summaryrefslogtreecommitdiffstats
path: root/src/ceph/doc/dev/osd_internals
diff options
context:
space:
mode:
Diffstat (limited to 'src/ceph/doc/dev/osd_internals')
-rw-r--r--src/ceph/doc/dev/osd_internals/backfill_reservation.rst38
-rw-r--r--src/ceph/doc/dev/osd_internals/erasure_coding.rst82
-rw-r--r--src/ceph/doc/dev/osd_internals/erasure_coding/developer_notes.rst223
-rw-r--r--src/ceph/doc/dev/osd_internals/erasure_coding/ecbackend.rst207
-rw-r--r--src/ceph/doc/dev/osd_internals/erasure_coding/jerasure.rst33
-rw-r--r--src/ceph/doc/dev/osd_internals/erasure_coding/proposals.rst385
-rw-r--r--src/ceph/doc/dev/osd_internals/index.rst10
-rw-r--r--src/ceph/doc/dev/osd_internals/last_epoch_started.rst60
-rw-r--r--src/ceph/doc/dev/osd_internals/log_based_pg.rst206
-rw-r--r--src/ceph/doc/dev/osd_internals/map_message_handling.rst131
-rw-r--r--src/ceph/doc/dev/osd_internals/osd_overview.rst106
-rw-r--r--src/ceph/doc/dev/osd_internals/osd_throttles.rst93
-rw-r--r--src/ceph/doc/dev/osd_internals/osd_throttles.txt21
-rw-r--r--src/ceph/doc/dev/osd_internals/pg.rst31
-rw-r--r--src/ceph/doc/dev/osd_internals/pg_removal.rst56
-rw-r--r--src/ceph/doc/dev/osd_internals/pgpool.rst22
-rw-r--r--src/ceph/doc/dev/osd_internals/recovery_reservation.rst75
-rw-r--r--src/ceph/doc/dev/osd_internals/scrub.rst30
-rw-r--r--src/ceph/doc/dev/osd_internals/snaps.rst128
-rw-r--r--src/ceph/doc/dev/osd_internals/watch_notify.rst81
-rw-r--r--src/ceph/doc/dev/osd_internals/wbthrottle.rst28
21 files changed, 0 insertions, 2046 deletions
diff --git a/src/ceph/doc/dev/osd_internals/backfill_reservation.rst b/src/ceph/doc/dev/osd_internals/backfill_reservation.rst
deleted file mode 100644
index cf9dab4..0000000
--- a/src/ceph/doc/dev/osd_internals/backfill_reservation.rst
+++ /dev/null
@@ -1,38 +0,0 @@
-====================
-Backfill Reservation
-====================
-
-When a new osd joins a cluster, all pgs containing it must eventually backfill
-to it. If all of these backfills happen simultaneously, it would put excessive
-load on the osd. osd_max_backfills limits the number of outgoing or
-incoming backfills on a single node. The maximum number of outgoing backfills is
-osd_max_backfills. The maximum number of incoming backfills is
-osd_max_backfills. Therefore there can be a maximum of osd_max_backfills * 2
-simultaneous backfills on one osd.
-
-Each OSDService now has two AsyncReserver instances: one for backfills going
-from the osd (local_reserver) and one for backfills going to the osd
-(remote_reserver). An AsyncReserver (common/AsyncReserver.h) manages a queue
-by priority of waiting items and a set of current reservation holders. When a
-slot frees up, the AsyncReserver queues the Context* associated with the next
-item on the highest priority queue in the finisher provided to the constructor.
-
-For a primary to initiate a backfill, it must first obtain a reservation from
-its own local_reserver. Then, it must obtain a reservation from the backfill
-target's remote_reserver via a MBackfillReserve message. This process is
-managed by substates of Active and ReplicaActive (see the substates of Active
-in PG.h). The reservations are dropped either on the Backfilled event, which
-is sent on the primary before calling recovery_complete and on the replica on
-receipt of the BackfillComplete progress message), or upon leaving Active or
-ReplicaActive.
-
-It's important that we always grab the local reservation before the remote
-reservation in order to prevent a circular dependency.
-
-We want to minimize the risk of data loss by prioritizing the order in
-which PGs are recovered. The highest priority is log based recovery
-(OSD_RECOVERY_PRIORITY_MAX) since this must always complete before
-backfill can start. The next priority is backfill of degraded PGs and
-is a function of the degradation. A backfill for a PG missing two
-replicas will have a priority higher than a backfill for a PG missing
-one replica. The lowest priority is backfill of non-degraded PGs.
diff --git a/src/ceph/doc/dev/osd_internals/erasure_coding.rst b/src/ceph/doc/dev/osd_internals/erasure_coding.rst
deleted file mode 100644
index 7263cc3..0000000
--- a/src/ceph/doc/dev/osd_internals/erasure_coding.rst
+++ /dev/null
@@ -1,82 +0,0 @@
-==============================
-Erasure Coded Placement Groups
-==============================
-
-Glossary
---------
-
-*chunk*
- when the encoding function is called, it returns chunks of the same
- size. Data chunks which can be concatenated to reconstruct the original
- object and coding chunks which can be used to rebuild a lost chunk.
-
-*chunk rank*
- the index of a chunk when returned by the encoding function. The
- rank of the first chunk is 0, the rank of the second chunk is 1
- etc.
-
-*stripe*
- when an object is too large to be encoded with a single call,
- each set of chunks created by a call to the encoding function is
- called a stripe.
-
-*shard|strip*
- an ordered sequence of chunks of the same rank from the same
- object. For a given placement group, each OSD contains shards of
- the same rank. When dealing with objects that are encoded with a
- single operation, *chunk* is sometime used instead of *shard*
- because the shard is made of a single chunk. The *chunks* in a
- *shard* are ordered according to the rank of the stripe they belong
- to.
-
-*K*
- the number of data *chunks*, i.e. the number of *chunks* in which the
- original object is divided. For instance if *K* = 2 a 10KB object
- will be divided into *K* objects of 5KB each.
-
-*M*
- the number of coding *chunks*, i.e. the number of additional *chunks*
- computed by the encoding functions. If there are 2 coding *chunks*,
- it means 2 OSDs can be out without losing data.
-
-*N*
- the number of data *chunks* plus the number of coding *chunks*,
- i.e. *K+M*.
-
-*rate*
- the proportion of the *chunks* that contains useful information, i.e. *K/N*.
- For instance, for *K* = 9 and *M* = 3 (i.e. *K+M* = *N* = 12) the rate is
- *K* = 9 / *N* = 12 = 0.75, i.e. 75% of the chunks contain useful information.
-
-The definitions are illustrated as follows (PG stands for placement group):
-::
-
- OSD 40 OSD 33
- +-------------------------+ +-------------------------+
- | shard 0 - PG 10 | | shard 1 - PG 10 |
- |+------ object O -------+| |+------ object O -------+|
- ||+---------------------+|| ||+---------------------+||
- stripe||| chunk 0 ||| ||| chunk 1 ||| ...
- 0 ||| stripe 0 ||| ||| stripe 0 |||
- ||+---------------------+|| ||+---------------------+||
- ||+---------------------+|| ||+---------------------+||
- stripe||| chunk 0 ||| ||| chunk 1 ||| ...
- 1 ||| stripe 1 ||| ||| stripe 1 |||
- ||+---------------------+|| ||+---------------------+||
- ||+---------------------+|| ||+---------------------+||
- stripe||| chunk 0 ||| ||| chunk 1 ||| ...
- 2 ||| stripe 2 ||| ||| stripe 2 |||
- ||+---------------------+|| ||+---------------------+||
- |+-----------------------+| |+-----------------------+|
- | ... | | ... |
- +-------------------------+ +-------------------------+
-
-Table of content
-----------------
-
-.. toctree::
- :maxdepth: 1
-
- Developer notes <erasure_coding/developer_notes>
- Jerasure plugin <erasure_coding/jerasure>
- High level design document <erasure_coding/ecbackend>
diff --git a/src/ceph/doc/dev/osd_internals/erasure_coding/developer_notes.rst b/src/ceph/doc/dev/osd_internals/erasure_coding/developer_notes.rst
deleted file mode 100644
index 770ff4a..0000000
--- a/src/ceph/doc/dev/osd_internals/erasure_coding/developer_notes.rst
+++ /dev/null
@@ -1,223 +0,0 @@
-============================
-Erasure Code developer notes
-============================
-
-Introduction
-------------
-
-Each chapter of this document explains an aspect of the implementation
-of the erasure code within Ceph. It is mostly based on examples being
-explained to demonstrate how things work.
-
-Reading and writing encoded chunks from and to OSDs
----------------------------------------------------
-
-An erasure coded pool stores each object as K+M chunks. It is divided
-into K data chunks and M coding chunks. The pool is configured to have
-a size of K+M so that each chunk is stored in an OSD in the acting
-set. The rank of the chunk is stored as an attribute of the object.
-
-Let's say an erasure coded pool is created to use five OSDs ( K+M =
-5 ) and sustain the loss of two of them ( M = 2 ).
-
-When the object *NYAN* containing *ABCDEFGHI* is written to it, the
-erasure encoding function splits the content in three data chunks,
-simply by dividing the content in three : the first contains *ABC*,
-the second *DEF* and the last *GHI*. The content will be padded if the
-content length is not a multiple of K. The function also creates two
-coding chunks : the fourth with *YXY* and the fifth with *GQC*. Each
-chunk is stored in an OSD in the acting set. The chunks are stored in
-objects that have the same name ( *NYAN* ) but reside on different
-OSDs. The order in which the chunks were created must be preserved and
-is stored as an attribute of the object ( shard_t ), in addition to its
-name. Chunk *1* contains *ABC* and is stored on *OSD5* while chunk *4*
-contains *YXY* and is stored on *OSD3*.
-
-::
-
- +-------------------+
- name | NYAN |
- +-------------------+
- content | ABCDEFGHI |
- +--------+----------+
- |
- |
- v
- +------+------+
- +---------------+ encode(3,2) +-----------+
- | +--+--+---+---+ |
- | | | | |
- | +-------+ | +-----+ |
- | | | | |
- +--v---+ +--v---+ +--v---+ +--v---+ +--v---+
- name | NYAN | | NYAN | | NYAN | | NYAN | | NYAN |
- +------+ +------+ +------+ +------+ +------+
- shard | 1 | | 2 | | 3 | | 4 | | 5 |
- +------+ +------+ +------+ +------+ +------+
- content | ABC | | DEF | | GHI | | YXY | | QGC |
- +--+---+ +--+---+ +--+---+ +--+---+ +--+---+
- | | | | |
- | | | | |
- | | +--+---+ | |
- | | | OSD1 | | |
- | | +------+ | |
- | | +------+ | |
- | +------>| OSD2 | | |
- | +------+ | |
- | +------+ | |
- | | OSD3 |<----+ |
- | +------+ |
- | +------+ |
- | | OSD4 |<--------------+
- | +------+
- | +------+
- +----------------->| OSD5 |
- +------+
-
-
-
-
-When the object *NYAN* is read from the erasure coded pool, the
-decoding function reads three chunks : chunk *1* containing *ABC*,
-chunk *3* containing *GHI* and chunk *4* containing *YXY* and rebuild
-the original content of the object *ABCDEFGHI*. The decoding function
-is informed that the chunks *2* and *5* are missing ( they are called
-*erasures* ). The chunk *5* could not be read because the *OSD4* is
-*out*.
-
-The decoding function could be called as soon as three chunks are
-read : *OSD2* was the slowest and its chunk does not need to be taken into
-account. This optimization is not implemented in Firefly.
-
-::
-
- +-------------------+
- name | NYAN |
- +-------------------+
- content | ABCDEFGHI |
- +--------+----------+
- ^
- |
- |
- +------+------+
- | decode(3,2) |
- | erasures 2,5|
- +-------------->| |
- | +-------------+
- | ^ ^
- | | +-----+
- | | |
- +--+---+ +------+ +--+---+ +--+---+
- name | NYAN | | NYAN | | NYAN | | NYAN |
- +------+ +------+ +------+ +------+
- shard | 1 | | 2 | | 3 | | 4 |
- +------+ +------+ +------+ +------+
- content | ABC | | DEF | | GHI | | YXY |
- +--+---+ +--+---+ +--+---+ +--+---+
- ^ . ^ ^
- | TOO . | |
- | SLOW . +--+---+ |
- | ^ | OSD1 | |
- | | +------+ |
- | | +------+ |
- | +-------| OSD2 | |
- | +------+ |
- | +------+ |
- | | OSD3 |-----+
- | +------+
- | +------+
- | | OSD4 | OUT
- | +------+
- | +------+
- +------------------| OSD5 |
- +------+
-
-
-Erasure code library
---------------------
-
-Using `Reed-Solomon <https://en.wikipedia.org/wiki/Reed_Solomon>`_,
-with parameters K+M, object O is encoded by dividing it into chunks O1,
-O2, ... OM and computing coding chunks P1, P2, ... PK. Any K chunks
-out of the available K+M chunks can be used to obtain the original
-object. If data chunk O2 or coding chunk P2 are lost, they can be
-repaired using any K chunks out of the K+M chunks. If more than M
-chunks are lost, it is not possible to recover the object.
-
-Reading the original content of object O can be a simple
-concatenation of O1, O2, ... OM, because the plugins are using
-`systematic codes
-<http://en.wikipedia.org/wiki/Systematic_code>`_. Otherwise the chunks
-must be given to the erasure code library *decode* method to retrieve
-the content of the object.
-
-Performance depend on the parameters to the encoding functions and
-is also influenced by the packet sizes used when calling the encoding
-functions ( for Cauchy or Liberation for instance ): smaller packets
-means more calls and more overhead.
-
-Although Reed-Solomon is provided as a default, Ceph uses it via an
-`abstract API <https://github.com/ceph/ceph/blob/v0.78/src/erasure-code/ErasureCodeInterface.h>`_ designed to
-allow each pool to choose the plugin that implements it using
-key=value pairs stored in an `erasure code profile`_.
-
-.. _erasure code profile: ../../../erasure-coded-pool
-
-::
-
- $ ceph osd erasure-code-profile set myprofile \
- crush-failure-domain=osd
- $ ceph osd erasure-code-profile get myprofile
- directory=/usr/lib/ceph/erasure-code
- k=2
- m=1
- plugin=jerasure
- technique=reed_sol_van
- crush-failure-domain=osd
- $ ceph osd pool create ecpool 12 12 erasure myprofile
-
-The *plugin* is dynamically loaded from *directory* and expected to
-implement the *int __erasure_code_init(char *plugin_name, char *directory)* function
-which is responsible for registering an object derived from *ErasureCodePlugin*
-in the registry. The `ErasureCodePluginExample <https://github.com/ceph/ceph/blob/v0.78/src/test/erasure-code/ErasureCodePluginExample.cc>`_ plugin reads:
-
-::
-
- ErasureCodePluginRegistry &instance =
- ErasureCodePluginRegistry::instance();
- instance.add(plugin_name, new ErasureCodePluginExample());
-
-The *ErasureCodePlugin* derived object must provide a factory method
-from which the concrete implementation of the *ErasureCodeInterface*
-object can be generated. The `ErasureCodePluginExample plugin <https://github.com/ceph/ceph/blob/v0.78/src/test/erasure-code/ErasureCodePluginExample.cc>`_ reads:
-
-::
-
- virtual int factory(const map<std::string,std::string> &parameters,
- ErasureCodeInterfaceRef *erasure_code) {
- *erasure_code = ErasureCodeInterfaceRef(new ErasureCodeExample(parameters));
- return 0;
- }
-
-The *parameters* argument is the list of *key=value* pairs that were
-set in the erasure code profile, before the pool was created.
-
-::
-
- ceph osd erasure-code-profile set myprofile \
- directory=<dir> \ # mandatory
- plugin=jerasure \ # mandatory
- m=10 \ # optional and plugin dependant
- k=3 \ # optional and plugin dependant
- technique=reed_sol_van \ # optional and plugin dependant
-
-Notes
------
-
-If the objects are large, it may be impractical to encode and decode
-them in memory. However, when using *RBD* a 1TB device is divided in
-many individual 4MB objects and *RGW* does the same.
-
-Encoding and decoding is implemented in the OSD. Although it could be
-implemented client side for read write, the OSD must be able to encode
-and decode on its own when scrubbing.
diff --git a/src/ceph/doc/dev/osd_internals/erasure_coding/ecbackend.rst b/src/ceph/doc/dev/osd_internals/erasure_coding/ecbackend.rst
deleted file mode 100644
index 624ec21..0000000
--- a/src/ceph/doc/dev/osd_internals/erasure_coding/ecbackend.rst
+++ /dev/null
@@ -1,207 +0,0 @@
-=================================
-ECBackend Implementation Strategy
-=================================
-
-Misc initial design notes
-=========================
-
-The initial (and still true for ec pools without the hacky ec
-overwrites debug flag enabled) design for ec pools restricted
-EC pools to operations which can be easily rolled back:
-
-- CEPH_OSD_OP_APPEND: We can roll back an append locally by
- including the previous object size as part of the PG log event.
-- CEPH_OSD_OP_DELETE: The possibility of rolling back a delete
- requires that we retain the deleted object until all replicas have
- persisted the deletion event. ErasureCoded backend will therefore
- need to store objects with the version at which they were created
- included in the key provided to the filestore. Old versions of an
- object can be pruned when all replicas have committed up to the log
- event deleting the object.
-- CEPH_OSD_OP_(SET|RM)ATTR: If we include the prior value of the attr
- to be set or removed, we can roll back these operations locally.
-
-Log entries contain a structure explaining how to locally undo the
-operation represented by the operation
-(see osd_types.h:TransactionInfo::LocalRollBack).
-
-PGTemp and Crush
-----------------
-
-Primaries are able to request a temp acting set mapping in order to
-allow an up-to-date OSD to serve requests while a new primary is
-backfilled (and for other reasons). An erasure coded pg needs to be
-able to designate a primary for these reasons without putting it in
-the first position of the acting set. It also needs to be able to
-leave holes in the requested acting set.
-
-Core Changes:
-
-- OSDMap::pg_to_*_osds needs to separately return a primary. For most
- cases, this can continue to be acting[0].
-- MOSDPGTemp (and related OSD structures) needs to be able to specify
- a primary as well as an acting set.
-- Much of the existing code base assumes that acting[0] is the primary
- and that all elements of acting are valid. This needs to be cleaned
- up since the acting set may contain holes.
-
-Distinguished acting set positions
-----------------------------------
-
-With the replicated strategy, all replicas of a PG are
-interchangeable. With erasure coding, different positions in the
-acting set have different pieces of the erasure coding scheme and are
-not interchangeable. Worse, crush might cause chunk 2 to be written
-to an OSD which happens already to contain an (old) copy of chunk 4.
-This means that the OSD and PG messages need to work in terms of a
-type like pair<shard_t, pg_t> in order to distinguish different pg
-chunks on a single OSD.
-
-Because the mapping of object name to object in the filestore must
-be 1-to-1, we must ensure that the objects in chunk 2 and the objects
-in chunk 4 have different names. To that end, the objectstore must
-include the chunk id in the object key.
-
-Core changes:
-
-- The objectstore `ghobject_t needs to also include a chunk id
- <https://github.com/ceph/ceph/blob/firefly/src/common/hobject.h#L241>`_ making it more like
- tuple<hobject_t, gen_t, shard_t>.
-- coll_t needs to include a shard_t.
-- The OSD pg_map and similar pg mappings need to work in terms of a
- spg_t (essentially
- pair<pg_t, shard_t>). Similarly, pg->pg messages need to include
- a shard_t
-- For client->PG messages, the OSD will need a way to know which PG
- chunk should get the message since the OSD may contain both a
- primary and non-primary chunk for the same pg
-
-Object Classes
---------------
-
-Reads from object classes will return ENOTSUP on ec pools by invoking
-a special SYNC read.
-
-Scrub
------
-
-The main catch, however, for ec pools is that sending a crc32 of the
-stored chunk on a replica isn't particularly helpful since the chunks
-on different replicas presumably store different data. Because we
-don't support overwrites except via DELETE, however, we have the
-option of maintaining a crc32 on each chunk through each append.
-Thus, each replica instead simply computes a crc32 of its own stored
-chunk and compares it with the locally stored checksum. The replica
-then reports to the primary whether the checksums match.
-
-With overwrites, all scrubs are disabled for now until we work out
-what to do (see doc/dev/osd_internals/erasure_coding/proposals.rst).
-
-Crush
------
-
-If crush is unable to generate a replacement for a down member of an
-acting set, the acting set should have a hole at that position rather
-than shifting the other elements of the acting set out of position.
-
-=========
-ECBackend
-=========
-
-MAIN OPERATION OVERVIEW
-=======================
-
-A RADOS put operation can span
-multiple stripes of a single object. There must be code that
-tessellates the application level write into a set of per-stripe write
-operations -- some whole-stripes and up to two partial
-stripes. Without loss of generality, for the remainder of this
-document we will focus exclusively on writing a single stripe (whole
-or partial). We will use the symbol "W" to represent the number of
-blocks within a stripe that are being written, i.e., W <= K.
-
-There are three data flows for handling a write into an EC stripe. The
-choice of which of the three data flows to choose is based on the size
-of the write operation and the arithmetic properties of the selected
-parity-generation algorithm.
-
-(1) whole stripe is written/overwritten
-(2) a read-modify-write operation is performed.
-
-WHOLE STRIPE WRITE
-------------------
-
-This is the simple case, and is already performed in the existing code
-(for appends, that is). The primary receives all of the data for the
-stripe in the RADOS request, computes the appropriate parity blocks
-and send the data and parity blocks to their destination shards which
-write them. This is essentially the current EC code.
-
-READ-MODIFY-WRITE
------------------
-
-The primary determines which of the K-W blocks are to be unmodified,
-and reads them from the shards. Once all of the data is received it is
-combined with the received new data and new parity blocks are
-computed. The modified blocks are sent to their respective shards and
-written. The RADOS operation is acknowledged.
-
-OSD Object Write and Consistency
---------------------------------
-
-Regardless of the algorithm chosen above, writing of the data is a two
-phase process: commit and rollforward. The primary sends the log
-entries with the operation described (see
-osd_types.h:TransactionInfo::(LocalRollForward|LocalRollBack).
-In all cases, the "commit" is performed in place, possibly leaving some
-information required for a rollback in a write-aside object. The
-rollforward phase occurs once all acting set replicas have committed
-the commit (sorry, overloaded term) and removes the rollback information.
-
-In the case of overwrites of exsting stripes, the rollback information
-has the form of a sparse object containing the old values of the
-overwritten extents populated using clone_range. This is essentially
-a place-holder implementation, in real life, bluestore will have an
-efficient primitive for this.
-
-The rollforward part can be delayed since we report the operation as
-committed once all replicas have committed. Currently, whenever we
-send a write, we also indicate that all previously committed
-operations should be rolled forward (see
-ECBackend::try_reads_to_commit). If there aren't any in the pipeline
-when we arrive at the waiting_rollforward queue, we start a dummy
-write to move things along (see the Pipeline section later on and
-ECBackend::try_finish_rmw).
-
-ExtentCache
------------
-
-It's pretty important to be able to pipeline writes on the same
-object. For this reason, there is a cache of extents written by
-cacheable operations. Each extent remains pinned until the operations
-referring to it are committed. The pipeline prevents rmw operations
-from running until uncacheable transactions (clones, etc) are flushed
-from the pipeline.
-
-See ExtentCache.h for a detailed explanation of how the cache
-states correspond to the higher level invariants about the conditions
-under which cuncurrent operations can refer to the same object.
-
-Pipeline
---------
-
-Reading src/osd/ExtentCache.h should have given a good idea of how
-operations might overlap. There are several states involved in
-processing a write operation and an important invariant which
-isn't enforced by PrimaryLogPG at a higher level which need to be
-managed by ECBackend. The important invariant is that we can't
-have uncacheable and rmw operations running at the same time
-on the same object. For simplicity, we simply enforce that any
-operation which contains an rmw operation must wait until
-all in-progress uncacheable operations complete.
-
-There are improvements to be made here in the future.
-
-For more details, see ECBackend::waiting_* and
-ECBackend::try_<from>_to_<to>.
-
diff --git a/src/ceph/doc/dev/osd_internals/erasure_coding/jerasure.rst b/src/ceph/doc/dev/osd_internals/erasure_coding/jerasure.rst
deleted file mode 100644
index 27669a0..0000000
--- a/src/ceph/doc/dev/osd_internals/erasure_coding/jerasure.rst
+++ /dev/null
@@ -1,33 +0,0 @@
-===============
-jerasure plugin
-===============
-
-Introduction
-------------
-
-The parameters interpreted by the jerasure plugin are:
-
-::
-
- ceph osd erasure-code-profile set myprofile \
- directory=<dir> \ # plugin directory absolute path
- plugin=jerasure \ # plugin name (only jerasure)
- k=<k> \ # data chunks (default 2)
- m=<m> \ # coding chunks (default 2)
- technique=<technique> \ # coding technique
-
-The coding techniques can be chosen among *reed_sol_van*,
-*reed_sol_r6_op*, *cauchy_orig*, *cauchy_good*, *liberation*,
-*blaum_roth* and *liber8tion*.
-
-The *src/erasure-code/jerasure* directory contains the
-implementation. It is a wrapper around the code found at
-`https://github.com/ceph/jerasure <https://github.com/ceph/jerasure>`_
-and `https://github.com/ceph/gf-complete
-<https://github.com/ceph/gf-complete>`_ , pinned to the latest stable
-version in *.gitmodules*. These repositories are copies of the
-upstream repositories `http://jerasure.org/jerasure/jerasure
-<http://jerasure.org/jerasure/jerasure>`_ and
-`http://jerasure.org/jerasure/gf-complete
-<http://jerasure.org/jerasure/gf-complete>`_ . The difference
-between the two, if any, should match pull requests against upstream.
diff --git a/src/ceph/doc/dev/osd_internals/erasure_coding/proposals.rst b/src/ceph/doc/dev/osd_internals/erasure_coding/proposals.rst
deleted file mode 100644
index 52f98e8..0000000
--- a/src/ceph/doc/dev/osd_internals/erasure_coding/proposals.rst
+++ /dev/null
@@ -1,385 +0,0 @@
-:orphan:
-
-=================================
-Proposed Next Steps for ECBackend
-=================================
-
-PARITY-DELTA-WRITE
-------------------
-
-RMW operations current require 4 network hops (2 round trips). In
-principle, for some codes, we can reduce this to 3 by sending the
-update to the replicas holding the data blocks and having them
-compute a delta to forward onto the parity blocks.
-
-The primary reads the current values of the "W" blocks and then uses
-the new values of the "W" blocks to compute parity-deltas for each of
-the parity blocks. The W blocks and the parity delta-blocks are sent
-to their respective shards.
-
-The choice of whether to use a read-modify-write or a
-parity-delta-write is complex policy issue that is TBD in the details
-and is likely to be heavily dependant on the computational costs
-associated with a parity-delta vs. a regular parity-generation
-operation. However, it is believed that the parity-delta scheme is
-likely to be the preferred choice, when available.
-
-The internal interface to the erasure coding library plug-ins needs to
-be extended to support the ability to query if parity-delta
-computation is possible for a selected algorithm as well as an
-interface to the actual parity-delta computation algorithm when
-available.
-
-Stripe Cache
-------------
-
-It may be a good idea to extend the current ExtentCache usage to
-cache some data past when the pinning operation releases it.
-One application pattern that is important to optimize is the small
-block sequential write operation (think of the journal of a journaling
-file system or a database transaction log). Regardless of the chosen
-redundancy algorithm, it is advantageous for the primary to
-retain/buffer recently read/written portions of a stripe in order to
-reduce network traffic. The dynamic contents of this cache may be used
-in the determination of whether a read-modify-write or a
-parity-delta-write is performed. The sizing of this cache is TBD, but
-we should plan on allowing at least a few full stripes per active
-client. Limiting the cache occupancy on a per-client basis will reduce
-the noisy neighbor problem.
-
-Recovery and Rollback Details
-=============================
-
-Implementing a Rollback-able Prepare Operation
-----------------------------------------------
-
-The prepare operation is implemented at each OSD through a simulation
-of a versioning or copy-on-write capability for modifying a portion of
-an object.
-
-When a prepare operation is performed, the new data is written into a
-temporary object. The PG log for the
-operation will contain a reference to the temporary object so that it
-can be located for recovery purposes as well as a record of all of the
-shards which are involved in the operation.
-
-In order to avoid fragmentation (and hence, future read performance),
-creation of the temporary object needs special attention. The name of
-the temporary object affects its location within the KV store. Right
-now its unclear whether it's desirable for the name to locate near the
-base object or whether a separate subset of keyspace should be used
-for temporary objects. Sam believes that colocation with the base
-object is preferred (he suggests using the generation counter of the
-ghobject for temporaries). Whereas Allen believes that using a
-separate subset of keyspace is desirable since these keys are
-ephemeral and we don't want to actually colocate them with the base
-object keys. Perhaps some modeling here can help resolve this
-issue. The data of the temporary object wants to be located as close
-to the data of the base object as possible. This may be best performed
-by adding a new ObjectStore creation primitive that takes the base
-object as an addtional parameter that is a hint to the allocator.
-
-Sam: I think that the short lived thing may be a red herring. We'll
-be updating the donor and primary objects atomically, so it seems like
-we'd want them adjacent in the key space, regardless of the donor's
-lifecycle.
-
-The apply operation moves the data from the temporary object into the
-correct position within the base object and deletes the associated
-temporary object. This operation is done using a specialized
-ObjectStore primitive. In the current ObjectStore interface, this can
-be done using the clonerange function followed by a delete, but can be
-done more efficiently with a specialized move primitive.
-Implementation of the specialized primitive on FileStore can be done
-by copying the data. Some file systems have extensions that might also
-be able to implement this operation (like a defrag API that swaps
-chunks between files). It is expected that NewStore will be able to
-support this efficiently and natively (It has been noted that this
-sequence requires that temporary object allocations, which tend to be
-small, be efficiently converted into blocks for main objects and that
-blocks that were formerly inside of main objects must be reusable with
-minimal overhead)
-
-The prepare and apply operations can be separated arbitrarily in
-time. If a read operation accesses an object that has been altered by
-a prepare operation (but without a corresponding apply operation) it
-must return the data after the prepare operation. This is done by
-creating an in-memory database of objects which have had a prepare
-operation without a corresponding apply operation. All read operations
-must consult this in-memory data structure in order to get the correct
-data. It should explicitly recognized that it is likely that there
-will be multiple prepare operations against a single base object and
-the code must handle this case correctly. This code is implemented as
-a layer between ObjectStore and all existing readers. Annoyingly,
-we'll want to trash this state when the interval changes, so the first
-thing that needs to happen after activation is that the primary and
-replicas apply up to last_update so that the empty cache will be
-correct.
-
-During peering, it is now obvious that an unapplied prepare operation
-can easily be rolled back simply by deleting the associated temporary
-object and removing that entry from the in-memory data structure.
-
-Partial Application Peering/Recovery modifications
---------------------------------------------------
-
-Some writes will be small enough to not require updating all of the
-shards holding data blocks. For write amplification minization
-reasons, it would be best to avoid writing to those shards at all,
-and delay even sending the log entries until the next write which
-actually hits that shard.
-
-The delaying (buffering) of the transmission of the prepare and apply
-operations for witnessing OSDs creates new situations that peering
-must handle. In particular the logic for determining the authoritative
-last_update value (and hence the selection of the OSD which has the
-authoritative log) must be modified to account for the valid but
-missing (i.e., delayed/buffered) pglog entries to which the
-authoritative OSD was only a witness to.
-
-Because a partial write might complete without persisting a log entry
-on every replica, we have to do a bit more work to determine an
-authoritative last_update. The constraint (as with a replicated PG)
-is that last_update >= the most recent log entry for which a commit
-was sent to the client (call this actual_last_update). Secondarily,
-we want last_update to be as small as possible since any log entry
-past actual_last_update (we do not apply a log entry until we have
-sent the commit to the client) must be able to be rolled back. Thus,
-the smaller a last_update we choose, the less recovery will need to
-happen (we can always roll back, but rolling a replica forward may
-require an object rebuild). Thus, we will set last_update to 1 before
-the oldest log entry we can prove cannot have been committed. In
-current master, this is simply the last_update of the shortest log
-from that interval (because that log did not persist any entry past
-that point -- a precondition for sending a commit to the client). For
-this design, we must consider the possibility that any log is missing
-at its head log entries in which it did not participate. Thus, we
-must determine the most recent interval in which we went active
-(essentially, this is what find_best_info currently does). We then
-pull the log from each live osd from that interval back to the minimum
-last_update among them. Then, we extend all logs from the
-authoritative interval until each hits an entry in which it should
-have participated, but did not record. The shortest of these extended
-logs must therefore contain any log entry for which we sent a commit
-to the client -- and the last entry gives us our last_update.
-
-Deep scrub support
-------------------
-
-The simple answer here is probably our best bet. EC pools can't use
-the omap namespace at all right now. The simplest solution would be
-to take a prefix of the omap space and pack N M byte L bit checksums
-into each key/value. The prefixing seems like a sensible precaution
-against eventually wanting to store something else in the omap space.
-It seems like any write will need to read at least the blocks
-containing the modified range. However, with a code able to compute
-parity deltas, we may not need to read a whole stripe. Even without
-that, we don't want to have to write to blocks not participating in
-the write. Thus, each shard should store checksums only for itself.
-It seems like you'd be able to store checksums for all shards on the
-parity blocks, but there may not be distinguished parity blocks which
-are modified on all writes (LRC or shec provide two examples). L
-should probably have a fixed number of options (16, 32, 64?) and be
-configurable per-pool at pool creation. N, M should be likewise be
-configurable at pool creation with sensible defaults.
-
-We need to handle online upgrade. I think the right answer is that
-the first overwrite to an object with an append only checksum
-removes the append only checksum and writes in whatever stripe
-checksums actually got written. The next deep scrub then writes
-out the full checksum omap entries.
-
-RADOS Client Acknowledgement Generation Optimization
-====================================================
-
-Now that the recovery scheme is understood, we can discuss the
-generation of of the RADOS operation acknowledgement (ACK) by the
-primary ("sufficient" from above). It is NOT required that the primary
-wait for all shards to complete their respective prepare
-operations. Using our example where the RADOS operations writes only
-"W" chunks of the stripe, the primary will generate and send W+M
-prepare operations (possibly including a send-to-self). The primary
-need only wait for enough shards to be written to ensure recovery of
-the data, Thus after writing W + M chunks you can afford the lost of M
-chunks. Hence the primary can generate the RADOS ACK after W+M-M => W
-of those prepare operations are completed.
-
-Inconsistent object_info_t versions
-===================================
-
-A natural consequence of only writing the blocks which actually
-changed is that we don't want to update the object_info_t of the
-objects which didn't. I actually think it would pose a problem to do
-so: pg ghobject namespaces are generally large, and unless the osd is
-seeing a bunch of overwrites on a small set of objects, I'd expect
-each write to be far enough apart in the backing ghobject_t->data
-mapping to each constitute a random metadata update. Thus, we have to
-accept that not every shard will have the current version in its
-object_info_t. We can't even bound how old the version on a
-particular shard will happen to be. In particular, the primary does
-not necessarily have the current version. One could argue that the
-parity shards would always have the current version, but not every
-code necessarily has designated parity shards which see every write
-(certainly LRC, iirc shec, and even with a more pedestrian code, it
-might be desirable to rotate the shards based on object hash). Even
-if you chose to designate a shard as witnessing all writes, the pg
-might be degraded with that particular shard missing. This is a bit
-tricky, currently reads and writes implicitely return the most recent
-version of the object written. On reads, we'd have to read K shards
-to answer that question. We can get around that by adding a "don't
-tell me the current version" flag. Writes are more problematic: we
-need an object_info from the most recent write in order to form the
-new object_info and log_entry.
-
-A truly terrifying option would be to eliminate version and
-prior_version entirely from the object_info_t. There are a few
-specific purposes it serves:
-
-#. On OSD startup, we prime the missing set by scanning backwards
- from last_update to last_complete comparing the stored object's
- object_info_t to the version of most recent log entry.
-#. During backfill, we compare versions between primary and target
- to avoid some pushes. We use it elsewhere as well
-#. While pushing and pulling objects, we verify the version.
-#. We return it on reads and writes and allow the librados user to
- assert it atomically on writesto allow the user to deal with write
- races (used extensively by rbd).
-
-Case (3) isn't actually essential, just convenient. Oh well. (4)
-is more annoying. Writes are easy since we know the version. Reads
-are tricky because we may not need to read from all of the replicas.
-Simplest solution is to add a flag to rados operations to just not
-return the user version on read. We can also just not support the
-user version assert on ec for now (I think? Only user is rgw bucket
-indices iirc, and those will always be on replicated because they use
-omap).
-
-We can avoid (1) by maintaining the missing set explicitely. It's
-already possible for there to be a missing object without a
-corresponding log entry (Consider the case where the most recent write
-is to an object which has not been updated in weeks. If that write
-becomes divergent, the written object needs to be marked missing based
-on the prior_version which is not in the log.) THe PGLog already has
-a way of handling those edge cases (see divergent_priors). We'd
-simply expand that to contain the entire missing set and maintain it
-atomically with the log and the objects. This isn't really an
-unreasonable option, the addiitonal keys would be fewer than the
-existing log keys + divergent_priors and aren't updated in the fast
-write path anyway.
-
-The second case is a bit trickier. It's really an optimization for
-the case where a pg became not in the acting set long enough for the
-logs to no longer overlap but not long enough for the PG to have
-healed and removed the old copy. Unfortunately, this describes the
-case where a node was taken down for maintenance with noout set. It's
-probably not acceptable to re-backfill the whole OSD in such a case,
-so we need to be able to quickly determine whether a particular shard
-is up to date given a valid acting set of other shards.
-
-Let ordinary writes which do not change the object size not touch the
-object_info at all. That means that the object_info version won't
-match the pg log entry version. Include in the pg_log_entry_t the
-current object_info version as well as which shards participated (as
-mentioned above). In addition to the object_info_t attr, record on
-each shard s a vector recording for each other shard s' the most
-recent write which spanned both s and s'. Operationally, we maintain
-an attr on each shard containing that vector. A write touching S
-updates the version stamp entry for each shard in S on each shard in
-S's attribute (and leaves the rest alone). If we have a valid acting
-set during backfill, we must have a witness of every write which
-completed -- so taking the max of each entry over all of the acting
-set shards must give us the current version for each shard. During
-recovery, we set the attribute on the recovery target to that max
-vector (Question: with LRC, we may not need to touch much of the
-acting set to recover a particular shard -- can we just use the max of
-the shards we used to recovery, or do we need to grab the version
-vector from the rest of the acting set as well? I'm not sure, not a
-big deal anyway, I think).
-
-The above lets us perform blind writes without knowing the current
-object version (log entry version, that is) while still allowing us to
-avoid backfilling up to date objects. The only catch is that our
-backfill scans will can all replicas, not just the primary and the
-backfill targets.
-
-It would be worth adding into scrub the ability to check the
-consistency of the gathered version vectors -- probably by just
-taking 3 random valid subsets and verifying that they generate
-the same authoritative version vector.
-
-Implementation Strategy
-=======================
-
-It goes without saying that it would be unwise to attempt to do all of
-this in one massive PR. It's also not a good idea to merge code which
-isn't being tested. To that end, it's worth thinking a bit about
-which bits can be tested on their own (perhaps with a bit of temporary
-scaffolding).
-
-We can implement the overwrite friendly checksumming scheme easily
-enough with the current implementation. We'll want to enable it on a
-per-pool basis (probably using a flag which we'll later repurpose for
-actual overwrite support). We can enable it in some of the ec
-thrashing tests in the suite. We can also add a simple test
-validating the behavior of turning it on for an existing ec pool
-(later, we'll want to be able to convert append-only ec pools to
-overwrite ec pools, so that test will simply be expanded as we go).
-The flag should be gated by the experimental feature flag since we
-won't want to support this as a valid configuration -- testing only.
-We need to upgrade append only ones in place during deep scrub.
-
-Similarly, we can implement the unstable extent cache with the current
-implementation, it even lets us cut out the readable ack the replicas
-send to the primary after the commit which lets it release the lock.
-Same deal, implement, gate with experimental flag, add to some of the
-automated tests. I don't really see a reason not to use the same flag
-as above.
-
-We can certainly implement the move-range primitive with unit tests
-before there are any users. Adding coverage to the existing
-objectstore tests would suffice here.
-
-Explicit missing set can be implemented now, same deal as above --
-might as well even use the same feature bit.
-
-The TPC protocol outlined above can actually be implemented an append
-only EC pool. Same deal as above, can even use the same feature bit.
-
-The RADOS flag to suppress the read op user version return can be
-implemented immediately. Mostly just needs unit tests.
-
-The version vector problem is an interesting one. For append only EC
-pools, it would be pointless since all writes increase the size and
-therefore update the object_info. We could do it for replicated pools
-though. It's a bit silly since all "shards" see all writes, but it
-would still let us implement and partially test the augmented backfill
-code as well as the extra pg log entry fields -- this depends on the
-explicit pg log entry branch having already merged. It's not entirely
-clear to me that this one is worth doing seperately. It's enough code
-that I'd really prefer to get it done independently, but it's also a
-fair amount of scaffolding that will be later discarded.
-
-PGLog entries need to be able to record the participants and log
-comparison needs to be modified to extend logs with entries they
-wouldn't have witnessed. This logic should be abstracted behind
-PGLog so it can be unittested -- that would let us test it somewhat
-before the actual ec overwrites code merges.
-
-Whatever needs to happen to the ec plugin interface can probably be
-done independently of the rest of this (pending resolution of
-questions below).
-
-The actual nuts and bolts of performing the ec overwrite it seems to
-me can't be productively tested (and therefore implemented) until the
-above are complete, so best to get all of the supporting code in
-first.
-
-Open Questions
-==============
-
-Is there a code we should be using that would let us compute a parity
-delta without rereading and reencoding the full stripe? If so, is it
-the kind of thing we need to design for now, or can it be reasonably
-put off?
-
-What needs to happen to the EC plugin interface?
diff --git a/src/ceph/doc/dev/osd_internals/index.rst b/src/ceph/doc/dev/osd_internals/index.rst
deleted file mode 100644
index 7e82914..0000000
--- a/src/ceph/doc/dev/osd_internals/index.rst
+++ /dev/null
@@ -1,10 +0,0 @@
-==============================
-OSD developer documentation
-==============================
-
-.. rubric:: Contents
-
-.. toctree::
- :glob:
-
- *
diff --git a/src/ceph/doc/dev/osd_internals/last_epoch_started.rst b/src/ceph/doc/dev/osd_internals/last_epoch_started.rst
deleted file mode 100644
index 9978bd3..0000000
--- a/src/ceph/doc/dev/osd_internals/last_epoch_started.rst
+++ /dev/null
@@ -1,60 +0,0 @@
-======================
-last_epoch_started
-======================
-
-info.last_epoch_started records an activation epoch e for interval i
-such that all writes commited in i or earlier are reflected in the
-local info/log and no writes after i are reflected in the local
-info/log. Since no committed write is ever divergent, even if we
-get an authoritative log/info with an older info.last_epoch_started,
-we can leave our info.last_epoch_started alone since no writes could
-have commited in any intervening interval (See PG::proc_master_log).
-
-info.history.last_epoch_started records a lower bound on the most
-recent interval in which the pg as a whole went active and accepted
-writes. On a particular osd, it is also an upper bound on the
-activation epoch of intervals in which writes in the local pg log
-occurred (we update it before accepting writes). Because all
-committed writes are committed by all acting set osds, any
-non-divergent writes ensure that history.last_epoch_started was
-recorded by all acting set members in the interval. Once peering has
-queried one osd from each interval back to some seen
-history.last_epoch_started, it follows that no interval after the max
-history.last_epoch_started can have reported writes as committed
-(since we record it before recording client writes in an interval).
-Thus, the minimum last_update across all infos with
-info.last_epoch_started >= MAX(history.last_epoch_started) must be an
-upper bound on writes reported as committed to the client.
-
-We update info.last_epoch_started with the intial activation message,
-but we only update history.last_epoch_started after the new
-info.last_epoch_started is persisted (possibly along with the first
-write). This ensures that we do not require an osd with the most
-recent info.last_epoch_started until all acting set osds have recorded
-it.
-
-In find_best_info, we do include info.last_epoch_started values when
-calculating the max_last_epoch_started_found because we want to avoid
-designating a log entry divergent which in a prior interval would have
-been non-divergent since it might have been used to serve a read. In
-activate(), we use the peer's last_epoch_started value as a bound on
-how far back divergent log entries can be found.
-
-However, in a case like
-
-.. code::
-
- calc_acting osd.0 1.4e( v 473'302 (292'200,473'302] local-les=473 n=4 ec=5 les/c 473/473 556/556/556
- calc_acting osd.1 1.4e( v 473'302 (293'202,473'302] lb 0//0//-1 local-les=477 n=0 ec=5 les/c 473/473 556/556/556
- calc_acting osd.4 1.4e( v 473'302 (120'121,473'302] local-les=473 n=4 ec=5 les/c 473/473 556/556/556
- calc_acting osd.5 1.4e( empty local-les=0 n=0 ec=5 les/c 473/473 556/556/556
-
-since osd.1 is the only one which recorded info.les=477 while 4,0
-which were the acting set in that interval did not (4 restarted and 0
-did not get the message in time) the pg is marked incomplete when
-either 4 or 0 would have been valid choices. To avoid this, we do not
-consider info.les for incomplete peers when calculating
-min_last_epoch_started_found. It would not have been in the acting
-set, so we must have another osd from that interval anyway (if
-maybe_went_rw). If that osd does not remember that info.les, then we
-cannot have served reads.
diff --git a/src/ceph/doc/dev/osd_internals/log_based_pg.rst b/src/ceph/doc/dev/osd_internals/log_based_pg.rst
deleted file mode 100644
index 8b11012..0000000
--- a/src/ceph/doc/dev/osd_internals/log_based_pg.rst
+++ /dev/null
@@ -1,206 +0,0 @@
-============
-Log Based PG
-============
-
-Background
-==========
-
-Why PrimaryLogPG?
------------------
-
-Currently, consistency for all ceph pool types is ensured by primary
-log-based replication. This goes for both erasure-coded and
-replicated pools.
-
-Primary log-based replication
------------------------------
-
-Reads must return data written by any write which completed (where the
-client could possibly have received a commit message). There are lots
-of ways to handle this, but ceph's architecture makes it easy for
-everyone at any map epoch to know who the primary is. Thus, the easy
-answer is to route all writes for a particular pg through a single
-ordering primary and then out to the replicas. Though we only
-actually need to serialize writes on a single object (and even then,
-the partial ordering only really needs to provide an ordering between
-writes on overlapping regions), we might as well serialize writes on
-the whole PG since it lets us represent the current state of the PG
-using two numbers: the epoch of the map on the primary in which the
-most recent write started (this is a bit stranger than it might seem
-since map distribution itself is asyncronous -- see Peering and the
-concept of interval changes) and an increasing per-pg version number
--- this is referred to in the code with type eversion_t and stored as
-pg_info_t::last_update. Furthermore, we maintain a log of "recent"
-operations extending back at least far enough to include any
-*unstable* writes (writes which have been started but not committed)
-and objects which aren't uptodate locally (see recovery and
-backfill). In practice, the log will extend much further
-(osd_pg_min_log_entries when clean, osd_pg_max_log_entries when not
-clean) because it's handy for quickly performing recovery.
-
-Using this log, as long as we talk to a non-empty subset of the OSDs
-which must have accepted any completed writes from the most recent
-interval in which we accepted writes, we can determine a conservative
-log which must contain any write which has been reported to a client
-as committed. There is some freedom here, we can choose any log entry
-between the oldest head remembered by an element of that set (any
-newer cannot have completed without that log containing it) and the
-newest head remembered (clearly, all writes in the log were started,
-so it's fine for us to remember them) as the new head. This is the
-main point of divergence between replicated pools and ec pools in
-PG/PrimaryLogPG: replicated pools try to choose the newest valid
-option to avoid the client needing to replay those operations and
-instead recover the other copies. EC pools instead try to choose
-the *oldest* option available to them.
-
-The reason for this gets to the heart of the rest of the differences
-in implementation: one copy will not generally be enough to
-reconstruct an ec object. Indeed, there are encodings where some log
-combinations would leave unrecoverable objects (as with a 4+2 encoding
-where 3 of the replicas remember a write, but the other 3 do not -- we
-don't have 3 copies of either version). For this reason, log entries
-representing *unstable* writes (writes not yet committed to the
-client) must be rollbackable using only local information on ec pools.
-Log entries in general may therefore be rollbackable (and in that case,
-via a delayed application or via a set of instructions for rolling
-back an inplace update) or not. Replicated pool log entries are
-never able to be rolled back.
-
-For more details, see PGLog.h/cc, osd_types.h:pg_log_t,
-osd_types.h:pg_log_entry_t, and peering in general.
-
-ReplicatedBackend/ECBackend unification strategy
-================================================
-
-PGBackend
----------
-
-So, the fundamental difference between replication and erasure coding
-is that replication can do destructive updates while erasure coding
-cannot. It would be really annoying if we needed to have two entire
-implementations of PrimaryLogPG, one for each of the two, if there
-are really only a few fundamental differences:
-
-#. How reads work -- async only, requires remote reads for ec
-#. How writes work -- either restricted to append, or must write aside and do a
- tpc
-#. Whether we choose the oldest or newest possible head entry during peering
-#. A bit of extra information in the log entry to enable rollback
-
-and so many similarities
-
-#. All of the stats and metadata for objects
-#. The high level locking rules for mixing client IO with recovery and scrub
-#. The high level locking rules for mixing reads and writes without exposing
- uncommitted state (which might be rolled back or forgotten later)
-#. The process, metadata, and protocol needed to determine the set of osds
- which partcipated in the most recent interval in which we accepted writes
-#. etc.
-
-Instead, we choose a few abstractions (and a few kludges) to paper over the differences:
-
-#. PGBackend
-#. PGTransaction
-#. PG::choose_acting chooses between calc_replicated_acting and calc_ec_acting
-#. Various bits of the write pipeline disallow some operations based on pool
- type -- like omap operations, class operation reads, and writes which are
- not aligned appends (officially, so far) for ec
-#. Misc other kludges here and there
-
-PGBackend and PGTransaction enable abstraction of differences 1, 2,
-and the addition of 4 as needed to the log entries.
-
-The replicated implementation is in ReplicatedBackend.h/cc and doesn't
-require much explanation, I think. More detail on the ECBackend can be
-found in doc/dev/osd_internals/erasure_coding/ecbackend.rst.
-
-PGBackend Interface Explanation
-===============================
-
-Note: this is from a design document from before the original firefly
-and is probably out of date w.r.t. some of the method names.
-
-Readable vs Degraded
---------------------
-
-For a replicated pool, an object is readable iff it is present on
-the primary (at the right version). For an ec pool, we need at least
-M shards present to do a read, and we need it on the primary. For
-this reason, PGBackend needs to include some interfaces for determing
-when recovery is required to serve a read vs a write. This also
-changes the rules for when peering has enough logs to prove that it
-
-Core Changes:
-
-- | PGBackend needs to be able to return IsPG(Recoverable|Readable)Predicate
- | objects to allow the user to make these determinations.
-
-Client Reads
-------------
-
-Reads with the replicated strategy can always be satisfied
-synchronously out of the primary OSD. With an erasure coded strategy,
-the primary will need to request data from some number of replicas in
-order to satisfy a read. PGBackend will therefore need to provide
-seperate objects_read_sync and objects_read_async interfaces where
-the former won't be implemented by the ECBackend.
-
-PGBackend interfaces:
-
-- objects_read_sync
-- objects_read_async
-
-Scrub
------
-
-We currently have two scrub modes with different default frequencies:
-
-#. [shallow] scrub: compares the set of objects and metadata, but not
- the contents
-#. deep scrub: compares the set of objects, metadata, and a crc32 of
- the object contents (including omap)
-
-The primary requests a scrubmap from each replica for a particular
-range of objects. The replica fills out this scrubmap for the range
-of objects including, if the scrub is deep, a crc32 of the contents of
-each object. The primary gathers these scrubmaps from each replica
-and performs a comparison identifying inconsistent objects.
-
-Most of this can work essentially unchanged with erasure coded PG with
-the caveat that the PGBackend implementation must be in charge of
-actually doing the scan.
-
-
-PGBackend interfaces:
-
-- be_*
-
-Recovery
---------
-
-The logic for recovering an object depends on the backend. With
-the current replicated strategy, we first pull the object replica
-to the primary and then concurrently push it out to the replicas.
-With the erasure coded strategy, we probably want to read the
-minimum number of replica chunks required to reconstruct the object
-and push out the replacement chunks concurrently.
-
-Another difference is that objects in erasure coded pg may be
-unrecoverable without being unfound. The "unfound" concept
-should probably then be renamed to unrecoverable. Also, the
-PGBackend implementation will have to be able to direct the search
-for pg replicas with unrecoverable object chunks and to be able
-to determine whether a particular object is recoverable.
-
-
-Core changes:
-
-- s/unfound/unrecoverable
-
-PGBackend interfaces:
-
-- `on_local_recover_start <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L60>`_
-- `on_local_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L66>`_
-- `on_global_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L78>`_
-- `on_peer_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L83>`_
-- `begin_peer_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L90>`_
diff --git a/src/ceph/doc/dev/osd_internals/map_message_handling.rst b/src/ceph/doc/dev/osd_internals/map_message_handling.rst
deleted file mode 100644
index a5013c2..0000000
--- a/src/ceph/doc/dev/osd_internals/map_message_handling.rst
+++ /dev/null
@@ -1,131 +0,0 @@
-===========================
-Map and PG Message handling
-===========================
-
-Overview
---------
-The OSD handles routing incoming messages to PGs, creating the PG if necessary
-in some cases.
-
-PG messages generally come in two varieties:
-
- 1. Peering Messages
- 2. Ops/SubOps
-
-There are several ways in which a message might be dropped or delayed. It is
-important that the message delaying does not result in a violation of certain
-message ordering requirements on the way to the relevant PG handling logic:
-
- 1. Ops referring to the same object must not be reordered.
- 2. Peering messages must not be reordered.
- 3. Subops must not be reordered.
-
-MOSDMap
--------
-MOSDMap messages may come from either monitors or other OSDs. Upon receipt, the
-OSD must perform several tasks:
-
- 1. Persist the new maps to the filestore.
- Several PG operations rely on having access to maps dating back to the last
- time the PG was clean.
- 2. Update and persist the superblock.
- 3. Update OSD state related to the current map.
- 4. Expose new maps to PG processes via *OSDService*.
- 5. Remove PGs due to pool removal.
- 6. Queue dummy events to trigger PG map catchup.
-
-Each PG asynchronously catches up to the currently published map during
-process_peering_events before processing the event. As a result, different
-PGs may have different views as to the "current" map.
-
-One consequence of this design is that messages containing submessages from
-multiple PGs (MOSDPGInfo, MOSDPGQuery, MOSDPGNotify) must tag each submessage
-with the PG's epoch as well as tagging the message as a whole with the OSD's
-current published epoch.
-
-MOSDPGOp/MOSDPGSubOp
---------------------
-See OSD::dispatch_op, OSD::handle_op, OSD::handle_sub_op
-
-MOSDPGOps are used by clients to initiate rados operations. MOSDSubOps are used
-between OSDs to coordinate most non peering activities including replicating
-MOSDPGOp operations.
-
-OSD::require_same_or_newer map checks that the current OSDMap is at least
-as new as the map epoch indicated on the message. If not, the message is
-queued in OSD::waiting_for_osdmap via OSD::wait_for_new_map. Note, this
-cannot violate the above conditions since any two messages will be queued
-in order of receipt and if a message is received with epoch e0, a later message
-from the same source must be at epoch at least e0. Note that two PGs from
-the same OSD count for these purposes as different sources for single PG
-messages. That is, messages from different PGs may be reordered.
-
-
-MOSDPGOps follow the following process:
-
- 1. OSD::handle_op: validates permissions and crush mapping.
- discard the request if they are not connected and the client cannot get the reply ( See OSD::op_is_discardable )
- See OSDService::handle_misdirected_op
- See PG::op_has_sufficient_caps
- See OSD::require_same_or_newer_map
- 2. OSD::enqueue_op
-
-MOSDSubOps follow the following process:
-
- 1. OSD::handle_sub_op checks that sender is an OSD
- 2. OSD::enqueue_op
-
-OSD::enqueue_op calls PG::queue_op which checks waiting_for_map before calling OpWQ::queue which adds the op to the queue of the PG responsible for handling it.
-
-OSD::dequeue_op is then eventually called, with a lock on the PG. At
-this time, the op is passed to PG::do_request, which checks that:
-
- 1. the PG map is new enough (PG::must_delay_op)
- 2. the client requesting the op has enough permissions (PG::op_has_sufficient_caps)
- 3. the op is not to be discarded (PG::can_discard_{request,op,subop,scan,backfill})
- 4. the PG is active (PG::flushed boolean)
- 5. the op is a CEPH_MSG_OSD_OP and the PG is in PG_STATE_ACTIVE state and not in PG_STATE_REPLAY
-
-If these conditions are not met, the op is either discarded or queued for later processing. If all conditions are met, the op is processed according to its type:
-
- 1. CEPH_MSG_OSD_OP is handled by PG::do_op
- 2. MSG_OSD_SUBOP is handled by PG::do_sub_op
- 3. MSG_OSD_SUBOPREPLY is handled by PG::do_sub_op_reply
- 4. MSG_OSD_PG_SCAN is handled by PG::do_scan
- 5. MSG_OSD_PG_BACKFILL is handled by PG::do_backfill
-
-CEPH_MSG_OSD_OP processing
---------------------------
-
-PrimaryLogPG::do_op handles CEPH_MSG_OSD_OP op and will queue it
-
- 1. in wait_for_all_missing if it is a CEPH_OSD_OP_PGLS for a designated snapid and some object updates are still missing
- 2. in waiting_for_active if the op may write but the scrubber is working
- 3. in waiting_for_missing_object if the op requires an object or a snapdir or a specific snap that is still missing
- 4. in waiting_for_degraded_object if the op may write an object or a snapdir that is degraded, or if another object blocks it ("blocked_by")
- 5. in waiting_for_backfill_pos if the op requires an object that will be available after the backfill is complete
- 6. in waiting_for_ack if an ack from another OSD is expected
- 7. in waiting_for_ondisk if the op is waiting for a write to complete
-
-Peering Messages
-----------------
-See OSD::handle_pg_(notify|info|log|query)
-
-Peering messages are tagged with two epochs:
-
- 1. epoch_sent: map epoch at which the message was sent
- 2. query_epoch: map epoch at which the message triggering the message was sent
-
-These are the same in cases where there was no triggering message. We discard
-a peering message if the message's query_epoch if the PG in question has entered
-a new epoch (See PG::old_peering_evt, PG::queue_peering_event). Notifies,
-infos, notifies, and logs are all handled as PG::RecoveryMachine events and
-are wrapped by PG::queue_* by PG::CephPeeringEvts, which include the created
-state machine event along with epoch_sent and query_epoch in order to
-generically check PG::old_peering_message upon insertion and removal from the
-queue.
-
-Note, notifies, logs, and infos can trigger the creation of a PG. See
-OSD::get_or_create_pg.
-
-
diff --git a/src/ceph/doc/dev/osd_internals/osd_overview.rst b/src/ceph/doc/dev/osd_internals/osd_overview.rst
deleted file mode 100644
index 192ddf8..0000000
--- a/src/ceph/doc/dev/osd_internals/osd_overview.rst
+++ /dev/null
@@ -1,106 +0,0 @@
-===
-OSD
-===
-
-Concepts
---------
-
-*Messenger*
- See src/msg/Messenger.h
-
- Handles sending and receipt of messages on behalf of the OSD. The OSD uses
- two messengers:
-
- 1. cluster_messenger - handles traffic to other OSDs, monitors
- 2. client_messenger - handles client traffic
-
- This division allows the OSD to be configured with different interfaces for
- client and cluster traffic.
-
-*Dispatcher*
- See src/msg/Dispatcher.h
-
- OSD implements the Dispatcher interface. Of particular note is ms_dispatch,
- which serves as the entry point for messages received via either the client
- or cluster messenger. Because there are two messengers, ms_dispatch may be
- called from at least two threads. The osd_lock is always held during
- ms_dispatch.
-
-*WorkQueue*
- See src/common/WorkQueue.h
-
- The WorkQueue class abstracts the process of queueing independent tasks
- for asynchronous execution. Each OSD process contains workqueues for
- distinct tasks:
-
- 1. OpWQ: handles ops (from clients) and subops (from other OSDs).
- Runs in the op_tp threadpool.
- 2. PeeringWQ: handles peering tasks and pg map advancement
- Runs in the op_tp threadpool.
- See Peering
- 3. CommandWQ: handles commands (pg query, etc)
- Runs in the command_tp threadpool.
- 4. RecoveryWQ: handles recovery tasks.
- Runs in the recovery_tp threadpool.
- 5. SnapTrimWQ: handles snap trimming
- Runs in the disk_tp threadpool.
- See SnapTrimmer
- 6. ScrubWQ: handles primary scrub path
- Runs in the disk_tp threadpool.
- See Scrub
- 7. ScrubFinalizeWQ: handles primary scrub finalize
- Runs in the disk_tp threadpool.
- See Scrub
- 8. RepScrubWQ: handles replica scrub path
- Runs in the disk_tp threadpool
- See Scrub
- 9. RemoveWQ: Asynchronously removes old pg directories
- Runs in the disk_tp threadpool
- See PGRemoval
-
-*ThreadPool*
- See src/common/WorkQueue.h
- See also above.
-
- There are 4 OSD threadpools:
-
- 1. op_tp: handles ops and subops
- 2. recovery_tp: handles recovery tasks
- 3. disk_tp: handles disk intensive tasks
- 4. command_tp: handles commands
-
-*OSDMap*
- See src/osd/OSDMap.h
-
- The crush algorithm takes two inputs: a picture of the cluster
- with status information about which nodes are up/down and in/out,
- and the pgid to place. The former is encapsulated by the OSDMap.
- Maps are numbered by *epoch* (epoch_t). These maps are passed around
- within the OSD as std::tr1::shared_ptr<const OSDMap>.
-
- See MapHandling
-
-*PG*
- See src/osd/PG.* src/osd/PrimaryLogPG.*
-
- Objects in rados are hashed into *PGs* and *PGs* are placed via crush onto
- OSDs. The PG structure is responsible for handling requests pertaining to
- a particular *PG* as well as for maintaining relevant metadata and controlling
- recovery.
-
-*OSDService*
- See src/osd/OSD.cc OSDService
-
- The OSDService acts as a broker between PG threads and OSD state which allows
- PGs to perform actions using OSD services such as workqueues and messengers.
- This is still a work in progress. Future cleanups will focus on moving such
- state entirely from the OSD into the OSDService.
-
-Overview
---------
- See src/ceph_osd.cc
-
- The OSD process represents one leaf device in the crush hierarchy. There
- might be one OSD process per physical machine, or more than one if, for
- example, the user configures one OSD instance per disk.
-
diff --git a/src/ceph/doc/dev/osd_internals/osd_throttles.rst b/src/ceph/doc/dev/osd_internals/osd_throttles.rst
deleted file mode 100644
index 6739bd9..0000000
--- a/src/ceph/doc/dev/osd_internals/osd_throttles.rst
+++ /dev/null
@@ -1,93 +0,0 @@
-=============
-OSD Throttles
-=============
-
-There are three significant throttles in the filestore: wbthrottle,
-op_queue_throttle, and a throttle based on journal usage.
-
-WBThrottle
-----------
-The WBThrottle is defined in src/os/filestore/WBThrottle.[h,cc] and
-included in FileStore as FileStore::wbthrottle. The intention is to
-bound the amount of outstanding IO we need to do to flush the journal.
-At the same time, we don't want to necessarily do it inline in case we
-might be able to combine several IOs on the same object close together
-in time. Thus, in FileStore::_write, we queue the fd for asyncronous
-flushing and block in FileStore::_do_op if we have exceeded any hard
-limits until the background flusher catches up.
-
-The relevant config options are filestore_wbthrottle*. There are
-different defaults for xfs and btrfs. Each set has hard and soft
-limits on bytes (total dirty bytes), ios (total dirty ios), and
-inodes (total dirty fds). The WBThrottle will begin flushing
-when any of these hits the soft limit and will block in throttle()
-while any has exceeded the hard limit.
-
-Tighter soft limits will cause writeback to happen more quickly,
-but may cause the OSD to miss oportunities for write coalescing.
-Tighter hard limits may cause a reduction in latency variance by
-reducing time spent flushing the journal, but may reduce writeback
-parallelism.
-
-op_queue_throttle
------------------
-The op queue throttle is intended to bound the amount of queued but
-uncompleted work in the filestore by delaying threads calling
-queue_transactions more and more based on how many ops and bytes are
-currently queued. The throttle is taken in queue_transactions and
-released when the op is applied to the filesystem. This period
-includes time spent in the journal queue, time spent writing to the
-journal, time spent in the actual op queue, time spent waiting for the
-wbthrottle to open up (thus, the wbthrottle can push back indirectly
-on the queue_transactions caller), and time spent actually applying
-the op to the filesystem. A BackoffThrottle is used to gradually
-delay the queueing thread after each throttle becomes more than
-filestore_queue_low_threshhold full (a ratio of
-filestore_queue_max_(bytes|ops)). The throttles will block once the
-max value is reached (filestore_queue_max_(bytes|ops)).
-
-The significant config options are:
-filestore_queue_low_threshhold
-filestore_queue_high_threshhold
-filestore_expected_throughput_ops
-filestore_expected_throughput_bytes
-filestore_queue_high_delay_multiple
-filestore_queue_max_delay_multiple
-
-While each throttle is at less than low_threshhold of the max,
-no delay happens. Between low and high, the throttle will
-inject a per-op delay (per op or byte) ramping from 0 at low to
-high_delay_multiple/expected_throughput at high. From high to
-1, the delay will ramp from high_delay_multiple/expected_throughput
-to max_delay_multiple/expected_throughput.
-
-filestore_queue_high_delay_multiple and
-filestore_queue_max_delay_multiple probably do not need to be
-changed.
-
-Setting these properly should help to smooth out op latencies by
-mostly avoiding the hard limit.
-
-See FileStore::throttle_ops and FileSTore::thottle_bytes.
-
-journal usage throttle
-----------------------
-See src/os/filestore/JournalThrottle.h/cc
-
-The intention of the journal usage throttle is to gradually slow
-down queue_transactions callers as the journal fills up in order
-to smooth out hiccup during filestore syncs. JournalThrottle
-wraps a BackoffThrottle and tracks journaled but not flushed
-journal entries so that the throttle can be released when the
-journal is flushed. The configs work very similarly to the
-op_queue_throttle.
-
-The significant config options are:
-journal_throttle_low_threshhold
-journal_throttle_high_threshhold
-filestore_expected_throughput_ops
-filestore_expected_throughput_bytes
-journal_throttle_high_multiple
-journal_throttle_max_multiple
-
-.. literalinclude:: osd_throttles.txt
diff --git a/src/ceph/doc/dev/osd_internals/osd_throttles.txt b/src/ceph/doc/dev/osd_internals/osd_throttles.txt
deleted file mode 100644
index 0332377..0000000
--- a/src/ceph/doc/dev/osd_internals/osd_throttles.txt
+++ /dev/null
@@ -1,21 +0,0 @@
- Messenger throttle (number and size)
- |-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
- FileStore op_queue throttle (number and size, includes a soft throttle based on filestore_expected_throughput_(ops|bytes))
- |--------------------------------------------------------|
- WBThrottle
- |---------------------------------------------------------------------------------------------------------|
- Journal (size, includes a soft throttle based on filestore_expected_throughput_bytes)
- |-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
- |----------------------------------------------------------------------------------------------------> flushed ----------------> synced
- |
-Op: Read Header --DispatchQ--> OSD::_dispatch --OpWQ--> PG::do_request --journalq--> Journal --FileStore::OpWQ--> Apply Thread --Finisher--> op_applied -------------------------------------------------------------> Complete
- | |
-SubOp: --Messenger--> ReadHeader --DispatchQ--> OSD::_dispatch --OpWQ--> PG::do_request --journalq--> Journal --FileStore::OpWQ--> Apply Thread --Finisher--> sub_op_applied -
- |
- |-----------------------------> flushed ----------------> synced
- |------------------------------------------------------------------------------------------|
- Journal (size)
- |---------------------------------|
- WBThrottle
- |-----------------------------------------------------|
- FileStore op_queue throttle (number and size)
diff --git a/src/ceph/doc/dev/osd_internals/pg.rst b/src/ceph/doc/dev/osd_internals/pg.rst
deleted file mode 100644
index 4055363..0000000
--- a/src/ceph/doc/dev/osd_internals/pg.rst
+++ /dev/null
@@ -1,31 +0,0 @@
-====
-PG
-====
-
-Concepts
---------
-
-*Peering Interval*
- See PG::start_peering_interval.
- See PG::acting_up_affected
- See PG::RecoveryState::Reset
-
- A peering interval is a maximal set of contiguous map epochs in which the
- up and acting sets did not change. PG::RecoveryMachine represents a
- transition from one interval to another as passing through
- RecoveryState::Reset. On PG::RecoveryState::AdvMap PG::acting_up_affected can
- cause the pg to transition to Reset.
-
-
-Peering Details and Gotchas
----------------------------
-For an overview of peering, see `Peering <../../peering>`_.
-
- * PG::flushed defaults to false and is set to false in
- PG::start_peering_interval. Upon transitioning to PG::RecoveryState::Started
- we send a transaction through the pg op sequencer which, upon complete,
- sends a FlushedEvt which sets flushed to true. The primary cannot go
- active until this happens (See PG::RecoveryState::WaitFlushedPeering).
- Replicas can go active but cannot serve ops (writes or reads).
- This is necessary because we cannot read our ondisk state until unstable
- transactions from the previous interval have cleared.
diff --git a/src/ceph/doc/dev/osd_internals/pg_removal.rst b/src/ceph/doc/dev/osd_internals/pg_removal.rst
deleted file mode 100644
index d968ecc..0000000
--- a/src/ceph/doc/dev/osd_internals/pg_removal.rst
+++ /dev/null
@@ -1,56 +0,0 @@
-==========
-PG Removal
-==========
-
-See OSD::_remove_pg, OSD::RemoveWQ
-
-There are two ways for a pg to be removed from an OSD:
-
- 1. MOSDPGRemove from the primary
- 2. OSD::advance_map finds that the pool has been removed
-
-In either case, our general strategy for removing the pg is to
-atomically set the metadata objects (pg->log_oid, pg->biginfo_oid) to
-backfill and asynronously remove the pg collections. We do not do
-this inline because scanning the collections to remove the objects is
-an expensive operation.
-
-OSDService::deleting_pgs tracks all pgs in the process of being
-deleted. Each DeletingState object in deleting_pgs lives while at
-least one reference to it remains. Each item in RemoveWQ carries a
-reference to the DeletingState for the relevant pg such that
-deleting_pgs.lookup(pgid) will return a null ref only if there are no
-collections currently being deleted for that pg.
-
-The DeletingState for a pg also carries information about the status
-of the current deletion and allows the deletion to be cancelled.
-The possible states are:
-
- 1. QUEUED: the PG is in the RemoveWQ
- 2. CLEARING_DIR: the PG's contents are being removed synchronously
- 3. DELETING_DIR: the PG's directories and metadata being queued for removal
- 4. DELETED_DIR: the final removal transaction has been queued
- 5. CANCELED: the deletion has been canceled
-
-In 1 and 2, the deletion can be canceled. Each state transition
-method (and check_canceled) returns false if deletion has been
-canceled and true if the state transition was successful. Similarly,
-try_stop_deletion() returns true if it succeeds in canceling the
-deletion. Additionally, try_stop_deletion() in the event that it
-fails to stop the deletion will not return until the final removal
-transaction is queued. This ensures that any operations queued after
-that point will be ordered after the pg deletion.
-
-OSD::_create_lock_pg must handle two cases:
-
- 1. Either there is no DeletingStateRef for the pg, or it failed to cancel
- 2. We succeeded in canceling the deletion.
-
-In case 1., we proceed as if there were no deletion occurring, except that
-we avoid writing to the PG until the deletion finishes. In case 2., we
-proceed as in case 1., except that we first mark the PG as backfilling.
-
-Similarly, OSD::osr_registry ensures that the OpSequencers for those
-pgs can be reused for a new pg if created before the old one is fully
-removed, ensuring that operations on the new pg are sequenced properly
-with respect to operations on the old one.
diff --git a/src/ceph/doc/dev/osd_internals/pgpool.rst b/src/ceph/doc/dev/osd_internals/pgpool.rst
deleted file mode 100644
index 45a252b..0000000
--- a/src/ceph/doc/dev/osd_internals/pgpool.rst
+++ /dev/null
@@ -1,22 +0,0 @@
-==================
-PGPool
-==================
-
-PGPool is a structure used to manage and update the status of removed
-snapshots. It does this by maintaining two fields, cached_removed_snaps - the
-current removed snap set and newly_removed_snaps - newly removed snaps in the
-last epoch. In OSD::load_pgs the osd map is recovered from the pg's file store
-and passed down to OSD::_get_pool where a PGPool object is initialised with the
-map.
-
-With each new map we receive we call PGPool::update with the new map. In that
-function we build a list of newly removed snaps
-(pg_pool_t::build_removed_snaps) and merge that with our cached_removed_snaps.
-This function included checks to make sure we only do this update when things
-have changed or there has been a map gap.
-
-When we activate the pg we initialise the snap trim queue from
-cached_removed_snaps and subtract the purged_snaps we have already purged
-leaving us with the list of snaps that need to be trimmed. Trimming is later
-performed asynchronously by the snap_trim_wq.
-
diff --git a/src/ceph/doc/dev/osd_internals/recovery_reservation.rst b/src/ceph/doc/dev/osd_internals/recovery_reservation.rst
deleted file mode 100644
index 4ab0319..0000000
--- a/src/ceph/doc/dev/osd_internals/recovery_reservation.rst
+++ /dev/null
@@ -1,75 +0,0 @@
-====================
-Recovery Reservation
-====================
-
-Recovery reservation extends and subsumes backfill reservation. The
-reservation system from backfill recovery is used for local and remote
-reservations.
-
-When a PG goes active, first it determines what type of recovery is
-necessary, if any. It may need log-based recovery, backfill recovery,
-both, or neither.
-
-In log-based recovery, the primary first acquires a local reservation
-from the OSDService's local_reserver. Then a MRemoteReservationRequest
-message is sent to each replica in order of OSD number. These requests
-will always be granted (i.e., cannot be rejected), but they may take
-some time to be granted if the remotes have already granted all their
-remote reservation slots.
-
-After all reservations are acquired, log-based recovery proceeds as it
-would without the reservation system.
-
-After log-based recovery completes, the primary releases all remote
-reservations. The local reservation remains held. The primary then
-determines whether backfill is necessary. If it is not necessary, the
-primary releases its local reservation and waits in the Recovered state
-for all OSDs to indicate that they are clean.
-
-If backfill recovery occurs after log-based recovery, the local
-reservation does not need to be reacquired since it is still held from
-before. If it occurs immediately after activation (log-based recovery
-not possible/necessary), the local reservation is acquired according to
-the typical process.
-
-Once the primary has its local reservation, it requests a remote
-reservation from the backfill target. This reservation CAN be rejected,
-for instance if the OSD is too full (backfillfull_ratio osd setting).
-If the reservation is rejected, the primary drops its local
-reservation, waits (osd_backfill_retry_interval), and then retries. It
-will retry indefinitely.
-
-Once the primary has the local and remote reservations, backfill
-proceeds as usual. After backfill completes the remote reservation is
-dropped.
-
-Finally, after backfill (or log-based recovery if backfill was not
-necessary), the primary drops the local reservation and enters the
-Recovered state. Once all the PGs have reported they are clean, the
-primary enters the Clean state and marks itself active+clean.
-
-
---------------
-Things to Note
---------------
-
-We always grab the local reservation first, to prevent a circular
-dependency. We grab remote reservations in order of OSD number for the
-same reason.
-
-The recovery reservation state chart controls the PG state as reported
-to the monitor. The state chart can set:
-
- - recovery_wait: waiting for local/remote reservations
- - recovering: recovering
- - recovery_toofull: recovery stopped, OSD(s) above full ratio
- - backfill_wait: waiting for remote backfill reservations
- - backfilling: backfilling
- - backfill_toofull: backfill stopped, OSD(s) above backfillfull ratio
-
-
---------
-See Also
---------
-
-The Active substate of the automatically generated OSD state diagram.
diff --git a/src/ceph/doc/dev/osd_internals/scrub.rst b/src/ceph/doc/dev/osd_internals/scrub.rst
deleted file mode 100644
index 3343b39..0000000
--- a/src/ceph/doc/dev/osd_internals/scrub.rst
+++ /dev/null
@@ -1,30 +0,0 @@
-
-Scrubbing Behavior Table
-========================
-
-+-------------------------------------------------+----------+-----------+---------------+----------------------+
-| Flags | none | noscrub | nodeep_scrub | noscrub/nodeep_scrub |
-+=================================================+==========+===========+===============+======================+
-| Periodic tick | S | X | S | X |
-+-------------------------------------------------+----------+-----------+---------------+----------------------+
-| Periodic tick after osd_deep_scrub_interval | D | D | S | X |
-+-------------------------------------------------+----------+-----------+---------------+----------------------+
-| Initiated scrub | S | S | S | S |
-+-------------------------------------------------+----------+-----------+---------------+----------------------+
-| Initiated scrub after osd_deep_scrub_interval | D | D | S | S |
-+-------------------------------------------------+----------+-----------+---------------+----------------------+
-| Initiated deep scrub | D | D | D | D |
-+-------------------------------------------------+----------+-----------+---------------+----------------------+
-
-- X = Do nothing
-- S = Do regular scrub
-- D = Do deep scrub
-
-State variables
----------------
-
-- Periodic tick state is !must_scrub && !must_deep_scrub && !time_for_deep
-- Periodic tick after osd_deep_scrub_interval state is !must_scrub && !must_deep_scrub && time_for_deep
-- Initiated scrub state is must_scrub && !must_deep_scrub && !time_for_deep
-- Initiated scrub after osd_deep_scrub_interval state is must scrub && !must_deep_scrub && time_for_deep
-- Initiated deep scrub state is must_scrub && must_deep_scrub
diff --git a/src/ceph/doc/dev/osd_internals/snaps.rst b/src/ceph/doc/dev/osd_internals/snaps.rst
deleted file mode 100644
index e17378f..0000000
--- a/src/ceph/doc/dev/osd_internals/snaps.rst
+++ /dev/null
@@ -1,128 +0,0 @@
-======
-Snaps
-======
-
-Overview
---------
-Rados supports two related snapshotting mechanisms:
-
- 1. *pool snaps*: snapshots are implicitely applied to all objects
- in a pool
- 2. *self managed snaps*: the user must provide the current *SnapContext*
- on each write.
-
-These two are mutually exclusive, only one or the other can be used on
-a particular pool.
-
-The *SnapContext* is the set of snapshots currently defined for an object
-as well as the most recent snapshot (the *seq*) requested from the mon for
-sequencing purposes (a *SnapContext* with a newer *seq* is considered to
-be more recent).
-
-The difference between *pool snaps* and *self managed snaps* from the
-OSD's point of view lies in whether the *SnapContext* comes to the OSD
-via the client's MOSDOp or via the most recent OSDMap.
-
-See OSD::make_writeable
-
-Ondisk Structures
------------------
-Each object has in the pg collection a *head* object (or *snapdir*, which we
-will come to shortly) and possibly a set of *clone* objects.
-Each hobject_t has a snap field. For the *head* (the only writeable version
-of an object), the snap field is set to CEPH_NOSNAP. For the *clones*, the
-snap field is set to the *seq* of the *SnapContext* at their creation.
-When the OSD services a write, it first checks whether the most recent
-*clone* is tagged with a snapid prior to the most recent snap represented
-in the *SnapContext*. If so, at least one snapshot has occurred between
-the time of the write and the time of the last clone. Therefore, prior
-to performing the mutation, the OSD creates a new clone for servicing
-reads on snaps between the snapid of the last clone and the most recent
-snapid.
-
-The *head* object contains a *SnapSet* encoded in an attribute, which tracks
-
- 1. The full set of snaps defined for the object
- 2. The full set of clones which currently exist
- 3. Overlapping intervals between clones for tracking space usage
- 4. Clone size
-
-If the *head* is deleted while there are still clones, a *snapdir* object
-is created instead to house the *SnapSet*.
-
-Additionally, the *object_info_t* on each clone includes a vector of snaps
-for which clone is defined.
-
-Snap Removal
-------------
-To remove a snapshot, a request is made to the *Monitor* cluster to
-add the snapshot id to the list of purged snaps (or to remove it from
-the set of pool snaps in the case of *pool snaps*). In either case,
-the *PG* adds the snap to its *snap_trimq* for trimming.
-
-A clone can be removed when all of its snaps have been removed. In
-order to determine which clones might need to be removed upon snap
-removal, we maintain a mapping from snap to *hobject_t* using the
-*SnapMapper*.
-
-See PrimaryLogPG::SnapTrimmer, SnapMapper
-
-This trimming is performed asynchronously by the snap_trim_wq while the
-pg is clean and not scrubbing.
-
- #. The next snap in PG::snap_trimq is selected for trimming
- #. We determine the next object for trimming out of PG::snap_mapper.
- For each object, we create a log entry and repop updating the
- object info and the snap set (including adjusting the overlaps).
- If the object is a clone which no longer belongs to any live snapshots,
- it is removed here. (See PrimaryLogPG::trim_object() when new_snaps
- is empty.)
- #. We also locally update our *SnapMapper* instance with the object's
- new snaps.
- #. The log entry containing the modification of the object also
- contains the new set of snaps, which the replica uses to update
- its own *SnapMapper* instance.
- #. The primary shares the info with the replica, which persists
- the new set of purged_snaps along with the rest of the info.
-
-
-
-Recovery
---------
-Because the trim operations are implemented using repops and log entries,
-normal pg peering and recovery maintain the snap trimmer operations with
-the caveat that push and removal operations need to update the local
-*SnapMapper* instance. If the purged_snaps update is lost, we merely
-retrim a now empty snap.
-
-SnapMapper
-----------
-*SnapMapper* is implemented on top of map_cacher<string, bufferlist>,
-which provides an interface over a backing store such as the filesystem
-with async transactions. While transactions are incomplete, the map_cacher
-instance buffers unstable keys allowing consistent access without having
-to flush the filestore. *SnapMapper* provides two mappings:
-
- 1. hobject_t -> set<snapid_t>: stores the set of snaps for each clone
- object
- 2. snapid_t -> hobject_t: stores the set of hobjects with the snapshot
- as one of its snaps
-
-Assumption: there are lots of hobjects and relatively few snaps. The
-first encoding has a stringification of the object as the key and an
-encoding of the set of snaps as a value. The second mapping, because there
-might be many hobjects for a single snap, is stored as a collection of keys
-of the form stringify(snap)_stringify(object) such that stringify(snap)
-is constant length. These keys have a bufferlist encoding
-pair<snapid, hobject_t> as a value. Thus, creating or trimming a single
-object does not involve reading all objects for any snap. Additionally,
-upon construction, the *SnapMapper* is provided with a mask for filtering
-the objects in the single SnapMapper keyspace belonging to that pg.
-
-Split
------
-The snapid_t -> hobject_t key entries are arranged such that for any pg,
-up to 8 prefixes need to be checked to determine all hobjects in a particular
-snap for a particular pg. Upon split, the prefixes to check on the parent
-are adjusted such that only the objects remaining in the pg will be visible.
-The children will immediately have the correct mapping.
diff --git a/src/ceph/doc/dev/osd_internals/watch_notify.rst b/src/ceph/doc/dev/osd_internals/watch_notify.rst
deleted file mode 100644
index 8c2ce09..0000000
--- a/src/ceph/doc/dev/osd_internals/watch_notify.rst
+++ /dev/null
@@ -1,81 +0,0 @@
-============
-Watch Notify
-============
-
-See librados for the watch/notify interface.
-
-Overview
---------
-The object_info (See osd/osd_types.h) tracks the set of watchers for
-a particular object persistently in the object_info_t::watchers map.
-In order to track notify progress, we also maintain some ephemeral
-structures associated with the ObjectContext.
-
-Each Watch has an associated Watch object (See osd/Watch.h). The
-ObjectContext for a watched object will have a (strong) reference
-to one Watch object per watch, and each Watch object holds a
-reference to the corresponding ObjectContext. This circular reference
-is deliberate and is broken when the Watch state is discarded on
-a new peering interval or removed upon timeout expiration or an
-unwatch operation.
-
-A watch tracks the associated connection via a strong
-ConnectionRef Watch::conn. The associated connection has a
-WatchConState stashed in the OSD::Session for tracking associated
-Watches in order to be able to notify them upon ms_handle_reset()
-(via WatchConState::reset()).
-
-Each Watch object tracks the set of currently un-acked notifies.
-start_notify() on a Watch object adds a reference to a new in-progress
-Notify to the Watch and either:
-
-* if the Watch is *connected*, sends a Notify message to the client
-* if the Watch is *unconnected*, does nothing.
-
-When the Watch becomes connected (in PrimaryLogPG::do_osd_op_effects),
-Notifies are resent to all remaining tracked Notify objects.
-
-Each Notify object tracks the set of un-notified Watchers via
-calls to complete_watcher(). Once the remaining set is empty or the
-timeout expires (cb, registered in init()) a notify completion
-is sent to the client.
-
-Watch Lifecycle
----------------
-A watch may be in one of 5 states:
-
-1. Non existent.
-2. On disk, but not registered with an object context.
-3. Connected
-4. Disconnected, callback registered with timer
-5. Disconnected, callback in queue for scrub or is_degraded
-
-Case 2 occurs between when an OSD goes active and the ObjectContext
-for an object with watchers is loaded into memory due to an access.
-During Case 2, no state is registered for the watch. Case 2
-transitions to Case 4 in PrimaryLogPG::populate_obc_watchers() during
-PrimaryLogPG::find_object_context. Case 1 becomes case 3 via
-OSD::do_osd_op_effects due to a watch operation. Case 4,5 become case
-3 in the same way. Case 3 becomes case 4 when the connection resets
-on a watcher's session.
-
-Cases 4&5 can use some explanation. Normally, when a Watch enters Case
-4, a callback is registered with the OSDService::watch_timer to be
-called at timeout expiration. At the time that the callback is
-called, however, the pg might be in a state where it cannot write
-to the object in order to remove the watch (i.e., during a scrub
-or while the object is degraded). In that case, we use
-Watch::get_delayed_cb() to generate another Context for use from
-the callbacks_for_degraded_object and Scrubber::callbacks lists.
-In either case, Watch::unregister_cb() does the right thing
-(SafeTimer::cancel_event() is harmless for contexts not registered
-with the timer).
-
-Notify Lifecycle
-----------------
-The notify timeout is simpler: a timeout callback is registered when
-the notify is init()'d. If all watchers ack notifies before the
-timeout occurs, the timeout is canceled and the client is notified
-of the notify completion. Otherwise, the timeout fires, the Notify
-object pings each Watch via cancel_notify to remove itself, and
-sends the notify completion to the client early.
diff --git a/src/ceph/doc/dev/osd_internals/wbthrottle.rst b/src/ceph/doc/dev/osd_internals/wbthrottle.rst
deleted file mode 100644
index a3ae00d..0000000
--- a/src/ceph/doc/dev/osd_internals/wbthrottle.rst
+++ /dev/null
@@ -1,28 +0,0 @@
-==================
-Writeback Throttle
-==================
-
-Previously, the filestore had a problem when handling large numbers of
-small ios. We throttle dirty data implicitely via the journal, but
-a large number of inodes can be dirtied without filling the journal
-resulting in a very long sync time when the sync finally does happen.
-The flusher was not an adequate solution to this problem since it
-forced writeback of small writes too eagerly killing performance.
-
-WBThrottle tracks unflushed io per hobject_t and ::fsyncs in lru
-order once the start_flusher threshold is exceeded for any of
-dirty bytes, dirty ios, or dirty inodes. While any of these exceed
-the hard_limit, we block on throttle() in _do_op.
-
-See src/os/WBThrottle.h, src/osd/WBThrottle.cc
-
-To track the open FDs through the writeback process, there is now an
-fdcache to cache open fds. lfn_open now returns a cached FDRef which
-implicitely closes the fd once all references have expired.
-
-Filestore syncs have a sideeffect of flushing all outstanding objects
-in the wbthrottle.
-
-lfn_unlink clears the cached FDRef and wbthrottle entries for the
-unlinked object when the last link is removed and asserts that all
-outstanding FDRefs for that object are dead.