diff options
Diffstat (limited to 'src/ceph/doc/dev/osd_internals')
21 files changed, 0 insertions, 2046 deletions
diff --git a/src/ceph/doc/dev/osd_internals/backfill_reservation.rst b/src/ceph/doc/dev/osd_internals/backfill_reservation.rst deleted file mode 100644 index cf9dab4..0000000 --- a/src/ceph/doc/dev/osd_internals/backfill_reservation.rst +++ /dev/null @@ -1,38 +0,0 @@ -==================== -Backfill Reservation -==================== - -When a new osd joins a cluster, all pgs containing it must eventually backfill -to it. If all of these backfills happen simultaneously, it would put excessive -load on the osd. osd_max_backfills limits the number of outgoing or -incoming backfills on a single node. The maximum number of outgoing backfills is -osd_max_backfills. The maximum number of incoming backfills is -osd_max_backfills. Therefore there can be a maximum of osd_max_backfills * 2 -simultaneous backfills on one osd. - -Each OSDService now has two AsyncReserver instances: one for backfills going -from the osd (local_reserver) and one for backfills going to the osd -(remote_reserver). An AsyncReserver (common/AsyncReserver.h) manages a queue -by priority of waiting items and a set of current reservation holders. When a -slot frees up, the AsyncReserver queues the Context* associated with the next -item on the highest priority queue in the finisher provided to the constructor. - -For a primary to initiate a backfill, it must first obtain a reservation from -its own local_reserver. Then, it must obtain a reservation from the backfill -target's remote_reserver via a MBackfillReserve message. This process is -managed by substates of Active and ReplicaActive (see the substates of Active -in PG.h). The reservations are dropped either on the Backfilled event, which -is sent on the primary before calling recovery_complete and on the replica on -receipt of the BackfillComplete progress message), or upon leaving Active or -ReplicaActive. - -It's important that we always grab the local reservation before the remote -reservation in order to prevent a circular dependency. - -We want to minimize the risk of data loss by prioritizing the order in -which PGs are recovered. The highest priority is log based recovery -(OSD_RECOVERY_PRIORITY_MAX) since this must always complete before -backfill can start. The next priority is backfill of degraded PGs and -is a function of the degradation. A backfill for a PG missing two -replicas will have a priority higher than a backfill for a PG missing -one replica. The lowest priority is backfill of non-degraded PGs. diff --git a/src/ceph/doc/dev/osd_internals/erasure_coding.rst b/src/ceph/doc/dev/osd_internals/erasure_coding.rst deleted file mode 100644 index 7263cc3..0000000 --- a/src/ceph/doc/dev/osd_internals/erasure_coding.rst +++ /dev/null @@ -1,82 +0,0 @@ -============================== -Erasure Coded Placement Groups -============================== - -Glossary --------- - -*chunk* - when the encoding function is called, it returns chunks of the same - size. Data chunks which can be concatenated to reconstruct the original - object and coding chunks which can be used to rebuild a lost chunk. - -*chunk rank* - the index of a chunk when returned by the encoding function. The - rank of the first chunk is 0, the rank of the second chunk is 1 - etc. - -*stripe* - when an object is too large to be encoded with a single call, - each set of chunks created by a call to the encoding function is - called a stripe. - -*shard|strip* - an ordered sequence of chunks of the same rank from the same - object. For a given placement group, each OSD contains shards of - the same rank. When dealing with objects that are encoded with a - single operation, *chunk* is sometime used instead of *shard* - because the shard is made of a single chunk. The *chunks* in a - *shard* are ordered according to the rank of the stripe they belong - to. - -*K* - the number of data *chunks*, i.e. the number of *chunks* in which the - original object is divided. For instance if *K* = 2 a 10KB object - will be divided into *K* objects of 5KB each. - -*M* - the number of coding *chunks*, i.e. the number of additional *chunks* - computed by the encoding functions. If there are 2 coding *chunks*, - it means 2 OSDs can be out without losing data. - -*N* - the number of data *chunks* plus the number of coding *chunks*, - i.e. *K+M*. - -*rate* - the proportion of the *chunks* that contains useful information, i.e. *K/N*. - For instance, for *K* = 9 and *M* = 3 (i.e. *K+M* = *N* = 12) the rate is - *K* = 9 / *N* = 12 = 0.75, i.e. 75% of the chunks contain useful information. - -The definitions are illustrated as follows (PG stands for placement group): -:: - - OSD 40 OSD 33 - +-------------------------+ +-------------------------+ - | shard 0 - PG 10 | | shard 1 - PG 10 | - |+------ object O -------+| |+------ object O -------+| - ||+---------------------+|| ||+---------------------+|| - stripe||| chunk 0 ||| ||| chunk 1 ||| ... - 0 ||| stripe 0 ||| ||| stripe 0 ||| - ||+---------------------+|| ||+---------------------+|| - ||+---------------------+|| ||+---------------------+|| - stripe||| chunk 0 ||| ||| chunk 1 ||| ... - 1 ||| stripe 1 ||| ||| stripe 1 ||| - ||+---------------------+|| ||+---------------------+|| - ||+---------------------+|| ||+---------------------+|| - stripe||| chunk 0 ||| ||| chunk 1 ||| ... - 2 ||| stripe 2 ||| ||| stripe 2 ||| - ||+---------------------+|| ||+---------------------+|| - |+-----------------------+| |+-----------------------+| - | ... | | ... | - +-------------------------+ +-------------------------+ - -Table of content ----------------- - -.. toctree:: - :maxdepth: 1 - - Developer notes <erasure_coding/developer_notes> - Jerasure plugin <erasure_coding/jerasure> - High level design document <erasure_coding/ecbackend> diff --git a/src/ceph/doc/dev/osd_internals/erasure_coding/developer_notes.rst b/src/ceph/doc/dev/osd_internals/erasure_coding/developer_notes.rst deleted file mode 100644 index 770ff4a..0000000 --- a/src/ceph/doc/dev/osd_internals/erasure_coding/developer_notes.rst +++ /dev/null @@ -1,223 +0,0 @@ -============================ -Erasure Code developer notes -============================ - -Introduction ------------- - -Each chapter of this document explains an aspect of the implementation -of the erasure code within Ceph. It is mostly based on examples being -explained to demonstrate how things work. - -Reading and writing encoded chunks from and to OSDs ---------------------------------------------------- - -An erasure coded pool stores each object as K+M chunks. It is divided -into K data chunks and M coding chunks. The pool is configured to have -a size of K+M so that each chunk is stored in an OSD in the acting -set. The rank of the chunk is stored as an attribute of the object. - -Let's say an erasure coded pool is created to use five OSDs ( K+M = -5 ) and sustain the loss of two of them ( M = 2 ). - -When the object *NYAN* containing *ABCDEFGHI* is written to it, the -erasure encoding function splits the content in three data chunks, -simply by dividing the content in three : the first contains *ABC*, -the second *DEF* and the last *GHI*. The content will be padded if the -content length is not a multiple of K. The function also creates two -coding chunks : the fourth with *YXY* and the fifth with *GQC*. Each -chunk is stored in an OSD in the acting set. The chunks are stored in -objects that have the same name ( *NYAN* ) but reside on different -OSDs. The order in which the chunks were created must be preserved and -is stored as an attribute of the object ( shard_t ), in addition to its -name. Chunk *1* contains *ABC* and is stored on *OSD5* while chunk *4* -contains *YXY* and is stored on *OSD3*. - -:: - - +-------------------+ - name | NYAN | - +-------------------+ - content | ABCDEFGHI | - +--------+----------+ - | - | - v - +------+------+ - +---------------+ encode(3,2) +-----------+ - | +--+--+---+---+ | - | | | | | - | +-------+ | +-----+ | - | | | | | - +--v---+ +--v---+ +--v---+ +--v---+ +--v---+ - name | NYAN | | NYAN | | NYAN | | NYAN | | NYAN | - +------+ +------+ +------+ +------+ +------+ - shard | 1 | | 2 | | 3 | | 4 | | 5 | - +------+ +------+ +------+ +------+ +------+ - content | ABC | | DEF | | GHI | | YXY | | QGC | - +--+---+ +--+---+ +--+---+ +--+---+ +--+---+ - | | | | | - | | | | | - | | +--+---+ | | - | | | OSD1 | | | - | | +------+ | | - | | +------+ | | - | +------>| OSD2 | | | - | +------+ | | - | +------+ | | - | | OSD3 |<----+ | - | +------+ | - | +------+ | - | | OSD4 |<--------------+ - | +------+ - | +------+ - +----------------->| OSD5 | - +------+ - - - - -When the object *NYAN* is read from the erasure coded pool, the -decoding function reads three chunks : chunk *1* containing *ABC*, -chunk *3* containing *GHI* and chunk *4* containing *YXY* and rebuild -the original content of the object *ABCDEFGHI*. The decoding function -is informed that the chunks *2* and *5* are missing ( they are called -*erasures* ). The chunk *5* could not be read because the *OSD4* is -*out*. - -The decoding function could be called as soon as three chunks are -read : *OSD2* was the slowest and its chunk does not need to be taken into -account. This optimization is not implemented in Firefly. - -:: - - +-------------------+ - name | NYAN | - +-------------------+ - content | ABCDEFGHI | - +--------+----------+ - ^ - | - | - +------+------+ - | decode(3,2) | - | erasures 2,5| - +-------------->| | - | +-------------+ - | ^ ^ - | | +-----+ - | | | - +--+---+ +------+ +--+---+ +--+---+ - name | NYAN | | NYAN | | NYAN | | NYAN | - +------+ +------+ +------+ +------+ - shard | 1 | | 2 | | 3 | | 4 | - +------+ +------+ +------+ +------+ - content | ABC | | DEF | | GHI | | YXY | - +--+---+ +--+---+ +--+---+ +--+---+ - ^ . ^ ^ - | TOO . | | - | SLOW . +--+---+ | - | ^ | OSD1 | | - | | +------+ | - | | +------+ | - | +-------| OSD2 | | - | +------+ | - | +------+ | - | | OSD3 |-----+ - | +------+ - | +------+ - | | OSD4 | OUT - | +------+ - | +------+ - +------------------| OSD5 | - +------+ - - -Erasure code library --------------------- - -Using `Reed-Solomon <https://en.wikipedia.org/wiki/Reed_Solomon>`_, -with parameters K+M, object O is encoded by dividing it into chunks O1, -O2, ... OM and computing coding chunks P1, P2, ... PK. Any K chunks -out of the available K+M chunks can be used to obtain the original -object. If data chunk O2 or coding chunk P2 are lost, they can be -repaired using any K chunks out of the K+M chunks. If more than M -chunks are lost, it is not possible to recover the object. - -Reading the original content of object O can be a simple -concatenation of O1, O2, ... OM, because the plugins are using -`systematic codes -<http://en.wikipedia.org/wiki/Systematic_code>`_. Otherwise the chunks -must be given to the erasure code library *decode* method to retrieve -the content of the object. - -Performance depend on the parameters to the encoding functions and -is also influenced by the packet sizes used when calling the encoding -functions ( for Cauchy or Liberation for instance ): smaller packets -means more calls and more overhead. - -Although Reed-Solomon is provided as a default, Ceph uses it via an -`abstract API <https://github.com/ceph/ceph/blob/v0.78/src/erasure-code/ErasureCodeInterface.h>`_ designed to -allow each pool to choose the plugin that implements it using -key=value pairs stored in an `erasure code profile`_. - -.. _erasure code profile: ../../../erasure-coded-pool - -:: - - $ ceph osd erasure-code-profile set myprofile \ - crush-failure-domain=osd - $ ceph osd erasure-code-profile get myprofile - directory=/usr/lib/ceph/erasure-code - k=2 - m=1 - plugin=jerasure - technique=reed_sol_van - crush-failure-domain=osd - $ ceph osd pool create ecpool 12 12 erasure myprofile - -The *plugin* is dynamically loaded from *directory* and expected to -implement the *int __erasure_code_init(char *plugin_name, char *directory)* function -which is responsible for registering an object derived from *ErasureCodePlugin* -in the registry. The `ErasureCodePluginExample <https://github.com/ceph/ceph/blob/v0.78/src/test/erasure-code/ErasureCodePluginExample.cc>`_ plugin reads: - -:: - - ErasureCodePluginRegistry &instance = - ErasureCodePluginRegistry::instance(); - instance.add(plugin_name, new ErasureCodePluginExample()); - -The *ErasureCodePlugin* derived object must provide a factory method -from which the concrete implementation of the *ErasureCodeInterface* -object can be generated. The `ErasureCodePluginExample plugin <https://github.com/ceph/ceph/blob/v0.78/src/test/erasure-code/ErasureCodePluginExample.cc>`_ reads: - -:: - - virtual int factory(const map<std::string,std::string> ¶meters, - ErasureCodeInterfaceRef *erasure_code) { - *erasure_code = ErasureCodeInterfaceRef(new ErasureCodeExample(parameters)); - return 0; - } - -The *parameters* argument is the list of *key=value* pairs that were -set in the erasure code profile, before the pool was created. - -:: - - ceph osd erasure-code-profile set myprofile \ - directory=<dir> \ # mandatory - plugin=jerasure \ # mandatory - m=10 \ # optional and plugin dependant - k=3 \ # optional and plugin dependant - technique=reed_sol_van \ # optional and plugin dependant - -Notes ------ - -If the objects are large, it may be impractical to encode and decode -them in memory. However, when using *RBD* a 1TB device is divided in -many individual 4MB objects and *RGW* does the same. - -Encoding and decoding is implemented in the OSD. Although it could be -implemented client side for read write, the OSD must be able to encode -and decode on its own when scrubbing. diff --git a/src/ceph/doc/dev/osd_internals/erasure_coding/ecbackend.rst b/src/ceph/doc/dev/osd_internals/erasure_coding/ecbackend.rst deleted file mode 100644 index 624ec21..0000000 --- a/src/ceph/doc/dev/osd_internals/erasure_coding/ecbackend.rst +++ /dev/null @@ -1,207 +0,0 @@ -================================= -ECBackend Implementation Strategy -================================= - -Misc initial design notes -========================= - -The initial (and still true for ec pools without the hacky ec -overwrites debug flag enabled) design for ec pools restricted -EC pools to operations which can be easily rolled back: - -- CEPH_OSD_OP_APPEND: We can roll back an append locally by - including the previous object size as part of the PG log event. -- CEPH_OSD_OP_DELETE: The possibility of rolling back a delete - requires that we retain the deleted object until all replicas have - persisted the deletion event. ErasureCoded backend will therefore - need to store objects with the version at which they were created - included in the key provided to the filestore. Old versions of an - object can be pruned when all replicas have committed up to the log - event deleting the object. -- CEPH_OSD_OP_(SET|RM)ATTR: If we include the prior value of the attr - to be set or removed, we can roll back these operations locally. - -Log entries contain a structure explaining how to locally undo the -operation represented by the operation -(see osd_types.h:TransactionInfo::LocalRollBack). - -PGTemp and Crush ----------------- - -Primaries are able to request a temp acting set mapping in order to -allow an up-to-date OSD to serve requests while a new primary is -backfilled (and for other reasons). An erasure coded pg needs to be -able to designate a primary for these reasons without putting it in -the first position of the acting set. It also needs to be able to -leave holes in the requested acting set. - -Core Changes: - -- OSDMap::pg_to_*_osds needs to separately return a primary. For most - cases, this can continue to be acting[0]. -- MOSDPGTemp (and related OSD structures) needs to be able to specify - a primary as well as an acting set. -- Much of the existing code base assumes that acting[0] is the primary - and that all elements of acting are valid. This needs to be cleaned - up since the acting set may contain holes. - -Distinguished acting set positions ----------------------------------- - -With the replicated strategy, all replicas of a PG are -interchangeable. With erasure coding, different positions in the -acting set have different pieces of the erasure coding scheme and are -not interchangeable. Worse, crush might cause chunk 2 to be written -to an OSD which happens already to contain an (old) copy of chunk 4. -This means that the OSD and PG messages need to work in terms of a -type like pair<shard_t, pg_t> in order to distinguish different pg -chunks on a single OSD. - -Because the mapping of object name to object in the filestore must -be 1-to-1, we must ensure that the objects in chunk 2 and the objects -in chunk 4 have different names. To that end, the objectstore must -include the chunk id in the object key. - -Core changes: - -- The objectstore `ghobject_t needs to also include a chunk id - <https://github.com/ceph/ceph/blob/firefly/src/common/hobject.h#L241>`_ making it more like - tuple<hobject_t, gen_t, shard_t>. -- coll_t needs to include a shard_t. -- The OSD pg_map and similar pg mappings need to work in terms of a - spg_t (essentially - pair<pg_t, shard_t>). Similarly, pg->pg messages need to include - a shard_t -- For client->PG messages, the OSD will need a way to know which PG - chunk should get the message since the OSD may contain both a - primary and non-primary chunk for the same pg - -Object Classes --------------- - -Reads from object classes will return ENOTSUP on ec pools by invoking -a special SYNC read. - -Scrub ------ - -The main catch, however, for ec pools is that sending a crc32 of the -stored chunk on a replica isn't particularly helpful since the chunks -on different replicas presumably store different data. Because we -don't support overwrites except via DELETE, however, we have the -option of maintaining a crc32 on each chunk through each append. -Thus, each replica instead simply computes a crc32 of its own stored -chunk and compares it with the locally stored checksum. The replica -then reports to the primary whether the checksums match. - -With overwrites, all scrubs are disabled for now until we work out -what to do (see doc/dev/osd_internals/erasure_coding/proposals.rst). - -Crush ------ - -If crush is unable to generate a replacement for a down member of an -acting set, the acting set should have a hole at that position rather -than shifting the other elements of the acting set out of position. - -========= -ECBackend -========= - -MAIN OPERATION OVERVIEW -======================= - -A RADOS put operation can span -multiple stripes of a single object. There must be code that -tessellates the application level write into a set of per-stripe write -operations -- some whole-stripes and up to two partial -stripes. Without loss of generality, for the remainder of this -document we will focus exclusively on writing a single stripe (whole -or partial). We will use the symbol "W" to represent the number of -blocks within a stripe that are being written, i.e., W <= K. - -There are three data flows for handling a write into an EC stripe. The -choice of which of the three data flows to choose is based on the size -of the write operation and the arithmetic properties of the selected -parity-generation algorithm. - -(1) whole stripe is written/overwritten -(2) a read-modify-write operation is performed. - -WHOLE STRIPE WRITE ------------------- - -This is the simple case, and is already performed in the existing code -(for appends, that is). The primary receives all of the data for the -stripe in the RADOS request, computes the appropriate parity blocks -and send the data and parity blocks to their destination shards which -write them. This is essentially the current EC code. - -READ-MODIFY-WRITE ------------------ - -The primary determines which of the K-W blocks are to be unmodified, -and reads them from the shards. Once all of the data is received it is -combined with the received new data and new parity blocks are -computed. The modified blocks are sent to their respective shards and -written. The RADOS operation is acknowledged. - -OSD Object Write and Consistency --------------------------------- - -Regardless of the algorithm chosen above, writing of the data is a two -phase process: commit and rollforward. The primary sends the log -entries with the operation described (see -osd_types.h:TransactionInfo::(LocalRollForward|LocalRollBack). -In all cases, the "commit" is performed in place, possibly leaving some -information required for a rollback in a write-aside object. The -rollforward phase occurs once all acting set replicas have committed -the commit (sorry, overloaded term) and removes the rollback information. - -In the case of overwrites of exsting stripes, the rollback information -has the form of a sparse object containing the old values of the -overwritten extents populated using clone_range. This is essentially -a place-holder implementation, in real life, bluestore will have an -efficient primitive for this. - -The rollforward part can be delayed since we report the operation as -committed once all replicas have committed. Currently, whenever we -send a write, we also indicate that all previously committed -operations should be rolled forward (see -ECBackend::try_reads_to_commit). If there aren't any in the pipeline -when we arrive at the waiting_rollforward queue, we start a dummy -write to move things along (see the Pipeline section later on and -ECBackend::try_finish_rmw). - -ExtentCache ------------ - -It's pretty important to be able to pipeline writes on the same -object. For this reason, there is a cache of extents written by -cacheable operations. Each extent remains pinned until the operations -referring to it are committed. The pipeline prevents rmw operations -from running until uncacheable transactions (clones, etc) are flushed -from the pipeline. - -See ExtentCache.h for a detailed explanation of how the cache -states correspond to the higher level invariants about the conditions -under which cuncurrent operations can refer to the same object. - -Pipeline --------- - -Reading src/osd/ExtentCache.h should have given a good idea of how -operations might overlap. There are several states involved in -processing a write operation and an important invariant which -isn't enforced by PrimaryLogPG at a higher level which need to be -managed by ECBackend. The important invariant is that we can't -have uncacheable and rmw operations running at the same time -on the same object. For simplicity, we simply enforce that any -operation which contains an rmw operation must wait until -all in-progress uncacheable operations complete. - -There are improvements to be made here in the future. - -For more details, see ECBackend::waiting_* and -ECBackend::try_<from>_to_<to>. - diff --git a/src/ceph/doc/dev/osd_internals/erasure_coding/jerasure.rst b/src/ceph/doc/dev/osd_internals/erasure_coding/jerasure.rst deleted file mode 100644 index 27669a0..0000000 --- a/src/ceph/doc/dev/osd_internals/erasure_coding/jerasure.rst +++ /dev/null @@ -1,33 +0,0 @@ -=============== -jerasure plugin -=============== - -Introduction ------------- - -The parameters interpreted by the jerasure plugin are: - -:: - - ceph osd erasure-code-profile set myprofile \ - directory=<dir> \ # plugin directory absolute path - plugin=jerasure \ # plugin name (only jerasure) - k=<k> \ # data chunks (default 2) - m=<m> \ # coding chunks (default 2) - technique=<technique> \ # coding technique - -The coding techniques can be chosen among *reed_sol_van*, -*reed_sol_r6_op*, *cauchy_orig*, *cauchy_good*, *liberation*, -*blaum_roth* and *liber8tion*. - -The *src/erasure-code/jerasure* directory contains the -implementation. It is a wrapper around the code found at -`https://github.com/ceph/jerasure <https://github.com/ceph/jerasure>`_ -and `https://github.com/ceph/gf-complete -<https://github.com/ceph/gf-complete>`_ , pinned to the latest stable -version in *.gitmodules*. These repositories are copies of the -upstream repositories `http://jerasure.org/jerasure/jerasure -<http://jerasure.org/jerasure/jerasure>`_ and -`http://jerasure.org/jerasure/gf-complete -<http://jerasure.org/jerasure/gf-complete>`_ . The difference -between the two, if any, should match pull requests against upstream. diff --git a/src/ceph/doc/dev/osd_internals/erasure_coding/proposals.rst b/src/ceph/doc/dev/osd_internals/erasure_coding/proposals.rst deleted file mode 100644 index 52f98e8..0000000 --- a/src/ceph/doc/dev/osd_internals/erasure_coding/proposals.rst +++ /dev/null @@ -1,385 +0,0 @@ -:orphan: - -================================= -Proposed Next Steps for ECBackend -================================= - -PARITY-DELTA-WRITE ------------------- - -RMW operations current require 4 network hops (2 round trips). In -principle, for some codes, we can reduce this to 3 by sending the -update to the replicas holding the data blocks and having them -compute a delta to forward onto the parity blocks. - -The primary reads the current values of the "W" blocks and then uses -the new values of the "W" blocks to compute parity-deltas for each of -the parity blocks. The W blocks and the parity delta-blocks are sent -to their respective shards. - -The choice of whether to use a read-modify-write or a -parity-delta-write is complex policy issue that is TBD in the details -and is likely to be heavily dependant on the computational costs -associated with a parity-delta vs. a regular parity-generation -operation. However, it is believed that the parity-delta scheme is -likely to be the preferred choice, when available. - -The internal interface to the erasure coding library plug-ins needs to -be extended to support the ability to query if parity-delta -computation is possible for a selected algorithm as well as an -interface to the actual parity-delta computation algorithm when -available. - -Stripe Cache ------------- - -It may be a good idea to extend the current ExtentCache usage to -cache some data past when the pinning operation releases it. -One application pattern that is important to optimize is the small -block sequential write operation (think of the journal of a journaling -file system or a database transaction log). Regardless of the chosen -redundancy algorithm, it is advantageous for the primary to -retain/buffer recently read/written portions of a stripe in order to -reduce network traffic. The dynamic contents of this cache may be used -in the determination of whether a read-modify-write or a -parity-delta-write is performed. The sizing of this cache is TBD, but -we should plan on allowing at least a few full stripes per active -client. Limiting the cache occupancy on a per-client basis will reduce -the noisy neighbor problem. - -Recovery and Rollback Details -============================= - -Implementing a Rollback-able Prepare Operation ----------------------------------------------- - -The prepare operation is implemented at each OSD through a simulation -of a versioning or copy-on-write capability for modifying a portion of -an object. - -When a prepare operation is performed, the new data is written into a -temporary object. The PG log for the -operation will contain a reference to the temporary object so that it -can be located for recovery purposes as well as a record of all of the -shards which are involved in the operation. - -In order to avoid fragmentation (and hence, future read performance), -creation of the temporary object needs special attention. The name of -the temporary object affects its location within the KV store. Right -now its unclear whether it's desirable for the name to locate near the -base object or whether a separate subset of keyspace should be used -for temporary objects. Sam believes that colocation with the base -object is preferred (he suggests using the generation counter of the -ghobject for temporaries). Whereas Allen believes that using a -separate subset of keyspace is desirable since these keys are -ephemeral and we don't want to actually colocate them with the base -object keys. Perhaps some modeling here can help resolve this -issue. The data of the temporary object wants to be located as close -to the data of the base object as possible. This may be best performed -by adding a new ObjectStore creation primitive that takes the base -object as an addtional parameter that is a hint to the allocator. - -Sam: I think that the short lived thing may be a red herring. We'll -be updating the donor and primary objects atomically, so it seems like -we'd want them adjacent in the key space, regardless of the donor's -lifecycle. - -The apply operation moves the data from the temporary object into the -correct position within the base object and deletes the associated -temporary object. This operation is done using a specialized -ObjectStore primitive. In the current ObjectStore interface, this can -be done using the clonerange function followed by a delete, but can be -done more efficiently with a specialized move primitive. -Implementation of the specialized primitive on FileStore can be done -by copying the data. Some file systems have extensions that might also -be able to implement this operation (like a defrag API that swaps -chunks between files). It is expected that NewStore will be able to -support this efficiently and natively (It has been noted that this -sequence requires that temporary object allocations, which tend to be -small, be efficiently converted into blocks for main objects and that -blocks that were formerly inside of main objects must be reusable with -minimal overhead) - -The prepare and apply operations can be separated arbitrarily in -time. If a read operation accesses an object that has been altered by -a prepare operation (but without a corresponding apply operation) it -must return the data after the prepare operation. This is done by -creating an in-memory database of objects which have had a prepare -operation without a corresponding apply operation. All read operations -must consult this in-memory data structure in order to get the correct -data. It should explicitly recognized that it is likely that there -will be multiple prepare operations against a single base object and -the code must handle this case correctly. This code is implemented as -a layer between ObjectStore and all existing readers. Annoyingly, -we'll want to trash this state when the interval changes, so the first -thing that needs to happen after activation is that the primary and -replicas apply up to last_update so that the empty cache will be -correct. - -During peering, it is now obvious that an unapplied prepare operation -can easily be rolled back simply by deleting the associated temporary -object and removing that entry from the in-memory data structure. - -Partial Application Peering/Recovery modifications --------------------------------------------------- - -Some writes will be small enough to not require updating all of the -shards holding data blocks. For write amplification minization -reasons, it would be best to avoid writing to those shards at all, -and delay even sending the log entries until the next write which -actually hits that shard. - -The delaying (buffering) of the transmission of the prepare and apply -operations for witnessing OSDs creates new situations that peering -must handle. In particular the logic for determining the authoritative -last_update value (and hence the selection of the OSD which has the -authoritative log) must be modified to account for the valid but -missing (i.e., delayed/buffered) pglog entries to which the -authoritative OSD was only a witness to. - -Because a partial write might complete without persisting a log entry -on every replica, we have to do a bit more work to determine an -authoritative last_update. The constraint (as with a replicated PG) -is that last_update >= the most recent log entry for which a commit -was sent to the client (call this actual_last_update). Secondarily, -we want last_update to be as small as possible since any log entry -past actual_last_update (we do not apply a log entry until we have -sent the commit to the client) must be able to be rolled back. Thus, -the smaller a last_update we choose, the less recovery will need to -happen (we can always roll back, but rolling a replica forward may -require an object rebuild). Thus, we will set last_update to 1 before -the oldest log entry we can prove cannot have been committed. In -current master, this is simply the last_update of the shortest log -from that interval (because that log did not persist any entry past -that point -- a precondition for sending a commit to the client). For -this design, we must consider the possibility that any log is missing -at its head log entries in which it did not participate. Thus, we -must determine the most recent interval in which we went active -(essentially, this is what find_best_info currently does). We then -pull the log from each live osd from that interval back to the minimum -last_update among them. Then, we extend all logs from the -authoritative interval until each hits an entry in which it should -have participated, but did not record. The shortest of these extended -logs must therefore contain any log entry for which we sent a commit -to the client -- and the last entry gives us our last_update. - -Deep scrub support ------------------- - -The simple answer here is probably our best bet. EC pools can't use -the omap namespace at all right now. The simplest solution would be -to take a prefix of the omap space and pack N M byte L bit checksums -into each key/value. The prefixing seems like a sensible precaution -against eventually wanting to store something else in the omap space. -It seems like any write will need to read at least the blocks -containing the modified range. However, with a code able to compute -parity deltas, we may not need to read a whole stripe. Even without -that, we don't want to have to write to blocks not participating in -the write. Thus, each shard should store checksums only for itself. -It seems like you'd be able to store checksums for all shards on the -parity blocks, but there may not be distinguished parity blocks which -are modified on all writes (LRC or shec provide two examples). L -should probably have a fixed number of options (16, 32, 64?) and be -configurable per-pool at pool creation. N, M should be likewise be -configurable at pool creation with sensible defaults. - -We need to handle online upgrade. I think the right answer is that -the first overwrite to an object with an append only checksum -removes the append only checksum and writes in whatever stripe -checksums actually got written. The next deep scrub then writes -out the full checksum omap entries. - -RADOS Client Acknowledgement Generation Optimization -==================================================== - -Now that the recovery scheme is understood, we can discuss the -generation of of the RADOS operation acknowledgement (ACK) by the -primary ("sufficient" from above). It is NOT required that the primary -wait for all shards to complete their respective prepare -operations. Using our example where the RADOS operations writes only -"W" chunks of the stripe, the primary will generate and send W+M -prepare operations (possibly including a send-to-self). The primary -need only wait for enough shards to be written to ensure recovery of -the data, Thus after writing W + M chunks you can afford the lost of M -chunks. Hence the primary can generate the RADOS ACK after W+M-M => W -of those prepare operations are completed. - -Inconsistent object_info_t versions -=================================== - -A natural consequence of only writing the blocks which actually -changed is that we don't want to update the object_info_t of the -objects which didn't. I actually think it would pose a problem to do -so: pg ghobject namespaces are generally large, and unless the osd is -seeing a bunch of overwrites on a small set of objects, I'd expect -each write to be far enough apart in the backing ghobject_t->data -mapping to each constitute a random metadata update. Thus, we have to -accept that not every shard will have the current version in its -object_info_t. We can't even bound how old the version on a -particular shard will happen to be. In particular, the primary does -not necessarily have the current version. One could argue that the -parity shards would always have the current version, but not every -code necessarily has designated parity shards which see every write -(certainly LRC, iirc shec, and even with a more pedestrian code, it -might be desirable to rotate the shards based on object hash). Even -if you chose to designate a shard as witnessing all writes, the pg -might be degraded with that particular shard missing. This is a bit -tricky, currently reads and writes implicitely return the most recent -version of the object written. On reads, we'd have to read K shards -to answer that question. We can get around that by adding a "don't -tell me the current version" flag. Writes are more problematic: we -need an object_info from the most recent write in order to form the -new object_info and log_entry. - -A truly terrifying option would be to eliminate version and -prior_version entirely from the object_info_t. There are a few -specific purposes it serves: - -#. On OSD startup, we prime the missing set by scanning backwards - from last_update to last_complete comparing the stored object's - object_info_t to the version of most recent log entry. -#. During backfill, we compare versions between primary and target - to avoid some pushes. We use it elsewhere as well -#. While pushing and pulling objects, we verify the version. -#. We return it on reads and writes and allow the librados user to - assert it atomically on writesto allow the user to deal with write - races (used extensively by rbd). - -Case (3) isn't actually essential, just convenient. Oh well. (4) -is more annoying. Writes are easy since we know the version. Reads -are tricky because we may not need to read from all of the replicas. -Simplest solution is to add a flag to rados operations to just not -return the user version on read. We can also just not support the -user version assert on ec for now (I think? Only user is rgw bucket -indices iirc, and those will always be on replicated because they use -omap). - -We can avoid (1) by maintaining the missing set explicitely. It's -already possible for there to be a missing object without a -corresponding log entry (Consider the case where the most recent write -is to an object which has not been updated in weeks. If that write -becomes divergent, the written object needs to be marked missing based -on the prior_version which is not in the log.) THe PGLog already has -a way of handling those edge cases (see divergent_priors). We'd -simply expand that to contain the entire missing set and maintain it -atomically with the log and the objects. This isn't really an -unreasonable option, the addiitonal keys would be fewer than the -existing log keys + divergent_priors and aren't updated in the fast -write path anyway. - -The second case is a bit trickier. It's really an optimization for -the case where a pg became not in the acting set long enough for the -logs to no longer overlap but not long enough for the PG to have -healed and removed the old copy. Unfortunately, this describes the -case where a node was taken down for maintenance with noout set. It's -probably not acceptable to re-backfill the whole OSD in such a case, -so we need to be able to quickly determine whether a particular shard -is up to date given a valid acting set of other shards. - -Let ordinary writes which do not change the object size not touch the -object_info at all. That means that the object_info version won't -match the pg log entry version. Include in the pg_log_entry_t the -current object_info version as well as which shards participated (as -mentioned above). In addition to the object_info_t attr, record on -each shard s a vector recording for each other shard s' the most -recent write which spanned both s and s'. Operationally, we maintain -an attr on each shard containing that vector. A write touching S -updates the version stamp entry for each shard in S on each shard in -S's attribute (and leaves the rest alone). If we have a valid acting -set during backfill, we must have a witness of every write which -completed -- so taking the max of each entry over all of the acting -set shards must give us the current version for each shard. During -recovery, we set the attribute on the recovery target to that max -vector (Question: with LRC, we may not need to touch much of the -acting set to recover a particular shard -- can we just use the max of -the shards we used to recovery, or do we need to grab the version -vector from the rest of the acting set as well? I'm not sure, not a -big deal anyway, I think). - -The above lets us perform blind writes without knowing the current -object version (log entry version, that is) while still allowing us to -avoid backfilling up to date objects. The only catch is that our -backfill scans will can all replicas, not just the primary and the -backfill targets. - -It would be worth adding into scrub the ability to check the -consistency of the gathered version vectors -- probably by just -taking 3 random valid subsets and verifying that they generate -the same authoritative version vector. - -Implementation Strategy -======================= - -It goes without saying that it would be unwise to attempt to do all of -this in one massive PR. It's also not a good idea to merge code which -isn't being tested. To that end, it's worth thinking a bit about -which bits can be tested on their own (perhaps with a bit of temporary -scaffolding). - -We can implement the overwrite friendly checksumming scheme easily -enough with the current implementation. We'll want to enable it on a -per-pool basis (probably using a flag which we'll later repurpose for -actual overwrite support). We can enable it in some of the ec -thrashing tests in the suite. We can also add a simple test -validating the behavior of turning it on for an existing ec pool -(later, we'll want to be able to convert append-only ec pools to -overwrite ec pools, so that test will simply be expanded as we go). -The flag should be gated by the experimental feature flag since we -won't want to support this as a valid configuration -- testing only. -We need to upgrade append only ones in place during deep scrub. - -Similarly, we can implement the unstable extent cache with the current -implementation, it even lets us cut out the readable ack the replicas -send to the primary after the commit which lets it release the lock. -Same deal, implement, gate with experimental flag, add to some of the -automated tests. I don't really see a reason not to use the same flag -as above. - -We can certainly implement the move-range primitive with unit tests -before there are any users. Adding coverage to the existing -objectstore tests would suffice here. - -Explicit missing set can be implemented now, same deal as above -- -might as well even use the same feature bit. - -The TPC protocol outlined above can actually be implemented an append -only EC pool. Same deal as above, can even use the same feature bit. - -The RADOS flag to suppress the read op user version return can be -implemented immediately. Mostly just needs unit tests. - -The version vector problem is an interesting one. For append only EC -pools, it would be pointless since all writes increase the size and -therefore update the object_info. We could do it for replicated pools -though. It's a bit silly since all "shards" see all writes, but it -would still let us implement and partially test the augmented backfill -code as well as the extra pg log entry fields -- this depends on the -explicit pg log entry branch having already merged. It's not entirely -clear to me that this one is worth doing seperately. It's enough code -that I'd really prefer to get it done independently, but it's also a -fair amount of scaffolding that will be later discarded. - -PGLog entries need to be able to record the participants and log -comparison needs to be modified to extend logs with entries they -wouldn't have witnessed. This logic should be abstracted behind -PGLog so it can be unittested -- that would let us test it somewhat -before the actual ec overwrites code merges. - -Whatever needs to happen to the ec plugin interface can probably be -done independently of the rest of this (pending resolution of -questions below). - -The actual nuts and bolts of performing the ec overwrite it seems to -me can't be productively tested (and therefore implemented) until the -above are complete, so best to get all of the supporting code in -first. - -Open Questions -============== - -Is there a code we should be using that would let us compute a parity -delta without rereading and reencoding the full stripe? If so, is it -the kind of thing we need to design for now, or can it be reasonably -put off? - -What needs to happen to the EC plugin interface? diff --git a/src/ceph/doc/dev/osd_internals/index.rst b/src/ceph/doc/dev/osd_internals/index.rst deleted file mode 100644 index 7e82914..0000000 --- a/src/ceph/doc/dev/osd_internals/index.rst +++ /dev/null @@ -1,10 +0,0 @@ -============================== -OSD developer documentation -============================== - -.. rubric:: Contents - -.. toctree:: - :glob: - - * diff --git a/src/ceph/doc/dev/osd_internals/last_epoch_started.rst b/src/ceph/doc/dev/osd_internals/last_epoch_started.rst deleted file mode 100644 index 9978bd3..0000000 --- a/src/ceph/doc/dev/osd_internals/last_epoch_started.rst +++ /dev/null @@ -1,60 +0,0 @@ -====================== -last_epoch_started -====================== - -info.last_epoch_started records an activation epoch e for interval i -such that all writes commited in i or earlier are reflected in the -local info/log and no writes after i are reflected in the local -info/log. Since no committed write is ever divergent, even if we -get an authoritative log/info with an older info.last_epoch_started, -we can leave our info.last_epoch_started alone since no writes could -have commited in any intervening interval (See PG::proc_master_log). - -info.history.last_epoch_started records a lower bound on the most -recent interval in which the pg as a whole went active and accepted -writes. On a particular osd, it is also an upper bound on the -activation epoch of intervals in which writes in the local pg log -occurred (we update it before accepting writes). Because all -committed writes are committed by all acting set osds, any -non-divergent writes ensure that history.last_epoch_started was -recorded by all acting set members in the interval. Once peering has -queried one osd from each interval back to some seen -history.last_epoch_started, it follows that no interval after the max -history.last_epoch_started can have reported writes as committed -(since we record it before recording client writes in an interval). -Thus, the minimum last_update across all infos with -info.last_epoch_started >= MAX(history.last_epoch_started) must be an -upper bound on writes reported as committed to the client. - -We update info.last_epoch_started with the intial activation message, -but we only update history.last_epoch_started after the new -info.last_epoch_started is persisted (possibly along with the first -write). This ensures that we do not require an osd with the most -recent info.last_epoch_started until all acting set osds have recorded -it. - -In find_best_info, we do include info.last_epoch_started values when -calculating the max_last_epoch_started_found because we want to avoid -designating a log entry divergent which in a prior interval would have -been non-divergent since it might have been used to serve a read. In -activate(), we use the peer's last_epoch_started value as a bound on -how far back divergent log entries can be found. - -However, in a case like - -.. code:: - - calc_acting osd.0 1.4e( v 473'302 (292'200,473'302] local-les=473 n=4 ec=5 les/c 473/473 556/556/556 - calc_acting osd.1 1.4e( v 473'302 (293'202,473'302] lb 0//0//-1 local-les=477 n=0 ec=5 les/c 473/473 556/556/556 - calc_acting osd.4 1.4e( v 473'302 (120'121,473'302] local-les=473 n=4 ec=5 les/c 473/473 556/556/556 - calc_acting osd.5 1.4e( empty local-les=0 n=0 ec=5 les/c 473/473 556/556/556 - -since osd.1 is the only one which recorded info.les=477 while 4,0 -which were the acting set in that interval did not (4 restarted and 0 -did not get the message in time) the pg is marked incomplete when -either 4 or 0 would have been valid choices. To avoid this, we do not -consider info.les for incomplete peers when calculating -min_last_epoch_started_found. It would not have been in the acting -set, so we must have another osd from that interval anyway (if -maybe_went_rw). If that osd does not remember that info.les, then we -cannot have served reads. diff --git a/src/ceph/doc/dev/osd_internals/log_based_pg.rst b/src/ceph/doc/dev/osd_internals/log_based_pg.rst deleted file mode 100644 index 8b11012..0000000 --- a/src/ceph/doc/dev/osd_internals/log_based_pg.rst +++ /dev/null @@ -1,206 +0,0 @@ -============ -Log Based PG -============ - -Background -========== - -Why PrimaryLogPG? ------------------ - -Currently, consistency for all ceph pool types is ensured by primary -log-based replication. This goes for both erasure-coded and -replicated pools. - -Primary log-based replication ------------------------------ - -Reads must return data written by any write which completed (where the -client could possibly have received a commit message). There are lots -of ways to handle this, but ceph's architecture makes it easy for -everyone at any map epoch to know who the primary is. Thus, the easy -answer is to route all writes for a particular pg through a single -ordering primary and then out to the replicas. Though we only -actually need to serialize writes on a single object (and even then, -the partial ordering only really needs to provide an ordering between -writes on overlapping regions), we might as well serialize writes on -the whole PG since it lets us represent the current state of the PG -using two numbers: the epoch of the map on the primary in which the -most recent write started (this is a bit stranger than it might seem -since map distribution itself is asyncronous -- see Peering and the -concept of interval changes) and an increasing per-pg version number --- this is referred to in the code with type eversion_t and stored as -pg_info_t::last_update. Furthermore, we maintain a log of "recent" -operations extending back at least far enough to include any -*unstable* writes (writes which have been started but not committed) -and objects which aren't uptodate locally (see recovery and -backfill). In practice, the log will extend much further -(osd_pg_min_log_entries when clean, osd_pg_max_log_entries when not -clean) because it's handy for quickly performing recovery. - -Using this log, as long as we talk to a non-empty subset of the OSDs -which must have accepted any completed writes from the most recent -interval in which we accepted writes, we can determine a conservative -log which must contain any write which has been reported to a client -as committed. There is some freedom here, we can choose any log entry -between the oldest head remembered by an element of that set (any -newer cannot have completed without that log containing it) and the -newest head remembered (clearly, all writes in the log were started, -so it's fine for us to remember them) as the new head. This is the -main point of divergence between replicated pools and ec pools in -PG/PrimaryLogPG: replicated pools try to choose the newest valid -option to avoid the client needing to replay those operations and -instead recover the other copies. EC pools instead try to choose -the *oldest* option available to them. - -The reason for this gets to the heart of the rest of the differences -in implementation: one copy will not generally be enough to -reconstruct an ec object. Indeed, there are encodings where some log -combinations would leave unrecoverable objects (as with a 4+2 encoding -where 3 of the replicas remember a write, but the other 3 do not -- we -don't have 3 copies of either version). For this reason, log entries -representing *unstable* writes (writes not yet committed to the -client) must be rollbackable using only local information on ec pools. -Log entries in general may therefore be rollbackable (and in that case, -via a delayed application or via a set of instructions for rolling -back an inplace update) or not. Replicated pool log entries are -never able to be rolled back. - -For more details, see PGLog.h/cc, osd_types.h:pg_log_t, -osd_types.h:pg_log_entry_t, and peering in general. - -ReplicatedBackend/ECBackend unification strategy -================================================ - -PGBackend ---------- - -So, the fundamental difference between replication and erasure coding -is that replication can do destructive updates while erasure coding -cannot. It would be really annoying if we needed to have two entire -implementations of PrimaryLogPG, one for each of the two, if there -are really only a few fundamental differences: - -#. How reads work -- async only, requires remote reads for ec -#. How writes work -- either restricted to append, or must write aside and do a - tpc -#. Whether we choose the oldest or newest possible head entry during peering -#. A bit of extra information in the log entry to enable rollback - -and so many similarities - -#. All of the stats and metadata for objects -#. The high level locking rules for mixing client IO with recovery and scrub -#. The high level locking rules for mixing reads and writes without exposing - uncommitted state (which might be rolled back or forgotten later) -#. The process, metadata, and protocol needed to determine the set of osds - which partcipated in the most recent interval in which we accepted writes -#. etc. - -Instead, we choose a few abstractions (and a few kludges) to paper over the differences: - -#. PGBackend -#. PGTransaction -#. PG::choose_acting chooses between calc_replicated_acting and calc_ec_acting -#. Various bits of the write pipeline disallow some operations based on pool - type -- like omap operations, class operation reads, and writes which are - not aligned appends (officially, so far) for ec -#. Misc other kludges here and there - -PGBackend and PGTransaction enable abstraction of differences 1, 2, -and the addition of 4 as needed to the log entries. - -The replicated implementation is in ReplicatedBackend.h/cc and doesn't -require much explanation, I think. More detail on the ECBackend can be -found in doc/dev/osd_internals/erasure_coding/ecbackend.rst. - -PGBackend Interface Explanation -=============================== - -Note: this is from a design document from before the original firefly -and is probably out of date w.r.t. some of the method names. - -Readable vs Degraded --------------------- - -For a replicated pool, an object is readable iff it is present on -the primary (at the right version). For an ec pool, we need at least -M shards present to do a read, and we need it on the primary. For -this reason, PGBackend needs to include some interfaces for determing -when recovery is required to serve a read vs a write. This also -changes the rules for when peering has enough logs to prove that it - -Core Changes: - -- | PGBackend needs to be able to return IsPG(Recoverable|Readable)Predicate - | objects to allow the user to make these determinations. - -Client Reads ------------- - -Reads with the replicated strategy can always be satisfied -synchronously out of the primary OSD. With an erasure coded strategy, -the primary will need to request data from some number of replicas in -order to satisfy a read. PGBackend will therefore need to provide -seperate objects_read_sync and objects_read_async interfaces where -the former won't be implemented by the ECBackend. - -PGBackend interfaces: - -- objects_read_sync -- objects_read_async - -Scrub ------ - -We currently have two scrub modes with different default frequencies: - -#. [shallow] scrub: compares the set of objects and metadata, but not - the contents -#. deep scrub: compares the set of objects, metadata, and a crc32 of - the object contents (including omap) - -The primary requests a scrubmap from each replica for a particular -range of objects. The replica fills out this scrubmap for the range -of objects including, if the scrub is deep, a crc32 of the contents of -each object. The primary gathers these scrubmaps from each replica -and performs a comparison identifying inconsistent objects. - -Most of this can work essentially unchanged with erasure coded PG with -the caveat that the PGBackend implementation must be in charge of -actually doing the scan. - - -PGBackend interfaces: - -- be_* - -Recovery --------- - -The logic for recovering an object depends on the backend. With -the current replicated strategy, we first pull the object replica -to the primary and then concurrently push it out to the replicas. -With the erasure coded strategy, we probably want to read the -minimum number of replica chunks required to reconstruct the object -and push out the replacement chunks concurrently. - -Another difference is that objects in erasure coded pg may be -unrecoverable without being unfound. The "unfound" concept -should probably then be renamed to unrecoverable. Also, the -PGBackend implementation will have to be able to direct the search -for pg replicas with unrecoverable object chunks and to be able -to determine whether a particular object is recoverable. - - -Core changes: - -- s/unfound/unrecoverable - -PGBackend interfaces: - -- `on_local_recover_start <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L60>`_ -- `on_local_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L66>`_ -- `on_global_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L78>`_ -- `on_peer_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L83>`_ -- `begin_peer_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L90>`_ diff --git a/src/ceph/doc/dev/osd_internals/map_message_handling.rst b/src/ceph/doc/dev/osd_internals/map_message_handling.rst deleted file mode 100644 index a5013c2..0000000 --- a/src/ceph/doc/dev/osd_internals/map_message_handling.rst +++ /dev/null @@ -1,131 +0,0 @@ -=========================== -Map and PG Message handling -=========================== - -Overview --------- -The OSD handles routing incoming messages to PGs, creating the PG if necessary -in some cases. - -PG messages generally come in two varieties: - - 1. Peering Messages - 2. Ops/SubOps - -There are several ways in which a message might be dropped or delayed. It is -important that the message delaying does not result in a violation of certain -message ordering requirements on the way to the relevant PG handling logic: - - 1. Ops referring to the same object must not be reordered. - 2. Peering messages must not be reordered. - 3. Subops must not be reordered. - -MOSDMap -------- -MOSDMap messages may come from either monitors or other OSDs. Upon receipt, the -OSD must perform several tasks: - - 1. Persist the new maps to the filestore. - Several PG operations rely on having access to maps dating back to the last - time the PG was clean. - 2. Update and persist the superblock. - 3. Update OSD state related to the current map. - 4. Expose new maps to PG processes via *OSDService*. - 5. Remove PGs due to pool removal. - 6. Queue dummy events to trigger PG map catchup. - -Each PG asynchronously catches up to the currently published map during -process_peering_events before processing the event. As a result, different -PGs may have different views as to the "current" map. - -One consequence of this design is that messages containing submessages from -multiple PGs (MOSDPGInfo, MOSDPGQuery, MOSDPGNotify) must tag each submessage -with the PG's epoch as well as tagging the message as a whole with the OSD's -current published epoch. - -MOSDPGOp/MOSDPGSubOp --------------------- -See OSD::dispatch_op, OSD::handle_op, OSD::handle_sub_op - -MOSDPGOps are used by clients to initiate rados operations. MOSDSubOps are used -between OSDs to coordinate most non peering activities including replicating -MOSDPGOp operations. - -OSD::require_same_or_newer map checks that the current OSDMap is at least -as new as the map epoch indicated on the message. If not, the message is -queued in OSD::waiting_for_osdmap via OSD::wait_for_new_map. Note, this -cannot violate the above conditions since any two messages will be queued -in order of receipt and if a message is received with epoch e0, a later message -from the same source must be at epoch at least e0. Note that two PGs from -the same OSD count for these purposes as different sources for single PG -messages. That is, messages from different PGs may be reordered. - - -MOSDPGOps follow the following process: - - 1. OSD::handle_op: validates permissions and crush mapping. - discard the request if they are not connected and the client cannot get the reply ( See OSD::op_is_discardable ) - See OSDService::handle_misdirected_op - See PG::op_has_sufficient_caps - See OSD::require_same_or_newer_map - 2. OSD::enqueue_op - -MOSDSubOps follow the following process: - - 1. OSD::handle_sub_op checks that sender is an OSD - 2. OSD::enqueue_op - -OSD::enqueue_op calls PG::queue_op which checks waiting_for_map before calling OpWQ::queue which adds the op to the queue of the PG responsible for handling it. - -OSD::dequeue_op is then eventually called, with a lock on the PG. At -this time, the op is passed to PG::do_request, which checks that: - - 1. the PG map is new enough (PG::must_delay_op) - 2. the client requesting the op has enough permissions (PG::op_has_sufficient_caps) - 3. the op is not to be discarded (PG::can_discard_{request,op,subop,scan,backfill}) - 4. the PG is active (PG::flushed boolean) - 5. the op is a CEPH_MSG_OSD_OP and the PG is in PG_STATE_ACTIVE state and not in PG_STATE_REPLAY - -If these conditions are not met, the op is either discarded or queued for later processing. If all conditions are met, the op is processed according to its type: - - 1. CEPH_MSG_OSD_OP is handled by PG::do_op - 2. MSG_OSD_SUBOP is handled by PG::do_sub_op - 3. MSG_OSD_SUBOPREPLY is handled by PG::do_sub_op_reply - 4. MSG_OSD_PG_SCAN is handled by PG::do_scan - 5. MSG_OSD_PG_BACKFILL is handled by PG::do_backfill - -CEPH_MSG_OSD_OP processing --------------------------- - -PrimaryLogPG::do_op handles CEPH_MSG_OSD_OP op and will queue it - - 1. in wait_for_all_missing if it is a CEPH_OSD_OP_PGLS for a designated snapid and some object updates are still missing - 2. in waiting_for_active if the op may write but the scrubber is working - 3. in waiting_for_missing_object if the op requires an object or a snapdir or a specific snap that is still missing - 4. in waiting_for_degraded_object if the op may write an object or a snapdir that is degraded, or if another object blocks it ("blocked_by") - 5. in waiting_for_backfill_pos if the op requires an object that will be available after the backfill is complete - 6. in waiting_for_ack if an ack from another OSD is expected - 7. in waiting_for_ondisk if the op is waiting for a write to complete - -Peering Messages ----------------- -See OSD::handle_pg_(notify|info|log|query) - -Peering messages are tagged with two epochs: - - 1. epoch_sent: map epoch at which the message was sent - 2. query_epoch: map epoch at which the message triggering the message was sent - -These are the same in cases where there was no triggering message. We discard -a peering message if the message's query_epoch if the PG in question has entered -a new epoch (See PG::old_peering_evt, PG::queue_peering_event). Notifies, -infos, notifies, and logs are all handled as PG::RecoveryMachine events and -are wrapped by PG::queue_* by PG::CephPeeringEvts, which include the created -state machine event along with epoch_sent and query_epoch in order to -generically check PG::old_peering_message upon insertion and removal from the -queue. - -Note, notifies, logs, and infos can trigger the creation of a PG. See -OSD::get_or_create_pg. - - diff --git a/src/ceph/doc/dev/osd_internals/osd_overview.rst b/src/ceph/doc/dev/osd_internals/osd_overview.rst deleted file mode 100644 index 192ddf8..0000000 --- a/src/ceph/doc/dev/osd_internals/osd_overview.rst +++ /dev/null @@ -1,106 +0,0 @@ -=== -OSD -=== - -Concepts --------- - -*Messenger* - See src/msg/Messenger.h - - Handles sending and receipt of messages on behalf of the OSD. The OSD uses - two messengers: - - 1. cluster_messenger - handles traffic to other OSDs, monitors - 2. client_messenger - handles client traffic - - This division allows the OSD to be configured with different interfaces for - client and cluster traffic. - -*Dispatcher* - See src/msg/Dispatcher.h - - OSD implements the Dispatcher interface. Of particular note is ms_dispatch, - which serves as the entry point for messages received via either the client - or cluster messenger. Because there are two messengers, ms_dispatch may be - called from at least two threads. The osd_lock is always held during - ms_dispatch. - -*WorkQueue* - See src/common/WorkQueue.h - - The WorkQueue class abstracts the process of queueing independent tasks - for asynchronous execution. Each OSD process contains workqueues for - distinct tasks: - - 1. OpWQ: handles ops (from clients) and subops (from other OSDs). - Runs in the op_tp threadpool. - 2. PeeringWQ: handles peering tasks and pg map advancement - Runs in the op_tp threadpool. - See Peering - 3. CommandWQ: handles commands (pg query, etc) - Runs in the command_tp threadpool. - 4. RecoveryWQ: handles recovery tasks. - Runs in the recovery_tp threadpool. - 5. SnapTrimWQ: handles snap trimming - Runs in the disk_tp threadpool. - See SnapTrimmer - 6. ScrubWQ: handles primary scrub path - Runs in the disk_tp threadpool. - See Scrub - 7. ScrubFinalizeWQ: handles primary scrub finalize - Runs in the disk_tp threadpool. - See Scrub - 8. RepScrubWQ: handles replica scrub path - Runs in the disk_tp threadpool - See Scrub - 9. RemoveWQ: Asynchronously removes old pg directories - Runs in the disk_tp threadpool - See PGRemoval - -*ThreadPool* - See src/common/WorkQueue.h - See also above. - - There are 4 OSD threadpools: - - 1. op_tp: handles ops and subops - 2. recovery_tp: handles recovery tasks - 3. disk_tp: handles disk intensive tasks - 4. command_tp: handles commands - -*OSDMap* - See src/osd/OSDMap.h - - The crush algorithm takes two inputs: a picture of the cluster - with status information about which nodes are up/down and in/out, - and the pgid to place. The former is encapsulated by the OSDMap. - Maps are numbered by *epoch* (epoch_t). These maps are passed around - within the OSD as std::tr1::shared_ptr<const OSDMap>. - - See MapHandling - -*PG* - See src/osd/PG.* src/osd/PrimaryLogPG.* - - Objects in rados are hashed into *PGs* and *PGs* are placed via crush onto - OSDs. The PG structure is responsible for handling requests pertaining to - a particular *PG* as well as for maintaining relevant metadata and controlling - recovery. - -*OSDService* - See src/osd/OSD.cc OSDService - - The OSDService acts as a broker between PG threads and OSD state which allows - PGs to perform actions using OSD services such as workqueues and messengers. - This is still a work in progress. Future cleanups will focus on moving such - state entirely from the OSD into the OSDService. - -Overview --------- - See src/ceph_osd.cc - - The OSD process represents one leaf device in the crush hierarchy. There - might be one OSD process per physical machine, or more than one if, for - example, the user configures one OSD instance per disk. - diff --git a/src/ceph/doc/dev/osd_internals/osd_throttles.rst b/src/ceph/doc/dev/osd_internals/osd_throttles.rst deleted file mode 100644 index 6739bd9..0000000 --- a/src/ceph/doc/dev/osd_internals/osd_throttles.rst +++ /dev/null @@ -1,93 +0,0 @@ -============= -OSD Throttles -============= - -There are three significant throttles in the filestore: wbthrottle, -op_queue_throttle, and a throttle based on journal usage. - -WBThrottle ----------- -The WBThrottle is defined in src/os/filestore/WBThrottle.[h,cc] and -included in FileStore as FileStore::wbthrottle. The intention is to -bound the amount of outstanding IO we need to do to flush the journal. -At the same time, we don't want to necessarily do it inline in case we -might be able to combine several IOs on the same object close together -in time. Thus, in FileStore::_write, we queue the fd for asyncronous -flushing and block in FileStore::_do_op if we have exceeded any hard -limits until the background flusher catches up. - -The relevant config options are filestore_wbthrottle*. There are -different defaults for xfs and btrfs. Each set has hard and soft -limits on bytes (total dirty bytes), ios (total dirty ios), and -inodes (total dirty fds). The WBThrottle will begin flushing -when any of these hits the soft limit and will block in throttle() -while any has exceeded the hard limit. - -Tighter soft limits will cause writeback to happen more quickly, -but may cause the OSD to miss oportunities for write coalescing. -Tighter hard limits may cause a reduction in latency variance by -reducing time spent flushing the journal, but may reduce writeback -parallelism. - -op_queue_throttle ------------------ -The op queue throttle is intended to bound the amount of queued but -uncompleted work in the filestore by delaying threads calling -queue_transactions more and more based on how many ops and bytes are -currently queued. The throttle is taken in queue_transactions and -released when the op is applied to the filesystem. This period -includes time spent in the journal queue, time spent writing to the -journal, time spent in the actual op queue, time spent waiting for the -wbthrottle to open up (thus, the wbthrottle can push back indirectly -on the queue_transactions caller), and time spent actually applying -the op to the filesystem. A BackoffThrottle is used to gradually -delay the queueing thread after each throttle becomes more than -filestore_queue_low_threshhold full (a ratio of -filestore_queue_max_(bytes|ops)). The throttles will block once the -max value is reached (filestore_queue_max_(bytes|ops)). - -The significant config options are: -filestore_queue_low_threshhold -filestore_queue_high_threshhold -filestore_expected_throughput_ops -filestore_expected_throughput_bytes -filestore_queue_high_delay_multiple -filestore_queue_max_delay_multiple - -While each throttle is at less than low_threshhold of the max, -no delay happens. Between low and high, the throttle will -inject a per-op delay (per op or byte) ramping from 0 at low to -high_delay_multiple/expected_throughput at high. From high to -1, the delay will ramp from high_delay_multiple/expected_throughput -to max_delay_multiple/expected_throughput. - -filestore_queue_high_delay_multiple and -filestore_queue_max_delay_multiple probably do not need to be -changed. - -Setting these properly should help to smooth out op latencies by -mostly avoiding the hard limit. - -See FileStore::throttle_ops and FileSTore::thottle_bytes. - -journal usage throttle ----------------------- -See src/os/filestore/JournalThrottle.h/cc - -The intention of the journal usage throttle is to gradually slow -down queue_transactions callers as the journal fills up in order -to smooth out hiccup during filestore syncs. JournalThrottle -wraps a BackoffThrottle and tracks journaled but not flushed -journal entries so that the throttle can be released when the -journal is flushed. The configs work very similarly to the -op_queue_throttle. - -The significant config options are: -journal_throttle_low_threshhold -journal_throttle_high_threshhold -filestore_expected_throughput_ops -filestore_expected_throughput_bytes -journal_throttle_high_multiple -journal_throttle_max_multiple - -.. literalinclude:: osd_throttles.txt diff --git a/src/ceph/doc/dev/osd_internals/osd_throttles.txt b/src/ceph/doc/dev/osd_internals/osd_throttles.txt deleted file mode 100644 index 0332377..0000000 --- a/src/ceph/doc/dev/osd_internals/osd_throttles.txt +++ /dev/null @@ -1,21 +0,0 @@ - Messenger throttle (number and size) - |-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| - FileStore op_queue throttle (number and size, includes a soft throttle based on filestore_expected_throughput_(ops|bytes)) - |--------------------------------------------------------| - WBThrottle - |---------------------------------------------------------------------------------------------------------| - Journal (size, includes a soft throttle based on filestore_expected_throughput_bytes) - |-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| - |----------------------------------------------------------------------------------------------------> flushed ----------------> synced - | -Op: Read Header --DispatchQ--> OSD::_dispatch --OpWQ--> PG::do_request --journalq--> Journal --FileStore::OpWQ--> Apply Thread --Finisher--> op_applied -------------------------------------------------------------> Complete - | | -SubOp: --Messenger--> ReadHeader --DispatchQ--> OSD::_dispatch --OpWQ--> PG::do_request --journalq--> Journal --FileStore::OpWQ--> Apply Thread --Finisher--> sub_op_applied - - | - |-----------------------------> flushed ----------------> synced - |------------------------------------------------------------------------------------------| - Journal (size) - |---------------------------------| - WBThrottle - |-----------------------------------------------------| - FileStore op_queue throttle (number and size) diff --git a/src/ceph/doc/dev/osd_internals/pg.rst b/src/ceph/doc/dev/osd_internals/pg.rst deleted file mode 100644 index 4055363..0000000 --- a/src/ceph/doc/dev/osd_internals/pg.rst +++ /dev/null @@ -1,31 +0,0 @@ -==== -PG -==== - -Concepts --------- - -*Peering Interval* - See PG::start_peering_interval. - See PG::acting_up_affected - See PG::RecoveryState::Reset - - A peering interval is a maximal set of contiguous map epochs in which the - up and acting sets did not change. PG::RecoveryMachine represents a - transition from one interval to another as passing through - RecoveryState::Reset. On PG::RecoveryState::AdvMap PG::acting_up_affected can - cause the pg to transition to Reset. - - -Peering Details and Gotchas ---------------------------- -For an overview of peering, see `Peering <../../peering>`_. - - * PG::flushed defaults to false and is set to false in - PG::start_peering_interval. Upon transitioning to PG::RecoveryState::Started - we send a transaction through the pg op sequencer which, upon complete, - sends a FlushedEvt which sets flushed to true. The primary cannot go - active until this happens (See PG::RecoveryState::WaitFlushedPeering). - Replicas can go active but cannot serve ops (writes or reads). - This is necessary because we cannot read our ondisk state until unstable - transactions from the previous interval have cleared. diff --git a/src/ceph/doc/dev/osd_internals/pg_removal.rst b/src/ceph/doc/dev/osd_internals/pg_removal.rst deleted file mode 100644 index d968ecc..0000000 --- a/src/ceph/doc/dev/osd_internals/pg_removal.rst +++ /dev/null @@ -1,56 +0,0 @@ -========== -PG Removal -========== - -See OSD::_remove_pg, OSD::RemoveWQ - -There are two ways for a pg to be removed from an OSD: - - 1. MOSDPGRemove from the primary - 2. OSD::advance_map finds that the pool has been removed - -In either case, our general strategy for removing the pg is to -atomically set the metadata objects (pg->log_oid, pg->biginfo_oid) to -backfill and asynronously remove the pg collections. We do not do -this inline because scanning the collections to remove the objects is -an expensive operation. - -OSDService::deleting_pgs tracks all pgs in the process of being -deleted. Each DeletingState object in deleting_pgs lives while at -least one reference to it remains. Each item in RemoveWQ carries a -reference to the DeletingState for the relevant pg such that -deleting_pgs.lookup(pgid) will return a null ref only if there are no -collections currently being deleted for that pg. - -The DeletingState for a pg also carries information about the status -of the current deletion and allows the deletion to be cancelled. -The possible states are: - - 1. QUEUED: the PG is in the RemoveWQ - 2. CLEARING_DIR: the PG's contents are being removed synchronously - 3. DELETING_DIR: the PG's directories and metadata being queued for removal - 4. DELETED_DIR: the final removal transaction has been queued - 5. CANCELED: the deletion has been canceled - -In 1 and 2, the deletion can be canceled. Each state transition -method (and check_canceled) returns false if deletion has been -canceled and true if the state transition was successful. Similarly, -try_stop_deletion() returns true if it succeeds in canceling the -deletion. Additionally, try_stop_deletion() in the event that it -fails to stop the deletion will not return until the final removal -transaction is queued. This ensures that any operations queued after -that point will be ordered after the pg deletion. - -OSD::_create_lock_pg must handle two cases: - - 1. Either there is no DeletingStateRef for the pg, or it failed to cancel - 2. We succeeded in canceling the deletion. - -In case 1., we proceed as if there were no deletion occurring, except that -we avoid writing to the PG until the deletion finishes. In case 2., we -proceed as in case 1., except that we first mark the PG as backfilling. - -Similarly, OSD::osr_registry ensures that the OpSequencers for those -pgs can be reused for a new pg if created before the old one is fully -removed, ensuring that operations on the new pg are sequenced properly -with respect to operations on the old one. diff --git a/src/ceph/doc/dev/osd_internals/pgpool.rst b/src/ceph/doc/dev/osd_internals/pgpool.rst deleted file mode 100644 index 45a252b..0000000 --- a/src/ceph/doc/dev/osd_internals/pgpool.rst +++ /dev/null @@ -1,22 +0,0 @@ -================== -PGPool -================== - -PGPool is a structure used to manage and update the status of removed -snapshots. It does this by maintaining two fields, cached_removed_snaps - the -current removed snap set and newly_removed_snaps - newly removed snaps in the -last epoch. In OSD::load_pgs the osd map is recovered from the pg's file store -and passed down to OSD::_get_pool where a PGPool object is initialised with the -map. - -With each new map we receive we call PGPool::update with the new map. In that -function we build a list of newly removed snaps -(pg_pool_t::build_removed_snaps) and merge that with our cached_removed_snaps. -This function included checks to make sure we only do this update when things -have changed or there has been a map gap. - -When we activate the pg we initialise the snap trim queue from -cached_removed_snaps and subtract the purged_snaps we have already purged -leaving us with the list of snaps that need to be trimmed. Trimming is later -performed asynchronously by the snap_trim_wq. - diff --git a/src/ceph/doc/dev/osd_internals/recovery_reservation.rst b/src/ceph/doc/dev/osd_internals/recovery_reservation.rst deleted file mode 100644 index 4ab0319..0000000 --- a/src/ceph/doc/dev/osd_internals/recovery_reservation.rst +++ /dev/null @@ -1,75 +0,0 @@ -==================== -Recovery Reservation -==================== - -Recovery reservation extends and subsumes backfill reservation. The -reservation system from backfill recovery is used for local and remote -reservations. - -When a PG goes active, first it determines what type of recovery is -necessary, if any. It may need log-based recovery, backfill recovery, -both, or neither. - -In log-based recovery, the primary first acquires a local reservation -from the OSDService's local_reserver. Then a MRemoteReservationRequest -message is sent to each replica in order of OSD number. These requests -will always be granted (i.e., cannot be rejected), but they may take -some time to be granted if the remotes have already granted all their -remote reservation slots. - -After all reservations are acquired, log-based recovery proceeds as it -would without the reservation system. - -After log-based recovery completes, the primary releases all remote -reservations. The local reservation remains held. The primary then -determines whether backfill is necessary. If it is not necessary, the -primary releases its local reservation and waits in the Recovered state -for all OSDs to indicate that they are clean. - -If backfill recovery occurs after log-based recovery, the local -reservation does not need to be reacquired since it is still held from -before. If it occurs immediately after activation (log-based recovery -not possible/necessary), the local reservation is acquired according to -the typical process. - -Once the primary has its local reservation, it requests a remote -reservation from the backfill target. This reservation CAN be rejected, -for instance if the OSD is too full (backfillfull_ratio osd setting). -If the reservation is rejected, the primary drops its local -reservation, waits (osd_backfill_retry_interval), and then retries. It -will retry indefinitely. - -Once the primary has the local and remote reservations, backfill -proceeds as usual. After backfill completes the remote reservation is -dropped. - -Finally, after backfill (or log-based recovery if backfill was not -necessary), the primary drops the local reservation and enters the -Recovered state. Once all the PGs have reported they are clean, the -primary enters the Clean state and marks itself active+clean. - - --------------- -Things to Note --------------- - -We always grab the local reservation first, to prevent a circular -dependency. We grab remote reservations in order of OSD number for the -same reason. - -The recovery reservation state chart controls the PG state as reported -to the monitor. The state chart can set: - - - recovery_wait: waiting for local/remote reservations - - recovering: recovering - - recovery_toofull: recovery stopped, OSD(s) above full ratio - - backfill_wait: waiting for remote backfill reservations - - backfilling: backfilling - - backfill_toofull: backfill stopped, OSD(s) above backfillfull ratio - - --------- -See Also --------- - -The Active substate of the automatically generated OSD state diagram. diff --git a/src/ceph/doc/dev/osd_internals/scrub.rst b/src/ceph/doc/dev/osd_internals/scrub.rst deleted file mode 100644 index 3343b39..0000000 --- a/src/ceph/doc/dev/osd_internals/scrub.rst +++ /dev/null @@ -1,30 +0,0 @@ - -Scrubbing Behavior Table -======================== - -+-------------------------------------------------+----------+-----------+---------------+----------------------+ -| Flags | none | noscrub | nodeep_scrub | noscrub/nodeep_scrub | -+=================================================+==========+===========+===============+======================+ -| Periodic tick | S | X | S | X | -+-------------------------------------------------+----------+-----------+---------------+----------------------+ -| Periodic tick after osd_deep_scrub_interval | D | D | S | X | -+-------------------------------------------------+----------+-----------+---------------+----------------------+ -| Initiated scrub | S | S | S | S | -+-------------------------------------------------+----------+-----------+---------------+----------------------+ -| Initiated scrub after osd_deep_scrub_interval | D | D | S | S | -+-------------------------------------------------+----------+-----------+---------------+----------------------+ -| Initiated deep scrub | D | D | D | D | -+-------------------------------------------------+----------+-----------+---------------+----------------------+ - -- X = Do nothing -- S = Do regular scrub -- D = Do deep scrub - -State variables ---------------- - -- Periodic tick state is !must_scrub && !must_deep_scrub && !time_for_deep -- Periodic tick after osd_deep_scrub_interval state is !must_scrub && !must_deep_scrub && time_for_deep -- Initiated scrub state is must_scrub && !must_deep_scrub && !time_for_deep -- Initiated scrub after osd_deep_scrub_interval state is must scrub && !must_deep_scrub && time_for_deep -- Initiated deep scrub state is must_scrub && must_deep_scrub diff --git a/src/ceph/doc/dev/osd_internals/snaps.rst b/src/ceph/doc/dev/osd_internals/snaps.rst deleted file mode 100644 index e17378f..0000000 --- a/src/ceph/doc/dev/osd_internals/snaps.rst +++ /dev/null @@ -1,128 +0,0 @@ -====== -Snaps -====== - -Overview --------- -Rados supports two related snapshotting mechanisms: - - 1. *pool snaps*: snapshots are implicitely applied to all objects - in a pool - 2. *self managed snaps*: the user must provide the current *SnapContext* - on each write. - -These two are mutually exclusive, only one or the other can be used on -a particular pool. - -The *SnapContext* is the set of snapshots currently defined for an object -as well as the most recent snapshot (the *seq*) requested from the mon for -sequencing purposes (a *SnapContext* with a newer *seq* is considered to -be more recent). - -The difference between *pool snaps* and *self managed snaps* from the -OSD's point of view lies in whether the *SnapContext* comes to the OSD -via the client's MOSDOp or via the most recent OSDMap. - -See OSD::make_writeable - -Ondisk Structures ------------------ -Each object has in the pg collection a *head* object (or *snapdir*, which we -will come to shortly) and possibly a set of *clone* objects. -Each hobject_t has a snap field. For the *head* (the only writeable version -of an object), the snap field is set to CEPH_NOSNAP. For the *clones*, the -snap field is set to the *seq* of the *SnapContext* at their creation. -When the OSD services a write, it first checks whether the most recent -*clone* is tagged with a snapid prior to the most recent snap represented -in the *SnapContext*. If so, at least one snapshot has occurred between -the time of the write and the time of the last clone. Therefore, prior -to performing the mutation, the OSD creates a new clone for servicing -reads on snaps between the snapid of the last clone and the most recent -snapid. - -The *head* object contains a *SnapSet* encoded in an attribute, which tracks - - 1. The full set of snaps defined for the object - 2. The full set of clones which currently exist - 3. Overlapping intervals between clones for tracking space usage - 4. Clone size - -If the *head* is deleted while there are still clones, a *snapdir* object -is created instead to house the *SnapSet*. - -Additionally, the *object_info_t* on each clone includes a vector of snaps -for which clone is defined. - -Snap Removal ------------- -To remove a snapshot, a request is made to the *Monitor* cluster to -add the snapshot id to the list of purged snaps (or to remove it from -the set of pool snaps in the case of *pool snaps*). In either case, -the *PG* adds the snap to its *snap_trimq* for trimming. - -A clone can be removed when all of its snaps have been removed. In -order to determine which clones might need to be removed upon snap -removal, we maintain a mapping from snap to *hobject_t* using the -*SnapMapper*. - -See PrimaryLogPG::SnapTrimmer, SnapMapper - -This trimming is performed asynchronously by the snap_trim_wq while the -pg is clean and not scrubbing. - - #. The next snap in PG::snap_trimq is selected for trimming - #. We determine the next object for trimming out of PG::snap_mapper. - For each object, we create a log entry and repop updating the - object info and the snap set (including adjusting the overlaps). - If the object is a clone which no longer belongs to any live snapshots, - it is removed here. (See PrimaryLogPG::trim_object() when new_snaps - is empty.) - #. We also locally update our *SnapMapper* instance with the object's - new snaps. - #. The log entry containing the modification of the object also - contains the new set of snaps, which the replica uses to update - its own *SnapMapper* instance. - #. The primary shares the info with the replica, which persists - the new set of purged_snaps along with the rest of the info. - - - -Recovery --------- -Because the trim operations are implemented using repops and log entries, -normal pg peering and recovery maintain the snap trimmer operations with -the caveat that push and removal operations need to update the local -*SnapMapper* instance. If the purged_snaps update is lost, we merely -retrim a now empty snap. - -SnapMapper ----------- -*SnapMapper* is implemented on top of map_cacher<string, bufferlist>, -which provides an interface over a backing store such as the filesystem -with async transactions. While transactions are incomplete, the map_cacher -instance buffers unstable keys allowing consistent access without having -to flush the filestore. *SnapMapper* provides two mappings: - - 1. hobject_t -> set<snapid_t>: stores the set of snaps for each clone - object - 2. snapid_t -> hobject_t: stores the set of hobjects with the snapshot - as one of its snaps - -Assumption: there are lots of hobjects and relatively few snaps. The -first encoding has a stringification of the object as the key and an -encoding of the set of snaps as a value. The second mapping, because there -might be many hobjects for a single snap, is stored as a collection of keys -of the form stringify(snap)_stringify(object) such that stringify(snap) -is constant length. These keys have a bufferlist encoding -pair<snapid, hobject_t> as a value. Thus, creating or trimming a single -object does not involve reading all objects for any snap. Additionally, -upon construction, the *SnapMapper* is provided with a mask for filtering -the objects in the single SnapMapper keyspace belonging to that pg. - -Split ------ -The snapid_t -> hobject_t key entries are arranged such that for any pg, -up to 8 prefixes need to be checked to determine all hobjects in a particular -snap for a particular pg. Upon split, the prefixes to check on the parent -are adjusted such that only the objects remaining in the pg will be visible. -The children will immediately have the correct mapping. diff --git a/src/ceph/doc/dev/osd_internals/watch_notify.rst b/src/ceph/doc/dev/osd_internals/watch_notify.rst deleted file mode 100644 index 8c2ce09..0000000 --- a/src/ceph/doc/dev/osd_internals/watch_notify.rst +++ /dev/null @@ -1,81 +0,0 @@ -============ -Watch Notify -============ - -See librados for the watch/notify interface. - -Overview --------- -The object_info (See osd/osd_types.h) tracks the set of watchers for -a particular object persistently in the object_info_t::watchers map. -In order to track notify progress, we also maintain some ephemeral -structures associated with the ObjectContext. - -Each Watch has an associated Watch object (See osd/Watch.h). The -ObjectContext for a watched object will have a (strong) reference -to one Watch object per watch, and each Watch object holds a -reference to the corresponding ObjectContext. This circular reference -is deliberate and is broken when the Watch state is discarded on -a new peering interval or removed upon timeout expiration or an -unwatch operation. - -A watch tracks the associated connection via a strong -ConnectionRef Watch::conn. The associated connection has a -WatchConState stashed in the OSD::Session for tracking associated -Watches in order to be able to notify them upon ms_handle_reset() -(via WatchConState::reset()). - -Each Watch object tracks the set of currently un-acked notifies. -start_notify() on a Watch object adds a reference to a new in-progress -Notify to the Watch and either: - -* if the Watch is *connected*, sends a Notify message to the client -* if the Watch is *unconnected*, does nothing. - -When the Watch becomes connected (in PrimaryLogPG::do_osd_op_effects), -Notifies are resent to all remaining tracked Notify objects. - -Each Notify object tracks the set of un-notified Watchers via -calls to complete_watcher(). Once the remaining set is empty or the -timeout expires (cb, registered in init()) a notify completion -is sent to the client. - -Watch Lifecycle ---------------- -A watch may be in one of 5 states: - -1. Non existent. -2. On disk, but not registered with an object context. -3. Connected -4. Disconnected, callback registered with timer -5. Disconnected, callback in queue for scrub or is_degraded - -Case 2 occurs between when an OSD goes active and the ObjectContext -for an object with watchers is loaded into memory due to an access. -During Case 2, no state is registered for the watch. Case 2 -transitions to Case 4 in PrimaryLogPG::populate_obc_watchers() during -PrimaryLogPG::find_object_context. Case 1 becomes case 3 via -OSD::do_osd_op_effects due to a watch operation. Case 4,5 become case -3 in the same way. Case 3 becomes case 4 when the connection resets -on a watcher's session. - -Cases 4&5 can use some explanation. Normally, when a Watch enters Case -4, a callback is registered with the OSDService::watch_timer to be -called at timeout expiration. At the time that the callback is -called, however, the pg might be in a state where it cannot write -to the object in order to remove the watch (i.e., during a scrub -or while the object is degraded). In that case, we use -Watch::get_delayed_cb() to generate another Context for use from -the callbacks_for_degraded_object and Scrubber::callbacks lists. -In either case, Watch::unregister_cb() does the right thing -(SafeTimer::cancel_event() is harmless for contexts not registered -with the timer). - -Notify Lifecycle ----------------- -The notify timeout is simpler: a timeout callback is registered when -the notify is init()'d. If all watchers ack notifies before the -timeout occurs, the timeout is canceled and the client is notified -of the notify completion. Otherwise, the timeout fires, the Notify -object pings each Watch via cancel_notify to remove itself, and -sends the notify completion to the client early. diff --git a/src/ceph/doc/dev/osd_internals/wbthrottle.rst b/src/ceph/doc/dev/osd_internals/wbthrottle.rst deleted file mode 100644 index a3ae00d..0000000 --- a/src/ceph/doc/dev/osd_internals/wbthrottle.rst +++ /dev/null @@ -1,28 +0,0 @@ -================== -Writeback Throttle -================== - -Previously, the filestore had a problem when handling large numbers of -small ios. We throttle dirty data implicitely via the journal, but -a large number of inodes can be dirtied without filling the journal -resulting in a very long sync time when the sync finally does happen. -The flusher was not an adequate solution to this problem since it -forced writeback of small writes too eagerly killing performance. - -WBThrottle tracks unflushed io per hobject_t and ::fsyncs in lru -order once the start_flusher threshold is exceeded for any of -dirty bytes, dirty ios, or dirty inodes. While any of these exceed -the hard_limit, we block on throttle() in _do_op. - -See src/os/WBThrottle.h, src/osd/WBThrottle.cc - -To track the open FDs through the writeback process, there is now an -fdcache to cache open fds. lfn_open now returns a cached FDRef which -implicitely closes the fd once all references have expired. - -Filestore syncs have a sideeffect of flushing all outstanding objects -in the wbthrottle. - -lfn_unlink clears the cached FDRef and wbthrottle entries for the -unlinked object when the last link is removed and asserts that all -outstanding FDRefs for that object are dead. |