summaryrefslogtreecommitdiffstats
path: root/src/ceph/doc/dev/peering.rst
diff options
context:
space:
mode:
authorQiaowei Ren <qiaowei.ren@intel.com>2018-01-04 13:43:33 +0800
committerQiaowei Ren <qiaowei.ren@intel.com>2018-01-05 11:59:39 +0800
commit812ff6ca9fcd3e629e49d4328905f33eee8ca3f5 (patch)
tree04ece7b4da00d9d2f98093774594f4057ae561d4 /src/ceph/doc/dev/peering.rst
parent15280273faafb77777eab341909a3f495cf248d9 (diff)
initial code repo
This patch creates initial code repo. For ceph, luminous stable release will be used for base code, and next changes and optimization for ceph will be added to it. For opensds, currently any changes can be upstreamed into original opensds repo (https://github.com/opensds/opensds), and so stor4nfv will directly clone opensds code to deploy stor4nfv environment. And the scripts for deployment based on ceph and opensds will be put into 'ci' directory. Change-Id: I46a32218884c75dda2936337604ff03c554648e4 Signed-off-by: Qiaowei Ren <qiaowei.ren@intel.com>
Diffstat (limited to 'src/ceph/doc/dev/peering.rst')
-rw-r--r--src/ceph/doc/dev/peering.rst259
1 files changed, 259 insertions, 0 deletions
diff --git a/src/ceph/doc/dev/peering.rst b/src/ceph/doc/dev/peering.rst
new file mode 100644
index 0000000..7ee5deb
--- /dev/null
+++ b/src/ceph/doc/dev/peering.rst
@@ -0,0 +1,259 @@
+======================
+Peering
+======================
+
+Concepts
+--------
+
+*Peering*
+ the process of bringing all of the OSDs that store
+ a Placement Group (PG) into agreement about the state
+ of all of the objects (and their metadata) in that PG.
+ Note that agreeing on the state does not mean that
+ they all have the latest contents.
+
+*Acting set*
+ the ordered list of OSDs who are (or were as of some epoch)
+ responsible for a particular PG.
+
+*Up set*
+ the ordered list of OSDs responsible for a particular PG for
+ a particular epoch according to CRUSH. Normally this
+ is the same as the *acting set*, except when the *acting set* has been
+ explicitly overridden via *PG temp* in the OSDMap.
+
+*PG temp*
+ a temporary placement group acting set used while backfilling the
+ primary osd. Let say acting is [0,1,2] and we are
+ active+clean. Something happens and acting is now [3,1,2]. osd 3 is
+ empty and can't serve reads although it is the primary. osd.3 will
+ see that and request a *PG temp* of [1,2,3] to the monitors using a
+ MOSDPGTemp message so that osd.1 temporarily becomes the
+ primary. It will select osd.3 as a backfill peer and continue to
+ serve reads and writes while osd.3 is backfilled. When backfilling
+ is complete, *PG temp* is discarded and the acting set changes back
+ to [3,1,2] and osd.3 becomes the primary.
+
+*current interval* or *past interval*
+ a sequence of OSD map epochs during which the *acting set* and *up
+ set* for particular PG do not change
+
+*primary*
+ the (by convention first) member of the *acting set*,
+ who is responsible for coordination peering, and is
+ the only OSD that will accept client initiated
+ writes to objects in a placement group.
+
+*replica*
+ a non-primary OSD in the *acting set* for a placement group
+ (and who has been recognized as such and *activated* by the primary).
+
+*stray*
+ an OSD who is not a member of the current *acting set*, but
+ has not yet been told that it can delete its copies of a
+ particular placement group.
+
+*recovery*
+ ensuring that copies of all of the objects in a PG
+ are on all of the OSDs in the *acting set*. Once
+ *peering* has been performed, the primary can start
+ accepting write operations, and *recovery* can proceed
+ in the background.
+
+*PG info* basic metadata about the PG's creation epoch, the version
+ for the most recent write to the PG, *last epoch started*, *last
+ epoch clean*, and the beginning of the *current interval*. Any
+ inter-OSD communication about PGs includes the *PG info*, such that
+ any OSD that knows a PG exists (or once existed) also has a lower
+ bound on *last epoch clean* or *last epoch started*.
+
+*PG log*
+ a list of recent updates made to objects in a PG.
+ Note that these logs can be truncated after all OSDs
+ in the *acting set* have acknowledged up to a certain
+ point.
+
+*missing set*
+ Each OSD notes update log entries and if they imply updates to
+ the contents of an object, adds that object to a list of needed
+ updates. This list is called the *missing set* for that <OSD,PG>.
+
+*Authoritative History*
+ a complete, and fully ordered set of operations that, if
+ performed, would bring an OSD's copy of a Placement Group
+ up to date.
+
+*epoch*
+ a (monotonically increasing) OSD map version number
+
+*last epoch start*
+ the last epoch at which all nodes in the *acting set*
+ for a particular placement group agreed on an
+ *authoritative history*. At this point, *peering* is
+ deemed to have been successful.
+
+*up_thru*
+ before a primary can successfully complete the *peering* process,
+ it must inform a monitor that is alive through the current
+ OSD map epoch by having the monitor set its *up_thru* in the osd
+ map. This helps peering ignore previous *acting sets* for which
+ peering never completed after certain sequences of failures, such as
+ the second interval below:
+
+ - *acting set* = [A,B]
+ - *acting set* = [A]
+ - *acting set* = [] very shortly after (e.g., simultaneous failure, but staggered detection)
+ - *acting set* = [B] (B restarts, A does not)
+
+*last epoch clean*
+ the last epoch at which all nodes in the *acting set*
+ for a particular placement group were completely
+ up to date (both PG logs and object contents).
+ At this point, *recovery* is deemed to have been
+ completed.
+
+Description of the Peering Process
+----------------------------------
+
+The *Golden Rule* is that no write operation to any PG
+is acknowledged to a client until it has been persisted
+by all members of the *acting set* for that PG. This means
+that if we can communicate with at least one member of
+each *acting set* since the last successful *peering*, someone
+will have a record of every (acknowledged) operation
+since the last successful *peering*.
+This means that it should be possible for the current
+primary to construct and disseminate a new *authoritative history*.
+
+It is also important to appreciate the role of the OSD map
+(list of all known OSDs and their states, as well as some
+information about the placement groups) in the *peering*
+process:
+
+ When OSDs go up or down (or get added or removed)
+ this has the potential to affect the *active sets*
+ of many placement groups.
+
+ Before a primary successfully completes the *peering*
+ process, the OSD map must reflect that the OSD was alive
+ and well as of the first epoch in the *current interval*.
+
+ Changes can only be made after successful *peering*.
+
+Thus, a new primary can use the latest OSD map along with a recent
+history of past maps to generate a set of *past intervals* to
+determine which OSDs must be consulted before we can successfully
+*peer*. The set of past intervals is bounded by *last epoch started*,
+the most recent *past interval* for which we know *peering* completed.
+The process by which an OSD discovers a PG exists in the first place is
+by exchanging *PG info* messages, so the OSD always has some lower
+bound on *last epoch started*.
+
+The high level process is for the current PG primary to:
+
+ 1. get a recent OSD map (to identify the members of the all
+ interesting *acting sets*, and confirm that we are still the
+ primary).
+
+ #. generate a list of *past intervals* since *last epoch started*.
+ Consider the subset of those for which *up_thru* was greater than
+ the first interval epoch by the last interval epoch's OSD map; that is,
+ the subset for which *peering* could have completed before the *acting
+ set* changed to another set of OSDs.
+
+ Successful *peering* will require that we be able to contact at
+ least one OSD from each of *past interval*'s *acting set*.
+
+ #. ask every node in that list for its *PG info*, which includes the most
+ recent write made to the PG, and a value for *last epoch started*. If
+ we learn about a *last epoch started* that is newer than our own, we can
+ prune older *past intervals* and reduce the peer OSDs we need to contact.
+
+ #. if anyone else has (in its PG log) operations that I do not have,
+ instruct them to send me the missing log entries so that the primary's
+ *PG log* is up to date (includes the newest write)..
+
+ #. for each member of the current *acting set*:
+
+ a. ask it for copies of all PG log entries since *last epoch start*
+ so that I can verify that they agree with mine (or know what
+ objects I will be telling it to delete).
+
+ If the cluster failed before an operation was persisted by all
+ members of the *acting set*, and the subsequent *peering* did not
+ remember that operation, and a node that did remember that
+ operation later rejoined, its logs would record a different
+ (divergent) history than the *authoritative history* that was
+ reconstructed in the *peering* after the failure.
+
+ Since the *divergent* events were not recorded in other logs
+ from that *acting set*, they were not acknowledged to the client,
+ and there is no harm in discarding them (so that all OSDs agree
+ on the *authoritative history*). But, we will have to instruct
+ any OSD that stores data from a divergent update to delete the
+ affected (and now deemed to be apocryphal) objects.
+
+ #. ask it for its *missing set* (object updates recorded
+ in its PG log, but for which it does not have the new data).
+ This is the list of objects that must be fully replicated
+ before we can accept writes.
+
+ #. at this point, the primary's PG log contains an *authoritative history* of
+ the placement group, and the OSD now has sufficient
+ information to bring any other OSD in the *acting set* up to date.
+
+ #. if the primary's *up_thru* value in the current OSD map is not greater than
+ or equal to the first epoch in the *current interval*, send a request to the
+ monitor to update it, and wait until receive an updated OSD map that reflects
+ the change.
+
+ #. for each member of the current *acting set*:
+
+ a. send them log updates to bring their PG logs into agreement with
+ my own (*authoritative history*) ... which may involve deciding
+ to delete divergent objects.
+
+ #. await acknowledgment that they have persisted the PG log entries.
+
+ #. at this point all OSDs in the *acting set* agree on all of the meta-data,
+ and would (in any future *peering*) return identical accounts of all
+ updates.
+
+ a. start accepting client write operations (because we have unanimous
+ agreement on the state of the objects into which those updates are
+ being accepted). Note, however, that if a client tries to write to an
+ object it will be promoted to the front of the recovery queue, and the
+ write willy be applied after it is fully replicated to the current *acting set*.
+
+ #. update the *last epoch started* value in our local *PG info*, and instruct
+ other *active set* OSDs to do the same.
+
+ #. start pulling object data updates that other OSDs have, but I do not. We may
+ need to query OSDs from additional *past intervals* prior to *last epoch started*
+ (the last time *peering* completed) and following *last epoch clean* (the last epoch that
+ recovery completed) in order to find copies of all objects.
+
+ #. start pushing object data updates to other OSDs that do not yet have them.
+
+ We push these updates from the primary (rather than having the replicas
+ pull them) because this allows the primary to ensure that a replica has
+ the current contents before sending it an update write. It also makes
+ it possible for a single read (from the primary) to be used to write
+ the data to multiple replicas. If each replica did its own pulls,
+ the data might have to be read multiple times.
+
+ #. once all replicas store the all copies of all objects (that
+ existed prior to the start of this epoch) we can update *last
+ epoch clean* in the *PG info*, and we can dismiss all of the
+ *stray* replicas, allowing them to delete their copies of objects
+ for which they are no longer in the *acting set*.
+
+ We could not dismiss the *strays* prior to this because it was possible
+ that one of those *strays* might hold the sole surviving copy of an
+ old object (all of whose copies disappeared before they could be
+ replicated on members of the current *acting set*).
+
+State Model
+-----------
+
+.. graphviz:: peering_graph.generated.dot