summaryrefslogtreecommitdiffstats
path: root/src/ceph/doc/rados/operations/pg-concepts.rst
blob: 636d6bf9a1e5b5b933769e1424488842a198c551 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
==========================
 Placement Group Concepts
==========================

When you execute commands like ``ceph -w``, ``ceph osd dump``, and other 
commands related to placement groups, Ceph may return values using some
of the following terms: 

*Peering*
   The process of bringing all of the OSDs that store
   a Placement Group (PG) into agreement about the state
   of all of the objects (and their metadata) in that PG.
   Note that agreeing on the state does not mean that
   they all have the latest contents.

*Acting Set*
   The ordered list of OSDs who are (or were as of some epoch)
   responsible for a particular placement group.

*Up Set*
   The ordered list of OSDs responsible for a particular placement
   group for a particular epoch according to CRUSH. Normally this
   is the same as the *Acting Set*, except when the *Acting Set* has 
   been explicitly overridden via ``pg_temp`` in the OSD Map.

*Current Interval* or *Past Interval*
   A sequence of OSD map epochs during which the *Acting Set* and *Up
   Set* for particular placement group do not change.

*Primary*
   The member (and by convention first) of the *Acting Set*,
   that is responsible for coordination peering, and is
   the only OSD that will accept client-initiated
   writes to objects in a placement group.

*Replica*
   A non-primary OSD in the *Acting Set* for a placement group
   (and who has been recognized as such and *activated* by the primary).

*Stray*
   An OSD that is not a member of the current *Acting Set*, but
   has not yet been told that it can delete its copies of a
   particular placement group.

*Recovery*
   Ensuring that copies of all of the objects in a placement group
   are on all of the OSDs in the *Acting Set*.  Once *Peering* has 
   been performed, the *Primary* can start accepting write operations, 
   and *Recovery* can proceed in the background.

*PG Info* 
   Basic metadata about the placement group's creation epoch, the version
   for the most recent write to the placement group, *last epoch started*, 
   *last epoch clean*, and the beginning of the *current interval*.  Any
   inter-OSD communication about placement groups includes the *PG Info*, 
   such that any OSD that knows a placement group exists (or once existed) 
   also has a lower bound on *last epoch clean* or *last epoch started*.

*PG Log*
   A list of recent updates made to objects in a placement group.
   Note that these logs can be truncated after all OSDs
   in the *Acting Set* have acknowledged up to a certain
   point.

*Missing Set*
   Each OSD notes update log entries and if they imply updates to
   the contents of an object, adds that object to a list of needed
   updates.  This list is called the *Missing Set* for that ``<OSD,PG>``.

*Authoritative History*
   A complete, and fully ordered set of operations that, if
   performed, would bring an OSD's copy of a placement group
   up to date.

*Epoch*
   A (monotonically increasing) OSD map version number

*Last Epoch Start*
   The last epoch at which all nodes in the *Acting Set*
   for a particular placement group agreed on an
   *Authoritative History*.  At this point, *Peering* is
   deemed to have been successful.

*up_thru*
   Before a *Primary* can successfully complete the *Peering* process,
   it must inform a monitor that is alive through the current
   OSD map *Epoch* by having the monitor set its *up_thru* in the osd
   map.  This helps *Peering* ignore previous *Acting Sets* for which
   *Peering* never completed after certain sequences of failures, such as
   the second interval below:

   - *acting set* = [A,B]
   - *acting set* = [A]
   - *acting set* = [] very shortly after (e.g., simultaneous failure, but staggered detection)
   - *acting set* = [B] (B restarts, A does not)

*Last Epoch Clean*
   The last *Epoch* at which all nodes in the *Acting set*
   for a particular placement group were completely
   up to date (both placement group logs and object contents).
   At this point, *recovery* is deemed to have been
   completed.