src/ceph/doc/rados/operations/erasure-code-lrc.rst


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371

======================================
Locally repairable erasure code plugin
======================================

With the *jerasure* plugin, when an erasure coded object is stored on
multiple OSDs, recovering from the loss of one OSD requires reading
from all the others. For instance if *jerasure* is configured with
*k=8* and *m=4*, losing one OSD requires reading from the eleven
others to repair.

The *lrc* erasure code plugin creates local parity chunks to be able
to recover using less OSDs. For instance if *lrc* is configured with
*k=8*, *m=4* and *l=4*, it will create an additional parity chunk for
every four OSDs. When a single OSD is lost, it can be recovered with
only four OSDs instead of eleven.

Erasure code profile examples
=============================

Reduce recovery bandwidth between hosts
---------------------------------------

Although it is probably not an interesting use case when all hosts are
connected to the same switch, reduced bandwidth usage can actually be
observed.::

        $ ceph osd erasure-code-profile set LRCprofile \
             plugin=lrc \
             k=4 m=2 l=3 \
             crush-failure-domain=host
        $ ceph osd pool create lrcpool 12 12 erasure LRCprofile


Reduce recovery bandwidth between racks
---------------------------------------

In Firefly the reduced bandwidth will only be observed if the primary
OSD is in the same rack as the lost chunk.::

        $ ceph osd erasure-code-profile set LRCprofile \
             plugin=lrc \
             k=4 m=2 l=3 \
             crush-locality=rack \
             crush-failure-domain=host
        $ ceph osd pool create lrcpool 12 12 erasure LRCprofile


Create an lrc profile
=====================

To create a new lrc erasure code profile::

        ceph osd erasure-code-profile set {name} \
             plugin=lrc \
             k={data-chunks} \
             m={coding-chunks} \
             l={locality} \
             [crush-root={root}] \
             [crush-locality={bucket-type}] \
             [crush-failure-domain={bucket-type}] \
             [crush-device-class={device-class}] \
             [directory={directory}] \
             [--force]

Where:

``k={data chunks}``

:Description: Each object is split in **data-chunks** parts,
              each stored on a different OSD.

:Type: Integer
:Required: Yes.
:Example: 4

``m={coding-chunks}``

:Description: Compute **coding chunks** for each object and store them
              on different OSDs. The number of coding chunks is also
              the number of OSDs that can be down without losing data.

:Type: Integer
:Required: Yes.
:Example: 2

``l={locality}``

:Description: Group the coding and data chunks into sets of size
              **locality**. For instance, for **k=4** and **m=2**,
              when **locality=3** two groups of three are created.
              Each set can be recovered without reading chunks
              from another set.

:Type: Integer
:Required: Yes.
:Example: 3

``crush-root={root}``

:Description: The name of the crush bucket used for the first step of
              the ruleset. For intance **step take default**.

:Type: String
:Required: No.
:Default: default

``crush-locality={bucket-type}``

:Description: The type of the crush bucket in which each set of chunks
              defined by **l** will be stored. For instance, if it is
              set to **rack**, each group of **l** chunks will be
              placed in a different rack. It is used to create a
              ruleset step such as **step choose rack**. If it is not
              set, no such grouping is done.

:Type: String
:Required: No.

``crush-failure-domain={bucket-type}``

:Description: Ensure that no two chunks are in a bucket with the same
              failure domain. For instance, if the failure domain is
              **host** no two chunks will be stored on the same
              host. It is used to create a ruleset step such as **step
              chooseleaf host**.

:Type: String
:Required: No.
:Default: host

``crush-device-class={device-class}``

:Description: Restrict placement to devices of a specific class (e.g.,
              ``ssd`` or ``hdd``), using the crush device class names
              in the CRUSH map.

:Type: String
:Required: No.
:Default:

``directory={directory}``

:Description: Set the **directory** name from which the erasure code
              plugin is loaded.

:Type: String
:Required: No.
:Default: /usr/lib/ceph/erasure-code

``--force``

:Description: Override an existing profile by the same name.

:Type: String
:Required: No.

Low level plugin configuration
==============================

The sum of **k** and **m** must be a multiple of the **l** parameter.
The low level configuration parameters do not impose such a
restriction and it may be more convienient to use it for specific
purposes. It is for instance possible to define two groups, one with 4
chunks and another with 3 chunks. It is also possible to recursively
define locality sets, for instance datacenters and racks into
datacenters. The **k/m/l** are implemented by generating a low level
configuration.

The *lrc* erasure code plugin recursively applies erasure code
techniques so that recovering from the loss of some chunks only
requires a subset of the available chunks, most of the time.

For instance, when three coding steps are described as::

   chunk nr    01234567
   step 1      _cDD_cDD
   step 2      cDDD____
   step 3      ____cDDD

where *c* are coding chunks calculated from the data chunks *D*, the
loss of chunk *7* can be recovered with the last four chunks. And the
loss of chunk *2* chunk can be recovered with the first four
chunks.

Erasure code profile examples using low level configuration
===========================================================

Minimal testing
---------------

It is strictly equivalent to using the default erasure code profile. The *DD*
implies *K=2*, the *c* implies *M=1* and the *jerasure* plugin is used
by default.::

        $ ceph osd erasure-code-profile set LRCprofile \
             plugin=lrc \
             mapping=DD_ \
             layers='[ [ "DDc", "" ] ]'
        $ ceph osd pool create lrcpool 12 12 erasure LRCprofile

Reduce recovery bandwidth between hosts
---------------------------------------

Although it is probably not an interesting use case when all hosts are
connected to the same switch, reduced bandwidth usage can actually be
observed. It is equivalent to **k=4**, **m=2** and **l=3** although
the layout of the chunks is different::

        $ ceph osd erasure-code-profile set LRCprofile \
             plugin=lrc \
             mapping=__DD__DD \
             layers='[
                       [ "_cDD_cDD", "" ],
                       [ "cDDD____", "" ],
                       [ "____cDDD", "" ],
                     ]'
        $ ceph osd pool create lrcpool 12 12 erasure LRCprofile


Reduce recovery bandwidth between racks
---------------------------------------

In Firefly the reduced bandwidth will only be observed if the primary
OSD is in the same rack as the lost chunk.::

        $ ceph osd erasure-code-profile set LRCprofile \
             plugin=lrc \
             mapping=__DD__DD \
             layers='[
                       [ "_cDD_cDD", "" ],
                       [ "cDDD____", "" ],
                       [ "____cDDD", "" ],
                     ]' \
             crush-steps='[
                             [ "choose", "rack", 2 ],
                             [ "chooseleaf", "host", 4 ],
                            ]'
        $ ceph osd pool create lrcpool 12 12 erasure LRCprofile

Testing with different Erasure Code backends
--------------------------------------------

LRC now uses jerasure as the default EC backend. It is possible to
specify the EC backend/algorithm on a per layer basis using the low
level configuration. The second argument in layers='[ [ "DDc", "" ] ]'
is actually an erasure code profile to be used for this level. The
example below specifies the ISA backend with the cauchy technique to
be used in the lrcpool.::

        $ ceph osd erasure-code-profile set LRCprofile \
             plugin=lrc \
             mapping=DD_ \
             layers='[ [ "DDc", "plugin=isa technique=cauchy" ] ]'
        $ ceph osd pool create lrcpool 12 12 erasure LRCprofile

You could also use a different erasure code profile for for each
layer.::

        $ ceph osd erasure-code-profile set LRCprofile \
             plugin=lrc \
             mapping=__DD__DD \
             layers='[
                       [ "_cDD_cDD", "plugin=isa technique=cauchy" ],
                       [ "cDDD____", "plugin=isa" ],
                       [ "____cDDD", "plugin=jerasure" ],
                     ]'
        $ ceph osd pool create lrcpool 12 12 erasure LRCprofile


Erasure coding and decoding algorithm
=====================================

The steps found in the layers description::

   chunk nr    01234567

   step 1      _cDD_cDD
   step 2      cDDD____
   step 3      ____cDDD

are applied in order. For instance, if a 4K object is encoded, it will
first go thru *step 1* and be divided in four 1K chunks (the four
uppercase D). They are stored in the chunks 2, 3, 6 and 7, in
order. From these, two coding chunks are calculated (the two lowercase
c). The coding chunks are stored in the chunks 1 and 5, respectively.

The *step 2* re-uses the content created by *step 1* in a similar
fashion and stores a single coding chunk *c* at position 0. The last four
chunks, marked with an underscore (*_*) for readability, are ignored.

The *step 3* stores a single coding chunk *c* at position 4. The three
chunks created by *step 1* are used to compute this coding chunk,
i.e. the coding chunk from *step 1* becomes a data chunk in *step 3*.

If chunk *2* is lost::

   chunk nr    01234567

   step 1      _c D_cDD
   step 2      cD D____
   step 3      __ _cDDD

decoding will attempt to recover it by walking the steps in reverse
order: *step 3* then *step 2* and finally *step 1*.

The *step 3* knows nothing about chunk *2* (i.e. it is an underscore)
and is skipped.

The coding chunk from *step 2*, stored in chunk *0*, allows it to
recover the content of chunk *2*. There are no more chunks to recover
and the process stops, without considering *step 1*.

Recovering chunk *2* requires reading chunks *0, 1, 3* and writing
back chunk *2*.

If chunk *2, 3, 6* are lost::

   chunk nr    01234567

   step 1      _c  _c D
   step 2      cD  __ _
   step 3      __  cD D

The *step 3* can recover the content of chunk *6*::

   chunk nr    01234567

   step 1      _c  _cDD
   step 2      cD  ____
   step 3      __  cDDD

The *step 2* fails to recover and is skipped because there are two
chunks missing (*2, 3*) and it can only recover from one missing
chunk.

The coding chunk from *step 1*, stored in chunk *1, 5*, allows it to
recover the content of chunk *2, 3*::

   chunk nr    01234567

   step 1      _cDD_cDD
   step 2      cDDD____
   step 3      ____cDDD

Controlling crush placement
===========================

The default crush ruleset provides OSDs that are on different hosts. For instance::

   chunk nr    01234567

   step 1      _cDD_cDD
   step 2      cDDD____
   step 3      ____cDDD

needs exactly *8* OSDs, one for each chunk. If the hosts are in two
adjacent racks, the first four chunks can be placed in the first rack
and the last four in the second rack. So that recovering from the loss
of a single OSD does not require using bandwidth between the two
racks.

For instance::

   crush-steps='[ [ "choose", "rack", 2 ], [ "chooseleaf", "host", 4 ] ]'

will create a ruleset that will select two crush buckets of type
*rack* and for each of them choose four OSDs, each of them located in
different buckets of type *host*.

The ruleset can also be manually crafted for finer control.