summaryrefslogtreecommitdiffstats
path: root/src/ceph/doc/rados/operations
diff options
context:
space:
mode:
authorQiaowei Ren <qiaowei.ren@intel.com>2018-01-04 13:43:33 +0800
committerQiaowei Ren <qiaowei.ren@intel.com>2018-01-05 11:59:39 +0800
commit812ff6ca9fcd3e629e49d4328905f33eee8ca3f5 (patch)
tree04ece7b4da00d9d2f98093774594f4057ae561d4 /src/ceph/doc/rados/operations
parent15280273faafb77777eab341909a3f495cf248d9 (diff)
initial code repo
This patch creates initial code repo. For ceph, luminous stable release will be used for base code, and next changes and optimization for ceph will be added to it. For opensds, currently any changes can be upstreamed into original opensds repo (https://github.com/opensds/opensds), and so stor4nfv will directly clone opensds code to deploy stor4nfv environment. And the scripts for deployment based on ceph and opensds will be put into 'ci' directory. Change-Id: I46a32218884c75dda2936337604ff03c554648e4 Signed-off-by: Qiaowei Ren <qiaowei.ren@intel.com>
Diffstat (limited to 'src/ceph/doc/rados/operations')
-rw-r--r--src/ceph/doc/rados/operations/add-or-rm-mons.rst370
-rw-r--r--src/ceph/doc/rados/operations/add-or-rm-osds.rst366
-rw-r--r--src/ceph/doc/rados/operations/cache-tiering.rst461
-rw-r--r--src/ceph/doc/rados/operations/control.rst453
-rw-r--r--src/ceph/doc/rados/operations/crush-map-edits.rst654
-rw-r--r--src/ceph/doc/rados/operations/crush-map.rst956
-rw-r--r--src/ceph/doc/rados/operations/data-placement.rst37
-rw-r--r--src/ceph/doc/rados/operations/erasure-code-isa.rst105
-rw-r--r--src/ceph/doc/rados/operations/erasure-code-jerasure.rst120
-rw-r--r--src/ceph/doc/rados/operations/erasure-code-lrc.rst371
-rw-r--r--src/ceph/doc/rados/operations/erasure-code-profile.rst121
-rw-r--r--src/ceph/doc/rados/operations/erasure-code-shec.rst144
-rw-r--r--src/ceph/doc/rados/operations/erasure-code.rst195
-rw-r--r--src/ceph/doc/rados/operations/health-checks.rst527
-rw-r--r--src/ceph/doc/rados/operations/index.rst90
-rw-r--r--src/ceph/doc/rados/operations/monitoring-osd-pg.rst617
-rw-r--r--src/ceph/doc/rados/operations/monitoring.rst351
-rw-r--r--src/ceph/doc/rados/operations/operating.rst251
-rw-r--r--src/ceph/doc/rados/operations/pg-concepts.rst102
-rw-r--r--src/ceph/doc/rados/operations/pg-repair.rst4
-rw-r--r--src/ceph/doc/rados/operations/pg-states.rst80
-rw-r--r--src/ceph/doc/rados/operations/placement-groups.rst469
-rw-r--r--src/ceph/doc/rados/operations/pools.rst798
-rw-r--r--src/ceph/doc/rados/operations/upmap.rst75
-rw-r--r--src/ceph/doc/rados/operations/user-management.rst665
25 files changed, 8382 insertions, 0 deletions
diff --git a/src/ceph/doc/rados/operations/add-or-rm-mons.rst b/src/ceph/doc/rados/operations/add-or-rm-mons.rst
new file mode 100644
index 0000000..0cdc431
--- /dev/null
+++ b/src/ceph/doc/rados/operations/add-or-rm-mons.rst
@@ -0,0 +1,370 @@
+==========================
+ Adding/Removing Monitors
+==========================
+
+When you have a cluster up and running, you may add or remove monitors
+from the cluster at runtime. To bootstrap a monitor, see `Manual Deployment`_
+or `Monitor Bootstrap`_.
+
+Adding Monitors
+===============
+
+Ceph monitors are light-weight processes that maintain a master copy of the
+cluster map. You can run a cluster with 1 monitor. We recommend at least 3
+monitors for a production cluster. Ceph monitors use a variation of the
+`Paxos`_ protocol to establish consensus about maps and other critical
+information across the cluster. Due to the nature of Paxos, Ceph requires
+a majority of monitors running to establish a quorum (thus establishing
+consensus).
+
+It is advisable to run an odd-number of monitors but not mandatory. An
+odd-number of monitors has a higher resiliency to failures than an
+even-number of monitors. For instance, on a 2 monitor deployment, no
+failures can be tolerated in order to maintain a quorum; with 3 monitors,
+one failure can be tolerated; in a 4 monitor deployment, one failure can
+be tolerated; with 5 monitors, two failures can be tolerated. This is
+why an odd-number is advisable. Summarizing, Ceph needs a majority of
+monitors to be running (and able to communicate with each other), but that
+majority can be achieved using a single monitor, or 2 out of 2 monitors,
+2 out of 3, 3 out of 4, etc.
+
+For an initial deployment of a multi-node Ceph cluster, it is advisable to
+deploy three monitors, increasing the number two at a time if a valid need
+for more than three exists.
+
+Since monitors are light-weight, it is possible to run them on the same
+host as an OSD; however, we recommend running them on separate hosts,
+because fsync issues with the kernel may impair performance.
+
+.. note:: A *majority* of monitors in your cluster must be able to
+ reach each other in order to establish a quorum.
+
+Deploy your Hardware
+--------------------
+
+If you are adding a new host when adding a new monitor, see `Hardware
+Recommendations`_ for details on minimum recommendations for monitor hardware.
+To add a monitor host to your cluster, first make sure you have an up-to-date
+version of Linux installed (typically Ubuntu 14.04 or RHEL 7).
+
+Add your monitor host to a rack in your cluster, connect it to the network
+and ensure that it has network connectivity.
+
+.. _Hardware Recommendations: ../../../start/hardware-recommendations
+
+Install the Required Software
+-----------------------------
+
+For manually deployed clusters, you must install Ceph packages
+manually. See `Installing Packages`_ for details.
+You should configure SSH to a user with password-less authentication
+and root permissions.
+
+.. _Installing Packages: ../../../install/install-storage-cluster
+
+
+.. _Adding a Monitor (Manual):
+
+Adding a Monitor (Manual)
+-------------------------
+
+This procedure creates a ``ceph-mon`` data directory, retrieves the monitor map
+and monitor keyring, and adds a ``ceph-mon`` daemon to your cluster. If
+this results in only two monitor daemons, you may add more monitors by
+repeating this procedure until you have a sufficient number of ``ceph-mon``
+daemons to achieve a quorum.
+
+At this point you should define your monitor's id. Traditionally, monitors
+have been named with single letters (``a``, ``b``, ``c``, ...), but you are
+free to define the id as you see fit. For the purpose of this document,
+please take into account that ``{mon-id}`` should be the id you chose,
+without the ``mon.`` prefix (i.e., ``{mon-id}`` should be the ``a``
+on ``mon.a``).
+
+#. Create the default directory on the machine that will host your
+ new monitor. ::
+
+ ssh {new-mon-host}
+ sudo mkdir /var/lib/ceph/mon/ceph-{mon-id}
+
+#. Create a temporary directory ``{tmp}`` to keep the files needed during
+ this process. This directory should be different from the monitor's default
+ directory created in the previous step, and can be removed after all the
+ steps are executed. ::
+
+ mkdir {tmp}
+
+#. Retrieve the keyring for your monitors, where ``{tmp}`` is the path to
+ the retrieved keyring, and ``{key-filename}`` is the name of the file
+ containing the retrieved monitor key. ::
+
+ ceph auth get mon. -o {tmp}/{key-filename}
+
+#. Retrieve the monitor map, where ``{tmp}`` is the path to
+ the retrieved monitor map, and ``{map-filename}`` is the name of the file
+ containing the retrieved monitor monitor map. ::
+
+ ceph mon getmap -o {tmp}/{map-filename}
+
+#. Prepare the monitor's data directory created in the first step. You must
+ specify the path to the monitor map so that you can retrieve the
+ information about a quorum of monitors and their ``fsid``. You must also
+ specify a path to the monitor keyring::
+
+ sudo ceph-mon -i {mon-id} --mkfs --monmap {tmp}/{map-filename} --keyring {tmp}/{key-filename}
+
+
+#. Start the new monitor and it will automatically join the cluster.
+ The daemon needs to know which address to bind to, either via
+ ``--public-addr {ip:port}`` or by setting ``mon addr`` in the
+ appropriate section of ``ceph.conf``. For example::
+
+ ceph-mon -i {mon-id} --public-addr {ip:port}
+
+
+Removing Monitors
+=================
+
+When you remove monitors from a cluster, consider that Ceph monitors use
+PAXOS to establish consensus about the master cluster map. You must have
+a sufficient number of monitors to establish a quorum for consensus about
+the cluster map.
+
+.. _Removing a Monitor (Manual):
+
+Removing a Monitor (Manual)
+---------------------------
+
+This procedure removes a ``ceph-mon`` daemon from your cluster. If this
+procedure results in only two monitor daemons, you may add or remove another
+monitor until you have a number of ``ceph-mon`` daemons that can achieve a
+quorum.
+
+#. Stop the monitor. ::
+
+ service ceph -a stop mon.{mon-id}
+
+#. Remove the monitor from the cluster. ::
+
+ ceph mon remove {mon-id}
+
+#. Remove the monitor entry from ``ceph.conf``.
+
+
+Removing Monitors from an Unhealthy Cluster
+-------------------------------------------
+
+This procedure removes a ``ceph-mon`` daemon from an unhealthy
+cluster, for example a cluster where the monitors cannot form a
+quorum.
+
+
+#. Stop all ``ceph-mon`` daemons on all monitor hosts. ::
+
+ ssh {mon-host}
+ service ceph stop mon || stop ceph-mon-all
+ # and repeat for all mons
+
+#. Identify a surviving monitor and log in to that host. ::
+
+ ssh {mon-host}
+
+#. Extract a copy of the monmap file. ::
+
+ ceph-mon -i {mon-id} --extract-monmap {map-path}
+ # in most cases, that's
+ ceph-mon -i `hostname` --extract-monmap /tmp/monmap
+
+#. Remove the non-surviving or problematic monitors. For example, if
+ you have three monitors, ``mon.a``, ``mon.b``, and ``mon.c``, where
+ only ``mon.a`` will survive, follow the example below::
+
+ monmaptool {map-path} --rm {mon-id}
+ # for example,
+ monmaptool /tmp/monmap --rm b
+ monmaptool /tmp/monmap --rm c
+
+#. Inject the surviving map with the removed monitors into the
+ surviving monitor(s). For example, to inject a map into monitor
+ ``mon.a``, follow the example below::
+
+ ceph-mon -i {mon-id} --inject-monmap {map-path}
+ # for example,
+ ceph-mon -i a --inject-monmap /tmp/monmap
+
+#. Start only the surviving monitors.
+
+#. Verify the monitors form a quorum (``ceph -s``).
+
+#. You may wish to archive the removed monitors' data directory in
+ ``/var/lib/ceph/mon`` in a safe location, or delete it if you are
+ confident the remaining monitors are healthy and are sufficiently
+ redundant.
+
+.. _Changing a Monitor's IP address:
+
+Changing a Monitor's IP Address
+===============================
+
+.. important:: Existing monitors are not supposed to change their IP addresses.
+
+Monitors are critical components of a Ceph cluster, and they need to maintain a
+quorum for the whole system to work properly. To establish a quorum, the
+monitors need to discover each other. Ceph has strict requirements for
+discovering monitors.
+
+Ceph clients and other Ceph daemons use ``ceph.conf`` to discover monitors.
+However, monitors discover each other using the monitor map, not ``ceph.conf``.
+For example, if you refer to `Adding a Monitor (Manual)`_ you will see that you
+need to obtain the current monmap for the cluster when creating a new monitor,
+as it is one of the required arguments of ``ceph-mon -i {mon-id} --mkfs``. The
+following sections explain the consistency requirements for Ceph monitors, and a
+few safe ways to change a monitor's IP address.
+
+
+Consistency Requirements
+------------------------
+
+A monitor always refers to the local copy of the monmap when discovering other
+monitors in the cluster. Using the monmap instead of ``ceph.conf`` avoids
+errors that could break the cluster (e.g., typos in ``ceph.conf`` when
+specifying a monitor address or port). Since monitors use monmaps for discovery
+and they share monmaps with clients and other Ceph daemons, the monmap provides
+monitors with a strict guarantee that their consensus is valid.
+
+Strict consistency also applies to updates to the monmap. As with any other
+updates on the monitor, changes to the monmap always run through a distributed
+consensus algorithm called `Paxos`_. The monitors must agree on each update to
+the monmap, such as adding or removing a monitor, to ensure that each monitor in
+the quorum has the same version of the monmap. Updates to the monmap are
+incremental so that monitors have the latest agreed upon version, and a set of
+previous versions, allowing a monitor that has an older version of the monmap to
+catch up with the current state of the cluster.
+
+If monitors discovered each other through the Ceph configuration file instead of
+through the monmap, it would introduce additional risks because the Ceph
+configuration files are not updated and distributed automatically. Monitors
+might inadvertently use an older ``ceph.conf`` file, fail to recognize a
+monitor, fall out of a quorum, or develop a situation where `Paxos`_ is not able
+to determine the current state of the system accurately. Consequently, making
+changes to an existing monitor's IP address must be done with great care.
+
+
+Changing a Monitor's IP address (The Right Way)
+-----------------------------------------------
+
+Changing a monitor's IP address in ``ceph.conf`` only is not sufficient to
+ensure that other monitors in the cluster will receive the update. To change a
+monitor's IP address, you must add a new monitor with the IP address you want
+to use (as described in `Adding a Monitor (Manual)`_), ensure that the new
+monitor successfully joins the quorum; then, remove the monitor that uses the
+old IP address. Then, update the ``ceph.conf`` file to ensure that clients and
+other daemons know the IP address of the new monitor.
+
+For example, lets assume there are three monitors in place, such as ::
+
+ [mon.a]
+ host = host01
+ addr = 10.0.0.1:6789
+ [mon.b]
+ host = host02
+ addr = 10.0.0.2:6789
+ [mon.c]
+ host = host03
+ addr = 10.0.0.3:6789
+
+To change ``mon.c`` to ``host04`` with the IP address ``10.0.0.4``, follow the
+steps in `Adding a Monitor (Manual)`_ by adding a new monitor ``mon.d``. Ensure
+that ``mon.d`` is running before removing ``mon.c``, or it will break the
+quorum. Remove ``mon.c`` as described on `Removing a Monitor (Manual)`_. Moving
+all three monitors would thus require repeating this process as many times as
+needed.
+
+
+Changing a Monitor's IP address (The Messy Way)
+-----------------------------------------------
+
+There may come a time when the monitors must be moved to a different network, a
+different part of the datacenter or a different datacenter altogether. While it
+is possible to do it, the process becomes a bit more hazardous.
+
+In such a case, the solution is to generate a new monmap with updated IP
+addresses for all the monitors in the cluster, and inject the new map on each
+individual monitor. This is not the most user-friendly approach, but we do not
+expect this to be something that needs to be done every other week. As it is
+clearly stated on the top of this section, monitors are not supposed to change
+IP addresses.
+
+Using the previous monitor configuration as an example, assume you want to move
+all the monitors from the ``10.0.0.x`` range to ``10.1.0.x``, and these
+networks are unable to communicate. Use the following procedure:
+
+#. Retrieve the monitor map, where ``{tmp}`` is the path to
+ the retrieved monitor map, and ``{filename}`` is the name of the file
+ containing the retrieved monitor monitor map. ::
+
+ ceph mon getmap -o {tmp}/{filename}
+
+#. The following example demonstrates the contents of the monmap. ::
+
+ $ monmaptool --print {tmp}/{filename}
+
+ monmaptool: monmap file {tmp}/{filename}
+ epoch 1
+ fsid 224e376d-c5fe-4504-96bb-ea6332a19e61
+ last_changed 2012-12-17 02:46:41.591248
+ created 2012-12-17 02:46:41.591248
+ 0: 10.0.0.1:6789/0 mon.a
+ 1: 10.0.0.2:6789/0 mon.b
+ 2: 10.0.0.3:6789/0 mon.c
+
+#. Remove the existing monitors. ::
+
+ $ monmaptool --rm a --rm b --rm c {tmp}/{filename}
+
+ monmaptool: monmap file {tmp}/{filename}
+ monmaptool: removing a
+ monmaptool: removing b
+ monmaptool: removing c
+ monmaptool: writing epoch 1 to {tmp}/{filename} (0 monitors)
+
+#. Add the new monitor locations. ::
+
+ $ monmaptool --add a 10.1.0.1:6789 --add b 10.1.0.2:6789 --add c 10.1.0.3:6789 {tmp}/{filename}
+
+ monmaptool: monmap file {tmp}/{filename}
+ monmaptool: writing epoch 1 to {tmp}/{filename} (3 monitors)
+
+#. Check new contents. ::
+
+ $ monmaptool --print {tmp}/{filename}
+
+ monmaptool: monmap file {tmp}/{filename}
+ epoch 1
+ fsid 224e376d-c5fe-4504-96bb-ea6332a19e61
+ last_changed 2012-12-17 02:46:41.591248
+ created 2012-12-17 02:46:41.591248
+ 0: 10.1.0.1:6789/0 mon.a
+ 1: 10.1.0.2:6789/0 mon.b
+ 2: 10.1.0.3:6789/0 mon.c
+
+At this point, we assume the monitors (and stores) are installed at the new
+location. The next step is to propagate the modified monmap to the new
+monitors, and inject the modified monmap into each new monitor.
+
+#. First, make sure to stop all your monitors. Injection must be done while
+ the daemon is not running.
+
+#. Inject the monmap. ::
+
+ ceph-mon -i {mon-id} --inject-monmap {tmp}/{filename}
+
+#. Restart the monitors.
+
+After this step, migration to the new location is complete and
+the monitors should operate successfully.
+
+
+.. _Manual Deployment: ../../../install/manual-deployment
+.. _Monitor Bootstrap: ../../../dev/mon-bootstrap
+.. _Paxos: http://en.wikipedia.org/wiki/Paxos_(computer_science)
diff --git a/src/ceph/doc/rados/operations/add-or-rm-osds.rst b/src/ceph/doc/rados/operations/add-or-rm-osds.rst
new file mode 100644
index 0000000..59ce4c7
--- /dev/null
+++ b/src/ceph/doc/rados/operations/add-or-rm-osds.rst
@@ -0,0 +1,366 @@
+======================
+ Adding/Removing OSDs
+======================
+
+When you have a cluster up and running, you may add OSDs or remove OSDs
+from the cluster at runtime.
+
+Adding OSDs
+===========
+
+When you want to expand a cluster, you may add an OSD at runtime. With Ceph, an
+OSD is generally one Ceph ``ceph-osd`` daemon for one storage drive within a
+host machine. If your host has multiple storage drives, you may map one
+``ceph-osd`` daemon for each drive.
+
+Generally, it's a good idea to check the capacity of your cluster to see if you
+are reaching the upper end of its capacity. As your cluster reaches its ``near
+full`` ratio, you should add one or more OSDs to expand your cluster's capacity.
+
+.. warning:: Do not let your cluster reach its ``full ratio`` before
+ adding an OSD. OSD failures that occur after the cluster reaches
+ its ``near full`` ratio may cause the cluster to exceed its
+ ``full ratio``.
+
+Deploy your Hardware
+--------------------
+
+If you are adding a new host when adding a new OSD, see `Hardware
+Recommendations`_ for details on minimum recommendations for OSD hardware. To
+add an OSD host to your cluster, first make sure you have an up-to-date version
+of Linux installed, and you have made some initial preparations for your
+storage drives. See `Filesystem Recommendations`_ for details.
+
+Add your OSD host to a rack in your cluster, connect it to the network
+and ensure that it has network connectivity. See the `Network Configuration
+Reference`_ for details.
+
+.. _Hardware Recommendations: ../../../start/hardware-recommendations
+.. _Filesystem Recommendations: ../../configuration/filesystem-recommendations
+.. _Network Configuration Reference: ../../configuration/network-config-ref
+
+Install the Required Software
+-----------------------------
+
+For manually deployed clusters, you must install Ceph packages
+manually. See `Installing Ceph (Manual)`_ for details.
+You should configure SSH to a user with password-less authentication
+and root permissions.
+
+.. _Installing Ceph (Manual): ../../../install
+
+
+Adding an OSD (Manual)
+----------------------
+
+This procedure sets up a ``ceph-osd`` daemon, configures it to use one drive,
+and configures the cluster to distribute data to the OSD. If your host has
+multiple drives, you may add an OSD for each drive by repeating this procedure.
+
+To add an OSD, create a data directory for it, mount a drive to that directory,
+add the OSD to the cluster, and then add it to the CRUSH map.
+
+When you add the OSD to the CRUSH map, consider the weight you give to the new
+OSD. Hard drive capacity grows 40% per year, so newer OSD hosts may have larger
+hard drives than older hosts in the cluster (i.e., they may have greater
+weight).
+
+.. tip:: Ceph prefers uniform hardware across pools. If you are adding drives
+ of dissimilar size, you can adjust their weights. However, for best
+ performance, consider a CRUSH hierarchy with drives of the same type/size.
+
+#. Create the OSD. If no UUID is given, it will be set automatically when the
+ OSD starts up. The following command will output the OSD number, which you
+ will need for subsequent steps. ::
+
+ ceph osd create [{uuid} [{id}]]
+
+ If the optional parameter {id} is given it will be used as the OSD id.
+ Note, in this case the command may fail if the number is already in use.
+
+ .. warning:: In general, explicitly specifying {id} is not recommended.
+ IDs are allocated as an array, and skipping entries consumes some extra
+ memory. This can become significant if there are large gaps and/or
+ clusters are large. If {id} is not specified, the smallest available is
+ used.
+
+#. Create the default directory on your new OSD. ::
+
+ ssh {new-osd-host}
+ sudo mkdir /var/lib/ceph/osd/ceph-{osd-number}
+
+
+#. If the OSD is for a drive other than the OS drive, prepare it
+ for use with Ceph, and mount it to the directory you just created::
+
+ ssh {new-osd-host}
+ sudo mkfs -t {fstype} /dev/{drive}
+ sudo mount -o user_xattr /dev/{hdd} /var/lib/ceph/osd/ceph-{osd-number}
+
+
+#. Initialize the OSD data directory. ::
+
+ ssh {new-osd-host}
+ ceph-osd -i {osd-num} --mkfs --mkkey
+
+ The directory must be empty before you can run ``ceph-osd``.
+
+#. Register the OSD authentication key. The value of ``ceph`` for
+ ``ceph-{osd-num}`` in the path is the ``$cluster-$id``. If your
+ cluster name differs from ``ceph``, use your cluster name instead.::
+
+ ceph auth add osd.{osd-num} osd 'allow *' mon 'allow rwx' -i /var/lib/ceph/osd/ceph-{osd-num}/keyring
+
+
+#. Add the OSD to the CRUSH map so that the OSD can begin receiving data. The
+ ``ceph osd crush add`` command allows you to add OSDs to the CRUSH hierarchy
+ wherever you wish. If you specify at least one bucket, the command
+ will place the OSD into the most specific bucket you specify, *and* it will
+ move that bucket underneath any other buckets you specify. **Important:** If
+ you specify only the root bucket, the command will attach the OSD directly
+ to the root, but CRUSH rules expect OSDs to be inside of hosts.
+
+ For Argonaut (v 0.48), execute the following::
+
+ ceph osd crush add {id} {name} {weight} [{bucket-type}={bucket-name} ...]
+
+ For Bobtail (v 0.56) and later releases, execute the following::
+
+ ceph osd crush add {id-or-name} {weight} [{bucket-type}={bucket-name} ...]
+
+ You may also decompile the CRUSH map, add the OSD to the device list, add the
+ host as a bucket (if it's not already in the CRUSH map), add the device as an
+ item in the host, assign it a weight, recompile it and set it. See
+ `Add/Move an OSD`_ for details.
+
+
+.. topic:: Argonaut (v0.48) Best Practices
+
+ To limit impact on user I/O performance, add an OSD to the CRUSH map
+ with an initial weight of ``0``. Then, ramp up the CRUSH weight a
+ little bit at a time. For example, to ramp by increments of ``0.2``,
+ start with::
+
+ ceph osd crush reweight {osd-id} .2
+
+ and allow migration to complete before reweighting to ``0.4``,
+ ``0.6``, and so on until the desired CRUSH weight is reached.
+
+ To limit the impact of OSD failures, you can set::
+
+ mon osd down out interval = 0
+
+ which prevents down OSDs from automatically being marked out, and then
+ ramp them down manually with::
+
+ ceph osd reweight {osd-num} .8
+
+ Again, wait for the cluster to finish migrating data, and then adjust
+ the weight further until you reach a weight of 0. Note that this
+ problem prevents the cluster to automatically re-replicate data after
+ a failure, so please ensure that sufficient monitoring is in place for
+ an administrator to intervene promptly.
+
+ Note that this practice will no longer be necessary in Bobtail and
+ subsequent releases.
+
+
+Replacing an OSD
+----------------
+
+When disks fail, or if an admnistrator wants to reprovision OSDs with a new
+backend, for instance, for switching from FileStore to BlueStore, OSDs need to
+be replaced. Unlike `Removing the OSD`_, replaced OSD's id and CRUSH map entry
+need to be keep intact after the OSD is destroyed for replacement.
+
+#. Destroy the OSD first::
+
+ ceph osd destroy {id} --yes-i-really-mean-it
+
+#. Zap a disk for the new OSD, if the disk was used before for other purposes.
+ It's not necessary for a new disk::
+
+ ceph-disk zap /dev/sdX
+
+#. Prepare the disk for replacement by using the previously destroyed OSD id::
+
+ ceph-disk prepare --bluestore /dev/sdX --osd-id {id} --osd-uuid `uuidgen`
+
+#. And activate the OSD::
+
+ ceph-disk activate /dev/sdX1
+
+
+Starting the OSD
+----------------
+
+After you add an OSD to Ceph, the OSD is in your configuration. However,
+it is not yet running. The OSD is ``down`` and ``in``. You must start
+your new OSD before it can begin receiving data. You may use
+``service ceph`` from your admin host or start the OSD from its host
+machine.
+
+For Ubuntu Trusty use Upstart. ::
+
+ sudo start ceph-osd id={osd-num}
+
+For all other distros use systemd. ::
+
+ sudo systemctl start ceph-osd@{osd-num}
+
+
+Once you start your OSD, it is ``up`` and ``in``.
+
+
+Observe the Data Migration
+--------------------------
+
+Once you have added your new OSD to the CRUSH map, Ceph will begin rebalancing
+the server by migrating placement groups to your new OSD. You can observe this
+process with the `ceph`_ tool. ::
+
+ ceph -w
+
+You should see the placement group states change from ``active+clean`` to
+``active, some degraded objects``, and finally ``active+clean`` when migration
+completes. (Control-c to exit.)
+
+
+.. _Add/Move an OSD: ../crush-map#addosd
+.. _ceph: ../monitoring
+
+
+
+Removing OSDs (Manual)
+======================
+
+When you want to reduce the size of a cluster or replace hardware, you may
+remove an OSD at runtime. With Ceph, an OSD is generally one Ceph ``ceph-osd``
+daemon for one storage drive within a host machine. If your host has multiple
+storage drives, you may need to remove one ``ceph-osd`` daemon for each drive.
+Generally, it's a good idea to check the capacity of your cluster to see if you
+are reaching the upper end of its capacity. Ensure that when you remove an OSD
+that your cluster is not at its ``near full`` ratio.
+
+.. warning:: Do not let your cluster reach its ``full ratio`` when
+ removing an OSD. Removing OSDs could cause the cluster to reach
+ or exceed its ``full ratio``.
+
+
+Take the OSD out of the Cluster
+-----------------------------------
+
+Before you remove an OSD, it is usually ``up`` and ``in``. You need to take it
+out of the cluster so that Ceph can begin rebalancing and copying its data to
+other OSDs. ::
+
+ ceph osd out {osd-num}
+
+
+Observe the Data Migration
+--------------------------
+
+Once you have taken your OSD ``out`` of the cluster, Ceph will begin
+rebalancing the cluster by migrating placement groups out of the OSD you
+removed. You can observe this process with the `ceph`_ tool. ::
+
+ ceph -w
+
+You should see the placement group states change from ``active+clean`` to
+``active, some degraded objects``, and finally ``active+clean`` when migration
+completes. (Control-c to exit.)
+
+.. note:: Sometimes, typically in a "small" cluster with few hosts (for
+ instance with a small testing cluster), the fact to take ``out`` the
+ OSD can spawn a CRUSH corner case where some PGs remain stuck in the
+ ``active+remapped`` state. If you are in this case, you should mark
+ the OSD ``in`` with:
+
+ ``ceph osd in {osd-num}``
+
+ to come back to the initial state and then, instead of marking ``out``
+ the OSD, set its weight to 0 with:
+
+ ``ceph osd crush reweight osd.{osd-num} 0``
+
+ After that, you can observe the data migration which should come to its
+ end. The difference between marking ``out`` the OSD and reweighting it
+ to 0 is that in the first case the weight of the bucket which contains
+ the OSD is not changed whereas in the second case the weight of the bucket
+ is updated (and decreased of the OSD weight). The reweight command could
+ be sometimes favoured in the case of a "small" cluster.
+
+
+
+Stopping the OSD
+----------------
+
+After you take an OSD out of the cluster, it may still be running.
+That is, the OSD may be ``up`` and ``out``. You must stop
+your OSD before you remove it from the configuration. ::
+
+ ssh {osd-host}
+ sudo systemctl stop ceph-osd@{osd-num}
+
+Once you stop your OSD, it is ``down``.
+
+
+Removing the OSD
+----------------
+
+This procedure removes an OSD from a cluster map, removes its authentication
+key, removes the OSD from the OSD map, and removes the OSD from the
+``ceph.conf`` file. If your host has multiple drives, you may need to remove an
+OSD for each drive by repeating this procedure.
+
+#. Let the cluster forget the OSD first. This step removes the OSD from the CRUSH
+ map, removes its authentication key. And it is removed from the OSD map as
+ well. Please note the `purge subcommand`_ is introduced in Luminous, for older
+ versions, please see below ::
+
+ ceph osd purge {id} --yes-i-really-mean-it
+
+#. Navigate to the host where you keep the master copy of the cluster's
+ ``ceph.conf`` file. ::
+
+ ssh {admin-host}
+ cd /etc/ceph
+ vim ceph.conf
+
+#. Remove the OSD entry from your ``ceph.conf`` file (if it exists). ::
+
+ [osd.1]
+ host = {hostname}
+
+#. From the host where you keep the master copy of the cluster's ``ceph.conf`` file,
+ copy the updated ``ceph.conf`` file to the ``/etc/ceph`` directory of other
+ hosts in your cluster.
+
+If your Ceph cluster is older than Luminous, instead of using ``ceph osd purge``,
+you need to perform this step manually:
+
+
+#. Remove the OSD from the CRUSH map so that it no longer receives data. You may
+ also decompile the CRUSH map, remove the OSD from the device list, remove the
+ device as an item in the host bucket or remove the host bucket (if it's in the
+ CRUSH map and you intend to remove the host), recompile the map and set it.
+ See `Remove an OSD`_ for details. ::
+
+ ceph osd crush remove {name}
+
+#. Remove the OSD authentication key. ::
+
+ ceph auth del osd.{osd-num}
+
+ The value of ``ceph`` for ``ceph-{osd-num}`` in the path is the ``$cluster-$id``.
+ If your cluster name differs from ``ceph``, use your cluster name instead.
+
+#. Remove the OSD. ::
+
+ ceph osd rm {osd-num}
+ #for example
+ ceph osd rm 1
+
+
+.. _Remove an OSD: ../crush-map#removeosd
+.. _purge subcommand: /man/8/ceph#osd
diff --git a/src/ceph/doc/rados/operations/cache-tiering.rst b/src/ceph/doc/rados/operations/cache-tiering.rst
new file mode 100644
index 0000000..322c6ff
--- /dev/null
+++ b/src/ceph/doc/rados/operations/cache-tiering.rst
@@ -0,0 +1,461 @@
+===============
+ Cache Tiering
+===============
+
+A cache tier provides Ceph Clients with better I/O performance for a subset of
+the data stored in a backing storage tier. Cache tiering involves creating a
+pool of relatively fast/expensive storage devices (e.g., solid state drives)
+configured to act as a cache tier, and a backing pool of either erasure-coded
+or relatively slower/cheaper devices configured to act as an economical storage
+tier. The Ceph objecter handles where to place the objects and the tiering
+agent determines when to flush objects from the cache to the backing storage
+tier. So the cache tier and the backing storage tier are completely transparent
+to Ceph clients.
+
+
+.. ditaa::
+ +-------------+
+ | Ceph Client |
+ +------+------+
+ ^
+ Tiering is |
+ Transparent | Faster I/O
+ to Ceph | +---------------+
+ Client Ops | | |
+ | +----->+ Cache Tier |
+ | | | |
+ | | +-----+---+-----+
+ | | | ^
+ v v | | Active Data in Cache Tier
+ +------+----+--+ | |
+ | Objecter | | |
+ +-----------+--+ | |
+ ^ | | Inactive Data in Storage Tier
+ | v |
+ | +-----+---+-----+
+ | | |
+ +----->| Storage Tier |
+ | |
+ +---------------+
+ Slower I/O
+
+
+The cache tiering agent handles the migration of data between the cache tier
+and the backing storage tier automatically. However, admins have the ability to
+configure how this migration takes place. There are two main scenarios:
+
+- **Writeback Mode:** When admins configure tiers with ``writeback`` mode, Ceph
+ clients write data to the cache tier and receive an ACK from the cache tier.
+ In time, the data written to the cache tier migrates to the storage tier
+ and gets flushed from the cache tier. Conceptually, the cache tier is
+ overlaid "in front" of the backing storage tier. When a Ceph client needs
+ data that resides in the storage tier, the cache tiering agent migrates the
+ data to the cache tier on read, then it is sent to the Ceph client.
+ Thereafter, the Ceph client can perform I/O using the cache tier, until the
+ data becomes inactive. This is ideal for mutable data (e.g., photo/video
+ editing, transactional data, etc.).
+
+- **Read-proxy Mode:** This mode will use any objects that already
+ exist in the cache tier, but if an object is not present in the
+ cache the request will be proxied to the base tier. This is useful
+ for transitioning from ``writeback`` mode to a disabled cache as it
+ allows the workload to function properly while the cache is drained,
+ without adding any new objects to the cache.
+
+A word of caution
+=================
+
+Cache tiering will *degrade* performance for most workloads. Users should use
+extreme caution before using this feature.
+
+* *Workload dependent*: Whether a cache will improve performance is
+ highly dependent on the workload. Because there is a cost
+ associated with moving objects into or out of the cache, it can only
+ be effective when there is a *large skew* in the access pattern in
+ the data set, such that most of the requests touch a small number of
+ objects. The cache pool should be large enough to capture the
+ working set for your workload to avoid thrashing.
+
+* *Difficult to benchmark*: Most benchmarks that users run to measure
+ performance will show terrible performance with cache tiering, in
+ part because very few of them skew requests toward a small set of
+ objects, it can take a long time for the cache to "warm up," and
+ because the warm-up cost can be high.
+
+* *Usually slower*: For workloads that are not cache tiering-friendly,
+ performance is often slower than a normal RADOS pool without cache
+ tiering enabled.
+
+* *librados object enumeration*: The librados-level object enumeration
+ API is not meant to be coherent in the presence of the case. If
+ your applicatoin is using librados directly and relies on object
+ enumeration, cache tiering will probably not work as expected.
+ (This is not a problem for RGW, RBD, or CephFS.)
+
+* *Complexity*: Enabling cache tiering means that a lot of additional
+ machinery and complexity within the RADOS cluster is being used.
+ This increases the probability that you will encounter a bug in the system
+ that other users have not yet encountered and will put your deployment at a
+ higher level of risk.
+
+Known Good Workloads
+--------------------
+
+* *RGW time-skewed*: If the RGW workload is such that almost all read
+ operations are directed at recently written objects, a simple cache
+ tiering configuration that destages recently written objects from
+ the cache to the base tier after a configurable period can work
+ well.
+
+Known Bad Workloads
+-------------------
+
+The following configurations are *known to work poorly* with cache
+tiering.
+
+* *RBD with replicated cache and erasure-coded base*: This is a common
+ request, but usually does not perform well. Even reasonably skewed
+ workloads still send some small writes to cold objects, and because
+ small writes are not yet supported by the erasure-coded pool, entire
+ (usually 4 MB) objects must be migrated into the cache in order to
+ satisfy a small (often 4 KB) write. Only a handful of users have
+ successfully deployed this configuration, and it only works for them
+ because their data is extremely cold (backups) and they are not in
+ any way sensitive to performance.
+
+* *RBD with replicated cache and base*: RBD with a replicated base
+ tier does better than when the base is erasure coded, but it is
+ still highly dependent on the amount of skew in the workload, and
+ very difficult to validate. The user will need to have a good
+ understanding of their workload and will need to tune the cache
+ tiering parameters carefully.
+
+
+Setting Up Pools
+================
+
+To set up cache tiering, you must have two pools. One will act as the
+backing storage and the other will act as the cache.
+
+
+Setting Up a Backing Storage Pool
+---------------------------------
+
+Setting up a backing storage pool typically involves one of two scenarios:
+
+- **Standard Storage**: In this scenario, the pool stores multiple copies
+ of an object in the Ceph Storage Cluster.
+
+- **Erasure Coding:** In this scenario, the pool uses erasure coding to
+ store data much more efficiently with a small performance tradeoff.
+
+In the standard storage scenario, you can setup a CRUSH ruleset to establish
+the failure domain (e.g., osd, host, chassis, rack, row, etc.). Ceph OSD
+Daemons perform optimally when all storage drives in the ruleset are of the
+same size, speed (both RPMs and throughput) and type. See `CRUSH Maps`_
+for details on creating a ruleset. Once you have created a ruleset, create
+a backing storage pool.
+
+In the erasure coding scenario, the pool creation arguments will generate the
+appropriate ruleset automatically. See `Create a Pool`_ for details.
+
+In subsequent examples, we will refer to the backing storage pool
+as ``cold-storage``.
+
+
+Setting Up a Cache Pool
+-----------------------
+
+Setting up a cache pool follows the same procedure as the standard storage
+scenario, but with this difference: the drives for the cache tier are typically
+high performance drives that reside in their own servers and have their own
+ruleset. When setting up a ruleset, it should take account of the hosts that
+have the high performance drives while omitting the hosts that don't. See
+`Placing Different Pools on Different OSDs`_ for details.
+
+
+In subsequent examples, we will refer to the cache pool as ``hot-storage`` and
+the backing pool as ``cold-storage``.
+
+For cache tier configuration and default values, see
+`Pools - Set Pool Values`_.
+
+
+Creating a Cache Tier
+=====================
+
+Setting up a cache tier involves associating a backing storage pool with
+a cache pool ::
+
+ ceph osd tier add {storagepool} {cachepool}
+
+For example ::
+
+ ceph osd tier add cold-storage hot-storage
+
+To set the cache mode, execute the following::
+
+ ceph osd tier cache-mode {cachepool} {cache-mode}
+
+For example::
+
+ ceph osd tier cache-mode hot-storage writeback
+
+The cache tiers overlay the backing storage tier, so they require one
+additional step: you must direct all client traffic from the storage pool to
+the cache pool. To direct client traffic directly to the cache pool, execute
+the following::
+
+ ceph osd tier set-overlay {storagepool} {cachepool}
+
+For example::
+
+ ceph osd tier set-overlay cold-storage hot-storage
+
+
+Configuring a Cache Tier
+========================
+
+Cache tiers have several configuration options. You may set
+cache tier configuration options with the following usage::
+
+ ceph osd pool set {cachepool} {key} {value}
+
+See `Pools - Set Pool Values`_ for details.
+
+
+Target Size and Type
+--------------------
+
+Ceph's production cache tiers use a `Bloom Filter`_ for the ``hit_set_type``::
+
+ ceph osd pool set {cachepool} hit_set_type bloom
+
+For example::
+
+ ceph osd pool set hot-storage hit_set_type bloom
+
+The ``hit_set_count`` and ``hit_set_period`` define how much time each HitSet
+should cover, and how many such HitSets to store. ::
+
+ ceph osd pool set {cachepool} hit_set_count 12
+ ceph osd pool set {cachepool} hit_set_period 14400
+ ceph osd pool set {cachepool} target_max_bytes 1000000000000
+
+.. note:: A larger ``hit_set_count`` results in more RAM consumed by
+ the ``ceph-osd`` process.
+
+Binning accesses over time allows Ceph to determine whether a Ceph client
+accessed an object at least once, or more than once over a time period
+("age" vs "temperature").
+
+The ``min_read_recency_for_promote`` defines how many HitSets to check for the
+existence of an object when handling a read operation. The checking result is
+used to decide whether to promote the object asynchronously. Its value should be
+between 0 and ``hit_set_count``. If it's set to 0, the object is always promoted.
+If it's set to 1, the current HitSet is checked. And if this object is in the
+current HitSet, it's promoted. Otherwise not. For the other values, the exact
+number of archive HitSets are checked. The object is promoted if the object is
+found in any of the most recent ``min_read_recency_for_promote`` HitSets.
+
+A similar parameter can be set for the write operation, which is
+``min_write_recency_for_promote``. ::
+
+ ceph osd pool set {cachepool} min_read_recency_for_promote 2
+ ceph osd pool set {cachepool} min_write_recency_for_promote 2
+
+.. note:: The longer the period and the higher the
+ ``min_read_recency_for_promote`` and
+ ``min_write_recency_for_promote``values, the more RAM the ``ceph-osd``
+ daemon consumes. In particular, when the agent is active to flush
+ or evict cache objects, all ``hit_set_count`` HitSets are loaded
+ into RAM.
+
+
+Cache Sizing
+------------
+
+The cache tiering agent performs two main functions:
+
+- **Flushing:** The agent identifies modified (or dirty) objects and forwards
+ them to the storage pool for long-term storage.
+
+- **Evicting:** The agent identifies objects that haven't been modified
+ (or clean) and evicts the least recently used among them from the cache.
+
+
+Absolute Sizing
+~~~~~~~~~~~~~~~
+
+The cache tiering agent can flush or evict objects based upon the total number
+of bytes or the total number of objects. To specify a maximum number of bytes,
+execute the following::
+
+ ceph osd pool set {cachepool} target_max_bytes {#bytes}
+
+For example, to flush or evict at 1 TB, execute the following::
+
+ ceph osd pool set hot-storage target_max_bytes 1099511627776
+
+
+To specify the maximum number of objects, execute the following::
+
+ ceph osd pool set {cachepool} target_max_objects {#objects}
+
+For example, to flush or evict at 1M objects, execute the following::
+
+ ceph osd pool set hot-storage target_max_objects 1000000
+
+.. note:: Ceph is not able to determine the size of a cache pool automatically, so
+ the configuration on the absolute size is required here, otherwise the
+ flush/evict will not work. If you specify both limits, the cache tiering
+ agent will begin flushing or evicting when either threshold is triggered.
+
+.. note:: All client requests will be blocked only when ``target_max_bytes`` or
+ ``target_max_objects`` reached
+
+Relative Sizing
+~~~~~~~~~~~~~~~
+
+The cache tiering agent can flush or evict objects relative to the size of the
+cache pool(specified by ``target_max_bytes`` / ``target_max_objects`` in
+`Absolute sizing`_). When the cache pool consists of a certain percentage of
+modified (or dirty) objects, the cache tiering agent will flush them to the
+storage pool. To set the ``cache_target_dirty_ratio``, execute the following::
+
+ ceph osd pool set {cachepool} cache_target_dirty_ratio {0.0..1.0}
+
+For example, setting the value to ``0.4`` will begin flushing modified
+(dirty) objects when they reach 40% of the cache pool's capacity::
+
+ ceph osd pool set hot-storage cache_target_dirty_ratio 0.4
+
+When the dirty objects reaches a certain percentage of its capacity, flush dirty
+objects with a higher speed. To set the ``cache_target_dirty_high_ratio``::
+
+ ceph osd pool set {cachepool} cache_target_dirty_high_ratio {0.0..1.0}
+
+For example, setting the value to ``0.6`` will begin aggressively flush dirty objects
+when they reach 60% of the cache pool's capacity. obviously, we'd better set the value
+between dirty_ratio and full_ratio::
+
+ ceph osd pool set hot-storage cache_target_dirty_high_ratio 0.6
+
+When the cache pool reaches a certain percentage of its capacity, the cache
+tiering agent will evict objects to maintain free capacity. To set the
+``cache_target_full_ratio``, execute the following::
+
+ ceph osd pool set {cachepool} cache_target_full_ratio {0.0..1.0}
+
+For example, setting the value to ``0.8`` will begin flushing unmodified
+(clean) objects when they reach 80% of the cache pool's capacity::
+
+ ceph osd pool set hot-storage cache_target_full_ratio 0.8
+
+
+Cache Age
+---------
+
+You can specify the minimum age of an object before the cache tiering agent
+flushes a recently modified (or dirty) object to the backing storage pool::
+
+ ceph osd pool set {cachepool} cache_min_flush_age {#seconds}
+
+For example, to flush modified (or dirty) objects after 10 minutes, execute
+the following::
+
+ ceph osd pool set hot-storage cache_min_flush_age 600
+
+You can specify the minimum age of an object before it will be evicted from
+the cache tier::
+
+ ceph osd pool {cache-tier} cache_min_evict_age {#seconds}
+
+For example, to evict objects after 30 minutes, execute the following::
+
+ ceph osd pool set hot-storage cache_min_evict_age 1800
+
+
+Removing a Cache Tier
+=====================
+
+Removing a cache tier differs depending on whether it is a writeback
+cache or a read-only cache.
+
+
+Removing a Read-Only Cache
+--------------------------
+
+Since a read-only cache does not have modified data, you can disable
+and remove it without losing any recent changes to objects in the cache.
+
+#. Change the cache-mode to ``none`` to disable it. ::
+
+ ceph osd tier cache-mode {cachepool} none
+
+ For example::
+
+ ceph osd tier cache-mode hot-storage none
+
+#. Remove the cache pool from the backing pool. ::
+
+ ceph osd tier remove {storagepool} {cachepool}
+
+ For example::
+
+ ceph osd tier remove cold-storage hot-storage
+
+
+
+Removing a Writeback Cache
+--------------------------
+
+Since a writeback cache may have modified data, you must take steps to ensure
+that you do not lose any recent changes to objects in the cache before you
+disable and remove it.
+
+
+#. Change the cache mode to ``forward`` so that new and modified objects will
+ flush to the backing storage pool. ::
+
+ ceph osd tier cache-mode {cachepool} forward
+
+ For example::
+
+ ceph osd tier cache-mode hot-storage forward
+
+
+#. Ensure that the cache pool has been flushed. This may take a few minutes::
+
+ rados -p {cachepool} ls
+
+ If the cache pool still has objects, you can flush them manually.
+ For example::
+
+ rados -p {cachepool} cache-flush-evict-all
+
+
+#. Remove the overlay so that clients will not direct traffic to the cache. ::
+
+ ceph osd tier remove-overlay {storagetier}
+
+ For example::
+
+ ceph osd tier remove-overlay cold-storage
+
+
+#. Finally, remove the cache tier pool from the backing storage pool. ::
+
+ ceph osd tier remove {storagepool} {cachepool}
+
+ For example::
+
+ ceph osd tier remove cold-storage hot-storage
+
+
+.. _Create a Pool: ../pools#create-a-pool
+.. _Pools - Set Pool Values: ../pools#set-pool-values
+.. _Placing Different Pools on Different OSDs: ../crush-map/#placing-different-pools-on-different-osds
+.. _Bloom Filter: http://en.wikipedia.org/wiki/Bloom_filter
+.. _CRUSH Maps: ../crush-map
+.. _Absolute Sizing: #absolute-sizing
diff --git a/src/ceph/doc/rados/operations/control.rst b/src/ceph/doc/rados/operations/control.rst
new file mode 100644
index 0000000..1a58076
--- /dev/null
+++ b/src/ceph/doc/rados/operations/control.rst
@@ -0,0 +1,453 @@
+.. index:: control, commands
+
+==================
+ Control Commands
+==================
+
+
+Monitor Commands
+================
+
+Monitor commands are issued using the ceph utility::
+
+ ceph [-m monhost] {command}
+
+The command is usually (though not always) of the form::
+
+ ceph {subsystem} {command}
+
+
+System Commands
+===============
+
+Execute the following to display the current status of the cluster. ::
+
+ ceph -s
+ ceph status
+
+Execute the following to display a running summary of the status of the cluster,
+and major events. ::
+
+ ceph -w
+
+Execute the following to show the monitor quorum, including which monitors are
+participating and which one is the leader. ::
+
+ ceph quorum_status
+
+Execute the following to query the status of a single monitor, including whether
+or not it is in the quorum. ::
+
+ ceph [-m monhost] mon_status
+
+
+Authentication Subsystem
+========================
+
+To add a keyring for an OSD, execute the following::
+
+ ceph auth add {osd} {--in-file|-i} {path-to-osd-keyring}
+
+To list the cluster's keys and their capabilities, execute the following::
+
+ ceph auth ls
+
+
+Placement Group Subsystem
+=========================
+
+To display the statistics for all placement groups, execute the following::
+
+ ceph pg dump [--format {format}]
+
+The valid formats are ``plain`` (default) and ``json``.
+
+To display the statistics for all placement groups stuck in a specified state,
+execute the following::
+
+ ceph pg dump_stuck inactive|unclean|stale|undersized|degraded [--format {format}] [-t|--threshold {seconds}]
+
+
+``--format`` may be ``plain`` (default) or ``json``
+
+``--threshold`` defines how many seconds "stuck" is (default: 300)
+
+**Inactive** Placement groups cannot process reads or writes because they are waiting for an OSD
+with the most up-to-date data to come back.
+
+**Unclean** Placement groups contain objects that are not replicated the desired number
+of times. They should be recovering.
+
+**Stale** Placement groups are in an unknown state - the OSDs that host them have not
+reported to the monitor cluster in a while (configured by
+``mon_osd_report_timeout``).
+
+Delete "lost" objects or revert them to their prior state, either a previous version
+or delete them if they were just created. ::
+
+ ceph pg {pgid} mark_unfound_lost revert|delete
+
+
+OSD Subsystem
+=============
+
+Query OSD subsystem status. ::
+
+ ceph osd stat
+
+Write a copy of the most recent OSD map to a file. See
+`osdmaptool`_. ::
+
+ ceph osd getmap -o file
+
+.. _osdmaptool: ../../man/8/osdmaptool
+
+Write a copy of the crush map from the most recent OSD map to
+file. ::
+
+ ceph osd getcrushmap -o file
+
+The foregoing functionally equivalent to ::
+
+ ceph osd getmap -o /tmp/osdmap
+ osdmaptool /tmp/osdmap --export-crush file
+
+Dump the OSD map. Valid formats for ``-f`` are ``plain`` and ``json``. If no
+``--format`` option is given, the OSD map is dumped as plain text. ::
+
+ ceph osd dump [--format {format}]
+
+Dump the OSD map as a tree with one line per OSD containing weight
+and state. ::
+
+ ceph osd tree [--format {format}]
+
+Find out where a specific object is or would be stored in the system::
+
+ ceph osd map <pool-name> <object-name>
+
+Add or move a new item (OSD) with the given id/name/weight at the specified
+location. ::
+
+ ceph osd crush set {id} {weight} [{loc1} [{loc2} ...]]
+
+Remove an existing item (OSD) from the CRUSH map. ::
+
+ ceph osd crush remove {name}
+
+Remove an existing bucket from the CRUSH map. ::
+
+ ceph osd crush remove {bucket-name}
+
+Move an existing bucket from one position in the hierarchy to another. ::
+
+ ceph osd crush move {id} {loc1} [{loc2} ...]
+
+Set the weight of the item given by ``{name}`` to ``{weight}``. ::
+
+ ceph osd crush reweight {name} {weight}
+
+Mark an OSD as lost. This may result in permanent data loss. Use with caution. ::
+
+ ceph osd lost {id} [--yes-i-really-mean-it]
+
+Create a new OSD. If no UUID is given, it will be set automatically when the OSD
+starts up. ::
+
+ ceph osd create [{uuid}]
+
+Remove the given OSD(s). ::
+
+ ceph osd rm [{id}...]
+
+Query the current max_osd parameter in the OSD map. ::
+
+ ceph osd getmaxosd
+
+Import the given crush map. ::
+
+ ceph osd setcrushmap -i file
+
+Set the ``max_osd`` parameter in the OSD map. This is necessary when
+expanding the storage cluster. ::
+
+ ceph osd setmaxosd
+
+Mark OSD ``{osd-num}`` down. ::
+
+ ceph osd down {osd-num}
+
+Mark OSD ``{osd-num}`` out of the distribution (i.e. allocated no data). ::
+
+ ceph osd out {osd-num}
+
+Mark ``{osd-num}`` in the distribution (i.e. allocated data). ::
+
+ ceph osd in {osd-num}
+
+Set or clear the pause flags in the OSD map. If set, no IO requests
+will be sent to any OSD. Clearing the flags via unpause results in
+resending pending requests. ::
+
+ ceph osd pause
+ ceph osd unpause
+
+Set the weight of ``{osd-num}`` to ``{weight}``. Two OSDs with the
+same weight will receive roughly the same number of I/O requests and
+store approximately the same amount of data. ``ceph osd reweight``
+sets an override weight on the OSD. This value is in the range 0 to 1,
+and forces CRUSH to re-place (1-weight) of the data that would
+otherwise live on this drive. It does not change the weights assigned
+to the buckets above the OSD in the crush map, and is a corrective
+measure in case the normal CRUSH distribution is not working out quite
+right. For instance, if one of your OSDs is at 90% and the others are
+at 50%, you could reduce this weight to try and compensate for it. ::
+
+ ceph osd reweight {osd-num} {weight}
+
+Reweights all the OSDs by reducing the weight of OSDs which are
+heavily overused. By default it will adjust the weights downward on
+OSDs which have 120% of the average utilization, but if you include
+threshold it will use that percentage instead. ::
+
+ ceph osd reweight-by-utilization [threshold]
+
+Describes what reweight-by-utilization would do. ::
+
+ ceph osd test-reweight-by-utilization
+
+Adds/removes the address to/from the blacklist. When adding an address,
+you can specify how long it should be blacklisted in seconds; otherwise,
+it will default to 1 hour. A blacklisted address is prevented from
+connecting to any OSD. Blacklisting is most often used to prevent a
+lagging metadata server from making bad changes to data on the OSDs.
+
+These commands are mostly only useful for failure testing, as
+blacklists are normally maintained automatically and shouldn't need
+manual intervention. ::
+
+ ceph osd blacklist add ADDRESS[:source_port] [TIME]
+ ceph osd blacklist rm ADDRESS[:source_port]
+
+Creates/deletes a snapshot of a pool. ::
+
+ ceph osd pool mksnap {pool-name} {snap-name}
+ ceph osd pool rmsnap {pool-name} {snap-name}
+
+Creates/deletes/renames a storage pool. ::
+
+ ceph osd pool create {pool-name} pg_num [pgp_num]
+ ceph osd pool delete {pool-name} [{pool-name} --yes-i-really-really-mean-it]
+ ceph osd pool rename {old-name} {new-name}
+
+Changes a pool setting. ::
+
+ ceph osd pool set {pool-name} {field} {value}
+
+Valid fields are:
+
+ * ``size``: Sets the number of copies of data in the pool.
+ * ``pg_num``: The placement group number.
+ * ``pgp_num``: Effective number when calculating pg placement.
+ * ``crush_ruleset``: rule number for mapping placement.
+
+Get the value of a pool setting. ::
+
+ ceph osd pool get {pool-name} {field}
+
+Valid fields are:
+
+ * ``pg_num``: The placement group number.
+ * ``pgp_num``: Effective number of placement groups when calculating placement.
+ * ``lpg_num``: The number of local placement groups.
+ * ``lpgp_num``: The number used for placing the local placement groups.
+
+
+Sends a scrub command to OSD ``{osd-num}``. To send the command to all OSDs, use ``*``. ::
+
+ ceph osd scrub {osd-num}
+
+Sends a repair command to OSD.N. To send the command to all OSDs, use ``*``. ::
+
+ ceph osd repair N
+
+Runs a simple throughput benchmark against OSD.N, writing ``TOTAL_DATA_BYTES``
+in write requests of ``BYTES_PER_WRITE`` each. By default, the test
+writes 1 GB in total in 4-MB increments.
+The benchmark is non-destructive and will not overwrite existing live
+OSD data, but might temporarily affect the performance of clients
+concurrently accessing the OSD. ::
+
+ ceph tell osd.N bench [TOTAL_DATA_BYTES] [BYTES_PER_WRITE]
+
+
+MDS Subsystem
+=============
+
+Change configuration parameters on a running mds. ::
+
+ ceph tell mds.{mds-id} injectargs --{switch} {value} [--{switch} {value}]
+
+Example::
+
+ ceph tell mds.0 injectargs --debug_ms 1 --debug_mds 10
+
+Enables debug messages. ::
+
+ ceph mds stat
+
+Displays the status of all metadata servers. ::
+
+ ceph mds fail 0
+
+Marks the active MDS as failed, triggering failover to a standby if present.
+
+.. todo:: ``ceph mds`` subcommands missing docs: set, dump, getmap, stop, setmap
+
+
+Mon Subsystem
+=============
+
+Show monitor stats::
+
+ ceph mon stat
+
+ e2: 3 mons at {a=127.0.0.1:40000/0,b=127.0.0.1:40001/0,c=127.0.0.1:40002/0}, election epoch 6, quorum 0,1,2 a,b,c
+
+
+The ``quorum`` list at the end lists monitor nodes that are part of the current quorum.
+
+This is also available more directly::
+
+ ceph quorum_status -f json-pretty
+
+.. code-block:: javascript
+
+ {
+ "election_epoch": 6,
+ "quorum": [
+ 0,
+ 1,
+ 2
+ ],
+ "quorum_names": [
+ "a",
+ "b",
+ "c"
+ ],
+ "quorum_leader_name": "a",
+ "monmap": {
+ "epoch": 2,
+ "fsid": "ba807e74-b64f-4b72-b43f-597dfe60ddbc",
+ "modified": "2016-12-26 14:42:09.288066",
+ "created": "2016-12-26 14:42:03.573585",
+ "features": {
+ "persistent": [
+ "kraken"
+ ],
+ "optional": []
+ },
+ "mons": [
+ {
+ "rank": 0,
+ "name": "a",
+ "addr": "127.0.0.1:40000\/0",
+ "public_addr": "127.0.0.1:40000\/0"
+ },
+ {
+ "rank": 1,
+ "name": "b",
+ "addr": "127.0.0.1:40001\/0",
+ "public_addr": "127.0.0.1:40001\/0"
+ },
+ {
+ "rank": 2,
+ "name": "c",
+ "addr": "127.0.0.1:40002\/0",
+ "public_addr": "127.0.0.1:40002\/0"
+ }
+ ]
+ }
+ }
+
+
+The above will block until a quorum is reached.
+
+For a status of just the monitor you connect to (use ``-m HOST:PORT``
+to select)::
+
+ ceph mon_status -f json-pretty
+
+
+.. code-block:: javascript
+
+ {
+ "name": "b",
+ "rank": 1,
+ "state": "peon",
+ "election_epoch": 6,
+ "quorum": [
+ 0,
+ 1,
+ 2
+ ],
+ "features": {
+ "required_con": "9025616074522624",
+ "required_mon": [
+ "kraken"
+ ],
+ "quorum_con": "1152921504336314367",
+ "quorum_mon": [
+ "kraken"
+ ]
+ },
+ "outside_quorum": [],
+ "extra_probe_peers": [],
+ "sync_provider": [],
+ "monmap": {
+ "epoch": 2,
+ "fsid": "ba807e74-b64f-4b72-b43f-597dfe60ddbc",
+ "modified": "2016-12-26 14:42:09.288066",
+ "created": "2016-12-26 14:42:03.573585",
+ "features": {
+ "persistent": [
+ "kraken"
+ ],
+ "optional": []
+ },
+ "mons": [
+ {
+ "rank": 0,
+ "name": "a",
+ "addr": "127.0.0.1:40000\/0",
+ "public_addr": "127.0.0.1:40000\/0"
+ },
+ {
+ "rank": 1,
+ "name": "b",
+ "addr": "127.0.0.1:40001\/0",
+ "public_addr": "127.0.0.1:40001\/0"
+ },
+ {
+ "rank": 2,
+ "name": "c",
+ "addr": "127.0.0.1:40002\/0",
+ "public_addr": "127.0.0.1:40002\/0"
+ }
+ ]
+ }
+ }
+
+A dump of the monitor state::
+
+ ceph mon dump
+
+ dumped monmap epoch 2
+ epoch 2
+ fsid ba807e74-b64f-4b72-b43f-597dfe60ddbc
+ last_changed 2016-12-26 14:42:09.288066
+ created 2016-12-26 14:42:03.573585
+ 0: 127.0.0.1:40000/0 mon.a
+ 1: 127.0.0.1:40001/0 mon.b
+ 2: 127.0.0.1:40002/0 mon.c
+
diff --git a/src/ceph/doc/rados/operations/crush-map-edits.rst b/src/ceph/doc/rados/operations/crush-map-edits.rst
new file mode 100644
index 0000000..5222270
--- /dev/null
+++ b/src/ceph/doc/rados/operations/crush-map-edits.rst
@@ -0,0 +1,654 @@
+Manually editing a CRUSH Map
+============================
+
+.. note:: Manually editing the CRUSH map is considered an advanced
+ administrator operation. All CRUSH changes that are
+ necessary for the overwhelming majority of installations are
+ possible via the standard ceph CLI and do not require manual
+ CRUSH map edits. If you have identified a use case where
+ manual edits *are* necessary, consider contacting the Ceph
+ developers so that future versions of Ceph can make this
+ unnecessary.
+
+To edit an existing CRUSH map:
+
+#. `Get the CRUSH map`_.
+#. `Decompile`_ the CRUSH map.
+#. Edit at least one of `Devices`_, `Buckets`_ and `Rules`_.
+#. `Recompile`_ the CRUSH map.
+#. `Set the CRUSH map`_.
+
+To activate CRUSH map rules for a specific pool, identify the common ruleset
+number for those rules and specify that ruleset number for the pool. See `Set
+Pool Values`_ for details.
+
+.. _Get the CRUSH map: #getcrushmap
+.. _Decompile: #decompilecrushmap
+.. _Devices: #crushmapdevices
+.. _Buckets: #crushmapbuckets
+.. _Rules: #crushmaprules
+.. _Recompile: #compilecrushmap
+.. _Set the CRUSH map: #setcrushmap
+.. _Set Pool Values: ../pools#setpoolvalues
+
+.. _getcrushmap:
+
+Get a CRUSH Map
+---------------
+
+To get the CRUSH map for your cluster, execute the following::
+
+ ceph osd getcrushmap -o {compiled-crushmap-filename}
+
+Ceph will output (-o) a compiled CRUSH map to the filename you specified. Since
+the CRUSH map is in a compiled form, you must decompile it first before you can
+edit it.
+
+.. _decompilecrushmap:
+
+Decompile a CRUSH Map
+---------------------
+
+To decompile a CRUSH map, execute the following::
+
+ crushtool -d {compiled-crushmap-filename} -o {decompiled-crushmap-filename}
+
+
+Sections
+--------
+
+There are six main sections to a CRUSH Map.
+
+#. **tunables:** The preamble at the top of the map described any *tunables*
+ for CRUSH behavior that vary from the historical/legacy CRUSH behavior. These
+ correct for old bugs, optimizations, or other changes in behavior that have
+ been made over the years to improve CRUSH's behavior.
+
+#. **devices:** Devices are individual ``ceph-osd`` daemons that can
+ store data.
+
+#. **types**: Bucket ``types`` define the types of buckets used in
+ your CRUSH hierarchy. Buckets consist of a hierarchical aggregation
+ of storage locations (e.g., rows, racks, chassis, hosts, etc.) and
+ their assigned weights.
+
+#. **buckets:** Once you define bucket types, you must define each node
+ in the hierarchy, its type, and which devices or other nodes it
+ containes.
+
+#. **rules:** Rules define policy about how data is distributed across
+ devices in the hierarchy.
+
+#. **choose_args:** Choose_args are alternative weights associated with
+ the hierarchy that have been adjusted to optimize data placement. A single
+ choose_args map can be used for the entire cluster, or one can be
+ created for each individual pool.
+
+
+.. _crushmapdevices:
+
+CRUSH Map Devices
+-----------------
+
+Devices are individual ``ceph-osd`` daemons that can store data. You
+will normally have one defined here for each OSD daemon in your
+cluster. Devices are identified by an id (a non-negative integer) and
+a name, normally ``osd.N`` where ``N`` is the device id.
+
+Devices may also have a *device class* associated with them (e.g.,
+``hdd`` or ``ssd``), allowing them to be conveniently targetted by a
+crush rule.
+
+::
+
+ # devices
+ device {num} {osd.name} [class {class}]
+
+For example::
+
+ # devices
+ device 0 osd.0 class ssd
+ device 1 osd.1 class hdd
+ device 2 osd.2
+ device 3 osd.3
+
+In most cases, each device maps to a single ``ceph-osd`` daemon. This
+is normally a single storage device, a pair of devices (for example,
+one for data and one for a journal or metadata), or in some cases a
+small RAID device.
+
+
+
+
+
+CRUSH Map Bucket Types
+----------------------
+
+The second list in the CRUSH map defines 'bucket' types. Buckets facilitate
+a hierarchy of nodes and leaves. Node (or non-leaf) buckets typically represent
+physical locations in a hierarchy. Nodes aggregate other nodes or leaves.
+Leaf buckets represent ``ceph-osd`` daemons and their corresponding storage
+media.
+
+.. tip:: The term "bucket" used in the context of CRUSH means a node in
+ the hierarchy, i.e. a location or a piece of physical hardware. It
+ is a different concept from the term "bucket" when used in the
+ context of RADOS Gateway APIs.
+
+To add a bucket type to the CRUSH map, create a new line under your list of
+bucket types. Enter ``type`` followed by a unique numeric ID and a bucket name.
+By convention, there is one leaf bucket and it is ``type 0``; however, you may
+give it any name you like (e.g., osd, disk, drive, storage, etc.)::
+
+ #types
+ type {num} {bucket-name}
+
+For example::
+
+ # types
+ type 0 osd
+ type 1 host
+ type 2 chassis
+ type 3 rack
+ type 4 row
+ type 5 pdu
+ type 6 pod
+ type 7 room
+ type 8 datacenter
+ type 9 region
+ type 10 root
+
+
+
+.. _crushmapbuckets:
+
+CRUSH Map Bucket Hierarchy
+--------------------------
+
+The CRUSH algorithm distributes data objects among storage devices according
+to a per-device weight value, approximating a uniform probability distribution.
+CRUSH distributes objects and their replicas according to the hierarchical
+cluster map you define. Your CRUSH map represents the available storage
+devices and the logical elements that contain them.
+
+To map placement groups to OSDs across failure domains, a CRUSH map defines a
+hierarchical list of bucket types (i.e., under ``#types`` in the generated CRUSH
+map). The purpose of creating a bucket hierarchy is to segregate the
+leaf nodes by their failure domains, such as hosts, chassis, racks, power
+distribution units, pods, rows, rooms, and data centers. With the exception of
+the leaf nodes representing OSDs, the rest of the hierarchy is arbitrary, and
+you may define it according to your own needs.
+
+We recommend adapting your CRUSH map to your firms's hardware naming conventions
+and using instances names that reflect the physical hardware. Your naming
+practice can make it easier to administer the cluster and troubleshoot
+problems when an OSD and/or other hardware malfunctions and the administrator
+need access to physical hardware.
+
+In the following example, the bucket hierarchy has a leaf bucket named ``osd``,
+and two node buckets named ``host`` and ``rack`` respectively.
+
+.. ditaa::
+ +-----------+
+ | {o}rack |
+ | Bucket |
+ +-----+-----+
+ |
+ +---------------+---------------+
+ | |
+ +-----+-----+ +-----+-----+
+ | {o}host | | {o}host |
+ | Bucket | | Bucket |
+ +-----+-----+ +-----+-----+
+ | |
+ +-------+-------+ +-------+-------+
+ | | | |
+ +-----+-----+ +-----+-----+ +-----+-----+ +-----+-----+
+ | osd | | osd | | osd | | osd |
+ | Bucket | | Bucket | | Bucket | | Bucket |
+ +-----------+ +-----------+ +-----------+ +-----------+
+
+.. note:: The higher numbered ``rack`` bucket type aggregates the lower
+ numbered ``host`` bucket type.
+
+Since leaf nodes reflect storage devices declared under the ``#devices`` list
+at the beginning of the CRUSH map, you do not need to declare them as bucket
+instances. The second lowest bucket type in your hierarchy usually aggregates
+the devices (i.e., it's usually the computer containing the storage media, and
+uses whatever term you prefer to describe it, such as "node", "computer",
+"server," "host", "machine", etc.). In high density environments, it is
+increasingly common to see multiple hosts/nodes per chassis. You should account
+for chassis failure too--e.g., the need to pull a chassis if a node fails may
+result in bringing down numerous hosts/nodes and their OSDs.
+
+When declaring a bucket instance, you must specify its type, give it a unique
+name (string), assign it a unique ID expressed as a negative integer (optional),
+specify a weight relative to the total capacity/capability of its item(s),
+specify the bucket algorithm (usually ``straw``), and the hash (usually ``0``,
+reflecting hash algorithm ``rjenkins1``). A bucket may have one or more items.
+The items may consist of node buckets or leaves. Items may have a weight that
+reflects the relative weight of the item.
+
+You may declare a node bucket with the following syntax::
+
+ [bucket-type] [bucket-name] {
+ id [a unique negative numeric ID]
+ weight [the relative capacity/capability of the item(s)]
+ alg [the bucket type: uniform | list | tree | straw ]
+ hash [the hash type: 0 by default]
+ item [item-name] weight [weight]
+ }
+
+For example, using the diagram above, we would define two host buckets
+and one rack bucket. The OSDs are declared as items within the host buckets::
+
+ host node1 {
+ id -1
+ alg straw
+ hash 0
+ item osd.0 weight 1.00
+ item osd.1 weight 1.00
+ }
+
+ host node2 {
+ id -2
+ alg straw
+ hash 0
+ item osd.2 weight 1.00
+ item osd.3 weight 1.00
+ }
+
+ rack rack1 {
+ id -3
+ alg straw
+ hash 0
+ item node1 weight 2.00
+ item node2 weight 2.00
+ }
+
+.. note:: In the foregoing example, note that the rack bucket does not contain
+ any OSDs. Rather it contains lower level host buckets, and includes the
+ sum total of their weight in the item entry.
+
+.. topic:: Bucket Types
+
+ Ceph supports four bucket types, each representing a tradeoff between
+ performance and reorganization efficiency. If you are unsure of which bucket
+ type to use, we recommend using a ``straw`` bucket. For a detailed
+ discussion of bucket types, refer to
+ `CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_,
+ and more specifically to **Section 3.4**. The bucket types are:
+
+ #. **Uniform:** Uniform buckets aggregate devices with **exactly** the same
+ weight. For example, when firms commission or decommission hardware, they
+ typically do so with many machines that have exactly the same physical
+ configuration (e.g., bulk purchases). When storage devices have exactly
+ the same weight, you may use the ``uniform`` bucket type, which allows
+ CRUSH to map replicas into uniform buckets in constant time. With
+ non-uniform weights, you should use another bucket algorithm.
+
+ #. **List**: List buckets aggregate their content as linked lists. Based on
+ the :abbr:`RUSH (Replication Under Scalable Hashing)` :sub:`P` algorithm,
+ a list is a natural and intuitive choice for an **expanding cluster**:
+ either an object is relocated to the newest device with some appropriate
+ probability, or it remains on the older devices as before. The result is
+ optimal data migration when items are added to the bucket. Items removed
+ from the middle or tail of the list, however, can result in a significant
+ amount of unnecessary movement, making list buckets most suitable for
+ circumstances in which they **never (or very rarely) shrink**.
+
+ #. **Tree**: Tree buckets use a binary search tree. They are more efficient
+ than list buckets when a bucket contains a larger set of items. Based on
+ the :abbr:`RUSH (Replication Under Scalable Hashing)` :sub:`R` algorithm,
+ tree buckets reduce the placement time to O(log :sub:`n`), making them
+ suitable for managing much larger sets of devices or nested buckets.
+
+ #. **Straw:** List and Tree buckets use a divide and conquer strategy
+ in a way that either gives certain items precedence (e.g., those
+ at the beginning of a list) or obviates the need to consider entire
+ subtrees of items at all. That improves the performance of the replica
+ placement process, but can also introduce suboptimal reorganization
+ behavior when the contents of a bucket change due an addition, removal,
+ or re-weighting of an item. The straw bucket type allows all items to
+ fairly “compete” against each other for replica placement through a
+ process analogous to a draw of straws.
+
+.. topic:: Hash
+
+ Each bucket uses a hash algorithm. Currently, Ceph supports ``rjenkins1``.
+ Enter ``0`` as your hash setting to select ``rjenkins1``.
+
+
+.. _weightingbucketitems:
+
+.. topic:: Weighting Bucket Items
+
+ Ceph expresses bucket weights as doubles, which allows for fine
+ weighting. A weight is the relative difference between device capacities. We
+ recommend using ``1.00`` as the relative weight for a 1TB storage device.
+ In such a scenario, a weight of ``0.5`` would represent approximately 500GB,
+ and a weight of ``3.00`` would represent approximately 3TB. Higher level
+ buckets have a weight that is the sum total of the leaf items aggregated by
+ the bucket.
+
+ A bucket item weight is one dimensional, but you may also calculate your
+ item weights to reflect the performance of the storage drive. For example,
+ if you have many 1TB drives where some have relatively low data transfer
+ rate and the others have a relatively high data transfer rate, you may
+ weight them differently, even though they have the same capacity (e.g.,
+ a weight of 0.80 for the first set of drives with lower total throughput,
+ and 1.20 for the second set of drives with higher total throughput).
+
+
+.. _crushmaprules:
+
+CRUSH Map Rules
+---------------
+
+CRUSH maps support the notion of 'CRUSH rules', which are the rules that
+determine data placement for a pool. For large clusters, you will likely create
+many pools where each pool may have its own CRUSH ruleset and rules. The default
+CRUSH map has a rule for each pool, and one ruleset assigned to each of the
+default pools.
+
+.. note:: In most cases, you will not need to modify the default rules. When
+ you create a new pool, its default ruleset is ``0``.
+
+
+CRUSH rules define placement and replication strategies or distribution policies
+that allow you to specify exactly how CRUSH places object replicas. For
+example, you might create a rule selecting a pair of targets for 2-way
+mirroring, another rule for selecting three targets in two different data
+centers for 3-way mirroring, and yet another rule for erasure coding over six
+storage devices. For a detailed discussion of CRUSH rules, refer to
+`CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_,
+and more specifically to **Section 3.2**.
+
+A rule takes the following form::
+
+ rule <rulename> {
+
+ ruleset <ruleset>
+ type [ replicated | erasure ]
+ min_size <min-size>
+ max_size <max-size>
+ step take <bucket-name> [class <device-class>]
+ step [choose|chooseleaf] [firstn|indep] <N> <bucket-type>
+ step emit
+ }
+
+
+``ruleset``
+
+:Description: A means of classifying a rule as belonging to a set of rules.
+ Activated by `setting the ruleset in a pool`_.
+
+:Purpose: A component of the rule mask.
+:Type: Integer
+:Required: Yes
+:Default: 0
+
+.. _setting the ruleset in a pool: ../pools#setpoolvalues
+
+
+``type``
+
+:Description: Describes a rule for either a storage drive (replicated)
+ or a RAID.
+
+:Purpose: A component of the rule mask.
+:Type: String
+:Required: Yes
+:Default: ``replicated``
+:Valid Values: Currently only ``replicated`` and ``erasure``
+
+``min_size``
+
+:Description: If a pool makes fewer replicas than this number, CRUSH will
+ **NOT** select this rule.
+
+:Type: Integer
+:Purpose: A component of the rule mask.
+:Required: Yes
+:Default: ``1``
+
+``max_size``
+
+:Description: If a pool makes more replicas than this number, CRUSH will
+ **NOT** select this rule.
+
+:Type: Integer
+:Purpose: A component of the rule mask.
+:Required: Yes
+:Default: 10
+
+
+``step take <bucket-name> [class <device-class>]``
+
+:Description: Takes a bucket name, and begins iterating down the tree.
+ If the ``device-class`` is specified, it must match
+ a class previously used when defining a device. All
+ devices that do not belong to the class are excluded.
+:Purpose: A component of the rule.
+:Required: Yes
+:Example: ``step take data``
+
+
+``step choose firstn {num} type {bucket-type}``
+
+:Description: Selects the number of buckets of the given type. The number is
+ usually the number of replicas in the pool (i.e., pool size).
+
+ - If ``{num} == 0``, choose ``pool-num-replicas`` buckets (all available).
+ - If ``{num} > 0 && < pool-num-replicas``, choose that many buckets.
+ - If ``{num} < 0``, it means ``pool-num-replicas - {num}``.
+
+:Purpose: A component of the rule.
+:Prerequisite: Follows ``step take`` or ``step choose``.
+:Example: ``step choose firstn 1 type row``
+
+
+``step chooseleaf firstn {num} type {bucket-type}``
+
+:Description: Selects a set of buckets of ``{bucket-type}`` and chooses a leaf
+ node from the subtree of each bucket in the set of buckets. The
+ number of buckets in the set is usually the number of replicas in
+ the pool (i.e., pool size).
+
+ - If ``{num} == 0``, choose ``pool-num-replicas`` buckets (all available).
+ - If ``{num} > 0 && < pool-num-replicas``, choose that many buckets.
+ - If ``{num} < 0``, it means ``pool-num-replicas - {num}``.
+
+:Purpose: A component of the rule. Usage removes the need to select a device using two steps.
+:Prerequisite: Follows ``step take`` or ``step choose``.
+:Example: ``step chooseleaf firstn 0 type row``
+
+
+
+``step emit``
+
+:Description: Outputs the current value and empties the stack. Typically used
+ at the end of a rule, but may also be used to pick from different
+ trees in the same rule.
+
+:Purpose: A component of the rule.
+:Prerequisite: Follows ``step choose``.
+:Example: ``step emit``
+
+.. important:: To activate one or more rules with a common ruleset number to a
+ pool, set the ruleset number of the pool.
+
+
+Placing Different Pools on Different OSDS:
+==========================================
+
+Suppose you want to have most pools default to OSDs backed by large hard drives,
+but have some pools mapped to OSDs backed by fast solid-state drives (SSDs).
+It's possible to have multiple independent CRUSH hierarchies within the same
+CRUSH map. Define two hierarchies with two different root nodes--one for hard
+disks (e.g., "root platter") and one for SSDs (e.g., "root ssd") as shown
+below::
+
+ device 0 osd.0
+ device 1 osd.1
+ device 2 osd.2
+ device 3 osd.3
+ device 4 osd.4
+ device 5 osd.5
+ device 6 osd.6
+ device 7 osd.7
+
+ host ceph-osd-ssd-server-1 {
+ id -1
+ alg straw
+ hash 0
+ item osd.0 weight 1.00
+ item osd.1 weight 1.00
+ }
+
+ host ceph-osd-ssd-server-2 {
+ id -2
+ alg straw
+ hash 0
+ item osd.2 weight 1.00
+ item osd.3 weight 1.00
+ }
+
+ host ceph-osd-platter-server-1 {
+ id -3
+ alg straw
+ hash 0
+ item osd.4 weight 1.00
+ item osd.5 weight 1.00
+ }
+
+ host ceph-osd-platter-server-2 {
+ id -4
+ alg straw
+ hash 0
+ item osd.6 weight 1.00
+ item osd.7 weight 1.00
+ }
+
+ root platter {
+ id -5
+ alg straw
+ hash 0
+ item ceph-osd-platter-server-1 weight 2.00
+ item ceph-osd-platter-server-2 weight 2.00
+ }
+
+ root ssd {
+ id -6
+ alg straw
+ hash 0
+ item ceph-osd-ssd-server-1 weight 2.00
+ item ceph-osd-ssd-server-2 weight 2.00
+ }
+
+ rule data {
+ ruleset 0
+ type replicated
+ min_size 2
+ max_size 2
+ step take platter
+ step chooseleaf firstn 0 type host
+ step emit
+ }
+
+ rule metadata {
+ ruleset 1
+ type replicated
+ min_size 0
+ max_size 10
+ step take platter
+ step chooseleaf firstn 0 type host
+ step emit
+ }
+
+ rule rbd {
+ ruleset 2
+ type replicated
+ min_size 0
+ max_size 10
+ step take platter
+ step chooseleaf firstn 0 type host
+ step emit
+ }
+
+ rule platter {
+ ruleset 3
+ type replicated
+ min_size 0
+ max_size 10
+ step take platter
+ step chooseleaf firstn 0 type host
+ step emit
+ }
+
+ rule ssd {
+ ruleset 4
+ type replicated
+ min_size 0
+ max_size 4
+ step take ssd
+ step chooseleaf firstn 0 type host
+ step emit
+ }
+
+ rule ssd-primary {
+ ruleset 5
+ type replicated
+ min_size 5
+ max_size 10
+ step take ssd
+ step chooseleaf firstn 1 type host
+ step emit
+ step take platter
+ step chooseleaf firstn -1 type host
+ step emit
+ }
+
+You can then set a pool to use the SSD rule by::
+
+ ceph osd pool set <poolname> crush_ruleset 4
+
+Similarly, using the ``ssd-primary`` rule will cause each placement group in the
+pool to be placed with an SSD as the primary and platters as the replicas.
+
+
+Tuning CRUSH, the hard way
+--------------------------
+
+If you can ensure that all clients are running recent code, you can
+adjust the tunables by extracting the CRUSH map, modifying the values,
+and reinjecting it into the cluster.
+
+* Extract the latest CRUSH map::
+
+ ceph osd getcrushmap -o /tmp/crush
+
+* Adjust tunables. These values appear to offer the best behavior
+ for both large and small clusters we tested with. You will need to
+ additionally specify the ``--enable-unsafe-tunables`` argument to
+ ``crushtool`` for this to work. Please use this option with
+ extreme care.::
+
+ crushtool -i /tmp/crush --set-choose-local-tries 0 --set-choose-local-fallback-tries 0 --set-choose-total-tries 50 -o /tmp/crush.new
+
+* Reinject modified map::
+
+ ceph osd setcrushmap -i /tmp/crush.new
+
+Legacy values
+-------------
+
+For reference, the legacy values for the CRUSH tunables can be set
+with::
+
+ crushtool -i /tmp/crush --set-choose-local-tries 2 --set-choose-local-fallback-tries 5 --set-choose-total-tries 19 --set-chooseleaf-descend-once 0 --set-chooseleaf-vary-r 0 -o /tmp/crush.legacy
+
+Again, the special ``--enable-unsafe-tunables`` option is required.
+Further, as noted above, be careful running old versions of the
+``ceph-osd`` daemon after reverting to legacy values as the feature
+bit is not perfectly enforced.
diff --git a/src/ceph/doc/rados/operations/crush-map.rst b/src/ceph/doc/rados/operations/crush-map.rst
new file mode 100644
index 0000000..05fa4ff
--- /dev/null
+++ b/src/ceph/doc/rados/operations/crush-map.rst
@@ -0,0 +1,956 @@
+============
+ CRUSH Maps
+============
+
+The :abbr:`CRUSH (Controlled Replication Under Scalable Hashing)` algorithm
+determines how to store and retrieve data by computing data storage locations.
+CRUSH empowers Ceph clients to communicate with OSDs directly rather than
+through a centralized server or broker. With an algorithmically determined
+method of storing and retrieving data, Ceph avoids a single point of failure, a
+performance bottleneck, and a physical limit to its scalability.
+
+CRUSH requires a map of your cluster, and uses the CRUSH map to pseudo-randomly
+store and retrieve data in OSDs with a uniform distribution of data across the
+cluster. For a detailed discussion of CRUSH, see
+`CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_
+
+CRUSH maps contain a list of :abbr:`OSDs (Object Storage Devices)`, a list of
+'buckets' for aggregating the devices into physical locations, and a list of
+rules that tell CRUSH how it should replicate data in a Ceph cluster's pools. By
+reflecting the underlying physical organization of the installation, CRUSH can
+model—and thereby address—potential sources of correlated device failures.
+Typical sources include physical proximity, a shared power source, and a shared
+network. By encoding this information into the cluster map, CRUSH placement
+policies can separate object replicas across different failure domains while
+still maintaining the desired distribution. For example, to address the
+possibility of concurrent failures, it may be desirable to ensure that data
+replicas are on devices using different shelves, racks, power supplies,
+controllers, and/or physical locations.
+
+When you deploy OSDs they are automatically placed within the CRUSH map under a
+``host`` node named with the hostname for the host they are running on. This,
+combined with the default CRUSH failure domain, ensures that replicas or erasure
+code shards are separated across hosts and a single host failure will not
+affect availability. For larger clusters, however, administrators should carefully consider their choice of failure domain. Separating replicas across racks,
+for example, is common for mid- to large-sized clusters.
+
+
+CRUSH Location
+==============
+
+The location of an OSD in terms of the CRUSH map's hierarchy is
+referred to as a ``crush location``. This location specifier takes the
+form of a list of key and value pairs describing a position. For
+example, if an OSD is in a particular row, rack, chassis and host, and
+is part of the 'default' CRUSH tree (this is the case for the vast
+majority of clusters), its crush location could be described as::
+
+ root=default row=a rack=a2 chassis=a2a host=a2a1
+
+Note:
+
+#. Note that the order of the keys does not matter.
+#. The key name (left of ``=``) must be a valid CRUSH ``type``. By default
+ these include root, datacenter, room, row, pod, pdu, rack, chassis and host,
+ but those types can be customized to be anything appropriate by modifying
+ the CRUSH map.
+#. Not all keys need to be specified. For example, by default, Ceph
+ automatically sets a ``ceph-osd`` daemon's location to be
+ ``root=default host=HOSTNAME`` (based on the output from ``hostname -s``).
+
+The crush location for an OSD is normally expressed via the ``crush location``
+config option being set in the ``ceph.conf`` file. Each time the OSD starts,
+it verifies it is in the correct location in the CRUSH map and, if it is not,
+it moved itself. To disable this automatic CRUSH map management, add the
+following to your configuration file in the ``[osd]`` section::
+
+ osd crush update on start = false
+
+
+Custom location hooks
+---------------------
+
+A customized location hook can be used to generate a more complete
+crush location on startup. The sample ``ceph-crush-location`` utility
+will generate a CRUSH location string for a given daemon. The
+location is based on, in order of preference:
+
+#. A ``crush location`` option in ceph.conf.
+#. A default of ``root=default host=HOSTNAME`` where the hostname is
+ generated with the ``hostname -s`` command.
+
+This is not useful by itself, as the OSD itself has the exact same
+behavior. However, the script can be modified to provide additional
+location fields (for example, the rack or datacenter), and then the
+hook enabled via the config option::
+
+ crush location hook = /path/to/customized-ceph-crush-location
+
+This hook is passed several arguments (below) and should output a single line
+to stdout with the CRUSH location description.::
+
+ $ ceph-crush-location --cluster CLUSTER --id ID --type TYPE
+
+where the cluster name is typically 'ceph', the id is the daemon
+identifier (the OSD number), and the daemon type is typically ``osd``.
+
+
+CRUSH structure
+===============
+
+The CRUSH map consists of, loosely speaking, a hierarchy describing
+the physical topology of the cluster, and a set of rules defining
+policy about how we place data on those devices. The hierarchy has
+devices (``ceph-osd`` daemons) at the leaves, and internal nodes
+corresponding to other physical features or groupings: hosts, racks,
+rows, datacenters, and so on. The rules describe how replicas are
+placed in terms of that hierarchy (e.g., 'three replicas in different
+racks').
+
+Devices
+-------
+
+Devices are individual ``ceph-osd`` daemons that can store data. You
+will normally have one defined here for each OSD daemon in your
+cluster. Devices are identified by an id (a non-negative integer) and
+a name, normally ``osd.N`` where ``N`` is the device id.
+
+Devices may also have a *device class* associated with them (e.g.,
+``hdd`` or ``ssd``), allowing them to be conveniently targetted by a
+crush rule.
+
+Types and Buckets
+-----------------
+
+A bucket is the CRUSH term for internal nodes in the hierarchy: hosts,
+racks, rows, etc. The CRUSH map defines a series of *types* that are
+used to describe these nodes. By default, these types include:
+
+- osd (or device)
+- host
+- chassis
+- rack
+- row
+- pdu
+- pod
+- room
+- datacenter
+- region
+- root
+
+Most clusters make use of only a handful of these types, and others
+can be defined as needed.
+
+The hierarchy is built with devices (normally type ``osd``) at the
+leaves, interior nodes with non-device types, and a root node of type
+``root``. For example,
+
+.. ditaa::
+
+ +-----------------+
+ | {o}root default |
+ +--------+--------+
+ |
+ +---------------+---------------+
+ | |
+ +-------+-------+ +-----+-------+
+ | {o}host foo | | {o}host bar |
+ +-------+-------+ +-----+-------+
+ | |
+ +-------+-------+ +-------+-------+
+ | | | |
+ +-----+-----+ +-----+-----+ +-----+-----+ +-----+-----+
+ | osd.0 | | osd.1 | | osd.2 | | osd.3 |
+ +-----------+ +-----------+ +-----------+ +-----------+
+
+Each node (device or bucket) in the hierarchy has a *weight*
+associated with it, indicating the relative proportion of the total
+data that device or hierarchy subtree should store. Weights are set
+at the leaves, indicating the size of the device, and automatically
+sum up the tree from there, such that the weight of the default node
+will be the total of all devices contained beneath it. Normally
+weights are in units of terabytes (TB).
+
+You can get a simple view the CRUSH hierarchy for your cluster,
+including the weights, with::
+
+ ceph osd crush tree
+
+Rules
+-----
+
+Rules define policy about how data is distributed across the devices
+in the hierarchy.
+
+CRUSH rules define placement and replication strategies or
+distribution policies that allow you to specify exactly how CRUSH
+places object replicas. For example, you might create a rule selecting
+a pair of targets for 2-way mirroring, another rule for selecting
+three targets in two different data centers for 3-way mirroring, and
+yet another rule for erasure coding over six storage devices. For a
+detailed discussion of CRUSH rules, refer to `CRUSH - Controlled,
+Scalable, Decentralized Placement of Replicated Data`_, and more
+specifically to **Section 3.2**.
+
+In almost all cases, CRUSH rules can be created via the CLI by
+specifying the *pool type* they will be used for (replicated or
+erasure coded), the *failure domain*, and optionally a *device class*.
+In rare cases rules must be written by hand by manually editing the
+CRUSH map.
+
+You can see what rules are defined for your cluster with::
+
+ ceph osd crush rule ls
+
+You can view the contents of the rules with::
+
+ ceph osd crush rule dump
+
+Device classes
+--------------
+
+Each device can optionally have a *class* associated with it. By
+default, OSDs automatically set their class on startup to either
+`hdd`, `ssd`, or `nvme` based on the type of device they are backed
+by.
+
+The device class for one or more OSDs can be explicitly set with::
+
+ ceph osd crush set-device-class <class> <osd-name> [...]
+
+Once a device class is set, it cannot be changed to another class
+until the old class is unset with::
+
+ ceph osd crush rm-device-class <osd-name> [...]
+
+This allows administrators to set device classes without the class
+being changed on OSD restart or by some other script.
+
+A placement rule that targets a specific device class can be created with::
+
+ ceph osd crush rule create-replicated <rule-name> <root> <failure-domain> <class>
+
+A pool can then be changed to use the new rule with::
+
+ ceph osd pool set <pool-name> crush_rule <rule-name>
+
+Device classes are implemented by creating a "shadow" CRUSH hierarchy
+for each device class in use that contains only devices of that class.
+Rules can then distribute data over the shadow hierarchy. One nice
+thing about this approach is that it is fully backward compatible with
+old Ceph clients. You can view the CRUSH hierarchy with shadow items
+with::
+
+ ceph osd crush tree --show-shadow
+
+
+Weights sets
+------------
+
+A *weight set* is an alternative set of weights to use when
+calculating data placement. The normal weights associated with each
+device in the CRUSH map are set based on the device size and indicate
+how much data we *should* be storing where. However, because CRUSH is
+based on a pseudorandom placement process, there is always some
+variation from this ideal distribution, the same way that rolling a
+dice sixty times will not result in rolling exactly 10 ones and 10
+sixes. Weight sets allow the cluster to do a numerical optimization
+based on the specifics of your cluster (hierarchy, pools, etc.) to achieve
+a balanced distribution.
+
+There are two types of weight sets supported:
+
+ #. A **compat** weight set is a single alternative set of weights for
+ each device and node in the cluster. This is not well-suited for
+ correcting for all anomalies (for example, placement groups for
+ different pools may be different sizes and have different load
+ levels, but will be mostly treated the same by the balancer).
+ However, compat weight sets have the huge advantage that they are
+ *backward compatible* with previous versions of Ceph, which means
+ that even though weight sets were first introduced in Luminous
+ v12.2.z, older clients (e.g., firefly) can still connect to the
+ cluster when a compat weight set is being used to balance data.
+ #. A **per-pool** weight set is more flexible in that it allows
+ placement to be optimized for each data pool. Additionally,
+ weights can be adjusted for each position of placement, allowing
+ the optimizer to correct for a suble skew of data toward devices
+ with small weights relative to their peers (and effect that is
+ usually only apparently in very large clusters but which can cause
+ balancing problems).
+
+When weight sets are in use, the weights associated with each node in
+the hierarchy is visible as a separate column (labeled either
+``(compat)`` or the pool name) from the command::
+
+ ceph osd crush tree
+
+When both *compat* and *per-pool* weight sets are in use, data
+placement for a particular pool will use its own per-pool weight set
+if present. If not, it will use the compat weight set if present. If
+neither are present, it will use the normal CRUSH weights.
+
+Although weight sets can be set up and manipulated by hand, it is
+recommended that the *balancer* module be enabled to do so
+automatically.
+
+
+Modifying the CRUSH map
+=======================
+
+.. _addosd:
+
+Add/Move an OSD
+---------------
+
+.. note: OSDs are normally automatically added to the CRUSH map when
+ the OSD is created. This command is rarely needed.
+
+To add or move an OSD in the CRUSH map of a running cluster::
+
+ ceph osd crush set {name} {weight} root={root} [{bucket-type}={bucket-name} ...]
+
+Where:
+
+``name``
+
+:Description: The full name of the OSD.
+:Type: String
+:Required: Yes
+:Example: ``osd.0``
+
+
+``weight``
+
+:Description: The CRUSH weight for the OSD, normally its size measure in terabytes (TB).
+:Type: Double
+:Required: Yes
+:Example: ``2.0``
+
+
+``root``
+
+:Description: The root node of the tree in which the OSD resides (normally ``default``)
+:Type: Key/value pair.
+:Required: Yes
+:Example: ``root=default``
+
+
+``bucket-type``
+
+:Description: You may specify the OSD's location in the CRUSH hierarchy.
+:Type: Key/value pairs.
+:Required: No
+:Example: ``datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1``
+
+
+The following example adds ``osd.0`` to the hierarchy, or moves the
+OSD from a previous location. ::
+
+ ceph osd crush set osd.0 1.0 root=default datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1
+
+
+Adjust OSD weight
+-----------------
+
+.. note: Normally OSDs automatically add themselves to the CRUSH map
+ with the correct weight when they are created. This command
+ is rarely needed.
+
+To adjust an OSD's crush weight in the CRUSH map of a running cluster, execute
+the following::
+
+ ceph osd crush reweight {name} {weight}
+
+Where:
+
+``name``
+
+:Description: The full name of the OSD.
+:Type: String
+:Required: Yes
+:Example: ``osd.0``
+
+
+``weight``
+
+:Description: The CRUSH weight for the OSD.
+:Type: Double
+:Required: Yes
+:Example: ``2.0``
+
+
+.. _removeosd:
+
+Remove an OSD
+-------------
+
+.. note: OSDs are normally removed from the CRUSH as part of the
+ ``ceph osd purge`` command. This command is rarely needed.
+
+To remove an OSD from the CRUSH map of a running cluster, execute the
+following::
+
+ ceph osd crush remove {name}
+
+Where:
+
+``name``
+
+:Description: The full name of the OSD.
+:Type: String
+:Required: Yes
+:Example: ``osd.0``
+
+
+Add a Bucket
+------------
+
+.. note: Buckets are normally implicitly created when an OSD is added
+ that specifies a ``{bucket-type}={bucket-name}`` as part of its
+ location and a bucket with that name does not already exist. This
+ command is typically used when manually adjusting the structure of the
+ hierarchy after OSDs have been created (for example, to move a
+ series of hosts underneath a new rack-level bucket).
+
+To add a bucket in the CRUSH map of a running cluster, execute the
+``ceph osd crush add-bucket`` command::
+
+ ceph osd crush add-bucket {bucket-name} {bucket-type}
+
+Where:
+
+``bucket-name``
+
+:Description: The full name of the bucket.
+:Type: String
+:Required: Yes
+:Example: ``rack12``
+
+
+``bucket-type``
+
+:Description: The type of the bucket. The type must already exist in the hierarchy.
+:Type: String
+:Required: Yes
+:Example: ``rack``
+
+
+The following example adds the ``rack12`` bucket to the hierarchy::
+
+ ceph osd crush add-bucket rack12 rack
+
+Move a Bucket
+-------------
+
+To move a bucket to a different location or position in the CRUSH map
+hierarchy, execute the following::
+
+ ceph osd crush move {bucket-name} {bucket-type}={bucket-name}, [...]
+
+Where:
+
+``bucket-name``
+
+:Description: The name of the bucket to move/reposition.
+:Type: String
+:Required: Yes
+:Example: ``foo-bar-1``
+
+``bucket-type``
+
+:Description: You may specify the bucket's location in the CRUSH hierarchy.
+:Type: Key/value pairs.
+:Required: No
+:Example: ``datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1``
+
+Remove a Bucket
+---------------
+
+To remove a bucket from the CRUSH map hierarchy, execute the following::
+
+ ceph osd crush remove {bucket-name}
+
+.. note:: A bucket must be empty before removing it from the CRUSH hierarchy.
+
+Where:
+
+``bucket-name``
+
+:Description: The name of the bucket that you'd like to remove.
+:Type: String
+:Required: Yes
+:Example: ``rack12``
+
+The following example removes the ``rack12`` bucket from the hierarchy::
+
+ ceph osd crush remove rack12
+
+Creating a compat weight set
+----------------------------
+
+.. note: This step is normally done automatically by the ``balancer``
+ module when enabled.
+
+To create a *compat* weight set::
+
+ ceph osd crush weight-set create-compat
+
+Weights for the compat weight set can be adjusted with::
+
+ ceph osd crush weight-set reweight-compat {name} {weight}
+
+The compat weight set can be destroyed with::
+
+ ceph osd crush weight-set rm-compat
+
+Creating per-pool weight sets
+-----------------------------
+
+To create a weight set for a specific pool,::
+
+ ceph osd crush weight-set create {pool-name} {mode}
+
+.. note:: Per-pool weight sets require that all servers and daemons
+ run Luminous v12.2.z or later.
+
+Where:
+
+``pool-name``
+
+:Description: The name of a RADOS pool
+:Type: String
+:Required: Yes
+:Example: ``rbd``
+
+``mode``
+
+:Description: Either ``flat`` or ``positional``. A *flat* weight set
+ has a single weight for each device or bucket. A
+ *positional* weight set has a potentially different
+ weight for each position in the resulting placement
+ mapping. For example, if a pool has a replica count of
+ 3, then a positional weight set will have three weights
+ for each device and bucket.
+:Type: String
+:Required: Yes
+:Example: ``flat``
+
+To adjust the weight of an item in a weight set::
+
+ ceph osd crush weight-set reweight {pool-name} {item-name} {weight [...]}
+
+To list existing weight sets,::
+
+ ceph osd crush weight-set ls
+
+To remove a weight set,::
+
+ ceph osd crush weight-set rm {pool-name}
+
+Creating a rule for a replicated pool
+-------------------------------------
+
+For a replicated pool, the primary decision when creating the CRUSH
+rule is what the failure domain is going to be. For example, if a
+failure domain of ``host`` is selected, then CRUSH will ensure that
+each replica of the data is stored on a different host. If ``rack``
+is selected, then each replica will be stored in a different rack.
+What failure domain you choose primarily depends on the size of your
+cluster and how your hierarchy is structured.
+
+Normally, the entire cluster hierarchy is nested beneath a root node
+named ``default``. If you have customized your hierarchy, you may
+want to create a rule nested at some other node in the hierarchy. It
+doesn't matter what type is associated with that node (it doesn't have
+to be a ``root`` node).
+
+It is also possible to create a rule that restricts data placement to
+a specific *class* of device. By default, Ceph OSDs automatically
+classify themselves as either ``hdd`` or ``ssd``, depending on the
+underlying type of device being used. These classes can also be
+customized.
+
+To create a replicated rule,::
+
+ ceph osd crush rule create-replicated {name} {root} {failure-domain-type} [{class}]
+
+Where:
+
+``name``
+
+:Description: The name of the rule
+:Type: String
+:Required: Yes
+:Example: ``rbd-rule``
+
+``root``
+
+:Description: The name of the node under which data should be placed.
+:Type: String
+:Required: Yes
+:Example: ``default``
+
+``failure-domain-type``
+
+:Description: The type of CRUSH nodes across which we should separate replicas.
+:Type: String
+:Required: Yes
+:Example: ``rack``
+
+``class``
+
+:Description: The device class data should be placed on.
+:Type: String
+:Required: No
+:Example: ``ssd``
+
+Creating a rule for an erasure coded pool
+-----------------------------------------
+
+For an erasure-coded pool, the same basic decisions need to be made as
+with a replicated pool: what is the failure domain, what node in the
+hierarchy will data be placed under (usually ``default``), and will
+placement be restricted to a specific device class. Erasure code
+pools are created a bit differently, however, because they need to be
+constructed carefully based on the erasure code being used. For this reason,
+you must include this information in the *erasure code profile*. A CRUSH
+rule will then be created from that either explicitly or automatically when
+the profile is used to create a pool.
+
+The erasure code profiles can be listed with::
+
+ ceph osd erasure-code-profile ls
+
+An existing profile can be viewed with::
+
+ ceph osd erasure-code-profile get {profile-name}
+
+Normally profiles should never be modified; instead, a new profile
+should be created and used when creating a new pool or creating a new
+rule for an existing pool.
+
+An erasure code profile consists of a set of key=value pairs. Most of
+these control the behavior of the erasure code that is encoding data
+in the pool. Those that begin with ``crush-``, however, affect the
+CRUSH rule that is created.
+
+The erasure code profile properties of interest are:
+
+ * **crush-root**: the name of the CRUSH node to place data under [default: ``default``].
+ * **crush-failure-domain**: the CRUSH type to separate erasure-coded shards across [default: ``host``].
+ * **crush-device-class**: the device class to place data on [default: none, meaning all devices are used].
+ * **k** and **m** (and, for the ``lrc`` plugin, **l**): these determine the number of erasure code shards, affecting the resulting CRUSH rule.
+
+Once a profile is defined, you can create a CRUSH rule with::
+
+ ceph osd crush rule create-erasure {name} {profile-name}
+
+.. note: When creating a new pool, it is not actually necessary to
+ explicitly create the rule. If the erasure code profile alone is
+ specified and the rule argument is left off then Ceph will create
+ the CRUSH rule automatically.
+
+Deleting rules
+--------------
+
+Rules that are not in use by pools can be deleted with::
+
+ ceph osd crush rule rm {rule-name}
+
+
+Tunables
+========
+
+Over time, we have made (and continue to make) improvements to the
+CRUSH algorithm used to calculate the placement of data. In order to
+support the change in behavior, we have introduced a series of tunable
+options that control whether the legacy or improved variation of the
+algorithm is used.
+
+In order to use newer tunables, both clients and servers must support
+the new version of CRUSH. For this reason, we have created
+``profiles`` that are named after the Ceph version in which they were
+introduced. For example, the ``firefly`` tunables are first supported
+in the firefly release, and will not work with older (e.g., dumpling)
+clients. Once a given set of tunables are changed from the legacy
+default behavior, the ``ceph-mon`` and ``ceph-osd`` will prevent older
+clients who do not support the new CRUSH features from connecting to
+the cluster.
+
+argonaut (legacy)
+-----------------
+
+The legacy CRUSH behavior used by argonaut and older releases works
+fine for most clusters, provided there are not too many OSDs that have
+been marked out.
+
+bobtail (CRUSH_TUNABLES2)
+-------------------------
+
+The bobtail tunable profile fixes a few key misbehaviors:
+
+ * For hierarchies with a small number of devices in the leaf buckets,
+ some PGs map to fewer than the desired number of replicas. This
+ commonly happens for hierarchies with "host" nodes with a small
+ number (1-3) of OSDs nested beneath each one.
+
+ * For large clusters, some small percentages of PGs map to less than
+ the desired number of OSDs. This is more prevalent when there are
+ several layers of the hierarchy (e.g., row, rack, host, osd).
+
+ * When some OSDs are marked out, the data tends to get redistributed
+ to nearby OSDs instead of across the entire hierarchy.
+
+The new tunables are:
+
+ * ``choose_local_tries``: Number of local retries. Legacy value is
+ 2, optimal value is 0.
+
+ * ``choose_local_fallback_tries``: Legacy value is 5, optimal value
+ is 0.
+
+ * ``choose_total_tries``: Total number of attempts to choose an item.
+ Legacy value was 19, subsequent testing indicates that a value of
+ 50 is more appropriate for typical clusters. For extremely large
+ clusters, a larger value might be necessary.
+
+ * ``chooseleaf_descend_once``: Whether a recursive chooseleaf attempt
+ will retry, or only try once and allow the original placement to
+ retry. Legacy default is 0, optimal value is 1.
+
+Migration impact:
+
+ * Moving from argonaut to bobtail tunables triggers a moderate amount
+ of data movement. Use caution on a cluster that is already
+ populated with data.
+
+firefly (CRUSH_TUNABLES3)
+-------------------------
+
+The firefly tunable profile fixes a problem
+with the ``chooseleaf`` CRUSH rule behavior that tends to result in PG
+mappings with too few results when too many OSDs have been marked out.
+
+The new tunable is:
+
+ * ``chooseleaf_vary_r``: Whether a recursive chooseleaf attempt will
+ start with a non-zero value of r, based on how many attempts the
+ parent has already made. Legacy default is 0, but with this value
+ CRUSH is sometimes unable to find a mapping. The optimal value (in
+ terms of computational cost and correctness) is 1.
+
+Migration impact:
+
+ * For existing clusters that have lots of existing data, changing
+ from 0 to 1 will cause a lot of data to move; a value of 4 or 5
+ will allow CRUSH to find a valid mapping but will make less data
+ move.
+
+straw_calc_version tunable (introduced with Firefly too)
+--------------------------------------------------------
+
+There were some problems with the internal weights calculated and
+stored in the CRUSH map for ``straw`` buckets. Specifically, when
+there were items with a CRUSH weight of 0 or both a mix of weights and
+some duplicated weights CRUSH would distribute data incorrectly (i.e.,
+not in proportion to the weights).
+
+The new tunable is:
+
+ * ``straw_calc_version``: A value of 0 preserves the old, broken
+ internal weight calculation; a value of 1 fixes the behavior.
+
+Migration impact:
+
+ * Moving to straw_calc_version 1 and then adjusting a straw bucket
+ (by adding, removing, or reweighting an item, or by using the
+ reweight-all command) can trigger a small to moderate amount of
+ data movement *if* the cluster has hit one of the problematic
+ conditions.
+
+This tunable option is special because it has absolutely no impact
+concerning the required kernel version in the client side.
+
+hammer (CRUSH_V4)
+-----------------
+
+The hammer tunable profile does not affect the
+mapping of existing CRUSH maps simply by changing the profile. However:
+
+ * There is a new bucket type (``straw2``) supported. The new
+ ``straw2`` bucket type fixes several limitations in the original
+ ``straw`` bucket. Specifically, the old ``straw`` buckets would
+ change some mappings that should have changed when a weight was
+ adjusted, while ``straw2`` achieves the original goal of only
+ changing mappings to or from the bucket item whose weight has
+ changed.
+
+ * ``straw2`` is the default for any newly created buckets.
+
+Migration impact:
+
+ * Changing a bucket type from ``straw`` to ``straw2`` will result in
+ a reasonably small amount of data movement, depending on how much
+ the bucket item weights vary from each other. When the weights are
+ all the same no data will move, and when item weights vary
+ significantly there will be more movement.
+
+jewel (CRUSH_TUNABLES5)
+-----------------------
+
+The jewel tunable profile improves the
+overall behavior of CRUSH such that significantly fewer mappings
+change when an OSD is marked out of the cluster.
+
+The new tunable is:
+
+ * ``chooseleaf_stable``: Whether a recursive chooseleaf attempt will
+ use a better value for an inner loop that greatly reduces the number
+ of mapping changes when an OSD is marked out. The legacy value is 0,
+ while the new value of 1 uses the new approach.
+
+Migration impact:
+
+ * Changing this value on an existing cluster will result in a very
+ large amount of data movement as almost every PG mapping is likely
+ to change.
+
+
+
+
+Which client versions support CRUSH_TUNABLES
+--------------------------------------------
+
+ * argonaut series, v0.48.1 or later
+ * v0.49 or later
+ * Linux kernel version v3.6 or later (for the file system and RBD kernel clients)
+
+Which client versions support CRUSH_TUNABLES2
+---------------------------------------------
+
+ * v0.55 or later, including bobtail series (v0.56.x)
+ * Linux kernel version v3.9 or later (for the file system and RBD kernel clients)
+
+Which client versions support CRUSH_TUNABLES3
+---------------------------------------------
+
+ * v0.78 (firefly) or later
+ * Linux kernel version v3.15 or later (for the file system and RBD kernel clients)
+
+Which client versions support CRUSH_V4
+--------------------------------------
+
+ * v0.94 (hammer) or later
+ * Linux kernel version v4.1 or later (for the file system and RBD kernel clients)
+
+Which client versions support CRUSH_TUNABLES5
+---------------------------------------------
+
+ * v10.0.2 (jewel) or later
+ * Linux kernel version v4.5 or later (for the file system and RBD kernel clients)
+
+Warning when tunables are non-optimal
+-------------------------------------
+
+Starting with version v0.74, Ceph will issue a health warning if the
+current CRUSH tunables don't include all the optimal values from the
+``default`` profile (see below for the meaning of the ``default`` profile).
+To make this warning go away, you have two options:
+
+1. Adjust the tunables on the existing cluster. Note that this will
+ result in some data movement (possibly as much as 10%). This is the
+ preferred route, but should be taken with care on a production cluster
+ where the data movement may affect performance. You can enable optimal
+ tunables with::
+
+ ceph osd crush tunables optimal
+
+ If things go poorly (e.g., too much load) and not very much
+ progress has been made, or there is a client compatibility problem
+ (old kernel cephfs or rbd clients, or pre-bobtail librados
+ clients), you can switch back with::
+
+ ceph osd crush tunables legacy
+
+2. You can make the warning go away without making any changes to CRUSH by
+ adding the following option to your ceph.conf ``[mon]`` section::
+
+ mon warn on legacy crush tunables = false
+
+ For the change to take effect, you will need to restart the monitors, or
+ apply the option to running monitors with::
+
+ ceph tell mon.\* injectargs --no-mon-warn-on-legacy-crush-tunables
+
+
+A few important points
+----------------------
+
+ * Adjusting these values will result in the shift of some PGs between
+ storage nodes. If the Ceph cluster is already storing a lot of
+ data, be prepared for some fraction of the data to move.
+ * The ``ceph-osd`` and ``ceph-mon`` daemons will start requiring the
+ feature bits of new connections as soon as they get
+ the updated map. However, already-connected clients are
+ effectively grandfathered in, and will misbehave if they do not
+ support the new feature.
+ * If the CRUSH tunables are set to non-legacy values and then later
+ changed back to the defult values, ``ceph-osd`` daemons will not be
+ required to support the feature. However, the OSD peering process
+ requires examining and understanding old maps. Therefore, you
+ should not run old versions of the ``ceph-osd`` daemon
+ if the cluster has previously used non-legacy CRUSH values, even if
+ the latest version of the map has been switched back to using the
+ legacy defaults.
+
+Tuning CRUSH
+------------
+
+The simplest way to adjust the crush tunables is by changing to a known
+profile. Those are:
+
+ * ``legacy``: the legacy behavior from argonaut and earlier.
+ * ``argonaut``: the legacy values supported by the original argonaut release
+ * ``bobtail``: the values supported by the bobtail release
+ * ``firefly``: the values supported by the firefly release
+ * ``hammer``: the values supported by the hammer release
+ * ``jewel``: the values supported by the jewel release
+ * ``optimal``: the best (ie optimal) values of the current version of Ceph
+ * ``default``: the default values of a new cluster installed from
+ scratch. These values, which depend on the current version of Ceph,
+ are hard coded and are generally a mix of optimal and legacy values.
+ These values generally match the ``optimal`` profile of the previous
+ LTS release, or the most recent release for which we generally except
+ more users to have up to date clients for.
+
+You can select a profile on a running cluster with the command::
+
+ ceph osd crush tunables {PROFILE}
+
+Note that this may result in some data movement.
+
+
+.. _CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data: https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf
+
+
+Primary Affinity
+================
+
+When a Ceph Client reads or writes data, it always contacts the primary OSD in
+the acting set. For set ``[2, 3, 4]``, ``osd.2`` is the primary. Sometimes an
+OSD is not well suited to act as a primary compared to other OSDs (e.g., it has
+a slow disk or a slow controller). To prevent performance bottlenecks
+(especially on read operations) while maximizing utilization of your hardware,
+you can set a Ceph OSD's primary affinity so that CRUSH is less likely to use
+the OSD as a primary in an acting set. ::
+
+ ceph osd primary-affinity <osd-id> <weight>
+
+Primary affinity is ``1`` by default (*i.e.,* an OSD may act as a primary). You
+may set the OSD primary range from ``0-1``, where ``0`` means that the OSD may
+**NOT** be used as a primary and ``1`` means that an OSD may be used as a
+primary. When the weight is ``< 1``, it is less likely that CRUSH will select
+the Ceph OSD Daemon to act as a primary.
+
+
+
diff --git a/src/ceph/doc/rados/operations/data-placement.rst b/src/ceph/doc/rados/operations/data-placement.rst
new file mode 100644
index 0000000..27966b0
--- /dev/null
+++ b/src/ceph/doc/rados/operations/data-placement.rst
@@ -0,0 +1,37 @@
+=========================
+ Data Placement Overview
+=========================
+
+Ceph stores, replicates and rebalances data objects across a RADOS cluster
+dynamically. With many different users storing objects in different pools for
+different purposes on countless OSDs, Ceph operations require some data
+placement planning. The main data placement planning concepts in Ceph include:
+
+- **Pools:** Ceph stores data within pools, which are logical groups for storing
+ objects. Pools manage the number of placement groups, the number of replicas,
+ and the ruleset for the pool. To store data in a pool, you must have
+ an authenticated user with permissions for the pool. Ceph can snapshot pools.
+ See `Pools`_ for additional details.
+
+- **Placement Groups:** Ceph maps objects to placement groups (PGs).
+ Placement groups (PGs) are shards or fragments of a logical object pool
+ that place objects as a group into OSDs. Placement groups reduce the amount
+ of per-object metadata when Ceph stores the data in OSDs. A larger number of
+ placement groups (e.g., 100 per OSD) leads to better balancing. See
+ `Placement Groups`_ for additional details.
+
+- **CRUSH Maps:** CRUSH is a big part of what allows Ceph to scale without
+ performance bottlenecks, without limitations to scalability, and without a
+ single point of failure. CRUSH maps provide the physical topology of the
+ cluster to the CRUSH algorithm to determine where the data for an object
+ and its replicas should be stored, and how to do so across failure domains
+ for added data safety among other things. See `CRUSH Maps`_ for additional
+ details.
+
+When you initially set up a test cluster, you can use the default values. Once
+you begin planning for a large Ceph cluster, refer to pools, placement groups
+and CRUSH for data placement operations.
+
+.. _Pools: ../pools
+.. _Placement Groups: ../placement-groups
+.. _CRUSH Maps: ../crush-map
diff --git a/src/ceph/doc/rados/operations/erasure-code-isa.rst b/src/ceph/doc/rados/operations/erasure-code-isa.rst
new file mode 100644
index 0000000..b52933a
--- /dev/null
+++ b/src/ceph/doc/rados/operations/erasure-code-isa.rst
@@ -0,0 +1,105 @@
+=======================
+ISA erasure code plugin
+=======================
+
+The *isa* plugin encapsulates the `ISA
+<https://01.org/intel%C2%AE-storage-acceleration-library-open-source-version/>`_
+library. It only runs on Intel processors.
+
+Create an isa profile
+=====================
+
+To create a new *isa* erasure code profile::
+
+ ceph osd erasure-code-profile set {name} \
+ plugin=isa \
+ technique={reed_sol_van|cauchy} \
+ [k={data-chunks}] \
+ [m={coding-chunks}] \
+ [crush-root={root}] \
+ [crush-failure-domain={bucket-type}] \
+ [crush-device-class={device-class}] \
+ [directory={directory}] \
+ [--force]
+
+Where:
+
+``k={data chunks}``
+
+:Description: Each object is split in **data-chunks** parts,
+ each stored on a different OSD.
+
+:Type: Integer
+:Required: No.
+:Default: 7
+
+``m={coding-chunks}``
+
+:Description: Compute **coding chunks** for each object and store them
+ on different OSDs. The number of coding chunks is also
+ the number of OSDs that can be down without losing data.
+
+:Type: Integer
+:Required: No.
+:Default: 3
+
+``technique={reed_sol_van|cauchy}``
+
+:Description: The ISA plugin comes in two `Reed Solomon
+ <https://en.wikipedia.org/wiki/Reed%E2%80%93Solomon_error_correction>`_
+ forms. If *reed_sol_van* is set, it is `Vandermonde
+ <https://en.wikipedia.org/wiki/Vandermonde_matrix>`_, if
+ *cauchy* is set, it is `Cauchy
+ <https://en.wikipedia.org/wiki/Cauchy_matrix>`_.
+
+:Type: String
+:Required: No.
+:Default: reed_sol_van
+
+``crush-root={root}``
+
+:Description: The name of the crush bucket used for the first step of
+ the ruleset. For intance **step take default**.
+
+:Type: String
+:Required: No.
+:Default: default
+
+``crush-failure-domain={bucket-type}``
+
+:Description: Ensure that no two chunks are in a bucket with the same
+ failure domain. For instance, if the failure domain is
+ **host** no two chunks will be stored on the same
+ host. It is used to create a ruleset step such as **step
+ chooseleaf host**.
+
+:Type: String
+:Required: No.
+:Default: host
+
+``crush-device-class={device-class}``
+
+:Description: Restrict placement to devices of a specific class (e.g.,
+ ``ssd`` or ``hdd``), using the crush device class names
+ in the CRUSH map.
+
+:Type: String
+:Required: No.
+:Default:
+
+``directory={directory}``
+
+:Description: Set the **directory** name from which the erasure code
+ plugin is loaded.
+
+:Type: String
+:Required: No.
+:Default: /usr/lib/ceph/erasure-code
+
+``--force``
+
+:Description: Override an existing profile by the same name.
+
+:Type: String
+:Required: No.
+
diff --git a/src/ceph/doc/rados/operations/erasure-code-jerasure.rst b/src/ceph/doc/rados/operations/erasure-code-jerasure.rst
new file mode 100644
index 0000000..e8da097
--- /dev/null
+++ b/src/ceph/doc/rados/operations/erasure-code-jerasure.rst
@@ -0,0 +1,120 @@
+============================
+Jerasure erasure code plugin
+============================
+
+The *jerasure* plugin is the most generic and flexible plugin, it is
+also the default for Ceph erasure coded pools.
+
+The *jerasure* plugin encapsulates the `Jerasure
+<http://jerasure.org>`_ library. It is
+recommended to read the *jerasure* documentation to get a better
+understanding of the parameters.
+
+Create a jerasure profile
+=========================
+
+To create a new *jerasure* erasure code profile::
+
+ ceph osd erasure-code-profile set {name} \
+ plugin=jerasure \
+ k={data-chunks} \
+ m={coding-chunks} \
+ technique={reed_sol_van|reed_sol_r6_op|cauchy_orig|cauchy_good|liberation|blaum_roth|liber8tion} \
+ [crush-root={root}] \
+ [crush-failure-domain={bucket-type}] \
+ [crush-device-class={device-class}] \
+ [directory={directory}] \
+ [--force]
+
+Where:
+
+``k={data chunks}``
+
+:Description: Each object is split in **data-chunks** parts,
+ each stored on a different OSD.
+
+:Type: Integer
+:Required: Yes.
+:Example: 4
+
+``m={coding-chunks}``
+
+:Description: Compute **coding chunks** for each object and store them
+ on different OSDs. The number of coding chunks is also
+ the number of OSDs that can be down without losing data.
+
+:Type: Integer
+:Required: Yes.
+:Example: 2
+
+``technique={reed_sol_van|reed_sol_r6_op|cauchy_orig|cauchy_good|liberation|blaum_roth|liber8tion}``
+
+:Description: The more flexible technique is *reed_sol_van* : it is
+ enough to set *k* and *m*. The *cauchy_good* technique
+ can be faster but you need to chose the *packetsize*
+ carefully. All of *reed_sol_r6_op*, *liberation*,
+ *blaum_roth*, *liber8tion* are *RAID6* equivalents in
+ the sense that they can only be configured with *m=2*.
+
+:Type: String
+:Required: No.
+:Default: reed_sol_van
+
+``packetsize={bytes}``
+
+:Description: The encoding will be done on packets of *bytes* size at
+ a time. Chosing the right packet size is difficult. The
+ *jerasure* documentation contains extensive information
+ on this topic.
+
+:Type: Integer
+:Required: No.
+:Default: 2048
+
+``crush-root={root}``
+
+:Description: The name of the crush bucket used for the first step of
+ the ruleset. For intance **step take default**.
+
+:Type: String
+:Required: No.
+:Default: default
+
+``crush-failure-domain={bucket-type}``
+
+:Description: Ensure that no two chunks are in a bucket with the same
+ failure domain. For instance, if the failure domain is
+ **host** no two chunks will be stored on the same
+ host. It is used to create a ruleset step such as **step
+ chooseleaf host**.
+
+:Type: String
+:Required: No.
+:Default: host
+
+``crush-device-class={device-class}``
+
+:Description: Restrict placement to devices of a specific class (e.g.,
+ ``ssd`` or ``hdd``), using the crush device class names
+ in the CRUSH map.
+
+:Type: String
+:Required: No.
+:Default:
+
+ ``directory={directory}``
+
+:Description: Set the **directory** name from which the erasure code
+ plugin is loaded.
+
+:Type: String
+:Required: No.
+:Default: /usr/lib/ceph/erasure-code
+
+``--force``
+
+:Description: Override an existing profile by the same name.
+
+:Type: String
+:Required: No.
+
diff --git a/src/ceph/doc/rados/operations/erasure-code-lrc.rst b/src/ceph/doc/rados/operations/erasure-code-lrc.rst
new file mode 100644
index 0000000..447ce23
--- /dev/null
+++ b/src/ceph/doc/rados/operations/erasure-code-lrc.rst
@@ -0,0 +1,371 @@
+======================================
+Locally repairable erasure code plugin
+======================================
+
+With the *jerasure* plugin, when an erasure coded object is stored on
+multiple OSDs, recovering from the loss of one OSD requires reading
+from all the others. For instance if *jerasure* is configured with
+*k=8* and *m=4*, losing one OSD requires reading from the eleven
+others to repair.
+
+The *lrc* erasure code plugin creates local parity chunks to be able
+to recover using less OSDs. For instance if *lrc* is configured with
+*k=8*, *m=4* and *l=4*, it will create an additional parity chunk for
+every four OSDs. When a single OSD is lost, it can be recovered with
+only four OSDs instead of eleven.
+
+Erasure code profile examples
+=============================
+
+Reduce recovery bandwidth between hosts
+---------------------------------------
+
+Although it is probably not an interesting use case when all hosts are
+connected to the same switch, reduced bandwidth usage can actually be
+observed.::
+
+ $ ceph osd erasure-code-profile set LRCprofile \
+ plugin=lrc \
+ k=4 m=2 l=3 \
+ crush-failure-domain=host
+ $ ceph osd pool create lrcpool 12 12 erasure LRCprofile
+
+
+Reduce recovery bandwidth between racks
+---------------------------------------
+
+In Firefly the reduced bandwidth will only be observed if the primary
+OSD is in the same rack as the lost chunk.::
+
+ $ ceph osd erasure-code-profile set LRCprofile \
+ plugin=lrc \
+ k=4 m=2 l=3 \
+ crush-locality=rack \
+ crush-failure-domain=host
+ $ ceph osd pool create lrcpool 12 12 erasure LRCprofile
+
+
+Create an lrc profile
+=====================
+
+To create a new lrc erasure code profile::
+
+ ceph osd erasure-code-profile set {name} \
+ plugin=lrc \
+ k={data-chunks} \
+ m={coding-chunks} \
+ l={locality} \
+ [crush-root={root}] \
+ [crush-locality={bucket-type}] \
+ [crush-failure-domain={bucket-type}] \
+ [crush-device-class={device-class}] \
+ [directory={directory}] \
+ [--force]
+
+Where:
+
+``k={data chunks}``
+
+:Description: Each object is split in **data-chunks** parts,
+ each stored on a different OSD.
+
+:Type: Integer
+:Required: Yes.
+:Example: 4
+
+``m={coding-chunks}``
+
+:Description: Compute **coding chunks** for each object and store them
+ on different OSDs. The number of coding chunks is also
+ the number of OSDs that can be down without losing data.
+
+:Type: Integer
+:Required: Yes.
+:Example: 2
+
+``l={locality}``
+
+:Description: Group the coding and data chunks into sets of size
+ **locality**. For instance, for **k=4** and **m=2**,
+ when **locality=3** two groups of three are created.
+ Each set can be recovered without reading chunks
+ from another set.
+
+:Type: Integer
+:Required: Yes.
+:Example: 3
+
+``crush-root={root}``
+
+:Description: The name of the crush bucket used for the first step of
+ the ruleset. For intance **step take default**.
+
+:Type: String
+:Required: No.
+:Default: default
+
+``crush-locality={bucket-type}``
+
+:Description: The type of the crush bucket in which each set of chunks
+ defined by **l** will be stored. For instance, if it is
+ set to **rack**, each group of **l** chunks will be
+ placed in a different rack. It is used to create a
+ ruleset step such as **step choose rack**. If it is not
+ set, no such grouping is done.
+
+:Type: String
+:Required: No.
+
+``crush-failure-domain={bucket-type}``
+
+:Description: Ensure that no two chunks are in a bucket with the same
+ failure domain. For instance, if the failure domain is
+ **host** no two chunks will be stored on the same
+ host. It is used to create a ruleset step such as **step
+ chooseleaf host**.
+
+:Type: String
+:Required: No.
+:Default: host
+
+``crush-device-class={device-class}``
+
+:Description: Restrict placement to devices of a specific class (e.g.,
+ ``ssd`` or ``hdd``), using the crush device class names
+ in the CRUSH map.
+
+:Type: String
+:Required: No.
+:Default:
+
+``directory={directory}``
+
+:Description: Set the **directory** name from which the erasure code
+ plugin is loaded.
+
+:Type: String
+:Required: No.
+:Default: /usr/lib/ceph/erasure-code
+
+``--force``
+
+:Description: Override an existing profile by the same name.
+
+:Type: String
+:Required: No.
+
+Low level plugin configuration
+==============================
+
+The sum of **k** and **m** must be a multiple of the **l** parameter.
+The low level configuration parameters do not impose such a
+restriction and it may be more convienient to use it for specific
+purposes. It is for instance possible to define two groups, one with 4
+chunks and another with 3 chunks. It is also possible to recursively
+define locality sets, for instance datacenters and racks into
+datacenters. The **k/m/l** are implemented by generating a low level
+configuration.
+
+The *lrc* erasure code plugin recursively applies erasure code
+techniques so that recovering from the loss of some chunks only
+requires a subset of the available chunks, most of the time.
+
+For instance, when three coding steps are described as::
+
+ chunk nr 01234567
+ step 1 _cDD_cDD
+ step 2 cDDD____
+ step 3 ____cDDD
+
+where *c* are coding chunks calculated from the data chunks *D*, the
+loss of chunk *7* can be recovered with the last four chunks. And the
+loss of chunk *2* chunk can be recovered with the first four
+chunks.
+
+Erasure code profile examples using low level configuration
+===========================================================
+
+Minimal testing
+---------------
+
+It is strictly equivalent to using the default erasure code profile. The *DD*
+implies *K=2*, the *c* implies *M=1* and the *jerasure* plugin is used
+by default.::
+
+ $ ceph osd erasure-code-profile set LRCprofile \
+ plugin=lrc \
+ mapping=DD_ \
+ layers='[ [ "DDc", "" ] ]'
+ $ ceph osd pool create lrcpool 12 12 erasure LRCprofile
+
+Reduce recovery bandwidth between hosts
+---------------------------------------
+
+Although it is probably not an interesting use case when all hosts are
+connected to the same switch, reduced bandwidth usage can actually be
+observed. It is equivalent to **k=4**, **m=2** and **l=3** although
+the layout of the chunks is different::
+
+ $ ceph osd erasure-code-profile set LRCprofile \
+ plugin=lrc \
+ mapping=__DD__DD \
+ layers='[
+ [ "_cDD_cDD", "" ],
+ [ "cDDD____", "" ],
+ [ "____cDDD", "" ],
+ ]'
+ $ ceph osd pool create lrcpool 12 12 erasure LRCprofile
+
+
+Reduce recovery bandwidth between racks
+---------------------------------------
+
+In Firefly the reduced bandwidth will only be observed if the primary
+OSD is in the same rack as the lost chunk.::
+
+ $ ceph osd erasure-code-profile set LRCprofile \
+ plugin=lrc \
+ mapping=__DD__DD \
+ layers='[
+ [ "_cDD_cDD", "" ],
+ [ "cDDD____", "" ],
+ [ "____cDDD", "" ],
+ ]' \
+ crush-steps='[
+ [ "choose", "rack", 2 ],
+ [ "chooseleaf", "host", 4 ],
+ ]'
+ $ ceph osd pool create lrcpool 12 12 erasure LRCprofile
+
+Testing with different Erasure Code backends
+--------------------------------------------
+
+LRC now uses jerasure as the default EC backend. It is possible to
+specify the EC backend/algorithm on a per layer basis using the low
+level configuration. The second argument in layers='[ [ "DDc", "" ] ]'
+is actually an erasure code profile to be used for this level. The
+example below specifies the ISA backend with the cauchy technique to
+be used in the lrcpool.::
+
+ $ ceph osd erasure-code-profile set LRCprofile \
+ plugin=lrc \
+ mapping=DD_ \
+ layers='[ [ "DDc", "plugin=isa technique=cauchy" ] ]'
+ $ ceph osd pool create lrcpool 12 12 erasure LRCprofile
+
+You could also use a different erasure code profile for for each
+layer.::
+
+ $ ceph osd erasure-code-profile set LRCprofile \
+ plugin=lrc \
+ mapping=__DD__DD \
+ layers='[
+ [ "_cDD_cDD", "plugin=isa technique=cauchy" ],
+ [ "cDDD____", "plugin=isa" ],
+ [ "____cDDD", "plugin=jerasure" ],
+ ]'
+ $ ceph osd pool create lrcpool 12 12 erasure LRCprofile
+
+
+
+Erasure coding and decoding algorithm
+=====================================
+
+The steps found in the layers description::
+
+ chunk nr 01234567
+
+ step 1 _cDD_cDD
+ step 2 cDDD____
+ step 3 ____cDDD
+
+are applied in order. For instance, if a 4K object is encoded, it will
+first go thru *step 1* and be divided in four 1K chunks (the four
+uppercase D). They are stored in the chunks 2, 3, 6 and 7, in
+order. From these, two coding chunks are calculated (the two lowercase
+c). The coding chunks are stored in the chunks 1 and 5, respectively.
+
+The *step 2* re-uses the content created by *step 1* in a similar
+fashion and stores a single coding chunk *c* at position 0. The last four
+chunks, marked with an underscore (*_*) for readability, are ignored.
+
+The *step 3* stores a single coding chunk *c* at position 4. The three
+chunks created by *step 1* are used to compute this coding chunk,
+i.e. the coding chunk from *step 1* becomes a data chunk in *step 3*.
+
+If chunk *2* is lost::
+
+ chunk nr 01234567
+
+ step 1 _c D_cDD
+ step 2 cD D____
+ step 3 __ _cDDD
+
+decoding will attempt to recover it by walking the steps in reverse
+order: *step 3* then *step 2* and finally *step 1*.
+
+The *step 3* knows nothing about chunk *2* (i.e. it is an underscore)
+and is skipped.
+
+The coding chunk from *step 2*, stored in chunk *0*, allows it to
+recover the content of chunk *2*. There are no more chunks to recover
+and the process stops, without considering *step 1*.
+
+Recovering chunk *2* requires reading chunks *0, 1, 3* and writing
+back chunk *2*.
+
+If chunk *2, 3, 6* are lost::
+
+ chunk nr 01234567
+
+ step 1 _c _c D
+ step 2 cD __ _
+ step 3 __ cD D
+
+The *step 3* can recover the content of chunk *6*::
+
+ chunk nr 01234567
+
+ step 1 _c _cDD
+ step 2 cD ____
+ step 3 __ cDDD
+
+The *step 2* fails to recover and is skipped because there are two
+chunks missing (*2, 3*) and it can only recover from one missing
+chunk.
+
+The coding chunk from *step 1*, stored in chunk *1, 5*, allows it to
+recover the content of chunk *2, 3*::
+
+ chunk nr 01234567
+
+ step 1 _cDD_cDD
+ step 2 cDDD____
+ step 3 ____cDDD
+
+Controlling crush placement
+===========================
+
+The default crush ruleset provides OSDs that are on different hosts. For instance::
+
+ chunk nr 01234567
+
+ step 1 _cDD_cDD
+ step 2 cDDD____
+ step 3 ____cDDD
+
+needs exactly *8* OSDs, one for each chunk. If the hosts are in two
+adjacent racks, the first four chunks can be placed in the first rack
+and the last four in the second rack. So that recovering from the loss
+of a single OSD does not require using bandwidth between the two
+racks.
+
+For instance::
+
+ crush-steps='[ [ "choose", "rack", 2 ], [ "chooseleaf", "host", 4 ] ]'
+
+will create a ruleset that will select two crush buckets of type
+*rack* and for each of them choose four OSDs, each of them located in
+different buckets of type *host*.
+
+The ruleset can also be manually crafted for finer control.
diff --git a/src/ceph/doc/rados/operations/erasure-code-profile.rst b/src/ceph/doc/rados/operations/erasure-code-profile.rst
new file mode 100644
index 0000000..ddf772d
--- /dev/null
+++ b/src/ceph/doc/rados/operations/erasure-code-profile.rst
@@ -0,0 +1,121 @@
+=====================
+Erasure code profiles
+=====================
+
+Erasure code is defined by a **profile** and is used when creating an
+erasure coded pool and the associated crush ruleset.
+
+The **default** erasure code profile (which is created when the Ceph
+cluster is initialized) provides the same level of redundancy as two
+copies but requires 25% less disk space. It is described as a profile
+with **k=2** and **m=1**, meaning the information is spread over three
+OSD (k+m == 3) and one of them can be lost.
+
+To improve redundancy without increasing raw storage requirements, a
+new profile can be created. For instance, a profile with **k=10** and
+**m=4** can sustain the loss of four (**m=4**) OSDs by distributing an
+object on fourteen (k+m=14) OSDs. The object is first divided in
+**10** chunks (if the object is 10MB, each chunk is 1MB) and **4**
+coding chunks are computed, for recovery (each coding chunk has the
+same size as the data chunk, i.e. 1MB). The raw space overhead is only
+40% and the object will not be lost even if four OSDs break at the
+same time.
+
+.. _list of available plugins:
+
+.. toctree::
+ :maxdepth: 1
+
+ erasure-code-jerasure
+ erasure-code-isa
+ erasure-code-lrc
+ erasure-code-shec
+
+osd erasure-code-profile set
+============================
+
+To create a new erasure code profile::
+
+ ceph osd erasure-code-profile set {name} \
+ [{directory=directory}] \
+ [{plugin=plugin}] \
+ [{stripe_unit=stripe_unit}] \
+ [{key=value} ...] \
+ [--force]
+
+Where:
+
+``{directory=directory}``
+
+:Description: Set the **directory** name from which the erasure code
+ plugin is loaded.
+
+:Type: String
+:Required: No.
+:Default: /usr/lib/ceph/erasure-code
+
+``{plugin=plugin}``
+
+:Description: Use the erasure code **plugin** to compute coding chunks
+ and recover missing chunks. See the `list of available
+ plugins`_ for more information.
+
+:Type: String
+:Required: No.
+:Default: jerasure
+
+``{stripe_unit=stripe_unit}``
+
+:Description: The amount of data in a data chunk, per stripe. For
+ example, a profile with 2 data chunks and stripe_unit=4K
+ would put the range 0-4K in chunk 0, 4K-8K in chunk 1,
+ then 8K-12K in chunk 0 again. This should be a multiple
+ of 4K for best performance. The default value is taken
+ from the monitor config option
+ ``osd_pool_erasure_code_stripe_unit`` when a pool is
+ created. The stripe_width of a pool using this profile
+ will be the number of data chunks multiplied by this
+ stripe_unit.
+
+:Type: String
+:Required: No.
+
+``{key=value}``
+
+:Description: The semantic of the remaining key/value pairs is defined
+ by the erasure code plugin.
+
+:Type: String
+:Required: No.
+
+``--force``
+
+:Description: Override an existing profile by the same name, and allow
+ setting a non-4K-aligned stripe_unit.
+
+:Type: String
+:Required: No.
+
+osd erasure-code-profile rm
+============================
+
+To remove an erasure code profile::
+
+ ceph osd erasure-code-profile rm {name}
+
+If the profile is referenced by a pool, the deletion will fail.
+
+osd erasure-code-profile get
+============================
+
+To display an erasure code profile::
+
+ ceph osd erasure-code-profile get {name}
+
+osd erasure-code-profile ls
+===========================
+
+To list the names of all erasure code profiles::
+
+ ceph osd erasure-code-profile ls
+
diff --git a/src/ceph/doc/rados/operations/erasure-code-shec.rst b/src/ceph/doc/rados/operations/erasure-code-shec.rst
new file mode 100644
index 0000000..e3bab37
--- /dev/null
+++ b/src/ceph/doc/rados/operations/erasure-code-shec.rst
@@ -0,0 +1,144 @@
+========================
+SHEC erasure code plugin
+========================
+
+The *shec* plugin encapsulates the `multiple SHEC
+<http://tracker.ceph.com/projects/ceph/wiki/Shingled_Erasure_Code_(SHEC)>`_
+library. It allows ceph to recover data more efficiently than Reed Solomon codes.
+
+Create an SHEC profile
+======================
+
+To create a new *shec* erasure code profile::
+
+ ceph osd erasure-code-profile set {name} \
+ plugin=shec \
+ [k={data-chunks}] \
+ [m={coding-chunks}] \
+ [c={durability-estimator}] \
+ [crush-root={root}] \
+ [crush-failure-domain={bucket-type}] \
+ [crush-device-class={device-class}] \
+ [directory={directory}] \
+ [--force]
+
+Where:
+
+``k={data-chunks}``
+
+:Description: Each object is split in **data-chunks** parts,
+ each stored on a different OSD.
+
+:Type: Integer
+:Required: No.
+:Default: 4
+
+``m={coding-chunks}``
+
+:Description: Compute **coding-chunks** for each object and store them on
+ different OSDs. The number of **coding-chunks** does not necessarily
+ equal the number of OSDs that can be down without losing data.
+
+:Type: Integer
+:Required: No.
+:Default: 3
+
+``c={durability-estimator}``
+
+:Description: The number of parity chunks each of which includes each data chunk in its
+ calculation range. The number is used as a **durability estimator**.
+ For instance, if c=2, 2 OSDs can be down without losing data.
+
+:Type: Integer
+:Required: No.
+:Default: 2
+
+``crush-root={root}``
+
+:Description: The name of the crush bucket used for the first step of
+ the ruleset. For intance **step take default**.
+
+:Type: String
+:Required: No.
+:Default: default
+
+``crush-failure-domain={bucket-type}``
+
+:Description: Ensure that no two chunks are in a bucket with the same
+ failure domain. For instance, if the failure domain is
+ **host** no two chunks will be stored on the same
+ host. It is used to create a ruleset step such as **step
+ chooseleaf host**.
+
+:Type: String
+:Required: No.
+:Default: host
+
+``crush-device-class={device-class}``
+
+:Description: Restrict placement to devices of a specific class (e.g.,
+ ``ssd`` or ``hdd``), using the crush device class names
+ in the CRUSH map.
+
+:Type: String
+:Required: No.
+:Default:
+
+``directory={directory}``
+
+:Description: Set the **directory** name from which the erasure code
+ plugin is loaded.
+
+:Type: String
+:Required: No.
+:Default: /usr/lib/ceph/erasure-code
+
+``--force``
+
+:Description: Override an existing profile by the same name.
+
+:Type: String
+:Required: No.
+
+Brief description of SHEC's layouts
+===================================
+
+Space Efficiency
+----------------
+
+Space efficiency is a ratio of data chunks to all ones in a object and
+represented as k/(k+m).
+In order to improve space efficiency, you should increase k or decrease m.
+
+::
+
+ space efficiency of SHEC(4,3,2) = 4/(4+3) = 0.57
+ SHEC(5,3,2) or SHEC(4,2,2) improves SHEC(4,3,2)'s space efficiency
+
+Durability
+----------
+
+The third parameter of SHEC (=c) is a durability estimator, which approximates
+the number of OSDs that can be down without losing data.
+
+``durability estimator of SHEC(4,3,2) = 2``
+
+Recovery Efficiency
+-------------------
+
+Describing calculation of recovery efficiency is beyond the scope of this document,
+but at least increasing m without increasing c achieves improvement of recovery efficiency.
+(However, we must pay attention to the sacrifice of space efficiency in this case.)
+
+``SHEC(4,2,2) -> SHEC(4,3,2) : achieves improvement of recovery efficiency``
+
+Erasure code profile examples
+=============================
+
+::
+
+ $ ceph osd erasure-code-profile set SHECprofile \
+ plugin=shec \
+ k=8 m=4 c=3 \
+ crush-failure-domain=host
+ $ ceph osd pool create shecpool 256 256 erasure SHECprofile
diff --git a/src/ceph/doc/rados/operations/erasure-code.rst b/src/ceph/doc/rados/operations/erasure-code.rst
new file mode 100644
index 0000000..6ec5a09
--- /dev/null
+++ b/src/ceph/doc/rados/operations/erasure-code.rst
@@ -0,0 +1,195 @@
+=============
+ Erasure code
+=============
+
+A Ceph pool is associated to a type to sustain the loss of an OSD
+(i.e. a disk since most of the time there is one OSD per disk). The
+default choice when `creating a pool <../pools>`_ is *replicated*,
+meaning every object is copied on multiple disks. The `Erasure Code
+<https://en.wikipedia.org/wiki/Erasure_code>`_ pool type can be used
+instead to save space.
+
+Creating a sample erasure coded pool
+------------------------------------
+
+The simplest erasure coded pool is equivalent to `RAID5
+<https://en.wikipedia.org/wiki/Standard_RAID_levels#RAID_5>`_ and
+requires at least three hosts::
+
+ $ ceph osd pool create ecpool 12 12 erasure
+ pool 'ecpool' created
+ $ echo ABCDEFGHI | rados --pool ecpool put NYAN -
+ $ rados --pool ecpool get NYAN -
+ ABCDEFGHI
+
+.. note:: the 12 in *pool create* stands for
+ `the number of placement groups <../pools>`_.
+
+Erasure code profiles
+---------------------
+
+The default erasure code profile sustains the loss of a single OSD. It
+is equivalent to a replicated pool of size two but requires 1.5TB
+instead of 2TB to store 1TB of data. The default profile can be
+displayed with::
+
+ $ ceph osd erasure-code-profile get default
+ k=2
+ m=1
+ plugin=jerasure
+ crush-failure-domain=host
+ technique=reed_sol_van
+
+Choosing the right profile is important because it cannot be modified
+after the pool is created: a new pool with a different profile needs
+to be created and all objects from the previous pool moved to the new.
+
+The most important parameters of the profile are *K*, *M* and
+*crush-failure-domain* because they define the storage overhead and
+the data durability. For instance, if the desired architecture must
+sustain the loss of two racks with a storage overhead of 40% overhead,
+the following profile can be defined::
+
+ $ ceph osd erasure-code-profile set myprofile \
+ k=3 \
+ m=2 \
+ crush-failure-domain=rack
+ $ ceph osd pool create ecpool 12 12 erasure myprofile
+ $ echo ABCDEFGHI | rados --pool ecpool put NYAN -
+ $ rados --pool ecpool get NYAN -
+ ABCDEFGHI
+
+The *NYAN* object will be divided in three (*K=3*) and two additional
+*chunks* will be created (*M=2*). The value of *M* defines how many
+OSD can be lost simultaneously without losing any data. The
+*crush-failure-domain=rack* will create a CRUSH ruleset that ensures
+no two *chunks* are stored in the same rack.
+
+.. ditaa::
+ +-------------------+
+ name | NYAN |
+ +-------------------+
+ content | ABCDEFGHI |
+ +--------+----------+
+ |
+ |
+ v
+ +------+------+
+ +---------------+ encode(3,2) +-----------+
+ | +--+--+---+---+ |
+ | | | | |
+ | +-------+ | +-----+ |
+ | | | | |
+ +--v---+ +--v---+ +--v---+ +--v---+ +--v---+
+ name | NYAN | | NYAN | | NYAN | | NYAN | | NYAN |
+ +------+ +------+ +------+ +------+ +------+
+ shard | 1 | | 2 | | 3 | | 4 | | 5 |
+ +------+ +------+ +------+ +------+ +------+
+ content | ABC | | DEF | | GHI | | YXY | | QGC |
+ +--+---+ +--+---+ +--+---+ +--+---+ +--+---+
+ | | | | |
+ | | v | |
+ | | +--+---+ | |
+ | | | OSD1 | | |
+ | | +------+ | |
+ | | | |
+ | | +------+ | |
+ | +------>| OSD2 | | |
+ | +------+ | |
+ | | |
+ | +------+ | |
+ | | OSD3 |<----+ |
+ | +------+ |
+ | |
+ | +------+ |
+ | | OSD4 |<--------------+
+ | +------+
+ |
+ | +------+
+ +----------------->| OSD5 |
+ +------+
+
+
+More information can be found in the `erasure code profiles
+<../erasure-code-profile>`_ documentation.
+
+
+Erasure Coding with Overwrites
+------------------------------
+
+By default, erasure coded pools only work with uses like RGW that
+perform full object writes and appends.
+
+Since Luminous, partial writes for an erasure coded pool may be
+enabled with a per-pool setting. This lets RBD and Cephfs store their
+data in an erasure coded pool::
+
+ ceph osd pool set ec_pool allow_ec_overwrites true
+
+This can only be enabled on a pool residing on bluestore OSDs, since
+bluestore's checksumming is used to detect bitrot or other corruption
+during deep-scrub. In addition to being unsafe, using filestore with
+ec overwrites yields low performance compared to bluestore.
+
+Erasure coded pools do not support omap, so to use them with RBD and
+Cephfs you must instruct them to store their data in an ec pool, and
+their metadata in a replicated pool. For RBD, this means using the
+erasure coded pool as the ``--data-pool`` during image creation::
+
+ rbd create --size 1G --data-pool ec_pool replicated_pool/image_name
+
+For Cephfs, using an erasure coded pool means setting that pool in
+a `file layout <../../../cephfs/file-layouts>`_.
+
+
+Erasure coded pool and cache tiering
+------------------------------------
+
+Erasure coded pools require more resources than replicated pools and
+lack some functionalities such as omap. To overcome these
+limitations, one can set up a `cache tier <../cache-tiering>`_
+before the erasure coded pool.
+
+For instance, if the pool *hot-storage* is made of fast storage::
+
+ $ ceph osd tier add ecpool hot-storage
+ $ ceph osd tier cache-mode hot-storage writeback
+ $ ceph osd tier set-overlay ecpool hot-storage
+
+will place the *hot-storage* pool as tier of *ecpool* in *writeback*
+mode so that every write and read to the *ecpool* are actually using
+the *hot-storage* and benefit from its flexibility and speed.
+
+More information can be found in the `cache tiering
+<../cache-tiering>`_ documentation.
+
+Glossary
+--------
+
+*chunk*
+ when the encoding function is called, it returns chunks of the same
+ size. Data chunks which can be concatenated to reconstruct the original
+ object and coding chunks which can be used to rebuild a lost chunk.
+
+*K*
+ the number of data *chunks*, i.e. the number of *chunks* in which the
+ original object is divided. For instance if *K* = 2 a 10KB object
+ will be divided into *K* objects of 5KB each.
+
+*M*
+ the number of coding *chunks*, i.e. the number of additional *chunks*
+ computed by the encoding functions. If there are 2 coding *chunks*,
+ it means 2 OSDs can be out without losing data.
+
+
+Table of content
+----------------
+
+.. toctree::
+ :maxdepth: 1
+
+ erasure-code-profile
+ erasure-code-jerasure
+ erasure-code-isa
+ erasure-code-lrc
+ erasure-code-shec
diff --git a/src/ceph/doc/rados/operations/health-checks.rst b/src/ceph/doc/rados/operations/health-checks.rst
new file mode 100644
index 0000000..c1e2200
--- /dev/null
+++ b/src/ceph/doc/rados/operations/health-checks.rst
@@ -0,0 +1,527 @@
+
+=============
+Health checks
+=============
+
+Overview
+========
+
+There is a finite set of possible health messages that a Ceph cluster can
+raise -- these are defined as *health checks* which have unique identifiers.
+
+The identifier is a terse pseudo-human-readable (i.e. like a variable name)
+string. It is intended to enable tools (such as UIs) to make sense of
+health checks, and present them in a way that reflects their meaning.
+
+This page lists the health checks that are raised by the monitor and manager
+daemons. In addition to these, you may also see health checks that originate
+from MDS daemons (see :doc:`/cephfs/health-messages`), and health checks
+that are defined by ceph-mgr python modules.
+
+Definitions
+===========
+
+
+OSDs
+----
+
+OSD_DOWN
+________
+
+One or more OSDs are marked down. The ceph-osd daemon may have been
+stopped, or peer OSDs may be unable to reach the OSD over the network.
+Common causes include a stopped or crashed daemon, a down host, or a
+network outage.
+
+Verify the host is healthy, the daemon is started, and network is
+functioning. If the daemon has crashed, the daemon log file
+(``/var/log/ceph/ceph-osd.*``) may contain debugging information.
+
+OSD_<crush type>_DOWN
+_____________________
+
+(e.g. OSD_HOST_DOWN, OSD_ROOT_DOWN)
+
+All the OSDs within a particular CRUSH subtree are marked down, for example
+all OSDs on a host.
+
+OSD_ORPHAN
+__________
+
+An OSD is referenced in the CRUSH map hierarchy but does not exist.
+
+The OSD can be removed from the CRUSH hierarchy with::
+
+ ceph osd crush rm osd.<id>
+
+OSD_OUT_OF_ORDER_FULL
+_____________________
+
+The utilization thresholds for `backfillfull`, `nearfull`, `full`,
+and/or `failsafe_full` are not ascending. In particular, we expect
+`backfillfull < nearfull`, `nearfull < full`, and `full <
+failsafe_full`.
+
+The thresholds can be adjusted with::
+
+ ceph osd set-backfillfull-ratio <ratio>
+ ceph osd set-nearfull-ratio <ratio>
+ ceph osd set-full-ratio <ratio>
+
+
+OSD_FULL
+________
+
+One or more OSDs has exceeded the `full` threshold and is preventing
+the cluster from servicing writes.
+
+Utilization by pool can be checked with::
+
+ ceph df
+
+The currently defined `full` ratio can be seen with::
+
+ ceph osd dump | grep full_ratio
+
+A short-term workaround to restore write availability is to raise the full
+threshold by a small amount::
+
+ ceph osd set-full-ratio <ratio>
+
+New storage should be added to the cluster by deploying more OSDs or
+existing data should be deleted in order to free up space.
+
+OSD_BACKFILLFULL
+________________
+
+One or more OSDs has exceeded the `backfillfull` threshold, which will
+prevent data from being allowed to rebalance to this device. This is
+an early warning that rebalancing may not be able to complete and that
+the cluster is approaching full.
+
+Utilization by pool can be checked with::
+
+ ceph df
+
+OSD_NEARFULL
+____________
+
+One or more OSDs has exceeded the `nearfull` threshold. This is an early
+warning that the cluster is approaching full.
+
+Utilization by pool can be checked with::
+
+ ceph df
+
+OSDMAP_FLAGS
+____________
+
+One or more cluster flags of interest has been set. These flags include:
+
+* *full* - the cluster is flagged as full and cannot service writes
+* *pauserd*, *pausewr* - paused reads or writes
+* *noup* - OSDs are not allowed to start
+* *nodown* - OSD failure reports are being ignored, such that the
+ monitors will not mark OSDs `down`
+* *noin* - OSDs that were previously marked `out` will not be marked
+ back `in` when they start
+* *noout* - down OSDs will not automatically be marked out after the
+ configured interval
+* *nobackfill*, *norecover*, *norebalance* - recovery or data
+ rebalancing is suspended
+* *noscrub*, *nodeep_scrub* - scrubbing is disabled
+* *notieragent* - cache tiering activity is suspended
+
+With the exception of *full*, these flags can be set or cleared with::
+
+ ceph osd set <flag>
+ ceph osd unset <flag>
+
+OSD_FLAGS
+_________
+
+One or more OSDs has a per-OSD flag of interest set. These flags include:
+
+* *noup*: OSD is not allowed to start
+* *nodown*: failure reports for this OSD will be ignored
+* *noin*: if this OSD was previously marked `out` automatically
+ after a failure, it will not be marked in when it stats
+* *noout*: if this OSD is down it will not automatically be marked
+ `out` after the configured interval
+
+Per-OSD flags can be set and cleared with::
+
+ ceph osd add-<flag> <osd-id>
+ ceph osd rm-<flag> <osd-id>
+
+For example, ::
+
+ ceph osd rm-nodown osd.123
+
+OLD_CRUSH_TUNABLES
+__________________
+
+The CRUSH map is using very old settings and should be updated. The
+oldest tunables that can be used (i.e., the oldest client version that
+can connect to the cluster) without triggering this health warning is
+determined by the ``mon_crush_min_required_version`` config option.
+See :doc:`/rados/operations/crush-map/#tunables` for more information.
+
+OLD_CRUSH_STRAW_CALC_VERSION
+____________________________
+
+The CRUSH map is using an older, non-optimal method for calculating
+intermediate weight values for ``straw`` buckets.
+
+The CRUSH map should be updated to use the newer method
+(``straw_calc_version=1``). See
+:doc:`/rados/operations/crush-map/#tunables` for more information.
+
+CACHE_POOL_NO_HIT_SET
+_____________________
+
+One or more cache pools is not configured with a *hit set* to track
+utilization, which will prevent the tiering agent from identifying
+cold objects to flush and evict from the cache.
+
+Hit sets can be configured on the cache pool with::
+
+ ceph osd pool set <poolname> hit_set_type <type>
+ ceph osd pool set <poolname> hit_set_period <period-in-seconds>
+ ceph osd pool set <poolname> hit_set_count <number-of-hitsets>
+ ceph osd pool set <poolname> hit_set_fpp <target-false-positive-rate>
+
+OSD_NO_SORTBITWISE
+__________________
+
+No pre-luminous v12.y.z OSDs are running but the ``sortbitwise`` flag has not
+been set.
+
+The ``sortbitwise`` flag must be set before luminous v12.y.z or newer
+OSDs can start. You can safely set the flag with::
+
+ ceph osd set sortbitwise
+
+POOL_FULL
+_________
+
+One or more pools has reached its quota and is no longer allowing writes.
+
+Pool quotas and utilization can be seen with::
+
+ ceph df detail
+
+You can either raise the pool quota with::
+
+ ceph osd pool set-quota <poolname> max_objects <num-objects>
+ ceph osd pool set-quota <poolname> max_bytes <num-bytes>
+
+or delete some existing data to reduce utilization.
+
+
+Data health (pools & placement groups)
+--------------------------------------
+
+PG_AVAILABILITY
+_______________
+
+Data availability is reduced, meaning that the cluster is unable to
+service potential read or write requests for some data in the cluster.
+Specifically, one or more PGs is in a state that does not allow IO
+requests to be serviced. Problematic PG states include *peering*,
+*stale*, *incomplete*, and the lack of *active* (if those conditions do not clear
+quickly).
+
+Detailed information about which PGs are affected is available from::
+
+ ceph health detail
+
+In most cases the root cause is that one or more OSDs is currently
+down; see the dicussion for ``OSD_DOWN`` above.
+
+The state of specific problematic PGs can be queried with::
+
+ ceph tell <pgid> query
+
+PG_DEGRADED
+___________
+
+Data redundancy is reduced for some data, meaning the cluster does not
+have the desired number of replicas for all data (for replicated
+pools) or erasure code fragments (for erasure coded pools).
+Specifically, one or more PGs:
+
+* has the *degraded* or *undersized* flag set, meaning there are not
+ enough instances of that placement group in the cluster;
+* has not had the *clean* flag set for some time.
+
+Detailed information about which PGs are affected is available from::
+
+ ceph health detail
+
+In most cases the root cause is that one or more OSDs is currently
+down; see the dicussion for ``OSD_DOWN`` above.
+
+The state of specific problematic PGs can be queried with::
+
+ ceph tell <pgid> query
+
+
+PG_DEGRADED_FULL
+________________
+
+Data redundancy may be reduced or at risk for some data due to a lack
+of free space in the cluster. Specifically, one or more PGs has the
+*backfill_toofull* or *recovery_toofull* flag set, meaning that the
+cluster is unable to migrate or recover data because one or more OSDs
+is above the *backfillfull* threshold.
+
+See the discussion for *OSD_BACKFILLFULL* or *OSD_FULL* above for
+steps to resolve this condition.
+
+PG_DAMAGED
+__________
+
+Data scrubbing has discovered some problems with data consistency in
+the cluster. Specifically, one or more PGs has the *inconsistent* or
+*snaptrim_error* flag is set, indicating an earlier scrub operation
+found a problem, or that the *repair* flag is set, meaning a repair
+for such an inconsistency is currently in progress.
+
+See :doc:`pg-repair` for more information.
+
+OSD_SCRUB_ERRORS
+________________
+
+Recent OSD scrubs have uncovered inconsistencies. This error is generally
+paired with *PG_DAMANGED* (see above).
+
+See :doc:`pg-repair` for more information.
+
+CACHE_POOL_NEAR_FULL
+____________________
+
+A cache tier pool is nearly full. Full in this context is determined
+by the ``target_max_bytes`` and ``target_max_objects`` properties on
+the cache pool. Once the pool reaches the target threshold, write
+requests to the pool may block while data is flushed and evicted
+from the cache, a state that normally leads to very high latencies and
+poor performance.
+
+The cache pool target size can be adjusted with::
+
+ ceph osd pool set <cache-pool-name> target_max_bytes <bytes>
+ ceph osd pool set <cache-pool-name> target_max_objects <objects>
+
+Normal cache flush and evict activity may also be throttled due to reduced
+availability or performance of the base tier, or overall cluster load.
+
+TOO_FEW_PGS
+___________
+
+The number of PGs in use in the cluster is below the configurable
+threshold of ``mon_pg_warn_min_per_osd`` PGs per OSD. This can lead
+to suboptimizal distribution and balance of data across the OSDs in
+the cluster, and similar reduce overall performance.
+
+This may be an expected condition if data pools have not yet been
+created.
+
+The PG count for existing pools can be increased or new pools can be
+created. Please refer to
+:doc:`placement-groups#Choosing-the-number-of-Placement-Groups` for
+more information.
+
+TOO_MANY_PGS
+____________
+
+The number of PGs in use in the cluster is above the configurable
+threshold of ``mon_max_pg_per_osd`` PGs per OSD. If this threshold is
+exceed the cluster will not allow new pools to be created, pool `pg_num` to
+be increased, or pool replication to be increased (any of which would lead to
+more PGs in the cluster). A large number of PGs can lead
+to higher memory utilization for OSD daemons, slower peering after
+cluster state changes (like OSD restarts, additions, or removals), and
+higher load on the Manager and Monitor daemons.
+
+The simplest way to mitigate the problem is to increase the number of
+OSDs in the cluster by adding more hardware. Note that the OSD count
+used for the purposes of this health check is the number of "in" OSDs,
+so marking "out" OSDs "in" (if there are any) can also help::
+
+ ceph osd in <osd id(s)>
+
+Please refer to
+:doc:`placement-groups#Choosing-the-number-of-Placement-Groups` for
+more information.
+
+SMALLER_PGP_NUM
+_______________
+
+One or more pools has a ``pgp_num`` value less than ``pg_num``. This
+is normally an indication that the PG count was increased without
+also increasing the placement behavior.
+
+This is sometimes done deliberately to separate out the `split` step
+when the PG count is adjusted from the data migration that is needed
+when ``pgp_num`` is changed.
+
+This is normally resolved by setting ``pgp_num`` to match ``pg_num``,
+triggering the data migration, with::
+
+ ceph osd pool set <pool> pgp_num <pg-num-value>
+
+MANY_OBJECTS_PER_PG
+___________________
+
+One or more pools has an average number of objects per PG that is
+significantly higher than the overall cluster average. The specific
+threshold is controlled by the ``mon_pg_warn_max_object_skew``
+configuration value.
+
+This is usually an indication that the pool(s) containing most of the
+data in the cluster have too few PGs, and/or that other pools that do
+not contain as much data have too many PGs. See the discussion of
+*TOO_MANY_PGS* above.
+
+The threshold can be raised to silence the health warning by adjusting
+the ``mon_pg_warn_max_object_skew`` config option on the monitors.
+
+POOL_APP_NOT_ENABLED
+____________________
+
+A pool exists that contains one or more objects but has not been
+tagged for use by a particular application.
+
+Resolve this warning by labeling the pool for use by an application. For
+example, if the pool is used by RBD,::
+
+ rbd pool init <poolname>
+
+If the pool is being used by a custom application 'foo', you can also label
+via the low-level command::
+
+ ceph osd pool application enable foo
+
+For more information, see :doc:`pools.rst#associate-pool-to-application`.
+
+POOL_FULL
+_________
+
+One or more pools has reached (or is very close to reaching) its
+quota. The threshold to trigger this error condition is controlled by
+the ``mon_pool_quota_crit_threshold`` configuration option.
+
+Pool quotas can be adjusted up or down (or removed) with::
+
+ ceph osd pool set-quota <pool> max_bytes <bytes>
+ ceph osd pool set-quota <pool> max_objects <objects>
+
+Setting the quota value to 0 will disable the quota.
+
+POOL_NEAR_FULL
+______________
+
+One or more pools is approaching is quota. The threshold to trigger
+this warning condition is controlled by the
+``mon_pool_quota_warn_threshold`` configuration option.
+
+Pool quotas can be adjusted up or down (or removed) with::
+
+ ceph osd pool set-quota <pool> max_bytes <bytes>
+ ceph osd pool set-quota <pool> max_objects <objects>
+
+Setting the quota value to 0 will disable the quota.
+
+OBJECT_MISPLACED
+________________
+
+One or more objects in the cluster is not stored on the node the
+cluster would like it to be stored on. This is an indication that
+data migration due to some recent cluster change has not yet completed.
+
+Misplaced data is not a dangerous condition in and of itself; data
+consistency is never at risk, and old copies of objects are never
+removed until the desired number of new copies (in the desired
+locations) are present.
+
+OBJECT_UNFOUND
+______________
+
+One or more objects in the cluster cannot be found. Specifically, the
+OSDs know that a new or updated copy of an object should exist, but a
+copy of that version of the object has not been found on OSDs that are
+currently online.
+
+Read or write requests to unfound objects will block.
+
+Ideally, a down OSD can be brought back online that has the more
+recent copy of the unfound object. Candidate OSDs can be identified from the
+peering state for the PG(s) responsible for the unfound object::
+
+ ceph tell <pgid> query
+
+If the latest copy of the object is not available, the cluster can be
+told to roll back to a previous version of the object. See
+:doc:`troubleshooting-pg#Unfound-objects` for more information.
+
+REQUEST_SLOW
+____________
+
+One or more OSD requests is taking a long time to process. This can
+be an indication of extreme load, a slow storage device, or a software
+bug.
+
+The request queue on the OSD(s) in question can be queried with the
+following command, executed from the OSD host::
+
+ ceph daemon osd.<id> ops
+
+A summary of the slowest recent requests can be seen with::
+
+ ceph daemon osd.<id> dump_historic_ops
+
+The location of an OSD can be found with::
+
+ ceph osd find osd.<id>
+
+REQUEST_STUCK
+_____________
+
+One or more OSD requests has been blocked for an extremely long time.
+This is an indication that either the cluster has been unhealthy for
+an extended period of time (e.g., not enough running OSDs) or there is
+some internal problem with the OSD. See the dicussion of
+*REQUEST_SLOW* above.
+
+PG_NOT_SCRUBBED
+_______________
+
+One or more PGs has not been scrubbed recently. PGs are normally
+scrubbed every ``mon_scrub_interval`` seconds, and this warning
+triggers when ``mon_warn_not_scrubbed`` such intervals have elapsed
+without a scrub.
+
+PGs will not scrub if they are not flagged as *clean*, which may
+happen if they are misplaced or degraded (see *PG_AVAILABILITY* and
+*PG_DEGRADED* above).
+
+You can manually initiate a scrub of a clean PG with::
+
+ ceph pg scrub <pgid>
+
+PG_NOT_DEEP_SCRUBBED
+____________________
+
+One or more PGs has not been deep scrubbed recently. PGs are normally
+scrubbed every ``osd_deep_mon_scrub_interval`` seconds, and this warning
+triggers when ``mon_warn_not_deep_scrubbed`` such intervals have elapsed
+without a scrub.
+
+PGs will not (deep) scrub if they are not flagged as *clean*, which may
+happen if they are misplaced or degraded (see *PG_AVAILABILITY* and
+*PG_DEGRADED* above).
+
+You can manually initiate a scrub of a clean PG with::
+
+ ceph pg deep-scrub <pgid>
diff --git a/src/ceph/doc/rados/operations/index.rst b/src/ceph/doc/rados/operations/index.rst
new file mode 100644
index 0000000..aacf764
--- /dev/null
+++ b/src/ceph/doc/rados/operations/index.rst
@@ -0,0 +1,90 @@
+====================
+ Cluster Operations
+====================
+
+.. raw:: html
+
+ <table><colgroup><col width="50%"><col width="50%"></colgroup><tbody valign="top"><tr><td><h3>High-level Operations</h3>
+
+High-level cluster operations consist primarily of starting, stopping, and
+restarting a cluster with the ``ceph`` service; checking the cluster's health;
+and, monitoring an operating cluster.
+
+.. toctree::
+ :maxdepth: 1
+
+ operating
+ health-checks
+ monitoring
+ monitoring-osd-pg
+ user-management
+
+.. raw:: html
+
+ </td><td><h3>Data Placement</h3>
+
+Once you have your cluster up and running, you may begin working with data
+placement. Ceph supports petabyte-scale data storage clusters, with storage
+pools and placement groups that distribute data across the cluster using Ceph's
+CRUSH algorithm.
+
+.. toctree::
+ :maxdepth: 1
+
+ data-placement
+ pools
+ erasure-code
+ cache-tiering
+ placement-groups
+ upmap
+ crush-map
+ crush-map-edits
+
+
+
+.. raw:: html
+
+ </td></tr><tr><td><h3>Low-level Operations</h3>
+
+Low-level cluster operations consist of starting, stopping, and restarting a
+particular daemon within a cluster; changing the settings of a particular
+daemon or subsystem; and, adding a daemon to the cluster or removing a daemon
+from the cluster. The most common use cases for low-level operations include
+growing or shrinking the Ceph cluster and replacing legacy or failed hardware
+with new hardware.
+
+.. toctree::
+ :maxdepth: 1
+
+ add-or-rm-osds
+ add-or-rm-mons
+ Command Reference <control>
+
+
+
+.. raw:: html
+
+ </td><td><h3>Troubleshooting</h3>
+
+Ceph is still on the leading edge, so you may encounter situations that require
+you to evaluate your Ceph configuration and modify your logging and debugging
+settings to identify and remedy issues you are encountering with your cluster.
+
+.. toctree::
+ :maxdepth: 1
+
+ ../troubleshooting/community
+ ../troubleshooting/troubleshooting-mon
+ ../troubleshooting/troubleshooting-osd
+ ../troubleshooting/troubleshooting-pg
+ ../troubleshooting/log-and-debug
+ ../troubleshooting/cpu-profiling
+ ../troubleshooting/memory-profiling
+
+
+
+
+.. raw:: html
+
+ </td></tr></tbody></table>
+
diff --git a/src/ceph/doc/rados/operations/monitoring-osd-pg.rst b/src/ceph/doc/rados/operations/monitoring-osd-pg.rst
new file mode 100644
index 0000000..0107e34
--- /dev/null
+++ b/src/ceph/doc/rados/operations/monitoring-osd-pg.rst
@@ -0,0 +1,617 @@
+=========================
+ Monitoring OSDs and PGs
+=========================
+
+High availability and high reliability require a fault-tolerant approach to
+managing hardware and software issues. Ceph has no single point-of-failure, and
+can service requests for data in a "degraded" mode. Ceph's `data placement`_
+introduces a layer of indirection to ensure that data doesn't bind directly to
+particular OSD addresses. This means that tracking down system faults requires
+finding the `placement group`_ and the underlying OSDs at root of the problem.
+
+.. tip:: A fault in one part of the cluster may prevent you from accessing a
+ particular object, but that doesn't mean that you cannot access other objects.
+ When you run into a fault, don't panic. Just follow the steps for monitoring
+ your OSDs and placement groups. Then, begin troubleshooting.
+
+Ceph is generally self-repairing. However, when problems persist, monitoring
+OSDs and placement groups will help you identify the problem.
+
+
+Monitoring OSDs
+===============
+
+An OSD's status is either in the cluster (``in``) or out of the cluster
+(``out``); and, it is either up and running (``up``), or it is down and not
+running (``down``). If an OSD is ``up``, it may be either ``in`` the cluster
+(you can read and write data) or it is ``out`` of the cluster. If it was
+``in`` the cluster and recently moved ``out`` of the cluster, Ceph will migrate
+placement groups to other OSDs. If an OSD is ``out`` of the cluster, CRUSH will
+not assign placement groups to the OSD. If an OSD is ``down``, it should also be
+``out``.
+
+.. note:: If an OSD is ``down`` and ``in``, there is a problem and the cluster
+ will not be in a healthy state.
+
+.. ditaa:: +----------------+ +----------------+
+ | | | |
+ | OSD #n In | | OSD #n Up |
+ | | | |
+ +----------------+ +----------------+
+ ^ ^
+ | |
+ | |
+ v v
+ +----------------+ +----------------+
+ | | | |
+ | OSD #n Out | | OSD #n Down |
+ | | | |
+ +----------------+ +----------------+
+
+If you execute a command such as ``ceph health``, ``ceph -s`` or ``ceph -w``,
+you may notice that the cluster does not always echo back ``HEALTH OK``. Don't
+panic. With respect to OSDs, you should expect that the cluster will **NOT**
+echo ``HEALTH OK`` in a few expected circumstances:
+
+#. You haven't started the cluster yet (it won't respond).
+#. You have just started or restarted the cluster and it's not ready yet,
+ because the placement groups are getting created and the OSDs are in
+ the process of peering.
+#. You just added or removed an OSD.
+#. You just have modified your cluster map.
+
+An important aspect of monitoring OSDs is to ensure that when the cluster
+is up and running that all OSDs that are ``in`` the cluster are ``up`` and
+running, too. To see if all OSDs are running, execute::
+
+ ceph osd stat
+
+The result should tell you the map epoch (eNNNN), the total number of OSDs (x),
+how many are ``up`` (y) and how many are ``in`` (z). ::
+
+ eNNNN: x osds: y up, z in
+
+If the number of OSDs that are ``in`` the cluster is more than the number of
+OSDs that are ``up``, execute the following command to identify the ``ceph-osd``
+daemons that are not running::
+
+ ceph osd tree
+
+::
+
+ dumped osdmap tree epoch 1
+ # id weight type name up/down reweight
+ -1 2 pool openstack
+ -3 2 rack dell-2950-rack-A
+ -2 2 host dell-2950-A1
+ 0 1 osd.0 up 1
+ 1 1 osd.1 down 1
+
+
+.. tip:: The ability to search through a well-designed CRUSH hierarchy may help
+ you troubleshoot your cluster by identifying the physcial locations faster.
+
+If an OSD is ``down``, start it::
+
+ sudo systemctl start ceph-osd@1
+
+See `OSD Not Running`_ for problems associated with OSDs that stopped, or won't
+restart.
+
+
+PG Sets
+=======
+
+When CRUSH assigns placement groups to OSDs, it looks at the number of replicas
+for the pool and assigns the placement group to OSDs such that each replica of
+the placement group gets assigned to a different OSD. For example, if the pool
+requires three replicas of a placement group, CRUSH may assign them to
+``osd.1``, ``osd.2`` and ``osd.3`` respectively. CRUSH actually seeks a
+pseudo-random placement that will take into account failure domains you set in
+your `CRUSH map`_, so you will rarely see placement groups assigned to nearest
+neighbor OSDs in a large cluster. We refer to the set of OSDs that should
+contain the replicas of a particular placement group as the **Acting Set**. In
+some cases, an OSD in the Acting Set is ``down`` or otherwise not able to
+service requests for objects in the placement group. When these situations
+arise, don't panic. Common examples include:
+
+- You added or removed an OSD. Then, CRUSH reassigned the placement group to
+ other OSDs--thereby changing the composition of the Acting Set and spawning
+ the migration of data with a "backfill" process.
+- An OSD was ``down``, was restarted, and is now ``recovering``.
+- An OSD in the Acting Set is ``down`` or unable to service requests,
+ and another OSD has temporarily assumed its duties.
+
+Ceph processes a client request using the **Up Set**, which is the set of OSDs
+that will actually handle the requests. In most cases, the Up Set and the Acting
+Set are virtually identical. When they are not, it may indicate that Ceph is
+migrating data, an OSD is recovering, or that there is a problem (i.e., Ceph
+usually echoes a "HEALTH WARN" state with a "stuck stale" message in such
+scenarios).
+
+To retrieve a list of placement groups, execute::
+
+ ceph pg dump
+
+To view which OSDs are within the Acting Set or the Up Set for a given placement
+group, execute::
+
+ ceph pg map {pg-num}
+
+The result should tell you the osdmap epoch (eNNN), the placement group number
+({pg-num}), the OSDs in the Up Set (up[]), and the OSDs in the acting set
+(acting[]). ::
+
+ osdmap eNNN pg {pg-num} -> up [0,1,2] acting [0,1,2]
+
+.. note:: If the Up Set and Acting Set do not match, this may be an indicator
+ that the cluster rebalancing itself or of a potential problem with
+ the cluster.
+
+
+Peering
+=======
+
+Before you can write data to a placement group, it must be in an ``active``
+state, and it **should** be in a ``clean`` state. For Ceph to determine the
+current state of a placement group, the primary OSD of the placement group
+(i.e., the first OSD in the acting set), peers with the secondary and tertiary
+OSDs to establish agreement on the current state of the placement group
+(assuming a pool with 3 replicas of the PG).
+
+
+.. ditaa:: +---------+ +---------+ +-------+
+ | OSD 1 | | OSD 2 | | OSD 3 |
+ +---------+ +---------+ +-------+
+ | | |
+ | Request To | |
+ | Peer | |
+ |-------------->| |
+ |<--------------| |
+ | Peering |
+ | |
+ | Request To |
+ | Peer |
+ |----------------------------->|
+ |<-----------------------------|
+ | Peering |
+
+The OSDs also report their status to the monitor. See `Configuring Monitor/OSD
+Interaction`_ for details. To troubleshoot peering issues, see `Peering
+Failure`_.
+
+
+Monitoring Placement Group States
+=================================
+
+If you execute a command such as ``ceph health``, ``ceph -s`` or ``ceph -w``,
+you may notice that the cluster does not always echo back ``HEALTH OK``. After
+you check to see if the OSDs are running, you should also check placement group
+states. You should expect that the cluster will **NOT** echo ``HEALTH OK`` in a
+number of placement group peering-related circumstances:
+
+#. You have just created a pool and placement groups haven't peered yet.
+#. The placement groups are recovering.
+#. You have just added an OSD to or removed an OSD from the cluster.
+#. You have just modified your CRUSH map and your placement groups are migrating.
+#. There is inconsistent data in different replicas of a placement group.
+#. Ceph is scrubbing a placement group's replicas.
+#. Ceph doesn't have enough storage capacity to complete backfilling operations.
+
+If one of the foregoing circumstances causes Ceph to echo ``HEALTH WARN``, don't
+panic. In many cases, the cluster will recover on its own. In some cases, you
+may need to take action. An important aspect of monitoring placement groups is
+to ensure that when the cluster is up and running that all placement groups are
+``active``, and preferably in the ``clean`` state. To see the status of all
+placement groups, execute::
+
+ ceph pg stat
+
+The result should tell you the placement group map version (vNNNNNN), the total
+number of placement groups (x), and how many placement groups are in a
+particular state such as ``active+clean`` (y). ::
+
+ vNNNNNN: x pgs: y active+clean; z bytes data, aa MB used, bb GB / cc GB avail
+
+.. note:: It is common for Ceph to report multiple states for placement groups.
+
+In addition to the placement group states, Ceph will also echo back the amount
+of data used (aa), the amount of storage capacity remaining (bb), and the total
+storage capacity for the placement group. These numbers can be important in a
+few cases:
+
+- You are reaching your ``near full ratio`` or ``full ratio``.
+- Your data is not getting distributed across the cluster due to an
+ error in your CRUSH configuration.
+
+
+.. topic:: Placement Group IDs
+
+ Placement group IDs consist of the pool number (not pool name) followed
+ by a period (.) and the placement group ID--a hexadecimal number. You
+ can view pool numbers and their names from the output of ``ceph osd
+ lspools``. For example, the default pool ``rbd`` corresponds to
+ pool number ``0``. A fully qualified placement group ID has the
+ following form::
+
+ {pool-num}.{pg-id}
+
+ And it typically looks like this::
+
+ 0.1f
+
+
+To retrieve a list of placement groups, execute the following::
+
+ ceph pg dump
+
+You can also format the output in JSON format and save it to a file::
+
+ ceph pg dump -o {filename} --format=json
+
+To query a particular placement group, execute the following::
+
+ ceph pg {poolnum}.{pg-id} query
+
+Ceph will output the query in JSON format.
+
+.. code-block:: javascript
+
+ {
+ "state": "active+clean",
+ "up": [
+ 1,
+ 0
+ ],
+ "acting": [
+ 1,
+ 0
+ ],
+ "info": {
+ "pgid": "1.e",
+ "last_update": "4'1",
+ "last_complete": "4'1",
+ "log_tail": "0'0",
+ "last_backfill": "MAX",
+ "purged_snaps": "[]",
+ "history": {
+ "epoch_created": 1,
+ "last_epoch_started": 537,
+ "last_epoch_clean": 537,
+ "last_epoch_split": 534,
+ "same_up_since": 536,
+ "same_interval_since": 536,
+ "same_primary_since": 536,
+ "last_scrub": "4'1",
+ "last_scrub_stamp": "2013-01-25 10:12:23.828174"
+ },
+ "stats": {
+ "version": "4'1",
+ "reported": "536'782",
+ "state": "active+clean",
+ "last_fresh": "2013-01-25 10:12:23.828271",
+ "last_change": "2013-01-25 10:12:23.828271",
+ "last_active": "2013-01-25 10:12:23.828271",
+ "last_clean": "2013-01-25 10:12:23.828271",
+ "last_unstale": "2013-01-25 10:12:23.828271",
+ "mapping_epoch": 535,
+ "log_start": "0'0",
+ "ondisk_log_start": "0'0",
+ "created": 1,
+ "last_epoch_clean": 1,
+ "parent": "0.0",
+ "parent_split_bits": 0,
+ "last_scrub": "4'1",
+ "last_scrub_stamp": "2013-01-25 10:12:23.828174",
+ "log_size": 128,
+ "ondisk_log_size": 128,
+ "stat_sum": {
+ "num_bytes": 205,
+ "num_objects": 1,
+ "num_object_clones": 0,
+ "num_object_copies": 0,
+ "num_objects_missing_on_primary": 0,
+ "num_objects_degraded": 0,
+ "num_objects_unfound": 0,
+ "num_read": 1,
+ "num_read_kb": 0,
+ "num_write": 3,
+ "num_write_kb": 1
+ },
+ "stat_cat_sum": {
+
+ },
+ "up": [
+ 1,
+ 0
+ ],
+ "acting": [
+ 1,
+ 0
+ ]
+ },
+ "empty": 0,
+ "dne": 0,
+ "incomplete": 0
+ },
+ "recovery_state": [
+ {
+ "name": "Started\/Primary\/Active",
+ "enter_time": "2013-01-23 09:35:37.594691",
+ "might_have_unfound": [
+
+ ],
+ "scrub": {
+ "scrub_epoch_start": "536",
+ "scrub_active": 0,
+ "scrub_block_writes": 0,
+ "finalizing_scrub": 0,
+ "scrub_waiting_on": 0,
+ "scrub_waiting_on_whom": [
+
+ ]
+ }
+ },
+ {
+ "name": "Started",
+ "enter_time": "2013-01-23 09:35:31.581160"
+ }
+ ]
+ }
+
+
+
+The following subsections describe common states in greater detail.
+
+Creating
+--------
+
+When you create a pool, it will create the number of placement groups you
+specified. Ceph will echo ``creating`` when it is creating one or more
+placement groups. Once they are created, the OSDs that are part of a placement
+group's Acting Set will peer. Once peering is complete, the placement group
+status should be ``active+clean``, which means a Ceph client can begin writing
+to the placement group.
+
+.. ditaa::
+
+ /-----------\ /-----------\ /-----------\
+ | Creating |------>| Peering |------>| Active |
+ \-----------/ \-----------/ \-----------/
+
+Peering
+-------
+
+When Ceph is Peering a placement group, Ceph is bringing the OSDs that
+store the replicas of the placement group into **agreement about the state**
+of the objects and metadata in the placement group. When Ceph completes peering,
+this means that the OSDs that store the placement group agree about the current
+state of the placement group. However, completion of the peering process does
+**NOT** mean that each replica has the latest contents.
+
+.. topic:: Authoratative History
+
+ Ceph will **NOT** acknowledge a write operation to a client, until
+ all OSDs of the acting set persist the write operation. This practice
+ ensures that at least one member of the acting set will have a record
+ of every acknowledged write operation since the last successful
+ peering operation.
+
+ With an accurate record of each acknowledged write operation, Ceph can
+ construct and disseminate a new authoritative history of the placement
+ group--a complete, and fully ordered set of operations that, if performed,
+ would bring an OSD’s copy of a placement group up to date.
+
+
+Active
+------
+
+Once Ceph completes the peering process, a placement group may become
+``active``. The ``active`` state means that the data in the placement group is
+generally available in the primary placement group and the replicas for read
+and write operations.
+
+
+Clean
+-----
+
+When a placement group is in the ``clean`` state, the primary OSD and the
+replica OSDs have successfully peered and there are no stray replicas for the
+placement group. Ceph replicated all objects in the placement group the correct
+number of times.
+
+
+Degraded
+--------
+
+When a client writes an object to the primary OSD, the primary OSD is
+responsible for writing the replicas to the replica OSDs. After the primary OSD
+writes the object to storage, the placement group will remain in a ``degraded``
+state until the primary OSD has received an acknowledgement from the replica
+OSDs that Ceph created the replica objects successfully.
+
+The reason a placement group can be ``active+degraded`` is that an OSD may be
+``active`` even though it doesn't hold all of the objects yet. If an OSD goes
+``down``, Ceph marks each placement group assigned to the OSD as ``degraded``.
+The OSDs must peer again when the OSD comes back online. However, a client can
+still write a new object to a ``degraded`` placement group if it is ``active``.
+
+If an OSD is ``down`` and the ``degraded`` condition persists, Ceph may mark the
+``down`` OSD as ``out`` of the cluster and remap the data from the ``down`` OSD
+to another OSD. The time between being marked ``down`` and being marked ``out``
+is controlled by ``mon osd down out interval``, which is set to ``600`` seconds
+by default.
+
+A placement group can also be ``degraded``, because Ceph cannot find one or more
+objects that Ceph thinks should be in the placement group. While you cannot
+read or write to unfound objects, you can still access all of the other objects
+in the ``degraded`` placement group.
+
+
+Recovering
+----------
+
+Ceph was designed for fault-tolerance at a scale where hardware and software
+problems are ongoing. When an OSD goes ``down``, its contents may fall behind
+the current state of other replicas in the placement groups. When the OSD is
+back ``up``, the contents of the placement groups must be updated to reflect the
+current state. During that time period, the OSD may reflect a ``recovering``
+state.
+
+Recovery is not always trivial, because a hardware failure might cause a
+cascading failure of multiple OSDs. For example, a network switch for a rack or
+cabinet may fail, which can cause the OSDs of a number of host machines to fall
+behind the current state of the cluster. Each one of the OSDs must recover once
+the fault is resolved.
+
+Ceph provides a number of settings to balance the resource contention between
+new service requests and the need to recover data objects and restore the
+placement groups to the current state. The ``osd recovery delay start`` setting
+allows an OSD to restart, re-peer and even process some replay requests before
+starting the recovery process. The ``osd
+recovery thread timeout`` sets a thread timeout, because multiple OSDs may fail,
+restart and re-peer at staggered rates. The ``osd recovery max active`` setting
+limits the number of recovery requests an OSD will entertain simultaneously to
+prevent the OSD from failing to serve . The ``osd recovery max chunk`` setting
+limits the size of the recovered data chunks to prevent network congestion.
+
+
+Back Filling
+------------
+
+When a new OSD joins the cluster, CRUSH will reassign placement groups from OSDs
+in the cluster to the newly added OSD. Forcing the new OSD to accept the
+reassigned placement groups immediately can put excessive load on the new OSD.
+Back filling the OSD with the placement groups allows this process to begin in
+the background. Once backfilling is complete, the new OSD will begin serving
+requests when it is ready.
+
+During the backfill operations, you may see one of several states:
+``backfill_wait`` indicates that a backfill operation is pending, but is not
+underway yet; ``backfill`` indicates that a backfill operation is underway;
+and, ``backfill_too_full`` indicates that a backfill operation was requested,
+but couldn't be completed due to insufficient storage capacity. When a
+placement group cannot be backfilled, it may be considered ``incomplete``.
+
+Ceph provides a number of settings to manage the load spike associated with
+reassigning placement groups to an OSD (especially a new OSD). By default,
+``osd_max_backfills`` sets the maximum number of concurrent backfills to or from
+an OSD to 10. The ``backfill full ratio`` enables an OSD to refuse a
+backfill request if the OSD is approaching its full ratio (90%, by default) and
+change with ``ceph osd set-backfillfull-ratio`` comand.
+If an OSD refuses a backfill request, the ``osd backfill retry interval``
+enables an OSD to retry the request (after 10 seconds, by default). OSDs can
+also set ``osd backfill scan min`` and ``osd backfill scan max`` to manage scan
+intervals (64 and 512, by default).
+
+
+Remapped
+--------
+
+When the Acting Set that services a placement group changes, the data migrates
+from the old acting set to the new acting set. It may take some time for a new
+primary OSD to service requests. So it may ask the old primary to continue to
+service requests until the placement group migration is complete. Once data
+migration completes, the mapping uses the primary OSD of the new acting set.
+
+
+Stale
+-----
+
+While Ceph uses heartbeats to ensure that hosts and daemons are running, the
+``ceph-osd`` daemons may also get into a ``stuck`` state where they are not
+reporting statistics in a timely manner (e.g., a temporary network fault). By
+default, OSD daemons report their placement group, up thru, boot and failure
+statistics every half second (i.e., ``0.5``), which is more frequent than the
+heartbeat thresholds. If the **Primary OSD** of a placement group's acting set
+fails to report to the monitor or if other OSDs have reported the primary OSD
+``down``, the monitors will mark the placement group ``stale``.
+
+When you start your cluster, it is common to see the ``stale`` state until
+the peering process completes. After your cluster has been running for awhile,
+seeing placement groups in the ``stale`` state indicates that the primary OSD
+for those placement groups is ``down`` or not reporting placement group statistics
+to the monitor.
+
+
+Identifying Troubled PGs
+========================
+
+As previously noted, a placement group is not necessarily problematic just
+because its state is not ``active+clean``. Generally, Ceph's ability to self
+repair may not be working when placement groups get stuck. The stuck states
+include:
+
+- **Unclean**: Placement groups contain objects that are not replicated the
+ desired number of times. They should be recovering.
+- **Inactive**: Placement groups cannot process reads or writes because they
+ are waiting for an OSD with the most up-to-date data to come back ``up``.
+- **Stale**: Placement groups are in an unknown state, because the OSDs that
+ host them have not reported to the monitor cluster in a while (configured
+ by ``mon osd report timeout``).
+
+To identify stuck placement groups, execute the following::
+
+ ceph pg dump_stuck [unclean|inactive|stale|undersized|degraded]
+
+See `Placement Group Subsystem`_ for additional details. To troubleshoot
+stuck placement groups, see `Troubleshooting PG Errors`_.
+
+
+Finding an Object Location
+==========================
+
+To store object data in the Ceph Object Store, a Ceph client must:
+
+#. Set an object name
+#. Specify a `pool`_
+
+The Ceph client retrieves the latest cluster map and the CRUSH algorithm
+calculates how to map the object to a `placement group`_, and then calculates
+how to assign the placement group to an OSD dynamically. To find the object
+location, all you need is the object name and the pool name. For example::
+
+ ceph osd map {poolname} {object-name}
+
+.. topic:: Exercise: Locate an Object
+
+ As an exercise, lets create an object. Specify an object name, a path to a
+ test file containing some object data and a pool name using the
+ ``rados put`` command on the command line. For example::
+
+ rados put {object-name} {file-path} --pool=data
+ rados put test-object-1 testfile.txt --pool=data
+
+ To verify that the Ceph Object Store stored the object, execute the following::
+
+ rados -p data ls
+
+ Now, identify the object location::
+
+ ceph osd map {pool-name} {object-name}
+ ceph osd map data test-object-1
+
+ Ceph should output the object's location. For example::
+
+ osdmap e537 pool 'data' (0) object 'test-object-1' -> pg 0.d1743484 (0.4) -> up [1,0] acting [1,0]
+
+ To remove the test object, simply delete it using the ``rados rm`` command.
+ For example::
+
+ rados rm test-object-1 --pool=data
+
+
+As the cluster evolves, the object location may change dynamically. One benefit
+of Ceph's dynamic rebalancing is that Ceph relieves you from having to perform
+the migration manually. See the `Architecture`_ section for details.
+
+.. _data placement: ../data-placement
+.. _pool: ../pools
+.. _placement group: ../placement-groups
+.. _Architecture: ../../../architecture
+.. _OSD Not Running: ../../troubleshooting/troubleshooting-osd#osd-not-running
+.. _Troubleshooting PG Errors: ../../troubleshooting/troubleshooting-pg#troubleshooting-pg-errors
+.. _Peering Failure: ../../troubleshooting/troubleshooting-pg#failures-osd-peering
+.. _CRUSH map: ../crush-map
+.. _Configuring Monitor/OSD Interaction: ../../configuration/mon-osd-interaction/
+.. _Placement Group Subsystem: ../control#placement-group-subsystem
diff --git a/src/ceph/doc/rados/operations/monitoring.rst b/src/ceph/doc/rados/operations/monitoring.rst
new file mode 100644
index 0000000..c291440
--- /dev/null
+++ b/src/ceph/doc/rados/operations/monitoring.rst
@@ -0,0 +1,351 @@
+======================
+ Monitoring a Cluster
+======================
+
+Once you have a running cluster, you may use the ``ceph`` tool to monitor your
+cluster. Monitoring a cluster typically involves checking OSD status, monitor
+status, placement group status and metadata server status.
+
+Using the command line
+======================
+
+Interactive mode
+----------------
+
+To run the ``ceph`` tool in interactive mode, type ``ceph`` at the command line
+with no arguments. For example::
+
+ ceph
+ ceph> health
+ ceph> status
+ ceph> quorum_status
+ ceph> mon_status
+
+Non-default paths
+-----------------
+
+If you specified non-default locations for your configuration or keyring,
+you may specify their locations::
+
+ ceph -c /path/to/conf -k /path/to/keyring health
+
+Checking a Cluster's Status
+===========================
+
+After you start your cluster, and before you start reading and/or
+writing data, check your cluster's status first.
+
+To check a cluster's status, execute the following::
+
+ ceph status
+
+Or::
+
+ ceph -s
+
+In interactive mode, type ``status`` and press **Enter**. ::
+
+ ceph> status
+
+Ceph will print the cluster status. For example, a tiny Ceph demonstration
+cluster with one of each service may print the following:
+
+::
+
+ cluster:
+ id: 477e46f1-ae41-4e43-9c8f-72c918ab0a20
+ health: HEALTH_OK
+
+ services:
+ mon: 1 daemons, quorum a
+ mgr: x(active)
+ mds: 1/1/1 up {0=a=up:active}
+ osd: 1 osds: 1 up, 1 in
+
+ data:
+ pools: 2 pools, 16 pgs
+ objects: 21 objects, 2246 bytes
+ usage: 546 GB used, 384 GB / 931 GB avail
+ pgs: 16 active+clean
+
+
+.. topic:: How Ceph Calculates Data Usage
+
+ The ``usage`` value reflects the *actual* amount of raw storage used. The
+ ``xxx GB / xxx GB`` value means the amount available (the lesser number)
+ of the overall storage capacity of the cluster. The notional number reflects
+ the size of the stored data before it is replicated, cloned or snapshotted.
+ Therefore, the amount of data actually stored typically exceeds the notional
+ amount stored, because Ceph creates replicas of the data and may also use
+ storage capacity for cloning and snapshotting.
+
+
+Watching a Cluster
+==================
+
+In addition to local logging by each daemon, Ceph clusters maintain
+a *cluster log* that records high level events about the whole system.
+This is logged to disk on monitor servers (as ``/var/log/ceph/ceph.log`` by
+default), but can also be monitored via the command line.
+
+To follow the cluster log, use the following command
+
+::
+
+ ceph -w
+
+Ceph will print the status of the system, followed by each log message as it
+is emitted. For example:
+
+::
+
+ cluster:
+ id: 477e46f1-ae41-4e43-9c8f-72c918ab0a20
+ health: HEALTH_OK
+
+ services:
+ mon: 1 daemons, quorum a
+ mgr: x(active)
+ mds: 1/1/1 up {0=a=up:active}
+ osd: 1 osds: 1 up, 1 in
+
+ data:
+ pools: 2 pools, 16 pgs
+ objects: 21 objects, 2246 bytes
+ usage: 546 GB used, 384 GB / 931 GB avail
+ pgs: 16 active+clean
+
+
+ 2017-07-24 08:15:11.329298 mon.a mon.0 172.21.9.34:6789/0 23 : cluster [INF] osd.0 172.21.9.34:6806/20527 boot
+ 2017-07-24 08:15:14.258143 mon.a mon.0 172.21.9.34:6789/0 39 : cluster [INF] Activating manager daemon x
+ 2017-07-24 08:15:15.446025 mon.a mon.0 172.21.9.34:6789/0 47 : cluster [INF] Manager daemon x is now available
+
+
+In addition to using ``ceph -w`` to print log lines as they are emitted,
+use ``ceph log last [n]`` to see the most recent ``n`` lines from the cluster
+log.
+
+Monitoring Health Checks
+========================
+
+Ceph continously runs various *health checks* against its own status. When
+a health check fails, this is reflected in the output of ``ceph status`` (or
+``ceph health``). In addition, messages are sent to the cluster log to
+indicate when a check fails, and when the cluster recovers.
+
+For example, when an OSD goes down, the ``health`` section of the status
+output may be updated as follows:
+
+::
+
+ health: HEALTH_WARN
+ 1 osds down
+ Degraded data redundancy: 21/63 objects degraded (33.333%), 16 pgs unclean, 16 pgs degraded
+
+At this time, cluster log messages are also emitted to record the failure of the
+health checks:
+
+::
+
+ 2017-07-25 10:08:58.265945 mon.a mon.0 172.21.9.34:6789/0 91 : cluster [WRN] Health check failed: 1 osds down (OSD_DOWN)
+ 2017-07-25 10:09:01.302624 mon.a mon.0 172.21.9.34:6789/0 94 : cluster [WRN] Health check failed: Degraded data redundancy: 21/63 objects degraded (33.333%), 16 pgs unclean, 16 pgs degraded (PG_DEGRADED)
+
+When the OSD comes back online, the cluster log records the cluster's return
+to a health state:
+
+::
+
+ 2017-07-25 10:11:11.526841 mon.a mon.0 172.21.9.34:6789/0 109 : cluster [WRN] Health check update: Degraded data redundancy: 2 pgs unclean, 2 pgs degraded, 2 pgs undersized (PG_DEGRADED)
+ 2017-07-25 10:11:13.535493 mon.a mon.0 172.21.9.34:6789/0 110 : cluster [INF] Health check cleared: PG_DEGRADED (was: Degraded data redundancy: 2 pgs unclean, 2 pgs degraded, 2 pgs undersized)
+ 2017-07-25 10:11:13.535577 mon.a mon.0 172.21.9.34:6789/0 111 : cluster [INF] Cluster is now healthy
+
+
+Detecting configuration issues
+==============================
+
+In addition to the health checks that Ceph continuously runs on its
+own status, there are some configuration issues that may only be detected
+by an external tool.
+
+Use the `ceph-medic`_ tool to run these additional checks on your Ceph
+cluster's configuration.
+
+Checking a Cluster's Usage Stats
+================================
+
+To check a cluster's data usage and data distribution among pools, you can
+use the ``df`` option. It is similar to Linux ``df``. Execute
+the following::
+
+ ceph df
+
+The **GLOBAL** section of the output provides an overview of the amount of
+storage your cluster uses for your data.
+
+- **SIZE:** The overall storage capacity of the cluster.
+- **AVAIL:** The amount of free space available in the cluster.
+- **RAW USED:** The amount of raw storage used.
+- **% RAW USED:** The percentage of raw storage used. Use this number in
+ conjunction with the ``full ratio`` and ``near full ratio`` to ensure that
+ you are not reaching your cluster's capacity. See `Storage Capacity`_ for
+ additional details.
+
+The **POOLS** section of the output provides a list of pools and the notional
+usage of each pool. The output from this section **DOES NOT** reflect replicas,
+clones or snapshots. For example, if you store an object with 1MB of data, the
+notional usage will be 1MB, but the actual usage may be 2MB or more depending
+on the number of replicas, clones and snapshots.
+
+- **NAME:** The name of the pool.
+- **ID:** The pool ID.
+- **USED:** The notional amount of data stored in kilobytes, unless the number
+ appends **M** for megabytes or **G** for gigabytes.
+- **%USED:** The notional percentage of storage used per pool.
+- **MAX AVAIL:** An estimate of the notional amount of data that can be written
+ to this pool.
+- **Objects:** The notional number of objects stored per pool.
+
+.. note:: The numbers in the **POOLS** section are notional. They are not
+ inclusive of the number of replicas, shapshots or clones. As a result,
+ the sum of the **USED** and **%USED** amounts will not add up to the
+ **RAW USED** and **%RAW USED** amounts in the **GLOBAL** section of the
+ output.
+
+.. note:: The **MAX AVAIL** value is a complicated function of the
+ replication or erasure code used, the CRUSH rule that maps storage
+ to devices, the utilization of those devices, and the configured
+ mon_osd_full_ratio.
+
+
+
+Checking OSD Status
+===================
+
+You can check OSDs to ensure they are ``up`` and ``in`` by executing::
+
+ ceph osd stat
+
+Or::
+
+ ceph osd dump
+
+You can also check view OSDs according to their position in the CRUSH map. ::
+
+ ceph osd tree
+
+Ceph will print out a CRUSH tree with a host, its OSDs, whether they are up
+and their weight. ::
+
+ # id weight type name up/down reweight
+ -1 3 pool default
+ -3 3 rack mainrack
+ -2 3 host osd-host
+ 0 1 osd.0 up 1
+ 1 1 osd.1 up 1
+ 2 1 osd.2 up 1
+
+For a detailed discussion, refer to `Monitoring OSDs and Placement Groups`_.
+
+Checking Monitor Status
+=======================
+
+If your cluster has multiple monitors (likely), you should check the monitor
+quorum status after you start the cluster before reading and/or writing data. A
+quorum must be present when multiple monitors are running. You should also check
+monitor status periodically to ensure that they are running.
+
+To see display the monitor map, execute the following::
+
+ ceph mon stat
+
+Or::
+
+ ceph mon dump
+
+To check the quorum status for the monitor cluster, execute the following::
+
+ ceph quorum_status
+
+Ceph will return the quorum status. For example, a Ceph cluster consisting of
+three monitors may return the following:
+
+.. code-block:: javascript
+
+ { "election_epoch": 10,
+ "quorum": [
+ 0,
+ 1,
+ 2],
+ "monmap": { "epoch": 1,
+ "fsid": "444b489c-4f16-4b75-83f0-cb8097468898",
+ "modified": "2011-12-12 13:28:27.505520",
+ "created": "2011-12-12 13:28:27.505520",
+ "mons": [
+ { "rank": 0,
+ "name": "a",
+ "addr": "127.0.0.1:6789\/0"},
+ { "rank": 1,
+ "name": "b",
+ "addr": "127.0.0.1:6790\/0"},
+ { "rank": 2,
+ "name": "c",
+ "addr": "127.0.0.1:6791\/0"}
+ ]
+ }
+ }
+
+Checking MDS Status
+===================
+
+Metadata servers provide metadata services for Ceph FS. Metadata servers have
+two sets of states: ``up | down`` and ``active | inactive``. To ensure your
+metadata servers are ``up`` and ``active``, execute the following::
+
+ ceph mds stat
+
+To display details of the metadata cluster, execute the following::
+
+ ceph fs dump
+
+
+Checking Placement Group States
+===============================
+
+Placement groups map objects to OSDs. When you monitor your
+placement groups, you will want them to be ``active`` and ``clean``.
+For a detailed discussion, refer to `Monitoring OSDs and Placement Groups`_.
+
+.. _Monitoring OSDs and Placement Groups: ../monitoring-osd-pg
+
+
+Using the Admin Socket
+======================
+
+The Ceph admin socket allows you to query a daemon via a socket interface.
+By default, Ceph sockets reside under ``/var/run/ceph``. To access a daemon
+via the admin socket, login to the host running the daemon and use the
+following command::
+
+ ceph daemon {daemon-name}
+ ceph daemon {path-to-socket-file}
+
+For example, the following are equivalent::
+
+ ceph daemon osd.0 foo
+ ceph daemon /var/run/ceph/ceph-osd.0.asok foo
+
+To view the available admin socket commands, execute the following command::
+
+ ceph daemon {daemon-name} help
+
+The admin socket command enables you to show and set your configuration at
+runtime. See `Viewing a Configuration at Runtime`_ for details.
+
+Additionally, you can set configuration values at runtime directly (i.e., the
+admin socket bypasses the monitor, unlike ``ceph tell {daemon-type}.{id}
+injectargs``, which relies on the monitor but doesn't require you to login
+directly to the host in question ).
+
+.. _Viewing a Configuration at Runtime: ../../configuration/ceph-conf#ceph-runtime-config
+.. _Storage Capacity: ../../configuration/mon-config-ref#storage-capacity
+.. _ceph-medic: http://docs.ceph.com/ceph-medic/master/
diff --git a/src/ceph/doc/rados/operations/operating.rst b/src/ceph/doc/rados/operations/operating.rst
new file mode 100644
index 0000000..791941a
--- /dev/null
+++ b/src/ceph/doc/rados/operations/operating.rst
@@ -0,0 +1,251 @@
+=====================
+ Operating a Cluster
+=====================
+
+.. index:: systemd; operating a cluster
+
+
+Running Ceph with systemd
+==========================
+
+For all distributions that support systemd (CentOS 7, Fedora, Debian
+Jessie 8 and later, SUSE), ceph daemons are now managed using native
+systemd files instead of the legacy sysvinit scripts. For example::
+
+ sudo systemctl start ceph.target # start all daemons
+ sudo systemctl status ceph-osd@12 # check status of osd.12
+
+To list the Ceph systemd units on a node, execute::
+
+ sudo systemctl status ceph\*.service ceph\*.target
+
+Starting all Daemons
+--------------------
+
+To start all daemons on a Ceph Node (irrespective of type), execute the
+following::
+
+ sudo systemctl start ceph.target
+
+
+Stopping all Daemons
+--------------------
+
+To stop all daemons on a Ceph Node (irrespective of type), execute the
+following::
+
+ sudo systemctl stop ceph\*.service ceph\*.target
+
+
+Starting all Daemons by Type
+----------------------------
+
+To start all daemons of a particular type on a Ceph Node, execute one of the
+following::
+
+ sudo systemctl start ceph-osd.target
+ sudo systemctl start ceph-mon.target
+ sudo systemctl start ceph-mds.target
+
+
+Stopping all Daemons by Type
+----------------------------
+
+To stop all daemons of a particular type on a Ceph Node, execute one of the
+following::
+
+ sudo systemctl stop ceph-mon\*.service ceph-mon.target
+ sudo systemctl stop ceph-osd\*.service ceph-osd.target
+ sudo systemctl stop ceph-mds\*.service ceph-mds.target
+
+
+Starting a Daemon
+-----------------
+
+To start a specific daemon instance on a Ceph Node, execute one of the
+following::
+
+ sudo systemctl start ceph-osd@{id}
+ sudo systemctl start ceph-mon@{hostname}
+ sudo systemctl start ceph-mds@{hostname}
+
+For example::
+
+ sudo systemctl start ceph-osd@1
+ sudo systemctl start ceph-mon@ceph-server
+ sudo systemctl start ceph-mds@ceph-server
+
+
+Stopping a Daemon
+-----------------
+
+To stop a specific daemon instance on a Ceph Node, execute one of the
+following::
+
+ sudo systemctl stop ceph-osd@{id}
+ sudo systemctl stop ceph-mon@{hostname}
+ sudo systemctl stop ceph-mds@{hostname}
+
+For example::
+
+ sudo systemctl stop ceph-osd@1
+ sudo systemctl stop ceph-mon@ceph-server
+ sudo systemctl stop ceph-mds@ceph-server
+
+
+.. index:: Ceph service; Upstart; operating a cluster
+
+
+
+Running Ceph with Upstart
+=========================
+
+When deploying Ceph with ``ceph-deploy`` on Ubuntu Trusty, you may start and
+stop Ceph daemons on a :term:`Ceph Node` using the event-based `Upstart`_.
+Upstart does not require you to define daemon instances in the Ceph
+configuration file.
+
+To list the Ceph Upstart jobs and instances on a node, execute::
+
+ sudo initctl list | grep ceph
+
+See `initctl`_ for additional details.
+
+
+Starting all Daemons
+--------------------
+
+To start all daemons on a Ceph Node (irrespective of type), execute the
+following::
+
+ sudo start ceph-all
+
+
+Stopping all Daemons
+--------------------
+
+To stop all daemons on a Ceph Node (irrespective of type), execute the
+following::
+
+ sudo stop ceph-all
+
+
+Starting all Daemons by Type
+----------------------------
+
+To start all daemons of a particular type on a Ceph Node, execute one of the
+following::
+
+ sudo start ceph-osd-all
+ sudo start ceph-mon-all
+ sudo start ceph-mds-all
+
+
+Stopping all Daemons by Type
+----------------------------
+
+To stop all daemons of a particular type on a Ceph Node, execute one of the
+following::
+
+ sudo stop ceph-osd-all
+ sudo stop ceph-mon-all
+ sudo stop ceph-mds-all
+
+
+Starting a Daemon
+-----------------
+
+To start a specific daemon instance on a Ceph Node, execute one of the
+following::
+
+ sudo start ceph-osd id={id}
+ sudo start ceph-mon id={hostname}
+ sudo start ceph-mds id={hostname}
+
+For example::
+
+ sudo start ceph-osd id=1
+ sudo start ceph-mon id=ceph-server
+ sudo start ceph-mds id=ceph-server
+
+
+Stopping a Daemon
+-----------------
+
+To stop a specific daemon instance on a Ceph Node, execute one of the
+following::
+
+ sudo stop ceph-osd id={id}
+ sudo stop ceph-mon id={hostname}
+ sudo stop ceph-mds id={hostname}
+
+For example::
+
+ sudo stop ceph-osd id=1
+ sudo start ceph-mon id=ceph-server
+ sudo start ceph-mds id=ceph-server
+
+
+.. index:: Ceph service; sysvinit; operating a cluster
+
+
+Running Ceph
+============
+
+Each time you to **start**, **restart**, and **stop** Ceph daemons (or your
+entire cluster) you must specify at least one option and one command. You may
+also specify a daemon type or a daemon instance. ::
+
+ {commandline} [options] [commands] [daemons]
+
+
+The ``ceph`` options include:
+
++-----------------+----------+-------------------------------------------------+
+| Option | Shortcut | Description |
++=================+==========+=================================================+
+| ``--verbose`` | ``-v`` | Use verbose logging. |
++-----------------+----------+-------------------------------------------------+
+| ``--valgrind`` | ``N/A`` | (Dev and QA only) Use `Valgrind`_ debugging. |
++-----------------+----------+-------------------------------------------------+
+| ``--allhosts`` | ``-a`` | Execute on all nodes in ``ceph.conf.`` |
+| | | Otherwise, it only executes on ``localhost``. |
++-----------------+----------+-------------------------------------------------+
+| ``--restart`` | ``N/A`` | Automatically restart daemon if it core dumps. |
++-----------------+----------+-------------------------------------------------+
+| ``--norestart`` | ``N/A`` | Don't restart a daemon if it core dumps. |
++-----------------+----------+-------------------------------------------------+
+| ``--conf`` | ``-c`` | Use an alternate configuration file. |
++-----------------+----------+-------------------------------------------------+
+
+The ``ceph`` commands include:
+
++------------------+------------------------------------------------------------+
+| Command | Description |
++==================+============================================================+
+| ``start`` | Start the daemon(s). |
++------------------+------------------------------------------------------------+
+| ``stop`` | Stop the daemon(s). |
++------------------+------------------------------------------------------------+
+| ``forcestop`` | Force the daemon(s) to stop. Same as ``kill -9`` |
++------------------+------------------------------------------------------------+
+| ``killall`` | Kill all daemons of a particular type. |
++------------------+------------------------------------------------------------+
+| ``cleanlogs`` | Cleans out the log directory. |
++------------------+------------------------------------------------------------+
+| ``cleanalllogs`` | Cleans out **everything** in the log directory. |
++------------------+------------------------------------------------------------+
+
+For subsystem operations, the ``ceph`` service can target specific daemon types
+by adding a particular daemon type for the ``[daemons]`` option. Daemon types
+include:
+
+- ``mon``
+- ``osd``
+- ``mds``
+
+
+
+.. _Valgrind: http://www.valgrind.org/
+.. _Upstart: http://upstart.ubuntu.com/index.html
+.. _initctl: http://manpages.ubuntu.com/manpages/raring/en/man8/initctl.8.html
diff --git a/src/ceph/doc/rados/operations/pg-concepts.rst b/src/ceph/doc/rados/operations/pg-concepts.rst
new file mode 100644
index 0000000..636d6bf
--- /dev/null
+++ b/src/ceph/doc/rados/operations/pg-concepts.rst
@@ -0,0 +1,102 @@
+==========================
+ Placement Group Concepts
+==========================
+
+When you execute commands like ``ceph -w``, ``ceph osd dump``, and other
+commands related to placement groups, Ceph may return values using some
+of the following terms:
+
+*Peering*
+ The process of bringing all of the OSDs that store
+ a Placement Group (PG) into agreement about the state
+ of all of the objects (and their metadata) in that PG.
+ Note that agreeing on the state does not mean that
+ they all have the latest contents.
+
+*Acting Set*
+ The ordered list of OSDs who are (or were as of some epoch)
+ responsible for a particular placement group.
+
+*Up Set*
+ The ordered list of OSDs responsible for a particular placement
+ group for a particular epoch according to CRUSH. Normally this
+ is the same as the *Acting Set*, except when the *Acting Set* has
+ been explicitly overridden via ``pg_temp`` in the OSD Map.
+
+*Current Interval* or *Past Interval*
+ A sequence of OSD map epochs during which the *Acting Set* and *Up
+ Set* for particular placement group do not change.
+
+*Primary*
+ The member (and by convention first) of the *Acting Set*,
+ that is responsible for coordination peering, and is
+ the only OSD that will accept client-initiated
+ writes to objects in a placement group.
+
+*Replica*
+ A non-primary OSD in the *Acting Set* for a placement group
+ (and who has been recognized as such and *activated* by the primary).
+
+*Stray*
+ An OSD that is not a member of the current *Acting Set*, but
+ has not yet been told that it can delete its copies of a
+ particular placement group.
+
+*Recovery*
+ Ensuring that copies of all of the objects in a placement group
+ are on all of the OSDs in the *Acting Set*. Once *Peering* has
+ been performed, the *Primary* can start accepting write operations,
+ and *Recovery* can proceed in the background.
+
+*PG Info*
+ Basic metadata about the placement group's creation epoch, the version
+ for the most recent write to the placement group, *last epoch started*,
+ *last epoch clean*, and the beginning of the *current interval*. Any
+ inter-OSD communication about placement groups includes the *PG Info*,
+ such that any OSD that knows a placement group exists (or once existed)
+ also has a lower bound on *last epoch clean* or *last epoch started*.
+
+*PG Log*
+ A list of recent updates made to objects in a placement group.
+ Note that these logs can be truncated after all OSDs
+ in the *Acting Set* have acknowledged up to a certain
+ point.
+
+*Missing Set*
+ Each OSD notes update log entries and if they imply updates to
+ the contents of an object, adds that object to a list of needed
+ updates. This list is called the *Missing Set* for that ``<OSD,PG>``.
+
+*Authoritative History*
+ A complete, and fully ordered set of operations that, if
+ performed, would bring an OSD's copy of a placement group
+ up to date.
+
+*Epoch*
+ A (monotonically increasing) OSD map version number
+
+*Last Epoch Start*
+ The last epoch at which all nodes in the *Acting Set*
+ for a particular placement group agreed on an
+ *Authoritative History*. At this point, *Peering* is
+ deemed to have been successful.
+
+*up_thru*
+ Before a *Primary* can successfully complete the *Peering* process,
+ it must inform a monitor that is alive through the current
+ OSD map *Epoch* by having the monitor set its *up_thru* in the osd
+ map. This helps *Peering* ignore previous *Acting Sets* for which
+ *Peering* never completed after certain sequences of failures, such as
+ the second interval below:
+
+ - *acting set* = [A,B]
+ - *acting set* = [A]
+ - *acting set* = [] very shortly after (e.g., simultaneous failure, but staggered detection)
+ - *acting set* = [B] (B restarts, A does not)
+
+*Last Epoch Clean*
+ The last *Epoch* at which all nodes in the *Acting set*
+ for a particular placement group were completely
+ up to date (both placement group logs and object contents).
+ At this point, *recovery* is deemed to have been
+ completed.
diff --git a/src/ceph/doc/rados/operations/pg-repair.rst b/src/ceph/doc/rados/operations/pg-repair.rst
new file mode 100644
index 0000000..0d6692a
--- /dev/null
+++ b/src/ceph/doc/rados/operations/pg-repair.rst
@@ -0,0 +1,4 @@
+Repairing PG inconsistencies
+============================
+
+
diff --git a/src/ceph/doc/rados/operations/pg-states.rst b/src/ceph/doc/rados/operations/pg-states.rst
new file mode 100644
index 0000000..0fbd3dc
--- /dev/null
+++ b/src/ceph/doc/rados/operations/pg-states.rst
@@ -0,0 +1,80 @@
+========================
+ Placement Group States
+========================
+
+When checking a cluster's status (e.g., running ``ceph -w`` or ``ceph -s``),
+Ceph will report on the status of the placement groups. A placement group has
+one or more states. The optimum state for placement groups in the placement group
+map is ``active + clean``.
+
+*Creating*
+ Ceph is still creating the placement group.
+
+*Active*
+ Ceph will process requests to the placement group.
+
+*Clean*
+ Ceph replicated all objects in the placement group the correct number of times.
+
+*Down*
+ A replica with necessary data is down, so the placement group is offline.
+
+*Scrubbing*
+ Ceph is checking the placement group for inconsistencies.
+
+*Degraded*
+ Ceph has not replicated some objects in the placement group the correct number of times yet.
+
+*Inconsistent*
+ Ceph detects inconsistencies in the one or more replicas of an object in the placement group
+ (e.g. objects are the wrong size, objects are missing from one replica *after* recovery finished, etc.).
+
+*Peering*
+ The placement group is undergoing the peering process
+
+*Repair*
+ Ceph is checking the placement group and repairing any inconsistencies it finds (if possible).
+
+*Recovering*
+ Ceph is migrating/synchronizing objects and their replicas.
+
+*Forced-Recovery*
+ High recovery priority of that PG is enforced by user.
+
+*Backfill*
+ Ceph is scanning and synchronizing the entire contents of a placement group
+ instead of inferring what contents need to be synchronized from the logs of
+ recent operations. *Backfill* is a special case of recovery.
+
+*Forced-Backfill*
+ High backfill priority of that PG is enforced by user.
+
+*Wait-backfill*
+ The placement group is waiting in line to start backfill.
+
+*Backfill-toofull*
+ A backfill operation is waiting because the destination OSD is over its
+ full ratio.
+
+*Incomplete*
+ Ceph detects that a placement group is missing information about
+ writes that may have occurred, or does not have any healthy
+ copies. If you see this state, try to start any failed OSDs that may
+ contain the needed information. In the case of an erasure coded pool
+ temporarily reducing min_size may allow recovery.
+
+*Stale*
+ The placement group is in an unknown state - the monitors have not received
+ an update for it since the placement group mapping changed.
+
+*Remapped*
+ The placement group is temporarily mapped to a different set of OSDs from what
+ CRUSH specified.
+
+*Undersized*
+ The placement group fewer copies than the configured pool replication level.
+
+*Peered*
+ The placement group has peered, but cannot serve client IO due to not having
+ enough copies to reach the pool's configured min_size parameter. Recovery
+ may occur in this state, so the pg may heal up to min_size eventually.
diff --git a/src/ceph/doc/rados/operations/placement-groups.rst b/src/ceph/doc/rados/operations/placement-groups.rst
new file mode 100644
index 0000000..fee833a
--- /dev/null
+++ b/src/ceph/doc/rados/operations/placement-groups.rst
@@ -0,0 +1,469 @@
+==================
+ Placement Groups
+==================
+
+.. _preselection:
+
+A preselection of pg_num
+========================
+
+When creating a new pool with::
+
+ ceph osd pool create {pool-name} pg_num
+
+it is mandatory to choose the value of ``pg_num`` because it cannot be
+calculated automatically. Here are a few values commonly used:
+
+- Less than 5 OSDs set ``pg_num`` to 128
+
+- Between 5 and 10 OSDs set ``pg_num`` to 512
+
+- Between 10 and 50 OSDs set ``pg_num`` to 1024
+
+- If you have more than 50 OSDs, you need to understand the tradeoffs
+ and how to calculate the ``pg_num`` value by yourself
+
+- For calculating ``pg_num`` value by yourself please take help of `pgcalc`_ tool
+
+As the number of OSDs increases, chosing the right value for pg_num
+becomes more important because it has a significant influence on the
+behavior of the cluster as well as the durability of the data when
+something goes wrong (i.e. the probability that a catastrophic event
+leads to data loss).
+
+How are Placement Groups used ?
+===============================
+
+A placement group (PG) aggregates objects within a pool because
+tracking object placement and object metadata on a per-object basis is
+computationally expensive--i.e., a system with millions of objects
+cannot realistically track placement on a per-object basis.
+
+.. ditaa::
+ /-----\ /-----\ /-----\ /-----\ /-----\
+ | obj | | obj | | obj | | obj | | obj |
+ \-----/ \-----/ \-----/ \-----/ \-----/
+ | | | | |
+ +--------+--------+ +---+----+
+ | |
+ v v
+ +-----------------------+ +-----------------------+
+ | Placement Group #1 | | Placement Group #2 |
+ | | | |
+ +-----------------------+ +-----------------------+
+ | |
+ +------------------------------+
+ |
+ v
+ +-----------------------+
+ | Pool |
+ | |
+ +-----------------------+
+
+The Ceph client will calculate which placement group an object should
+be in. It does this by hashing the object ID and applying an operation
+based on the number of PGs in the defined pool and the ID of the pool.
+See `Mapping PGs to OSDs`_ for details.
+
+The object's contents within a placement group are stored in a set of
+OSDs. For instance, in a replicated pool of size two, each placement
+group will store objects on two OSDs, as shown below.
+
+.. ditaa::
+
+ +-----------------------+ +-----------------------+
+ | Placement Group #1 | | Placement Group #2 |
+ | | | |
+ +-----------------------+ +-----------------------+
+ | | | |
+ v v v v
+ /----------\ /----------\ /----------\ /----------\
+ | | | | | | | |
+ | OSD #1 | | OSD #2 | | OSD #2 | | OSD #3 |
+ | | | | | | | |
+ \----------/ \----------/ \----------/ \----------/
+
+
+Should OSD #2 fail, another will be assigned to Placement Group #1 and
+will be filled with copies of all objects in OSD #1. If the pool size
+is changed from two to three, an additional OSD will be assigned to
+the placement group and will receive copies of all objects in the
+placement group.
+
+Placement groups do not own the OSD, they share it with other
+placement groups from the same pool or even other pools. If OSD #2
+fails, the Placement Group #2 will also have to restore copies of
+objects, using OSD #3.
+
+When the number of placement groups increases, the new placement
+groups will be assigned OSDs. The result of the CRUSH function will
+also change and some objects from the former placement groups will be
+copied over to the new Placement Groups and removed from the old ones.
+
+Placement Groups Tradeoffs
+==========================
+
+Data durability and even distribution among all OSDs call for more
+placement groups but their number should be reduced to the minimum to
+save CPU and memory.
+
+.. _data durability:
+
+Data durability
+---------------
+
+After an OSD fails, the risk of data loss increases until the data it
+contained is fully recovered. Let's imagine a scenario that causes
+permanent data loss in a single placement group:
+
+- The OSD fails and all copies of the object it contains are lost.
+ For all objects within the placement group the number of replica
+ suddently drops from three to two.
+
+- Ceph starts recovery for this placement group by chosing a new OSD
+ to re-create the third copy of all objects.
+
+- Another OSD, within the same placement group, fails before the new
+ OSD is fully populated with the third copy. Some objects will then
+ only have one surviving copies.
+
+- Ceph picks yet another OSD and keeps copying objects to restore the
+ desired number of copies.
+
+- A third OSD, within the same placement group, fails before recovery
+ is complete. If this OSD contained the only remaining copy of an
+ object, it is permanently lost.
+
+In a cluster containing 10 OSDs with 512 placement groups in a three
+replica pool, CRUSH will give each placement groups three OSDs. In the
+end, each OSDs will end up hosting (512 * 3) / 10 = ~150 Placement
+Groups. When the first OSD fails, the above scenario will therefore
+start recovery for all 150 placement groups at the same time.
+
+The 150 placement groups being recovered are likely to be
+homogeneously spread over the 9 remaining OSDs. Each remaining OSD is
+therefore likely to send copies of objects to all others and also
+receive some new objects to be stored because they became part of a
+new placement group.
+
+The amount of time it takes for this recovery to complete entirely
+depends on the architecture of the Ceph cluster. Let say each OSD is
+hosted by a 1TB SSD on a single machine and all of them are connected
+to a 10Gb/s switch and the recovery for a single OSD completes within
+M minutes. If there are two OSDs per machine using spinners with no
+SSD journal and a 1Gb/s switch, it will at least be an order of
+magnitude slower.
+
+In a cluster of this size, the number of placement groups has almost
+no influence on data durability. It could be 128 or 8192 and the
+recovery would not be slower or faster.
+
+However, growing the same Ceph cluster to 20 OSDs instead of 10 OSDs
+is likely to speed up recovery and therefore improve data durability
+significantly. Each OSD now participates in only ~75 placement groups
+instead of ~150 when there were only 10 OSDs and it will still require
+all 19 remaining OSDs to perform the same amount of object copies in
+order to recover. But where 10 OSDs had to copy approximately 100GB
+each, they now have to copy 50GB each instead. If the network was the
+bottleneck, recovery will happen twice as fast. In other words,
+recovery goes faster when the number of OSDs increases.
+
+If this cluster grows to 40 OSDs, each of them will only host ~35
+placement groups. If an OSD dies, recovery will keep going faster
+unless it is blocked by another bottleneck. However, if this cluster
+grows to 200 OSDs, each of them will only host ~7 placement groups. If
+an OSD dies, recovery will happen between at most of ~21 (7 * 3) OSDs
+in these placement groups: recovery will take longer than when there
+were 40 OSDs, meaning the number of placement groups should be
+increased.
+
+No matter how short the recovery time is, there is a chance for a
+second OSD to fail while it is in progress. In the 10 OSDs cluster
+described above, if any of them fail, then ~17 placement groups
+(i.e. ~150 / 9 placement groups being recovered) will only have one
+surviving copy. And if any of the 8 remaining OSD fail, the last
+objects of two placement groups are likely to be lost (i.e. ~17 / 8
+placement groups with only one remaining copy being recovered).
+
+When the size of the cluster grows to 20 OSDs, the number of Placement
+Groups damaged by the loss of three OSDs drops. The second OSD lost
+will degrade ~4 (i.e. ~75 / 19 placement groups being recovered)
+instead of ~17 and the third OSD lost will only lose data if it is one
+of the four OSDs containing the surviving copy. In other words, if the
+probability of losing one OSD is 0.0001% during the recovery time
+frame, it goes from 17 * 10 * 0.0001% in the cluster with 10 OSDs to 4 * 20 *
+0.0001% in the cluster with 20 OSDs.
+
+In a nutshell, more OSDs mean faster recovery and a lower risk of
+cascading failures leading to the permanent loss of a Placement
+Group. Having 512 or 4096 Placement Groups is roughly equivalent in a
+cluster with less than 50 OSDs as far as data durability is concerned.
+
+Note: It may take a long time for a new OSD added to the cluster to be
+populated with placement groups that were assigned to it. However
+there is no degradation of any object and it has no impact on the
+durability of the data contained in the Cluster.
+
+.. _object distribution:
+
+Object distribution within a pool
+---------------------------------
+
+Ideally objects are evenly distributed in each placement group. Since
+CRUSH computes the placement group for each object, but does not
+actually know how much data is stored in each OSD within this
+placement group, the ratio between the number of placement groups and
+the number of OSDs may influence the distribution of the data
+significantly.
+
+For instance, if there was single a placement group for ten OSDs in a
+three replica pool, only three OSD would be used because CRUSH would
+have no other choice. When more placement groups are available,
+objects are more likely to be evenly spread among them. CRUSH also
+makes every effort to evenly spread OSDs among all existing Placement
+Groups.
+
+As long as there are one or two orders of magnitude more Placement
+Groups than OSDs, the distribution should be even. For instance, 300
+placement groups for 3 OSDs, 1000 placement groups for 10 OSDs etc.
+
+Uneven data distribution can be caused by factors other than the ratio
+between OSDs and placement groups. Since CRUSH does not take into
+account the size of the objects, a few very large objects may create
+an imbalance. Let say one million 4K objects totaling 4GB are evenly
+spread among 1000 placement groups on 10 OSDs. They will use 4GB / 10
+= 400MB on each OSD. If one 400MB object is added to the pool, the
+three OSDs supporting the placement group in which the object has been
+placed will be filled with 400MB + 400MB = 800MB while the seven
+others will remain occupied with only 400MB.
+
+.. _resource usage:
+
+Memory, CPU and network usage
+-----------------------------
+
+For each placement group, OSDs and MONs need memory, network and CPU
+at all times and even more during recovery. Sharing this overhead by
+clustering objects within a placement group is one of the main reasons
+they exist.
+
+Minimizing the number of placement groups saves significant amounts of
+resources.
+
+Choosing the number of Placement Groups
+=======================================
+
+If you have more than 50 OSDs, we recommend approximately 50-100
+placement groups per OSD to balance out resource usage, data
+durability and distribution. If you have less than 50 OSDs, chosing
+among the `preselection`_ above is best. For a single pool of objects,
+you can use the following formula to get a baseline::
+
+ (OSDs * 100)
+ Total PGs = ------------
+ pool size
+
+Where **pool size** is either the number of replicas for replicated
+pools or the K+M sum for erasure coded pools (as returned by **ceph
+osd erasure-code-profile get**).
+
+You should then check if the result makes sense with the way you
+designed your Ceph cluster to maximize `data durability`_,
+`object distribution`_ and minimize `resource usage`_.
+
+The result should be **rounded up to the nearest power of two.**
+Rounding up is optional, but recommended for CRUSH to evenly balance
+the number of objects among placement groups.
+
+As an example, for a cluster with 200 OSDs and a pool size of 3
+replicas, you would estimate your number of PGs as follows::
+
+ (200 * 100)
+ ----------- = 6667. Nearest power of 2: 8192
+ 3
+
+When using multiple data pools for storing objects, you need to ensure
+that you balance the number of placement groups per pool with the
+number of placement groups per OSD so that you arrive at a reasonable
+total number of placement groups that provides reasonably low variance
+per OSD without taxing system resources or making the peering process
+too slow.
+
+For instance a cluster of 10 pools each with 512 placement groups on
+ten OSDs is a total of 5,120 placement groups spread over ten OSDs,
+that is 512 placement groups per OSD. That does not use too many
+resources. However, if 1,000 pools were created with 512 placement
+groups each, the OSDs will handle ~50,000 placement groups each and it
+would require significantly more resources and time for peering.
+
+You may find the `PGCalc`_ tool helpful.
+
+
+.. _setting the number of placement groups:
+
+Set the Number of Placement Groups
+==================================
+
+To set the number of placement groups in a pool, you must specify the
+number of placement groups at the time you create the pool.
+See `Create a Pool`_ for details. Once you have set placement groups for a
+pool, you may increase the number of placement groups (but you cannot
+decrease the number of placement groups). To increase the number of
+placement groups, execute the following::
+
+ ceph osd pool set {pool-name} pg_num {pg_num}
+
+Once you increase the number of placement groups, you must also
+increase the number of placement groups for placement (``pgp_num``)
+before your cluster will rebalance. The ``pgp_num`` will be the number of
+placement groups that will be considered for placement by the CRUSH
+algorithm. Increasing ``pg_num`` splits the placement groups but data
+will not be migrated to the newer placement groups until placement
+groups for placement, ie. ``pgp_num`` is increased. The ``pgp_num``
+should be equal to the ``pg_num``. To increase the number of
+placement groups for placement, execute the following::
+
+ ceph osd pool set {pool-name} pgp_num {pgp_num}
+
+
+Get the Number of Placement Groups
+==================================
+
+To get the number of placement groups in a pool, execute the following::
+
+ ceph osd pool get {pool-name} pg_num
+
+
+Get a Cluster's PG Statistics
+=============================
+
+To get the statistics for the placement groups in your cluster, execute the following::
+
+ ceph pg dump [--format {format}]
+
+Valid formats are ``plain`` (default) and ``json``.
+
+
+Get Statistics for Stuck PGs
+============================
+
+To get the statistics for all placement groups stuck in a specified state,
+execute the following::
+
+ ceph pg dump_stuck inactive|unclean|stale|undersized|degraded [--format <format>] [-t|--threshold <seconds>]
+
+**Inactive** Placement groups cannot process reads or writes because they are waiting for an OSD
+with the most up-to-date data to come up and in.
+
+**Unclean** Placement groups contain objects that are not replicated the desired number
+of times. They should be recovering.
+
+**Stale** Placement groups are in an unknown state - the OSDs that host them have not
+reported to the monitor cluster in a while (configured by ``mon_osd_report_timeout``).
+
+Valid formats are ``plain`` (default) and ``json``. The threshold defines the minimum number
+of seconds the placement group is stuck before including it in the returned statistics
+(default 300 seconds).
+
+
+Get a PG Map
+============
+
+To get the placement group map for a particular placement group, execute the following::
+
+ ceph pg map {pg-id}
+
+For example::
+
+ ceph pg map 1.6c
+
+Ceph will return the placement group map, the placement group, and the OSD status::
+
+ osdmap e13 pg 1.6c (1.6c) -> up [1,0] acting [1,0]
+
+
+Get a PGs Statistics
+====================
+
+To retrieve statistics for a particular placement group, execute the following::
+
+ ceph pg {pg-id} query
+
+
+Scrub a Placement Group
+=======================
+
+To scrub a placement group, execute the following::
+
+ ceph pg scrub {pg-id}
+
+Ceph checks the primary and any replica nodes, generates a catalog of all objects
+in the placement group and compares them to ensure that no objects are missing
+or mismatched, and their contents are consistent. Assuming the replicas all
+match, a final semantic sweep ensures that all of the snapshot-related object
+metadata is consistent. Errors are reported via logs.
+
+Prioritize backfill/recovery of a Placement Group(s)
+====================================================
+
+You may run into a situation where a bunch of placement groups will require
+recovery and/or backfill, and some particular groups hold data more important
+than others (for example, those PGs may hold data for images used by running
+machines and other PGs may be used by inactive machines/less relevant data).
+In that case, you may want to prioritize recovery of those groups so
+performance and/or availability of data stored on those groups is restored
+earlier. To do this (mark particular placement group(s) as prioritized during
+backfill or recovery), execute the following::
+
+ ceph pg force-recovery {pg-id} [{pg-id #2}] [{pg-id #3} ...]
+ ceph pg force-backfill {pg-id} [{pg-id #2}] [{pg-id #3} ...]
+
+This will cause Ceph to perform recovery or backfill on specified placement
+groups first, before other placement groups. This does not interrupt currently
+ongoing backfills or recovery, but causes specified PGs to be processed
+as soon as possible. If you change your mind or prioritize wrong groups,
+use::
+
+ ceph pg cancel-force-recovery {pg-id} [{pg-id #2}] [{pg-id #3} ...]
+ ceph pg cancel-force-backfill {pg-id} [{pg-id #2}] [{pg-id #3} ...]
+
+This will remove "force" flag from those PGs and they will be processed
+in default order. Again, this doesn't affect currently processed placement
+group, only those that are still queued.
+
+The "force" flag is cleared automatically after recovery or backfill of group
+is done.
+
+Revert Lost
+===========
+
+If the cluster has lost one or more objects, and you have decided to
+abandon the search for the lost data, you must mark the unfound objects
+as ``lost``.
+
+If all possible locations have been queried and objects are still
+lost, you may have to give up on the lost objects. This is
+possible given unusual combinations of failures that allow the cluster
+to learn about writes that were performed before the writes themselves
+are recovered.
+
+Currently the only supported option is "revert", which will either roll back to
+a previous version of the object or (if it was a new object) forget about it
+entirely. To mark the "unfound" objects as "lost", execute the following::
+
+ ceph pg {pg-id} mark_unfound_lost revert|delete
+
+.. important:: Use this feature with caution, because it may confuse
+ applications that expect the object(s) to exist.
+
+
+.. toctree::
+ :hidden:
+
+ pg-states
+ pg-concepts
+
+
+.. _Create a Pool: ../pools#createpool
+.. _Mapping PGs to OSDs: ../../../architecture#mapping-pgs-to-osds
+.. _pgcalc: http://ceph.com/pgcalc/
diff --git a/src/ceph/doc/rados/operations/pools.rst b/src/ceph/doc/rados/operations/pools.rst
new file mode 100644
index 0000000..7015593
--- /dev/null
+++ b/src/ceph/doc/rados/operations/pools.rst
@@ -0,0 +1,798 @@
+=======
+ Pools
+=======
+
+When you first deploy a cluster without creating a pool, Ceph uses the default
+pools for storing data. A pool provides you with:
+
+- **Resilience**: You can set how many OSD are allowed to fail without losing data.
+ For replicated pools, it is the desired number of copies/replicas of an object.
+ A typical configuration stores an object and one additional copy
+ (i.e., ``size = 2``), but you can determine the number of copies/replicas.
+ For `erasure coded pools <../erasure-code>`_, it is the number of coding chunks
+ (i.e. ``m=2`` in the **erasure code profile**)
+
+- **Placement Groups**: You can set the number of placement groups for the pool.
+ A typical configuration uses approximately 100 placement groups per OSD to
+ provide optimal balancing without using up too many computing resources. When
+ setting up multiple pools, be careful to ensure you set a reasonable number of
+ placement groups for both the pool and the cluster as a whole.
+
+- **CRUSH Rules**: When you store data in a pool, a CRUSH ruleset mapped to the
+ pool enables CRUSH to identify a rule for the placement of the object
+ and its replicas (or chunks for erasure coded pools) in your cluster.
+ You can create a custom CRUSH rule for your pool.
+
+- **Snapshots**: When you create snapshots with ``ceph osd pool mksnap``,
+ you effectively take a snapshot of a particular pool.
+
+To organize data into pools, you can list, create, and remove pools.
+You can also view the utilization statistics for each pool.
+
+List Pools
+==========
+
+To list your cluster's pools, execute::
+
+ ceph osd lspools
+
+On a freshly installed cluster, only the ``rbd`` pool exists.
+
+
+.. _createpool:
+
+Create a Pool
+=============
+
+Before creating pools, refer to the `Pool, PG and CRUSH Config Reference`_.
+Ideally, you should override the default value for the number of placement
+groups in your Ceph configuration file, as the default is NOT ideal.
+For details on placement group numbers refer to `setting the number of placement groups`_
+
+.. note:: Starting with Luminous, all pools need to be associated to the
+ application using the pool. See `Associate Pool to Application`_ below for
+ more information.
+
+For example::
+
+ osd pool default pg num = 100
+ osd pool default pgp num = 100
+
+To create a pool, execute::
+
+ ceph osd pool create {pool-name} {pg-num} [{pgp-num}] [replicated] \
+ [crush-rule-name] [expected-num-objects]
+ ceph osd pool create {pool-name} {pg-num} {pgp-num} erasure \
+ [erasure-code-profile] [crush-rule-name] [expected_num_objects]
+
+Where:
+
+``{pool-name}``
+
+:Description: The name of the pool. It must be unique.
+:Type: String
+:Required: Yes.
+
+``{pg-num}``
+
+:Description: The total number of placement groups for the pool. See `Placement
+ Groups`_ for details on calculating a suitable number. The
+ default value ``8`` is NOT suitable for most systems.
+
+:Type: Integer
+:Required: Yes.
+:Default: 8
+
+``{pgp-num}``
+
+:Description: The total number of placement groups for placement purposes. This
+ **should be equal to the total number of placement groups**, except
+ for placement group splitting scenarios.
+
+:Type: Integer
+:Required: Yes. Picks up default or Ceph configuration value if not specified.
+:Default: 8
+
+``{replicated|erasure}``
+
+:Description: The pool type which may either be **replicated** to
+ recover from lost OSDs by keeping multiple copies of the
+ objects or **erasure** to get a kind of
+ `generalized RAID5 <../erasure-code>`_ capability.
+ The **replicated** pools require more
+ raw storage but implement all Ceph operations. The
+ **erasure** pools require less raw storage but only
+ implement a subset of the available operations.
+
+:Type: String
+:Required: No.
+:Default: replicated
+
+``[crush-rule-name]``
+
+:Description: The name of a CRUSH rule to use for this pool. The specified
+ rule must exist.
+
+:Type: String
+:Required: No.
+:Default: For **replicated** pools it is the ruleset specified by the ``osd
+ pool default crush replicated ruleset`` config variable. This
+ ruleset must exist.
+ For **erasure** pools it is ``erasure-code`` if the ``default``
+ `erasure code profile`_ is used or ``{pool-name}`` otherwise. This
+ ruleset will be created implicitly if it doesn't exist already.
+
+
+``[erasure-code-profile=profile]``
+
+.. _erasure code profile: ../erasure-code-profile
+
+:Description: For **erasure** pools only. Use the `erasure code profile`_. It
+ must be an existing profile as defined by
+ **osd erasure-code-profile set**.
+
+:Type: String
+:Required: No.
+
+When you create a pool, set the number of placement groups to a reasonable value
+(e.g., ``100``). Consider the total number of placement groups per OSD too.
+Placement groups are computationally expensive, so performance will degrade when
+you have many pools with many placement groups (e.g., 50 pools with 100
+placement groups each). The point of diminishing returns depends upon the power
+of the OSD host.
+
+See `Placement Groups`_ for details on calculating an appropriate number of
+placement groups for your pool.
+
+.. _Placement Groups: ../placement-groups
+
+``[expected-num-objects]``
+
+:Description: The expected number of objects for this pool. By setting this value (
+ together with a negative **filestore merge threshold**), the PG folder
+ splitting would happen at the pool creation time, to avoid the latency
+ impact to do a runtime folder splitting.
+
+:Type: Integer
+:Required: No.
+:Default: 0, no splitting at the pool creation time.
+
+Associate Pool to Application
+=============================
+
+Pools need to be associated with an application before use. Pools that will be
+used with CephFS or pools that are automatically created by RGW are
+automatically associated. Pools that are intended for use with RBD should be
+initialized using the ``rbd`` tool (see `Block Device Commands`_ for more
+information).
+
+For other cases, you can manually associate a free-form application name to
+a pool.::
+
+ ceph osd pool application enable {pool-name} {application-name}
+
+.. note:: CephFS uses the application name ``cephfs``, RBD uses the
+ application name ``rbd``, and RGW uses the application name ``rgw``.
+
+Set Pool Quotas
+===============
+
+You can set pool quotas for the maximum number of bytes and/or the maximum
+number of objects per pool. ::
+
+ ceph osd pool set-quota {pool-name} [max_objects {obj-count}] [max_bytes {bytes}]
+
+For example::
+
+ ceph osd pool set-quota data max_objects 10000
+
+To remove a quota, set its value to ``0``.
+
+
+Delete a Pool
+=============
+
+To delete a pool, execute::
+
+ ceph osd pool delete {pool-name} [{pool-name} --yes-i-really-really-mean-it]
+
+
+To remove a pool the mon_allow_pool_delete flag must be set to true in the Monitor's
+configuration. Otherwise they will refuse to remove a pool.
+
+See `Monitor Configuration`_ for more information.
+
+.. _Monitor Configuration: ../../configuration/mon-config-ref
+
+If you created your own rulesets and rules for a pool you created, you should
+consider removing them when you no longer need your pool::
+
+ ceph osd pool get {pool-name} crush_ruleset
+
+If the ruleset was "123", for example, you can check the other pools like so::
+
+ ceph osd dump | grep "^pool" | grep "crush_ruleset 123"
+
+If no other pools use that custom ruleset, then it's safe to delete that
+ruleset from the cluster.
+
+If you created users with permissions strictly for a pool that no longer
+exists, you should consider deleting those users too::
+
+ ceph auth ls | grep -C 5 {pool-name}
+ ceph auth del {user}
+
+
+Rename a Pool
+=============
+
+To rename a pool, execute::
+
+ ceph osd pool rename {current-pool-name} {new-pool-name}
+
+If you rename a pool and you have per-pool capabilities for an authenticated
+user, you must update the user's capabilities (i.e., caps) with the new pool
+name.
+
+.. note:: Version ``0.48`` Argonaut and above.
+
+Show Pool Statistics
+====================
+
+To show a pool's utilization statistics, execute::
+
+ rados df
+
+
+Make a Snapshot of a Pool
+=========================
+
+To make a snapshot of a pool, execute::
+
+ ceph osd pool mksnap {pool-name} {snap-name}
+
+.. note:: Version ``0.48`` Argonaut and above.
+
+
+Remove a Snapshot of a Pool
+===========================
+
+To remove a snapshot of a pool, execute::
+
+ ceph osd pool rmsnap {pool-name} {snap-name}
+
+.. note:: Version ``0.48`` Argonaut and above.
+
+.. _setpoolvalues:
+
+
+Set Pool Values
+===============
+
+To set a value to a pool, execute the following::
+
+ ceph osd pool set {pool-name} {key} {value}
+
+You may set values for the following keys:
+
+.. _compression_algorithm:
+
+``compression_algorithm``
+:Description: Sets inline compression algorithm to use for underlying BlueStore.
+ This setting overrides the `global setting <rados/configuration/bluestore-config-ref/#inline-compression>`_ of ``bluestore compression algorithm``.
+
+:Type: String
+:Valid Settings: ``lz4``, ``snappy``, ``zlib``, ``zstd``
+
+``compression_mode``
+
+:Description: Sets the policy for the inline compression algorithm for underlying BlueStore.
+ This setting overrides the `global setting <rados/configuration/bluestore-config-ref/#inline-compression>`_ of ``bluestore compression mode``.
+
+:Type: String
+:Valid Settings: ``none``, ``passive``, ``aggressive``, ``force``
+
+``compression_min_blob_size``
+
+:Description: Chunks smaller than this are never compressed.
+ This setting overrides the `global setting <rados/configuration/bluestore-config-ref/#inline-compression>`_ of ``bluestore compression min blob *``.
+
+:Type: Unsigned Integer
+
+``compression_max_blob_size``
+
+:Description: Chunks larger than this are broken into smaller blobs sizing
+ ``compression_max_blob_size`` before being compressed.
+
+:Type: Unsigned Integer
+
+.. _size:
+
+``size``
+
+:Description: Sets the number of replicas for objects in the pool.
+ See `Set the Number of Object Replicas`_ for further details.
+ Replicated pools only.
+
+:Type: Integer
+
+.. _min_size:
+
+``min_size``
+
+:Description: Sets the minimum number of replicas required for I/O.
+ See `Set the Number of Object Replicas`_ for further details.
+ Replicated pools only.
+
+:Type: Integer
+:Version: ``0.54`` and above
+
+.. _pg_num:
+
+``pg_num``
+
+:Description: The effective number of placement groups to use when calculating
+ data placement.
+:Type: Integer
+:Valid Range: Superior to ``pg_num`` current value.
+
+.. _pgp_num:
+
+``pgp_num``
+
+:Description: The effective number of placement groups for placement to use
+ when calculating data placement.
+
+:Type: Integer
+:Valid Range: Equal to or less than ``pg_num``.
+
+.. _crush_ruleset:
+
+``crush_ruleset``
+
+:Description: The ruleset to use for mapping object placement in the cluster.
+:Type: Integer
+
+.. _allow_ec_overwrites:
+
+``allow_ec_overwrites``
+
+:Description: Whether writes to an erasure coded pool can update part
+ of an object, so cephfs and rbd can use it. See
+ `Erasure Coding with Overwrites`_ for more details.
+:Type: Boolean
+:Version: ``12.2.0`` and above
+
+.. _hashpspool:
+
+``hashpspool``
+
+:Description: Set/Unset HASHPSPOOL flag on a given pool.
+:Type: Integer
+:Valid Range: 1 sets flag, 0 unsets flag
+:Version: Version ``0.48`` Argonaut and above.
+
+.. _nodelete:
+
+``nodelete``
+
+:Description: Set/Unset NODELETE flag on a given pool.
+:Type: Integer
+:Valid Range: 1 sets flag, 0 unsets flag
+:Version: Version ``FIXME``
+
+.. _nopgchange:
+
+``nopgchange``
+
+:Description: Set/Unset NOPGCHANGE flag on a given pool.
+:Type: Integer
+:Valid Range: 1 sets flag, 0 unsets flag
+:Version: Version ``FIXME``
+
+.. _nosizechange:
+
+``nosizechange``
+
+:Description: Set/Unset NOSIZECHANGE flag on a given pool.
+:Type: Integer
+:Valid Range: 1 sets flag, 0 unsets flag
+:Version: Version ``FIXME``
+
+.. _write_fadvise_dontneed:
+
+``write_fadvise_dontneed``
+
+:Description: Set/Unset WRITE_FADVISE_DONTNEED flag on a given pool.
+:Type: Integer
+:Valid Range: 1 sets flag, 0 unsets flag
+
+.. _noscrub:
+
+``noscrub``
+
+:Description: Set/Unset NOSCRUB flag on a given pool.
+:Type: Integer
+:Valid Range: 1 sets flag, 0 unsets flag
+
+.. _nodeep-scrub:
+
+``nodeep-scrub``
+
+:Description: Set/Unset NODEEP_SCRUB flag on a given pool.
+:Type: Integer
+:Valid Range: 1 sets flag, 0 unsets flag
+
+.. _hit_set_type:
+
+``hit_set_type``
+
+:Description: Enables hit set tracking for cache pools.
+ See `Bloom Filter`_ for additional information.
+
+:Type: String
+:Valid Settings: ``bloom``, ``explicit_hash``, ``explicit_object``
+:Default: ``bloom``. Other values are for testing.
+
+.. _hit_set_count:
+
+``hit_set_count``
+
+:Description: The number of hit sets to store for cache pools. The higher
+ the number, the more RAM consumed by the ``ceph-osd`` daemon.
+
+:Type: Integer
+:Valid Range: ``1``. Agent doesn't handle > 1 yet.
+
+.. _hit_set_period:
+
+``hit_set_period``
+
+:Description: The duration of a hit set period in seconds for cache pools.
+ The higher the number, the more RAM consumed by the
+ ``ceph-osd`` daemon.
+
+:Type: Integer
+:Example: ``3600`` 1hr
+
+.. _hit_set_fpp:
+
+``hit_set_fpp``
+
+:Description: The false positive probability for the ``bloom`` hit set type.
+ See `Bloom Filter`_ for additional information.
+
+:Type: Double
+:Valid Range: 0.0 - 1.0
+:Default: ``0.05``
+
+.. _cache_target_dirty_ratio:
+
+``cache_target_dirty_ratio``
+
+:Description: The percentage of the cache pool containing modified (dirty)
+ objects before the cache tiering agent will flush them to the
+ backing storage pool.
+
+:Type: Double
+:Default: ``.4``
+
+.. _cache_target_dirty_high_ratio:
+
+``cache_target_dirty_high_ratio``
+
+:Description: The percentage of the cache pool containing modified (dirty)
+ objects before the cache tiering agent will flush them to the
+ backing storage pool with a higher speed.
+
+:Type: Double
+:Default: ``.6``
+
+.. _cache_target_full_ratio:
+
+``cache_target_full_ratio``
+
+:Description: The percentage of the cache pool containing unmodified (clean)
+ objects before the cache tiering agent will evict them from the
+ cache pool.
+
+:Type: Double
+:Default: ``.8``
+
+.. _target_max_bytes:
+
+``target_max_bytes``
+
+:Description: Ceph will begin flushing or evicting objects when the
+ ``max_bytes`` threshold is triggered.
+
+:Type: Integer
+:Example: ``1000000000000`` #1-TB
+
+.. _target_max_objects:
+
+``target_max_objects``
+
+:Description: Ceph will begin flushing or evicting objects when the
+ ``max_objects`` threshold is triggered.
+
+:Type: Integer
+:Example: ``1000000`` #1M objects
+
+
+``hit_set_grade_decay_rate``
+
+:Description: Temperature decay rate between two successive hit_sets
+:Type: Integer
+:Valid Range: 0 - 100
+:Default: ``20``
+
+
+``hit_set_search_last_n``
+
+:Description: Count at most N appearance in hit_sets for temperature calculation
+:Type: Integer
+:Valid Range: 0 - hit_set_count
+:Default: ``1``
+
+
+.. _cache_min_flush_age:
+
+``cache_min_flush_age``
+
+:Description: The time (in seconds) before the cache tiering agent will flush
+ an object from the cache pool to the storage pool.
+
+:Type: Integer
+:Example: ``600`` 10min
+
+.. _cache_min_evict_age:
+
+``cache_min_evict_age``
+
+:Description: The time (in seconds) before the cache tiering agent will evict
+ an object from the cache pool.
+
+:Type: Integer
+:Example: ``1800`` 30min
+
+.. _fast_read:
+
+``fast_read``
+
+:Description: On Erasure Coding pool, if this flag is turned on, the read request
+ would issue sub reads to all shards, and waits until it receives enough
+ shards to decode to serve the client. In the case of jerasure and isa
+ erasure plugins, once the first K replies return, client's request is
+ served immediately using the data decoded from these replies. This
+ helps to tradeoff some resources for better performance. Currently this
+ flag is only supported for Erasure Coding pool.
+
+:Type: Boolean
+:Defaults: ``0``
+
+.. _scrub_min_interval:
+
+``scrub_min_interval``
+
+:Description: The minimum interval in seconds for pool scrubbing when
+ load is low. If it is 0, the value osd_scrub_min_interval
+ from config is used.
+
+:Type: Double
+:Default: ``0``
+
+.. _scrub_max_interval:
+
+``scrub_max_interval``
+
+:Description: The maximum interval in seconds for pool scrubbing
+ irrespective of cluster load. If it is 0, the value
+ osd_scrub_max_interval from config is used.
+
+:Type: Double
+:Default: ``0``
+
+.. _deep_scrub_interval:
+
+``deep_scrub_interval``
+
+:Description: The interval in seconds for pool “deep” scrubbing. If it
+ is 0, the value osd_deep_scrub_interval from config is used.
+
+:Type: Double
+:Default: ``0``
+
+
+Get Pool Values
+===============
+
+To get a value from a pool, execute the following::
+
+ ceph osd pool get {pool-name} {key}
+
+You may get values for the following keys:
+
+``size``
+
+:Description: see size_
+
+:Type: Integer
+
+``min_size``
+
+:Description: see min_size_
+
+:Type: Integer
+:Version: ``0.54`` and above
+
+``pg_num``
+
+:Description: see pg_num_
+
+:Type: Integer
+
+
+``pgp_num``
+
+:Description: see pgp_num_
+
+:Type: Integer
+:Valid Range: Equal to or less than ``pg_num``.
+
+
+``crush_ruleset``
+
+:Description: see crush_ruleset_
+
+
+``hit_set_type``
+
+:Description: see hit_set_type_
+
+:Type: String
+:Valid Settings: ``bloom``, ``explicit_hash``, ``explicit_object``
+
+``hit_set_count``
+
+:Description: see hit_set_count_
+
+:Type: Integer
+
+
+``hit_set_period``
+
+:Description: see hit_set_period_
+
+:Type: Integer
+
+
+``hit_set_fpp``
+
+:Description: see hit_set_fpp_
+
+:Type: Double
+
+
+``cache_target_dirty_ratio``
+
+:Description: see cache_target_dirty_ratio_
+
+:Type: Double
+
+
+``cache_target_dirty_high_ratio``
+
+:Description: see cache_target_dirty_high_ratio_
+
+:Type: Double
+
+
+``cache_target_full_ratio``
+
+:Description: see cache_target_full_ratio_
+
+:Type: Double
+
+
+``target_max_bytes``
+
+:Description: see target_max_bytes_
+
+:Type: Integer
+
+
+``target_max_objects``
+
+:Description: see target_max_objects_
+
+:Type: Integer
+
+
+``cache_min_flush_age``
+
+:Description: see cache_min_flush_age_
+
+:Type: Integer
+
+
+``cache_min_evict_age``
+
+:Description: see cache_min_evict_age_
+
+:Type: Integer
+
+
+``fast_read``
+
+:Description: see fast_read_
+
+:Type: Boolean
+
+
+``scrub_min_interval``
+
+:Description: see scrub_min_interval_
+
+:Type: Double
+
+
+``scrub_max_interval``
+
+:Description: see scrub_max_interval_
+
+:Type: Double
+
+
+``deep_scrub_interval``
+
+:Description: see deep_scrub_interval_
+
+:Type: Double
+
+
+Set the Number of Object Replicas
+=================================
+
+To set the number of object replicas on a replicated pool, execute the following::
+
+ ceph osd pool set {poolname} size {num-replicas}
+
+.. important:: The ``{num-replicas}`` includes the object itself.
+ If you want the object and two copies of the object for a total of
+ three instances of the object, specify ``3``.
+
+For example::
+
+ ceph osd pool set data size 3
+
+You may execute this command for each pool. **Note:** An object might accept
+I/Os in degraded mode with fewer than ``pool size`` replicas. To set a minimum
+number of required replicas for I/O, you should use the ``min_size`` setting.
+For example::
+
+ ceph osd pool set data min_size 2
+
+This ensures that no object in the data pool will receive I/O with fewer than
+``min_size`` replicas.
+
+
+Get the Number of Object Replicas
+=================================
+
+To get the number of object replicas, execute the following::
+
+ ceph osd dump | grep 'replicated size'
+
+Ceph will list the pools, with the ``replicated size`` attribute highlighted.
+By default, ceph creates two replicas of an object (a total of three copies, or
+a size of 3).
+
+
+
+.. _Pool, PG and CRUSH Config Reference: ../../configuration/pool-pg-config-ref
+.. _Bloom Filter: http://en.wikipedia.org/wiki/Bloom_filter
+.. _setting the number of placement groups: ../placement-groups#set-the-number-of-placement-groups
+.. _Erasure Coding with Overwrites: ../erasure-code#erasure-coding-with-overwrites
+.. _Block Device Commands: ../../../rbd/rados-rbd-cmds/#create-a-block-device-pool
+
diff --git a/src/ceph/doc/rados/operations/upmap.rst b/src/ceph/doc/rados/operations/upmap.rst
new file mode 100644
index 0000000..58f6322
--- /dev/null
+++ b/src/ceph/doc/rados/operations/upmap.rst
@@ -0,0 +1,75 @@
+Using the pg-upmap
+==================
+
+Starting in Luminous v12.2.z there is a new *pg-upmap* exception table
+in the OSDMap that allows the cluster to explicitly map specific PGs to
+specific OSDs. This allows the cluster to fine-tune the data
+distribution to, in most cases, perfectly distributed PGs across OSDs.
+
+The key caveat to this new mechanism is that it requires that all
+clients understand the new *pg-upmap* structure in the OSDMap.
+
+Enabling
+--------
+
+To allow use of the feature, you must tell the cluster that it only
+needs to support luminous (and newer) clients with::
+
+ ceph osd set-require-min-compat-client luminous
+
+This command will fail if any pre-luminous clients or daemons are
+connected to the monitors. You can see what client versions are in
+use with::
+
+ ceph features
+
+A word of caution
+-----------------
+
+This is a new feature and not very user friendly. At the time of this
+writing we are working on a new `balancer` module for ceph-mgr that
+will eventually do all of this automatically.
+
+Until then,
+
+Offline optimization
+--------------------
+
+Upmap entries are updated with an offline optimizer built into ``osdmaptool``.
+
+#. Grab the latest copy of your osdmap::
+
+ ceph osd getmap -o om
+
+#. Run the optimizer::
+
+ osdmaptool om --upmap out.txt [--upmap-pool <pool>] [--upmap-max <max-count>] [--upmap-deviation <max-deviation>]
+
+ It is highly recommended that optimization be done for each pool
+ individually, or for sets of similarly-utilized pools. You can
+ specify the ``--upmap-pool`` option multiple times. "Similar pools"
+ means pools that are mapped to the same devices and store the same
+ kind of data (e.g., RBD image pools, yes; RGW index pool and RGW
+ data pool, no).
+
+ The ``max-count`` value is the maximum number of upmap entries to
+ identify in the run. The default is 100, but you may want to make
+ this a smaller number so that the tool completes more quickly (but
+ does less work). If it cannot find any additional changes to make
+ it will stop early (i.e., when the pool distribution is perfect).
+
+ The ``max-deviation`` value defaults to `.01` (i.e., 1%). If an OSD
+ utilization varies from the average by less than this amount it
+ will be considered perfect.
+
+#. The proposed changes are written to the output file ``out.txt`` in
+ the example above. These are normal ceph CLI commands that can be
+ run to apply the changes to the cluster. This can be done with::
+
+ source out.txt
+
+The above steps can be repeated as many times as necessary to achieve
+a perfect distribution of PGs for each set of pools.
+
+You can see some (gory) details about what the tool is doing by
+passing ``--debug-osd 10`` to ``osdmaptool``.
diff --git a/src/ceph/doc/rados/operations/user-management.rst b/src/ceph/doc/rados/operations/user-management.rst
new file mode 100644
index 0000000..8a35a50
--- /dev/null
+++ b/src/ceph/doc/rados/operations/user-management.rst
@@ -0,0 +1,665 @@
+=================
+ User Management
+=================
+
+This document describes :term:`Ceph Client` users, and their authentication and
+authorization with the :term:`Ceph Storage Cluster`. Users are either
+individuals or system actors such as applications, which use Ceph clients to
+interact with the Ceph Storage Cluster daemons.
+
+.. ditaa:: +-----+
+ | {o} |
+ | |
+ +--+--+ /---------\ /---------\
+ | | Ceph | | Ceph |
+ ---+---*----->| |<------------->| |
+ | uses | Clients | | Servers |
+ | \---------/ \---------/
+ /--+--\
+ | |
+ | |
+ actor
+
+
+When Ceph runs with authentication and authorization enabled (enabled by
+default), you must specify a user name and a keyring containing the secret key
+of the specified user (usually via the command line). If you do not specify a
+user name, Ceph will use ``client.admin`` as the default user name. If you do
+not specify a keyring, Ceph will look for a keyring via the ``keyring`` setting
+in the Ceph configuration. For example, if you execute the ``ceph health``
+command without specifying a user or keyring::
+
+ ceph health
+
+Ceph interprets the command like this::
+
+ ceph -n client.admin --keyring=/etc/ceph/ceph.client.admin.keyring health
+
+Alternatively, you may use the ``CEPH_ARGS`` environment variable to avoid
+re-entry of the user name and secret.
+
+For details on configuring the Ceph Storage Cluster to use authentication,
+see `Cephx Config Reference`_. For details on the architecture of Cephx, see
+`Architecture - High Availability Authentication`_.
+
+
+Background
+==========
+
+Irrespective of the type of Ceph client (e.g., Block Device, Object Storage,
+Filesystem, native API, etc.), Ceph stores all data as objects within `pools`_.
+Ceph users must have access to pools in order to read and write data.
+Additionally, Ceph users must have execute permissions to use Ceph's
+administrative commands. The following concepts will help you understand Ceph
+user management.
+
+
+User
+----
+
+A user is either an individual or a system actor such as an application.
+Creating users allows you to control who (or what) can access your Ceph Storage
+Cluster, its pools, and the data within pools.
+
+Ceph has the notion of a ``type`` of user. For the purposes of user management,
+the type will always be ``client``. Ceph identifies users in period (.)
+delimited form consisting of the user type and the user ID: for example,
+``TYPE.ID``, ``client.admin``, or ``client.user1``. The reason for user typing
+is that Ceph Monitors, OSDs, and Metadata Servers also use the Cephx protocol,
+but they are not clients. Distinguishing the user type helps to distinguish
+between client users and other users--streamlining access control, user
+monitoring and traceability.
+
+Sometimes Ceph's user type may seem confusing, because the Ceph command line
+allows you to specify a user with or without the type, depending upon your
+command line usage. If you specify ``--user`` or ``--id``, you can omit the
+type. So ``client.user1`` can be entered simply as ``user1``. If you specify
+``--name`` or ``-n``, you must specify the type and name, such as
+``client.user1``. We recommend using the type and name as a best practice
+wherever possible.
+
+.. note:: A Ceph Storage Cluster user is not the same as a Ceph Object Storage
+ user or a Ceph Filesystem user. The Ceph Object Gateway uses a Ceph Storage
+ Cluster user to communicate between the gateway daemon and the storage
+ cluster, but the gateway has its own user management functionality for end
+ users. The Ceph Filesystem uses POSIX semantics. The user space associated
+ with the Ceph Filesystem is not the same as a Ceph Storage Cluster user.
+
+
+
+Authorization (Capabilities)
+----------------------------
+
+Ceph uses the term "capabilities" (caps) to describe authorizing an
+authenticated user to exercise the functionality of the monitors, OSDs and
+metadata servers. Capabilities can also restrict access to data within a pool or
+a namespace within a pool. A Ceph administrative user sets a user's
+capabilities when creating or updating a user.
+
+Capability syntax follows the form::
+
+ {daemon-type} '{capspec}[, {capspec} ...]'
+
+- **Monitor Caps:** Monitor capabilities include ``r``, ``w``, ``x`` access
+ settings or ``profile {name}``. For example::
+
+ mon 'allow rwx'
+ mon 'profile osd'
+
+- **OSD Caps:** OSD capabilities include ``r``, ``w``, ``x``, ``class-read``,
+ ``class-write`` access settings or ``profile {name}``. Additionally, OSD
+ capabilities also allow for pool and namespace settings. ::
+
+ osd 'allow {access} [pool={pool-name} [namespace={namespace-name}]]'
+ osd 'profile {name} [pool={pool-name} [namespace={namespace-name}]]'
+
+- **Metadata Server Caps:** For administrators, use ``allow *``. For all
+ other users, such as CephFS clients, consult :doc:`/cephfs/client-auth`
+
+
+.. note:: The Ceph Object Gateway daemon (``radosgw``) is a client of the
+ Ceph Storage Cluster, so it is not represented as a Ceph Storage
+ Cluster daemon type.
+
+The following entries describe each capability.
+
+``allow``
+
+:Description: Precedes access settings for a daemon. Implies ``rw``
+ for MDS only.
+
+
+``r``
+
+:Description: Gives the user read access. Required with monitors to retrieve
+ the CRUSH map.
+
+
+``w``
+
+:Description: Gives the user write access to objects.
+
+
+``x``
+
+:Description: Gives the user the capability to call class methods
+ (i.e., both read and write) and to conduct ``auth``
+ operations on monitors.
+
+
+``class-read``
+
+:Descriptions: Gives the user the capability to call class read methods.
+ Subset of ``x``.
+
+
+``class-write``
+
+:Description: Gives the user the capability to call class write methods.
+ Subset of ``x``.
+
+
+``*``
+
+:Description: Gives the user read, write and execute permissions for a
+ particular daemon/pool, and the ability to execute
+ admin commands.
+
+
+``profile osd`` (Monitor only)
+
+:Description: Gives a user permissions to connect as an OSD to other OSDs or
+ monitors. Conferred on OSDs to enable OSDs to handle replication
+ heartbeat traffic and status reporting.
+
+
+``profile mds`` (Monitor only)
+
+:Description: Gives a user permissions to connect as a MDS to other MDSs or
+ monitors.
+
+
+``profile bootstrap-osd`` (Monitor only)
+
+:Description: Gives a user permissions to bootstrap an OSD. Conferred on
+ deployment tools such as ``ceph-disk``, ``ceph-deploy``, etc.
+ so that they have permissions to add keys, etc. when
+ bootstrapping an OSD.
+
+
+``profile bootstrap-mds`` (Monitor only)
+
+:Description: Gives a user permissions to bootstrap a metadata server.
+ Conferred on deployment tools such as ``ceph-deploy``, etc.
+ so they have permissions to add keys, etc. when bootstrapping
+ a metadata server.
+
+``profile rbd`` (Monitor and OSD)
+
+:Description: Gives a user permissions to manipulate RBD images. When used
+ as a Monitor cap, it provides the minimal privileges required
+ by an RBD client application. When used as an OSD cap, it
+ provides read-write access to an RBD client application.
+
+``profile rbd-read-only`` (OSD only)
+
+:Description: Gives a user read-only permissions to an RBD image.
+
+
+Pool
+----
+
+A pool is a logical partition where users store data.
+In Ceph deployments, it is common to create a pool as a logical partition for
+similar types of data. For example, when deploying Ceph as a backend for
+OpenStack, a typical deployment would have pools for volumes, images, backups
+and virtual machines, and users such as ``client.glance``, ``client.cinder``,
+etc.
+
+
+Namespace
+---------
+
+Objects within a pool can be associated to a namespace--a logical group of
+objects within the pool. A user's access to a pool can be associated with a
+namespace such that reads and writes by the user take place only within the
+namespace. Objects written to a namespace within the pool can only be accessed
+by users who have access to the namespace.
+
+.. note:: Namespaces are primarily useful for applications written on top of
+ ``librados`` where the logical grouping can alleviate the need to create
+ different pools. Ceph Object Gateway (from ``luminous``) uses namespaces for various
+ metadata objects.
+
+The rationale for namespaces is that pools can be a computationally expensive
+method of segregating data sets for the purposes of authorizing separate sets
+of users. For example, a pool should have ~100 placement groups per OSD. So an
+exemplary cluster with 1000 OSDs would have 100,000 placement groups for one
+pool. Each pool would create another 100,000 placement groups in the exemplary
+cluster. By contrast, writing an object to a namespace simply associates the
+namespace to the object name with out the computational overhead of a separate
+pool. Rather than creating a separate pool for a user or set of users, you may
+use a namespace. **Note:** Only available using ``librados`` at this time.
+
+
+Managing Users
+==============
+
+User management functionality provides Ceph Storage Cluster administrators with
+the ability to create, update and delete users directly in the Ceph Storage
+Cluster.
+
+When you create or delete users in the Ceph Storage Cluster, you may need to
+distribute keys to clients so that they can be added to keyrings. See `Keyring
+Management`_ for details.
+
+
+List Users
+----------
+
+To list the users in your cluster, execute the following::
+
+ ceph auth ls
+
+Ceph will list out all users in your cluster. For example, in a two-node
+exemplary cluster, ``ceph auth ls`` will output something that looks like
+this::
+
+ installed auth entries:
+
+ osd.0
+ key: AQCvCbtToC6MDhAATtuT70Sl+DymPCfDSsyV4w==
+ caps: [mon] allow profile osd
+ caps: [osd] allow *
+ osd.1
+ key: AQC4CbtTCFJBChAAVq5spj0ff4eHZICxIOVZeA==
+ caps: [mon] allow profile osd
+ caps: [osd] allow *
+ client.admin
+ key: AQBHCbtT6APDHhAA5W00cBchwkQjh3dkKsyPjw==
+ caps: [mds] allow
+ caps: [mon] allow *
+ caps: [osd] allow *
+ client.bootstrap-mds
+ key: AQBICbtTOK9uGBAAdbe5zcIGHZL3T/u2g6EBww==
+ caps: [mon] allow profile bootstrap-mds
+ client.bootstrap-osd
+ key: AQBHCbtT4GxqORAADE5u7RkpCN/oo4e5W0uBtw==
+ caps: [mon] allow profile bootstrap-osd
+
+
+Note that the ``TYPE.ID`` notation for users applies such that ``osd.0`` is a
+user of type ``osd`` and its ID is ``0``, ``client.admin`` is a user of type
+``client`` and its ID is ``admin`` (i.e., the default ``client.admin`` user).
+Note also that each entry has a ``key: <value>`` entry, and one or more
+``caps:`` entries.
+
+You may use the ``-o {filename}`` option with ``ceph auth ls`` to
+save the output to a file.
+
+
+Get a User
+----------
+
+To retrieve a specific user, key and capabilities, execute the
+following::
+
+ ceph auth get {TYPE.ID}
+
+For example::
+
+ ceph auth get client.admin
+
+You may also use the ``-o {filename}`` option with ``ceph auth get`` to
+save the output to a file. Developers may also execute the following::
+
+ ceph auth export {TYPE.ID}
+
+The ``auth export`` command is identical to ``auth get``, but also prints
+out the internal ``auid``, which is not relevant to end users.
+
+
+
+Add a User
+----------
+
+Adding a user creates a username (i.e., ``TYPE.ID``), a secret key and
+any capabilities included in the command you use to create the user.
+
+A user's key enables the user to authenticate with the Ceph Storage Cluster.
+The user's capabilities authorize the user to read, write, or execute on Ceph
+monitors (``mon``), Ceph OSDs (``osd``) or Ceph Metadata Servers (``mds``).
+
+There are a few ways to add a user:
+
+- ``ceph auth add``: This command is the canonical way to add a user. It
+ will create the user, generate a key and add any specified capabilities.
+
+- ``ceph auth get-or-create``: This command is often the most convenient way
+ to create a user, because it returns a keyfile format with the user name
+ (in brackets) and the key. If the user already exists, this command
+ simply returns the user name and key in the keyfile format. You may use the
+ ``-o {filename}`` option to save the output to a file.
+
+- ``ceph auth get-or-create-key``: This command is a convenient way to create
+ a user and return the user's key (only). This is useful for clients that
+ need the key only (e.g., libvirt). If the user already exists, this command
+ simply returns the key. You may use the ``-o {filename}`` option to save the
+ output to a file.
+
+When creating client users, you may create a user with no capabilities. A user
+with no capabilities is useless beyond mere authentication, because the client
+cannot retrieve the cluster map from the monitor. However, you can create a
+user with no capabilities if you wish to defer adding capabilities later using
+the ``ceph auth caps`` command.
+
+A typical user has at least read capabilities on the Ceph monitor and
+read and write capability on Ceph OSDs. Additionally, a user's OSD permissions
+are often restricted to accessing a particular pool. ::
+
+ ceph auth add client.john mon 'allow r' osd 'allow rw pool=liverpool'
+ ceph auth get-or-create client.paul mon 'allow r' osd 'allow rw pool=liverpool'
+ ceph auth get-or-create client.george mon 'allow r' osd 'allow rw pool=liverpool' -o george.keyring
+ ceph auth get-or-create-key client.ringo mon 'allow r' osd 'allow rw pool=liverpool' -o ringo.key
+
+
+.. important:: If you provide a user with capabilities to OSDs, but you DO NOT
+ restrict access to particular pools, the user will have access to ALL
+ pools in the cluster!
+
+
+.. _modify-user-capabilities:
+
+Modify User Capabilities
+------------------------
+
+The ``ceph auth caps`` command allows you to specify a user and change the
+user's capabilities. Setting new capabilities will overwrite current capabilities.
+To view current capabilities run ``ceph auth get USERTYPE.USERID``. To add
+capabilities, you should also specify the existing capabilities when using the form::
+
+ ceph auth caps USERTYPE.USERID {daemon} 'allow [r|w|x|*|...] [pool={pool-name}] [namespace={namespace-name}]' [{daemon} 'allow [r|w|x|*|...] [pool={pool-name}] [namespace={namespace-name}]']
+
+For example::
+
+ ceph auth get client.john
+ ceph auth caps client.john mon 'allow r' osd 'allow rw pool=liverpool'
+ ceph auth caps client.paul mon 'allow rw' osd 'allow rwx pool=liverpool'
+ ceph auth caps client.brian-manager mon 'allow *' osd 'allow *'
+
+To remove a capability, you may reset the capability. If you want the user
+to have no access to a particular daemon that was previously set, specify
+an empty string. For example::
+
+ ceph auth caps client.ringo mon ' ' osd ' '
+
+See `Authorization (Capabilities)`_ for additional details on capabilities.
+
+
+Delete a User
+-------------
+
+To delete a user, use ``ceph auth del``::
+
+ ceph auth del {TYPE}.{ID}
+
+Where ``{TYPE}`` is one of ``client``, ``osd``, ``mon``, or ``mds``,
+and ``{ID}`` is the user name or ID of the daemon.
+
+
+Print a User's Key
+------------------
+
+To print a user's authentication key to standard output, execute the following::
+
+ ceph auth print-key {TYPE}.{ID}
+
+Where ``{TYPE}`` is one of ``client``, ``osd``, ``mon``, or ``mds``,
+and ``{ID}`` is the user name or ID of the daemon.
+
+Printing a user's key is useful when you need to populate client
+software with a user's key (e.g., libvirt). ::
+
+ mount -t ceph serverhost:/ mountpoint -o name=client.user,secret=`ceph auth print-key client.user`
+
+
+Import a User(s)
+----------------
+
+To import one or more users, use ``ceph auth import`` and
+specify a keyring::
+
+ ceph auth import -i /path/to/keyring
+
+For example::
+
+ sudo ceph auth import -i /etc/ceph/ceph.keyring
+
+
+.. note:: The ceph storage cluster will add new users, their keys and their
+ capabilities and will update existing users, their keys and their
+ capabilities.
+
+
+Keyring Management
+==================
+
+When you access Ceph via a Ceph client, the Ceph client will look for a local
+keyring. Ceph presets the ``keyring`` setting with the following four keyring
+names by default so you don't have to set them in your Ceph configuration file
+unless you want to override the defaults (not recommended):
+
+- ``/etc/ceph/$cluster.$name.keyring``
+- ``/etc/ceph/$cluster.keyring``
+- ``/etc/ceph/keyring``
+- ``/etc/ceph/keyring.bin``
+
+The ``$cluster`` metavariable is your Ceph cluster name as defined by the
+name of the Ceph configuration file (i.e., ``ceph.conf`` means the cluster name
+is ``ceph``; thus, ``ceph.keyring``). The ``$name`` metavariable is the user
+type and user ID (e.g., ``client.admin``; thus, ``ceph.client.admin.keyring``).
+
+.. note:: When executing commands that read or write to ``/etc/ceph``, you may
+ need to use ``sudo`` to execute the command as ``root``.
+
+After you create a user (e.g., ``client.ringo``), you must get the key and add
+it to a keyring on a Ceph client so that the user can access the Ceph Storage
+Cluster.
+
+The `User Management`_ section details how to list, get, add, modify and delete
+users directly in the Ceph Storage Cluster. However, Ceph also provides the
+``ceph-authtool`` utility to allow you to manage keyrings from a Ceph client.
+
+
+Create a Keyring
+----------------
+
+When you use the procedures in the `Managing Users`_ section to create users,
+you need to provide user keys to the Ceph client(s) so that the Ceph client
+can retrieve the key for the specified user and authenticate with the Ceph
+Storage Cluster. Ceph Clients access keyrings to lookup a user name and
+retrieve the user's key.
+
+The ``ceph-authtool`` utility allows you to create a keyring. To create an
+empty keyring, use ``--create-keyring`` or ``-C``. For example::
+
+ ceph-authtool --create-keyring /path/to/keyring
+
+When creating a keyring with multiple users, we recommend using the cluster name
+(e.g., ``$cluster.keyring``) for the keyring filename and saving it in the
+``/etc/ceph`` directory so that the ``keyring`` configuration default setting
+will pick up the filename without requiring you to specify it in the local copy
+of your Ceph configuration file. For example, create ``ceph.keyring`` by
+executing the following::
+
+ sudo ceph-authtool -C /etc/ceph/ceph.keyring
+
+When creating a keyring with a single user, we recommend using the cluster name,
+the user type and the user name and saving it in the ``/etc/ceph`` directory.
+For example, ``ceph.client.admin.keyring`` for the ``client.admin`` user.
+
+To create a keyring in ``/etc/ceph``, you must do so as ``root``. This means
+the file will have ``rw`` permissions for the ``root`` user only, which is
+appropriate when the keyring contains administrator keys. However, if you
+intend to use the keyring for a particular user or group of users, ensure
+that you execute ``chown`` or ``chmod`` to establish appropriate keyring
+ownership and access.
+
+
+Add a User to a Keyring
+-----------------------
+
+When you `Add a User`_ to the Ceph Storage Cluster, you can use the `Get a
+User`_ procedure to retrieve a user, key and capabilities and save the user to a
+keyring.
+
+When you only want to use one user per keyring, the `Get a User`_ procedure with
+the ``-o`` option will save the output in the keyring file format. For example,
+to create a keyring for the ``client.admin`` user, execute the following::
+
+ sudo ceph auth get client.admin -o /etc/ceph/ceph.client.admin.keyring
+
+Notice that we use the recommended file format for an individual user.
+
+When you want to import users to a keyring, you can use ``ceph-authtool``
+to specify the destination keyring and the source keyring.
+For example::
+
+ sudo ceph-authtool /etc/ceph/ceph.keyring --import-keyring /etc/ceph/ceph.client.admin.keyring
+
+
+Create a User
+-------------
+
+Ceph provides the `Add a User`_ function to create a user directly in the Ceph
+Storage Cluster. However, you can also create a user, keys and capabilities
+directly on a Ceph client keyring. Then, you can import the user to the Ceph
+Storage Cluster. For example::
+
+ sudo ceph-authtool -n client.ringo --cap osd 'allow rwx' --cap mon 'allow rwx' /etc/ceph/ceph.keyring
+
+See `Authorization (Capabilities)`_ for additional details on capabilities.
+
+You can also create a keyring and add a new user to the keyring simultaneously.
+For example::
+
+ sudo ceph-authtool -C /etc/ceph/ceph.keyring -n client.ringo --cap osd 'allow rwx' --cap mon 'allow rwx' --gen-key
+
+In the foregoing scenarios, the new user ``client.ringo`` is only in the
+keyring. To add the new user to the Ceph Storage Cluster, you must still add
+the new user to the Ceph Storage Cluster. ::
+
+ sudo ceph auth add client.ringo -i /etc/ceph/ceph.keyring
+
+
+Modify a User
+-------------
+
+To modify the capabilities of a user record in a keyring, specify the keyring,
+and the user followed by the capabilities. For example::
+
+ sudo ceph-authtool /etc/ceph/ceph.keyring -n client.ringo --cap osd 'allow rwx' --cap mon 'allow rwx'
+
+To update the user to the Ceph Storage Cluster, you must update the user
+in the keyring to the user entry in the the Ceph Storage Cluster. ::
+
+ sudo ceph auth import -i /etc/ceph/ceph.keyring
+
+See `Import a User(s)`_ for details on updating a Ceph Storage Cluster user
+from a keyring.
+
+You may also `Modify User Capabilities`_ directly in the cluster, store the
+results to a keyring file; then, import the keyring into your main
+``ceph.keyring`` file.
+
+
+Command Line Usage
+==================
+
+Ceph supports the following usage for user name and secret:
+
+``--id`` | ``--user``
+
+:Description: Ceph identifies users with a type and an ID (e.g., ``TYPE.ID`` or
+ ``client.admin``, ``client.user1``). The ``id``, ``name`` and
+ ``-n`` options enable you to specify the ID portion of the user
+ name (e.g., ``admin``, ``user1``, ``foo``, etc.). You can specify
+ the user with the ``--id`` and omit the type. For example,
+ to specify user ``client.foo`` enter the following::
+
+ ceph --id foo --keyring /path/to/keyring health
+ ceph --user foo --keyring /path/to/keyring health
+
+
+``--name`` | ``-n``
+
+:Description: Ceph identifies users with a type and an ID (e.g., ``TYPE.ID`` or
+ ``client.admin``, ``client.user1``). The ``--name`` and ``-n``
+ options enables you to specify the fully qualified user name.
+ You must specify the user type (typically ``client``) with the
+ user ID. For example::
+
+ ceph --name client.foo --keyring /path/to/keyring health
+ ceph -n client.foo --keyring /path/to/keyring health
+
+
+``--keyring``
+
+:Description: The path to the keyring containing one or more user name and
+ secret. The ``--secret`` option provides the same functionality,
+ but it does not work with Ceph RADOS Gateway, which uses
+ ``--secret`` for another purpose. You may retrieve a keyring with
+ ``ceph auth get-or-create`` and store it locally. This is a
+ preferred approach, because you can switch user names without
+ switching the keyring path. For example::
+
+ sudo rbd map --id foo --keyring /path/to/keyring mypool/myimage
+
+
+.. _pools: ../pools
+
+
+Limitations
+===========
+
+The ``cephx`` protocol authenticates Ceph clients and servers to each other. It
+is not intended to handle authentication of human users or application programs
+run on their behalf. If that effect is required to handle your access control
+needs, you must have another mechanism, which is likely to be specific to the
+front end used to access the Ceph object store. This other mechanism has the
+role of ensuring that only acceptable users and programs are able to run on the
+machine that Ceph will permit to access its object store.
+
+The keys used to authenticate Ceph clients and servers are typically stored in
+a plain text file with appropriate permissions in a trusted host.
+
+.. important:: Storing keys in plaintext files has security shortcomings, but
+ they are difficult to avoid, given the basic authentication methods Ceph
+ uses in the background. Those setting up Ceph systems should be aware of
+ these shortcomings.
+
+In particular, arbitrary user machines, especially portable machines, should not
+be configured to interact directly with Ceph, since that mode of use would
+require the storage of a plaintext authentication key on an insecure machine.
+Anyone who stole that machine or obtained surreptitious access to it could
+obtain the key that will allow them to authenticate their own machines to Ceph.
+
+Rather than permitting potentially insecure machines to access a Ceph object
+store directly, users should be required to sign in to a trusted machine in
+your environment using a method that provides sufficient security for your
+purposes. That trusted machine will store the plaintext Ceph keys for the
+human users. A future version of Ceph may address these particular
+authentication issues more fully.
+
+At the moment, none of the Ceph authentication protocols provide secrecy for
+messages in transit. Thus, an eavesdropper on the wire can hear and understand
+all data sent between clients and servers in Ceph, even if it cannot create or
+alter them. Further, Ceph does not include options to encrypt user data in the
+object store. Users can hand-encrypt and store their own data in the Ceph
+object store, of course, but Ceph provides no features to perform object
+encryption itself. Those storing sensitive data in Ceph should consider
+encrypting their data before providing it to the Ceph system.
+
+
+.. _Architecture - High Availability Authentication: ../../../architecture#high-availability-authentication
+.. _Cephx Config Reference: ../../configuration/auth-config-ref