summaryrefslogtreecommitdiffstats
path: root/src/ceph/doc/rados
diff options
context:
space:
mode:
authorQiaowei Ren <qiaowei.ren@intel.com>2018-01-04 13:43:33 +0800
committerQiaowei Ren <qiaowei.ren@intel.com>2018-01-05 11:59:39 +0800
commit812ff6ca9fcd3e629e49d4328905f33eee8ca3f5 (patch)
tree04ece7b4da00d9d2f98093774594f4057ae561d4 /src/ceph/doc/rados
parent15280273faafb77777eab341909a3f495cf248d9 (diff)
initial code repo
This patch creates initial code repo. For ceph, luminous stable release will be used for base code, and next changes and optimization for ceph will be added to it. For opensds, currently any changes can be upstreamed into original opensds repo (https://github.com/opensds/opensds), and so stor4nfv will directly clone opensds code to deploy stor4nfv environment. And the scripts for deployment based on ceph and opensds will be put into 'ci' directory. Change-Id: I46a32218884c75dda2936337604ff03c554648e4 Signed-off-by: Qiaowei Ren <qiaowei.ren@intel.com>
Diffstat (limited to 'src/ceph/doc/rados')
-rw-r--r--src/ceph/doc/rados/api/index.rst22
-rw-r--r--src/ceph/doc/rados/api/librados-intro.rst1003
-rw-r--r--src/ceph/doc/rados/api/librados.rst187
-rw-r--r--src/ceph/doc/rados/api/libradospp.rst5
-rw-r--r--src/ceph/doc/rados/api/objclass-sdk.rst37
-rw-r--r--src/ceph/doc/rados/api/python.rst397
-rw-r--r--src/ceph/doc/rados/command/list-inconsistent-obj.json195
-rw-r--r--src/ceph/doc/rados/command/list-inconsistent-snap.json87
-rw-r--r--src/ceph/doc/rados/configuration/auth-config-ref.rst432
-rw-r--r--src/ceph/doc/rados/configuration/bluestore-config-ref.rst297
-rw-r--r--src/ceph/doc/rados/configuration/ceph-conf.rst629
-rw-r--r--src/ceph/doc/rados/configuration/demo-ceph.conf31
-rw-r--r--src/ceph/doc/rados/configuration/filestore-config-ref.rst365
-rw-r--r--src/ceph/doc/rados/configuration/general-config-ref.rst66
-rw-r--r--src/ceph/doc/rados/configuration/index.rst64
-rw-r--r--src/ceph/doc/rados/configuration/journal-ref.rst116
-rw-r--r--src/ceph/doc/rados/configuration/mon-config-ref.rst1222
-rw-r--r--src/ceph/doc/rados/configuration/mon-lookup-dns.rst51
-rw-r--r--src/ceph/doc/rados/configuration/mon-osd-interaction.rst408
-rw-r--r--src/ceph/doc/rados/configuration/ms-ref.rst154
-rw-r--r--src/ceph/doc/rados/configuration/network-config-ref.rst494
-rw-r--r--src/ceph/doc/rados/configuration/osd-config-ref.rst1105
-rw-r--r--src/ceph/doc/rados/configuration/pool-pg-config-ref.rst270
-rw-r--r--src/ceph/doc/rados/configuration/pool-pg.conf20
-rw-r--r--src/ceph/doc/rados/configuration/storage-devices.rst83
-rw-r--r--src/ceph/doc/rados/deployment/ceph-deploy-admin.rst38
-rw-r--r--src/ceph/doc/rados/deployment/ceph-deploy-install.rst46
-rw-r--r--src/ceph/doc/rados/deployment/ceph-deploy-keys.rst32
-rw-r--r--src/ceph/doc/rados/deployment/ceph-deploy-mds.rst46
-rw-r--r--src/ceph/doc/rados/deployment/ceph-deploy-mon.rst56
-rw-r--r--src/ceph/doc/rados/deployment/ceph-deploy-new.rst66
-rw-r--r--src/ceph/doc/rados/deployment/ceph-deploy-osd.rst121
-rw-r--r--src/ceph/doc/rados/deployment/ceph-deploy-purge.rst25
-rw-r--r--src/ceph/doc/rados/deployment/index.rst58
-rw-r--r--src/ceph/doc/rados/deployment/preflight-checklist.rst109
-rw-r--r--src/ceph/doc/rados/index.rst76
-rw-r--r--src/ceph/doc/rados/man/index.rst34
-rw-r--r--src/ceph/doc/rados/operations/add-or-rm-mons.rst370
-rw-r--r--src/ceph/doc/rados/operations/add-or-rm-osds.rst366
-rw-r--r--src/ceph/doc/rados/operations/cache-tiering.rst461
-rw-r--r--src/ceph/doc/rados/operations/control.rst453
-rw-r--r--src/ceph/doc/rados/operations/crush-map-edits.rst654
-rw-r--r--src/ceph/doc/rados/operations/crush-map.rst956
-rw-r--r--src/ceph/doc/rados/operations/data-placement.rst37
-rw-r--r--src/ceph/doc/rados/operations/erasure-code-isa.rst105
-rw-r--r--src/ceph/doc/rados/operations/erasure-code-jerasure.rst120
-rw-r--r--src/ceph/doc/rados/operations/erasure-code-lrc.rst371
-rw-r--r--src/ceph/doc/rados/operations/erasure-code-profile.rst121
-rw-r--r--src/ceph/doc/rados/operations/erasure-code-shec.rst144
-rw-r--r--src/ceph/doc/rados/operations/erasure-code.rst195
-rw-r--r--src/ceph/doc/rados/operations/health-checks.rst527
-rw-r--r--src/ceph/doc/rados/operations/index.rst90
-rw-r--r--src/ceph/doc/rados/operations/monitoring-osd-pg.rst617
-rw-r--r--src/ceph/doc/rados/operations/monitoring.rst351
-rw-r--r--src/ceph/doc/rados/operations/operating.rst251
-rw-r--r--src/ceph/doc/rados/operations/pg-concepts.rst102
-rw-r--r--src/ceph/doc/rados/operations/pg-repair.rst4
-rw-r--r--src/ceph/doc/rados/operations/pg-states.rst80
-rw-r--r--src/ceph/doc/rados/operations/placement-groups.rst469
-rw-r--r--src/ceph/doc/rados/operations/pools.rst798
-rw-r--r--src/ceph/doc/rados/operations/upmap.rst75
-rw-r--r--src/ceph/doc/rados/operations/user-management.rst665
-rw-r--r--src/ceph/doc/rados/troubleshooting/community.rst29
-rw-r--r--src/ceph/doc/rados/troubleshooting/cpu-profiling.rst67
-rw-r--r--src/ceph/doc/rados/troubleshooting/index.rst19
-rw-r--r--src/ceph/doc/rados/troubleshooting/log-and-debug.rst550
-rw-r--r--src/ceph/doc/rados/troubleshooting/memory-profiling.rst142
-rw-r--r--src/ceph/doc/rados/troubleshooting/troubleshooting-mon.rst567
-rw-r--r--src/ceph/doc/rados/troubleshooting/troubleshooting-osd.rst536
-rw-r--r--src/ceph/doc/rados/troubleshooting/troubleshooting-pg.rst668
70 files changed, 19407 insertions, 0 deletions
diff --git a/src/ceph/doc/rados/api/index.rst b/src/ceph/doc/rados/api/index.rst
new file mode 100644
index 0000000..cccc153
--- /dev/null
+++ b/src/ceph/doc/rados/api/index.rst
@@ -0,0 +1,22 @@
+===========================
+ Ceph Storage Cluster APIs
+===========================
+
+The :term:`Ceph Storage Cluster` has a messaging layer protocol that enables
+clients to interact with a :term:`Ceph Monitor` and a :term:`Ceph OSD Daemon`.
+``librados`` provides this functionality to :term:`Ceph Clients` in the form of
+a library. All Ceph Clients either use ``librados`` or the same functionality
+encapsulated in ``librados`` to interact with the object store. For example,
+``librbd`` and ``libcephfs`` leverage this functionality. You may use
+``librados`` to interact with Ceph directly (e.g., an application that talks to
+Ceph, your own interface to Ceph, etc.).
+
+
+.. toctree::
+ :maxdepth: 2
+
+ Introduction to librados <librados-intro>
+ librados (C) <librados>
+ librados (C++) <libradospp>
+ librados (Python) <python>
+ object class <objclass-sdk>
diff --git a/src/ceph/doc/rados/api/librados-intro.rst b/src/ceph/doc/rados/api/librados-intro.rst
new file mode 100644
index 0000000..8405f6e
--- /dev/null
+++ b/src/ceph/doc/rados/api/librados-intro.rst
@@ -0,0 +1,1003 @@
+==========================
+ Introduction to librados
+==========================
+
+The :term:`Ceph Storage Cluster` provides the basic storage service that allows
+:term:`Ceph` to uniquely deliver **object, block, and file storage** in one
+unified system. However, you are not limited to using the RESTful, block, or
+POSIX interfaces. Based upon :abbr:`RADOS (Reliable Autonomic Distributed Object
+Store)`, the ``librados`` API enables you to create your own interface to the
+Ceph Storage Cluster.
+
+The ``librados`` API enables you to interact with the two types of daemons in
+the Ceph Storage Cluster:
+
+- The :term:`Ceph Monitor`, which maintains a master copy of the cluster map.
+- The :term:`Ceph OSD Daemon` (OSD), which stores data as objects on a storage node.
+
+.. ditaa::
+ +---------------------------------+
+ | Ceph Storage Cluster Protocol |
+ | (librados) |
+ +---------------------------------+
+ +---------------+ +---------------+
+ | OSDs | | Monitors |
+ +---------------+ +---------------+
+
+This guide provides a high-level introduction to using ``librados``.
+Refer to :doc:`../../architecture` for additional details of the Ceph
+Storage Cluster. To use the API, you need a running Ceph Storage Cluster.
+See `Installation (Quick)`_ for details.
+
+
+Step 1: Getting librados
+========================
+
+Your client application must bind with ``librados`` to connect to the Ceph
+Storage Cluster. You must install ``librados`` and any required packages to
+write applications that use ``librados``. The ``librados`` API is written in
+C++, with additional bindings for C, Python, Java and PHP.
+
+
+Getting librados for C/C++
+--------------------------
+
+To install ``librados`` development support files for C/C++ on Debian/Ubuntu
+distributions, execute the following::
+
+ sudo apt-get install librados-dev
+
+To install ``librados`` development support files for C/C++ on RHEL/CentOS
+distributions, execute the following::
+
+ sudo yum install librados2-devel
+
+Once you install ``librados`` for developers, you can find the required
+headers for C/C++ under ``/usr/include/rados``. ::
+
+ ls /usr/include/rados
+
+
+Getting librados for Python
+---------------------------
+
+The ``rados`` module provides ``librados`` support to Python
+applications. The ``librados-dev`` package for Debian/Ubuntu
+and the ``librados2-devel`` package for RHEL/CentOS will install the
+``python-rados`` package for you. You may install ``python-rados``
+directly too.
+
+To install ``librados`` development support files for Python on Debian/Ubuntu
+distributions, execute the following::
+
+ sudo apt-get install python-rados
+
+To install ``librados`` development support files for Python on RHEL/CentOS
+distributions, execute the following::
+
+ sudo yum install python-rados
+
+You can find the module under ``/usr/share/pyshared`` on Debian systems,
+or under ``/usr/lib/python*/site-packages`` on CentOS/RHEL systems.
+
+
+Getting librados for Java
+-------------------------
+
+To install ``librados`` for Java, you need to execute the following procedure:
+
+#. Install ``jna.jar``. For Debian/Ubuntu, execute::
+
+ sudo apt-get install libjna-java
+
+ For CentOS/RHEL, execute::
+
+ sudo yum install jna
+
+ The JAR files are located in ``/usr/share/java``.
+
+#. Clone the ``rados-java`` repository::
+
+ git clone --recursive https://github.com/ceph/rados-java.git
+
+#. Build the ``rados-java`` repository::
+
+ cd rados-java
+ ant
+
+ The JAR file is located under ``rados-java/target``.
+
+#. Copy the JAR for RADOS to a common location (e.g., ``/usr/share/java``) and
+ ensure that it and the JNA JAR are in your JVM's classpath. For example::
+
+ sudo cp target/rados-0.1.3.jar /usr/share/java/rados-0.1.3.jar
+ sudo ln -s /usr/share/java/jna-3.2.7.jar /usr/lib/jvm/default-java/jre/lib/ext/jna-3.2.7.jar
+ sudo ln -s /usr/share/java/rados-0.1.3.jar /usr/lib/jvm/default-java/jre/lib/ext/rados-0.1.3.jar
+
+To build the documentation, execute the following::
+
+ ant docs
+
+
+Getting librados for PHP
+-------------------------
+
+To install the ``librados`` extension for PHP, you need to execute the following procedure:
+
+#. Install php-dev. For Debian/Ubuntu, execute::
+
+ sudo apt-get install php5-dev build-essential
+
+ For CentOS/RHEL, execute::
+
+ sudo yum install php-devel
+
+#. Clone the ``phprados`` repository::
+
+ git clone https://github.com/ceph/phprados.git
+
+#. Build ``phprados``::
+
+ cd phprados
+ phpize
+ ./configure
+ make
+ sudo make install
+
+#. Enable ``phprados`` in php.ini by adding::
+
+ extension=rados.so
+
+
+Step 2: Configuring a Cluster Handle
+====================================
+
+A :term:`Ceph Client`, via ``librados``, interacts directly with OSDs to store
+and retrieve data. To interact with OSDs, the client app must invoke
+``librados`` and connect to a Ceph Monitor. Once connected, ``librados``
+retrieves the :term:`Cluster Map` from the Ceph Monitor. When the client app
+wants to read or write data, it creates an I/O context and binds to a
+:term:`pool`. The pool has an associated :term:`ruleset` that defines how it
+will place data in the storage cluster. Via the I/O context, the client
+provides the object name to ``librados``, which takes the object name
+and the cluster map (i.e., the topology of the cluster) and `computes`_ the
+placement group and `OSD`_ for locating the data. Then the client application
+can read or write data. The client app doesn't need to learn about the topology
+of the cluster directly.
+
+.. ditaa::
+ +--------+ Retrieves +---------------+
+ | Client |------------>| Cluster Map |
+ +--------+ +---------------+
+ |
+ v Writes
+ /-----\
+ | obj |
+ \-----/
+ | To
+ v
+ +--------+ +---------------+
+ | Pool |---------->| CRUSH Ruleset |
+ +--------+ Selects +---------------+
+
+
+The Ceph Storage Cluster handle encapsulates the client configuration, including:
+
+- The `user ID`_ for ``rados_create()`` or user name for ``rados_create2()``
+ (preferred).
+- The :term:`cephx` authentication key
+- The monitor ID and IP address
+- Logging levels
+- Debugging levels
+
+Thus, the first steps in using the cluster from your app are to 1) create
+a cluster handle that your app will use to connect to the storage cluster,
+and then 2) use that handle to connect. To connect to the cluster, the
+app must supply a monitor address, a username and an authentication key
+(cephx is enabled by default).
+
+.. tip:: Talking to different Ceph Storage Clusters – or to the same cluster
+ with different users – requires different cluster handles.
+
+RADOS provides a number of ways for you to set the required values. For
+the monitor and encryption key settings, an easy way to handle them is to ensure
+that your Ceph configuration file contains a ``keyring`` path to a keyring file
+and at least one monitor address (e.g,. ``mon host``). For example::
+
+ [global]
+ mon host = 192.168.1.1
+ keyring = /etc/ceph/ceph.client.admin.keyring
+
+Once you create the handle, you can read a Ceph configuration file to configure
+the handle. You can also pass arguments to your app and parse them with the
+function for parsing command line arguments (e.g., ``rados_conf_parse_argv()``),
+or parse Ceph environment variables (e.g., ``rados_conf_parse_env()``). Some
+wrappers may not implement convenience methods, so you may need to implement
+these capabilities. The following diagram provides a high-level flow for the
+initial connection.
+
+
+.. ditaa:: +---------+ +---------+
+ | Client | | Monitor |
+ +---------+ +---------+
+ | |
+ |-----+ create |
+ | | cluster |
+ |<----+ handle |
+ | |
+ |-----+ read |
+ | | config |
+ |<----+ file |
+ | |
+ | connect |
+ |-------------->|
+ | |
+ |<--------------|
+ | connected |
+ | |
+
+
+Once connected, your app can invoke functions that affect the whole cluster
+with only the cluster handle. For example, once you have a cluster
+handle, you can:
+
+- Get cluster statistics
+- Use Pool Operation (exists, create, list, delete)
+- Get and set the configuration
+
+
+One of the powerful features of Ceph is the ability to bind to different pools.
+Each pool may have a different number of placement groups, object replicas and
+replication strategies. For example, a pool could be set up as a "hot" pool that
+uses SSDs for frequently used objects or a "cold" pool that uses erasure coding.
+
+The main difference in the various ``librados`` bindings is between C and
+the object-oriented bindings for C++, Java and Python. The object-oriented
+bindings use objects to represent cluster handles, IO Contexts, iterators,
+exceptions, etc.
+
+
+C Example
+---------
+
+For C, creating a simple cluster handle using the ``admin`` user, configuring
+it and connecting to the cluster might look something like this:
+
+.. code-block:: c
+
+ #include <stdio.h>
+ #include <stdlib.h>
+ #include <string.h>
+ #include <rados/librados.h>
+
+ int main (int argc, const char **argv)
+ {
+
+ /* Declare the cluster handle and required arguments. */
+ rados_t cluster;
+ char cluster_name[] = "ceph";
+ char user_name[] = "client.admin";
+ uint64_t flags;
+
+ /* Initialize the cluster handle with the "ceph" cluster name and the "client.admin" user */
+ int err;
+ err = rados_create2(&cluster, cluster_name, user_name, flags);
+
+ if (err < 0) {
+ fprintf(stderr, "%s: Couldn't create the cluster handle! %s\n", argv[0], strerror(-err));
+ exit(EXIT_FAILURE);
+ } else {
+ printf("\nCreated a cluster handle.\n");
+ }
+
+
+ /* Read a Ceph configuration file to configure the cluster handle. */
+ err = rados_conf_read_file(cluster, "/etc/ceph/ceph.conf");
+ if (err < 0) {
+ fprintf(stderr, "%s: cannot read config file: %s\n", argv[0], strerror(-err));
+ exit(EXIT_FAILURE);
+ } else {
+ printf("\nRead the config file.\n");
+ }
+
+ /* Read command line arguments */
+ err = rados_conf_parse_argv(cluster, argc, argv);
+ if (err < 0) {
+ fprintf(stderr, "%s: cannot parse command line arguments: %s\n", argv[0], strerror(-err));
+ exit(EXIT_FAILURE);
+ } else {
+ printf("\nRead the command line arguments.\n");
+ }
+
+ /* Connect to the cluster */
+ err = rados_connect(cluster);
+ if (err < 0) {
+ fprintf(stderr, "%s: cannot connect to cluster: %s\n", argv[0], strerror(-err));
+ exit(EXIT_FAILURE);
+ } else {
+ printf("\nConnected to the cluster.\n");
+ }
+
+ }
+
+Compile your client and link to ``librados`` using ``-lrados``. For example::
+
+ gcc ceph-client.c -lrados -o ceph-client
+
+
+C++ Example
+-----------
+
+The Ceph project provides a C++ example in the ``ceph/examples/librados``
+directory. For C++, a simple cluster handle using the ``admin`` user requires
+you to initialize a ``librados::Rados`` cluster handle object:
+
+.. code-block:: c++
+
+ #include <iostream>
+ #include <string>
+ #include <rados/librados.hpp>
+
+ int main(int argc, const char **argv)
+ {
+
+ int ret = 0;
+
+ /* Declare the cluster handle and required variables. */
+ librados::Rados cluster;
+ char cluster_name[] = "ceph";
+ char user_name[] = "client.admin";
+ uint64_t flags = 0;
+
+ /* Initialize the cluster handle with the "ceph" cluster name and "client.admin" user */
+ {
+ ret = cluster.init2(user_name, cluster_name, flags);
+ if (ret < 0) {
+ std::cerr << "Couldn't initialize the cluster handle! error " << ret << std::endl;
+ return EXIT_FAILURE;
+ } else {
+ std::cout << "Created a cluster handle." << std::endl;
+ }
+ }
+
+ /* Read a Ceph configuration file to configure the cluster handle. */
+ {
+ ret = cluster.conf_read_file("/etc/ceph/ceph.conf");
+ if (ret < 0) {
+ std::cerr << "Couldn't read the Ceph configuration file! error " << ret << std::endl;
+ return EXIT_FAILURE;
+ } else {
+ std::cout << "Read the Ceph configuration file." << std::endl;
+ }
+ }
+
+ /* Read command line arguments */
+ {
+ ret = cluster.conf_parse_argv(argc, argv);
+ if (ret < 0) {
+ std::cerr << "Couldn't parse command line options! error " << ret << std::endl;
+ return EXIT_FAILURE;
+ } else {
+ std::cout << "Parsed command line options." << std::endl;
+ }
+ }
+
+ /* Connect to the cluster */
+ {
+ ret = cluster.connect();
+ if (ret < 0) {
+ std::cerr << "Couldn't connect to cluster! error " << ret << std::endl;
+ return EXIT_FAILURE;
+ } else {
+ std::cout << "Connected to the cluster." << std::endl;
+ }
+ }
+
+ return 0;
+ }
+
+
+Compile the source; then, link ``librados`` using ``-lrados``.
+For example::
+
+ g++ -g -c ceph-client.cc -o ceph-client.o
+ g++ -g ceph-client.o -lrados -o ceph-client
+
+
+
+Python Example
+--------------
+
+Python uses the ``admin`` id and the ``ceph`` cluster name by default, and
+will read the standard ``ceph.conf`` file if the conffile parameter is
+set to the empty string. The Python binding converts C++ errors
+into exceptions.
+
+
+.. code-block:: python
+
+ import rados
+
+ try:
+ cluster = rados.Rados(conffile='')
+ except TypeError as e:
+ print 'Argument validation error: ', e
+ raise e
+
+ print "Created cluster handle."
+
+ try:
+ cluster.connect()
+ except Exception as e:
+ print "connection error: ", e
+ raise e
+ finally:
+ print "Connected to the cluster."
+
+
+Execute the example to verify that it connects to your cluster. ::
+
+ python ceph-client.py
+
+
+Java Example
+------------
+
+Java requires you to specify the user ID (``admin``) or user name
+(``client.admin``), and uses the ``ceph`` cluster name by default . The Java
+binding converts C++-based errors into exceptions.
+
+.. code-block:: java
+
+ import com.ceph.rados.Rados;
+ import com.ceph.rados.RadosException;
+
+ import java.io.File;
+
+ public class CephClient {
+ public static void main (String args[]){
+
+ try {
+ Rados cluster = new Rados("admin");
+ System.out.println("Created cluster handle.");
+
+ File f = new File("/etc/ceph/ceph.conf");
+ cluster.confReadFile(f);
+ System.out.println("Read the configuration file.");
+
+ cluster.connect();
+ System.out.println("Connected to the cluster.");
+
+ } catch (RadosException e) {
+ System.out.println(e.getMessage() + ": " + e.getReturnValue());
+ }
+ }
+ }
+
+
+Compile the source; then, run it. If you have copied the JAR to
+``/usr/share/java`` and sym linked from your ``ext`` directory, you won't need
+to specify the classpath. For example::
+
+ javac CephClient.java
+ java CephClient
+
+
+PHP Example
+------------
+
+With the RADOS extension enabled in PHP you can start creating a new cluster handle very easily:
+
+.. code-block:: php
+
+ <?php
+
+ $r = rados_create();
+ rados_conf_read_file($r, '/etc/ceph/ceph.conf');
+ if (!rados_connect($r)) {
+ echo "Failed to connect to Ceph cluster";
+ } else {
+ echo "Successfully connected to Ceph cluster";
+ }
+
+
+Save this as rados.php and run the code::
+
+ php rados.php
+
+
+Step 3: Creating an I/O Context
+===============================
+
+Once your app has a cluster handle and a connection to a Ceph Storage Cluster,
+you may create an I/O Context and begin reading and writing data. An I/O Context
+binds the connection to a specific pool. The user must have appropriate
+`CAPS`_ permissions to access the specified pool. For example, a user with read
+access but not write access will only be able to read data. I/O Context
+functionality includes:
+
+- Write/read data and extended attributes
+- List and iterate over objects and extended attributes
+- Snapshot pools, list snapshots, etc.
+
+
+.. ditaa:: +---------+ +---------+ +---------+
+ | Client | | Monitor | | OSD |
+ +---------+ +---------+ +---------+
+ | | |
+ |-----+ create | |
+ | | I/O | |
+ |<----+ context | |
+ | | |
+ | write data | |
+ |---------------+-------------->|
+ | | |
+ | write ack | |
+ |<--------------+---------------|
+ | | |
+ | write xattr | |
+ |---------------+-------------->|
+ | | |
+ | xattr ack | |
+ |<--------------+---------------|
+ | | |
+ | read data | |
+ |---------------+-------------->|
+ | | |
+ | read ack | |
+ |<--------------+---------------|
+ | | |
+ | remove data | |
+ |---------------+-------------->|
+ | | |
+ | remove ack | |
+ |<--------------+---------------|
+
+
+
+RADOS enables you to interact both synchronously and asynchronously. Once your
+app has an I/O Context, read/write operations only require you to know the
+object/xattr name. The CRUSH algorithm encapsulated in ``librados`` uses the
+cluster map to identify the appropriate OSD. OSD daemons handle the replication,
+as described in `Smart Daemons Enable Hyperscale`_. The ``librados`` library also
+maps objects to placement groups, as described in `Calculating PG IDs`_.
+
+The following examples use the default ``data`` pool. However, you may also
+use the API to list pools, ensure they exist, or create and delete pools. For
+the write operations, the examples illustrate how to use synchronous mode. For
+the read operations, the examples illustrate how to use asynchronous mode.
+
+.. important:: Use caution when deleting pools with this API. If you delete
+ a pool, the pool and ALL DATA in the pool will be lost.
+
+
+C Example
+---------
+
+
+.. code-block:: c
+
+ #include <stdio.h>
+ #include <stdlib.h>
+ #include <string.h>
+ #include <rados/librados.h>
+
+ int main (int argc, const char **argv)
+ {
+ /*
+ * Continued from previous C example, where cluster handle and
+ * connection are established. First declare an I/O Context.
+ */
+
+ rados_ioctx_t io;
+ char *poolname = "data";
+
+ err = rados_ioctx_create(cluster, poolname, &io);
+ if (err < 0) {
+ fprintf(stderr, "%s: cannot open rados pool %s: %s\n", argv[0], poolname, strerror(-err));
+ rados_shutdown(cluster);
+ exit(EXIT_FAILURE);
+ } else {
+ printf("\nCreated I/O context.\n");
+ }
+
+ /* Write data to the cluster synchronously. */
+ err = rados_write(io, "hw", "Hello World!", 12, 0);
+ if (err < 0) {
+ fprintf(stderr, "%s: Cannot write object \"hw\" to pool %s: %s\n", argv[0], poolname, strerror(-err));
+ rados_ioctx_destroy(io);
+ rados_shutdown(cluster);
+ exit(1);
+ } else {
+ printf("\nWrote \"Hello World\" to object \"hw\".\n");
+ }
+
+ char xattr[] = "en_US";
+ err = rados_setxattr(io, "hw", "lang", xattr, 5);
+ if (err < 0) {
+ fprintf(stderr, "%s: Cannot write xattr to pool %s: %s\n", argv[0], poolname, strerror(-err));
+ rados_ioctx_destroy(io);
+ rados_shutdown(cluster);
+ exit(1);
+ } else {
+ printf("\nWrote \"en_US\" to xattr \"lang\" for object \"hw\".\n");
+ }
+
+ /*
+ * Read data from the cluster asynchronously.
+ * First, set up asynchronous I/O completion.
+ */
+ rados_completion_t comp;
+ err = rados_aio_create_completion(NULL, NULL, NULL, &comp);
+ if (err < 0) {
+ fprintf(stderr, "%s: Could not create aio completion: %s\n", argv[0], strerror(-err));
+ rados_ioctx_destroy(io);
+ rados_shutdown(cluster);
+ exit(1);
+ } else {
+ printf("\nCreated AIO completion.\n");
+ }
+
+ /* Next, read data using rados_aio_read. */
+ char read_res[100];
+ err = rados_aio_read(io, "hw", comp, read_res, 12, 0);
+ if (err < 0) {
+ fprintf(stderr, "%s: Cannot read object. %s %s\n", argv[0], poolname, strerror(-err));
+ rados_ioctx_destroy(io);
+ rados_shutdown(cluster);
+ exit(1);
+ } else {
+ printf("\nRead object \"hw\". The contents are:\n %s \n", read_res);
+ }
+
+ /* Wait for the operation to complete */
+ rados_aio_wait_for_complete(comp);
+
+ /* Release the asynchronous I/O complete handle to avoid memory leaks. */
+ rados_aio_release(comp);
+
+
+ char xattr_res[100];
+ err = rados_getxattr(io, "hw", "lang", xattr_res, 5);
+ if (err < 0) {
+ fprintf(stderr, "%s: Cannot read xattr. %s %s\n", argv[0], poolname, strerror(-err));
+ rados_ioctx_destroy(io);
+ rados_shutdown(cluster);
+ exit(1);
+ } else {
+ printf("\nRead xattr \"lang\" for object \"hw\". The contents are:\n %s \n", xattr_res);
+ }
+
+ err = rados_rmxattr(io, "hw", "lang");
+ if (err < 0) {
+ fprintf(stderr, "%s: Cannot remove xattr. %s %s\n", argv[0], poolname, strerror(-err));
+ rados_ioctx_destroy(io);
+ rados_shutdown(cluster);
+ exit(1);
+ } else {
+ printf("\nRemoved xattr \"lang\" for object \"hw\".\n");
+ }
+
+ err = rados_remove(io, "hw");
+ if (err < 0) {
+ fprintf(stderr, "%s: Cannot remove object. %s %s\n", argv[0], poolname, strerror(-err));
+ rados_ioctx_destroy(io);
+ rados_shutdown(cluster);
+ exit(1);
+ } else {
+ printf("\nRemoved object \"hw\".\n");
+ }
+
+ }
+
+
+
+C++ Example
+-----------
+
+
+.. code-block:: c++
+
+ #include <iostream>
+ #include <string>
+ #include <rados/librados.hpp>
+
+ int main(int argc, const char **argv)
+ {
+
+ /* Continued from previous C++ example, where cluster handle and
+ * connection are established. First declare an I/O Context.
+ */
+
+ librados::IoCtx io_ctx;
+ const char *pool_name = "data";
+
+ {
+ ret = cluster.ioctx_create(pool_name, io_ctx);
+ if (ret < 0) {
+ std::cerr << "Couldn't set up ioctx! error " << ret << std::endl;
+ exit(EXIT_FAILURE);
+ } else {
+ std::cout << "Created an ioctx for the pool." << std::endl;
+ }
+ }
+
+
+ /* Write an object synchronously. */
+ {
+ librados::bufferlist bl;
+ bl.append("Hello World!");
+ ret = io_ctx.write_full("hw", bl);
+ if (ret < 0) {
+ std::cerr << "Couldn't write object! error " << ret << std::endl;
+ exit(EXIT_FAILURE);
+ } else {
+ std::cout << "Wrote new object 'hw' " << std::endl;
+ }
+ }
+
+
+ /*
+ * Add an xattr to the object.
+ */
+ {
+ librados::bufferlist lang_bl;
+ lang_bl.append("en_US");
+ ret = io_ctx.setxattr("hw", "lang", lang_bl);
+ if (ret < 0) {
+ std::cerr << "failed to set xattr version entry! error "
+ << ret << std::endl;
+ exit(EXIT_FAILURE);
+ } else {
+ std::cout << "Set the xattr 'lang' on our object!" << std::endl;
+ }
+ }
+
+
+ /*
+ * Read the object back asynchronously.
+ */
+ {
+ librados::bufferlist read_buf;
+ int read_len = 4194304;
+
+ //Create I/O Completion.
+ librados::AioCompletion *read_completion = librados::Rados::aio_create_completion();
+
+ //Send read request.
+ ret = io_ctx.aio_read("hw", read_completion, &read_buf, read_len, 0);
+ if (ret < 0) {
+ std::cerr << "Couldn't start read object! error " << ret << std::endl;
+ exit(EXIT_FAILURE);
+ }
+
+ // Wait for the request to complete, and check that it succeeded.
+ read_completion->wait_for_complete();
+ ret = read_completion->get_return_value();
+ if (ret < 0) {
+ std::cerr << "Couldn't read object! error " << ret << std::endl;
+ exit(EXIT_FAILURE);
+ } else {
+ std::cout << "Read object hw asynchronously with contents.\n"
+ << read_buf.c_str() << std::endl;
+ }
+ }
+
+
+ /*
+ * Read the xattr.
+ */
+ {
+ librados::bufferlist lang_res;
+ ret = io_ctx.getxattr("hw", "lang", lang_res);
+ if (ret < 0) {
+ std::cerr << "failed to get xattr version entry! error "
+ << ret << std::endl;
+ exit(EXIT_FAILURE);
+ } else {
+ std::cout << "Got the xattr 'lang' from object hw!"
+ << lang_res.c_str() << std::endl;
+ }
+ }
+
+
+ /*
+ * Remove the xattr.
+ */
+ {
+ ret = io_ctx.rmxattr("hw", "lang");
+ if (ret < 0) {
+ std::cerr << "Failed to remove xattr! error "
+ << ret << std::endl;
+ exit(EXIT_FAILURE);
+ } else {
+ std::cout << "Removed the xattr 'lang' from our object!" << std::endl;
+ }
+ }
+
+ /*
+ * Remove the object.
+ */
+ {
+ ret = io_ctx.remove("hw");
+ if (ret < 0) {
+ std::cerr << "Couldn't remove object! error " << ret << std::endl;
+ exit(EXIT_FAILURE);
+ } else {
+ std::cout << "Removed object 'hw'." << std::endl;
+ }
+ }
+ }
+
+
+
+Python Example
+--------------
+
+.. code-block:: python
+
+ print "\n\nI/O Context and Object Operations"
+ print "================================="
+
+ print "\nCreating a context for the 'data' pool"
+ if not cluster.pool_exists('data'):
+ raise RuntimeError('No data pool exists')
+ ioctx = cluster.open_ioctx('data')
+
+ print "\nWriting object 'hw' with contents 'Hello World!' to pool 'data'."
+ ioctx.write("hw", "Hello World!")
+ print "Writing XATTR 'lang' with value 'en_US' to object 'hw'"
+ ioctx.set_xattr("hw", "lang", "en_US")
+
+
+ print "\nWriting object 'bm' with contents 'Bonjour tout le monde!' to pool 'data'."
+ ioctx.write("bm", "Bonjour tout le monde!")
+ print "Writing XATTR 'lang' with value 'fr_FR' to object 'bm'"
+ ioctx.set_xattr("bm", "lang", "fr_FR")
+
+ print "\nContents of object 'hw'\n------------------------"
+ print ioctx.read("hw")
+
+ print "\n\nGetting XATTR 'lang' from object 'hw'"
+ print ioctx.get_xattr("hw", "lang")
+
+ print "\nContents of object 'bm'\n------------------------"
+ print ioctx.read("bm")
+
+ print "Getting XATTR 'lang' from object 'bm'"
+ print ioctx.get_xattr("bm", "lang")
+
+
+ print "\nRemoving object 'hw'"
+ ioctx.remove_object("hw")
+
+ print "Removing object 'bm'"
+ ioctx.remove_object("bm")
+
+
+Java-Example
+------------
+
+.. code-block:: java
+
+ import com.ceph.rados.Rados;
+ import com.ceph.rados.RadosException;
+
+ import java.io.File;
+ import com.ceph.rados.IoCTX;
+
+ public class CephClient {
+ public static void main (String args[]){
+
+ try {
+ Rados cluster = new Rados("admin");
+ System.out.println("Created cluster handle.");
+
+ File f = new File("/etc/ceph/ceph.conf");
+ cluster.confReadFile(f);
+ System.out.println("Read the configuration file.");
+
+ cluster.connect();
+ System.out.println("Connected to the cluster.");
+
+ IoCTX io = cluster.ioCtxCreate("data");
+
+ String oidone = "hw";
+ String contentone = "Hello World!";
+ io.write(oidone, contentone);
+
+ String oidtwo = "bm";
+ String contenttwo = "Bonjour tout le monde!";
+ io.write(oidtwo, contenttwo);
+
+ String[] objects = io.listObjects();
+ for (String object: objects)
+ System.out.println(object);
+
+ io.remove(oidone);
+ io.remove(oidtwo);
+
+ cluster.ioCtxDestroy(io);
+
+ } catch (RadosException e) {
+ System.out.println(e.getMessage() + ": " + e.getReturnValue());
+ }
+ }
+ }
+
+
+PHP Example
+-----------
+
+.. code-block:: php
+
+ <?php
+
+ $io = rados_ioctx_create($r, "mypool");
+ rados_write_full($io, "oidOne", "mycontents");
+ rados_remove("oidOne");
+ rados_ioctx_destroy($io);
+
+
+Step 4: Closing Sessions
+========================
+
+Once your app finishes with the I/O Context and cluster handle, the app should
+close the connection and shutdown the handle. For asynchronous I/O, the app
+should also ensure that pending asynchronous operations have completed.
+
+
+C Example
+---------
+
+.. code-block:: c
+
+ rados_ioctx_destroy(io);
+ rados_shutdown(cluster);
+
+
+C++ Example
+-----------
+
+.. code-block:: c++
+
+ io_ctx.close();
+ cluster.shutdown();
+
+
+Java Example
+--------------
+
+.. code-block:: java
+
+ cluster.ioCtxDestroy(io);
+ cluster.shutDown();
+
+
+Python Example
+--------------
+
+.. code-block:: python
+
+ print "\nClosing the connection."
+ ioctx.close()
+
+ print "Shutting down the handle."
+ cluster.shutdown()
+
+PHP Example
+-----------
+
+.. code-block:: php
+
+ rados_shutdown($r);
+
+
+
+.. _user ID: ../../operations/user-management#command-line-usage
+.. _CAPS: ../../operations/user-management#authorization-capabilities
+.. _Installation (Quick): ../../../start
+.. _Smart Daemons Enable Hyperscale: ../../../architecture#smart-daemons-enable-hyperscale
+.. _Calculating PG IDs: ../../../architecture#calculating-pg-ids
+.. _computes: ../../../architecture#calculating-pg-ids
+.. _OSD: ../../../architecture#mapping-pgs-to-osds
diff --git a/src/ceph/doc/rados/api/librados.rst b/src/ceph/doc/rados/api/librados.rst
new file mode 100644
index 0000000..73d0e42
--- /dev/null
+++ b/src/ceph/doc/rados/api/librados.rst
@@ -0,0 +1,187 @@
+==============
+ Librados (C)
+==============
+
+.. highlight:: c
+
+`librados` provides low-level access to the RADOS service. For an
+overview of RADOS, see :doc:`../../architecture`.
+
+
+Example: connecting and writing an object
+=========================================
+
+To use `Librados`, you instantiate a :c:type:`rados_t` variable (a cluster handle) and
+call :c:func:`rados_create()` with a pointer to it::
+
+ int err;
+ rados_t cluster;
+
+ err = rados_create(&cluster, NULL);
+ if (err < 0) {
+ fprintf(stderr, "%s: cannot create a cluster handle: %s\n", argv[0], strerror(-err));
+ exit(1);
+ }
+
+Then you configure your :c:type:`rados_t` to connect to your cluster,
+either by setting individual values (:c:func:`rados_conf_set()`),
+using a configuration file (:c:func:`rados_conf_read_file()`), using
+command line options (:c:func:`rados_conf_parse_argv`), or an
+environment variable (:c:func:`rados_conf_parse_env()`)::
+
+ err = rados_conf_read_file(cluster, "/path/to/myceph.conf");
+ if (err < 0) {
+ fprintf(stderr, "%s: cannot read config file: %s\n", argv[0], strerror(-err));
+ exit(1);
+ }
+
+Once the cluster handle is configured, you can connect to the cluster with :c:func:`rados_connect()`::
+
+ err = rados_connect(cluster);
+ if (err < 0) {
+ fprintf(stderr, "%s: cannot connect to cluster: %s\n", argv[0], strerror(-err));
+ exit(1);
+ }
+
+Then you open an "IO context", a :c:type:`rados_ioctx_t`, with :c:func:`rados_ioctx_create()`::
+
+ rados_ioctx_t io;
+ char *poolname = "mypool";
+
+ err = rados_ioctx_create(cluster, poolname, &io);
+ if (err < 0) {
+ fprintf(stderr, "%s: cannot open rados pool %s: %s\n", argv[0], poolname, strerror(-err));
+ rados_shutdown(cluster);
+ exit(1);
+ }
+
+Note that the pool you try to access must exist.
+
+Then you can use the RADOS data manipulation functions, for example
+write into an object called ``greeting`` with
+:c:func:`rados_write_full()`::
+
+ err = rados_write_full(io, "greeting", "hello", 5);
+ if (err < 0) {
+ fprintf(stderr, "%s: cannot write pool %s: %s\n", argv[0], poolname, strerror(-err));
+ rados_ioctx_destroy(io);
+ rados_shutdown(cluster);
+ exit(1);
+ }
+
+In the end, you will want to close your IO context and connection to RADOS with :c:func:`rados_ioctx_destroy()` and :c:func:`rados_shutdown()`::
+
+ rados_ioctx_destroy(io);
+ rados_shutdown(cluster);
+
+
+Asychronous IO
+==============
+
+When doing lots of IO, you often don't need to wait for one operation
+to complete before starting the next one. `Librados` provides
+asynchronous versions of several operations:
+
+* :c:func:`rados_aio_write`
+* :c:func:`rados_aio_append`
+* :c:func:`rados_aio_write_full`
+* :c:func:`rados_aio_read`
+
+For each operation, you must first create a
+:c:type:`rados_completion_t` that represents what to do when the
+operation is safe or complete by calling
+:c:func:`rados_aio_create_completion`. If you don't need anything
+special to happen, you can pass NULL::
+
+ rados_completion_t comp;
+ err = rados_aio_create_completion(NULL, NULL, NULL, &comp);
+ if (err < 0) {
+ fprintf(stderr, "%s: could not create aio completion: %s\n", argv[0], strerror(-err));
+ rados_ioctx_destroy(io);
+ rados_shutdown(cluster);
+ exit(1);
+ }
+
+Now you can call any of the aio operations, and wait for it to
+be in memory or on disk on all replicas::
+
+ err = rados_aio_write(io, "foo", comp, "bar", 3, 0);
+ if (err < 0) {
+ fprintf(stderr, "%s: could not schedule aio write: %s\n", argv[0], strerror(-err));
+ rados_aio_release(comp);
+ rados_ioctx_destroy(io);
+ rados_shutdown(cluster);
+ exit(1);
+ }
+ rados_aio_wait_for_complete(comp); // in memory
+ rados_aio_wait_for_safe(comp); // on disk
+
+Finally, we need to free the memory used by the completion with :c:func:`rados_aio_release`::
+
+ rados_aio_release(comp);
+
+You can use the callbacks to tell your application when writes are
+durable, or when read buffers are full. For example, if you wanted to
+measure the latency of each operation when appending to several
+objects, you could schedule several writes and store the ack and
+commit time in the corresponding callback, then wait for all of them
+to complete using :c:func:`rados_aio_flush` before analyzing the
+latencies::
+
+ typedef struct {
+ struct timeval start;
+ struct timeval ack_end;
+ struct timeval commit_end;
+ } req_duration;
+
+ void ack_callback(rados_completion_t comp, void *arg) {
+ req_duration *dur = (req_duration *) arg;
+ gettimeofday(&dur->ack_end, NULL);
+ }
+
+ void commit_callback(rados_completion_t comp, void *arg) {
+ req_duration *dur = (req_duration *) arg;
+ gettimeofday(&dur->commit_end, NULL);
+ }
+
+ int output_append_latency(rados_ioctx_t io, const char *data, size_t len, size_t num_writes) {
+ req_duration times[num_writes];
+ rados_completion_t comps[num_writes];
+ for (size_t i = 0; i < num_writes; ++i) {
+ gettimeofday(&times[i].start, NULL);
+ int err = rados_aio_create_completion((void*) &times[i], ack_callback, commit_callback, &comps[i]);
+ if (err < 0) {
+ fprintf(stderr, "Error creating rados completion: %s\n", strerror(-err));
+ return err;
+ }
+ char obj_name[100];
+ snprintf(obj_name, sizeof(obj_name), "foo%ld", (unsigned long)i);
+ err = rados_aio_append(io, obj_name, comps[i], data, len);
+ if (err < 0) {
+ fprintf(stderr, "Error from rados_aio_append: %s", strerror(-err));
+ return err;
+ }
+ }
+ // wait until all requests finish *and* the callbacks complete
+ rados_aio_flush(io);
+ // the latencies can now be analyzed
+ printf("Request # | Ack latency (s) | Commit latency (s)\n");
+ for (size_t i = 0; i < num_writes; ++i) {
+ // don't forget to free the completions
+ rados_aio_release(comps[i]);
+ struct timeval ack_lat, commit_lat;
+ timersub(&times[i].ack_end, &times[i].start, &ack_lat);
+ timersub(&times[i].commit_end, &times[i].start, &commit_lat);
+ printf("%9ld | %8ld.%06ld | %10ld.%06ld\n", (unsigned long) i, ack_lat.tv_sec, ack_lat.tv_usec, commit_lat.tv_sec, commit_lat.tv_usec);
+ }
+ return 0;
+ }
+
+Note that all the :c:type:`rados_completion_t` must be freed with :c:func:`rados_aio_release` to avoid leaking memory.
+
+
+API calls
+=========
+
+ .. autodoxygenfile:: rados_types.h
+ .. autodoxygenfile:: librados.h
diff --git a/src/ceph/doc/rados/api/libradospp.rst b/src/ceph/doc/rados/api/libradospp.rst
new file mode 100644
index 0000000..27d3fa7
--- /dev/null
+++ b/src/ceph/doc/rados/api/libradospp.rst
@@ -0,0 +1,5 @@
+==================
+ LibradosPP (C++)
+==================
+
+.. todo:: write me!
diff --git a/src/ceph/doc/rados/api/objclass-sdk.rst b/src/ceph/doc/rados/api/objclass-sdk.rst
new file mode 100644
index 0000000..6b1162f
--- /dev/null
+++ b/src/ceph/doc/rados/api/objclass-sdk.rst
@@ -0,0 +1,37 @@
+===========================
+SDK for Ceph Object Classes
+===========================
+
+`Ceph` can be extended by creating shared object classes called `Ceph Object
+Classes`. The existing framework to build these object classes has dependencies
+on the internal functionality of `Ceph`, which restricts users to build object
+classes within the tree. The aim of this project is to create an independent
+object class interface, which can be used to build object classes outside the
+`Ceph` tree. This allows us to have two types of object classes, 1) those that
+have in-tree dependencies and reside in the tree and 2) those that can make use
+of the `Ceph Object Class SDK framework` and can be built outside of the `Ceph`
+tree because they do not depend on any internal implementation of `Ceph`. This
+project decouples object class development from Ceph and encourages creation
+and distribution of object classes as packages.
+
+In order to demonstrate the use of this framework, we have provided an example
+called ``cls_sdk``, which is a very simple object class that makes use of the
+SDK framework. This object class resides in the ``src/cls`` directory.
+
+Installing objclass.h
+---------------------
+
+The object class interface that enables out-of-tree development of object
+classes resides in ``src/include/rados/`` and gets installed with `Ceph`
+installation. After running ``make install``, you should be able to see it
+in ``<prefix>/include/rados``. ::
+
+ ls /usr/local/include/rados
+
+Using the SDK example
+---------------------
+
+The ``cls_sdk`` object class resides in ``src/cls/sdk/``. This gets built and
+loaded into Ceph, with the Ceph build process. You can run the
+``ceph_test_cls_sdk`` unittest, which resides in ``src/test/cls_sdk/``,
+to test this class.
diff --git a/src/ceph/doc/rados/api/python.rst b/src/ceph/doc/rados/api/python.rst
new file mode 100644
index 0000000..b4fd7e0
--- /dev/null
+++ b/src/ceph/doc/rados/api/python.rst
@@ -0,0 +1,397 @@
+===================
+ Librados (Python)
+===================
+
+The ``rados`` module is a thin Python wrapper for ``librados``.
+
+Installation
+============
+
+To install Python libraries for Ceph, see `Getting librados for Python`_.
+
+
+Getting Started
+===============
+
+You can create your own Ceph client using Python. The following tutorial will
+show you how to import the Ceph Python module, connect to a Ceph cluster, and
+perform object operations as a ``client.admin`` user.
+
+.. note:: To use the Ceph Python bindings, you must have access to a
+ running Ceph cluster. To set one up quickly, see `Getting Started`_.
+
+First, create a Python source file for your Ceph client. ::
+ :linenos:
+
+ sudo vim client.py
+
+
+Import the Module
+-----------------
+
+To use the ``rados`` module, import it into your source file.
+
+.. code-block:: python
+ :linenos:
+
+ import rados
+
+
+Configure a Cluster Handle
+--------------------------
+
+Before connecting to the Ceph Storage Cluster, create a cluster handle. By
+default, the cluster handle assumes a cluster named ``ceph`` (i.e., the default
+for deployment tools, and our Getting Started guides too), and a
+``client.admin`` user name. You may change these defaults to suit your needs.
+
+To connect to the Ceph Storage Cluster, your application needs to know where to
+find the Ceph Monitor. Provide this information to your application by
+specifying the path to your Ceph configuration file, which contains the location
+of the initial Ceph monitors.
+
+.. code-block:: python
+ :linenos:
+
+ import rados, sys
+
+ #Create Handle Examples.
+ cluster = rados.Rados(conffile='ceph.conf')
+ cluster = rados.Rados(conffile=sys.argv[1])
+ cluster = rados.Rados(conffile = 'ceph.conf', conf = dict (keyring = '/path/to/keyring'))
+
+Ensure that the ``conffile`` argument provides the path and file name of your
+Ceph configuration file. You may use the ``sys`` module to avoid hard-coding the
+Ceph configuration path and file name.
+
+Your Python client also requires a client keyring. For this example, we use the
+``client.admin`` key by default. If you would like to specify the keyring when
+creating the cluster handle, you may use the ``conf`` argument. Alternatively,
+you may specify the keyring path in your Ceph configuration file. For example,
+you may add something like the following line to you Ceph configuration file::
+
+ keyring = /path/to/ceph.client.admin.keyring
+
+For additional details on modifying your configuration via Python, see `Configuration`_.
+
+
+Connect to the Cluster
+----------------------
+
+Once you have a cluster handle configured, you may connect to the cluster.
+With a connection to the cluster, you may execute methods that return
+information about the cluster.
+
+.. code-block:: python
+ :linenos:
+ :emphasize-lines: 7
+
+ import rados, sys
+
+ cluster = rados.Rados(conffile='ceph.conf')
+ print "\nlibrados version: " + str(cluster.version())
+ print "Will attempt to connect to: " + str(cluster.conf_get('mon initial members'))
+
+ cluster.connect()
+ print "\nCluster ID: " + cluster.get_fsid()
+
+ print "\n\nCluster Statistics"
+ print "=================="
+ cluster_stats = cluster.get_cluster_stats()
+
+ for key, value in cluster_stats.iteritems():
+ print key, value
+
+
+By default, Ceph authentication is ``on``. Your application will need to know
+the location of the keyring. The ``python-ceph`` module doesn't have the default
+location, so you need to specify the keyring path. The easiest way to specify
+the keyring is to add it to the Ceph configuration file. The following Ceph
+configuration file example uses the ``client.admin`` keyring you generated with
+``ceph-deploy``.
+
+.. code-block:: ini
+ :linenos:
+
+ [global]
+ ...
+ keyring=/path/to/keyring/ceph.client.admin.keyring
+
+
+Manage Pools
+------------
+
+When connected to the cluster, the ``Rados`` API allows you to manage pools. You
+can list pools, check for the existence of a pool, create a pool and delete a
+pool.
+
+.. code-block:: python
+ :linenos:
+ :emphasize-lines: 6, 13, 18, 25
+
+ print "\n\nPool Operations"
+ print "==============="
+
+ print "\nAvailable Pools"
+ print "----------------"
+ pools = cluster.list_pools()
+
+ for pool in pools:
+ print pool
+
+ print "\nCreate 'test' Pool"
+ print "------------------"
+ cluster.create_pool('test')
+
+ print "\nPool named 'test' exists: " + str(cluster.pool_exists('test'))
+ print "\nVerify 'test' Pool Exists"
+ print "-------------------------"
+ pools = cluster.list_pools()
+
+ for pool in pools:
+ print pool
+
+ print "\nDelete 'test' Pool"
+ print "------------------"
+ cluster.delete_pool('test')
+ print "\nPool named 'test' exists: " + str(cluster.pool_exists('test'))
+
+
+
+Input/Output Context
+--------------------
+
+Reading from and writing to the Ceph Storage Cluster requires an input/output
+context (ioctx). You can create an ioctx with the ``open_ioctx()`` method of the
+``Rados`` class. The ``ioctx_name`` parameter is the name of the pool you wish
+to use.
+
+.. code-block:: python
+ :linenos:
+
+ ioctx = cluster.open_ioctx('data')
+
+
+Once you have an I/O context, you can read/write objects, extended attributes,
+and perform a number of other operations. After you complete operations, ensure
+that you close the connection. For example:
+
+.. code-block:: python
+ :linenos:
+
+ print "\nClosing the connection."
+ ioctx.close()
+
+
+Writing, Reading and Removing Objects
+-------------------------------------
+
+Once you create an I/O context, you can write objects to the cluster. If you
+write to an object that doesn't exist, Ceph creates it. If you write to an
+object that exists, Ceph overwrites it (except when you specify a range, and
+then it only overwrites the range). You may read objects (and object ranges)
+from the cluster. You may also remove objects from the cluster. For example:
+
+.. code-block:: python
+ :linenos:
+ :emphasize-lines: 2, 5, 8
+
+ print "\nWriting object 'hw' with contents 'Hello World!' to pool 'data'."
+ ioctx.write_full("hw", "Hello World!")
+
+ print "\n\nContents of object 'hw'\n------------------------\n"
+ print ioctx.read("hw")
+
+ print "\nRemoving object 'hw'"
+ ioctx.remove_object("hw")
+
+
+Writing and Reading XATTRS
+--------------------------
+
+Once you create an object, you can write extended attributes (XATTRs) to
+the object and read XATTRs from the object. For example:
+
+.. code-block:: python
+ :linenos:
+ :emphasize-lines: 2, 5
+
+ print "\n\nWriting XATTR 'lang' with value 'en_US' to object 'hw'"
+ ioctx.set_xattr("hw", "lang", "en_US")
+
+ print "\n\nGetting XATTR 'lang' from object 'hw'\n"
+ print ioctx.get_xattr("hw", "lang")
+
+
+Listing Objects
+---------------
+
+If you want to examine the list of objects in a pool, you may
+retrieve the list of objects and iterate over them with the object iterator.
+For example:
+
+.. code-block:: python
+ :linenos:
+ :emphasize-lines: 1, 6, 7
+
+ object_iterator = ioctx.list_objects()
+
+ while True :
+
+ try :
+ rados_object = object_iterator.next()
+ print "Object contents = " + rados_object.read()
+
+ except StopIteration :
+ break
+
+The ``Object`` class provides a file-like interface to an object, allowing
+you to read and write content and extended attributes. Object operations using
+the I/O context provide additional functionality and asynchronous capabilities.
+
+
+Cluster Handle API
+==================
+
+The ``Rados`` class provides an interface into the Ceph Storage Daemon.
+
+
+Configuration
+-------------
+
+The ``Rados`` class provides methods for getting and setting configuration
+values, reading the Ceph configuration file, and parsing arguments. You
+do not need to be connected to the Ceph Storage Cluster to invoke the following
+methods. See `Storage Cluster Configuration`_ for details on settings.
+
+.. currentmodule:: rados
+.. automethod:: Rados.conf_get(option)
+.. automethod:: Rados.conf_set(option, val)
+.. automethod:: Rados.conf_read_file(path=None)
+.. automethod:: Rados.conf_parse_argv(args)
+.. automethod:: Rados.version()
+
+
+Connection Management
+---------------------
+
+Once you configure your cluster handle, you may connect to the cluster, check
+the cluster ``fsid``, retrieve cluster statistics, and disconnect (shutdown)
+from the cluster. You may also assert that the cluster handle is in a particular
+state (e.g., "configuring", "connecting", etc.).
+
+
+.. automethod:: Rados.connect(timeout=0)
+.. automethod:: Rados.shutdown()
+.. automethod:: Rados.get_fsid()
+.. automethod:: Rados.get_cluster_stats()
+.. automethod:: Rados.require_state(*args)
+
+
+Pool Operations
+---------------
+
+To use pool operation methods, you must connect to the Ceph Storage Cluster
+first. You may list the available pools, create a pool, check to see if a pool
+exists, and delete a pool.
+
+.. automethod:: Rados.list_pools()
+.. automethod:: Rados.create_pool(pool_name, auid=None, crush_rule=None)
+.. automethod:: Rados.pool_exists()
+.. automethod:: Rados.delete_pool(pool_name)
+
+
+
+Input/Output Context API
+========================
+
+To write data to and read data from the Ceph Object Store, you must create
+an Input/Output context (ioctx). The `Rados` class provides a `open_ioctx()`
+method. The remaining ``ioctx`` operations involve invoking methods of the
+`Ioctx` and other classes.
+
+.. automethod:: Rados.open_ioctx(ioctx_name)
+.. automethod:: Ioctx.require_ioctx_open()
+.. automethod:: Ioctx.get_stats()
+.. automethod:: Ioctx.change_auid(auid)
+.. automethod:: Ioctx.get_last_version()
+.. automethod:: Ioctx.close()
+
+
+.. Pool Snapshots
+.. --------------
+
+.. The Ceph Storage Cluster allows you to make a snapshot of a pool's state.
+.. Whereas, basic pool operations only require a connection to the cluster,
+.. snapshots require an I/O context.
+
+.. Ioctx.create_snap(self, snap_name)
+.. Ioctx.list_snaps(self)
+.. SnapIterator.next(self)
+.. Snap.get_timestamp(self)
+.. Ioctx.lookup_snap(self, snap_name)
+.. Ioctx.remove_snap(self, snap_name)
+
+.. not published. This doesn't seem ready yet.
+
+Object Operations
+-----------------
+
+The Ceph Storage Cluster stores data as objects. You can read and write objects
+synchronously or asynchronously. You can read and write from offsets. An object
+has a name (or key) and data.
+
+
+.. automethod:: Ioctx.aio_write(object_name, to_write, offset=0, oncomplete=None, onsafe=None)
+.. automethod:: Ioctx.aio_write_full(object_name, to_write, oncomplete=None, onsafe=None)
+.. automethod:: Ioctx.aio_append(object_name, to_append, oncomplete=None, onsafe=None)
+.. automethod:: Ioctx.write(key, data, offset=0)
+.. automethod:: Ioctx.write_full(key, data)
+.. automethod:: Ioctx.aio_flush()
+.. automethod:: Ioctx.set_locator_key(loc_key)
+.. automethod:: Ioctx.aio_read(object_name, length, offset, oncomplete)
+.. automethod:: Ioctx.read(key, length=8192, offset=0)
+.. automethod:: Ioctx.stat(key)
+.. automethod:: Ioctx.trunc(key, size)
+.. automethod:: Ioctx.remove_object(key)
+
+
+Object Extended Attributes
+--------------------------
+
+You may set extended attributes (XATTRs) on an object. You can retrieve a list
+of objects or XATTRs and iterate over them.
+
+.. automethod:: Ioctx.set_xattr(key, xattr_name, xattr_value)
+.. automethod:: Ioctx.get_xattrs(oid)
+.. automethod:: XattrIterator.next()
+.. automethod:: Ioctx.get_xattr(key, xattr_name)
+.. automethod:: Ioctx.rm_xattr(key, xattr_name)
+
+
+
+Object Interface
+================
+
+From an I/O context, you can retrieve a list of objects from a pool and iterate
+over them. The object interface provide makes each object look like a file, and
+you may perform synchronous operations on the objects. For asynchronous
+operations, you should use the I/O context methods.
+
+.. automethod:: Ioctx.list_objects()
+.. automethod:: ObjectIterator.next()
+.. automethod:: Object.read(length = 1024*1024)
+.. automethod:: Object.write(string_to_write)
+.. automethod:: Object.get_xattrs()
+.. automethod:: Object.get_xattr(xattr_name)
+.. automethod:: Object.set_xattr(xattr_name, xattr_value)
+.. automethod:: Object.rm_xattr(xattr_name)
+.. automethod:: Object.stat()
+.. automethod:: Object.remove()
+
+
+
+
+.. _Getting Started: ../../../start
+.. _Storage Cluster Configuration: ../../configuration
+.. _Getting librados for Python: ../librados-intro#getting-librados-for-python
diff --git a/src/ceph/doc/rados/command/list-inconsistent-obj.json b/src/ceph/doc/rados/command/list-inconsistent-obj.json
new file mode 100644
index 0000000..76ca43e
--- /dev/null
+++ b/src/ceph/doc/rados/command/list-inconsistent-obj.json
@@ -0,0 +1,195 @@
+{
+ "$schema": "http://json-schema.org/draft-04/schema#",
+ "type": "object",
+ "properties": {
+ "epoch": {
+ "description": "Scrub epoch",
+ "type": "integer"
+ },
+ "inconsistents": {
+ "type": "array",
+ "items": {
+ "type": "object",
+ "properties": {
+ "object": {
+ "description": "Identify a Ceph object",
+ "type": "object",
+ "properties": {
+ "name": {
+ "type": "string"
+ },
+ "nspace": {
+ "type": "string"
+ },
+ "locator": {
+ "type": "string"
+ },
+ "version": {
+ "type": "integer",
+ "minimum": 0
+ },
+ "snap": {
+ "oneOf": [
+ {
+ "type": "string",
+ "enum": [ "head", "snapdir" ]
+ },
+ {
+ "type": "integer",
+ "minimum": 0
+ }
+ ]
+ }
+ },
+ "required": [
+ "name",
+ "nspace",
+ "locator",
+ "version",
+ "snap"
+ ]
+ },
+ "selected_object_info": {
+ "type": "string"
+ },
+ "union_shard_errors": {
+ "description": "Union of all shard errors",
+ "type": "array",
+ "items": {
+ "enum": [
+ "missing",
+ "stat_error",
+ "read_error",
+ "data_digest_mismatch_oi",
+ "omap_digest_mismatch_oi",
+ "size_mismatch_oi",
+ "ec_hash_error",
+ "ec_size_error",
+ "oi_attr_missing",
+ "oi_attr_corrupted",
+ "obj_size_oi_mismatch",
+ "ss_attr_missing",
+ "ss_attr_corrupted"
+ ]
+ },
+ "minItems": 0,
+ "uniqueItems": true
+ },
+ "errors": {
+ "description": "Errors related to the analysis of this object",
+ "type": "array",
+ "items": {
+ "enum": [
+ "object_info_inconsistency",
+ "data_digest_mismatch",
+ "omap_digest_mismatch",
+ "size_mismatch",
+ "attr_value_mismatch",
+ "attr_name_mismatch"
+ ]
+ },
+ "minItems": 0,
+ "uniqueItems": true
+ },
+ "shards": {
+ "description": "All found or expected shards",
+ "type": "array",
+ "items": {
+ "description": "Information about a particular shard of object",
+ "type": "object",
+ "properties": {
+ "object_info": {
+ "type": "string"
+ },
+ "shard": {
+ "type": "integer"
+ },
+ "osd": {
+ "type": "integer"
+ },
+ "primary": {
+ "type": "boolean"
+ },
+ "size": {
+ "type": "integer"
+ },
+ "omap_digest": {
+ "description": "Hex representation (e.g. 0x1abd1234)",
+ "type": "string"
+ },
+ "data_digest": {
+ "description": "Hex representation (e.g. 0x1abd1234)",
+ "type": "string"
+ },
+ "errors": {
+ "description": "Errors with this shard",
+ "type": "array",
+ "items": {
+ "enum": [
+ "missing",
+ "stat_error",
+ "read_error",
+ "data_digest_mismatch_oi",
+ "omap_digest_mismatch_oi",
+ "size_mismatch_oi",
+ "ec_hash_error",
+ "ec_size_error",
+ "oi_attr_missing",
+ "oi_attr_corrupted",
+ "obj_size_oi_mismatch",
+ "ss_attr_missing",
+ "ss_attr_corrupted"
+ ]
+ },
+ "minItems": 0,
+ "uniqueItems": true
+ },
+ "attrs": {
+ "description": "If any shard's attr error is set then all attrs are here",
+ "type": "array",
+ "items": {
+ "description": "Information about a particular shard of object",
+ "type": "object",
+ "properties": {
+ "name": {
+ "type": "string"
+ },
+ "value": {
+ "type": "string"
+ },
+ "Base64": {
+ "type": "boolean"
+ }
+ },
+ "required": [
+ "name",
+ "value",
+ "Base64"
+ ],
+ "additionalProperties": false,
+ "minItems": 1
+ }
+ }
+ },
+ "required": [
+ "osd",
+ "primary",
+ "errors"
+ ]
+ }
+ }
+ },
+ "required": [
+ "object",
+ "union_shard_errors",
+ "errors",
+ "shards"
+ ]
+ }
+ }
+ },
+ "required": [
+ "epoch",
+ "inconsistents"
+ ]
+}
diff --git a/src/ceph/doc/rados/command/list-inconsistent-snap.json b/src/ceph/doc/rados/command/list-inconsistent-snap.json
new file mode 100644
index 0000000..0da6b0f
--- /dev/null
+++ b/src/ceph/doc/rados/command/list-inconsistent-snap.json
@@ -0,0 +1,87 @@
+{
+ "$schema": "http://json-schema.org/draft-04/schema#",
+ "type": "object",
+ "properties": {
+ "epoch": {
+ "description": "Scrub epoch",
+ "type": "integer"
+ },
+ "inconsistents": {
+ "type": "array",
+ "items": {
+ "type": "object",
+ "properties": {
+ "name": {
+ "type": "string"
+ },
+ "nspace": {
+ "type": "string"
+ },
+ "locator": {
+ "type": "string"
+ },
+ "snap": {
+ "oneOf": [
+ {
+ "type": "string",
+ "enum": [
+ "head",
+ "snapdir"
+ ]
+ },
+ {
+ "type": "integer",
+ "minimum": 0
+ }
+ ]
+ },
+ "errors": {
+ "description": "Errors for this object's snap",
+ "type": "array",
+ "items": {
+ "enum": [
+ "ss_attr_missing",
+ "ss_attr_corrupted",
+ "oi_attr_missing",
+ "oi_attr_corrupted",
+ "snapset_mismatch",
+ "head_mismatch",
+ "headless",
+ "size_mismatch",
+ "extra_clones",
+ "clone_missing"
+ ]
+ },
+ "minItems": 1,
+ "uniqueItems": true
+ },
+ "missing": {
+ "description": "List of missing clones if clone_missing error set",
+ "type": "array",
+ "items": {
+ "type": "integer"
+ }
+ },
+ "extra_clones": {
+ "description": "List of extra clones if extra_clones error set",
+ "type": "array",
+ "items": {
+ "type": "integer"
+ }
+ }
+ },
+ "required": [
+ "name",
+ "nspace",
+ "locator",
+ "snap",
+ "errors"
+ ]
+ }
+ }
+ },
+ "required": [
+ "epoch",
+ "inconsistents"
+ ]
+}
diff --git a/src/ceph/doc/rados/configuration/auth-config-ref.rst b/src/ceph/doc/rados/configuration/auth-config-ref.rst
new file mode 100644
index 0000000..eb14fa4
--- /dev/null
+++ b/src/ceph/doc/rados/configuration/auth-config-ref.rst
@@ -0,0 +1,432 @@
+========================
+ Cephx Config Reference
+========================
+
+The ``cephx`` protocol is enabled by default. Cryptographic authentication has
+some computational costs, though they should generally be quite low. If the
+network environment connecting your client and server hosts is very safe and
+you cannot afford authentication, you can turn it off. **This is not generally
+recommended**.
+
+.. note:: If you disable authentication, you are at risk of a man-in-the-middle
+ attack altering your client/server messages, which could lead to disastrous
+ security effects.
+
+For creating users, see `User Management`_. For details on the architecture
+of Cephx, see `Architecture - High Availability Authentication`_.
+
+
+Deployment Scenarios
+====================
+
+There are two main scenarios for deploying a Ceph cluster, which impact
+how you initially configure Cephx. Most first time Ceph users use
+``ceph-deploy`` to create a cluster (easiest). For clusters using
+other deployment tools (e.g., Chef, Juju, Puppet, etc.), you will need
+to use the manual procedures or configure your deployment tool to
+bootstrap your monitor(s).
+
+ceph-deploy
+-----------
+
+When you deploy a cluster with ``ceph-deploy``, you do not have to bootstrap the
+monitor manually or create the ``client.admin`` user or keyring. The steps you
+execute in the `Storage Cluster Quick Start`_ will invoke ``ceph-deploy`` to do
+that for you.
+
+When you execute ``ceph-deploy new {initial-monitor(s)}``, Ceph will create a
+monitor keyring for you (only used to bootstrap monitors), and it will generate
+an initial Ceph configuration file for you, which contains the following
+authentication settings, indicating that Ceph enables authentication by
+default::
+
+ auth_cluster_required = cephx
+ auth_service_required = cephx
+ auth_client_required = cephx
+
+When you execute ``ceph-deploy mon create-initial``, Ceph will bootstrap the
+initial monitor(s), retrieve a ``ceph.client.admin.keyring`` file containing the
+key for the ``client.admin`` user. Additionally, it will also retrieve keyrings
+that give ``ceph-deploy`` and ``ceph-disk`` utilities the ability to prepare and
+activate OSDs and metadata servers.
+
+When you execute ``ceph-deploy admin {node-name}`` (**note:** Ceph must be
+installed first), you are pushing a Ceph configuration file and the
+``ceph.client.admin.keyring`` to the ``/etc/ceph`` directory of the node. You
+will be able to execute Ceph administrative functions as ``root`` on the command
+line of that node.
+
+
+Manual Deployment
+-----------------
+
+When you deploy a cluster manually, you have to bootstrap the monitor manually
+and create the ``client.admin`` user and keyring. To bootstrap monitors, follow
+the steps in `Monitor Bootstrapping`_. The steps for monitor bootstrapping are
+the logical steps you must perform when using third party deployment tools like
+Chef, Puppet, Juju, etc.
+
+
+Enabling/Disabling Cephx
+========================
+
+Enabling Cephx requires that you have deployed keys for your monitors,
+OSDs and metadata servers. If you are simply toggling Cephx on / off,
+you do not have to repeat the bootstrapping procedures.
+
+
+Enabling Cephx
+--------------
+
+When ``cephx`` is enabled, Ceph will look for the keyring in the default search
+path, which includes ``/etc/ceph/$cluster.$name.keyring``. You can override
+this location by adding a ``keyring`` option in the ``[global]`` section of
+your `Ceph configuration`_ file, but this is not recommended.
+
+Execute the following procedures to enable ``cephx`` on a cluster with
+authentication disabled. If you (or your deployment utility) have already
+generated the keys, you may skip the steps related to generating keys.
+
+#. Create a ``client.admin`` key, and save a copy of the key for your client
+ host::
+
+ ceph auth get-or-create client.admin mon 'allow *' mds 'allow *' osd 'allow *' -o /etc/ceph/ceph.client.admin.keyring
+
+ **Warning:** This will clobber any existing
+ ``/etc/ceph/client.admin.keyring`` file. Do not perform this step if a
+ deployment tool has already done it for you. Be careful!
+
+#. Create a keyring for your monitor cluster and generate a monitor
+ secret key. ::
+
+ ceph-authtool --create-keyring /tmp/ceph.mon.keyring --gen-key -n mon. --cap mon 'allow *'
+
+#. Copy the monitor keyring into a ``ceph.mon.keyring`` file in every monitor's
+ ``mon data`` directory. For example, to copy it to ``mon.a`` in cluster ``ceph``,
+ use the following::
+
+ cp /tmp/ceph.mon.keyring /var/lib/ceph/mon/ceph-a/keyring
+
+#. Generate a secret key for every OSD, where ``{$id}`` is the OSD number::
+
+ ceph auth get-or-create osd.{$id} mon 'allow rwx' osd 'allow *' -o /var/lib/ceph/osd/ceph-{$id}/keyring
+
+#. Generate a secret key for every MDS, where ``{$id}`` is the MDS letter::
+
+ ceph auth get-or-create mds.{$id} mon 'allow rwx' osd 'allow *' mds 'allow *' -o /var/lib/ceph/mds/ceph-{$id}/keyring
+
+#. Enable ``cephx`` authentication by setting the following options in the
+ ``[global]`` section of your `Ceph configuration`_ file::
+
+ auth cluster required = cephx
+ auth service required = cephx
+ auth client required = cephx
+
+
+#. Start or restart the Ceph cluster. See `Operating a Cluster`_ for details.
+
+For details on bootstrapping a monitor manually, see `Manual Deployment`_.
+
+
+
+Disabling Cephx
+---------------
+
+The following procedure describes how to disable Cephx. If your cluster
+environment is relatively safe, you can offset the computation expense of
+running authentication. **We do not recommend it.** However, it may be easier
+during setup and/or troubleshooting to temporarily disable authentication.
+
+#. Disable ``cephx`` authentication by setting the following options in the
+ ``[global]`` section of your `Ceph configuration`_ file::
+
+ auth cluster required = none
+ auth service required = none
+ auth client required = none
+
+
+#. Start or restart the Ceph cluster. See `Operating a Cluster`_ for details.
+
+
+Configuration Settings
+======================
+
+Enablement
+----------
+
+
+``auth cluster required``
+
+:Description: If enabled, the Ceph Storage Cluster daemons (i.e., ``ceph-mon``,
+ ``ceph-osd``, and ``ceph-mds``) must authenticate with
+ each other. Valid settings are ``cephx`` or ``none``.
+
+:Type: String
+:Required: No
+:Default: ``cephx``.
+
+
+``auth service required``
+
+:Description: If enabled, the Ceph Storage Cluster daemons require Ceph Clients
+ to authenticate with the Ceph Storage Cluster in order to access
+ Ceph services. Valid settings are ``cephx`` or ``none``.
+
+:Type: String
+:Required: No
+:Default: ``cephx``.
+
+
+``auth client required``
+
+:Description: If enabled, the Ceph Client requires the Ceph Storage Cluster to
+ authenticate with the Ceph Client. Valid settings are ``cephx``
+ or ``none``.
+
+:Type: String
+:Required: No
+:Default: ``cephx``.
+
+
+.. index:: keys; keyring
+
+Keys
+----
+
+When you run Ceph with authentication enabled, ``ceph`` administrative commands
+and Ceph Clients require authentication keys to access the Ceph Storage Cluster.
+
+The most common way to provide these keys to the ``ceph`` administrative
+commands and clients is to include a Ceph keyring under the ``/etc/ceph``
+directory. For Cuttlefish and later releases using ``ceph-deploy``, the filename
+is usually ``ceph.client.admin.keyring`` (or ``$cluster.client.admin.keyring``).
+If you include the keyring under the ``/etc/ceph`` directory, you don't need to
+specify a ``keyring`` entry in your Ceph configuration file.
+
+We recommend copying the Ceph Storage Cluster's keyring file to nodes where you
+will run administrative commands, because it contains the ``client.admin`` key.
+
+You may use ``ceph-deploy admin`` to perform this task. See `Create an Admin
+Host`_ for details. To perform this step manually, execute the following::
+
+ sudo scp {user}@{ceph-cluster-host}:/etc/ceph/ceph.client.admin.keyring /etc/ceph/ceph.client.admin.keyring
+
+.. tip:: Ensure the ``ceph.keyring`` file has appropriate permissions set
+ (e.g., ``chmod 644``) on your client machine.
+
+You may specify the key itself in the Ceph configuration file using the ``key``
+setting (not recommended), or a path to a keyfile using the ``keyfile`` setting.
+
+
+``keyring``
+
+:Description: The path to the keyring file.
+:Type: String
+:Required: No
+:Default: ``/etc/ceph/$cluster.$name.keyring,/etc/ceph/$cluster.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin``
+
+
+``keyfile``
+
+:Description: The path to a key file (i.e,. a file containing only the key).
+:Type: String
+:Required: No
+:Default: None
+
+
+``key``
+
+:Description: The key (i.e., the text string of the key itself). Not recommended.
+:Type: String
+:Required: No
+:Default: None
+
+
+Daemon Keyrings
+---------------
+
+Administrative users or deployment tools (e.g., ``ceph-deploy``) may generate
+daemon keyrings in the same way as generating user keyrings. By default, Ceph
+stores daemons keyrings inside their data directory. The default keyring
+locations, and the capabilities necessary for the daemon to function, are shown
+below.
+
+``ceph-mon``
+
+:Location: ``$mon_data/keyring``
+:Capabilities: ``mon 'allow *'``
+
+``ceph-osd``
+
+:Location: ``$osd_data/keyring``
+:Capabilities: ``mon 'allow profile osd' osd 'allow *'``
+
+``ceph-mds``
+
+:Location: ``$mds_data/keyring``
+:Capabilities: ``mds 'allow' mon 'allow profile mds' osd 'allow rwx'``
+
+``radosgw``
+
+:Location: ``$rgw_data/keyring``
+:Capabilities: ``mon 'allow rwx' osd 'allow rwx'``
+
+
+.. note:: The monitor keyring (i.e., ``mon.``) contains a key but no
+ capabilities, and is not part of the cluster ``auth`` database.
+
+The daemon data directory locations default to directories of the form::
+
+ /var/lib/ceph/$type/$cluster-$id
+
+For example, ``osd.12`` would be::
+
+ /var/lib/ceph/osd/ceph-12
+
+You can override these locations, but it is not recommended.
+
+
+.. index:: signatures
+
+Signatures
+----------
+
+In Ceph Bobtail and subsequent versions, we prefer that Ceph authenticate all
+ongoing messages between the entities using the session key set up for that
+initial authentication. However, Argonaut and earlier Ceph daemons do not know
+how to perform ongoing message authentication. To maintain backward
+compatibility (e.g., running both Botbail and Argonaut daemons in the same
+cluster), message signing is **off** by default. If you are running Bobtail or
+later daemons exclusively, configure Ceph to require signatures.
+
+Like other parts of Ceph authentication, Ceph provides fine-grained control so
+you can enable/disable signatures for service messages between the client and
+Ceph, and you can enable/disable signatures for messages between Ceph daemons.
+
+
+``cephx require signatures``
+
+:Description: If set to ``true``, Ceph requires signatures on all message
+ traffic between the Ceph Client and the Ceph Storage Cluster, and
+ between daemons comprising the Ceph Storage Cluster.
+
+:Type: Boolean
+:Required: No
+:Default: ``false``
+
+
+``cephx cluster require signatures``
+
+:Description: If set to ``true``, Ceph requires signatures on all message
+ traffic between Ceph daemons comprising the Ceph Storage Cluster.
+
+:Type: Boolean
+:Required: No
+:Default: ``false``
+
+
+``cephx service require signatures``
+
+:Description: If set to ``true``, Ceph requires signatures on all message
+ traffic between Ceph Clients and the Ceph Storage Cluster.
+
+:Type: Boolean
+:Required: No
+:Default: ``false``
+
+
+``cephx sign messages``
+
+:Description: If the Ceph version supports message signing, Ceph will sign
+ all messages so they cannot be spoofed.
+
+:Type: Boolean
+:Default: ``true``
+
+
+Time to Live
+------------
+
+``auth service ticket ttl``
+
+:Description: When the Ceph Storage Cluster sends a Ceph Client a ticket for
+ authentication, the Ceph Storage Cluster assigns the ticket a
+ time to live.
+
+:Type: Double
+:Default: ``60*60``
+
+
+Backward Compatibility
+======================
+
+For Cuttlefish and earlier releases, see `Cephx`_.
+
+In Ceph Argonaut v0.48 and earlier versions, if you enable ``cephx``
+authentication, Ceph only authenticates the initial communication between the
+client and daemon; Ceph does not authenticate the subsequent messages they send
+to each other, which has security implications. In Ceph Bobtail and subsequent
+versions, Ceph authenticates all ongoing messages between the entities using the
+session key set up for that initial authentication.
+
+We identified a backward compatibility issue between Argonaut v0.48 (and prior
+versions) and Bobtail (and subsequent versions). During testing, if you
+attempted to use Argonaut (and earlier) daemons with Bobtail (and later)
+daemons, the Argonaut daemons did not know how to perform ongoing message
+authentication, while the Bobtail versions of the daemons insist on
+authenticating message traffic subsequent to the initial
+request/response--making it impossible for Argonaut (and prior) daemons to
+interoperate with Bobtail (and subsequent) daemons.
+
+We have addressed this potential problem by providing a means for Argonaut (and
+prior) systems to interact with Bobtail (and subsequent) systems. Here's how it
+works: by default, the newer systems will not insist on seeing signatures from
+older systems that do not know how to perform them, but will simply accept such
+messages without authenticating them. This new default behavior provides the
+advantage of allowing two different releases to interact. **We do not recommend
+this as a long term solution**. Allowing newer daemons to forgo ongoing
+authentication has the unfortunate security effect that an attacker with control
+of some of your machines or some access to your network can disable session
+security simply by claiming to be unable to sign messages.
+
+.. note:: Even if you don't actually run any old versions of Ceph,
+ the attacker may be able to force some messages to be accepted unsigned in the
+ default scenario. While running Cephx with the default scenario, Ceph still
+ authenticates the initial communication, but you lose desirable session security.
+
+If you know that you are not running older versions of Ceph, or you are willing
+to accept that old servers and new servers will not be able to interoperate, you
+can eliminate this security risk. If you do so, any Ceph system that is new
+enough to support session authentication and that has Cephx enabled will reject
+unsigned messages. To preclude new servers from interacting with old servers,
+include the following in the ``[global]`` section of your `Ceph
+configuration`_ file directly below the line that specifies the use of Cephx
+for authentication::
+
+ cephx require signatures = true ; everywhere possible
+
+You can also selectively require signatures for cluster internal
+communications only, separate from client-facing service::
+
+ cephx cluster require signatures = true ; for cluster-internal communication
+ cephx service require signatures = true ; for client-facing service
+
+An option to make a client require signatures from the cluster is not
+yet implemented.
+
+**We recommend migrating all daemons to the newer versions and enabling the
+foregoing flag** at the nearest practical time so that you may avail yourself
+of the enhanced authentication.
+
+.. note:: Ceph kernel modules do not support signatures yet.
+
+
+.. _Storage Cluster Quick Start: ../../../start/quick-ceph-deploy/
+.. _Monitor Bootstrapping: ../../../install/manual-deployment#monitor-bootstrapping
+.. _Operating a Cluster: ../../operations/operating
+.. _Manual Deployment: ../../../install/manual-deployment
+.. _Cephx: http://docs.ceph.com/docs/cuttlefish/rados/configuration/auth-config-ref/
+.. _Ceph configuration: ../ceph-conf
+.. _Create an Admin Host: ../../deployment/ceph-deploy-admin
+.. _Architecture - High Availability Authentication: ../../../architecture#high-availability-authentication
+.. _User Management: ../../operations/user-management
diff --git a/src/ceph/doc/rados/configuration/bluestore-config-ref.rst b/src/ceph/doc/rados/configuration/bluestore-config-ref.rst
new file mode 100644
index 0000000..8d8ace6
--- /dev/null
+++ b/src/ceph/doc/rados/configuration/bluestore-config-ref.rst
@@ -0,0 +1,297 @@
+==========================
+BlueStore Config Reference
+==========================
+
+Devices
+=======
+
+BlueStore manages either one, two, or (in certain cases) three storage
+devices.
+
+In the simplest case, BlueStore consumes a single (primary) storage
+device. The storage device is normally partitioned into two parts:
+
+#. A small partition is formatted with XFS and contains basic metadata
+ for the OSD. This *data directory* includes information about the
+ OSD (its identifier, which cluster it belongs to, and its private
+ keyring.
+
+#. The rest of the device is normally a large partition occupying the
+ rest of the device that is managed directly by BlueStore contains
+ all of the actual data. This *primary device* is normally identifed
+ by a ``block`` symlink in data directory.
+
+It is also possible to deploy BlueStore across two additional devices:
+
+* A *WAL device* can be used for BlueStore's internal journal or
+ write-ahead log. It is identified by the ``block.wal`` symlink in
+ the data directory. It is only useful to use a WAL device if the
+ device is faster than the primary device (e.g., when it is on an SSD
+ and the primary device is an HDD).
+* A *DB device* can be used for storing BlueStore's internal metadata.
+ BlueStore (or rather, the embedded RocksDB) will put as much
+ metadata as it can on the DB device to improve performance. If the
+ DB device fills up, metadata will spill back onto the primary device
+ (where it would have been otherwise). Again, it is only helpful to
+ provision a DB device if it is faster than the primary device.
+
+If there is only a small amount of fast storage available (e.g., less
+than a gigabyte), we recommend using it as a WAL device. If there is
+more, provisioning a DB device makes more sense. The BlueStore
+journal will always be placed on the fastest device available, so
+using a DB device will provide the same benefit that the WAL device
+would while *also* allowing additional metadata to be stored there (if
+it will fix).
+
+A single-device BlueStore OSD can be provisioned with::
+
+ ceph-disk prepare --bluestore <device>
+
+To specify a WAL device and/or DB device, ::
+
+ ceph-disk prepare --bluestore <device> --block.wal <wal-device> --block-db <db-device>
+
+Cache size
+==========
+
+The amount of memory consumed by each OSD for BlueStore's cache is
+determined by the ``bluestore_cache_size`` configuration option. If
+that config option is not set (i.e., remains at 0), there is a
+different default value that is used depending on whether an HDD or
+SSD is used for the primary device (set by the
+``bluestore_cache_size_ssd`` and ``bluestore_cache_size_hdd`` config
+options).
+
+BlueStore and the rest of the Ceph OSD does the best it can currently
+to stick to the budgeted memory. Note that on top of the configured
+cache size, there is also memory consumed by the OSD itself, and
+generally some overhead due to memory fragmentation and other
+allocator overhead.
+
+The configured cache memory budget can be used in a few different ways:
+
+* Key/Value metadata (i.e., RocksDB's internal cache)
+* BlueStore metadata
+* BlueStore data (i.e., recently read or written object data)
+
+Cache memory usage is governed by the following options:
+``bluestore_cache_meta_ratio``, ``bluestore_cache_kv_ratio``, and
+``bluestore_cache_kv_max``. The fraction of the cache devoted to data
+is 1.0 minus the meta and kv ratios. The memory devoted to kv
+metadata (the RocksDB cache) is capped by ``bluestore_cache_kv_max``
+since our testing indicates there are diminishing returns beyond a
+certain point.
+
+``bluestore_cache_size``
+
+:Description: The amount of memory BlueStore will use for its cache. If zero, ``bluestore_cache_size_hdd`` or ``bluestore_cache_size_ssd`` will be used instead.
+:Type: Integer
+:Required: Yes
+:Default: ``0``
+
+``bluestore_cache_size_hdd``
+
+:Description: The default amount of memory BlueStore will use for its cache when backed by an HDD.
+:Type: Integer
+:Required: Yes
+:Default: ``1 * 1024 * 1024 * 1024`` (1 GB)
+
+``bluestore_cache_size_ssd``
+
+:Description: The default amount of memory BlueStore will use for its cache when backed by an SSD.
+:Type: Integer
+:Required: Yes
+:Default: ``3 * 1024 * 1024 * 1024`` (3 GB)
+
+``bluestore_cache_meta_ratio``
+
+:Description: The ratio of cache devoted to metadata.
+:Type: Floating point
+:Required: Yes
+:Default: ``.01``
+
+``bluestore_cache_kv_ratio``
+
+:Description: The ratio of cache devoted to key/value data (rocksdb).
+:Type: Floating point
+:Required: Yes
+:Default: ``.99``
+
+``bluestore_cache_kv_max``
+
+:Description: The maximum amount of cache devoted to key/value data (rocksdb).
+:Type: Floating point
+:Required: Yes
+:Default: ``512 * 1024*1024`` (512 MB)
+
+
+Checksums
+=========
+
+BlueStore checksums all metadata and data written to disk. Metadata
+checksumming is handled by RocksDB and uses `crc32c`. Data
+checksumming is done by BlueStore and can make use of `crc32c`,
+`xxhash32`, or `xxhash64`. The default is `crc32c` and should be
+suitable for most purposes.
+
+Full data checksumming does increase the amount of metadata that
+BlueStore must store and manage. When possible, e.g., when clients
+hint that data is written and read sequentially, BlueStore will
+checksum larger blocks, but in many cases it must store a checksum
+value (usually 4 bytes) for every 4 kilobyte block of data.
+
+It is possible to use a smaller checksum value by truncating the
+checksum to two or one byte, reducing the metadata overhead. The
+trade-off is that the probability that a random error will not be
+detected is higher with a smaller checksum, going from about one if
+four billion with a 32-bit (4 byte) checksum to one is 65,536 for a
+16-bit (2 byte) checksum or one in 256 for an 8-bit (1 byte) checksum.
+The smaller checksum values can be used by selecting `crc32c_16` or
+`crc32c_8` as the checksum algorithm.
+
+The *checksum algorithm* can be set either via a per-pool
+``csum_type`` property or the global config option. For example, ::
+
+ ceph osd pool set <pool-name> csum_type <algorithm>
+
+``bluestore_csum_type``
+
+:Description: The default checksum algorithm to use.
+:Type: String
+:Required: Yes
+:Valid Settings: ``none``, ``crc32c``, ``crc32c_16``, ``crc32c_8``, ``xxhash32``, ``xxhash64``
+:Default: ``crc32c``
+
+
+Inline Compression
+==================
+
+BlueStore supports inline compression using `snappy`, `zlib`, or
+`lz4`. Please note that the `lz4` compression plugin is not
+distributed in the official release.
+
+Whether data in BlueStore is compressed is determined by a combination
+of the *compression mode* and any hints associated with a write
+operation. The modes are:
+
+* **none**: Never compress data.
+* **passive**: Do not compress data unless the write operation as a
+ *compressible* hint set.
+* **aggressive**: Compress data unless the write operation as an
+ *incompressible* hint set.
+* **force**: Try to compress data no matter what.
+
+For more information about the *compressible* and *incompressible* IO
+hints, see :doc:`/api/librados/#rados_set_alloc_hint`.
+
+Note that regardless of the mode, if the size of the data chunk is not
+reduced sufficiently it will not be used and the original
+(uncompressed) data will be stored. For example, if the ``bluestore
+compression required ratio`` is set to ``.7`` then the compressed data
+must be 70% of the size of the original (or smaller).
+
+The *compression mode*, *compression algorithm*, *compression required
+ratio*, *min blob size*, and *max blob size* can be set either via a
+per-pool property or a global config option. Pool properties can be
+set with::
+
+ ceph osd pool set <pool-name> compression_algorithm <algorithm>
+ ceph osd pool set <pool-name> compression_mode <mode>
+ ceph osd pool set <pool-name> compression_required_ratio <ratio>
+ ceph osd pool set <pool-name> compression_min_blob_size <size>
+ ceph osd pool set <pool-name> compression_max_blob_size <size>
+
+``bluestore compression algorithm``
+
+:Description: The default compressor to use (if any) if the per-pool property
+ ``compression_algorithm`` is not set. Note that zstd is *not*
+ recommended for bluestore due to high CPU overhead when
+ compressing small amounts of data.
+:Type: String
+:Required: No
+:Valid Settings: ``lz4``, ``snappy``, ``zlib``, ``zstd``
+:Default: ``snappy``
+
+``bluestore compression mode``
+
+:Description: The default policy for using compression if the per-pool property
+ ``compression_mode`` is not set. ``none`` means never use
+ compression. ``passive`` means use compression when
+ `clients hint`_ that data is compressible. ``aggressive`` means
+ use compression unless clients hint that data is not compressible.
+ ``force`` means use compression under all circumstances even if
+ the clients hint that the data is not compressible.
+:Type: String
+:Required: No
+:Valid Settings: ``none``, ``passive``, ``aggressive``, ``force``
+:Default: ``none``
+
+``bluestore compression required ratio``
+
+:Description: The ratio of the size of the data chunk after
+ compression relative to the original size must be at
+ least this small in order to store the compressed
+ version.
+
+:Type: Floating point
+:Required: No
+:Default: .875
+
+``bluestore compression min blob size``
+
+:Description: Chunks smaller than this are never compressed.
+ The per-pool property ``compression_min_blob_size`` overrides
+ this setting.
+
+:Type: Unsigned Integer
+:Required: No
+:Default: 0
+
+``bluestore compression min blob size hdd``
+
+:Description: Default value of ``bluestore compression min blob size``
+ for rotational media.
+
+:Type: Unsigned Integer
+:Required: No
+:Default: 128K
+
+``bluestore compression min blob size ssd``
+
+:Description: Default value of ``bluestore compression min blob size``
+ for non-rotational (solid state) media.
+
+:Type: Unsigned Integer
+:Required: No
+:Default: 8K
+
+``bluestore compression max blob size``
+
+:Description: Chunks larger than this are broken into smaller blobs sizing
+ ``bluestore compression max blob size`` before being compressed.
+ The per-pool property ``compression_max_blob_size`` overrides
+ this setting.
+
+:Type: Unsigned Integer
+:Required: No
+:Default: 0
+
+``bluestore compression max blob size hdd``
+
+:Description: Default value of ``bluestore compression max blob size``
+ for rotational media.
+
+:Type: Unsigned Integer
+:Required: No
+:Default: 512K
+
+``bluestore compression max blob size ssd``
+
+:Description: Default value of ``bluestore compression max blob size``
+ for non-rotational (solid state) media.
+
+:Type: Unsigned Integer
+:Required: No
+:Default: 64K
+
+.. _clients hint: ../../api/librados/#rados_set_alloc_hint
diff --git a/src/ceph/doc/rados/configuration/ceph-conf.rst b/src/ceph/doc/rados/configuration/ceph-conf.rst
new file mode 100644
index 0000000..df88452
--- /dev/null
+++ b/src/ceph/doc/rados/configuration/ceph-conf.rst
@@ -0,0 +1,629 @@
+==================
+ Configuring Ceph
+==================
+
+When you start the Ceph service, the initialization process activates a series
+of daemons that run in the background. A :term:`Ceph Storage Cluster` runs
+two types of daemons:
+
+- :term:`Ceph Monitor` (``ceph-mon``)
+- :term:`Ceph OSD Daemon` (``ceph-osd``)
+
+Ceph Storage Clusters that support the :term:`Ceph Filesystem` run at least one
+:term:`Ceph Metadata Server` (``ceph-mds``). Clusters that support :term:`Ceph
+Object Storage` run Ceph Gateway daemons (``radosgw``). For your convenience,
+each daemon has a series of default values (*i.e.*, many are set by
+``ceph/src/common/config_opts.h``). You may override these settings with a Ceph
+configuration file.
+
+
+.. _ceph-conf-file:
+
+The Configuration File
+======================
+
+When you start a Ceph Storage Cluster, each daemon looks for a Ceph
+configuration file (i.e., ``ceph.conf`` by default) that provides the cluster's
+configuration settings. For manual deployments, you need to create a Ceph
+configuration file. For tools that create configuration files for you (*e.g.*,
+``ceph-deploy``, Chef, etc.), you may use the information contained herein as a
+reference. The Ceph configuration file defines:
+
+- Cluster Identity
+- Authentication settings
+- Cluster membership
+- Host names
+- Host addresses
+- Paths to keyrings
+- Paths to journals
+- Paths to data
+- Other runtime options
+
+The default Ceph configuration file locations in sequential order include:
+
+#. ``$CEPH_CONF`` (*i.e.,* the path following the ``$CEPH_CONF``
+ environment variable)
+#. ``-c path/path`` (*i.e.,* the ``-c`` command line argument)
+#. ``/etc/ceph/ceph.conf``
+#. ``~/.ceph/config``
+#. ``./ceph.conf`` (*i.e.,* in the current working directory)
+
+
+The Ceph configuration file uses an *ini* style syntax. You can add comments
+by preceding comments with a pound sign (#) or a semi-colon (;). For example:
+
+.. code-block:: ini
+
+ # <--A number (#) sign precedes a comment.
+ ; A comment may be anything.
+ # Comments always follow a semi-colon (;) or a pound (#) on each line.
+ # The end of the line terminates a comment.
+ # We recommend that you provide comments in your configuration file(s).
+
+
+.. _ceph-conf-settings:
+
+Config Sections
+===============
+
+The configuration file can configure all Ceph daemons in a Ceph Storage Cluster,
+or all Ceph daemons of a particular type. To configure a series of daemons, the
+settings must be included under the processes that will receive the
+configuration as follows:
+
+``[global]``
+
+:Description: Settings under ``[global]`` affect all daemons in a Ceph Storage
+ Cluster.
+
+:Example: ``auth supported = cephx``
+
+``[osd]``
+
+:Description: Settings under ``[osd]`` affect all ``ceph-osd`` daemons in
+ the Ceph Storage Cluster, and override the same setting in
+ ``[global]``.
+
+:Example: ``osd journal size = 1000``
+
+``[mon]``
+
+:Description: Settings under ``[mon]`` affect all ``ceph-mon`` daemons in
+ the Ceph Storage Cluster, and override the same setting in
+ ``[global]``.
+
+:Example: ``mon addr = 10.0.0.101:6789``
+
+
+``[mds]``
+
+:Description: Settings under ``[mds]`` affect all ``ceph-mds`` daemons in
+ the Ceph Storage Cluster, and override the same setting in
+ ``[global]``.
+
+:Example: ``host = myserver01``
+
+``[client]``
+
+:Description: Settings under ``[client]`` affect all Ceph Clients
+ (e.g., mounted Ceph Filesystems, mounted Ceph Block Devices,
+ etc.).
+
+:Example: ``log file = /var/log/ceph/radosgw.log``
+
+
+Global settings affect all instances of all daemon in the Ceph Storage Cluster.
+Use the ``[global]`` setting for values that are common for all daemons in the
+Ceph Storage Cluster. You can override each ``[global]`` setting by:
+
+#. Changing the setting in a particular process type
+ (*e.g.,* ``[osd]``, ``[mon]``, ``[mds]`` ).
+
+#. Changing the setting in a particular process (*e.g.,* ``[osd.1]`` ).
+
+Overriding a global setting affects all child processes, except those that
+you specifically override in a particular daemon.
+
+A typical global setting involves activating authentication. For example:
+
+.. code-block:: ini
+
+ [global]
+ #Enable authentication between hosts within the cluster.
+ #v 0.54 and earlier
+ auth supported = cephx
+
+ #v 0.55 and after
+ auth cluster required = cephx
+ auth service required = cephx
+ auth client required = cephx
+
+
+You can specify settings that apply to a particular type of daemon. When you
+specify settings under ``[osd]``, ``[mon]`` or ``[mds]`` without specifying a
+particular instance, the setting will apply to all OSDs, monitors or metadata
+daemons respectively.
+
+A typical daemon-wide setting involves setting journal sizes, filestore
+settings, etc. For example:
+
+.. code-block:: ini
+
+ [osd]
+ osd journal size = 1000
+
+
+You may specify settings for particular instances of a daemon. You may specify
+an instance by entering its type, delimited by a period (.) and by the instance
+ID. The instance ID for a Ceph OSD Daemon is always numeric, but it may be
+alphanumeric for Ceph Monitors and Ceph Metadata Servers.
+
+.. code-block:: ini
+
+ [osd.1]
+ # settings affect osd.1 only.
+
+ [mon.a]
+ # settings affect mon.a only.
+
+ [mds.b]
+ # settings affect mds.b only.
+
+
+If the daemon you specify is a Ceph Gateway client, specify the daemon and the
+instance, delimited by a period (.). For example::
+
+ [client.radosgw.instance-name]
+ # settings affect client.radosgw.instance-name only.
+
+
+
+.. _ceph-metavariables:
+
+Metavariables
+=============
+
+Metavariables simplify Ceph Storage Cluster configuration dramatically. When a
+metavariable is set in a configuration value, Ceph expands the metavariable into
+a concrete value. Metavariables are very powerful when used within the
+``[global]``, ``[osd]``, ``[mon]``, ``[mds]`` or ``[client]`` sections of your
+configuration file. Ceph metavariables are similar to Bash shell expansion.
+
+Ceph supports the following metavariables:
+
+
+``$cluster``
+
+:Description: Expands to the Ceph Storage Cluster name. Useful when running
+ multiple Ceph Storage Clusters on the same hardware.
+
+:Example: ``/etc/ceph/$cluster.keyring``
+:Default: ``ceph``
+
+
+``$type``
+
+:Description: Expands to one of ``mds``, ``osd``, or ``mon``, depending on the
+ type of the instant daemon.
+
+:Example: ``/var/lib/ceph/$type``
+
+
+``$id``
+
+:Description: Expands to the daemon identifier. For ``osd.0``, this would be
+ ``0``; for ``mds.a``, it would be ``a``.
+
+:Example: ``/var/lib/ceph/$type/$cluster-$id``
+
+
+``$host``
+
+:Description: Expands to the host name of the instant daemon.
+
+
+``$name``
+
+:Description: Expands to ``$type.$id``.
+:Example: ``/var/run/ceph/$cluster-$name.asok``
+
+``$pid``
+
+:Description: Expands to daemon pid.
+:Example: ``/var/run/ceph/$cluster-$name-$pid.asok``
+
+
+.. _ceph-conf-common-settings:
+
+Common Settings
+===============
+
+The `Hardware Recommendations`_ section provides some hardware guidelines for
+configuring a Ceph Storage Cluster. It is possible for a single :term:`Ceph
+Node` to run multiple daemons. For example, a single node with multiple drives
+may run one ``ceph-osd`` for each drive. Ideally, you will have a node for a
+particular type of process. For example, some nodes may run ``ceph-osd``
+daemons, other nodes may run ``ceph-mds`` daemons, and still other nodes may
+run ``ceph-mon`` daemons.
+
+Each node has a name identified by the ``host`` setting. Monitors also specify
+a network address and port (i.e., domain name or IP address) identified by the
+``addr`` setting. A basic configuration file will typically specify only
+minimal settings for each instance of monitor daemons. For example:
+
+.. code-block:: ini
+
+ [global]
+ mon_initial_members = ceph1
+ mon_host = 10.0.0.1
+
+
+.. important:: The ``host`` setting is the short name of the node (i.e., not
+ an fqdn). It is **NOT** an IP address either. Enter ``hostname -s`` on
+ the command line to retrieve the name of the node. Do not use ``host``
+ settings for anything other than initial monitors unless you are deploying
+ Ceph manually. You **MUST NOT** specify ``host`` under individual daemons
+ when using deployment tools like ``chef`` or ``ceph-deploy``, as those tools
+ will enter the appropriate values for you in the cluster map.
+
+
+.. _ceph-network-config:
+
+Networks
+========
+
+See the `Network Configuration Reference`_ for a detailed discussion about
+configuring a network for use with Ceph.
+
+
+Monitors
+========
+
+Ceph production clusters typically deploy with a minimum 3 :term:`Ceph Monitor`
+daemons to ensure high availability should a monitor instance crash. At least
+three (3) monitors ensures that the Paxos algorithm can determine which version
+of the :term:`Ceph Cluster Map` is the most recent from a majority of Ceph
+Monitors in the quorum.
+
+.. note:: You may deploy Ceph with a single monitor, but if the instance fails,
+ the lack of other monitors may interrupt data service availability.
+
+Ceph Monitors typically listen on port ``6789``. For example:
+
+.. code-block:: ini
+
+ [mon.a]
+ host = hostName
+ mon addr = 150.140.130.120:6789
+
+By default, Ceph expects that you will store a monitor's data under the
+following path::
+
+ /var/lib/ceph/mon/$cluster-$id
+
+You or a deployment tool (e.g., ``ceph-deploy``) must create the corresponding
+directory. With metavariables fully expressed and a cluster named "ceph", the
+foregoing directory would evaluate to::
+
+ /var/lib/ceph/mon/ceph-a
+
+For additional details, see the `Monitor Config Reference`_.
+
+.. _Monitor Config Reference: ../mon-config-ref
+
+
+.. _ceph-osd-config:
+
+
+Authentication
+==============
+
+.. versionadded:: Bobtail 0.56
+
+For Bobtail (v 0.56) and beyond, you should expressly enable or disable
+authentication in the ``[global]`` section of your Ceph configuration file. ::
+
+ auth cluster required = cephx
+ auth service required = cephx
+ auth client required = cephx
+
+Additionally, you should enable message signing. See `Cephx Config Reference`_ for details.
+
+.. important:: When upgrading, we recommend expressly disabling authentication
+ first, then perform the upgrade. Once the upgrade is complete, re-enable
+ authentication.
+
+.. _Cephx Config Reference: ../auth-config-ref
+
+
+.. _ceph-monitor-config:
+
+
+OSDs
+====
+
+Ceph production clusters typically deploy :term:`Ceph OSD Daemons` where one node
+has one OSD daemon running a filestore on one storage drive. A typical
+deployment specifies a journal size. For example:
+
+.. code-block:: ini
+
+ [osd]
+ osd journal size = 10000
+
+ [osd.0]
+ host = {hostname} #manual deployments only.
+
+
+By default, Ceph expects that you will store a Ceph OSD Daemon's data with the
+following path::
+
+ /var/lib/ceph/osd/$cluster-$id
+
+You or a deployment tool (e.g., ``ceph-deploy``) must create the corresponding
+directory. With metavariables fully expressed and a cluster named "ceph", the
+foregoing directory would evaluate to::
+
+ /var/lib/ceph/osd/ceph-0
+
+You may override this path using the ``osd data`` setting. We don't recommend
+changing the default location. Create the default directory on your OSD host.
+
+::
+
+ ssh {osd-host}
+ sudo mkdir /var/lib/ceph/osd/ceph-{osd-number}
+
+The ``osd data`` path ideally leads to a mount point with a hard disk that is
+separate from the hard disk storing and running the operating system and
+daemons. If the OSD is for a disk other than the OS disk, prepare it for
+use with Ceph, and mount it to the directory you just created::
+
+ ssh {new-osd-host}
+ sudo mkfs -t {fstype} /dev/{disk}
+ sudo mount -o user_xattr /dev/{hdd} /var/lib/ceph/osd/ceph-{osd-number}
+
+We recommend using the ``xfs`` file system when running
+:command:`mkfs`. (``btrfs`` and ``ext4`` are not recommended and no
+longer tested.)
+
+See the `OSD Config Reference`_ for additional configuration details.
+
+
+Heartbeats
+==========
+
+During runtime operations, Ceph OSD Daemons check up on other Ceph OSD Daemons
+and report their findings to the Ceph Monitor. You do not have to provide any
+settings. However, if you have network latency issues, you may wish to modify
+the settings.
+
+See `Configuring Monitor/OSD Interaction`_ for additional details.
+
+
+.. _ceph-logging-and-debugging:
+
+Logs / Debugging
+================
+
+Sometimes you may encounter issues with Ceph that require
+modifying logging output and using Ceph's debugging. See `Debugging and
+Logging`_ for details on log rotation.
+
+.. _Debugging and Logging: ../../troubleshooting/log-and-debug
+
+
+Example ceph.conf
+=================
+
+.. literalinclude:: demo-ceph.conf
+ :language: ini
+
+.. _ceph-runtime-config:
+
+Runtime Changes
+===============
+
+Ceph allows you to make changes to the configuration of a ``ceph-osd``,
+``ceph-mon``, or ``ceph-mds`` daemon at runtime. This capability is quite
+useful for increasing/decreasing logging output, enabling/disabling debug
+settings, and even for runtime optimization. The following reflects runtime
+configuration usage::
+
+ ceph tell {daemon-type}.{id or *} injectargs --{name} {value} [--{name} {value}]
+
+Replace ``{daemon-type}`` with one of ``osd``, ``mon`` or ``mds``. You may apply
+the runtime setting to all daemons of a particular type with ``*``, or specify
+a specific daemon's ID (i.e., its number or letter). For example, to increase
+debug logging for a ``ceph-osd`` daemon named ``osd.0``, execute the following::
+
+ ceph tell osd.0 injectargs --debug-osd 20 --debug-ms 1
+
+In your ``ceph.conf`` file, you may use spaces when specifying a
+setting name. When specifying a setting name on the command line,
+ensure that you use an underscore or hyphen (``_`` or ``-``) between
+terms (e.g., ``debug osd`` becomes ``--debug-osd``).
+
+
+Viewing a Configuration at Runtime
+==================================
+
+If your Ceph Storage Cluster is running, and you would like to see the
+configuration settings from a running daemon, execute the following::
+
+ ceph daemon {daemon-type}.{id} config show | less
+
+If you are on a machine where osd.0 is running, the command would be::
+
+ ceph daemon osd.0 config show | less
+
+Reading Configuration Metadata at Runtime
+=========================================
+
+Information about the available configuration options is available via
+the ``config help`` command:
+
+::
+
+ ceph daemon {daemon-type}.{id} config help | less
+
+
+This metadata is primarily intended to be used when integrating other
+software with Ceph, such as graphical user interfaces. The output is
+a list of JSON objects, for example:
+
+::
+
+ {
+ "name": "mon_host",
+ "type": "std::string",
+ "level": "basic",
+ "desc": "list of hosts or addresses to search for a monitor",
+ "long_desc": "This is a comma, whitespace, or semicolon separated list of IP addresses or hostnames. Hostnames are resolved via DNS and all A or AAAA records are included in the search list.",
+ "default": "",
+ "daemon_default": "",
+ "tags": [],
+ "services": [
+ "common"
+ ],
+ "see_also": [],
+ "enum_values": [],
+ "min": "",
+ "max": ""
+ }
+
+type
+____
+
+The type of the setting, given as a C++ type name.
+
+level
+_____
+
+One of `basic`, `advanced`, `dev`. The `dev` options are not intended
+for use outside of development and testing.
+
+desc
+____
+
+A short description -- this is a sentence fragment suitable for display
+in small spaces like a single line in a list.
+
+long_desc
+_________
+
+A full description of what the setting does, this may be as long as needed.
+
+default
+_______
+
+The default value, if any.
+
+daemon_default
+______________
+
+An alternative default used for daemons (services) as opposed to clients.
+
+tags
+____
+
+A list of strings indicating topics to which this setting relates. Examples
+of tags are `performance` and `networking`.
+
+services
+________
+
+A list of strings indicating which Ceph services the setting relates to, such
+as `osd`, `mds`, `mon`. For settings that are relevant to any Ceph client
+or server, `common` is used.
+
+see_also
+________
+
+A list of strings indicating other configuration options that may also
+be of interest to a user setting this option.
+
+enum_values
+___________
+
+Optional: a list of strings indicating the valid settings.
+
+min, max
+________
+
+Optional: upper and lower (inclusive) bounds on valid settings.
+
+
+
+
+Running Multiple Clusters
+=========================
+
+With Ceph, you can run multiple Ceph Storage Clusters on the same hardware.
+Running multiple clusters provides a higher level of isolation compared to
+using different pools on the same cluster with different CRUSH rulesets. A
+separate cluster will have separate monitor, OSD and metadata server processes.
+When running Ceph with default settings, the default cluster name is ``ceph``,
+which means you would save your Ceph configuration file with the file name
+``ceph.conf`` in the ``/etc/ceph`` default directory.
+
+See `ceph-deploy new`_ for details.
+.. _ceph-deploy new:../ceph-deploy-new
+
+When you run multiple clusters, you must name your cluster and save the Ceph
+configuration file with the name of the cluster. For example, a cluster named
+``openstack`` will have a Ceph configuration file with the file name
+``openstack.conf`` in the ``/etc/ceph`` default directory.
+
+.. important:: Cluster names must consist of letters a-z and digits 0-9 only.
+
+Separate clusters imply separate data disks and journals, which are not shared
+between clusters. Referring to `Metavariables`_, the ``$cluster`` metavariable
+evaluates to the cluster name (i.e., ``openstack`` in the foregoing example).
+Various settings use the ``$cluster`` metavariable, including:
+
+- ``keyring``
+- ``admin socket``
+- ``log file``
+- ``pid file``
+- ``mon data``
+- ``mon cluster log file``
+- ``osd data``
+- ``osd journal``
+- ``mds data``
+- ``rgw data``
+
+See `General Settings`_, `OSD Settings`_, `Monitor Settings`_, `MDS Settings`_,
+`RGW Settings`_ and `Log Settings`_ for relevant path defaults that use the
+``$cluster`` metavariable.
+
+.. _General Settings: ../general-config-ref
+.. _OSD Settings: ../osd-config-ref
+.. _Monitor Settings: ../mon-config-ref
+.. _MDS Settings: ../../../cephfs/mds-config-ref
+.. _RGW Settings: ../../../radosgw/config-ref/
+.. _Log Settings: ../../troubleshooting/log-and-debug
+
+
+When creating default directories or files, you should use the cluster
+name at the appropriate places in the path. For example::
+
+ sudo mkdir /var/lib/ceph/osd/openstack-0
+ sudo mkdir /var/lib/ceph/mon/openstack-a
+
+.. important:: When running monitors on the same host, you should use
+ different ports. By default, monitors use port 6789. If you already
+ have monitors using port 6789, use a different port for your other cluster(s).
+
+To invoke a cluster other than the default ``ceph`` cluster, use the
+``-c {filename}.conf`` option with the ``ceph`` command. For example::
+
+ ceph -c {cluster-name}.conf health
+ ceph -c openstack.conf health
+
+
+.. _Hardware Recommendations: ../../../start/hardware-recommendations
+.. _Network Configuration Reference: ../network-config-ref
+.. _OSD Config Reference: ../osd-config-ref
+.. _Configuring Monitor/OSD Interaction: ../mon-osd-interaction
+.. _ceph-deploy new: ../../deployment/ceph-deploy-new#naming-a-cluster
diff --git a/src/ceph/doc/rados/configuration/demo-ceph.conf b/src/ceph/doc/rados/configuration/demo-ceph.conf
new file mode 100644
index 0000000..ba86d53
--- /dev/null
+++ b/src/ceph/doc/rados/configuration/demo-ceph.conf
@@ -0,0 +1,31 @@
+[global]
+fsid = {cluster-id}
+mon initial members = {hostname}[, {hostname}]
+mon host = {ip-address}[, {ip-address}]
+
+#All clusters have a front-side public network.
+#If you have two NICs, you can configure a back side cluster
+#network for OSD object replication, heart beats, backfilling,
+#recovery, etc.
+public network = {network}[, {network}]
+#cluster network = {network}[, {network}]
+
+#Clusters require authentication by default.
+auth cluster required = cephx
+auth service required = cephx
+auth client required = cephx
+
+#Choose reasonable numbers for your journals, number of replicas
+#and placement groups.
+osd journal size = {n}
+osd pool default size = {n} # Write an object n times.
+osd pool default min size = {n} # Allow writing n copy in a degraded state.
+osd pool default pg num = {n}
+osd pool default pgp num = {n}
+
+#Choose a reasonable crush leaf type.
+#0 for a 1-node cluster.
+#1 for a multi node cluster in a single rack
+#2 for a multi node, multi chassis cluster with multiple hosts in a chassis
+#3 for a multi node cluster with hosts across racks, etc.
+osd crush chooseleaf type = {n} \ No newline at end of file
diff --git a/src/ceph/doc/rados/configuration/filestore-config-ref.rst b/src/ceph/doc/rados/configuration/filestore-config-ref.rst
new file mode 100644
index 0000000..4dff60c
--- /dev/null
+++ b/src/ceph/doc/rados/configuration/filestore-config-ref.rst
@@ -0,0 +1,365 @@
+============================
+ Filestore Config Reference
+============================
+
+
+``filestore debug omap check``
+
+:Description: Debugging check on synchronization. Expensive. For debugging only.
+:Type: Boolean
+:Required: No
+:Default: ``0``
+
+
+.. index:: filestore; extended attributes
+
+Extended Attributes
+===================
+
+Extended Attributes (XATTRs) are an important aspect in your configuration.
+Some file systems have limits on the number of bytes stored in XATTRS.
+Additionally, in some cases, the filesystem may not be as fast as an alternative
+method of storing XATTRs. The following settings may help improve performance
+by using a method of storing XATTRs that is extrinsic to the underlying filesystem.
+
+Ceph XATTRs are stored as ``inline xattr``, using the XATTRs provided
+by the underlying file system, if it does not impose a size limit. If
+there is a size limit (4KB total on ext4, for instance), some Ceph
+XATTRs will be stored in an key/value database when either the
+``filestore max inline xattr size`` or ``filestore max inline
+xattrs`` threshold is reached.
+
+
+``filestore max inline xattr size``
+
+:Description: The maximimum size of an XATTR stored in the filesystem (i.e., XFS,
+ btrfs, ext4, etc.) per object. Should not be larger than the
+ filesytem can handle. Default value of 0 means to use the value
+ specific to the underlying filesystem.
+:Type: Unsigned 32-bit Integer
+:Required: No
+:Default: ``0``
+
+
+``filestore max inline xattr size xfs``
+
+:Description: The maximimum size of an XATTR stored in the XFS filesystem.
+ Only used if ``filestore max inline xattr size`` == 0.
+:Type: Unsigned 32-bit Integer
+:Required: No
+:Default: ``65536``
+
+
+``filestore max inline xattr size btrfs``
+
+:Description: The maximimum size of an XATTR stored in the btrfs filesystem.
+ Only used if ``filestore max inline xattr size`` == 0.
+:Type: Unsigned 32-bit Integer
+:Required: No
+:Default: ``2048``
+
+
+``filestore max inline xattr size other``
+
+:Description: The maximimum size of an XATTR stored in other filesystems.
+ Only used if ``filestore max inline xattr size`` == 0.
+:Type: Unsigned 32-bit Integer
+:Required: No
+:Default: ``512``
+
+
+``filestore max inline xattrs``
+
+:Description: The maximum number of XATTRs stored in the filesystem per object.
+ Default value of 0 means to use the value specific to the
+ underlying filesystem.
+:Type: 32-bit Integer
+:Required: No
+:Default: ``0``
+
+
+``filestore max inline xattrs xfs``
+
+:Description: The maximum number of XATTRs stored in the XFS filesystem per object.
+ Only used if ``filestore max inline xattrs`` == 0.
+:Type: 32-bit Integer
+:Required: No
+:Default: ``10``
+
+
+``filestore max inline xattrs btrfs``
+
+:Description: The maximum number of XATTRs stored in the btrfs filesystem per object.
+ Only used if ``filestore max inline xattrs`` == 0.
+:Type: 32-bit Integer
+:Required: No
+:Default: ``10``
+
+
+``filestore max inline xattrs other``
+
+:Description: The maximum number of XATTRs stored in other filesystems per object.
+ Only used if ``filestore max inline xattrs`` == 0.
+:Type: 32-bit Integer
+:Required: No
+:Default: ``2``
+
+.. index:: filestore; synchronization
+
+Synchronization Intervals
+=========================
+
+Periodically, the filestore needs to quiesce writes and synchronize the
+filesystem, which creates a consistent commit point. It can then free journal
+entries up to the commit point. Synchronizing more frequently tends to reduce
+the time required to perform synchronization, and reduces the amount of data
+that needs to remain in the journal. Less frequent synchronization allows the
+backing filesystem to coalesce small writes and metadata updates more
+optimally--potentially resulting in more efficient synchronization.
+
+
+``filestore max sync interval``
+
+:Description: The maximum interval in seconds for synchronizing the filestore.
+:Type: Double
+:Required: No
+:Default: ``5``
+
+
+``filestore min sync interval``
+
+:Description: The minimum interval in seconds for synchronizing the filestore.
+:Type: Double
+:Required: No
+:Default: ``.01``
+
+
+.. index:: filestore; flusher
+
+Flusher
+=======
+
+The filestore flusher forces data from large writes to be written out using
+``sync file range`` before the sync in order to (hopefully) reduce the cost of
+the eventual sync. In practice, disabling 'filestore flusher' seems to improve
+performance in some cases.
+
+
+``filestore flusher``
+
+:Description: Enables the filestore flusher.
+:Type: Boolean
+:Required: No
+:Default: ``false``
+
+.. deprecated:: v.65
+
+``filestore flusher max fds``
+
+:Description: Sets the maximum number of file descriptors for the flusher.
+:Type: Integer
+:Required: No
+:Default: ``512``
+
+.. deprecated:: v.65
+
+``filestore sync flush``
+
+:Description: Enables the synchronization flusher.
+:Type: Boolean
+:Required: No
+:Default: ``false``
+
+.. deprecated:: v.65
+
+``filestore fsync flushes journal data``
+
+:Description: Flush journal data during filesystem synchronization.
+:Type: Boolean
+:Required: No
+:Default: ``false``
+
+
+.. index:: filestore; queue
+
+Queue
+=====
+
+The following settings provide limits on the size of filestore queue.
+
+``filestore queue max ops``
+
+:Description: Defines the maximum number of in progress operations the file store accepts before blocking on queuing new operations.
+:Type: Integer
+:Required: No. Minimal impact on performance.
+:Default: ``50``
+
+
+``filestore queue max bytes``
+
+:Description: The maximum number of bytes for an operation.
+:Type: Integer
+:Required: No
+:Default: ``100 << 20``
+
+
+
+
+.. index:: filestore; timeouts
+
+Timeouts
+========
+
+
+``filestore op threads``
+
+:Description: The number of filesystem operation threads that execute in parallel.
+:Type: Integer
+:Required: No
+:Default: ``2``
+
+
+``filestore op thread timeout``
+
+:Description: The timeout for a filesystem operation thread (in seconds).
+:Type: Integer
+:Required: No
+:Default: ``60``
+
+
+``filestore op thread suicide timeout``
+
+:Description: The timeout for a commit operation before cancelling the commit (in seconds).
+:Type: Integer
+:Required: No
+:Default: ``180``
+
+
+.. index:: filestore; btrfs
+
+B-Tree Filesystem
+=================
+
+
+``filestore btrfs snap``
+
+:Description: Enable snapshots for a ``btrfs`` filestore.
+:Type: Boolean
+:Required: No. Only used for ``btrfs``.
+:Default: ``true``
+
+
+``filestore btrfs clone range``
+
+:Description: Enable cloning ranges for a ``btrfs`` filestore.
+:Type: Boolean
+:Required: No. Only used for ``btrfs``.
+:Default: ``true``
+
+
+.. index:: filestore; journal
+
+Journal
+=======
+
+
+``filestore journal parallel``
+
+:Description: Enables parallel journaling, default for btrfs.
+:Type: Boolean
+:Required: No
+:Default: ``false``
+
+
+``filestore journal writeahead``
+
+:Description: Enables writeahead journaling, default for xfs.
+:Type: Boolean
+:Required: No
+:Default: ``false``
+
+
+``filestore journal trailing``
+
+:Description: Deprecated, never use.
+:Type: Boolean
+:Required: No
+:Default: ``false``
+
+
+Misc
+====
+
+
+``filestore merge threshold``
+
+:Description: Min number of files in a subdir before merging into parent
+ NOTE: A negative value means to disable subdir merging
+:Type: Integer
+:Required: No
+:Default: ``10``
+
+
+``filestore split multiple``
+
+:Description: ``(filestore_split_multiple * abs(filestore_merge_threshold) + (rand() % filestore_split_rand_factor)) * 16``
+ is the maximum number of files in a subdirectory before
+ splitting into child directories.
+
+:Type: Integer
+:Required: No
+:Default: ``2``
+
+
+``filestore split rand factor``
+
+:Description: A random factor added to the split threshold to avoid
+ too many filestore splits occurring at once. See
+ ``filestore split multiple`` for details.
+ This can only be changed for an existing osd offline,
+ via ceph-objectstore-tool's apply-layout-settings command.
+
+:Type: Unsigned 32-bit Integer
+:Required: No
+:Default: ``20``
+
+
+``filestore update to``
+
+:Description: Limits filestore auto upgrade to specified version.
+:Type: Integer
+:Required: No
+:Default: ``1000``
+
+
+``filestore blackhole``
+
+:Description: Drop any new transactions on the floor.
+:Type: Boolean
+:Required: No
+:Default: ``false``
+
+
+``filestore dump file``
+
+:Description: File onto which store transaction dumps.
+:Type: Boolean
+:Required: No
+:Default: ``false``
+
+
+``filestore kill at``
+
+:Description: inject a failure at the n'th opportunity
+:Type: String
+:Required: No
+:Default: ``false``
+
+
+``filestore fail eio``
+
+:Description: Fail/Crash on eio.
+:Type: Boolean
+:Required: No
+:Default: ``true``
+
diff --git a/src/ceph/doc/rados/configuration/general-config-ref.rst b/src/ceph/doc/rados/configuration/general-config-ref.rst
new file mode 100644
index 0000000..ca09ee5
--- /dev/null
+++ b/src/ceph/doc/rados/configuration/general-config-ref.rst
@@ -0,0 +1,66 @@
+==========================
+ General Config Reference
+==========================
+
+
+``fsid``
+
+:Description: The filesystem ID. One per cluster.
+:Type: UUID
+:Required: No.
+:Default: N/A. Usually generated by deployment tools.
+
+
+``admin socket``
+
+:Description: The socket for executing administrative commands on a daemon,
+ irrespective of whether Ceph Monitors have established a quorum.
+
+:Type: String
+:Required: No
+:Default: ``/var/run/ceph/$cluster-$name.asok``
+
+
+``pid file``
+
+:Description: The file in which the mon, osd or mds will write its
+ PID. For instance, ``/var/run/$cluster/$type.$id.pid``
+ will create /var/run/ceph/mon.a.pid for the ``mon`` with
+ id ``a`` running in the ``ceph`` cluster. The ``pid
+ file`` is removed when the daemon stops gracefully. If
+ the process is not daemonized (i.e. runs with the ``-f``
+ or ``-d`` option), the ``pid file`` is not created.
+:Type: String
+:Required: No
+:Default: No
+
+
+``chdir``
+
+:Description: The directory Ceph daemons change to once they are
+ up and running. Default ``/`` directory recommended.
+
+:Type: String
+:Required: No
+:Default: ``/``
+
+
+``max open files``
+
+:Description: If set, when the :term:`Ceph Storage Cluster` starts, Ceph sets
+ the ``max open fds`` at the OS level (i.e., the max # of file
+ descriptors). It helps prevents Ceph OSD Daemons from running out
+ of file descriptors.
+
+:Type: 64-bit Integer
+:Required: No
+:Default: ``0``
+
+
+``fatal signal handlers``
+
+:Description: If set, we will install signal handlers for SEGV, ABRT, BUS, ILL,
+ FPE, XCPU, XFSZ, SYS signals to generate a useful log message
+
+:Type: Boolean
+:Default: ``true``
diff --git a/src/ceph/doc/rados/configuration/index.rst b/src/ceph/doc/rados/configuration/index.rst
new file mode 100644
index 0000000..48b58ef
--- /dev/null
+++ b/src/ceph/doc/rados/configuration/index.rst
@@ -0,0 +1,64 @@
+===============
+ Configuration
+===============
+
+Ceph can run with a cluster containing thousands of Object Storage Devices
+(OSDs). A minimal system will have at least two OSDs for data replication. To
+configure OSD clusters, you must provide settings in the configuration file.
+Ceph provides default values for many settings, which you can override in the
+configuration file. Additionally, you can make runtime modification to the
+configuration using command-line utilities.
+
+When Ceph starts, it activates three daemons:
+
+- ``ceph-mon`` (mandatory)
+- ``ceph-osd`` (mandatory)
+- ``ceph-mds`` (mandatory for cephfs only)
+
+Each process, daemon or utility loads the host's configuration file. A process
+may have information about more than one daemon instance (*i.e.,* multiple
+contexts). A daemon or utility only has information about a single daemon
+instance (a single context).
+
+.. note:: Ceph can run on a single host for evaluation purposes.
+
+
+.. raw:: html
+
+ <table cellpadding="10"><colgroup><col width="50%"><col width="50%"></colgroup><tbody valign="top"><tr><td><h3>Configuring the Object Store</h3>
+
+For general object store configuration, refer to the following:
+
+.. toctree::
+ :maxdepth: 1
+
+ Storage devices <storage-devices>
+ ceph-conf
+
+
+.. raw:: html
+
+ </td><td><h3>Reference</h3>
+
+To optimize the performance of your cluster, refer to the following:
+
+.. toctree::
+ :maxdepth: 1
+
+ Network Settings <network-config-ref>
+ Auth Settings <auth-config-ref>
+ Monitor Settings <mon-config-ref>
+ mon-lookup-dns
+ Heartbeat Settings <mon-osd-interaction>
+ OSD Settings <osd-config-ref>
+ BlueStore Settings <bluestore-config-ref>
+ FileStore Settings <filestore-config-ref>
+ Journal Settings <journal-ref>
+ Pool, PG & CRUSH Settings <pool-pg-config-ref.rst>
+ Messaging Settings <ms-ref>
+ General Settings <general-config-ref>
+
+
+.. raw:: html
+
+ </td></tr></tbody></table>
diff --git a/src/ceph/doc/rados/configuration/journal-ref.rst b/src/ceph/doc/rados/configuration/journal-ref.rst
new file mode 100644
index 0000000..97300f4
--- /dev/null
+++ b/src/ceph/doc/rados/configuration/journal-ref.rst
@@ -0,0 +1,116 @@
+==========================
+ Journal Config Reference
+==========================
+
+.. index:: journal; journal configuration
+
+Ceph OSDs use a journal for two reasons: speed and consistency.
+
+- **Speed:** The journal enables the Ceph OSD Daemon to commit small writes
+ quickly. Ceph writes small, random i/o to the journal sequentially, which
+ tends to speed up bursty workloads by allowing the backing filesystem more
+ time to coalesce writes. The Ceph OSD Daemon's journal, however, can lead
+ to spiky performance with short spurts of high-speed writes followed by
+ periods without any write progress as the filesystem catches up to the
+ journal.
+
+- **Consistency:** Ceph OSD Daemons require a filesystem interface that
+ guarantees atomic compound operations. Ceph OSD Daemons write a description
+ of the operation to the journal and apply the operation to the filesystem.
+ This enables atomic updates to an object (for example, placement group
+ metadata). Every few seconds--between ``filestore max sync interval`` and
+ ``filestore min sync interval``--the Ceph OSD Daemon stops writes and
+ synchronizes the journal with the filesystem, allowing Ceph OSD Daemons to
+ trim operations from the journal and reuse the space. On failure, Ceph
+ OSD Daemons replay the journal starting after the last synchronization
+ operation.
+
+Ceph OSD Daemons support the following journal settings:
+
+
+``journal dio``
+
+:Description: Enables direct i/o to the journal. Requires ``journal block
+ align`` set to ``true``.
+
+:Type: Boolean
+:Required: Yes when using ``aio``.
+:Default: ``true``
+
+
+
+``journal aio``
+
+.. versionchanged:: 0.61 Cuttlefish
+
+:Description: Enables using ``libaio`` for asynchronous writes to the journal.
+ Requires ``journal dio`` set to ``true``.
+
+:Type: Boolean
+:Required: No.
+:Default: Version 0.61 and later, ``true``. Version 0.60 and earlier, ``false``.
+
+
+``journal block align``
+
+:Description: Block aligns write operations. Required for ``dio`` and ``aio``.
+:Type: Boolean
+:Required: Yes when using ``dio`` and ``aio``.
+:Default: ``true``
+
+
+``journal max write bytes``
+
+:Description: The maximum number of bytes the journal will write at
+ any one time.
+
+:Type: Integer
+:Required: No
+:Default: ``10 << 20``
+
+
+``journal max write entries``
+
+:Description: The maximum number of entries the journal will write at
+ any one time.
+
+:Type: Integer
+:Required: No
+:Default: ``100``
+
+
+``journal queue max ops``
+
+:Description: The maximum number of operations allowed in the queue at
+ any one time.
+
+:Type: Integer
+:Required: No
+:Default: ``500``
+
+
+``journal queue max bytes``
+
+:Description: The maximum number of bytes allowed in the queue at
+ any one time.
+
+:Type: Integer
+:Required: No
+:Default: ``10 << 20``
+
+
+``journal align min size``
+
+:Description: Align data payloads greater than the specified minimum.
+:Type: Integer
+:Required: No
+:Default: ``64 << 10``
+
+
+``journal zero on create``
+
+:Description: Causes the file store to overwrite the entire journal with
+ ``0``'s during ``mkfs``.
+:Type: Boolean
+:Required: No
+:Default: ``false``
diff --git a/src/ceph/doc/rados/configuration/mon-config-ref.rst b/src/ceph/doc/rados/configuration/mon-config-ref.rst
new file mode 100644
index 0000000..6c8e92b
--- /dev/null
+++ b/src/ceph/doc/rados/configuration/mon-config-ref.rst
@@ -0,0 +1,1222 @@
+==========================
+ Monitor Config Reference
+==========================
+
+Understanding how to configure a :term:`Ceph Monitor` is an important part of
+building a reliable :term:`Ceph Storage Cluster`. **All Ceph Storage Clusters
+have at least one monitor**. A monitor configuration usually remains fairly
+consistent, but you can add, remove or replace a monitor in a cluster. See
+`Adding/Removing a Monitor`_ and `Add/Remove a Monitor (ceph-deploy)`_ for
+details.
+
+
+.. index:: Ceph Monitor; Paxos
+
+Background
+==========
+
+Ceph Monitors maintain a "master copy" of the :term:`cluster map`, which means a
+:term:`Ceph Client` can determine the location of all Ceph Monitors, Ceph OSD
+Daemons, and Ceph Metadata Servers just by connecting to one Ceph Monitor and
+retrieving a current cluster map. Before Ceph Clients can read from or write to
+Ceph OSD Daemons or Ceph Metadata Servers, they must connect to a Ceph Monitor
+first. With a current copy of the cluster map and the CRUSH algorithm, a Ceph
+Client can compute the location for any object. The ability to compute object
+locations allows a Ceph Client to talk directly to Ceph OSD Daemons, which is a
+very important aspect of Ceph's high scalability and performance. See
+`Scalability and High Availability`_ for additional details.
+
+The primary role of the Ceph Monitor is to maintain a master copy of the cluster
+map. Ceph Monitors also provide authentication and logging services. Ceph
+Monitors write all changes in the monitor services to a single Paxos instance,
+and Paxos writes the changes to a key/value store for strong consistency. Ceph
+Monitors can query the most recent version of the cluster map during sync
+operations. Ceph Monitors leverage the key/value store's snapshots and iterators
+(using leveldb) to perform store-wide synchronization.
+
+.. ditaa::
+
+ /-------------\ /-------------\
+ | Monitor | Write Changes | Paxos |
+ | cCCC +-------------->+ cCCC |
+ | | | |
+ +-------------+ \------+------/
+ | Auth | |
+ +-------------+ | Write Changes
+ | Log | |
+ +-------------+ v
+ | Monitor Map | /------+------\
+ +-------------+ | Key / Value |
+ | OSD Map | | Store |
+ +-------------+ | cCCC |
+ | PG Map | \------+------/
+ +-------------+ ^
+ | MDS Map | | Read Changes
+ +-------------+ |
+ | cCCC |*---------------------+
+ \-------------/
+
+
+.. deprecated:: version 0.58
+
+In Ceph versions 0.58 and earlier, Ceph Monitors use a Paxos instance for
+each service and store the map as a file.
+
+.. index:: Ceph Monitor; cluster map
+
+Cluster Maps
+------------
+
+The cluster map is a composite of maps, including the monitor map, the OSD map,
+the placement group map and the metadata server map. The cluster map tracks a
+number of important things: which processes are ``in`` the Ceph Storage Cluster;
+which processes that are ``in`` the Ceph Storage Cluster are ``up`` and running
+or ``down``; whether, the placement groups are ``active`` or ``inactive``, and
+``clean`` or in some other state; and, other details that reflect the current
+state of the cluster such as the total amount of storage space, and the amount
+of storage used.
+
+When there is a significant change in the state of the cluster--e.g., a Ceph OSD
+Daemon goes down, a placement group falls into a degraded state, etc.--the
+cluster map gets updated to reflect the current state of the cluster.
+Additionally, the Ceph Monitor also maintains a history of the prior states of
+the cluster. The monitor map, OSD map, placement group map and metadata server
+map each maintain a history of their map versions. We call each version an
+"epoch."
+
+When operating your Ceph Storage Cluster, keeping track of these states is an
+important part of your system administration duties. See `Monitoring a Cluster`_
+and `Monitoring OSDs and PGs`_ for additional details.
+
+.. index:: high availability; quorum
+
+Monitor Quorum
+--------------
+
+Our Configuring ceph section provides a trivial `Ceph configuration file`_ that
+provides for one monitor in the test cluster. A cluster will run fine with a
+single monitor; however, **a single monitor is a single-point-of-failure**. To
+ensure high availability in a production Ceph Storage Cluster, you should run
+Ceph with multiple monitors so that the failure of a single monitor **WILL NOT**
+bring down your entire cluster.
+
+When a Ceph Storage Cluster runs multiple Ceph Monitors for high availability,
+Ceph Monitors use `Paxos`_ to establish consensus about the master cluster map.
+A consensus requires a majority of monitors running to establish a quorum for
+consensus about the cluster map (e.g., 1; 2 out of 3; 3 out of 5; 4 out of 6;
+etc.).
+
+``mon force quorum join``
+
+:Description: Force monitor to join quorum even if it has been previously removed from the map
+:Type: Boolean
+:Default: ``False``
+
+.. index:: Ceph Monitor; consistency
+
+Consistency
+-----------
+
+When you add monitor settings to your Ceph configuration file, you need to be
+aware of some of the architectural aspects of Ceph Monitors. **Ceph imposes
+strict consistency requirements** for a Ceph monitor when discovering another
+Ceph Monitor within the cluster. Whereas, Ceph Clients and other Ceph daemons
+use the Ceph configuration file to discover monitors, monitors discover each
+other using the monitor map (monmap), not the Ceph configuration file.
+
+A Ceph Monitor always refers to the local copy of the monmap when discovering
+other Ceph Monitors in the Ceph Storage Cluster. Using the monmap instead of the
+Ceph configuration file avoids errors that could break the cluster (e.g., typos
+in ``ceph.conf`` when specifying a monitor address or port). Since monitors use
+monmaps for discovery and they share monmaps with clients and other Ceph
+daemons, **the monmap provides monitors with a strict guarantee that their
+consensus is valid.**
+
+Strict consistency also applies to updates to the monmap. As with any other
+updates on the Ceph Monitor, changes to the monmap always run through a
+distributed consensus algorithm called `Paxos`_. The Ceph Monitors must agree on
+each update to the monmap, such as adding or removing a Ceph Monitor, to ensure
+that each monitor in the quorum has the same version of the monmap. Updates to
+the monmap are incremental so that Ceph Monitors have the latest agreed upon
+version, and a set of previous versions. Maintaining a history enables a Ceph
+Monitor that has an older version of the monmap to catch up with the current
+state of the Ceph Storage Cluster.
+
+If Ceph Monitors discovered each other through the Ceph configuration file
+instead of through the monmap, it would introduce additional risks because the
+Ceph configuration files are not updated and distributed automatically. Ceph
+Monitors might inadvertently use an older Ceph configuration file, fail to
+recognize a Ceph Monitor, fall out of a quorum, or develop a situation where
+`Paxos`_ is not able to determine the current state of the system accurately.
+
+
+.. index:: Ceph Monitor; bootstrapping monitors
+
+Bootstrapping Monitors
+----------------------
+
+In most configuration and deployment cases, tools that deploy Ceph may help
+bootstrap the Ceph Monitors by generating a monitor map for you (e.g.,
+``ceph-deploy``, etc). A Ceph Monitor requires a few explicit
+settings:
+
+- **Filesystem ID**: The ``fsid`` is the unique identifier for your
+ object store. Since you can run multiple clusters on the same
+ hardware, you must specify the unique ID of the object store when
+ bootstrapping a monitor. Deployment tools usually do this for you
+ (e.g., ``ceph-deploy`` can call a tool like ``uuidgen``), but you
+ may specify the ``fsid`` manually too.
+
+- **Monitor ID**: A monitor ID is a unique ID assigned to each monitor within
+ the cluster. It is an alphanumeric value, and by convention the identifier
+ usually follows an alphabetical increment (e.g., ``a``, ``b``, etc.). This
+ can be set in a Ceph configuration file (e.g., ``[mon.a]``, ``[mon.b]``, etc.),
+ by a deployment tool, or using the ``ceph`` commandline.
+
+- **Keys**: The monitor must have secret keys. A deployment tool such as
+ ``ceph-deploy`` usually does this for you, but you may
+ perform this step manually too. See `Monitor Keyrings`_ for details.
+
+For additional details on bootstrapping, see `Bootstrapping a Monitor`_.
+
+.. index:: Ceph Monitor; configuring monitors
+
+Configuring Monitors
+====================
+
+To apply configuration settings to the entire cluster, enter the configuration
+settings under ``[global]``. To apply configuration settings to all monitors in
+your cluster, enter the configuration settings under ``[mon]``. To apply
+configuration settings to specific monitors, specify the monitor instance
+(e.g., ``[mon.a]``). By convention, monitor instance names use alpha notation.
+
+.. code-block:: ini
+
+ [global]
+
+ [mon]
+
+ [mon.a]
+
+ [mon.b]
+
+ [mon.c]
+
+
+Minimum Configuration
+---------------------
+
+The bare minimum monitor settings for a Ceph monitor via the Ceph configuration
+file include a hostname and a monitor address for each monitor. You can configure
+these under ``[mon]`` or under the entry for a specific monitor.
+
+.. code-block:: ini
+
+ [mon]
+ mon host = hostname1,hostname2,hostname3
+ mon addr = 10.0.0.10:6789,10.0.0.11:6789,10.0.0.12:6789
+
+
+.. code-block:: ini
+
+ [mon.a]
+ host = hostname1
+ mon addr = 10.0.0.10:6789
+
+See the `Network Configuration Reference`_ for details.
+
+.. note:: This minimum configuration for monitors assumes that a deployment
+ tool generates the ``fsid`` and the ``mon.`` key for you.
+
+Once you deploy a Ceph cluster, you **SHOULD NOT** change the IP address of
+the monitors. However, if you decide to change the monitor's IP address, you
+must follow a specific procedure. See `Changing a Monitor's IP Address`_ for
+details.
+
+Monitors can also be found by clients using DNS SRV records. See `Monitor lookup through DNS`_ for details.
+
+Cluster ID
+----------
+
+Each Ceph Storage Cluster has a unique identifier (``fsid``). If specified, it
+usually appears under the ``[global]`` section of the configuration file.
+Deployment tools usually generate the ``fsid`` and store it in the monitor map,
+so the value may not appear in a configuration file. The ``fsid`` makes it
+possible to run daemons for multiple clusters on the same hardware.
+
+``fsid``
+
+:Description: The cluster ID. One per cluster.
+:Type: UUID
+:Required: Yes.
+:Default: N/A. May be generated by a deployment tool if not specified.
+
+.. note:: Do not set this value if you use a deployment tool that does
+ it for you.
+
+
+.. index:: Ceph Monitor; initial members
+
+Initial Members
+---------------
+
+We recommend running a production Ceph Storage Cluster with at least three Ceph
+Monitors to ensure high availability. When you run multiple monitors, you may
+specify the initial monitors that must be members of the cluster in order to
+establish a quorum. This may reduce the time it takes for your cluster to come
+online.
+
+.. code-block:: ini
+
+ [mon]
+ mon initial members = a,b,c
+
+
+``mon initial members``
+
+:Description: The IDs of initial monitors in a cluster during startup. If
+ specified, Ceph requires an odd number of monitors to form an
+ initial quorum (e.g., 3).
+
+:Type: String
+:Default: None
+
+.. note:: A *majority* of monitors in your cluster must be able to reach
+ each other in order to establish a quorum. You can decrease the initial
+ number of monitors to establish a quorum with this setting.
+
+.. index:: Ceph Monitor; data path
+
+Data
+----
+
+Ceph provides a default path where Ceph Monitors store data. For optimal
+performance in a production Ceph Storage Cluster, we recommend running Ceph
+Monitors on separate hosts and drives from Ceph OSD Daemons. As leveldb is using
+``mmap()`` for writing the data, Ceph Monitors flush their data from memory to disk
+very often, which can interfere with Ceph OSD Daemon workloads if the data
+store is co-located with the OSD Daemons.
+
+In Ceph versions 0.58 and earlier, Ceph Monitors store their data in files. This
+approach allows users to inspect monitor data with common tools like ``ls``
+and ``cat``. However, it doesn't provide strong consistency.
+
+In Ceph versions 0.59 and later, Ceph Monitors store their data as key/value
+pairs. Ceph Monitors require `ACID`_ transactions. Using a data store prevents
+recovering Ceph Monitors from running corrupted versions through Paxos, and it
+enables multiple modification operations in one single atomic batch, among other
+advantages.
+
+Generally, we do not recommend changing the default data location. If you modify
+the default location, we recommend that you make it uniform across Ceph Monitors
+by setting it in the ``[mon]`` section of the configuration file.
+
+
+``mon data``
+
+:Description: The monitor's data location.
+:Type: String
+:Default: ``/var/lib/ceph/mon/$cluster-$id``
+
+
+``mon data size warn``
+
+:Description: Issue a ``HEALTH_WARN`` in cluster log when the monitor's data
+ store goes over 15GB.
+:Type: Integer
+:Default: 15*1024*1024*1024*
+
+
+``mon data avail warn``
+
+:Description: Issue a ``HEALTH_WARN`` in cluster log when the available disk
+ space of monitor's data store is lower or equal to this
+ percentage.
+:Type: Integer
+:Default: 30
+
+
+``mon data avail crit``
+
+:Description: Issue a ``HEALTH_ERR`` in cluster log when the available disk
+ space of monitor's data store is lower or equal to this
+ percentage.
+:Type: Integer
+:Default: 5
+
+
+``mon warn on cache pools without hit sets``
+
+:Description: Issue a ``HEALTH_WARN`` in cluster log if a cache pool does not
+ have the hitset type set set.
+ See `hit set type <../operations/pools#hit-set-type>`_ for more
+ details.
+:Type: Boolean
+:Default: True
+
+
+``mon warn on crush straw calc version zero``
+
+:Description: Issue a ``HEALTH_WARN`` in cluster log if the CRUSH's
+ ``straw_calc_version`` is zero. See
+ `CRUSH map tunables <../operations/crush-map#tunables>`_ for
+ details.
+:Type: Boolean
+:Default: True
+
+
+``mon warn on legacy crush tunables``
+
+:Description: Issue a ``HEALTH_WARN`` in cluster log if
+ CRUSH tunables are too old (older than ``mon_min_crush_required_version``)
+:Type: Boolean
+:Default: True
+
+
+``mon crush min required version``
+
+:Description: The minimum tunable profile version required by the cluster.
+ See
+ `CRUSH map tunables <../operations/crush-map#tunables>`_ for
+ details.
+:Type: String
+:Default: ``firefly``
+
+
+``mon warn on osd down out interval zero``
+
+:Description: Issue a ``HEALTH_WARN`` in cluster log if
+ ``mon osd down out interval`` is zero. Having this option set to
+ zero on the leader acts much like the ``noout`` flag. It's hard
+ to figure out what's going wrong with clusters witout the
+ ``noout`` flag set but acting like that just the same, so we
+ report a warning in this case.
+:Type: Boolean
+:Default: True
+
+
+``mon cache target full warn ratio``
+
+:Description: Position between pool's ``cache_target_full`` and
+ ``target_max_object`` where we start warning
+:Type: Float
+:Default: ``0.66``
+
+
+``mon health data update interval``
+
+:Description: How often (in seconds) the monitor in quorum shares its health
+ status with its peers. (negative number disables it)
+:Type: Float
+:Default: ``60``
+
+
+``mon health to clog``
+
+:Description: Enable sending health summary to cluster log periodically.
+:Type: Boolean
+:Default: True
+
+
+``mon health to clog tick interval``
+
+:Description: How often (in seconds) the monitor send health summary to cluster
+ log (a non-positive number disables it). If current health summary
+ is empty or identical to the last time, monitor will not send it
+ to cluster log.
+:Type: Integer
+:Default: 3600
+
+
+``mon health to clog interval``
+
+:Description: How often (in seconds) the monitor send health summary to cluster
+ log (a non-positive number disables it). Monitor will always
+ send the summary to cluster log no matter if the summary changes
+ or not.
+:Type: Integer
+:Default: 60
+
+
+
+.. index:: Ceph Storage Cluster; capacity planning, Ceph Monitor; capacity planning
+
+Storage Capacity
+----------------
+
+When a Ceph Storage Cluster gets close to its maximum capacity (i.e., ``mon osd
+full ratio``), Ceph prevents you from writing to or reading from Ceph OSD
+Daemons as a safety measure to prevent data loss. Therefore, letting a
+production Ceph Storage Cluster approach its full ratio is not a good practice,
+because it sacrifices high availability. The default full ratio is ``.95``, or
+95% of capacity. This a very aggressive setting for a test cluster with a small
+number of OSDs.
+
+.. tip:: When monitoring your cluster, be alert to warnings related to the
+ ``nearfull`` ratio. This means that a failure of some OSDs could result
+ in a temporary service disruption if one or more OSDs fails. Consider adding
+ more OSDs to increase storage capacity.
+
+A common scenario for test clusters involves a system administrator removing a
+Ceph OSD Daemon from the Ceph Storage Cluster to watch the cluster rebalance;
+then, removing another Ceph OSD Daemon, and so on until the Ceph Storage Cluster
+eventually reaches the full ratio and locks up. We recommend a bit of capacity
+planning even with a test cluster. Planning enables you to gauge how much spare
+capacity you will need in order to maintain high availability. Ideally, you want
+to plan for a series of Ceph OSD Daemon failures where the cluster can recover
+to an ``active + clean`` state without replacing those Ceph OSD Daemons
+immediately. You can run a cluster in an ``active + degraded`` state, but this
+is not ideal for normal operating conditions.
+
+The following diagram depicts a simplistic Ceph Storage Cluster containing 33
+Ceph Nodes with one Ceph OSD Daemon per host, each Ceph OSD Daemon reading from
+and writing to a 3TB drive. So this exemplary Ceph Storage Cluster has a maximum
+actual capacity of 99TB. With a ``mon osd full ratio`` of ``0.95``, if the Ceph
+Storage Cluster falls to 5TB of remaining capacity, the cluster will not allow
+Ceph Clients to read and write data. So the Ceph Storage Cluster's operating
+capacity is 95TB, not 99TB.
+
+.. ditaa::
+
+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+
+ | Rack 1 | | Rack 2 | | Rack 3 | | Rack 4 | | Rack 5 | | Rack 6 |
+ | cCCC | | cF00 | | cCCC | | cCCC | | cCCC | | cCCC |
+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+
+ | OSD 1 | | OSD 7 | | OSD 13 | | OSD 19 | | OSD 25 | | OSD 31 |
+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+
+ | OSD 2 | | OSD 8 | | OSD 14 | | OSD 20 | | OSD 26 | | OSD 32 |
+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+
+ | OSD 3 | | OSD 9 | | OSD 15 | | OSD 21 | | OSD 27 | | OSD 33 |
+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+
+ | OSD 4 | | OSD 10 | | OSD 16 | | OSD 22 | | OSD 28 | | Spare |
+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+
+ | OSD 5 | | OSD 11 | | OSD 17 | | OSD 23 | | OSD 29 | | Spare |
+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+
+ | OSD 6 | | OSD 12 | | OSD 18 | | OSD 24 | | OSD 30 | | Spare |
+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+
+
+It is normal in such a cluster for one or two OSDs to fail. A less frequent but
+reasonable scenario involves a rack's router or power supply failing, which
+brings down multiple OSDs simultaneously (e.g., OSDs 7-12). In such a scenario,
+you should still strive for a cluster that can remain operational and achieve an
+``active + clean`` state--even if that means adding a few hosts with additional
+OSDs in short order. If your capacity utilization is too high, you may not lose
+data, but you could still sacrifice data availability while resolving an outage
+within a failure domain if capacity utilization of the cluster exceeds the full
+ratio. For this reason, we recommend at least some rough capacity planning.
+
+Identify two numbers for your cluster:
+
+#. The number of OSDs.
+#. The total capacity of the cluster
+
+If you divide the total capacity of your cluster by the number of OSDs in your
+cluster, you will find the mean average capacity of an OSD within your cluster.
+Consider multiplying that number by the number of OSDs you expect will fail
+simultaneously during normal operations (a relatively small number). Finally
+multiply the capacity of the cluster by the full ratio to arrive at a maximum
+operating capacity; then, subtract the number of amount of data from the OSDs
+you expect to fail to arrive at a reasonable full ratio. Repeat the foregoing
+process with a higher number of OSD failures (e.g., a rack of OSDs) to arrive at
+a reasonable number for a near full ratio.
+
+.. code-block:: ini
+
+ [global]
+
+ mon osd full ratio = .80
+ mon osd backfillfull ratio = .75
+ mon osd nearfull ratio = .70
+
+
+``mon osd full ratio``
+
+:Description: The percentage of disk space used before an OSD is
+ considered ``full``.
+
+:Type: Float
+:Default: ``.95``
+
+
+``mon osd backfillfull ratio``
+
+:Description: The percentage of disk space used before an OSD is
+ considered too ``full`` to backfill.
+
+:Type: Float
+:Default: ``.90``
+
+
+``mon osd nearfull ratio``
+
+:Description: The percentage of disk space used before an OSD is
+ considered ``nearfull``.
+
+:Type: Float
+:Default: ``.85``
+
+
+.. tip:: If some OSDs are nearfull, but others have plenty of capacity, you
+ may have a problem with the CRUSH weight for the nearfull OSDs.
+
+.. index:: heartbeat
+
+Heartbeat
+---------
+
+Ceph monitors know about the cluster by requiring reports from each OSD, and by
+receiving reports from OSDs about the status of their neighboring OSDs. Ceph
+provides reasonable default settings for monitor/OSD interaction; however, you
+may modify them as needed. See `Monitor/OSD Interaction`_ for details.
+
+
+.. index:: Ceph Monitor; leader, Ceph Monitor; provider, Ceph Monitor; requester, Ceph Monitor; synchronization
+
+Monitor Store Synchronization
+-----------------------------
+
+When you run a production cluster with multiple monitors (recommended), each
+monitor checks to see if a neighboring monitor has a more recent version of the
+cluster map (e.g., a map in a neighboring monitor with one or more epoch numbers
+higher than the most current epoch in the map of the instant monitor).
+Periodically, one monitor in the cluster may fall behind the other monitors to
+the point where it must leave the quorum, synchronize to retrieve the most
+current information about the cluster, and then rejoin the quorum. For the
+purposes of synchronization, monitors may assume one of three roles:
+
+#. **Leader**: The `Leader` is the first monitor to achieve the most recent
+ Paxos version of the cluster map.
+
+#. **Provider**: The `Provider` is a monitor that has the most recent version
+ of the cluster map, but wasn't the first to achieve the most recent version.
+
+#. **Requester:** A `Requester` is a monitor that has fallen behind the leader
+ and must synchronize in order to retrieve the most recent information about
+ the cluster before it can rejoin the quorum.
+
+These roles enable a leader to delegate synchronization duties to a provider,
+which prevents synchronization requests from overloading the leader--improving
+performance. In the following diagram, the requester has learned that it has
+fallen behind the other monitors. The requester asks the leader to synchronize,
+and the leader tells the requester to synchronize with a provider.
+
+
+.. ditaa:: +-----------+ +---------+ +----------+
+ | Requester | | Leader | | Provider |
+ +-----------+ +---------+ +----------+
+ | | |
+ | | |
+ | Ask to Synchronize | |
+ |------------------->| |
+ | | |
+ |<-------------------| |
+ | Tell Requester to | |
+ | Sync with Provider | |
+ | | |
+ | Synchronize |
+ |--------------------+-------------------->|
+ | | |
+ |<-------------------+---------------------|
+ | Send Chunk to Requester |
+ | (repeat as necessary) |
+ | Requester Acks Chuck to Provider |
+ |--------------------+-------------------->|
+ | |
+ | Sync Complete |
+ | Notification |
+ |------------------->|
+ | |
+ |<-------------------|
+ | Ack |
+ | |
+
+
+Synchronization always occurs when a new monitor joins the cluster. During
+runtime operations, monitors may receive updates to the cluster map at different
+times. This means the leader and provider roles may migrate from one monitor to
+another. If this happens while synchronizing (e.g., a provider falls behind the
+leader), the provider can terminate synchronization with a requester.
+
+Once synchronization is complete, Ceph requires trimming across the cluster.
+Trimming requires that the placement groups are ``active + clean``.
+
+
+``mon sync trim timeout``
+
+:Description:
+:Type: Double
+:Default: ``30.0``
+
+
+``mon sync heartbeat timeout``
+
+:Description:
+:Type: Double
+:Default: ``30.0``
+
+
+``mon sync heartbeat interval``
+
+:Description:
+:Type: Double
+:Default: ``5.0``
+
+
+``mon sync backoff timeout``
+
+:Description:
+:Type: Double
+:Default: ``30.0``
+
+
+``mon sync timeout``
+
+:Description: Number of seconds the monitor will wait for the next update
+ message from its sync provider before it gives up and bootstrap
+ again.
+:Type: Double
+:Default: ``30.0``
+
+
+``mon sync max retries``
+
+:Description:
+:Type: Integer
+:Default: ``5``
+
+
+``mon sync max payload size``
+
+:Description: The maximum size for a sync payload (in bytes).
+:Type: 32-bit Integer
+:Default: ``1045676``
+
+
+``paxos max join drift``
+
+:Description: The maximum Paxos iterations before we must first sync the
+ monitor data stores. When a monitor finds that its peer is too
+ far ahead of it, it will first sync with data stores before moving
+ on.
+:Type: Integer
+:Default: ``10``
+
+``paxos stash full interval``
+
+:Description: How often (in commits) to stash a full copy of the PaxosService state.
+ Current this setting only affects ``mds``, ``mon``, ``auth`` and ``mgr``
+ PaxosServices.
+:Type: Integer
+:Default: 25
+
+``paxos propose interval``
+
+:Description: Gather updates for this time interval before proposing
+ a map update.
+:Type: Double
+:Default: ``1.0``
+
+
+``paxos min``
+
+:Description: The minimum number of paxos states to keep around
+:Type: Integer
+:Default: 500
+
+
+``paxos min wait``
+
+:Description: The minimum amount of time to gather updates after a period of
+ inactivity.
+:Type: Double
+:Default: ``0.05``
+
+
+``paxos trim min``
+
+:Description: Number of extra proposals tolerated before trimming
+:Type: Integer
+:Default: 250
+
+
+``paxos trim max``
+
+:Description: The maximum number of extra proposals to trim at a time
+:Type: Integer
+:Default: 500
+
+
+``paxos service trim min``
+
+:Description: The minimum amount of versions to trigger a trim (0 disables it)
+:Type: Integer
+:Default: 250
+
+
+``paxos service trim max``
+
+:Description: The maximum amount of versions to trim during a single proposal (0 disables it)
+:Type: Integer
+:Default: 500
+
+
+``mon max log epochs``
+
+:Description: The maximum amount of log epochs to trim during a single proposal
+:Type: Integer
+:Default: 500
+
+
+``mon max pgmap epochs``
+
+:Description: The maximum amount of pgmap epochs to trim during a single proposal
+:Type: Integer
+:Default: 500
+
+
+``mon mds force trim to``
+
+:Description: Force monitor to trim mdsmaps to this point (0 disables it.
+ dangerous, use with care)
+:Type: Integer
+:Default: 0
+
+
+``mon osd force trim to``
+
+:Description: Force monitor to trim osdmaps to this point, even if there is
+ PGs not clean at the specified epoch (0 disables it. dangerous,
+ use with care)
+:Type: Integer
+:Default: 0
+
+``mon osd cache size``
+
+:Description: The size of osdmaps cache, not to rely on underlying store's cache
+:Type: Integer
+:Default: 10
+
+
+``mon election timeout``
+
+:Description: On election proposer, maximum waiting time for all ACKs in seconds.
+:Type: Float
+:Default: ``5``
+
+
+``mon lease``
+
+:Description: The length (in seconds) of the lease on the monitor's versions.
+:Type: Float
+:Default: ``5``
+
+
+``mon lease renew interval factor``
+
+:Description: ``mon lease`` \* ``mon lease renew interval factor`` will be the
+ interval for the Leader to renew the other monitor's leases. The
+ factor should be less than ``1.0``.
+:Type: Float
+:Default: ``0.6``
+
+
+``mon lease ack timeout factor``
+
+:Description: The Leader will wait ``mon lease`` \* ``mon lease ack timeout factor``
+ for the Providers to acknowledge the lease extension.
+:Type: Float
+:Default: ``2.0``
+
+
+``mon accept timeout factor``
+
+:Description: The Leader will wait ``mon lease`` \* ``mon accept timeout factor``
+ for the Requester(s) to accept a Paxos update. It is also used
+ during the Paxos recovery phase for similar purposes.
+:Type: Float
+:Default: ``2.0``
+
+
+``mon min osdmap epochs``
+
+:Description: Minimum number of OSD map epochs to keep at all times.
+:Type: 32-bit Integer
+:Default: ``500``
+
+
+``mon max pgmap epochs``
+
+:Description: Maximum number of PG map epochs the monitor should keep.
+:Type: 32-bit Integer
+:Default: ``500``
+
+
+``mon max log epochs``
+
+:Description: Maximum number of Log epochs the monitor should keep.
+:Type: 32-bit Integer
+:Default: ``500``
+
+
+
+.. index:: Ceph Monitor; clock
+
+Clock
+-----
+
+Ceph daemons pass critical messages to each other, which must be processed
+before daemons reach a timeout threshold. If the clocks in Ceph monitors
+are not synchronized, it can lead to a number of anomalies. For example:
+
+- Daemons ignoring received messages (e.g., timestamps outdated)
+- Timeouts triggered too soon/late when a message wasn't received in time.
+
+See `Monitor Store Synchronization`_ for details.
+
+
+.. tip:: You SHOULD install NTP on your Ceph monitor hosts to
+ ensure that the monitor cluster operates with synchronized clocks.
+
+Clock drift may still be noticeable with NTP even though the discrepancy is not
+yet harmful. Ceph's clock drift / clock skew warnings may get triggered even
+though NTP maintains a reasonable level of synchronization. Increasing your
+clock drift may be tolerable under such circumstances; however, a number of
+factors such as workload, network latency, configuring overrides to default
+timeouts and the `Monitor Store Synchronization`_ settings may influence
+the level of acceptable clock drift without compromising Paxos guarantees.
+
+Ceph provides the following tunable options to allow you to find
+acceptable values.
+
+
+``clock offset``
+
+:Description: How much to offset the system clock. See ``Clock.cc`` for details.
+:Type: Double
+:Default: ``0``
+
+
+.. deprecated:: 0.58
+
+``mon tick interval``
+
+:Description: A monitor's tick interval in seconds.
+:Type: 32-bit Integer
+:Default: ``5``
+
+
+``mon clock drift allowed``
+
+:Description: The clock drift in seconds allowed between monitors.
+:Type: Float
+:Default: ``.050``
+
+
+``mon clock drift warn backoff``
+
+:Description: Exponential backoff for clock drift warnings
+:Type: Float
+:Default: ``5``
+
+
+``mon timecheck interval``
+
+:Description: The time check interval (clock drift check) in seconds
+ for the Leader.
+
+:Type: Float
+:Default: ``300.0``
+
+
+``mon timecheck skew interval``
+
+:Description: The time check interval (clock drift check) in seconds when in
+ presence of a skew in seconds for the Leader.
+:Type: Float
+:Default: ``30.0``
+
+
+Client
+------
+
+``mon client hunt interval``
+
+:Description: The client will try a new monitor every ``N`` seconds until it
+ establishes a connection.
+
+:Type: Double
+:Default: ``3.0``
+
+
+``mon client ping interval``
+
+:Description: The client will ping the monitor every ``N`` seconds.
+:Type: Double
+:Default: ``10.0``
+
+
+``mon client max log entries per message``
+
+:Description: The maximum number of log entries a monitor will generate
+ per client message.
+
+:Type: Integer
+:Default: ``1000``
+
+
+``mon client bytes``
+
+:Description: The amount of client message data allowed in memory (in bytes).
+:Type: 64-bit Integer Unsigned
+:Default: ``100ul << 20``
+
+
+Pool settings
+=============
+Since version v0.94 there is support for pool flags which allow or disallow changes to be made to pools.
+
+Monitors can also disallow removal of pools if configured that way.
+
+``mon allow pool delete``
+
+:Description: If the monitors should allow pools to be removed. Regardless of what the pool flags say.
+:Type: Boolean
+:Default: ``false``
+
+``osd pool default flag hashpspool``
+
+:Description: Set the hashpspool flag on new pools
+:Type: Boolean
+:Default: ``true``
+
+``osd pool default flag nodelete``
+
+:Description: Set the nodelete flag on new pools. Prevents allow pool removal with this flag in any way.
+:Type: Boolean
+:Default: ``false``
+
+``osd pool default flag nopgchange``
+
+:Description: Set the nopgchange flag on new pools. Does not allow the number of PGs to be changed for a pool.
+:Type: Boolean
+:Default: ``false``
+
+``osd pool default flag nosizechange``
+
+:Description: Set the nosizechange flag on new pools. Does not allow the size to be changed of pool.
+:Type: Boolean
+:Default: ``false``
+
+For more information about the pool flags see `Pool values`_.
+
+Miscellaneous
+=============
+
+
+``mon max osd``
+
+:Description: The maximum number of OSDs allowed in the cluster.
+:Type: 32-bit Integer
+:Default: ``10000``
+
+``mon globalid prealloc``
+
+:Description: The number of global IDs to pre-allocate for clients and daemons in the cluster.
+:Type: 32-bit Integer
+:Default: ``100``
+
+``mon subscribe interval``
+
+:Description: The refresh interval (in seconds) for subscriptions. The
+ subscription mechanism enables obtaining the cluster maps
+ and log information.
+
+:Type: Double
+:Default: ``300``
+
+
+``mon stat smooth intervals``
+
+:Description: Ceph will smooth statistics over the last ``N`` PG maps.
+:Type: Integer
+:Default: ``2``
+
+
+``mon probe timeout``
+
+:Description: Number of seconds the monitor will wait to find peers before bootstrapping.
+:Type: Double
+:Default: ``2.0``
+
+
+``mon daemon bytes``
+
+:Description: The message memory cap for metadata server and OSD messages (in bytes).
+:Type: 64-bit Integer Unsigned
+:Default: ``400ul << 20``
+
+
+``mon max log entries per event``
+
+:Description: The maximum number of log entries per event.
+:Type: Integer
+:Default: ``4096``
+
+
+``mon osd prime pg temp``
+
+:Description: Enables or disable priming the PGMap with the previous OSDs when an out
+ OSD comes back into the cluster. With the ``true`` setting the clients
+ will continue to use the previous OSDs until the newly in OSDs as that
+ PG peered.
+:Type: Boolean
+:Default: ``true``
+
+
+``mon osd prime pg temp max time``
+
+:Description: How much time in seconds the monitor should spend trying to prime the
+ PGMap when an out OSD comes back into the cluster.
+:Type: Float
+:Default: ``0.5``
+
+
+``mon osd prime pg temp max time estimate``
+
+:Description: Maximum estimate of time spent on each PG before we prime all PGs
+ in parallel.
+:Type: Float
+:Default: ``0.25``
+
+
+``mon osd allow primary affinity``
+
+:Description: allow ``primary_affinity`` to be set in the osdmap.
+:Type: Boolean
+:Default: False
+
+
+``mon osd pool ec fast read``
+
+:Description: Whether turn on fast read on the pool or not. It will be used as
+ the default setting of newly created erasure pools if ``fast_read``
+ is not specified at create time.
+:Type: Boolean
+:Default: False
+
+
+``mon mds skip sanity``
+
+:Description: Skip safety assertions on FSMap (in case of bugs where we want to
+ continue anyway). Monitor terminates if the FSMap sanity check
+ fails, but we can disable it by enabling this option.
+:Type: Boolean
+:Default: False
+
+
+``mon max mdsmap epochs``
+
+:Description: The maximum amount of mdsmap epochs to trim during a single proposal.
+:Type: Integer
+:Default: 500
+
+
+``mon config key max entry size``
+
+:Description: The maximum size of config-key entry (in bytes)
+:Type: Integer
+:Default: 4096
+
+
+``mon scrub interval``
+
+:Description: How often (in seconds) the monitor scrub its store by comparing
+ the stored checksums with the computed ones of all the stored
+ keys.
+:Type: Integer
+:Default: 3600*24
+
+
+``mon scrub max keys``
+
+:Description: The maximum number of keys to scrub each time.
+:Type: Integer
+:Default: 100
+
+
+``mon compact on start``
+
+:Description: Compact the database used as Ceph Monitor store on
+ ``ceph-mon`` start. A manual compaction helps to shrink the
+ monitor database and improve the performance of it if the regular
+ compaction fails to work.
+:Type: Boolean
+:Default: False
+
+
+``mon compact on bootstrap``
+
+:Description: Compact the database used as Ceph Monitor store on
+ on bootstrap. Monitor starts probing each other for creating
+ a quorum after bootstrap. If it times out before joining the
+ quorum, it will start over and bootstrap itself again.
+:Type: Boolean
+:Default: False
+
+
+``mon compact on trim``
+
+:Description: Compact a certain prefix (including paxos) when we trim its old states.
+:Type: Boolean
+:Default: True
+
+
+``mon cpu threads``
+
+:Description: Number of threads for performing CPU intensive work on monitor.
+:Type: Boolean
+:Default: True
+
+
+``mon osd mapping pgs per chunk``
+
+:Description: We calculate the mapping from placement group to OSDs in chunks.
+ This option specifies the number of placement groups per chunk.
+:Type: Integer
+:Default: 4096
+
+
+``mon osd max split count``
+
+:Description: Largest number of PGs per "involved" OSD to let split create.
+ When we increase the ``pg_num`` of a pool, the placement groups
+ will be splitted on all OSDs serving that pool. We want to avoid
+ extreme multipliers on PG splits.
+:Type: Integer
+:Default: 300
+
+
+``mon session timeout``
+
+:Description: Monitor will terminate inactive sessions stay idle over this
+ time limit.
+:Type: Integer
+:Default: 300
+
+
+
+.. _Paxos: http://en.wikipedia.org/wiki/Paxos_(computer_science)
+.. _Monitor Keyrings: ../../../dev/mon-bootstrap#secret-keys
+.. _Ceph configuration file: ../ceph-conf/#monitors
+.. _Network Configuration Reference: ../network-config-ref
+.. _Monitor lookup through DNS: ../mon-lookup-dns
+.. _ACID: http://en.wikipedia.org/wiki/ACID
+.. _Adding/Removing a Monitor: ../../operations/add-or-rm-mons
+.. _Add/Remove a Monitor (ceph-deploy): ../../deployment/ceph-deploy-mon
+.. _Monitoring a Cluster: ../../operations/monitoring
+.. _Monitoring OSDs and PGs: ../../operations/monitoring-osd-pg
+.. _Bootstrapping a Monitor: ../../../dev/mon-bootstrap
+.. _Changing a Monitor's IP Address: ../../operations/add-or-rm-mons#changing-a-monitor-s-ip-address
+.. _Monitor/OSD Interaction: ../mon-osd-interaction
+.. _Scalability and High Availability: ../../../architecture#scalability-and-high-availability
+.. _Pool values: ../../operations/pools/#set-pool-values
diff --git a/src/ceph/doc/rados/configuration/mon-lookup-dns.rst b/src/ceph/doc/rados/configuration/mon-lookup-dns.rst
new file mode 100644
index 0000000..e32b320
--- /dev/null
+++ b/src/ceph/doc/rados/configuration/mon-lookup-dns.rst
@@ -0,0 +1,51 @@
+===============================
+Looking op Monitors through DNS
+===============================
+
+Since version 11.0.0 RADOS supports looking up Monitors through DNS.
+
+This way daemons and clients do not require a *mon host* configuration directive in their ceph.conf configuration file.
+
+Using DNS SRV TCP records clients are able to look up the monitors.
+
+This allows for less configuration on clients and monitors. Using a DNS update clients and daemons can be made aware of changes in the monitor topology.
+
+By default clients and daemons will look for the TCP service called *ceph-mon* which is configured by the *mon_dns_srv_name* configuration directive.
+
+
+``mon dns srv name``
+
+:Description: the service name used querying the DNS for the monitor hosts/addresses
+:Type: String
+:Default: ``ceph-mon``
+
+Example
+-------
+When the DNS search domain is set to *example.com* a DNS zone file might contain the following elements.
+
+First, create records for the Monitors, either IPv4 (A) or IPv6 (AAAA).
+
+::
+
+ mon1.example.com. AAAA 2001:db8::100
+ mon2.example.com. AAAA 2001:db8::200
+ mon3.example.com. AAAA 2001:db8::300
+
+::
+
+ mon1.example.com. A 192.168.0.1
+ mon2.example.com. A 192.168.0.2
+ mon3.example.com. A 192.168.0.3
+
+
+With those records now existing we can create the SRV TCP records with the name *ceph-mon* pointing to the three Monitors.
+
+::
+
+ _ceph-mon._tcp.example.com. 60 IN SRV 10 60 6789 mon1.example.com.
+ _ceph-mon._tcp.example.com. 60 IN SRV 10 60 6789 mon2.example.com.
+ _ceph-mon._tcp.example.com. 60 IN SRV 10 60 6789 mon3.example.com.
+
+In this case the Monitors are running on port *6789*, and their priority and weight are all *10* and *60* respectively.
+
+The current implementation in clients and daemons will *only* respect the priority set in SRV records, and they will only connect to the monitors with lowest-numbered priority. The targets with the same priority will be selected at random.
diff --git a/src/ceph/doc/rados/configuration/mon-osd-interaction.rst b/src/ceph/doc/rados/configuration/mon-osd-interaction.rst
new file mode 100644
index 0000000..e335ff0
--- /dev/null
+++ b/src/ceph/doc/rados/configuration/mon-osd-interaction.rst
@@ -0,0 +1,408 @@
+=====================================
+ Configuring Monitor/OSD Interaction
+=====================================
+
+.. index:: heartbeat
+
+After you have completed your initial Ceph configuration, you may deploy and run
+Ceph. When you execute a command such as ``ceph health`` or ``ceph -s``, the
+:term:`Ceph Monitor` reports on the current state of the :term:`Ceph Storage
+Cluster`. The Ceph Monitor knows about the Ceph Storage Cluster by requiring
+reports from each :term:`Ceph OSD Daemon`, and by receiving reports from Ceph
+OSD Daemons about the status of their neighboring Ceph OSD Daemons. If the Ceph
+Monitor doesn't receive reports, or if it receives reports of changes in the
+Ceph Storage Cluster, the Ceph Monitor updates the status of the :term:`Ceph
+Cluster Map`.
+
+Ceph provides reasonable default settings for Ceph Monitor/Ceph OSD Daemon
+interaction. However, you may override the defaults. The following sections
+describe how Ceph Monitors and Ceph OSD Daemons interact for the purposes of
+monitoring the Ceph Storage Cluster.
+
+.. index:: heartbeat interval
+
+OSDs Check Heartbeats
+=====================
+
+Each Ceph OSD Daemon checks the heartbeat of other Ceph OSD Daemons every 6
+seconds. You can change the heartbeat interval by adding an ``osd heartbeat
+interval`` setting under the ``[osd]`` section of your Ceph configuration file,
+or by setting the value at runtime. If a neighboring Ceph OSD Daemon doesn't
+show a heartbeat within a 20 second grace period, the Ceph OSD Daemon may
+consider the neighboring Ceph OSD Daemon ``down`` and report it back to a Ceph
+Monitor, which will update the Ceph Cluster Map. You may change this grace
+period by adding an ``osd heartbeat grace`` setting under the ``[mon]``
+and ``[osd]`` or ``[global]`` section of your Ceph configuration file,
+or by setting the value at runtime.
+
+
+.. ditaa:: +---------+ +---------+
+ | OSD 1 | | OSD 2 |
+ +---------+ +---------+
+ | |
+ |----+ Heartbeat |
+ | | Interval |
+ |<---+ Exceeded |
+ | |
+ | Check |
+ | Heartbeat |
+ |------------------->|
+ | |
+ |<-------------------|
+ | Heart Beating |
+ | |
+ |----+ Heartbeat |
+ | | Interval |
+ |<---+ Exceeded |
+ | |
+ | Check |
+ | Heartbeat |
+ |------------------->|
+ | |
+ |----+ Grace |
+ | | Period |
+ |<---+ Exceeded |
+ | |
+ |----+ Mark |
+ | | OSD 2 |
+ |<---+ Down |
+
+
+.. index:: OSD down report
+
+OSDs Report Down OSDs
+=====================
+
+By default, two Ceph OSD Daemons from different hosts must report to the Ceph
+Monitors that another Ceph OSD Daemon is ``down`` before the Ceph Monitors
+acknowledge that the reported Ceph OSD Daemon is ``down``. But there is chance
+that all the OSDs reporting the failure are hosted in a rack with a bad switch
+which has trouble connecting to another OSD. To avoid this sort of false alarm,
+we consider the peers reporting a failure a proxy for a potential "subcluster"
+over the overall cluster that is similarly laggy. This is clearly not true in
+all cases, but will sometimes help us localize the grace correction to a subset
+of the system that is unhappy. ``mon osd reporter subtree level`` is used to
+group the peers into the "subcluster" by their common ancestor type in CRUSH
+map. By default, only two reports from different subtree are required to report
+another Ceph OSD Daemon ``down``. You can change the number of reporters from
+unique subtrees and the common ancestor type required to report a Ceph OSD
+Daemon ``down`` to a Ceph Monitor by adding an ``mon osd min down reporters``
+and ``mon osd reporter subtree level`` settings under the ``[mon]`` section of
+your Ceph configuration file, or by setting the value at runtime.
+
+
+.. ditaa:: +---------+ +---------+ +---------+
+ | OSD 1 | | OSD 2 | | Monitor |
+ +---------+ +---------+ +---------+
+ | | |
+ | OSD 3 Is Down | |
+ |---------------+--------------->|
+ | | |
+ | | |
+ | | OSD 3 Is Down |
+ | |--------------->|
+ | | |
+ | | |
+ | | |---------+ Mark
+ | | | | OSD 3
+ | | |<--------+ Down
+
+
+.. index:: peering failure
+
+OSDs Report Peering Failure
+===========================
+
+If a Ceph OSD Daemon cannot peer with any of the Ceph OSD Daemons defined in its
+Ceph configuration file (or the cluster map), it will ping a Ceph Monitor for
+the most recent copy of the cluster map every 30 seconds. You can change the
+Ceph Monitor heartbeat interval by adding an ``osd mon heartbeat interval``
+setting under the ``[osd]`` section of your Ceph configuration file, or by
+setting the value at runtime.
+
+.. ditaa:: +---------+ +---------+ +-------+ +---------+
+ | OSD 1 | | OSD 2 | | OSD 3 | | Monitor |
+ +---------+ +---------+ +-------+ +---------+
+ | | | |
+ | Request To | | |
+ | Peer | | |
+ |-------------->| | |
+ |<--------------| | |
+ | Peering | |
+ | | |
+ | Request To | |
+ | Peer | |
+ |----------------------------->| |
+ | |
+ |----+ OSD Monitor |
+ | | Heartbeat |
+ |<---+ Interval Exceeded |
+ | |
+ | Failed to Peer with OSD 3 |
+ |-------------------------------------------->|
+ |<--------------------------------------------|
+ | Receive New Cluster Map |
+
+
+.. index:: OSD status
+
+OSDs Report Their Status
+========================
+
+If an Ceph OSD Daemon doesn't report to a Ceph Monitor, the Ceph Monitor will
+consider the Ceph OSD Daemon ``down`` after the ``mon osd report timeout``
+elapses. A Ceph OSD Daemon sends a report to a Ceph Monitor when a reportable
+event such as a failure, a change in placement group stats, a change in
+``up_thru`` or when it boots within 5 seconds. You can change the Ceph OSD
+Daemon minimum report interval by adding an ``osd mon report interval min``
+setting under the ``[osd]`` section of your Ceph configuration file, or by
+setting the value at runtime. A Ceph OSD Daemon sends a report to a Ceph
+Monitor every 120 seconds irrespective of whether any notable changes occur.
+You can change the Ceph Monitor report interval by adding an ``osd mon report
+interval max`` setting under the ``[osd]`` section of your Ceph configuration
+file, or by setting the value at runtime.
+
+
+.. ditaa:: +---------+ +---------+
+ | OSD 1 | | Monitor |
+ +---------+ +---------+
+ | |
+ |----+ Report Min |
+ | | Interval |
+ |<---+ Exceeded |
+ | |
+ |----+ Reportable |
+ | | Event |
+ |<---+ Occurs |
+ | |
+ | Report To |
+ | Monitor |
+ |------------------->|
+ | |
+ |----+ Report Max |
+ | | Interval |
+ |<---+ Exceeded |
+ | |
+ | Report To |
+ | Monitor |
+ |------------------->|
+ | |
+ |----+ Monitor |
+ | | Fails |
+ |<---+ |
+ +----+ Monitor OSD
+ | | Report Timeout
+ |<---+ Exceeded
+ |
+ +----+ Mark
+ | | OSD 1
+ |<---+ Down
+
+
+
+
+Configuration Settings
+======================
+
+When modifying heartbeat settings, you should include them in the ``[global]``
+section of your configuration file.
+
+.. index:: monitor heartbeat
+
+Monitor Settings
+----------------
+
+``mon osd min up ratio``
+
+:Description: The minimum ratio of ``up`` Ceph OSD Daemons before Ceph will
+ mark Ceph OSD Daemons ``down``.
+
+:Type: Double
+:Default: ``.3``
+
+
+``mon osd min in ratio``
+
+:Description: The minimum ratio of ``in`` Ceph OSD Daemons before Ceph will
+ mark Ceph OSD Daemons ``out``.
+
+:Type: Double
+:Default: ``.75``
+
+
+``mon osd laggy halflife``
+
+:Description: The number of seconds laggy estimates will decay.
+:Type: Integer
+:Default: ``60*60``
+
+
+``mon osd laggy weight``
+
+:Description: The weight for new samples in laggy estimation decay.
+:Type: Double
+:Default: ``0.3``
+
+
+
+``mon osd laggy max interval``
+
+:Description: Maximum value of ``laggy_interval`` in laggy estimations (in seconds).
+ Monitor uses an adaptive approach to evaluate the ``laggy_interval`` of
+ a certain OSD. This value will be used to calculate the grace time for
+ that OSD.
+:Type: Integer
+:Default: 300
+
+``mon osd adjust heartbeat grace``
+
+:Description: If set to ``true``, Ceph will scale based on laggy estimations.
+:Type: Boolean
+:Default: ``true``
+
+
+``mon osd adjust down out interval``
+
+:Description: If set to ``true``, Ceph will scaled based on laggy estimations.
+:Type: Boolean
+:Default: ``true``
+
+
+``mon osd auto mark in``
+
+:Description: Ceph will mark any booting Ceph OSD Daemons as ``in``
+ the Ceph Storage Cluster.
+
+:Type: Boolean
+:Default: ``false``
+
+
+``mon osd auto mark auto out in``
+
+:Description: Ceph will mark booting Ceph OSD Daemons auto marked ``out``
+ of the Ceph Storage Cluster as ``in`` the cluster.
+
+:Type: Boolean
+:Default: ``true``
+
+
+``mon osd auto mark new in``
+
+:Description: Ceph will mark booting new Ceph OSD Daemons as ``in`` the
+ Ceph Storage Cluster.
+
+:Type: Boolean
+:Default: ``true``
+
+
+``mon osd down out interval``
+
+:Description: The number of seconds Ceph waits before marking a Ceph OSD Daemon
+ ``down`` and ``out`` if it doesn't respond.
+
+:Type: 32-bit Integer
+:Default: ``600``
+
+
+``mon osd down out subtree limit``
+
+:Description: The smallest :term:`CRUSH` unit type that Ceph will **not**
+ automatically mark out. For instance, if set to ``host`` and if
+ all OSDs of a host are down, Ceph will not automatically mark out
+ these OSDs.
+
+:Type: String
+:Default: ``rack``
+
+
+``mon osd report timeout``
+
+:Description: The grace period in seconds before declaring
+ unresponsive Ceph OSD Daemons ``down``.
+
+:Type: 32-bit Integer
+:Default: ``900``
+
+``mon osd min down reporters``
+
+:Description: The minimum number of Ceph OSD Daemons required to report a
+ ``down`` Ceph OSD Daemon.
+
+:Type: 32-bit Integer
+:Default: ``2``
+
+
+``mon osd reporter subtree level``
+
+:Description: In which level of parent bucket the reporters are counted. The OSDs
+ send failure reports to monitor if they find its peer is not responsive.
+ And monitor mark the reported OSD out and then down after a grace period.
+:Type: String
+:Default: ``host``
+
+
+.. index:: OSD hearbeat
+
+OSD Settings
+------------
+
+``osd heartbeat address``
+
+:Description: An Ceph OSD Daemon's network address for heartbeats.
+:Type: Address
+:Default: The host address.
+
+
+``osd heartbeat interval``
+
+:Description: How often an Ceph OSD Daemon pings its peers (in seconds).
+:Type: 32-bit Integer
+:Default: ``6``
+
+
+``osd heartbeat grace``
+
+:Description: The elapsed time when a Ceph OSD Daemon hasn't shown a heartbeat
+ that the Ceph Storage Cluster considers it ``down``.
+ This setting has to be set in both the [mon] and [osd] or [global]
+ section so that it is read by both the MON and OSD daemons.
+:Type: 32-bit Integer
+:Default: ``20``
+
+
+``osd mon heartbeat interval``
+
+:Description: How often the Ceph OSD Daemon pings a Ceph Monitor if it has no
+ Ceph OSD Daemon peers.
+
+:Type: 32-bit Integer
+:Default: ``30``
+
+
+``osd mon report interval max``
+
+:Description: The maximum time in seconds that a Ceph OSD Daemon can wait before
+ it must report to a Ceph Monitor.
+
+:Type: 32-bit Integer
+:Default: ``120``
+
+
+``osd mon report interval min``
+
+:Description: The minimum number of seconds a Ceph OSD Daemon may wait
+ from startup or another reportable event before reporting
+ to a Ceph Monitor.
+
+:Type: 32-bit Integer
+:Default: ``5``
+:Valid Range: Should be less than ``osd mon report interval max``
+
+
+``osd mon ack timeout``
+
+:Description: The number of seconds to wait for a Ceph Monitor to acknowledge a
+ request for statistics.
+
+:Type: 32-bit Integer
+:Default: ``30``
diff --git a/src/ceph/doc/rados/configuration/ms-ref.rst b/src/ceph/doc/rados/configuration/ms-ref.rst
new file mode 100644
index 0000000..55d009e
--- /dev/null
+++ b/src/ceph/doc/rados/configuration/ms-ref.rst
@@ -0,0 +1,154 @@
+===========
+ Messaging
+===========
+
+General Settings
+================
+
+``ms tcp nodelay``
+
+:Description: Disables nagle's algorithm on messenger tcp sessions.
+:Type: Boolean
+:Required: No
+:Default: ``true``
+
+
+``ms initial backoff``
+
+:Description: The initial time to wait before reconnecting on a fault.
+:Type: Double
+:Required: No
+:Default: ``.2``
+
+
+``ms max backoff``
+
+:Description: The maximum time to wait before reconnecting on a fault.
+:Type: Double
+:Required: No
+:Default: ``15.0``
+
+
+``ms nocrc``
+
+:Description: Disables crc on network messages. May increase performance if cpu limited.
+:Type: Boolean
+:Required: No
+:Default: ``false``
+
+
+``ms die on bad msg``
+
+:Description: Debug option; do not configure.
+:Type: Boolean
+:Required: No
+:Default: ``false``
+
+
+``ms dispatch throttle bytes``
+
+:Description: Throttles total size of messages waiting to be dispatched.
+:Type: 64-bit Unsigned Integer
+:Required: No
+:Default: ``100 << 20``
+
+
+``ms bind ipv6``
+
+:Description: Enable if you want your daemons to bind to IPv6 address instead of IPv4 ones. (Not required if you specify a daemon or cluster IP.)
+:Type: Boolean
+:Required: No
+:Default: ``false``
+
+
+``ms rwthread stack bytes``
+
+:Description: Debug option for stack size; do not configure.
+:Type: 64-bit Unsigned Integer
+:Required: No
+:Default: ``1024 << 10``
+
+
+``ms tcp read timeout``
+
+:Description: Controls how long (in seconds) the messenger will wait before closing an idle connection.
+:Type: 64-bit Unsigned Integer
+:Required: No
+:Default: ``900``
+
+
+``ms inject socket failures``
+
+:Description: Debug option; do not configure.
+:Type: 64-bit Unsigned Integer
+:Required: No
+:Default: ``0``
+
+Async messenger options
+=======================
+
+
+``ms async transport type``
+
+:Description: Transport type used by Async Messenger. Can be ``posix``, ``dpdk``
+ or ``rdma``. Posix uses standard TCP/IP networking and is default.
+ Other transports may be experimental and support may be limited.
+:Type: String
+:Required: No
+:Default: ``posix``
+
+
+``ms async op threads``
+
+:Description: Initial number of worker threads used by each Async Messenger instance.
+ Should be at least equal to highest number of replicas, but you can
+ decrease it if you are low on CPU core count and/or you host a lot of
+ OSDs on single server.
+:Type: 64-bit Unsigned Integer
+:Required: No
+:Default: ``3``
+
+
+``ms async max op threads``
+
+:Description: Maximum number of worker threads used by each Async Messenger instance.
+ Set to lower values when your machine has limited CPU count, and increase
+ when your CPUs are underutilized (i. e. one or more of CPUs are
+ constantly on 100% load during I/O operations).
+:Type: 64-bit Unsigned Integer
+:Required: No
+:Default: ``5``
+
+
+``ms async set affinity``
+
+:Description: Set to true to bind Async Messenger workers to particular CPU cores.
+:Type: Boolean
+:Required: No
+:Default: ``true``
+
+
+``ms async affinity cores``
+
+:Description: When ``ms async set affinity`` is true, this string specifies how Async
+ Messenger workers are bound to CPU cores. For example, "0,2" will bind
+ workers #1 and #2 to CPU cores #0 and #2, respectively.
+ NOTE: when manually setting affinity, make sure to not assign workers to
+ processors that are virtual CPUs created as an effect of Hyperthreading
+ or similar technology, because they are slower than regular CPU cores.
+:Type: String
+:Required: No
+:Default: ``(empty)``
+
+
+``ms async send inline``
+
+:Description: Send messages directly from the thread that generated them instead of
+ queuing and sending from Async Messenger thread. This option is known
+ to decrease performance on systems with a lot of CPU cores, so it's
+ disabled by default.
+:Type: Boolean
+:Required: No
+:Default: ``false``
+
+
diff --git a/src/ceph/doc/rados/configuration/network-config-ref.rst b/src/ceph/doc/rados/configuration/network-config-ref.rst
new file mode 100644
index 0000000..2d7f9d6
--- /dev/null
+++ b/src/ceph/doc/rados/configuration/network-config-ref.rst
@@ -0,0 +1,494 @@
+=================================
+ Network Configuration Reference
+=================================
+
+Network configuration is critical for building a high performance :term:`Ceph
+Storage Cluster`. The Ceph Storage Cluster does not perform request routing or
+dispatching on behalf of the :term:`Ceph Client`. Instead, Ceph Clients make
+requests directly to Ceph OSD Daemons. Ceph OSD Daemons perform data replication
+on behalf of Ceph Clients, which means replication and other factors impose
+additional loads on Ceph Storage Cluster networks.
+
+Our Quick Start configurations provide a trivial `Ceph configuration file`_ that
+sets monitor IP addresses and daemon host names only. Unless you specify a
+cluster network, Ceph assumes a single "public" network. Ceph functions just
+fine with a public network only, but you may see significant performance
+improvement with a second "cluster" network in a large cluster.
+
+We recommend running a Ceph Storage Cluster with two networks: a public
+(front-side) network and a cluster (back-side) network. To support two networks,
+each :term:`Ceph Node` will need to have more than one NIC. See `Hardware
+Recommendations - Networks`_ for additional details.
+
+.. ditaa::
+ +-------------+
+ | Ceph Client |
+ +----*--*-----+
+ | ^
+ Request | : Response
+ v |
+ /----------------------------------*--*-------------------------------------\
+ | Public Network |
+ \---*--*------------*--*-------------*--*------------*--*------------*--*---/
+ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^
+ | | | | | | | | | |
+ | : | : | : | : | :
+ v v v v v v v v v v
+ +---*--*---+ +---*--*---+ +---*--*---+ +---*--*---+ +---*--*---+
+ | Ceph MON | | Ceph MDS | | Ceph OSD | | Ceph OSD | | Ceph OSD |
+ +----------+ +----------+ +---*--*---+ +---*--*---+ +---*--*---+
+ ^ ^ ^ ^ ^ ^
+ The cluster network relieves | | | | | |
+ OSD replication and heartbeat | : | : | :
+ traffic from the public network. v v v v v v
+ /------------------------------------*--*------------*--*------------*--*---\
+ | cCCC Cluster Network |
+ \---------------------------------------------------------------------------/
+
+
+There are several reasons to consider operating two separate networks:
+
+#. **Performance:** Ceph OSD Daemons handle data replication for the Ceph
+ Clients. When Ceph OSD Daemons replicate data more than once, the network
+ load between Ceph OSD Daemons easily dwarfs the network load between Ceph
+ Clients and the Ceph Storage Cluster. This can introduce latency and
+ create a performance problem. Recovery and rebalancing can
+ also introduce significant latency on the public network. See
+ `Scalability and High Availability`_ for additional details on how Ceph
+ replicates data. See `Monitor / OSD Interaction`_ for details on heartbeat
+ traffic.
+
+#. **Security**: While most people are generally civil, a very tiny segment of
+ the population likes to engage in what's known as a Denial of Service (DoS)
+ attack. When traffic between Ceph OSD Daemons gets disrupted, placement
+ groups may no longer reflect an ``active + clean`` state, which may prevent
+ users from reading and writing data. A great way to defeat this type of
+ attack is to maintain a completely separate cluster network that doesn't
+ connect directly to the internet. Also, consider using `Message Signatures`_
+ to defeat spoofing attacks.
+
+
+IP Tables
+=========
+
+By default, daemons `bind`_ to ports within the ``6800:7300`` range. You may
+configure this range at your discretion. Before configuring your IP tables,
+check the default ``iptables`` configuration.
+
+ sudo iptables -L
+
+Some Linux distributions include rules that reject all inbound requests
+except SSH from all network interfaces. For example::
+
+ REJECT all -- anywhere anywhere reject-with icmp-host-prohibited
+
+You will need to delete these rules on both your public and cluster networks
+initially, and replace them with appropriate rules when you are ready to
+harden the ports on your Ceph Nodes.
+
+
+Monitor IP Tables
+-----------------
+
+Ceph Monitors listen on port ``6789`` by default. Additionally, Ceph Monitors
+always operate on the public network. When you add the rule using the example
+below, make sure you replace ``{iface}`` with the public network interface
+(e.g., ``eth0``, ``eth1``, etc.), ``{ip-address}`` with the IP address of the
+public network and ``{netmask}`` with the netmask for the public network. ::
+
+ sudo iptables -A INPUT -i {iface} -p tcp -s {ip-address}/{netmask} --dport 6789 -j ACCEPT
+
+
+MDS IP Tables
+-------------
+
+A :term:`Ceph Metadata Server` listens on the first available port on the public
+network beginning at port 6800. Note that this behavior is not deterministic, so
+if you are running more than one OSD or MDS on the same host, or if you restart
+the daemons within a short window of time, the daemons will bind to higher
+ports. You should open the entire 6800-7300 range by default. When you add the
+rule using the example below, make sure you replace ``{iface}`` with the public
+network interface (e.g., ``eth0``, ``eth1``, etc.), ``{ip-address}`` with the IP
+address of the public network and ``{netmask}`` with the netmask of the public
+network.
+
+For example::
+
+ sudo iptables -A INPUT -i {iface} -m multiport -p tcp -s {ip-address}/{netmask} --dports 6800:7300 -j ACCEPT
+
+
+OSD IP Tables
+-------------
+
+By default, Ceph OSD Daemons `bind`_ to the first available ports on a Ceph Node
+beginning at port 6800. Note that this behavior is not deterministic, so if you
+are running more than one OSD or MDS on the same host, or if you restart the
+daemons within a short window of time, the daemons will bind to higher ports.
+Each Ceph OSD Daemon on a Ceph Node may use up to four ports:
+
+#. One for talking to clients and monitors.
+#. One for sending data to other OSDs.
+#. Two for heartbeating on each interface.
+
+.. ditaa::
+ /---------------\
+ | OSD |
+ | +---+----------------+-----------+
+ | | Clients & Monitors | Heartbeat |
+ | +---+----------------+-----------+
+ | |
+ | +---+----------------+-----------+
+ | | Data Replication | Heartbeat |
+ | +---+----------------+-----------+
+ | cCCC |
+ \---------------/
+
+When a daemon fails and restarts without letting go of the port, the restarted
+daemon will bind to a new port. You should open the entire 6800-7300 port range
+to handle this possibility.
+
+If you set up separate public and cluster networks, you must add rules for both
+the public network and the cluster network, because clients will connect using
+the public network and other Ceph OSD Daemons will connect using the cluster
+network. When you add the rule using the example below, make sure you replace
+``{iface}`` with the network interface (e.g., ``eth0``, ``eth1``, etc.),
+``{ip-address}`` with the IP address and ``{netmask}`` with the netmask of the
+public or cluster network. For example::
+
+ sudo iptables -A INPUT -i {iface} -m multiport -p tcp -s {ip-address}/{netmask} --dports 6800:7300 -j ACCEPT
+
+.. tip:: If you run Ceph Metadata Servers on the same Ceph Node as the
+ Ceph OSD Daemons, you can consolidate the public network configuration step.
+
+
+Ceph Networks
+=============
+
+To configure Ceph networks, you must add a network configuration to the
+``[global]`` section of the configuration file. Our 5-minute Quick Start
+provides a trivial `Ceph configuration file`_ that assumes one public network
+with client and server on the same network and subnet. Ceph functions just fine
+with a public network only. However, Ceph allows you to establish much more
+specific criteria, including multiple IP network and subnet masks for your
+public network. You can also establish a separate cluster network to handle OSD
+heartbeat, object replication and recovery traffic. Don't confuse the IP
+addresses you set in your configuration with the public-facing IP addresses
+network clients may use to access your service. Typical internal IP networks are
+often ``192.168.0.0`` or ``10.0.0.0``.
+
+.. tip:: If you specify more than one IP address and subnet mask for
+ either the public or the cluster network, the subnets within the network
+ must be capable of routing to each other. Additionally, make sure you
+ include each IP address/subnet in your IP tables and open ports for them
+ as necessary.
+
+.. note:: Ceph uses `CIDR`_ notation for subnets (e.g., ``10.0.0.0/24``).
+
+When you have configured your networks, you may restart your cluster or restart
+each daemon. Ceph daemons bind dynamically, so you do not have to restart the
+entire cluster at once if you change your network configuration.
+
+
+Public Network
+--------------
+
+To configure a public network, add the following option to the ``[global]``
+section of your Ceph configuration file.
+
+.. code-block:: ini
+
+ [global]
+ ...
+ public network = {public-network/netmask}
+
+
+Cluster Network
+---------------
+
+If you declare a cluster network, OSDs will route heartbeat, object replication
+and recovery traffic over the cluster network. This may improve performance
+compared to using a single network. To configure a cluster network, add the
+following option to the ``[global]`` section of your Ceph configuration file.
+
+.. code-block:: ini
+
+ [global]
+ ...
+ cluster network = {cluster-network/netmask}
+
+We prefer that the cluster network is **NOT** reachable from the public network
+or the Internet for added security.
+
+
+Ceph Daemons
+============
+
+Ceph has one network configuration requirement that applies to all daemons: the
+Ceph configuration file **MUST** specify the ``host`` for each daemon. Ceph also
+requires that a Ceph configuration file specify the monitor IP address and its
+port.
+
+.. important:: Some deployment tools (e.g., ``ceph-deploy``, Chef) may create a
+ configuration file for you. **DO NOT** set these values if the deployment
+ tool does it for you.
+
+.. tip:: The ``host`` setting is the short name of the host (i.e., not
+ an fqdn). It is **NOT** an IP address either. Enter ``hostname -s`` on
+ the command line to retrieve the name of the host.
+
+
+.. code-block:: ini
+
+ [mon.a]
+
+ host = {hostname}
+ mon addr = {ip-address}:6789
+
+ [osd.0]
+ host = {hostname}
+
+
+You do not have to set the host IP address for a daemon. If you have a static IP
+configuration and both public and cluster networks running, the Ceph
+configuration file may specify the IP address of the host for each daemon. To
+set a static IP address for a daemon, the following option(s) should appear in
+the daemon instance sections of your ``ceph.conf`` file.
+
+.. code-block:: ini
+
+ [osd.0]
+ public addr = {host-public-ip-address}
+ cluster addr = {host-cluster-ip-address}
+
+
+.. topic:: One NIC OSD in a Two Network Cluster
+
+ Generally, we do not recommend deploying an OSD host with a single NIC in a
+ cluster with two networks. However, you may accomplish this by forcing the
+ OSD host to operate on the public network by adding a ``public addr`` entry
+ to the ``[osd.n]`` section of the Ceph configuration file, where ``n``
+ refers to the number of the OSD with one NIC. Additionally, the public
+ network and cluster network must be able to route traffic to each other,
+ which we don't recommend for security reasons.
+
+
+Network Config Settings
+=======================
+
+Network configuration settings are not required. Ceph assumes a public network
+with all hosts operating on it unless you specifically configure a cluster
+network.
+
+
+Public Network
+--------------
+
+The public network configuration allows you specifically define IP addresses
+and subnets for the public network. You may specifically assign static IP
+addresses or override ``public network`` settings using the ``public addr``
+setting for a specific daemon.
+
+``public network``
+
+:Description: The IP address and netmask of the public (front-side) network
+ (e.g., ``192.168.0.0/24``). Set in ``[global]``. You may specify
+ comma-delimited subnets.
+
+:Type: ``{ip-address}/{netmask} [, {ip-address}/{netmask}]``
+:Required: No
+:Default: N/A
+
+
+``public addr``
+
+:Description: The IP address for the public (front-side) network.
+ Set for each daemon.
+
+:Type: IP Address
+:Required: No
+:Default: N/A
+
+
+
+Cluster Network
+---------------
+
+The cluster network configuration allows you to declare a cluster network, and
+specifically define IP addresses and subnets for the cluster network. You may
+specifically assign static IP addresses or override ``cluster network``
+settings using the ``cluster addr`` setting for specific OSD daemons.
+
+
+``cluster network``
+
+:Description: The IP address and netmask of the cluster (back-side) network
+ (e.g., ``10.0.0.0/24``). Set in ``[global]``. You may specify
+ comma-delimited subnets.
+
+:Type: ``{ip-address}/{netmask} [, {ip-address}/{netmask}]``
+:Required: No
+:Default: N/A
+
+
+``cluster addr``
+
+:Description: The IP address for the cluster (back-side) network.
+ Set for each daemon.
+
+:Type: Address
+:Required: No
+:Default: N/A
+
+
+Bind
+----
+
+Bind settings set the default port ranges Ceph OSD and MDS daemons use. The
+default range is ``6800:7300``. Ensure that your `IP Tables`_ configuration
+allows you to use the configured port range.
+
+You may also enable Ceph daemons to bind to IPv6 addresses instead of IPv4
+addresses.
+
+
+``ms bind port min``
+
+:Description: The minimum port number to which an OSD or MDS daemon will bind.
+:Type: 32-bit Integer
+:Default: ``6800``
+:Required: No
+
+
+``ms bind port max``
+
+:Description: The maximum port number to which an OSD or MDS daemon will bind.
+:Type: 32-bit Integer
+:Default: ``7300``
+:Required: No.
+
+
+``ms bind ipv6``
+
+:Description: Enables Ceph daemons to bind to IPv6 addresses. Currently the
+ messenger *either* uses IPv4 or IPv6, but it cannot do both.
+:Type: Boolean
+:Default: ``false``
+:Required: No
+
+``public bind addr``
+
+:Description: In some dynamic deployments the Ceph MON daemon might bind
+ to an IP address locally that is different from the ``public addr``
+ advertised to other peers in the network. The environment must ensure
+ that routing rules are set correclty. If ``public bind addr`` is set
+ the Ceph MON daemon will bind to it locally and use ``public addr``
+ in the monmaps to advertise its address to peers. This behavior is limited
+ to the MON daemon.
+
+:Type: IP Address
+:Required: No
+:Default: N/A
+
+
+
+Hosts
+-----
+
+Ceph expects at least one monitor declared in the Ceph configuration file, with
+a ``mon addr`` setting under each declared monitor. Ceph expects a ``host``
+setting under each declared monitor, metadata server and OSD in the Ceph
+configuration file. Optionally, a monitor can be assigned with a priority, and
+the clients will always connect to the monitor with lower value of priority if
+specified.
+
+
+``mon addr``
+
+:Description: A list of ``{hostname}:{port}`` entries that clients can use to
+ connect to a Ceph monitor. If not set, Ceph searches ``[mon.*]``
+ sections.
+
+:Type: String
+:Required: No
+:Default: N/A
+
+``mon priority``
+
+:Description: The priority of the declared monitor, the lower value the more
+ prefered when a client selects a monitor when trying to connect
+ to the cluster.
+
+:Type: Unsigned 16-bit Integer
+:Required: No
+:Default: 0
+
+``host``
+
+:Description: The hostname. Use this setting for specific daemon instances
+ (e.g., ``[osd.0]``).
+
+:Type: String
+:Required: Yes, for daemon instances.
+:Default: ``localhost``
+
+.. tip:: Do not use ``localhost``. To get your host name, execute
+ ``hostname -s`` on your command line and use the name of your host
+ (to the first period, not the fully-qualified domain name).
+
+.. important:: You should not specify any value for ``host`` when using a third
+ party deployment system that retrieves the host name for you.
+
+
+
+TCP
+---
+
+Ceph disables TCP buffering by default.
+
+
+``ms tcp nodelay``
+
+:Description: Ceph enables ``ms tcp nodelay`` so that each request is sent
+ immediately (no buffering). Disabling `Nagle's algorithm`_
+ increases network traffic, which can introduce latency. If you
+ experience large numbers of small packets, you may try
+ disabling ``ms tcp nodelay``.
+
+:Type: Boolean
+:Required: No
+:Default: ``true``
+
+
+
+``ms tcp rcvbuf``
+
+:Description: The size of the socket buffer on the receiving end of a network
+ connection. Disable by default.
+
+:Type: 32-bit Integer
+:Required: No
+:Default: ``0``
+
+
+
+``ms tcp read timeout``
+
+:Description: If a client or daemon makes a request to another Ceph daemon and
+ does not drop an unused connection, the ``ms tcp read timeout``
+ defines the connection as idle after the specified number
+ of seconds.
+
+:Type: Unsigned 64-bit Integer
+:Required: No
+:Default: ``900`` 15 minutes.
+
+
+
+.. _Scalability and High Availability: ../../../architecture#scalability-and-high-availability
+.. _Hardware Recommendations - Networks: ../../../start/hardware-recommendations#networks
+.. _Ceph configuration file: ../../../start/quick-ceph-deploy/#create-a-cluster
+.. _hardware recommendations: ../../../start/hardware-recommendations
+.. _Monitor / OSD Interaction: ../mon-osd-interaction
+.. _Message Signatures: ../auth-config-ref#signatures
+.. _CIDR: http://en.wikipedia.org/wiki/Classless_Inter-Domain_Routing
+.. _Nagle's Algorithm: http://en.wikipedia.org/wiki/Nagle's_algorithm
diff --git a/src/ceph/doc/rados/configuration/osd-config-ref.rst b/src/ceph/doc/rados/configuration/osd-config-ref.rst
new file mode 100644
index 0000000..fae7078
--- /dev/null
+++ b/src/ceph/doc/rados/configuration/osd-config-ref.rst
@@ -0,0 +1,1105 @@
+======================
+ OSD Config Reference
+======================
+
+.. index:: OSD; configuration
+
+You can configure Ceph OSD Daemons in the Ceph configuration file, but Ceph OSD
+Daemons can use the default values and a very minimal configuration. A minimal
+Ceph OSD Daemon configuration sets ``osd journal size`` and ``host``, and
+uses default values for nearly everything else.
+
+Ceph OSD Daemons are numerically identified in incremental fashion, beginning
+with ``0`` using the following convention. ::
+
+ osd.0
+ osd.1
+ osd.2
+
+In a configuration file, you may specify settings for all Ceph OSD Daemons in
+the cluster by adding configuration settings to the ``[osd]`` section of your
+configuration file. To add settings directly to a specific Ceph OSD Daemon
+(e.g., ``host``), enter it in an OSD-specific section of your configuration
+file. For example:
+
+.. code-block:: ini
+
+ [osd]
+ osd journal size = 1024
+
+ [osd.0]
+ host = osd-host-a
+
+ [osd.1]
+ host = osd-host-b
+
+
+.. index:: OSD; config settings
+
+General Settings
+================
+
+The following settings provide an Ceph OSD Daemon's ID, and determine paths to
+data and journals. Ceph deployment scripts typically generate the UUID
+automatically. We **DO NOT** recommend changing the default paths for data or
+journals, as it makes it more problematic to troubleshoot Ceph later.
+
+The journal size should be at least twice the product of the expected drive
+speed multiplied by ``filestore max sync interval``. However, the most common
+practice is to partition the journal drive (often an SSD), and mount it such
+that Ceph uses the entire partition for the journal.
+
+
+``osd uuid``
+
+:Description: The universally unique identifier (UUID) for the Ceph OSD Daemon.
+:Type: UUID
+:Default: The UUID.
+:Note: The ``osd uuid`` applies to a single Ceph OSD Daemon. The ``fsid``
+ applies to the entire cluster.
+
+
+``osd data``
+
+:Description: The path to the OSDs data. You must create the directory when
+ deploying Ceph. You should mount a drive for OSD data at this
+ mount point. We do not recommend changing the default.
+
+:Type: String
+:Default: ``/var/lib/ceph/osd/$cluster-$id``
+
+
+``osd max write size``
+
+:Description: The maximum size of a write in megabytes.
+:Type: 32-bit Integer
+:Default: ``90``
+
+
+``osd client message size cap``
+
+:Description: The largest client data message allowed in memory.
+:Type: 64-bit Unsigned Integer
+:Default: 500MB default. ``500*1024L*1024L``
+
+
+``osd class dir``
+
+:Description: The class path for RADOS class plug-ins.
+:Type: String
+:Default: ``$libdir/rados-classes``
+
+
+.. index:: OSD; file system
+
+File System Settings
+====================
+Ceph builds and mounts file systems which are used for Ceph OSDs.
+
+``osd mkfs options {fs-type}``
+
+:Description: Options used when creating a new Ceph OSD of type {fs-type}.
+
+:Type: String
+:Default for xfs: ``-f -i 2048``
+:Default for other file systems: {empty string}
+
+For example::
+ ``osd mkfs options xfs = -f -d agcount=24``
+
+``osd mount options {fs-type}``
+
+:Description: Options used when mounting a Ceph OSD of type {fs-type}.
+
+:Type: String
+:Default for xfs: ``rw,noatime,inode64``
+:Default for other file systems: ``rw, noatime``
+
+For example::
+ ``osd mount options xfs = rw, noatime, inode64, logbufs=8``
+
+
+.. index:: OSD; journal settings
+
+Journal Settings
+================
+
+By default, Ceph expects that you will store an Ceph OSD Daemons journal with
+the following path::
+
+ /var/lib/ceph/osd/$cluster-$id/journal
+
+Without performance optimization, Ceph stores the journal on the same disk as
+the Ceph OSD Daemons data. An Ceph OSD Daemon optimized for performance may use
+a separate disk to store journal data (e.g., a solid state drive delivers high
+performance journaling).
+
+Ceph's default ``osd journal size`` is 0, so you will need to set this in your
+``ceph.conf`` file. A journal size should find the product of the ``filestore
+max sync interval`` and the expected throughput, and multiply the product by
+two (2)::
+
+ osd journal size = {2 * (expected throughput * filestore max sync interval)}
+
+The expected throughput number should include the expected disk throughput
+(i.e., sustained data transfer rate), and network throughput. For example,
+a 7200 RPM disk will likely have approximately 100 MB/s. Taking the ``min()``
+of the disk and network throughput should provide a reasonable expected
+throughput. Some users just start off with a 10GB journal size. For
+example::
+
+ osd journal size = 10000
+
+
+``osd journal``
+
+:Description: The path to the OSD's journal. This may be a path to a file or a
+ block device (such as a partition of an SSD). If it is a file,
+ you must create the directory to contain it. We recommend using a
+ drive separate from the ``osd data`` drive.
+
+:Type: String
+:Default: ``/var/lib/ceph/osd/$cluster-$id/journal``
+
+
+``osd journal size``
+
+:Description: The size of the journal in megabytes. If this is 0, and the
+ journal is a block device, the entire block device is used.
+ Since v0.54, this is ignored if the journal is a block device,
+ and the entire block device is used.
+
+:Type: 32-bit Integer
+:Default: ``5120``
+:Recommended: Begin with 1GB. Should be at least twice the product of the
+ expected speed multiplied by ``filestore max sync interval``.
+
+
+See `Journal Config Reference`_ for additional details.
+
+
+Monitor OSD Interaction
+=======================
+
+Ceph OSD Daemons check each other's heartbeats and report to monitors
+periodically. Ceph can use default values in many cases. However, if your
+network has latency issues, you may need to adopt longer intervals. See
+`Configuring Monitor/OSD Interaction`_ for a detailed discussion of heartbeats.
+
+
+Data Placement
+==============
+
+See `Pool & PG Config Reference`_ for details.
+
+
+.. index:: OSD; scrubbing
+
+Scrubbing
+=========
+
+In addition to making multiple copies of objects, Ceph insures data integrity by
+scrubbing placement groups. Ceph scrubbing is analogous to ``fsck`` on the
+object storage layer. For each placement group, Ceph generates a catalog of all
+objects and compares each primary object and its replicas to ensure that no
+objects are missing or mismatched. Light scrubbing (daily) checks the object
+size and attributes. Deep scrubbing (weekly) reads the data and uses checksums
+to ensure data integrity.
+
+Scrubbing is important for maintaining data integrity, but it can reduce
+performance. You can adjust the following settings to increase or decrease
+scrubbing operations.
+
+
+``osd max scrubs``
+
+:Description: The maximum number of simultaneous scrub operations for
+ a Ceph OSD Daemon.
+
+:Type: 32-bit Int
+:Default: ``1``
+
+``osd scrub begin hour``
+
+:Description: The time of day for the lower bound when a scheduled scrub can be
+ performed.
+:Type: Integer in the range of 0 to 24
+:Default: ``0``
+
+
+``osd scrub end hour``
+
+:Description: The time of day for the upper bound when a scheduled scrub can be
+ performed. Along with ``osd scrub begin hour``, they define a time
+ window, in which the scrubs can happen. But a scrub will be performed
+ no matter the time window allows or not, as long as the placement
+ group's scrub interval exceeds ``osd scrub max interval``.
+:Type: Integer in the range of 0 to 24
+:Default: ``24``
+
+
+``osd scrub during recovery``
+
+:Description: Allow scrub during recovery. Setting this to ``false`` will disable
+ scheduling new scrub (and deep--scrub) while there is active recovery.
+ Already running scrubs will be continued. This might be useful to reduce
+ load on busy clusters.
+:Type: Boolean
+:Default: ``true``
+
+
+``osd scrub thread timeout``
+
+:Description: The maximum time in seconds before timing out a scrub thread.
+:Type: 32-bit Integer
+:Default: ``60``
+
+
+``osd scrub finalize thread timeout``
+
+:Description: The maximum time in seconds before timing out a scrub finalize
+ thread.
+
+:Type: 32-bit Integer
+:Default: ``60*10``
+
+
+``osd scrub load threshold``
+
+:Description: The maximum load. Ceph will not scrub when the system load
+ (as defined by ``getloadavg()``) is higher than this number.
+ Default is ``0.5``.
+
+:Type: Float
+:Default: ``0.5``
+
+
+``osd scrub min interval``
+
+:Description: The minimal interval in seconds for scrubbing the Ceph OSD Daemon
+ when the Ceph Storage Cluster load is low.
+
+:Type: Float
+:Default: Once per day. ``60*60*24``
+
+
+``osd scrub max interval``
+
+:Description: The maximum interval in seconds for scrubbing the Ceph OSD Daemon
+ irrespective of cluster load.
+
+:Type: Float
+:Default: Once per week. ``7*60*60*24``
+
+
+``osd scrub chunk min``
+
+:Description: The minimal number of object store chunks to scrub during single operation.
+ Ceph blocks writes to single chunk during scrub.
+
+:Type: 32-bit Integer
+:Default: 5
+
+
+``osd scrub chunk max``
+
+:Description: The maximum number of object store chunks to scrub during single operation.
+
+:Type: 32-bit Integer
+:Default: 25
+
+
+``osd scrub sleep``
+
+:Description: Time to sleep before scrubbing next group of chunks. Increasing this value will slow
+ down whole scrub operation while client operations will be less impacted.
+
+:Type: Float
+:Default: 0
+
+
+``osd deep scrub interval``
+
+:Description: The interval for "deep" scrubbing (fully reading all data). The
+ ``osd scrub load threshold`` does not affect this setting.
+
+:Type: Float
+:Default: Once per week. ``60*60*24*7``
+
+
+``osd scrub interval randomize ratio``
+
+:Description: Add a random delay to ``osd scrub min interval`` when scheduling
+ the next scrub job for a placement group. The delay is a random
+ value less than ``osd scrub min interval`` \*
+ ``osd scrub interval randomized ratio``. So the default setting
+ practically randomly spreads the scrubs out in the allowed time
+ window of ``[1, 1.5]`` \* ``osd scrub min interval``.
+:Type: Float
+:Default: ``0.5``
+
+``osd deep scrub stride``
+
+:Description: Read size when doing a deep scrub.
+:Type: 32-bit Integer
+:Default: 512 KB. ``524288``
+
+
+.. index:: OSD; operations settings
+
+Operations
+==========
+
+Operations settings allow you to configure the number of threads for servicing
+requests. If you set ``osd op threads`` to ``0``, it disables multi-threading.
+By default, Ceph uses two threads with a 30 second timeout and a 30 second
+complaint time if an operation doesn't complete within those time parameters.
+You can set operations priority weights between client operations and
+recovery operations to ensure optimal performance during recovery.
+
+
+``osd op threads``
+
+:Description: The number of threads to service Ceph OSD Daemon operations.
+ Set to ``0`` to disable it. Increasing the number may increase
+ the request processing rate.
+
+:Type: 32-bit Integer
+:Default: ``2``
+
+
+``osd op queue``
+
+:Description: This sets the type of queue to be used for prioritizing ops
+ in the OSDs. Both queues feature a strict sub-queue which is
+ dequeued before the normal queue. The normal queue is different
+ between implementations. The original PrioritizedQueue (``prio``) uses a
+ token bucket system which when there are sufficient tokens will
+ dequeue high priority queues first. If there are not enough
+ tokens available, queues are dequeued low priority to high priority.
+ The WeightedPriorityQueue (``wpq``) dequeues all priorities in
+ relation to their priorities to prevent starvation of any queue.
+ WPQ should help in cases where a few OSDs are more overloaded
+ than others. The new mClock based OpClassQueue
+ (``mclock_opclass``) prioritizes operations based on which class
+ they belong to (recovery, scrub, snaptrim, client op, osd subop).
+ And, the mClock based ClientQueue (``mclock_client``) also
+ incorporates the client identifier in order to promote fairness
+ between clients. See `QoS Based on mClock`_. Requires a restart.
+
+:Type: String
+:Valid Choices: prio, wpq, mclock_opclass, mclock_client
+:Default: ``prio``
+
+
+``osd op queue cut off``
+
+:Description: This selects which priority ops will be sent to the strict
+ queue verses the normal queue. The ``low`` setting sends all
+ replication ops and higher to the strict queue, while the ``high``
+ option sends only replication acknowledgement ops and higher to
+ the strict queue. Setting this to ``high`` should help when a few
+ OSDs in the cluster are very busy especially when combined with
+ ``wpq`` in the ``osd op queue`` setting. OSDs that are very busy
+ handling replication traffic could starve primary client traffic
+ on these OSDs without these settings. Requires a restart.
+
+:Type: String
+:Valid Choices: low, high
+:Default: ``low``
+
+
+``osd client op priority``
+
+:Description: The priority set for client operations. It is relative to
+ ``osd recovery op priority``.
+
+:Type: 32-bit Integer
+:Default: ``63``
+:Valid Range: 1-63
+
+
+``osd recovery op priority``
+
+:Description: The priority set for recovery operations. It is relative to
+ ``osd client op priority``.
+
+:Type: 32-bit Integer
+:Default: ``3``
+:Valid Range: 1-63
+
+
+``osd scrub priority``
+
+:Description: The priority set for scrub operations. It is relative to
+ ``osd client op priority``.
+
+:Type: 32-bit Integer
+:Default: ``5``
+:Valid Range: 1-63
+
+
+``osd snap trim priority``
+
+:Description: The priority set for snap trim operations. It is relative to
+ ``osd client op priority``.
+
+:Type: 32-bit Integer
+:Default: ``5``
+:Valid Range: 1-63
+
+
+``osd op thread timeout``
+
+:Description: The Ceph OSD Daemon operation thread timeout in seconds.
+:Type: 32-bit Integer
+:Default: ``15``
+
+
+``osd op complaint time``
+
+:Description: An operation becomes complaint worthy after the specified number
+ of seconds have elapsed.
+
+:Type: Float
+:Default: ``30``
+
+
+``osd disk threads``
+
+:Description: The number of disk threads, which are used to perform background
+ disk intensive OSD operations such as scrubbing and snap
+ trimming.
+
+:Type: 32-bit Integer
+:Default: ``1``
+
+``osd disk thread ioprio class``
+
+:Description: Warning: it will only be used if both ``osd disk thread
+ ioprio class`` and ``osd disk thread ioprio priority`` are
+ set to a non default value. Sets the ioprio_set(2) I/O
+ scheduling ``class`` for the disk thread. Acceptable
+ values are ``idle``, ``be`` or ``rt``. The ``idle``
+ class means the disk thread will have lower priority
+ than any other thread in the OSD. This is useful to slow
+ down scrubbing on an OSD that is busy handling client
+ operations. ``be`` is the default and is the same
+ priority as all other threads in the OSD. ``rt`` means
+ the disk thread will have precendence over all other
+ threads in the OSD. Note: Only works with the Linux Kernel
+ CFQ scheduler. Since Jewel scrubbing is no longer carried
+ out by the disk iothread, see osd priority options instead.
+:Type: String
+:Default: the empty string
+
+``osd disk thread ioprio priority``
+
+:Description: Warning: it will only be used if both ``osd disk thread
+ ioprio class`` and ``osd disk thread ioprio priority`` are
+ set to a non default value. It sets the ioprio_set(2)
+ I/O scheduling ``priority`` of the disk thread ranging
+ from 0 (highest) to 7 (lowest). If all OSDs on a given
+ host were in class ``idle`` and compete for I/O
+ (i.e. due to controller congestion), it can be used to
+ lower the disk thread priority of one OSD to 7 so that
+ another OSD with priority 0 can have priority.
+ Note: Only works with the Linux Kernel CFQ scheduler.
+:Type: Integer in the range of 0 to 7 or -1 if not to be used.
+:Default: ``-1``
+
+``osd op history size``
+
+:Description: The maximum number of completed operations to track.
+:Type: 32-bit Unsigned Integer
+:Default: ``20``
+
+
+``osd op history duration``
+
+:Description: The oldest completed operation to track.
+:Type: 32-bit Unsigned Integer
+:Default: ``600``
+
+
+``osd op log threshold``
+
+:Description: How many operations logs to display at once.
+:Type: 32-bit Integer
+:Default: ``5``
+
+
+QoS Based on mClock
+-------------------
+
+Ceph's use of mClock is currently in the experimental phase and should
+be approached with an exploratory mindset.
+
+Core Concepts
+`````````````
+
+The QoS support of Ceph is implemented using a queueing scheduler
+based on `the dmClock algorithm`_. This algorithm allocates the I/O
+resources of the Ceph cluster in proportion to weights, and enforces
+the constraits of minimum reservation and maximum limitation, so that
+the services can compete for the resources fairly. Currently the
+*mclock_opclass* operation queue divides Ceph services involving I/O
+resources into following buckets:
+
+- client op: the iops issued by client
+- osd subop: the iops issued by primary OSD
+- snap trim: the snap trimming related requests
+- pg recovery: the recovery related requests
+- pg scrub: the scrub related requests
+
+And the resources are partitioned using following three sets of tags. In other
+words, the share of each type of service is controlled by three tags:
+
+#. reservation: the minimum IOPS allocated for the service.
+#. limitation: the maximum IOPS allocated for the service.
+#. weight: the proportional share of capacity if extra capacity or system
+ oversubscribed.
+
+In Ceph operations are graded with "cost". And the resources allocated
+for serving various services are consumed by these "costs". So, for
+example, the more reservation a services has, the more resource it is
+guaranteed to possess, as long as it requires. Assuming there are 2
+services: recovery and client ops:
+
+- recovery: (r:1, l:5, w:1)
+- client ops: (r:2, l:0, w:9)
+
+The settings above ensure that the recovery won't get more than 5
+requests per second serviced, even if it requires so (see CURRENT
+IMPLEMENTATION NOTE below), and no other services are competing with
+it. But if the clients start to issue large amount of I/O requests,
+neither will they exhaust all the I/O resources. 1 request per second
+is always allocated for recovery jobs as long as there are any such
+requests. So the recovery jobs won't be starved even in a cluster with
+high load. And in the meantime, the client ops can enjoy a larger
+portion of the I/O resource, because its weight is "9", while its
+competitor "1". In the case of client ops, it is not clamped by the
+limit setting, so it can make use of all the resources if there is no
+recovery ongoing.
+
+Along with *mclock_opclass* another mclock operation queue named
+*mclock_client* is available. It divides operations based on category
+but also divides them based on the client making the request. This
+helps not only manage the distribution of resources spent on different
+classes of operations but also tries to insure fairness among clients.
+
+CURRENT IMPLEMENTATION NOTE: the current experimental implementation
+does not enforce the limit values. As a first approximation we decided
+not to prevent operations that would otherwise enter the operation
+sequencer from doing so.
+
+Subtleties of mClock
+````````````````````
+
+The reservation and limit values have a unit of requests per
+second. The weight, however, does not technically have a unit and the
+weights are relative to one another. So if one class of requests has a
+weight of 1 and another a weight of 9, then the latter class of
+requests should get 9 executed at a 9 to 1 ratio as the first class.
+However that will only happen once the reservations are met and those
+values include the operations executed under the reservation phase.
+
+Even though the weights do not have units, one must be careful in
+choosing their values due how the algorithm assigns weight tags to
+requests. If the weight is *W*, then for a given class of requests,
+the next one that comes in will have a weight tag of *1/W* plus the
+previous weight tag or the current time, whichever is larger. That
+means if *W* is sufficiently large and therefore *1/W* is sufficiently
+small, the calculated tag may never be assigned as it will get a value
+of the current time. The ultimate lesson is that values for weight
+should not be too large. They should be under the number of requests
+one expects to ve serviced each second.
+
+Caveats
+```````
+
+There are some factors that can reduce the impact of the mClock op
+queues within Ceph. First, requests to an OSD are sharded by their
+placement group identifier. Each shard has its own mClock queue and
+these queues neither interact nor share information among them. The
+number of shards can be controlled with the configuration options
+``osd_op_num_shards``, ``osd_op_num_shards_hdd``, and
+``osd_op_num_shards_ssd``. A lower number of shards will increase the
+impact of the mClock queues, but may have other deliterious effects.
+
+Second, requests are transferred from the operation queue to the
+operation sequencer, in which they go through the phases of
+execution. The operation queue is where mClock resides and mClock
+determines the next op to transfer to the operation sequencer. The
+number of operations allowed in the operation sequencer is a complex
+issue. In general we want to keep enough operations in the sequencer
+so it's always getting work done on some operations while it's waiting
+for disk and network access to complete on other operations. On the
+other hand, once an operation is transferred to the operation
+sequencer, mClock no longer has control over it. Therefore to maximize
+the impact of mClock, we want to keep as few operations in the
+operation sequencer as possible. So we have an inherent tension.
+
+The configuration options that influence the number of operations in
+the operation sequencer are ``bluestore_throttle_bytes``,
+``bluestore_throttle_deferred_bytes``,
+``bluestore_throttle_cost_per_io``,
+``bluestore_throttle_cost_per_io_hdd``, and
+``bluestore_throttle_cost_per_io_ssd``.
+
+A third factor that affects the impact of the mClock algorithm is that
+we're using a distributed system, where requests are made to multiple
+OSDs and each OSD has (can have) multiple shards. Yet we're currently
+using the mClock algorithm, which is not distributed (note: dmClock is
+the distributed version of mClock).
+
+Various organizations and individuals are currently experimenting with
+mClock as it exists in this code base along with their modifications
+to the code base. We hope you'll share you're experiences with your
+mClock and dmClock experiments in the ceph-devel mailing list.
+
+
+``osd push per object cost``
+
+:Description: the overhead for serving a push op
+
+:Type: Unsigned Integer
+:Default: 1000
+
+``osd recovery max chunk``
+
+:Description: the maximum total size of data chunks a recovery op can carry.
+
+:Type: Unsigned Integer
+:Default: 8 MiB
+
+
+``osd op queue mclock client op res``
+
+:Description: the reservation of client op.
+
+:Type: Float
+:Default: 1000.0
+
+
+``osd op queue mclock client op wgt``
+
+:Description: the weight of client op.
+
+:Type: Float
+:Default: 500.0
+
+
+``osd op queue mclock client op lim``
+
+:Description: the limit of client op.
+
+:Type: Float
+:Default: 1000.0
+
+
+``osd op queue mclock osd subop res``
+
+:Description: the reservation of osd subop.
+
+:Type: Float
+:Default: 1000.0
+
+
+``osd op queue mclock osd subop wgt``
+
+:Description: the weight of osd subop.
+
+:Type: Float
+:Default: 500.0
+
+
+``osd op queue mclock osd subop lim``
+
+:Description: the limit of osd subop.
+
+:Type: Float
+:Default: 0.0
+
+
+``osd op queue mclock snap res``
+
+:Description: the reservation of snap trimming.
+
+:Type: Float
+:Default: 0.0
+
+
+``osd op queue mclock snap wgt``
+
+:Description: the weight of snap trimming.
+
+:Type: Float
+:Default: 1.0
+
+
+``osd op queue mclock snap lim``
+
+:Description: the limit of snap trimming.
+
+:Type: Float
+:Default: 0.001
+
+
+``osd op queue mclock recov res``
+
+:Description: the reservation of recovery.
+
+:Type: Float
+:Default: 0.0
+
+
+``osd op queue mclock recov wgt``
+
+:Description: the weight of recovery.
+
+:Type: Float
+:Default: 1.0
+
+
+``osd op queue mclock recov lim``
+
+:Description: the limit of recovery.
+
+:Type: Float
+:Default: 0.001
+
+
+``osd op queue mclock scrub res``
+
+:Description: the reservation of scrub jobs.
+
+:Type: Float
+:Default: 0.0
+
+
+``osd op queue mclock scrub wgt``
+
+:Description: the weight of scrub jobs.
+
+:Type: Float
+:Default: 1.0
+
+
+``osd op queue mclock scrub lim``
+
+:Description: the limit of scrub jobs.
+
+:Type: Float
+:Default: 0.001
+
+.. _the dmClock algorithm: https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Gulati.pdf
+
+
+.. index:: OSD; backfilling
+
+Backfilling
+===========
+
+When you add or remove Ceph OSD Daemons to a cluster, the CRUSH algorithm will
+want to rebalance the cluster by moving placement groups to or from Ceph OSD
+Daemons to restore the balance. The process of migrating placement groups and
+the objects they contain can reduce the cluster's operational performance
+considerably. To maintain operational performance, Ceph performs this migration
+with 'backfilling', which allows Ceph to set backfill operations to a lower
+priority than requests to read or write data.
+
+
+``osd max backfills``
+
+:Description: The maximum number of backfills allowed to or from a single OSD.
+:Type: 64-bit Unsigned Integer
+:Default: ``1``
+
+
+``osd backfill scan min``
+
+:Description: The minimum number of objects per backfill scan.
+
+:Type: 32-bit Integer
+:Default: ``64``
+
+
+``osd backfill scan max``
+
+:Description: The maximum number of objects per backfill scan.
+
+:Type: 32-bit Integer
+:Default: ``512``
+
+
+``osd backfill retry interval``
+
+:Description: The number of seconds to wait before retrying backfill requests.
+:Type: Double
+:Default: ``10.0``
+
+.. index:: OSD; osdmap
+
+OSD Map
+=======
+
+OSD maps reflect the OSD daemons operating in the cluster. Over time, the
+number of map epochs increases. Ceph provides some settings to ensure that
+Ceph performs well as the OSD map grows larger.
+
+
+``osd map dedup``
+
+:Description: Enable removing duplicates in the OSD map.
+:Type: Boolean
+:Default: ``true``
+
+
+``osd map cache size``
+
+:Description: The number of OSD maps to keep cached.
+:Type: 32-bit Integer
+:Default: ``500``
+
+
+``osd map cache bl size``
+
+:Description: The size of the in-memory OSD map cache in OSD daemons.
+:Type: 32-bit Integer
+:Default: ``50``
+
+
+``osd map cache bl inc size``
+
+:Description: The size of the in-memory OSD map cache incrementals in
+ OSD daemons.
+
+:Type: 32-bit Integer
+:Default: ``100``
+
+
+``osd map message max``
+
+:Description: The maximum map entries allowed per MOSDMap message.
+:Type: 32-bit Integer
+:Default: ``100``
+
+
+
+.. index:: OSD; recovery
+
+Recovery
+========
+
+When the cluster starts or when a Ceph OSD Daemon crashes and restarts, the OSD
+begins peering with other Ceph OSD Daemons before writes can occur. See
+`Monitoring OSDs and PGs`_ for details.
+
+If a Ceph OSD Daemon crashes and comes back online, usually it will be out of
+sync with other Ceph OSD Daemons containing more recent versions of objects in
+the placement groups. When this happens, the Ceph OSD Daemon goes into recovery
+mode and seeks to get the latest copy of the data and bring its map back up to
+date. Depending upon how long the Ceph OSD Daemon was down, the OSD's objects
+and placement groups may be significantly out of date. Also, if a failure domain
+went down (e.g., a rack), more than one Ceph OSD Daemon may come back online at
+the same time. This can make the recovery process time consuming and resource
+intensive.
+
+To maintain operational performance, Ceph performs recovery with limitations on
+the number recovery requests, threads and object chunk sizes which allows Ceph
+perform well in a degraded state.
+
+
+``osd recovery delay start``
+
+:Description: After peering completes, Ceph will delay for the specified number
+ of seconds before starting to recover objects.
+
+:Type: Float
+:Default: ``0``
+
+
+``osd recovery max active``
+
+:Description: The number of active recovery requests per OSD at one time. More
+ requests will accelerate recovery, but the requests places an
+ increased load on the cluster.
+
+:Type: 32-bit Integer
+:Default: ``3``
+
+
+``osd recovery max chunk``
+
+:Description: The maximum size of a recovered chunk of data to push.
+:Type: 64-bit Unsigned Integer
+:Default: ``8 << 20``
+
+
+``osd recovery max single start``
+
+:Description: The maximum number of recovery operations per OSD that will be
+ newly started when an OSD is recovering.
+:Type: 64-bit Unsigned Integer
+:Default: ``1``
+
+
+``osd recovery thread timeout``
+
+:Description: The maximum time in seconds before timing out a recovery thread.
+:Type: 32-bit Integer
+:Default: ``30``
+
+
+``osd recover clone overlap``
+
+:Description: Preserves clone overlap during recovery. Should always be set
+ to ``true``.
+
+:Type: Boolean
+:Default: ``true``
+
+
+``osd recovery sleep``
+
+:Description: Time in seconds to sleep before next recovery or backfill op.
+ Increasing this value will slow down recovery operation while
+ client operations will be less impacted.
+
+:Type: Float
+:Default: ``0``
+
+
+``osd recovery sleep hdd``
+
+:Description: Time in seconds to sleep before next recovery or backfill op
+ for HDDs.
+
+:Type: Float
+:Default: ``0.1``
+
+
+``osd recovery sleep ssd``
+
+:Description: Time in seconds to sleep before next recovery or backfill op
+ for SSDs.
+
+:Type: Float
+:Default: ``0``
+
+
+``osd recovery sleep hybrid``
+
+:Description: Time in seconds to sleep before next recovery or backfill op
+ when osd data is on HDD and osd journal is on SSD.
+
+:Type: Float
+:Default: ``0.025``
+
+Tiering
+=======
+
+``osd agent max ops``
+
+:Description: The maximum number of simultaneous flushing ops per tiering agent
+ in the high speed mode.
+:Type: 32-bit Integer
+:Default: ``4``
+
+
+``osd agent max low ops``
+
+:Description: The maximum number of simultaneous flushing ops per tiering agent
+ in the low speed mode.
+:Type: 32-bit Integer
+:Default: ``2``
+
+See `cache target dirty high ratio`_ for when the tiering agent flushes dirty
+objects within the high speed mode.
+
+Miscellaneous
+=============
+
+
+``osd snap trim thread timeout``
+
+:Description: The maximum time in seconds before timing out a snap trim thread.
+:Type: 32-bit Integer
+:Default: ``60*60*1``
+
+
+``osd backlog thread timeout``
+
+:Description: The maximum time in seconds before timing out a backlog thread.
+:Type: 32-bit Integer
+:Default: ``60*60*1``
+
+
+``osd default notify timeout``
+
+:Description: The OSD default notification timeout (in seconds).
+:Type: 32-bit Unsigned Integer
+:Default: ``30``
+
+
+``osd check for log corruption``
+
+:Description: Check log files for corruption. Can be computationally expensive.
+:Type: Boolean
+:Default: ``false``
+
+
+``osd remove thread timeout``
+
+:Description: The maximum time in seconds before timing out a remove OSD thread.
+:Type: 32-bit Integer
+:Default: ``60*60``
+
+
+``osd command thread timeout``
+
+:Description: The maximum time in seconds before timing out a command thread.
+:Type: 32-bit Integer
+:Default: ``10*60``
+
+
+``osd command max records``
+
+:Description: Limits the number of lost objects to return.
+:Type: 32-bit Integer
+:Default: ``256``
+
+
+``osd auto upgrade tmap``
+
+:Description: Uses ``tmap`` for ``omap`` on old objects.
+:Type: Boolean
+:Default: ``true``
+
+
+``osd tmapput sets users tmap``
+
+:Description: Uses ``tmap`` for debugging only.
+:Type: Boolean
+:Default: ``false``
+
+
+``osd fast fail on connection refused``
+
+:Description: If this option is enabled, crashed OSDs are marked down
+ immediately by connected peers and MONs (assuming that the
+ crashed OSD host survives). Disable it to restore old
+ behavior, at the expense of possible long I/O stalls when
+ OSDs crash in the middle of I/O operations.
+:Type: Boolean
+:Default: ``true``
+
+
+
+.. _pool: ../../operations/pools
+.. _Configuring Monitor/OSD Interaction: ../mon-osd-interaction
+.. _Monitoring OSDs and PGs: ../../operations/monitoring-osd-pg#peering
+.. _Pool & PG Config Reference: ../pool-pg-config-ref
+.. _Journal Config Reference: ../journal-ref
+.. _cache target dirty high ratio: ../../operations/pools#cache-target-dirty-high-ratio
diff --git a/src/ceph/doc/rados/configuration/pool-pg-config-ref.rst b/src/ceph/doc/rados/configuration/pool-pg-config-ref.rst
new file mode 100644
index 0000000..89a3707
--- /dev/null
+++ b/src/ceph/doc/rados/configuration/pool-pg-config-ref.rst
@@ -0,0 +1,270 @@
+======================================
+ Pool, PG and CRUSH Config Reference
+======================================
+
+.. index:: pools; configuration
+
+When you create pools and set the number of placement groups for the pool, Ceph
+uses default values when you don't specifically override the defaults. **We
+recommend** overridding some of the defaults. Specifically, we recommend setting
+a pool's replica size and overriding the default number of placement groups. You
+can specifically set these values when running `pool`_ commands. You can also
+override the defaults by adding new ones in the ``[global]`` section of your
+Ceph configuration file.
+
+
+.. literalinclude:: pool-pg.conf
+ :language: ini
+
+
+
+``mon max pool pg num``
+
+:Description: The maximum number of placement groups per pool.
+:Type: Integer
+:Default: ``65536``
+
+
+``mon pg create interval``
+
+:Description: Number of seconds between PG creation in the same
+ Ceph OSD Daemon.
+
+:Type: Float
+:Default: ``30.0``
+
+
+``mon pg stuck threshold``
+
+:Description: Number of seconds after which PGs can be considered as
+ being stuck.
+
+:Type: 32-bit Integer
+:Default: ``300``
+
+``mon pg min inactive``
+
+:Description: Issue a ``HEALTH_ERR`` in cluster log if the number of PGs stay
+ inactive longer than ``mon_pg_stuck_threshold`` exceeds this
+ setting. A non-positive number means disabled, never go into ERR.
+:Type: Integer
+:Default: ``1``
+
+
+``mon pg warn min per osd``
+
+:Description: Issue a ``HEALTH_WARN`` in cluster log if the average number
+ of PGs per (in) OSD is under this number. (a non-positive number
+ disables this)
+:Type: Integer
+:Default: ``30``
+
+
+``mon pg warn max per osd``
+
+:Description: Issue a ``HEALTH_WARN`` in cluster log if the average number
+ of PGs per (in) OSD is above this number. (a non-positive number
+ disables this)
+:Type: Integer
+:Default: ``300``
+
+
+``mon pg warn min objects``
+
+:Description: Do not warn if the total number of objects in cluster is below
+ this number
+:Type: Integer
+:Default: ``1000``
+
+
+``mon pg warn min pool objects``
+
+:Description: Do not warn on pools whose object number is below this number
+:Type: Integer
+:Default: ``1000``
+
+
+``mon pg check down all threshold``
+
+:Description: Threshold of down OSDs percentage after which we check all PGs
+ for stale ones.
+:Type: Float
+:Default: ``0.5``
+
+
+``mon pg warn max object skew``
+
+:Description: Issue a ``HEALTH_WARN`` in cluster log if the average object number
+ of a certain pool is greater than ``mon pg warn max object skew`` times
+ the average object number of the whole pool. (a non-positive number
+ disables this)
+:Type: Float
+:Default: ``10``
+
+
+``mon delta reset interval``
+
+:Description: Seconds of inactivity before we reset the pg delta to 0. We keep
+ track of the delta of the used space of each pool, so, for
+ example, it would be easier for us to understand the progress of
+ recovery or the performance of cache tier. But if there's no
+ activity reported for a certain pool, we just reset the history of
+ deltas of that pool.
+:Type: Integer
+:Default: ``10``
+
+
+``mon osd max op age``
+
+:Description: Maximum op age before we get concerned (make it a power of 2).
+ A ``HEALTH_WARN`` will be issued if a request has been blocked longer
+ than this limit.
+:Type: Float
+:Default: ``32.0``
+
+
+``osd pg bits``
+
+:Description: Placement group bits per Ceph OSD Daemon.
+:Type: 32-bit Integer
+:Default: ``6``
+
+
+``osd pgp bits``
+
+:Description: The number of bits per Ceph OSD Daemon for PGPs.
+:Type: 32-bit Integer
+:Default: ``6``
+
+
+``osd crush chooseleaf type``
+
+:Description: The bucket type to use for ``chooseleaf`` in a CRUSH rule. Uses
+ ordinal rank rather than name.
+
+:Type: 32-bit Integer
+:Default: ``1``. Typically a host containing one or more Ceph OSD Daemons.
+
+
+``osd crush initial weight``
+
+:Description: The initial crush weight for newly added osds into crushmap.
+
+:Type: Double
+:Default: ``the size of newly added osd in TB``. By default, the initial crush
+ weight for the newly added osd is set to its volume size in TB.
+ See `Weighting Bucket Items`_ for details.
+
+
+``osd pool default crush replicated ruleset``
+
+:Description: The default CRUSH ruleset to use when creating a replicated pool.
+:Type: 8-bit Integer
+:Default: ``CEPH_DEFAULT_CRUSH_REPLICATED_RULESET``, which means "pick
+ a ruleset with the lowest numerical ID and use that". This is to
+ make pool creation work in the absence of ruleset 0.
+
+
+``osd pool erasure code stripe unit``
+
+:Description: Sets the default size, in bytes, of a chunk of an object
+ stripe for erasure coded pools. Every object of size S
+ will be stored as N stripes, with each data chunk
+ receiving ``stripe unit`` bytes. Each stripe of ``N *
+ stripe unit`` bytes will be encoded/decoded
+ individually. This option can is overridden by the
+ ``stripe_unit`` setting in an erasure code profile.
+
+:Type: Unsigned 32-bit Integer
+:Default: ``4096``
+
+
+``osd pool default size``
+
+:Description: Sets the number of replicas for objects in the pool. The default
+ value is the same as
+ ``ceph osd pool set {pool-name} size {size}``.
+
+:Type: 32-bit Integer
+:Default: ``3``
+
+
+``osd pool default min size``
+
+:Description: Sets the minimum number of written replicas for objects in the
+ pool in order to acknowledge a write operation to the client.
+ If minimum is not met, Ceph will not acknowledge the write to the
+ client. This setting ensures a minimum number of replicas when
+ operating in ``degraded`` mode.
+
+:Type: 32-bit Integer
+:Default: ``0``, which means no particular minimum. If ``0``,
+ minimum is ``size - (size / 2)``.
+
+
+``osd pool default pg num``
+
+:Description: The default number of placement groups for a pool. The default
+ value is the same as ``pg_num`` with ``mkpool``.
+
+:Type: 32-bit Integer
+:Default: ``8``
+
+
+``osd pool default pgp num``
+
+:Description: The default number of placement groups for placement for a pool.
+ The default value is the same as ``pgp_num`` with ``mkpool``.
+ PG and PGP should be equal (for now).
+
+:Type: 32-bit Integer
+:Default: ``8``
+
+
+``osd pool default flags``
+
+:Description: The default flags for new pools.
+:Type: 32-bit Integer
+:Default: ``0``
+
+
+``osd max pgls``
+
+:Description: The maximum number of placement groups to list. A client
+ requesting a large number can tie up the Ceph OSD Daemon.
+
+:Type: Unsigned 64-bit Integer
+:Default: ``1024``
+:Note: Default should be fine.
+
+
+``osd min pg log entries``
+
+:Description: The minimum number of placement group logs to maintain
+ when trimming log files.
+
+:Type: 32-bit Int Unsigned
+:Default: ``1000``
+
+
+``osd default data pool replay window``
+
+:Description: The time (in seconds) for an OSD to wait for a client to replay
+ a request.
+
+:Type: 32-bit Integer
+:Default: ``45``
+
+``osd max pg per osd hard ratio``
+
+:Description: The ratio of number of PGs per OSD allowed by the cluster before
+ OSD refuses to create new PGs. OSD stops creating new PGs if the number
+ of PGs it serves exceeds
+ ``osd max pg per osd hard ratio`` \* ``mon max pg per osd``.
+
+:Type: Float
+:Default: ``2``
+
+.. _pool: ../../operations/pools
+.. _Monitoring OSDs and PGs: ../../operations/monitoring-osd-pg#peering
+.. _Weighting Bucket Items: ../../operations/crush-map#weightingbucketitems
diff --git a/src/ceph/doc/rados/configuration/pool-pg.conf b/src/ceph/doc/rados/configuration/pool-pg.conf
new file mode 100644
index 0000000..5f1b3b7
--- /dev/null
+++ b/src/ceph/doc/rados/configuration/pool-pg.conf
@@ -0,0 +1,20 @@
+[global]
+
+ # By default, Ceph makes 3 replicas of objects. If you want to make four
+ # copies of an object the default value--a primary copy and three replica
+ # copies--reset the default values as shown in 'osd pool default size'.
+ # If you want to allow Ceph to write a lesser number of copies in a degraded
+ # state, set 'osd pool default min size' to a number less than the
+ # 'osd pool default size' value.
+
+ osd pool default size = 4 # Write an object 4 times.
+ osd pool default min size = 1 # Allow writing one copy in a degraded state.
+
+ # Ensure you have a realistic number of placement groups. We recommend
+ # approximately 100 per OSD. E.g., total number of OSDs multiplied by 100
+ # divided by the number of replicas (i.e., osd pool default size). So for
+ # 10 OSDs and osd pool default size = 4, we'd recommend approximately
+ # (100 * 10) / 4 = 250.
+
+ osd pool default pg num = 250
+ osd pool default pgp num = 250
diff --git a/src/ceph/doc/rados/configuration/storage-devices.rst b/src/ceph/doc/rados/configuration/storage-devices.rst
new file mode 100644
index 0000000..83c0c9b
--- /dev/null
+++ b/src/ceph/doc/rados/configuration/storage-devices.rst
@@ -0,0 +1,83 @@
+=================
+ Storage Devices
+=================
+
+There are two Ceph daemons that store data on disk:
+
+* **Ceph OSDs** (or Object Storage Daemons) are where most of the
+ data is stored in Ceph. Generally speaking, each OSD is backed by
+ a single storage device, like a traditional hard disk (HDD) or
+ solid state disk (SSD). OSDs can also be backed by a combination
+ of devices, like a HDD for most data and an SSD (or partition of an
+ SSD) for some metadata. The number of OSDs in a cluster is
+ generally a function of how much data will be stored, how big each
+ storage device will be, and the level and type of redundancy
+ (replication or erasure coding).
+* **Ceph Monitor** daemons manage critical cluster state like cluster
+ membership and authentication information. For smaller clusters a
+ few gigabytes is all that is needed, although for larger clusters
+ the monitor database can reach tens or possibly hundreds of
+ gigabytes.
+
+
+OSD Backends
+============
+
+There are two ways that OSDs can manage the data they store. Starting
+with the Luminous 12.2.z release, the new default (and recommended) backend is
+*BlueStore*. Prior to Luminous, the default (and only option) was
+*FileStore*.
+
+BlueStore
+---------
+
+BlueStore is a special-purpose storage backend designed specifically
+for managing data on disk for Ceph OSD workloads. It is motivated by
+experience supporting and managing OSDs using FileStore over the
+last ten years. Key BlueStore features include:
+
+* Direct management of storage devices. BlueStore consumes raw block
+ devices or partitions. This avoids any intervening layers of
+ abstraction (such as local file systems like XFS) that may limit
+ performance or add complexity.
+* Metadata management with RocksDB. We embed RocksDB's key/value database
+ in order to manage internal metadata, such as the mapping from object
+ names to block locations on disk.
+* Full data and metadata checksumming. By default all data and
+ metadata written to BlueStore is protected by one or more
+ checksums. No data or metadata will be read from disk or returned
+ to the user without being verified.
+* Inline compression. Data written may be optionally compressed
+ before being written to disk.
+* Multi-device metadata tiering. BlueStore allows its internal
+ journal (write-ahead log) to be written to a separate, high-speed
+ device (like an SSD, NVMe, or NVDIMM) to increased performance. If
+ a significant amount of faster storage is available, internal
+ metadata can also be stored on the faster device.
+* Efficient copy-on-write. RBD and CephFS snapshots rely on a
+ copy-on-write *clone* mechanism that is implemented efficiently in
+ BlueStore. This results in efficient IO both for regular snapshots
+ and for erasure coded pools (which rely on cloning to implement
+ efficient two-phase commits).
+
+For more information, see :doc:`bluestore-config-ref`.
+
+FileStore
+---------
+
+FileStore is the legacy approach to storing objects in Ceph. It
+relies on a standard file system (normally XFS) in combination with a
+key/value database (traditionally LevelDB, now RocksDB) for some
+metadata.
+
+FileStore is well-tested and widely used in production but suffers
+from many performance deficiencies due to its overall design and
+reliance on a traditional file system for storing object data.
+
+Although FileStore is generally capable of functioning on most
+POSIX-compatible file systems (including btrfs and ext4), we only
+recommend that XFS be used. Both btrfs and ext4 have known bugs and
+deficiencies and their use may lead to data loss. By default all Ceph
+provisioning tools will use XFS.
+
+For more information, see :doc:`filestore-config-ref`.
diff --git a/src/ceph/doc/rados/deployment/ceph-deploy-admin.rst b/src/ceph/doc/rados/deployment/ceph-deploy-admin.rst
new file mode 100644
index 0000000..a91f69c
--- /dev/null
+++ b/src/ceph/doc/rados/deployment/ceph-deploy-admin.rst
@@ -0,0 +1,38 @@
+=============
+ Admin Tasks
+=============
+
+Once you have set up a cluster with ``ceph-deploy``, you may
+provide the client admin key and the Ceph configuration file
+to another host so that a user on the host may use the ``ceph``
+command line as an administrative user.
+
+
+Create an Admin Host
+====================
+
+To enable a host to execute ceph commands with administrator
+privileges, use the ``admin`` command. ::
+
+ ceph-deploy admin {host-name [host-name]...}
+
+
+Deploy Config File
+==================
+
+To send an updated copy of the Ceph configuration file to hosts
+in your cluster, use the ``config push`` command. ::
+
+ ceph-deploy config push {host-name [host-name]...}
+
+.. tip:: With a base name and increment host-naming convention,
+ it is easy to deploy configuration files via simple scripts
+ (e.g., ``ceph-deploy config hostname{1,2,3,4,5}``).
+
+Retrieve Config File
+====================
+
+To retrieve a copy of the Ceph configuration file from a host
+in your cluster, use the ``config pull`` command. ::
+
+ ceph-deploy config pull {host-name [host-name]...}
diff --git a/src/ceph/doc/rados/deployment/ceph-deploy-install.rst b/src/ceph/doc/rados/deployment/ceph-deploy-install.rst
new file mode 100644
index 0000000..849d68e
--- /dev/null
+++ b/src/ceph/doc/rados/deployment/ceph-deploy-install.rst
@@ -0,0 +1,46 @@
+====================
+ Package Management
+====================
+
+Install
+=======
+
+To install Ceph packages on your cluster hosts, open a command line on your
+client machine and type the following::
+
+ ceph-deploy install {hostname [hostname] ...}
+
+Without additional arguments, ``ceph-deploy`` will install the most recent
+major release of Ceph to the cluster host(s). To specify a particular package,
+you may select from the following:
+
+- ``--release <code-name>``
+- ``--testing``
+- ``--dev <branch-or-tag>``
+
+For example::
+
+ ceph-deploy install --release cuttlefish hostname1
+ ceph-deploy install --testing hostname2
+ ceph-deploy install --dev wip-some-branch hostname{1,2,3,4,5}
+
+For additional usage, execute::
+
+ ceph-deploy install -h
+
+
+Uninstall
+=========
+
+To uninstall Ceph packages from your cluster hosts, open a terminal on
+your admin host and type the following::
+
+ ceph-deploy uninstall {hostname [hostname] ...}
+
+On a Debian or Ubuntu system, you may also::
+
+ ceph-deploy purge {hostname [hostname] ...}
+
+The tool will unininstall ``ceph`` packages from the specified hosts. Purge
+additionally removes configuration files.
+
diff --git a/src/ceph/doc/rados/deployment/ceph-deploy-keys.rst b/src/ceph/doc/rados/deployment/ceph-deploy-keys.rst
new file mode 100644
index 0000000..3e106c9
--- /dev/null
+++ b/src/ceph/doc/rados/deployment/ceph-deploy-keys.rst
@@ -0,0 +1,32 @@
+=================
+ Keys Management
+=================
+
+
+Gather Keys
+===========
+
+Before you can provision a host to run OSDs or metadata servers, you must gather
+monitor keys and the OSD and MDS bootstrap keyrings. To gather keys, enter the
+following::
+
+ ceph-deploy gatherkeys {monitor-host}
+
+
+.. note:: To retrieve the keys, you specify a host that has a
+ Ceph monitor.
+
+.. note:: If you have specified multiple monitors in the setup of the cluster,
+ make sure, that all monitors are up and running. If the monitors haven't
+ formed quorum, ``ceph-create-keys`` will not finish and the keys are not
+ generated.
+
+Forget Keys
+===========
+
+When you are no longer using ``ceph-deploy`` (or if you are recreating a
+cluster), you should delete the keys in the local directory of your admin host.
+To delete keys, enter the following::
+
+ ceph-deploy forgetkeys
+
diff --git a/src/ceph/doc/rados/deployment/ceph-deploy-mds.rst b/src/ceph/doc/rados/deployment/ceph-deploy-mds.rst
new file mode 100644
index 0000000..d2afaec
--- /dev/null
+++ b/src/ceph/doc/rados/deployment/ceph-deploy-mds.rst
@@ -0,0 +1,46 @@
+============================
+ Add/Remove Metadata Server
+============================
+
+With ``ceph-deploy``, adding and removing metadata servers is a simple task. You
+just add or remove one or more metadata servers on the command line with one
+command.
+
+.. important:: You must deploy at least one metadata server to use CephFS.
+ There is experimental support for running multiple metadata servers.
+ Do not run multiple active metadata servers in production.
+
+See `MDS Config Reference`_ for details on configuring metadata servers.
+
+
+Add a Metadata Server
+=====================
+
+Once you deploy monitors and OSDs you may deploy the metadata server(s). ::
+
+ ceph-deploy mds create {host-name}[:{daemon-name}] [{host-name}[:{daemon-name}] ...]
+
+You may specify a daemon instance a name (optional) if you would like to run
+multiple daemons on a single server.
+
+
+Remove a Metadata Server
+========================
+
+Coming soon...
+
+.. If you have a metadata server in your cluster that you'd like to remove, you may use
+.. the ``destroy`` option. ::
+
+.. ceph-deploy mds destroy {host-name}[:{daemon-name}] [{host-name}[:{daemon-name}] ...]
+
+.. You may specify a daemon instance a name (optional) if you would like to destroy
+.. a particular daemon that runs on a single server with multiple MDS daemons.
+
+.. .. note:: Ensure that if you remove a metadata server, the remaining metadata
+ servers will be able to service requests from CephFS clients. If that is not
+ possible, consider adding a metadata server before destroying the metadata
+ server you would like to take offline.
+
+
+.. _MDS Config Reference: ../../../cephfs/mds-config-ref
diff --git a/src/ceph/doc/rados/deployment/ceph-deploy-mon.rst b/src/ceph/doc/rados/deployment/ceph-deploy-mon.rst
new file mode 100644
index 0000000..bda34fe
--- /dev/null
+++ b/src/ceph/doc/rados/deployment/ceph-deploy-mon.rst
@@ -0,0 +1,56 @@
+=====================
+ Add/Remove Monitors
+=====================
+
+With ``ceph-deploy``, adding and removing monitors is a simple task. You just
+add or remove one or more monitors on the command line with one command. Before
+``ceph-deploy``, the process of `adding and removing monitors`_ involved
+numerous manual steps. Using ``ceph-deploy`` imposes a restriction: **you may
+only install one monitor per host.**
+
+.. note:: We do not recommend comingling monitors and OSDs on
+ the same host.
+
+For high availability, you should run a production Ceph cluster with **AT
+LEAST** three monitors. Ceph uses the Paxos algorithm, which requires a
+consensus among the majority of monitors in a quorum. With Paxos, the monitors
+cannot determine a majority for establishing a quorum with only two monitors. A
+majority of monitors must be counted as such: 1:1, 2:3, 3:4, 3:5, 4:6, etc.
+
+See `Monitor Config Reference`_ for details on configuring monitors.
+
+
+Add a Monitor
+=============
+
+Once you create a cluster and install Ceph packages to the monitor host(s), you
+may deploy the monitor(s) to the monitor host(s). When using ``ceph-deploy``,
+the tool enforces a single monitor per host. ::
+
+ ceph-deploy mon create {host-name [host-name]...}
+
+
+.. note:: Ensure that you add monitors such that they may arrive at a consensus
+ among a majority of monitors, otherwise other steps (like ``ceph-deploy gatherkeys``)
+ will fail.
+
+.. note:: When adding a monitor on a host that was not in hosts initially defined
+ with the ``ceph-deploy new`` command, a ``public network`` statement needs
+ to be added to the ceph.conf file.
+
+Remove a Monitor
+================
+
+If you have a monitor in your cluster that you'd like to remove, you may use
+the ``destroy`` option. ::
+
+ ceph-deploy mon destroy {host-name [host-name]...}
+
+
+.. note:: Ensure that if you remove a monitor, the remaining monitors will be
+ able to establish a consensus. If that is not possible, consider adding a
+ monitor before removing the monitor you would like to take offline.
+
+
+.. _adding and removing monitors: ../../operations/add-or-rm-mons
+.. _Monitor Config Reference: ../../configuration/mon-config-ref
diff --git a/src/ceph/doc/rados/deployment/ceph-deploy-new.rst b/src/ceph/doc/rados/deployment/ceph-deploy-new.rst
new file mode 100644
index 0000000..5eb37a9
--- /dev/null
+++ b/src/ceph/doc/rados/deployment/ceph-deploy-new.rst
@@ -0,0 +1,66 @@
+==================
+ Create a Cluster
+==================
+
+The first step in using Ceph with ``ceph-deploy`` is to create a new Ceph
+cluster. A new Ceph cluster has:
+
+- A Ceph configuration file, and
+- A monitor keyring.
+
+The Ceph configuration file consists of at least:
+
+- Its own filesystem ID (``fsid``)
+- The initial monitor(s) hostname(s), and
+- The initial monitor(s) and IP address(es).
+
+For additional details, see the `Monitor Configuration Reference`_.
+
+The ``ceph-deploy`` tool also creates a monitor keyring and populates it with a
+``[mon.]`` key. For additional details, see the `Cephx Guide`_.
+
+
+Usage
+-----
+
+To create a cluster with ``ceph-deploy``, use the ``new`` command and specify
+the host(s) that will be initial members of the monitor quorum. ::
+
+ ceph-deploy new {host [host], ...}
+
+For example::
+
+ ceph-deploy new mon1.foo.com
+ ceph-deploy new mon{1,2,3}
+
+The ``ceph-deploy`` utility will use DNS to resolve hostnames to IP
+addresses. The monitors will be named using the first component of
+the name (e.g., ``mon1`` above). It will add the specified host names
+to the Ceph configuration file. For additional details, execute::
+
+ ceph-deploy new -h
+
+
+Naming a Cluster
+----------------
+
+By default, Ceph clusters have a cluster name of ``ceph``. You can specify
+a cluster name if you want to run multiple clusters on the same hardware. For
+example, if you want to optimize a cluster for use with block devices, and
+another for use with the gateway, you can run two different clusters on the same
+hardware if they have a different ``fsid`` and cluster name. ::
+
+ ceph-deploy --cluster {cluster-name} new {host [host], ...}
+
+For example::
+
+ ceph-deploy --cluster rbdcluster new ceph-mon1
+ ceph-deploy --cluster rbdcluster new ceph-mon{1,2,3}
+
+.. note:: If you run multiple clusters, ensure you adjust the default
+ port settings and open ports for your additional cluster(s) so that
+ the networks of the two different clusters don't conflict with each other.
+
+
+.. _Monitor Configuration Reference: ../../configuration/mon-config-ref
+.. _Cephx Guide: ../../../dev/mon-bootstrap#secret-keys
diff --git a/src/ceph/doc/rados/deployment/ceph-deploy-osd.rst b/src/ceph/doc/rados/deployment/ceph-deploy-osd.rst
new file mode 100644
index 0000000..a4eb4d1
--- /dev/null
+++ b/src/ceph/doc/rados/deployment/ceph-deploy-osd.rst
@@ -0,0 +1,121 @@
+=================
+ Add/Remove OSDs
+=================
+
+Adding and removing Ceph OSD Daemons to your cluster may involve a few more
+steps when compared to adding and removing other Ceph daemons. Ceph OSD Daemons
+write data to the disk and to journals. So you need to provide a disk for the
+OSD and a path to the journal partition (i.e., this is the most common
+configuration, but you may configure your system to your own needs).
+
+In Ceph v0.60 and later releases, Ceph supports ``dm-crypt`` on disk encryption.
+You may specify the ``--dmcrypt`` argument when preparing an OSD to tell
+``ceph-deploy`` that you want to use encryption. You may also specify the
+``--dmcrypt-key-dir`` argument to specify the location of ``dm-crypt``
+encryption keys.
+
+You should test various drive configurations to gauge their throughput before
+before building out a large cluster. See `Data Storage`_ for additional details.
+
+
+List Disks
+==========
+
+To list the disks on a node, execute the following command::
+
+ ceph-deploy disk list {node-name [node-name]...}
+
+
+Zap Disks
+=========
+
+To zap a disk (delete its partition table) in preparation for use with Ceph,
+execute the following::
+
+ ceph-deploy disk zap {osd-server-name}:{disk-name}
+ ceph-deploy disk zap osdserver1:sdb
+
+.. important:: This will delete all data.
+
+
+Prepare OSDs
+============
+
+Once you create a cluster, install Ceph packages, and gather keys, you
+may prepare the OSDs and deploy them to the OSD node(s). If you need to
+identify a disk or zap it prior to preparing it for use as an OSD,
+see `List Disks`_ and `Zap Disks`_. ::
+
+ ceph-deploy osd prepare {node-name}:{data-disk}[:{journal-disk}]
+ ceph-deploy osd prepare osdserver1:sdb:/dev/ssd
+ ceph-deploy osd prepare osdserver1:sdc:/dev/ssd
+
+The ``prepare`` command only prepares the OSD. On most operating
+systems, the ``activate`` phase will automatically run when the
+partitions are created on the disk (using Ceph ``udev`` rules). If not
+use the ``activate`` command. See `Activate OSDs`_ for
+details.
+
+The foregoing example assumes a disk dedicated to one Ceph OSD Daemon, and
+a path to an SSD journal partition. We recommend storing the journal on
+a separate drive to maximize throughput. You may dedicate a single drive
+for the journal too (which may be expensive) or place the journal on the
+same disk as the OSD (not recommended as it impairs performance). In the
+foregoing example we store the journal on a partitioned solid state drive.
+
+You can use the settings --fs-type or --bluestore to choose which file system
+you want to install in the OSD drive. (More information by running
+'ceph-deploy osd prepare --help').
+
+.. note:: When running multiple Ceph OSD daemons on a single node, and
+ sharing a partioned journal with each OSD daemon, you should consider
+ the entire node the minimum failure domain for CRUSH purposes, because
+ if the SSD drive fails, all of the Ceph OSD daemons that journal to it
+ will fail too.
+
+
+Activate OSDs
+=============
+
+Once you prepare an OSD you may activate it with the following command. ::
+
+ ceph-deploy osd activate {node-name}:{data-disk-partition}[:{journal-disk-partition}]
+ ceph-deploy osd activate osdserver1:/dev/sdb1:/dev/ssd1
+ ceph-deploy osd activate osdserver1:/dev/sdc1:/dev/ssd2
+
+The ``activate`` command will cause your OSD to come ``up`` and be placed
+``in`` the cluster. The ``activate`` command uses the path to the partition
+created when running the ``prepare`` command.
+
+
+Create OSDs
+===========
+
+You may prepare OSDs, deploy them to the OSD node(s) and activate them in one
+step with the ``create`` command. The ``create`` command is a convenience method
+for executing the ``prepare`` and ``activate`` command sequentially. ::
+
+ ceph-deploy osd create {node-name}:{disk}[:{path/to/journal}]
+ ceph-deploy osd create osdserver1:sdb:/dev/ssd1
+
+.. List OSDs
+.. =========
+
+.. To list the OSDs deployed on a node(s), execute the following command::
+
+.. ceph-deploy osd list {node-name}
+
+
+Destroy OSDs
+============
+
+.. note:: Coming soon. See `Remove OSDs`_ for manual procedures.
+
+.. To destroy an OSD, execute the following command::
+
+.. ceph-deploy osd destroy {node-name}:{path-to-disk}[:{path/to/journal}]
+
+.. Destroying an OSD will take it ``down`` and ``out`` of the cluster.
+
+.. _Data Storage: ../../../start/hardware-recommendations#data-storage
+.. _Remove OSDs: ../../operations/add-or-rm-osds#removing-osds-manual
diff --git a/src/ceph/doc/rados/deployment/ceph-deploy-purge.rst b/src/ceph/doc/rados/deployment/ceph-deploy-purge.rst
new file mode 100644
index 0000000..685c3c4
--- /dev/null
+++ b/src/ceph/doc/rados/deployment/ceph-deploy-purge.rst
@@ -0,0 +1,25 @@
+==============
+ Purge a Host
+==============
+
+When you remove Ceph daemons and uninstall Ceph, there may still be extraneous
+data from the cluster on your server. The ``purge`` and ``purgedata`` commands
+provide a convenient means of cleaning up a host.
+
+
+Purge Data
+==========
+
+To remove all data from ``/var/lib/ceph`` (but leave Ceph packages intact),
+execute the ``purgedata`` command.
+
+ ceph-deploy purgedata {hostname} [{hostname} ...]
+
+
+Purge
+=====
+
+To remove all data from ``/var/lib/ceph`` and uninstall Ceph packages, execute
+the ``purge`` command.
+
+ ceph-deploy purge {hostname} [{hostname} ...] \ No newline at end of file
diff --git a/src/ceph/doc/rados/deployment/index.rst b/src/ceph/doc/rados/deployment/index.rst
new file mode 100644
index 0000000..0853e4a
--- /dev/null
+++ b/src/ceph/doc/rados/deployment/index.rst
@@ -0,0 +1,58 @@
+=================
+ Ceph Deployment
+=================
+
+The ``ceph-deploy`` tool is a way to deploy Ceph relying only upon SSH access to
+the servers, ``sudo``, and some Python. It runs on your workstation, and does
+not require servers, databases, or any other tools. If you set up and
+tear down Ceph clusters a lot, and want minimal extra bureaucracy,
+``ceph-deploy`` is an ideal tool. The ``ceph-deploy`` tool is not a generic
+deployment system. It was designed exclusively for Ceph users who want to get
+Ceph up and running quickly with sensible initial configuration settings without
+the overhead of installing Chef, Puppet or Juju. Users who want fine-control
+over security settings, partitions or directory locations should use a tool
+such as Juju, Puppet, `Chef`_ or Crowbar.
+
+
+With ``ceph-deploy``, you can develop scripts to install Ceph packages on remote
+hosts, create a cluster, add monitors, gather (or forget) keys, add OSDs and
+metadata servers, configure admin hosts, and tear down the clusters.
+
+.. raw:: html
+
+ <table cellpadding="10"><tbody valign="top"><tr><td>
+
+.. toctree::
+
+ Preflight Checklist <preflight-checklist>
+ Install Ceph <ceph-deploy-install>
+
+.. raw:: html
+
+ </td><td>
+
+.. toctree::
+
+ Create a Cluster <ceph-deploy-new>
+ Add/Remove Monitor(s) <ceph-deploy-mon>
+ Key Management <ceph-deploy-keys>
+ Add/Remove OSD(s) <ceph-deploy-osd>
+ Add/Remove MDS(s) <ceph-deploy-mds>
+
+
+.. raw:: html
+
+ </td><td>
+
+.. toctree::
+
+ Purge Hosts <ceph-deploy-purge>
+ Admin Tasks <ceph-deploy-admin>
+
+
+.. raw:: html
+
+ </td></tr></tbody></table>
+
+
+.. _Chef: http://tracker.ceph.com/projects/ceph/wiki/Deploying_Ceph_with_Chef
diff --git a/src/ceph/doc/rados/deployment/preflight-checklist.rst b/src/ceph/doc/rados/deployment/preflight-checklist.rst
new file mode 100644
index 0000000..64a669f
--- /dev/null
+++ b/src/ceph/doc/rados/deployment/preflight-checklist.rst
@@ -0,0 +1,109 @@
+=====================
+ Preflight Checklist
+=====================
+
+.. versionadded:: 0.60
+
+This **Preflight Checklist** will help you prepare an admin node for use with
+``ceph-deploy``, and server nodes for use with passwordless ``ssh`` and
+``sudo``.
+
+Before you can deploy Ceph using ``ceph-deploy``, you need to ensure that you
+have a few things set up first on your admin node and on nodes running Ceph
+daemons.
+
+
+Install an Operating System
+===========================
+
+Install a recent release of Debian or Ubuntu (e.g., 12.04 LTS, 14.04 LTS) on
+your nodes. For additional details on operating systems or to use other
+operating systems other than Debian or Ubuntu, see `OS Recommendations`_.
+
+
+Install an SSH Server
+=====================
+
+The ``ceph-deploy`` utility requires ``ssh``, so your server node(s) require an
+SSH server. ::
+
+ sudo apt-get install openssh-server
+
+
+Create a User
+=============
+
+Create a user on nodes running Ceph daemons.
+
+.. tip:: We recommend a username that brute force attackers won't
+ guess easily (e.g., something other than ``root``, ``ceph``, etc).
+
+::
+
+ ssh user@ceph-server
+ sudo useradd -d /home/ceph -m ceph
+ sudo passwd ceph
+
+
+``ceph-deploy`` installs packages onto your nodes. This means that
+the user you create requires passwordless ``sudo`` privileges.
+
+.. note:: We **DO NOT** recommend enabling the ``root`` password
+ for security reasons.
+
+To provide full privileges to the user, add the following to
+``/etc/sudoers.d/ceph``. ::
+
+ echo "ceph ALL = (root) NOPASSWD:ALL" | sudo tee /etc/sudoers.d/ceph
+ sudo chmod 0440 /etc/sudoers.d/ceph
+
+
+Configure SSH
+=============
+
+Configure your admin machine with password-less SSH access to each node
+running Ceph daemons (leave the passphrase empty). ::
+
+ ssh-keygen
+ Generating public/private key pair.
+ Enter file in which to save the key (/ceph-client/.ssh/id_rsa):
+ Enter passphrase (empty for no passphrase):
+ Enter same passphrase again:
+ Your identification has been saved in /ceph-client/.ssh/id_rsa.
+ Your public key has been saved in /ceph-client/.ssh/id_rsa.pub.
+
+Copy the key to each node running Ceph daemons::
+
+ ssh-copy-id ceph@ceph-server
+
+Modify your ~/.ssh/config file of your admin node so that it defaults
+to logging in as the user you created when no username is specified. ::
+
+ Host ceph-server
+ Hostname ceph-server.fqdn-or-ip-address.com
+ User ceph
+
+
+Install ceph-deploy
+===================
+
+To install ``ceph-deploy``, execute the following::
+
+ wget -q -O- 'https://download.ceph.com/keys/release.asc' | sudo apt-key add -
+ echo deb http://ceph.com/debian-dumpling/ $(lsb_release -sc) main | sudo tee /etc/apt/sources.list.d/ceph.list
+ sudo apt-get update
+ sudo apt-get install ceph-deploy
+
+
+Ensure Connectivity
+===================
+
+Ensure that your Admin node has connectivity to the network and to your Server
+node (e.g., ensure ``iptables``, ``ufw`` or other tools that may prevent
+connections, traffic forwarding, etc. to allow what you need).
+
+
+Once you have completed this pre-flight checklist, you are ready to begin using
+``ceph-deploy``.
+
+.. _OS Recommendations: ../../../start/os-recommendations
diff --git a/src/ceph/doc/rados/index.rst b/src/ceph/doc/rados/index.rst
new file mode 100644
index 0000000..929bb7e
--- /dev/null
+++ b/src/ceph/doc/rados/index.rst
@@ -0,0 +1,76 @@
+======================
+ Ceph Storage Cluster
+======================
+
+The :term:`Ceph Storage Cluster` is the foundation for all Ceph deployments.
+Based upon :abbr:`RADOS (Reliable Autonomic Distributed Object Store)`, Ceph
+Storage Clusters consist of two types of daemons: a :term:`Ceph OSD Daemon`
+(OSD) stores data as objects on a storage node; and a :term:`Ceph Monitor` (MON)
+maintains a master copy of the cluster map. A Ceph Storage Cluster may contain
+thousands of storage nodes. A minimal system will have at least one
+Ceph Monitor and two Ceph OSD Daemons for data replication.
+
+The Ceph Filesystem, Ceph Object Storage and Ceph Block Devices read data from
+and write data to the Ceph Storage Cluster.
+
+.. raw:: html
+
+ <style type="text/css">div.body h3{margin:5px 0px 0px 0px;}</style>
+ <table cellpadding="10"><colgroup><col width="33%"><col width="33%"><col width="33%"></colgroup><tbody valign="top"><tr><td><h3>Config and Deploy</h3>
+
+Ceph Storage Clusters have a few required settings, but most configuration
+settings have default values. A typical deployment uses a deployment tool
+to define a cluster and bootstrap a monitor. See `Deployment`_ for details
+on ``ceph-deploy.``
+
+.. toctree::
+ :maxdepth: 2
+
+ Configuration <configuration/index>
+ Deployment <deployment/index>
+
+.. raw:: html
+
+ </td><td><h3>Operations</h3>
+
+Once you have a deployed a Ceph Storage Cluster, you may begin operating
+your cluster.
+
+.. toctree::
+ :maxdepth: 2
+
+
+ Operations <operations/index>
+
+.. toctree::
+ :maxdepth: 1
+
+ Man Pages <man/index>
+
+
+.. toctree::
+ :hidden:
+
+ troubleshooting/index
+
+.. raw:: html
+
+ </td><td><h3>APIs</h3>
+
+Most Ceph deployments use `Ceph Block Devices`_, `Ceph Object Storage`_ and/or the
+`Ceph Filesystem`_. You may also develop applications that talk directly to
+the Ceph Storage Cluster.
+
+.. toctree::
+ :maxdepth: 2
+
+ APIs <api/index>
+
+.. raw:: html
+
+ </td></tr></tbody></table>
+
+.. _Ceph Block Devices: ../rbd/
+.. _Ceph Filesystem: ../cephfs/
+.. _Ceph Object Storage: ../radosgw/
+.. _Deployment: ../rados/deployment/
diff --git a/src/ceph/doc/rados/man/index.rst b/src/ceph/doc/rados/man/index.rst
new file mode 100644
index 0000000..abeb88b
--- /dev/null
+++ b/src/ceph/doc/rados/man/index.rst
@@ -0,0 +1,34 @@
+=======================
+ Object Store Manpages
+=======================
+
+.. toctree::
+ :maxdepth: 1
+
+ ../../man/8/ceph-disk.rst
+ ../../man/8/ceph-volume.rst
+ ../../man/8/ceph-volume-systemd.rst
+ ../../man/8/ceph.rst
+ ../../man/8/ceph-deploy.rst
+ ../../man/8/ceph-rest-api.rst
+ ../../man/8/ceph-authtool.rst
+ ../../man/8/ceph-clsinfo.rst
+ ../../man/8/ceph-conf.rst
+ ../../man/8/ceph-debugpack.rst
+ ../../man/8/ceph-dencoder.rst
+ ../../man/8/ceph-mon.rst
+ ../../man/8/ceph-osd.rst
+ ../../man/8/ceph-kvstore-tool.rst
+ ../../man/8/ceph-run.rst
+ ../../man/8/ceph-syn.rst
+ ../../man/8/crushtool.rst
+ ../../man/8/librados-config.rst
+ ../../man/8/monmaptool.rst
+ ../../man/8/osdmaptool.rst
+ ../../man/8/rados.rst
+
+
+.. toctree::
+ :hidden:
+
+ ../../man/8/ceph-post-file.rst
diff --git a/src/ceph/doc/rados/operations/add-or-rm-mons.rst b/src/ceph/doc/rados/operations/add-or-rm-mons.rst
new file mode 100644
index 0000000..0cdc431
--- /dev/null
+++ b/src/ceph/doc/rados/operations/add-or-rm-mons.rst
@@ -0,0 +1,370 @@
+==========================
+ Adding/Removing Monitors
+==========================
+
+When you have a cluster up and running, you may add or remove monitors
+from the cluster at runtime. To bootstrap a monitor, see `Manual Deployment`_
+or `Monitor Bootstrap`_.
+
+Adding Monitors
+===============
+
+Ceph monitors are light-weight processes that maintain a master copy of the
+cluster map. You can run a cluster with 1 monitor. We recommend at least 3
+monitors for a production cluster. Ceph monitors use a variation of the
+`Paxos`_ protocol to establish consensus about maps and other critical
+information across the cluster. Due to the nature of Paxos, Ceph requires
+a majority of monitors running to establish a quorum (thus establishing
+consensus).
+
+It is advisable to run an odd-number of monitors but not mandatory. An
+odd-number of monitors has a higher resiliency to failures than an
+even-number of monitors. For instance, on a 2 monitor deployment, no
+failures can be tolerated in order to maintain a quorum; with 3 monitors,
+one failure can be tolerated; in a 4 monitor deployment, one failure can
+be tolerated; with 5 monitors, two failures can be tolerated. This is
+why an odd-number is advisable. Summarizing, Ceph needs a majority of
+monitors to be running (and able to communicate with each other), but that
+majority can be achieved using a single monitor, or 2 out of 2 monitors,
+2 out of 3, 3 out of 4, etc.
+
+For an initial deployment of a multi-node Ceph cluster, it is advisable to
+deploy three monitors, increasing the number two at a time if a valid need
+for more than three exists.
+
+Since monitors are light-weight, it is possible to run them on the same
+host as an OSD; however, we recommend running them on separate hosts,
+because fsync issues with the kernel may impair performance.
+
+.. note:: A *majority* of monitors in your cluster must be able to
+ reach each other in order to establish a quorum.
+
+Deploy your Hardware
+--------------------
+
+If you are adding a new host when adding a new monitor, see `Hardware
+Recommendations`_ for details on minimum recommendations for monitor hardware.
+To add a monitor host to your cluster, first make sure you have an up-to-date
+version of Linux installed (typically Ubuntu 14.04 or RHEL 7).
+
+Add your monitor host to a rack in your cluster, connect it to the network
+and ensure that it has network connectivity.
+
+.. _Hardware Recommendations: ../../../start/hardware-recommendations
+
+Install the Required Software
+-----------------------------
+
+For manually deployed clusters, you must install Ceph packages
+manually. See `Installing Packages`_ for details.
+You should configure SSH to a user with password-less authentication
+and root permissions.
+
+.. _Installing Packages: ../../../install/install-storage-cluster
+
+
+.. _Adding a Monitor (Manual):
+
+Adding a Monitor (Manual)
+-------------------------
+
+This procedure creates a ``ceph-mon`` data directory, retrieves the monitor map
+and monitor keyring, and adds a ``ceph-mon`` daemon to your cluster. If
+this results in only two monitor daemons, you may add more monitors by
+repeating this procedure until you have a sufficient number of ``ceph-mon``
+daemons to achieve a quorum.
+
+At this point you should define your monitor's id. Traditionally, monitors
+have been named with single letters (``a``, ``b``, ``c``, ...), but you are
+free to define the id as you see fit. For the purpose of this document,
+please take into account that ``{mon-id}`` should be the id you chose,
+without the ``mon.`` prefix (i.e., ``{mon-id}`` should be the ``a``
+on ``mon.a``).
+
+#. Create the default directory on the machine that will host your
+ new monitor. ::
+
+ ssh {new-mon-host}
+ sudo mkdir /var/lib/ceph/mon/ceph-{mon-id}
+
+#. Create a temporary directory ``{tmp}`` to keep the files needed during
+ this process. This directory should be different from the monitor's default
+ directory created in the previous step, and can be removed after all the
+ steps are executed. ::
+
+ mkdir {tmp}
+
+#. Retrieve the keyring for your monitors, where ``{tmp}`` is the path to
+ the retrieved keyring, and ``{key-filename}`` is the name of the file
+ containing the retrieved monitor key. ::
+
+ ceph auth get mon. -o {tmp}/{key-filename}
+
+#. Retrieve the monitor map, where ``{tmp}`` is the path to
+ the retrieved monitor map, and ``{map-filename}`` is the name of the file
+ containing the retrieved monitor monitor map. ::
+
+ ceph mon getmap -o {tmp}/{map-filename}
+
+#. Prepare the monitor's data directory created in the first step. You must
+ specify the path to the monitor map so that you can retrieve the
+ information about a quorum of monitors and their ``fsid``. You must also
+ specify a path to the monitor keyring::
+
+ sudo ceph-mon -i {mon-id} --mkfs --monmap {tmp}/{map-filename} --keyring {tmp}/{key-filename}
+
+
+#. Start the new monitor and it will automatically join the cluster.
+ The daemon needs to know which address to bind to, either via
+ ``--public-addr {ip:port}`` or by setting ``mon addr`` in the
+ appropriate section of ``ceph.conf``. For example::
+
+ ceph-mon -i {mon-id} --public-addr {ip:port}
+
+
+Removing Monitors
+=================
+
+When you remove monitors from a cluster, consider that Ceph monitors use
+PAXOS to establish consensus about the master cluster map. You must have
+a sufficient number of monitors to establish a quorum for consensus about
+the cluster map.
+
+.. _Removing a Monitor (Manual):
+
+Removing a Monitor (Manual)
+---------------------------
+
+This procedure removes a ``ceph-mon`` daemon from your cluster. If this
+procedure results in only two monitor daemons, you may add or remove another
+monitor until you have a number of ``ceph-mon`` daemons that can achieve a
+quorum.
+
+#. Stop the monitor. ::
+
+ service ceph -a stop mon.{mon-id}
+
+#. Remove the monitor from the cluster. ::
+
+ ceph mon remove {mon-id}
+
+#. Remove the monitor entry from ``ceph.conf``.
+
+
+Removing Monitors from an Unhealthy Cluster
+-------------------------------------------
+
+This procedure removes a ``ceph-mon`` daemon from an unhealthy
+cluster, for example a cluster where the monitors cannot form a
+quorum.
+
+
+#. Stop all ``ceph-mon`` daemons on all monitor hosts. ::
+
+ ssh {mon-host}
+ service ceph stop mon || stop ceph-mon-all
+ # and repeat for all mons
+
+#. Identify a surviving monitor and log in to that host. ::
+
+ ssh {mon-host}
+
+#. Extract a copy of the monmap file. ::
+
+ ceph-mon -i {mon-id} --extract-monmap {map-path}
+ # in most cases, that's
+ ceph-mon -i `hostname` --extract-monmap /tmp/monmap
+
+#. Remove the non-surviving or problematic monitors. For example, if
+ you have three monitors, ``mon.a``, ``mon.b``, and ``mon.c``, where
+ only ``mon.a`` will survive, follow the example below::
+
+ monmaptool {map-path} --rm {mon-id}
+ # for example,
+ monmaptool /tmp/monmap --rm b
+ monmaptool /tmp/monmap --rm c
+
+#. Inject the surviving map with the removed monitors into the
+ surviving monitor(s). For example, to inject a map into monitor
+ ``mon.a``, follow the example below::
+
+ ceph-mon -i {mon-id} --inject-monmap {map-path}
+ # for example,
+ ceph-mon -i a --inject-monmap /tmp/monmap
+
+#. Start only the surviving monitors.
+
+#. Verify the monitors form a quorum (``ceph -s``).
+
+#. You may wish to archive the removed monitors' data directory in
+ ``/var/lib/ceph/mon`` in a safe location, or delete it if you are
+ confident the remaining monitors are healthy and are sufficiently
+ redundant.
+
+.. _Changing a Monitor's IP address:
+
+Changing a Monitor's IP Address
+===============================
+
+.. important:: Existing monitors are not supposed to change their IP addresses.
+
+Monitors are critical components of a Ceph cluster, and they need to maintain a
+quorum for the whole system to work properly. To establish a quorum, the
+monitors need to discover each other. Ceph has strict requirements for
+discovering monitors.
+
+Ceph clients and other Ceph daemons use ``ceph.conf`` to discover monitors.
+However, monitors discover each other using the monitor map, not ``ceph.conf``.
+For example, if you refer to `Adding a Monitor (Manual)`_ you will see that you
+need to obtain the current monmap for the cluster when creating a new monitor,
+as it is one of the required arguments of ``ceph-mon -i {mon-id} --mkfs``. The
+following sections explain the consistency requirements for Ceph monitors, and a
+few safe ways to change a monitor's IP address.
+
+
+Consistency Requirements
+------------------------
+
+A monitor always refers to the local copy of the monmap when discovering other
+monitors in the cluster. Using the monmap instead of ``ceph.conf`` avoids
+errors that could break the cluster (e.g., typos in ``ceph.conf`` when
+specifying a monitor address or port). Since monitors use monmaps for discovery
+and they share monmaps with clients and other Ceph daemons, the monmap provides
+monitors with a strict guarantee that their consensus is valid.
+
+Strict consistency also applies to updates to the monmap. As with any other
+updates on the monitor, changes to the monmap always run through a distributed
+consensus algorithm called `Paxos`_. The monitors must agree on each update to
+the monmap, such as adding or removing a monitor, to ensure that each monitor in
+the quorum has the same version of the monmap. Updates to the monmap are
+incremental so that monitors have the latest agreed upon version, and a set of
+previous versions, allowing a monitor that has an older version of the monmap to
+catch up with the current state of the cluster.
+
+If monitors discovered each other through the Ceph configuration file instead of
+through the monmap, it would introduce additional risks because the Ceph
+configuration files are not updated and distributed automatically. Monitors
+might inadvertently use an older ``ceph.conf`` file, fail to recognize a
+monitor, fall out of a quorum, or develop a situation where `Paxos`_ is not able
+to determine the current state of the system accurately. Consequently, making
+changes to an existing monitor's IP address must be done with great care.
+
+
+Changing a Monitor's IP address (The Right Way)
+-----------------------------------------------
+
+Changing a monitor's IP address in ``ceph.conf`` only is not sufficient to
+ensure that other monitors in the cluster will receive the update. To change a
+monitor's IP address, you must add a new monitor with the IP address you want
+to use (as described in `Adding a Monitor (Manual)`_), ensure that the new
+monitor successfully joins the quorum; then, remove the monitor that uses the
+old IP address. Then, update the ``ceph.conf`` file to ensure that clients and
+other daemons know the IP address of the new monitor.
+
+For example, lets assume there are three monitors in place, such as ::
+
+ [mon.a]
+ host = host01
+ addr = 10.0.0.1:6789
+ [mon.b]
+ host = host02
+ addr = 10.0.0.2:6789
+ [mon.c]
+ host = host03
+ addr = 10.0.0.3:6789
+
+To change ``mon.c`` to ``host04`` with the IP address ``10.0.0.4``, follow the
+steps in `Adding a Monitor (Manual)`_ by adding a new monitor ``mon.d``. Ensure
+that ``mon.d`` is running before removing ``mon.c``, or it will break the
+quorum. Remove ``mon.c`` as described on `Removing a Monitor (Manual)`_. Moving
+all three monitors would thus require repeating this process as many times as
+needed.
+
+
+Changing a Monitor's IP address (The Messy Way)
+-----------------------------------------------
+
+There may come a time when the monitors must be moved to a different network, a
+different part of the datacenter or a different datacenter altogether. While it
+is possible to do it, the process becomes a bit more hazardous.
+
+In such a case, the solution is to generate a new monmap with updated IP
+addresses for all the monitors in the cluster, and inject the new map on each
+individual monitor. This is not the most user-friendly approach, but we do not
+expect this to be something that needs to be done every other week. As it is
+clearly stated on the top of this section, monitors are not supposed to change
+IP addresses.
+
+Using the previous monitor configuration as an example, assume you want to move
+all the monitors from the ``10.0.0.x`` range to ``10.1.0.x``, and these
+networks are unable to communicate. Use the following procedure:
+
+#. Retrieve the monitor map, where ``{tmp}`` is the path to
+ the retrieved monitor map, and ``{filename}`` is the name of the file
+ containing the retrieved monitor monitor map. ::
+
+ ceph mon getmap -o {tmp}/{filename}
+
+#. The following example demonstrates the contents of the monmap. ::
+
+ $ monmaptool --print {tmp}/{filename}
+
+ monmaptool: monmap file {tmp}/{filename}
+ epoch 1
+ fsid 224e376d-c5fe-4504-96bb-ea6332a19e61
+ last_changed 2012-12-17 02:46:41.591248
+ created 2012-12-17 02:46:41.591248
+ 0: 10.0.0.1:6789/0 mon.a
+ 1: 10.0.0.2:6789/0 mon.b
+ 2: 10.0.0.3:6789/0 mon.c
+
+#. Remove the existing monitors. ::
+
+ $ monmaptool --rm a --rm b --rm c {tmp}/{filename}
+
+ monmaptool: monmap file {tmp}/{filename}
+ monmaptool: removing a
+ monmaptool: removing b
+ monmaptool: removing c
+ monmaptool: writing epoch 1 to {tmp}/{filename} (0 monitors)
+
+#. Add the new monitor locations. ::
+
+ $ monmaptool --add a 10.1.0.1:6789 --add b 10.1.0.2:6789 --add c 10.1.0.3:6789 {tmp}/{filename}
+
+ monmaptool: monmap file {tmp}/{filename}
+ monmaptool: writing epoch 1 to {tmp}/{filename} (3 monitors)
+
+#. Check new contents. ::
+
+ $ monmaptool --print {tmp}/{filename}
+
+ monmaptool: monmap file {tmp}/{filename}
+ epoch 1
+ fsid 224e376d-c5fe-4504-96bb-ea6332a19e61
+ last_changed 2012-12-17 02:46:41.591248
+ created 2012-12-17 02:46:41.591248
+ 0: 10.1.0.1:6789/0 mon.a
+ 1: 10.1.0.2:6789/0 mon.b
+ 2: 10.1.0.3:6789/0 mon.c
+
+At this point, we assume the monitors (and stores) are installed at the new
+location. The next step is to propagate the modified monmap to the new
+monitors, and inject the modified monmap into each new monitor.
+
+#. First, make sure to stop all your monitors. Injection must be done while
+ the daemon is not running.
+
+#. Inject the monmap. ::
+
+ ceph-mon -i {mon-id} --inject-monmap {tmp}/{filename}
+
+#. Restart the monitors.
+
+After this step, migration to the new location is complete and
+the monitors should operate successfully.
+
+
+.. _Manual Deployment: ../../../install/manual-deployment
+.. _Monitor Bootstrap: ../../../dev/mon-bootstrap
+.. _Paxos: http://en.wikipedia.org/wiki/Paxos_(computer_science)
diff --git a/src/ceph/doc/rados/operations/add-or-rm-osds.rst b/src/ceph/doc/rados/operations/add-or-rm-osds.rst
new file mode 100644
index 0000000..59ce4c7
--- /dev/null
+++ b/src/ceph/doc/rados/operations/add-or-rm-osds.rst
@@ -0,0 +1,366 @@
+======================
+ Adding/Removing OSDs
+======================
+
+When you have a cluster up and running, you may add OSDs or remove OSDs
+from the cluster at runtime.
+
+Adding OSDs
+===========
+
+When you want to expand a cluster, you may add an OSD at runtime. With Ceph, an
+OSD is generally one Ceph ``ceph-osd`` daemon for one storage drive within a
+host machine. If your host has multiple storage drives, you may map one
+``ceph-osd`` daemon for each drive.
+
+Generally, it's a good idea to check the capacity of your cluster to see if you
+are reaching the upper end of its capacity. As your cluster reaches its ``near
+full`` ratio, you should add one or more OSDs to expand your cluster's capacity.
+
+.. warning:: Do not let your cluster reach its ``full ratio`` before
+ adding an OSD. OSD failures that occur after the cluster reaches
+ its ``near full`` ratio may cause the cluster to exceed its
+ ``full ratio``.
+
+Deploy your Hardware
+--------------------
+
+If you are adding a new host when adding a new OSD, see `Hardware
+Recommendations`_ for details on minimum recommendations for OSD hardware. To
+add an OSD host to your cluster, first make sure you have an up-to-date version
+of Linux installed, and you have made some initial preparations for your
+storage drives. See `Filesystem Recommendations`_ for details.
+
+Add your OSD host to a rack in your cluster, connect it to the network
+and ensure that it has network connectivity. See the `Network Configuration
+Reference`_ for details.
+
+.. _Hardware Recommendations: ../../../start/hardware-recommendations
+.. _Filesystem Recommendations: ../../configuration/filesystem-recommendations
+.. _Network Configuration Reference: ../../configuration/network-config-ref
+
+Install the Required Software
+-----------------------------
+
+For manually deployed clusters, you must install Ceph packages
+manually. See `Installing Ceph (Manual)`_ for details.
+You should configure SSH to a user with password-less authentication
+and root permissions.
+
+.. _Installing Ceph (Manual): ../../../install
+
+
+Adding an OSD (Manual)
+----------------------
+
+This procedure sets up a ``ceph-osd`` daemon, configures it to use one drive,
+and configures the cluster to distribute data to the OSD. If your host has
+multiple drives, you may add an OSD for each drive by repeating this procedure.
+
+To add an OSD, create a data directory for it, mount a drive to that directory,
+add the OSD to the cluster, and then add it to the CRUSH map.
+
+When you add the OSD to the CRUSH map, consider the weight you give to the new
+OSD. Hard drive capacity grows 40% per year, so newer OSD hosts may have larger
+hard drives than older hosts in the cluster (i.e., they may have greater
+weight).
+
+.. tip:: Ceph prefers uniform hardware across pools. If you are adding drives
+ of dissimilar size, you can adjust their weights. However, for best
+ performance, consider a CRUSH hierarchy with drives of the same type/size.
+
+#. Create the OSD. If no UUID is given, it will be set automatically when the
+ OSD starts up. The following command will output the OSD number, which you
+ will need for subsequent steps. ::
+
+ ceph osd create [{uuid} [{id}]]
+
+ If the optional parameter {id} is given it will be used as the OSD id.
+ Note, in this case the command may fail if the number is already in use.
+
+ .. warning:: In general, explicitly specifying {id} is not recommended.
+ IDs are allocated as an array, and skipping entries consumes some extra
+ memory. This can become significant if there are large gaps and/or
+ clusters are large. If {id} is not specified, the smallest available is
+ used.
+
+#. Create the default directory on your new OSD. ::
+
+ ssh {new-osd-host}
+ sudo mkdir /var/lib/ceph/osd/ceph-{osd-number}
+
+
+#. If the OSD is for a drive other than the OS drive, prepare it
+ for use with Ceph, and mount it to the directory you just created::
+
+ ssh {new-osd-host}
+ sudo mkfs -t {fstype} /dev/{drive}
+ sudo mount -o user_xattr /dev/{hdd} /var/lib/ceph/osd/ceph-{osd-number}
+
+
+#. Initialize the OSD data directory. ::
+
+ ssh {new-osd-host}
+ ceph-osd -i {osd-num} --mkfs --mkkey
+
+ The directory must be empty before you can run ``ceph-osd``.
+
+#. Register the OSD authentication key. The value of ``ceph`` for
+ ``ceph-{osd-num}`` in the path is the ``$cluster-$id``. If your
+ cluster name differs from ``ceph``, use your cluster name instead.::
+
+ ceph auth add osd.{osd-num} osd 'allow *' mon 'allow rwx' -i /var/lib/ceph/osd/ceph-{osd-num}/keyring
+
+
+#. Add the OSD to the CRUSH map so that the OSD can begin receiving data. The
+ ``ceph osd crush add`` command allows you to add OSDs to the CRUSH hierarchy
+ wherever you wish. If you specify at least one bucket, the command
+ will place the OSD into the most specific bucket you specify, *and* it will
+ move that bucket underneath any other buckets you specify. **Important:** If
+ you specify only the root bucket, the command will attach the OSD directly
+ to the root, but CRUSH rules expect OSDs to be inside of hosts.
+
+ For Argonaut (v 0.48), execute the following::
+
+ ceph osd crush add {id} {name} {weight} [{bucket-type}={bucket-name} ...]
+
+ For Bobtail (v 0.56) and later releases, execute the following::
+
+ ceph osd crush add {id-or-name} {weight} [{bucket-type}={bucket-name} ...]
+
+ You may also decompile the CRUSH map, add the OSD to the device list, add the
+ host as a bucket (if it's not already in the CRUSH map), add the device as an
+ item in the host, assign it a weight, recompile it and set it. See
+ `Add/Move an OSD`_ for details.
+
+
+.. topic:: Argonaut (v0.48) Best Practices
+
+ To limit impact on user I/O performance, add an OSD to the CRUSH map
+ with an initial weight of ``0``. Then, ramp up the CRUSH weight a
+ little bit at a time. For example, to ramp by increments of ``0.2``,
+ start with::
+
+ ceph osd crush reweight {osd-id} .2
+
+ and allow migration to complete before reweighting to ``0.4``,
+ ``0.6``, and so on until the desired CRUSH weight is reached.
+
+ To limit the impact of OSD failures, you can set::
+
+ mon osd down out interval = 0
+
+ which prevents down OSDs from automatically being marked out, and then
+ ramp them down manually with::
+
+ ceph osd reweight {osd-num} .8
+
+ Again, wait for the cluster to finish migrating data, and then adjust
+ the weight further until you reach a weight of 0. Note that this
+ problem prevents the cluster to automatically re-replicate data after
+ a failure, so please ensure that sufficient monitoring is in place for
+ an administrator to intervene promptly.
+
+ Note that this practice will no longer be necessary in Bobtail and
+ subsequent releases.
+
+
+Replacing an OSD
+----------------
+
+When disks fail, or if an admnistrator wants to reprovision OSDs with a new
+backend, for instance, for switching from FileStore to BlueStore, OSDs need to
+be replaced. Unlike `Removing the OSD`_, replaced OSD's id and CRUSH map entry
+need to be keep intact after the OSD is destroyed for replacement.
+
+#. Destroy the OSD first::
+
+ ceph osd destroy {id} --yes-i-really-mean-it
+
+#. Zap a disk for the new OSD, if the disk was used before for other purposes.
+ It's not necessary for a new disk::
+
+ ceph-disk zap /dev/sdX
+
+#. Prepare the disk for replacement by using the previously destroyed OSD id::
+
+ ceph-disk prepare --bluestore /dev/sdX --osd-id {id} --osd-uuid `uuidgen`
+
+#. And activate the OSD::
+
+ ceph-disk activate /dev/sdX1
+
+
+Starting the OSD
+----------------
+
+After you add an OSD to Ceph, the OSD is in your configuration. However,
+it is not yet running. The OSD is ``down`` and ``in``. You must start
+your new OSD before it can begin receiving data. You may use
+``service ceph`` from your admin host or start the OSD from its host
+machine.
+
+For Ubuntu Trusty use Upstart. ::
+
+ sudo start ceph-osd id={osd-num}
+
+For all other distros use systemd. ::
+
+ sudo systemctl start ceph-osd@{osd-num}
+
+
+Once you start your OSD, it is ``up`` and ``in``.
+
+
+Observe the Data Migration
+--------------------------
+
+Once you have added your new OSD to the CRUSH map, Ceph will begin rebalancing
+the server by migrating placement groups to your new OSD. You can observe this
+process with the `ceph`_ tool. ::
+
+ ceph -w
+
+You should see the placement group states change from ``active+clean`` to
+``active, some degraded objects``, and finally ``active+clean`` when migration
+completes. (Control-c to exit.)
+
+
+.. _Add/Move an OSD: ../crush-map#addosd
+.. _ceph: ../monitoring
+
+
+
+Removing OSDs (Manual)
+======================
+
+When you want to reduce the size of a cluster or replace hardware, you may
+remove an OSD at runtime. With Ceph, an OSD is generally one Ceph ``ceph-osd``
+daemon for one storage drive within a host machine. If your host has multiple
+storage drives, you may need to remove one ``ceph-osd`` daemon for each drive.
+Generally, it's a good idea to check the capacity of your cluster to see if you
+are reaching the upper end of its capacity. Ensure that when you remove an OSD
+that your cluster is not at its ``near full`` ratio.
+
+.. warning:: Do not let your cluster reach its ``full ratio`` when
+ removing an OSD. Removing OSDs could cause the cluster to reach
+ or exceed its ``full ratio``.
+
+
+Take the OSD out of the Cluster
+-----------------------------------
+
+Before you remove an OSD, it is usually ``up`` and ``in``. You need to take it
+out of the cluster so that Ceph can begin rebalancing and copying its data to
+other OSDs. ::
+
+ ceph osd out {osd-num}
+
+
+Observe the Data Migration
+--------------------------
+
+Once you have taken your OSD ``out`` of the cluster, Ceph will begin
+rebalancing the cluster by migrating placement groups out of the OSD you
+removed. You can observe this process with the `ceph`_ tool. ::
+
+ ceph -w
+
+You should see the placement group states change from ``active+clean`` to
+``active, some degraded objects``, and finally ``active+clean`` when migration
+completes. (Control-c to exit.)
+
+.. note:: Sometimes, typically in a "small" cluster with few hosts (for
+ instance with a small testing cluster), the fact to take ``out`` the
+ OSD can spawn a CRUSH corner case where some PGs remain stuck in the
+ ``active+remapped`` state. If you are in this case, you should mark
+ the OSD ``in`` with:
+
+ ``ceph osd in {osd-num}``
+
+ to come back to the initial state and then, instead of marking ``out``
+ the OSD, set its weight to 0 with:
+
+ ``ceph osd crush reweight osd.{osd-num} 0``
+
+ After that, you can observe the data migration which should come to its
+ end. The difference between marking ``out`` the OSD and reweighting it
+ to 0 is that in the first case the weight of the bucket which contains
+ the OSD is not changed whereas in the second case the weight of the bucket
+ is updated (and decreased of the OSD weight). The reweight command could
+ be sometimes favoured in the case of a "small" cluster.
+
+
+
+Stopping the OSD
+----------------
+
+After you take an OSD out of the cluster, it may still be running.
+That is, the OSD may be ``up`` and ``out``. You must stop
+your OSD before you remove it from the configuration. ::
+
+ ssh {osd-host}
+ sudo systemctl stop ceph-osd@{osd-num}
+
+Once you stop your OSD, it is ``down``.
+
+
+Removing the OSD
+----------------
+
+This procedure removes an OSD from a cluster map, removes its authentication
+key, removes the OSD from the OSD map, and removes the OSD from the
+``ceph.conf`` file. If your host has multiple drives, you may need to remove an
+OSD for each drive by repeating this procedure.
+
+#. Let the cluster forget the OSD first. This step removes the OSD from the CRUSH
+ map, removes its authentication key. And it is removed from the OSD map as
+ well. Please note the `purge subcommand`_ is introduced in Luminous, for older
+ versions, please see below ::
+
+ ceph osd purge {id} --yes-i-really-mean-it
+
+#. Navigate to the host where you keep the master copy of the cluster's
+ ``ceph.conf`` file. ::
+
+ ssh {admin-host}
+ cd /etc/ceph
+ vim ceph.conf
+
+#. Remove the OSD entry from your ``ceph.conf`` file (if it exists). ::
+
+ [osd.1]
+ host = {hostname}
+
+#. From the host where you keep the master copy of the cluster's ``ceph.conf`` file,
+ copy the updated ``ceph.conf`` file to the ``/etc/ceph`` directory of other
+ hosts in your cluster.
+
+If your Ceph cluster is older than Luminous, instead of using ``ceph osd purge``,
+you need to perform this step manually:
+
+
+#. Remove the OSD from the CRUSH map so that it no longer receives data. You may
+ also decompile the CRUSH map, remove the OSD from the device list, remove the
+ device as an item in the host bucket or remove the host bucket (if it's in the
+ CRUSH map and you intend to remove the host), recompile the map and set it.
+ See `Remove an OSD`_ for details. ::
+
+ ceph osd crush remove {name}
+
+#. Remove the OSD authentication key. ::
+
+ ceph auth del osd.{osd-num}
+
+ The value of ``ceph`` for ``ceph-{osd-num}`` in the path is the ``$cluster-$id``.
+ If your cluster name differs from ``ceph``, use your cluster name instead.
+
+#. Remove the OSD. ::
+
+ ceph osd rm {osd-num}
+ #for example
+ ceph osd rm 1
+
+
+.. _Remove an OSD: ../crush-map#removeosd
+.. _purge subcommand: /man/8/ceph#osd
diff --git a/src/ceph/doc/rados/operations/cache-tiering.rst b/src/ceph/doc/rados/operations/cache-tiering.rst
new file mode 100644
index 0000000..322c6ff
--- /dev/null
+++ b/src/ceph/doc/rados/operations/cache-tiering.rst
@@ -0,0 +1,461 @@
+===============
+ Cache Tiering
+===============
+
+A cache tier provides Ceph Clients with better I/O performance for a subset of
+the data stored in a backing storage tier. Cache tiering involves creating a
+pool of relatively fast/expensive storage devices (e.g., solid state drives)
+configured to act as a cache tier, and a backing pool of either erasure-coded
+or relatively slower/cheaper devices configured to act as an economical storage
+tier. The Ceph objecter handles where to place the objects and the tiering
+agent determines when to flush objects from the cache to the backing storage
+tier. So the cache tier and the backing storage tier are completely transparent
+to Ceph clients.
+
+
+.. ditaa::
+ +-------------+
+ | Ceph Client |
+ +------+------+
+ ^
+ Tiering is |
+ Transparent | Faster I/O
+ to Ceph | +---------------+
+ Client Ops | | |
+ | +----->+ Cache Tier |
+ | | | |
+ | | +-----+---+-----+
+ | | | ^
+ v v | | Active Data in Cache Tier
+ +------+----+--+ | |
+ | Objecter | | |
+ +-----------+--+ | |
+ ^ | | Inactive Data in Storage Tier
+ | v |
+ | +-----+---+-----+
+ | | |
+ +----->| Storage Tier |
+ | |
+ +---------------+
+ Slower I/O
+
+
+The cache tiering agent handles the migration of data between the cache tier
+and the backing storage tier automatically. However, admins have the ability to
+configure how this migration takes place. There are two main scenarios:
+
+- **Writeback Mode:** When admins configure tiers with ``writeback`` mode, Ceph
+ clients write data to the cache tier and receive an ACK from the cache tier.
+ In time, the data written to the cache tier migrates to the storage tier
+ and gets flushed from the cache tier. Conceptually, the cache tier is
+ overlaid "in front" of the backing storage tier. When a Ceph client needs
+ data that resides in the storage tier, the cache tiering agent migrates the
+ data to the cache tier on read, then it is sent to the Ceph client.
+ Thereafter, the Ceph client can perform I/O using the cache tier, until the
+ data becomes inactive. This is ideal for mutable data (e.g., photo/video
+ editing, transactional data, etc.).
+
+- **Read-proxy Mode:** This mode will use any objects that already
+ exist in the cache tier, but if an object is not present in the
+ cache the request will be proxied to the base tier. This is useful
+ for transitioning from ``writeback`` mode to a disabled cache as it
+ allows the workload to function properly while the cache is drained,
+ without adding any new objects to the cache.
+
+A word of caution
+=================
+
+Cache tiering will *degrade* performance for most workloads. Users should use
+extreme caution before using this feature.
+
+* *Workload dependent*: Whether a cache will improve performance is
+ highly dependent on the workload. Because there is a cost
+ associated with moving objects into or out of the cache, it can only
+ be effective when there is a *large skew* in the access pattern in
+ the data set, such that most of the requests touch a small number of
+ objects. The cache pool should be large enough to capture the
+ working set for your workload to avoid thrashing.
+
+* *Difficult to benchmark*: Most benchmarks that users run to measure
+ performance will show terrible performance with cache tiering, in
+ part because very few of them skew requests toward a small set of
+ objects, it can take a long time for the cache to "warm up," and
+ because the warm-up cost can be high.
+
+* *Usually slower*: For workloads that are not cache tiering-friendly,
+ performance is often slower than a normal RADOS pool without cache
+ tiering enabled.
+
+* *librados object enumeration*: The librados-level object enumeration
+ API is not meant to be coherent in the presence of the case. If
+ your applicatoin is using librados directly and relies on object
+ enumeration, cache tiering will probably not work as expected.
+ (This is not a problem for RGW, RBD, or CephFS.)
+
+* *Complexity*: Enabling cache tiering means that a lot of additional
+ machinery and complexity within the RADOS cluster is being used.
+ This increases the probability that you will encounter a bug in the system
+ that other users have not yet encountered and will put your deployment at a
+ higher level of risk.
+
+Known Good Workloads
+--------------------
+
+* *RGW time-skewed*: If the RGW workload is such that almost all read
+ operations are directed at recently written objects, a simple cache
+ tiering configuration that destages recently written objects from
+ the cache to the base tier after a configurable period can work
+ well.
+
+Known Bad Workloads
+-------------------
+
+The following configurations are *known to work poorly* with cache
+tiering.
+
+* *RBD with replicated cache and erasure-coded base*: This is a common
+ request, but usually does not perform well. Even reasonably skewed
+ workloads still send some small writes to cold objects, and because
+ small writes are not yet supported by the erasure-coded pool, entire
+ (usually 4 MB) objects must be migrated into the cache in order to
+ satisfy a small (often 4 KB) write. Only a handful of users have
+ successfully deployed this configuration, and it only works for them
+ because their data is extremely cold (backups) and they are not in
+ any way sensitive to performance.
+
+* *RBD with replicated cache and base*: RBD with a replicated base
+ tier does better than when the base is erasure coded, but it is
+ still highly dependent on the amount of skew in the workload, and
+ very difficult to validate. The user will need to have a good
+ understanding of their workload and will need to tune the cache
+ tiering parameters carefully.
+
+
+Setting Up Pools
+================
+
+To set up cache tiering, you must have two pools. One will act as the
+backing storage and the other will act as the cache.
+
+
+Setting Up a Backing Storage Pool
+---------------------------------
+
+Setting up a backing storage pool typically involves one of two scenarios:
+
+- **Standard Storage**: In this scenario, the pool stores multiple copies
+ of an object in the Ceph Storage Cluster.
+
+- **Erasure Coding:** In this scenario, the pool uses erasure coding to
+ store data much more efficiently with a small performance tradeoff.
+
+In the standard storage scenario, you can setup a CRUSH ruleset to establish
+the failure domain (e.g., osd, host, chassis, rack, row, etc.). Ceph OSD
+Daemons perform optimally when all storage drives in the ruleset are of the
+same size, speed (both RPMs and throughput) and type. See `CRUSH Maps`_
+for details on creating a ruleset. Once you have created a ruleset, create
+a backing storage pool.
+
+In the erasure coding scenario, the pool creation arguments will generate the
+appropriate ruleset automatically. See `Create a Pool`_ for details.
+
+In subsequent examples, we will refer to the backing storage pool
+as ``cold-storage``.
+
+
+Setting Up a Cache Pool
+-----------------------
+
+Setting up a cache pool follows the same procedure as the standard storage
+scenario, but with this difference: the drives for the cache tier are typically
+high performance drives that reside in their own servers and have their own
+ruleset. When setting up a ruleset, it should take account of the hosts that
+have the high performance drives while omitting the hosts that don't. See
+`Placing Different Pools on Different OSDs`_ for details.
+
+
+In subsequent examples, we will refer to the cache pool as ``hot-storage`` and
+the backing pool as ``cold-storage``.
+
+For cache tier configuration and default values, see
+`Pools - Set Pool Values`_.
+
+
+Creating a Cache Tier
+=====================
+
+Setting up a cache tier involves associating a backing storage pool with
+a cache pool ::
+
+ ceph osd tier add {storagepool} {cachepool}
+
+For example ::
+
+ ceph osd tier add cold-storage hot-storage
+
+To set the cache mode, execute the following::
+
+ ceph osd tier cache-mode {cachepool} {cache-mode}
+
+For example::
+
+ ceph osd tier cache-mode hot-storage writeback
+
+The cache tiers overlay the backing storage tier, so they require one
+additional step: you must direct all client traffic from the storage pool to
+the cache pool. To direct client traffic directly to the cache pool, execute
+the following::
+
+ ceph osd tier set-overlay {storagepool} {cachepool}
+
+For example::
+
+ ceph osd tier set-overlay cold-storage hot-storage
+
+
+Configuring a Cache Tier
+========================
+
+Cache tiers have several configuration options. You may set
+cache tier configuration options with the following usage::
+
+ ceph osd pool set {cachepool} {key} {value}
+
+See `Pools - Set Pool Values`_ for details.
+
+
+Target Size and Type
+--------------------
+
+Ceph's production cache tiers use a `Bloom Filter`_ for the ``hit_set_type``::
+
+ ceph osd pool set {cachepool} hit_set_type bloom
+
+For example::
+
+ ceph osd pool set hot-storage hit_set_type bloom
+
+The ``hit_set_count`` and ``hit_set_period`` define how much time each HitSet
+should cover, and how many such HitSets to store. ::
+
+ ceph osd pool set {cachepool} hit_set_count 12
+ ceph osd pool set {cachepool} hit_set_period 14400
+ ceph osd pool set {cachepool} target_max_bytes 1000000000000
+
+.. note:: A larger ``hit_set_count`` results in more RAM consumed by
+ the ``ceph-osd`` process.
+
+Binning accesses over time allows Ceph to determine whether a Ceph client
+accessed an object at least once, or more than once over a time period
+("age" vs "temperature").
+
+The ``min_read_recency_for_promote`` defines how many HitSets to check for the
+existence of an object when handling a read operation. The checking result is
+used to decide whether to promote the object asynchronously. Its value should be
+between 0 and ``hit_set_count``. If it's set to 0, the object is always promoted.
+If it's set to 1, the current HitSet is checked. And if this object is in the
+current HitSet, it's promoted. Otherwise not. For the other values, the exact
+number of archive HitSets are checked. The object is promoted if the object is
+found in any of the most recent ``min_read_recency_for_promote`` HitSets.
+
+A similar parameter can be set for the write operation, which is
+``min_write_recency_for_promote``. ::
+
+ ceph osd pool set {cachepool} min_read_recency_for_promote 2
+ ceph osd pool set {cachepool} min_write_recency_for_promote 2
+
+.. note:: The longer the period and the higher the
+ ``min_read_recency_for_promote`` and
+ ``min_write_recency_for_promote``values, the more RAM the ``ceph-osd``
+ daemon consumes. In particular, when the agent is active to flush
+ or evict cache objects, all ``hit_set_count`` HitSets are loaded
+ into RAM.
+
+
+Cache Sizing
+------------
+
+The cache tiering agent performs two main functions:
+
+- **Flushing:** The agent identifies modified (or dirty) objects and forwards
+ them to the storage pool for long-term storage.
+
+- **Evicting:** The agent identifies objects that haven't been modified
+ (or clean) and evicts the least recently used among them from the cache.
+
+
+Absolute Sizing
+~~~~~~~~~~~~~~~
+
+The cache tiering agent can flush or evict objects based upon the total number
+of bytes or the total number of objects. To specify a maximum number of bytes,
+execute the following::
+
+ ceph osd pool set {cachepool} target_max_bytes {#bytes}
+
+For example, to flush or evict at 1 TB, execute the following::
+
+ ceph osd pool set hot-storage target_max_bytes 1099511627776
+
+
+To specify the maximum number of objects, execute the following::
+
+ ceph osd pool set {cachepool} target_max_objects {#objects}
+
+For example, to flush or evict at 1M objects, execute the following::
+
+ ceph osd pool set hot-storage target_max_objects 1000000
+
+.. note:: Ceph is not able to determine the size of a cache pool automatically, so
+ the configuration on the absolute size is required here, otherwise the
+ flush/evict will not work. If you specify both limits, the cache tiering
+ agent will begin flushing or evicting when either threshold is triggered.
+
+.. note:: All client requests will be blocked only when ``target_max_bytes`` or
+ ``target_max_objects`` reached
+
+Relative Sizing
+~~~~~~~~~~~~~~~
+
+The cache tiering agent can flush or evict objects relative to the size of the
+cache pool(specified by ``target_max_bytes`` / ``target_max_objects`` in
+`Absolute sizing`_). When the cache pool consists of a certain percentage of
+modified (or dirty) objects, the cache tiering agent will flush them to the
+storage pool. To set the ``cache_target_dirty_ratio``, execute the following::
+
+ ceph osd pool set {cachepool} cache_target_dirty_ratio {0.0..1.0}
+
+For example, setting the value to ``0.4`` will begin flushing modified
+(dirty) objects when they reach 40% of the cache pool's capacity::
+
+ ceph osd pool set hot-storage cache_target_dirty_ratio 0.4
+
+When the dirty objects reaches a certain percentage of its capacity, flush dirty
+objects with a higher speed. To set the ``cache_target_dirty_high_ratio``::
+
+ ceph osd pool set {cachepool} cache_target_dirty_high_ratio {0.0..1.0}
+
+For example, setting the value to ``0.6`` will begin aggressively flush dirty objects
+when they reach 60% of the cache pool's capacity. obviously, we'd better set the value
+between dirty_ratio and full_ratio::
+
+ ceph osd pool set hot-storage cache_target_dirty_high_ratio 0.6
+
+When the cache pool reaches a certain percentage of its capacity, the cache
+tiering agent will evict objects to maintain free capacity. To set the
+``cache_target_full_ratio``, execute the following::
+
+ ceph osd pool set {cachepool} cache_target_full_ratio {0.0..1.0}
+
+For example, setting the value to ``0.8`` will begin flushing unmodified
+(clean) objects when they reach 80% of the cache pool's capacity::
+
+ ceph osd pool set hot-storage cache_target_full_ratio 0.8
+
+
+Cache Age
+---------
+
+You can specify the minimum age of an object before the cache tiering agent
+flushes a recently modified (or dirty) object to the backing storage pool::
+
+ ceph osd pool set {cachepool} cache_min_flush_age {#seconds}
+
+For example, to flush modified (or dirty) objects after 10 minutes, execute
+the following::
+
+ ceph osd pool set hot-storage cache_min_flush_age 600
+
+You can specify the minimum age of an object before it will be evicted from
+the cache tier::
+
+ ceph osd pool {cache-tier} cache_min_evict_age {#seconds}
+
+For example, to evict objects after 30 minutes, execute the following::
+
+ ceph osd pool set hot-storage cache_min_evict_age 1800
+
+
+Removing a Cache Tier
+=====================
+
+Removing a cache tier differs depending on whether it is a writeback
+cache or a read-only cache.
+
+
+Removing a Read-Only Cache
+--------------------------
+
+Since a read-only cache does not have modified data, you can disable
+and remove it without losing any recent changes to objects in the cache.
+
+#. Change the cache-mode to ``none`` to disable it. ::
+
+ ceph osd tier cache-mode {cachepool} none
+
+ For example::
+
+ ceph osd tier cache-mode hot-storage none
+
+#. Remove the cache pool from the backing pool. ::
+
+ ceph osd tier remove {storagepool} {cachepool}
+
+ For example::
+
+ ceph osd tier remove cold-storage hot-storage
+
+
+
+Removing a Writeback Cache
+--------------------------
+
+Since a writeback cache may have modified data, you must take steps to ensure
+that you do not lose any recent changes to objects in the cache before you
+disable and remove it.
+
+
+#. Change the cache mode to ``forward`` so that new and modified objects will
+ flush to the backing storage pool. ::
+
+ ceph osd tier cache-mode {cachepool} forward
+
+ For example::
+
+ ceph osd tier cache-mode hot-storage forward
+
+
+#. Ensure that the cache pool has been flushed. This may take a few minutes::
+
+ rados -p {cachepool} ls
+
+ If the cache pool still has objects, you can flush them manually.
+ For example::
+
+ rados -p {cachepool} cache-flush-evict-all
+
+
+#. Remove the overlay so that clients will not direct traffic to the cache. ::
+
+ ceph osd tier remove-overlay {storagetier}
+
+ For example::
+
+ ceph osd tier remove-overlay cold-storage
+
+
+#. Finally, remove the cache tier pool from the backing storage pool. ::
+
+ ceph osd tier remove {storagepool} {cachepool}
+
+ For example::
+
+ ceph osd tier remove cold-storage hot-storage
+
+
+.. _Create a Pool: ../pools#create-a-pool
+.. _Pools - Set Pool Values: ../pools#set-pool-values
+.. _Placing Different Pools on Different OSDs: ../crush-map/#placing-different-pools-on-different-osds
+.. _Bloom Filter: http://en.wikipedia.org/wiki/Bloom_filter
+.. _CRUSH Maps: ../crush-map
+.. _Absolute Sizing: #absolute-sizing
diff --git a/src/ceph/doc/rados/operations/control.rst b/src/ceph/doc/rados/operations/control.rst
new file mode 100644
index 0000000..1a58076
--- /dev/null
+++ b/src/ceph/doc/rados/operations/control.rst
@@ -0,0 +1,453 @@
+.. index:: control, commands
+
+==================
+ Control Commands
+==================
+
+
+Monitor Commands
+================
+
+Monitor commands are issued using the ceph utility::
+
+ ceph [-m monhost] {command}
+
+The command is usually (though not always) of the form::
+
+ ceph {subsystem} {command}
+
+
+System Commands
+===============
+
+Execute the following to display the current status of the cluster. ::
+
+ ceph -s
+ ceph status
+
+Execute the following to display a running summary of the status of the cluster,
+and major events. ::
+
+ ceph -w
+
+Execute the following to show the monitor quorum, including which monitors are
+participating and which one is the leader. ::
+
+ ceph quorum_status
+
+Execute the following to query the status of a single monitor, including whether
+or not it is in the quorum. ::
+
+ ceph [-m monhost] mon_status
+
+
+Authentication Subsystem
+========================
+
+To add a keyring for an OSD, execute the following::
+
+ ceph auth add {osd} {--in-file|-i} {path-to-osd-keyring}
+
+To list the cluster's keys and their capabilities, execute the following::
+
+ ceph auth ls
+
+
+Placement Group Subsystem
+=========================
+
+To display the statistics for all placement groups, execute the following::
+
+ ceph pg dump [--format {format}]
+
+The valid formats are ``plain`` (default) and ``json``.
+
+To display the statistics for all placement groups stuck in a specified state,
+execute the following::
+
+ ceph pg dump_stuck inactive|unclean|stale|undersized|degraded [--format {format}] [-t|--threshold {seconds}]
+
+
+``--format`` may be ``plain`` (default) or ``json``
+
+``--threshold`` defines how many seconds "stuck" is (default: 300)
+
+**Inactive** Placement groups cannot process reads or writes because they are waiting for an OSD
+with the most up-to-date data to come back.
+
+**Unclean** Placement groups contain objects that are not replicated the desired number
+of times. They should be recovering.
+
+**Stale** Placement groups are in an unknown state - the OSDs that host them have not
+reported to the monitor cluster in a while (configured by
+``mon_osd_report_timeout``).
+
+Delete "lost" objects or revert them to their prior state, either a previous version
+or delete them if they were just created. ::
+
+ ceph pg {pgid} mark_unfound_lost revert|delete
+
+
+OSD Subsystem
+=============
+
+Query OSD subsystem status. ::
+
+ ceph osd stat
+
+Write a copy of the most recent OSD map to a file. See
+`osdmaptool`_. ::
+
+ ceph osd getmap -o file
+
+.. _osdmaptool: ../../man/8/osdmaptool
+
+Write a copy of the crush map from the most recent OSD map to
+file. ::
+
+ ceph osd getcrushmap -o file
+
+The foregoing functionally equivalent to ::
+
+ ceph osd getmap -o /tmp/osdmap
+ osdmaptool /tmp/osdmap --export-crush file
+
+Dump the OSD map. Valid formats for ``-f`` are ``plain`` and ``json``. If no
+``--format`` option is given, the OSD map is dumped as plain text. ::
+
+ ceph osd dump [--format {format}]
+
+Dump the OSD map as a tree with one line per OSD containing weight
+and state. ::
+
+ ceph osd tree [--format {format}]
+
+Find out where a specific object is or would be stored in the system::
+
+ ceph osd map <pool-name> <object-name>
+
+Add or move a new item (OSD) with the given id/name/weight at the specified
+location. ::
+
+ ceph osd crush set {id} {weight} [{loc1} [{loc2} ...]]
+
+Remove an existing item (OSD) from the CRUSH map. ::
+
+ ceph osd crush remove {name}
+
+Remove an existing bucket from the CRUSH map. ::
+
+ ceph osd crush remove {bucket-name}
+
+Move an existing bucket from one position in the hierarchy to another. ::
+
+ ceph osd crush move {id} {loc1} [{loc2} ...]
+
+Set the weight of the item given by ``{name}`` to ``{weight}``. ::
+
+ ceph osd crush reweight {name} {weight}
+
+Mark an OSD as lost. This may result in permanent data loss. Use with caution. ::
+
+ ceph osd lost {id} [--yes-i-really-mean-it]
+
+Create a new OSD. If no UUID is given, it will be set automatically when the OSD
+starts up. ::
+
+ ceph osd create [{uuid}]
+
+Remove the given OSD(s). ::
+
+ ceph osd rm [{id}...]
+
+Query the current max_osd parameter in the OSD map. ::
+
+ ceph osd getmaxosd
+
+Import the given crush map. ::
+
+ ceph osd setcrushmap -i file
+
+Set the ``max_osd`` parameter in the OSD map. This is necessary when
+expanding the storage cluster. ::
+
+ ceph osd setmaxosd
+
+Mark OSD ``{osd-num}`` down. ::
+
+ ceph osd down {osd-num}
+
+Mark OSD ``{osd-num}`` out of the distribution (i.e. allocated no data). ::
+
+ ceph osd out {osd-num}
+
+Mark ``{osd-num}`` in the distribution (i.e. allocated data). ::
+
+ ceph osd in {osd-num}
+
+Set or clear the pause flags in the OSD map. If set, no IO requests
+will be sent to any OSD. Clearing the flags via unpause results in
+resending pending requests. ::
+
+ ceph osd pause
+ ceph osd unpause
+
+Set the weight of ``{osd-num}`` to ``{weight}``. Two OSDs with the
+same weight will receive roughly the same number of I/O requests and
+store approximately the same amount of data. ``ceph osd reweight``
+sets an override weight on the OSD. This value is in the range 0 to 1,
+and forces CRUSH to re-place (1-weight) of the data that would
+otherwise live on this drive. It does not change the weights assigned
+to the buckets above the OSD in the crush map, and is a corrective
+measure in case the normal CRUSH distribution is not working out quite
+right. For instance, if one of your OSDs is at 90% and the others are
+at 50%, you could reduce this weight to try and compensate for it. ::
+
+ ceph osd reweight {osd-num} {weight}
+
+Reweights all the OSDs by reducing the weight of OSDs which are
+heavily overused. By default it will adjust the weights downward on
+OSDs which have 120% of the average utilization, but if you include
+threshold it will use that percentage instead. ::
+
+ ceph osd reweight-by-utilization [threshold]
+
+Describes what reweight-by-utilization would do. ::
+
+ ceph osd test-reweight-by-utilization
+
+Adds/removes the address to/from the blacklist. When adding an address,
+you can specify how long it should be blacklisted in seconds; otherwise,
+it will default to 1 hour. A blacklisted address is prevented from
+connecting to any OSD. Blacklisting is most often used to prevent a
+lagging metadata server from making bad changes to data on the OSDs.
+
+These commands are mostly only useful for failure testing, as
+blacklists are normally maintained automatically and shouldn't need
+manual intervention. ::
+
+ ceph osd blacklist add ADDRESS[:source_port] [TIME]
+ ceph osd blacklist rm ADDRESS[:source_port]
+
+Creates/deletes a snapshot of a pool. ::
+
+ ceph osd pool mksnap {pool-name} {snap-name}
+ ceph osd pool rmsnap {pool-name} {snap-name}
+
+Creates/deletes/renames a storage pool. ::
+
+ ceph osd pool create {pool-name} pg_num [pgp_num]
+ ceph osd pool delete {pool-name} [{pool-name} --yes-i-really-really-mean-it]
+ ceph osd pool rename {old-name} {new-name}
+
+Changes a pool setting. ::
+
+ ceph osd pool set {pool-name} {field} {value}
+
+Valid fields are:
+
+ * ``size``: Sets the number of copies of data in the pool.
+ * ``pg_num``: The placement group number.
+ * ``pgp_num``: Effective number when calculating pg placement.
+ * ``crush_ruleset``: rule number for mapping placement.
+
+Get the value of a pool setting. ::
+
+ ceph osd pool get {pool-name} {field}
+
+Valid fields are:
+
+ * ``pg_num``: The placement group number.
+ * ``pgp_num``: Effective number of placement groups when calculating placement.
+ * ``lpg_num``: The number of local placement groups.
+ * ``lpgp_num``: The number used for placing the local placement groups.
+
+
+Sends a scrub command to OSD ``{osd-num}``. To send the command to all OSDs, use ``*``. ::
+
+ ceph osd scrub {osd-num}
+
+Sends a repair command to OSD.N. To send the command to all OSDs, use ``*``. ::
+
+ ceph osd repair N
+
+Runs a simple throughput benchmark against OSD.N, writing ``TOTAL_DATA_BYTES``
+in write requests of ``BYTES_PER_WRITE`` each. By default, the test
+writes 1 GB in total in 4-MB increments.
+The benchmark is non-destructive and will not overwrite existing live
+OSD data, but might temporarily affect the performance of clients
+concurrently accessing the OSD. ::
+
+ ceph tell osd.N bench [TOTAL_DATA_BYTES] [BYTES_PER_WRITE]
+
+
+MDS Subsystem
+=============
+
+Change configuration parameters on a running mds. ::
+
+ ceph tell mds.{mds-id} injectargs --{switch} {value} [--{switch} {value}]
+
+Example::
+
+ ceph tell mds.0 injectargs --debug_ms 1 --debug_mds 10
+
+Enables debug messages. ::
+
+ ceph mds stat
+
+Displays the status of all metadata servers. ::
+
+ ceph mds fail 0
+
+Marks the active MDS as failed, triggering failover to a standby if present.
+
+.. todo:: ``ceph mds`` subcommands missing docs: set, dump, getmap, stop, setmap
+
+
+Mon Subsystem
+=============
+
+Show monitor stats::
+
+ ceph mon stat
+
+ e2: 3 mons at {a=127.0.0.1:40000/0,b=127.0.0.1:40001/0,c=127.0.0.1:40002/0}, election epoch 6, quorum 0,1,2 a,b,c
+
+
+The ``quorum`` list at the end lists monitor nodes that are part of the current quorum.
+
+This is also available more directly::
+
+ ceph quorum_status -f json-pretty
+
+.. code-block:: javascript
+
+ {
+ "election_epoch": 6,
+ "quorum": [
+ 0,
+ 1,
+ 2
+ ],
+ "quorum_names": [
+ "a",
+ "b",
+ "c"
+ ],
+ "quorum_leader_name": "a",
+ "monmap": {
+ "epoch": 2,
+ "fsid": "ba807e74-b64f-4b72-b43f-597dfe60ddbc",
+ "modified": "2016-12-26 14:42:09.288066",
+ "created": "2016-12-26 14:42:03.573585",
+ "features": {
+ "persistent": [
+ "kraken"
+ ],
+ "optional": []
+ },
+ "mons": [
+ {
+ "rank": 0,
+ "name": "a",
+ "addr": "127.0.0.1:40000\/0",
+ "public_addr": "127.0.0.1:40000\/0"
+ },
+ {
+ "rank": 1,
+ "name": "b",
+ "addr": "127.0.0.1:40001\/0",
+ "public_addr": "127.0.0.1:40001\/0"
+ },
+ {
+ "rank": 2,
+ "name": "c",
+ "addr": "127.0.0.1:40002\/0",
+ "public_addr": "127.0.0.1:40002\/0"
+ }
+ ]
+ }
+ }
+
+
+The above will block until a quorum is reached.
+
+For a status of just the monitor you connect to (use ``-m HOST:PORT``
+to select)::
+
+ ceph mon_status -f json-pretty
+
+
+.. code-block:: javascript
+
+ {
+ "name": "b",
+ "rank": 1,
+ "state": "peon",
+ "election_epoch": 6,
+ "quorum": [
+ 0,
+ 1,
+ 2
+ ],
+ "features": {
+ "required_con": "9025616074522624",
+ "required_mon": [
+ "kraken"
+ ],
+ "quorum_con": "1152921504336314367",
+ "quorum_mon": [
+ "kraken"
+ ]
+ },
+ "outside_quorum": [],
+ "extra_probe_peers": [],
+ "sync_provider": [],
+ "monmap": {
+ "epoch": 2,
+ "fsid": "ba807e74-b64f-4b72-b43f-597dfe60ddbc",
+ "modified": "2016-12-26 14:42:09.288066",
+ "created": "2016-12-26 14:42:03.573585",
+ "features": {
+ "persistent": [
+ "kraken"
+ ],
+ "optional": []
+ },
+ "mons": [
+ {
+ "rank": 0,
+ "name": "a",
+ "addr": "127.0.0.1:40000\/0",
+ "public_addr": "127.0.0.1:40000\/0"
+ },
+ {
+ "rank": 1,
+ "name": "b",
+ "addr": "127.0.0.1:40001\/0",
+ "public_addr": "127.0.0.1:40001\/0"
+ },
+ {
+ "rank": 2,
+ "name": "c",
+ "addr": "127.0.0.1:40002\/0",
+ "public_addr": "127.0.0.1:40002\/0"
+ }
+ ]
+ }
+ }
+
+A dump of the monitor state::
+
+ ceph mon dump
+
+ dumped monmap epoch 2
+ epoch 2
+ fsid ba807e74-b64f-4b72-b43f-597dfe60ddbc
+ last_changed 2016-12-26 14:42:09.288066
+ created 2016-12-26 14:42:03.573585
+ 0: 127.0.0.1:40000/0 mon.a
+ 1: 127.0.0.1:40001/0 mon.b
+ 2: 127.0.0.1:40002/0 mon.c
+
diff --git a/src/ceph/doc/rados/operations/crush-map-edits.rst b/src/ceph/doc/rados/operations/crush-map-edits.rst
new file mode 100644
index 0000000..5222270
--- /dev/null
+++ b/src/ceph/doc/rados/operations/crush-map-edits.rst
@@ -0,0 +1,654 @@
+Manually editing a CRUSH Map
+============================
+
+.. note:: Manually editing the CRUSH map is considered an advanced
+ administrator operation. All CRUSH changes that are
+ necessary for the overwhelming majority of installations are
+ possible via the standard ceph CLI and do not require manual
+ CRUSH map edits. If you have identified a use case where
+ manual edits *are* necessary, consider contacting the Ceph
+ developers so that future versions of Ceph can make this
+ unnecessary.
+
+To edit an existing CRUSH map:
+
+#. `Get the CRUSH map`_.
+#. `Decompile`_ the CRUSH map.
+#. Edit at least one of `Devices`_, `Buckets`_ and `Rules`_.
+#. `Recompile`_ the CRUSH map.
+#. `Set the CRUSH map`_.
+
+To activate CRUSH map rules for a specific pool, identify the common ruleset
+number for those rules and specify that ruleset number for the pool. See `Set
+Pool Values`_ for details.
+
+.. _Get the CRUSH map: #getcrushmap
+.. _Decompile: #decompilecrushmap
+.. _Devices: #crushmapdevices
+.. _Buckets: #crushmapbuckets
+.. _Rules: #crushmaprules
+.. _Recompile: #compilecrushmap
+.. _Set the CRUSH map: #setcrushmap
+.. _Set Pool Values: ../pools#setpoolvalues
+
+.. _getcrushmap:
+
+Get a CRUSH Map
+---------------
+
+To get the CRUSH map for your cluster, execute the following::
+
+ ceph osd getcrushmap -o {compiled-crushmap-filename}
+
+Ceph will output (-o) a compiled CRUSH map to the filename you specified. Since
+the CRUSH map is in a compiled form, you must decompile it first before you can
+edit it.
+
+.. _decompilecrushmap:
+
+Decompile a CRUSH Map
+---------------------
+
+To decompile a CRUSH map, execute the following::
+
+ crushtool -d {compiled-crushmap-filename} -o {decompiled-crushmap-filename}
+
+
+Sections
+--------
+
+There are six main sections to a CRUSH Map.
+
+#. **tunables:** The preamble at the top of the map described any *tunables*
+ for CRUSH behavior that vary from the historical/legacy CRUSH behavior. These
+ correct for old bugs, optimizations, or other changes in behavior that have
+ been made over the years to improve CRUSH's behavior.
+
+#. **devices:** Devices are individual ``ceph-osd`` daemons that can
+ store data.
+
+#. **types**: Bucket ``types`` define the types of buckets used in
+ your CRUSH hierarchy. Buckets consist of a hierarchical aggregation
+ of storage locations (e.g., rows, racks, chassis, hosts, etc.) and
+ their assigned weights.
+
+#. **buckets:** Once you define bucket types, you must define each node
+ in the hierarchy, its type, and which devices or other nodes it
+ containes.
+
+#. **rules:** Rules define policy about how data is distributed across
+ devices in the hierarchy.
+
+#. **choose_args:** Choose_args are alternative weights associated with
+ the hierarchy that have been adjusted to optimize data placement. A single
+ choose_args map can be used for the entire cluster, or one can be
+ created for each individual pool.
+
+
+.. _crushmapdevices:
+
+CRUSH Map Devices
+-----------------
+
+Devices are individual ``ceph-osd`` daemons that can store data. You
+will normally have one defined here for each OSD daemon in your
+cluster. Devices are identified by an id (a non-negative integer) and
+a name, normally ``osd.N`` where ``N`` is the device id.
+
+Devices may also have a *device class* associated with them (e.g.,
+``hdd`` or ``ssd``), allowing them to be conveniently targetted by a
+crush rule.
+
+::
+
+ # devices
+ device {num} {osd.name} [class {class}]
+
+For example::
+
+ # devices
+ device 0 osd.0 class ssd
+ device 1 osd.1 class hdd
+ device 2 osd.2
+ device 3 osd.3
+
+In most cases, each device maps to a single ``ceph-osd`` daemon. This
+is normally a single storage device, a pair of devices (for example,
+one for data and one for a journal or metadata), or in some cases a
+small RAID device.
+
+
+
+
+
+CRUSH Map Bucket Types
+----------------------
+
+The second list in the CRUSH map defines 'bucket' types. Buckets facilitate
+a hierarchy of nodes and leaves. Node (or non-leaf) buckets typically represent
+physical locations in a hierarchy. Nodes aggregate other nodes or leaves.
+Leaf buckets represent ``ceph-osd`` daemons and their corresponding storage
+media.
+
+.. tip:: The term "bucket" used in the context of CRUSH means a node in
+ the hierarchy, i.e. a location or a piece of physical hardware. It
+ is a different concept from the term "bucket" when used in the
+ context of RADOS Gateway APIs.
+
+To add a bucket type to the CRUSH map, create a new line under your list of
+bucket types. Enter ``type`` followed by a unique numeric ID and a bucket name.
+By convention, there is one leaf bucket and it is ``type 0``; however, you may
+give it any name you like (e.g., osd, disk, drive, storage, etc.)::
+
+ #types
+ type {num} {bucket-name}
+
+For example::
+
+ # types
+ type 0 osd
+ type 1 host
+ type 2 chassis
+ type 3 rack
+ type 4 row
+ type 5 pdu
+ type 6 pod
+ type 7 room
+ type 8 datacenter
+ type 9 region
+ type 10 root
+
+
+
+.. _crushmapbuckets:
+
+CRUSH Map Bucket Hierarchy
+--------------------------
+
+The CRUSH algorithm distributes data objects among storage devices according
+to a per-device weight value, approximating a uniform probability distribution.
+CRUSH distributes objects and their replicas according to the hierarchical
+cluster map you define. Your CRUSH map represents the available storage
+devices and the logical elements that contain them.
+
+To map placement groups to OSDs across failure domains, a CRUSH map defines a
+hierarchical list of bucket types (i.e., under ``#types`` in the generated CRUSH
+map). The purpose of creating a bucket hierarchy is to segregate the
+leaf nodes by their failure domains, such as hosts, chassis, racks, power
+distribution units, pods, rows, rooms, and data centers. With the exception of
+the leaf nodes representing OSDs, the rest of the hierarchy is arbitrary, and
+you may define it according to your own needs.
+
+We recommend adapting your CRUSH map to your firms's hardware naming conventions
+and using instances names that reflect the physical hardware. Your naming
+practice can make it easier to administer the cluster and troubleshoot
+problems when an OSD and/or other hardware malfunctions and the administrator
+need access to physical hardware.
+
+In the following example, the bucket hierarchy has a leaf bucket named ``osd``,
+and two node buckets named ``host`` and ``rack`` respectively.
+
+.. ditaa::
+ +-----------+
+ | {o}rack |
+ | Bucket |
+ +-----+-----+
+ |
+ +---------------+---------------+
+ | |
+ +-----+-----+ +-----+-----+
+ | {o}host | | {o}host |
+ | Bucket | | Bucket |
+ +-----+-----+ +-----+-----+
+ | |
+ +-------+-------+ +-------+-------+
+ | | | |
+ +-----+-----+ +-----+-----+ +-----+-----+ +-----+-----+
+ | osd | | osd | | osd | | osd |
+ | Bucket | | Bucket | | Bucket | | Bucket |
+ +-----------+ +-----------+ +-----------+ +-----------+
+
+.. note:: The higher numbered ``rack`` bucket type aggregates the lower
+ numbered ``host`` bucket type.
+
+Since leaf nodes reflect storage devices declared under the ``#devices`` list
+at the beginning of the CRUSH map, you do not need to declare them as bucket
+instances. The second lowest bucket type in your hierarchy usually aggregates
+the devices (i.e., it's usually the computer containing the storage media, and
+uses whatever term you prefer to describe it, such as "node", "computer",
+"server," "host", "machine", etc.). In high density environments, it is
+increasingly common to see multiple hosts/nodes per chassis. You should account
+for chassis failure too--e.g., the need to pull a chassis if a node fails may
+result in bringing down numerous hosts/nodes and their OSDs.
+
+When declaring a bucket instance, you must specify its type, give it a unique
+name (string), assign it a unique ID expressed as a negative integer (optional),
+specify a weight relative to the total capacity/capability of its item(s),
+specify the bucket algorithm (usually ``straw``), and the hash (usually ``0``,
+reflecting hash algorithm ``rjenkins1``). A bucket may have one or more items.
+The items may consist of node buckets or leaves. Items may have a weight that
+reflects the relative weight of the item.
+
+You may declare a node bucket with the following syntax::
+
+ [bucket-type] [bucket-name] {
+ id [a unique negative numeric ID]
+ weight [the relative capacity/capability of the item(s)]
+ alg [the bucket type: uniform | list | tree | straw ]
+ hash [the hash type: 0 by default]
+ item [item-name] weight [weight]
+ }
+
+For example, using the diagram above, we would define two host buckets
+and one rack bucket. The OSDs are declared as items within the host buckets::
+
+ host node1 {
+ id -1
+ alg straw
+ hash 0
+ item osd.0 weight 1.00
+ item osd.1 weight 1.00
+ }
+
+ host node2 {
+ id -2
+ alg straw
+ hash 0
+ item osd.2 weight 1.00
+ item osd.3 weight 1.00
+ }
+
+ rack rack1 {
+ id -3
+ alg straw
+ hash 0
+ item node1 weight 2.00
+ item node2 weight 2.00
+ }
+
+.. note:: In the foregoing example, note that the rack bucket does not contain
+ any OSDs. Rather it contains lower level host buckets, and includes the
+ sum total of their weight in the item entry.
+
+.. topic:: Bucket Types
+
+ Ceph supports four bucket types, each representing a tradeoff between
+ performance and reorganization efficiency. If you are unsure of which bucket
+ type to use, we recommend using a ``straw`` bucket. For a detailed
+ discussion of bucket types, refer to
+ `CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_,
+ and more specifically to **Section 3.4**. The bucket types are:
+
+ #. **Uniform:** Uniform buckets aggregate devices with **exactly** the same
+ weight. For example, when firms commission or decommission hardware, they
+ typically do so with many machines that have exactly the same physical
+ configuration (e.g., bulk purchases). When storage devices have exactly
+ the same weight, you may use the ``uniform`` bucket type, which allows
+ CRUSH to map replicas into uniform buckets in constant time. With
+ non-uniform weights, you should use another bucket algorithm.
+
+ #. **List**: List buckets aggregate their content as linked lists. Based on
+ the :abbr:`RUSH (Replication Under Scalable Hashing)` :sub:`P` algorithm,
+ a list is a natural and intuitive choice for an **expanding cluster**:
+ either an object is relocated to the newest device with some appropriate
+ probability, or it remains on the older devices as before. The result is
+ optimal data migration when items are added to the bucket. Items removed
+ from the middle or tail of the list, however, can result in a significant
+ amount of unnecessary movement, making list buckets most suitable for
+ circumstances in which they **never (or very rarely) shrink**.
+
+ #. **Tree**: Tree buckets use a binary search tree. They are more efficient
+ than list buckets when a bucket contains a larger set of items. Based on
+ the :abbr:`RUSH (Replication Under Scalable Hashing)` :sub:`R` algorithm,
+ tree buckets reduce the placement time to O(log :sub:`n`), making them
+ suitable for managing much larger sets of devices or nested buckets.
+
+ #. **Straw:** List and Tree buckets use a divide and conquer strategy
+ in a way that either gives certain items precedence (e.g., those
+ at the beginning of a list) or obviates the need to consider entire
+ subtrees of items at all. That improves the performance of the replica
+ placement process, but can also introduce suboptimal reorganization
+ behavior when the contents of a bucket change due an addition, removal,
+ or re-weighting of an item. The straw bucket type allows all items to
+ fairly “compete” against each other for replica placement through a
+ process analogous to a draw of straws.
+
+.. topic:: Hash
+
+ Each bucket uses a hash algorithm. Currently, Ceph supports ``rjenkins1``.
+ Enter ``0`` as your hash setting to select ``rjenkins1``.
+
+
+.. _weightingbucketitems:
+
+.. topic:: Weighting Bucket Items
+
+ Ceph expresses bucket weights as doubles, which allows for fine
+ weighting. A weight is the relative difference between device capacities. We
+ recommend using ``1.00`` as the relative weight for a 1TB storage device.
+ In such a scenario, a weight of ``0.5`` would represent approximately 500GB,
+ and a weight of ``3.00`` would represent approximately 3TB. Higher level
+ buckets have a weight that is the sum total of the leaf items aggregated by
+ the bucket.
+
+ A bucket item weight is one dimensional, but you may also calculate your
+ item weights to reflect the performance of the storage drive. For example,
+ if you have many 1TB drives where some have relatively low data transfer
+ rate and the others have a relatively high data transfer rate, you may
+ weight them differently, even though they have the same capacity (e.g.,
+ a weight of 0.80 for the first set of drives with lower total throughput,
+ and 1.20 for the second set of drives with higher total throughput).
+
+
+.. _crushmaprules:
+
+CRUSH Map Rules
+---------------
+
+CRUSH maps support the notion of 'CRUSH rules', which are the rules that
+determine data placement for a pool. For large clusters, you will likely create
+many pools where each pool may have its own CRUSH ruleset and rules. The default
+CRUSH map has a rule for each pool, and one ruleset assigned to each of the
+default pools.
+
+.. note:: In most cases, you will not need to modify the default rules. When
+ you create a new pool, its default ruleset is ``0``.
+
+
+CRUSH rules define placement and replication strategies or distribution policies
+that allow you to specify exactly how CRUSH places object replicas. For
+example, you might create a rule selecting a pair of targets for 2-way
+mirroring, another rule for selecting three targets in two different data
+centers for 3-way mirroring, and yet another rule for erasure coding over six
+storage devices. For a detailed discussion of CRUSH rules, refer to
+`CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_,
+and more specifically to **Section 3.2**.
+
+A rule takes the following form::
+
+ rule <rulename> {
+
+ ruleset <ruleset>
+ type [ replicated | erasure ]
+ min_size <min-size>
+ max_size <max-size>
+ step take <bucket-name> [class <device-class>]
+ step [choose|chooseleaf] [firstn|indep] <N> <bucket-type>
+ step emit
+ }
+
+
+``ruleset``
+
+:Description: A means of classifying a rule as belonging to a set of rules.
+ Activated by `setting the ruleset in a pool`_.
+
+:Purpose: A component of the rule mask.
+:Type: Integer
+:Required: Yes
+:Default: 0
+
+.. _setting the ruleset in a pool: ../pools#setpoolvalues
+
+
+``type``
+
+:Description: Describes a rule for either a storage drive (replicated)
+ or a RAID.
+
+:Purpose: A component of the rule mask.
+:Type: String
+:Required: Yes
+:Default: ``replicated``
+:Valid Values: Currently only ``replicated`` and ``erasure``
+
+``min_size``
+
+:Description: If a pool makes fewer replicas than this number, CRUSH will
+ **NOT** select this rule.
+
+:Type: Integer
+:Purpose: A component of the rule mask.
+:Required: Yes
+:Default: ``1``
+
+``max_size``
+
+:Description: If a pool makes more replicas than this number, CRUSH will
+ **NOT** select this rule.
+
+:Type: Integer
+:Purpose: A component of the rule mask.
+:Required: Yes
+:Default: 10
+
+
+``step take <bucket-name> [class <device-class>]``
+
+:Description: Takes a bucket name, and begins iterating down the tree.
+ If the ``device-class`` is specified, it must match
+ a class previously used when defining a device. All
+ devices that do not belong to the class are excluded.
+:Purpose: A component of the rule.
+:Required: Yes
+:Example: ``step take data``
+
+
+``step choose firstn {num} type {bucket-type}``
+
+:Description: Selects the number of buckets of the given type. The number is
+ usually the number of replicas in the pool (i.e., pool size).
+
+ - If ``{num} == 0``, choose ``pool-num-replicas`` buckets (all available).
+ - If ``{num} > 0 && < pool-num-replicas``, choose that many buckets.
+ - If ``{num} < 0``, it means ``pool-num-replicas - {num}``.
+
+:Purpose: A component of the rule.
+:Prerequisite: Follows ``step take`` or ``step choose``.
+:Example: ``step choose firstn 1 type row``
+
+
+``step chooseleaf firstn {num} type {bucket-type}``
+
+:Description: Selects a set of buckets of ``{bucket-type}`` and chooses a leaf
+ node from the subtree of each bucket in the set of buckets. The
+ number of buckets in the set is usually the number of replicas in
+ the pool (i.e., pool size).
+
+ - If ``{num} == 0``, choose ``pool-num-replicas`` buckets (all available).
+ - If ``{num} > 0 && < pool-num-replicas``, choose that many buckets.
+ - If ``{num} < 0``, it means ``pool-num-replicas - {num}``.
+
+:Purpose: A component of the rule. Usage removes the need to select a device using two steps.
+:Prerequisite: Follows ``step take`` or ``step choose``.
+:Example: ``step chooseleaf firstn 0 type row``
+
+
+
+``step emit``
+
+:Description: Outputs the current value and empties the stack. Typically used
+ at the end of a rule, but may also be used to pick from different
+ trees in the same rule.
+
+:Purpose: A component of the rule.
+:Prerequisite: Follows ``step choose``.
+:Example: ``step emit``
+
+.. important:: To activate one or more rules with a common ruleset number to a
+ pool, set the ruleset number of the pool.
+
+
+Placing Different Pools on Different OSDS:
+==========================================
+
+Suppose you want to have most pools default to OSDs backed by large hard drives,
+but have some pools mapped to OSDs backed by fast solid-state drives (SSDs).
+It's possible to have multiple independent CRUSH hierarchies within the same
+CRUSH map. Define two hierarchies with two different root nodes--one for hard
+disks (e.g., "root platter") and one for SSDs (e.g., "root ssd") as shown
+below::
+
+ device 0 osd.0
+ device 1 osd.1
+ device 2 osd.2
+ device 3 osd.3
+ device 4 osd.4
+ device 5 osd.5
+ device 6 osd.6
+ device 7 osd.7
+
+ host ceph-osd-ssd-server-1 {
+ id -1
+ alg straw
+ hash 0
+ item osd.0 weight 1.00
+ item osd.1 weight 1.00
+ }
+
+ host ceph-osd-ssd-server-2 {
+ id -2
+ alg straw
+ hash 0
+ item osd.2 weight 1.00
+ item osd.3 weight 1.00
+ }
+
+ host ceph-osd-platter-server-1 {
+ id -3
+ alg straw
+ hash 0
+ item osd.4 weight 1.00
+ item osd.5 weight 1.00
+ }
+
+ host ceph-osd-platter-server-2 {
+ id -4
+ alg straw
+ hash 0
+ item osd.6 weight 1.00
+ item osd.7 weight 1.00
+ }
+
+ root platter {
+ id -5
+ alg straw
+ hash 0
+ item ceph-osd-platter-server-1 weight 2.00
+ item ceph-osd-platter-server-2 weight 2.00
+ }
+
+ root ssd {
+ id -6
+ alg straw
+ hash 0
+ item ceph-osd-ssd-server-1 weight 2.00
+ item ceph-osd-ssd-server-2 weight 2.00
+ }
+
+ rule data {
+ ruleset 0
+ type replicated
+ min_size 2
+ max_size 2
+ step take platter
+ step chooseleaf firstn 0 type host
+ step emit
+ }
+
+ rule metadata {
+ ruleset 1
+ type replicated
+ min_size 0
+ max_size 10
+ step take platter
+ step chooseleaf firstn 0 type host
+ step emit
+ }
+
+ rule rbd {
+ ruleset 2
+ type replicated
+ min_size 0
+ max_size 10
+ step take platter
+ step chooseleaf firstn 0 type host
+ step emit
+ }
+
+ rule platter {
+ ruleset 3
+ type replicated
+ min_size 0
+ max_size 10
+ step take platter
+ step chooseleaf firstn 0 type host
+ step emit
+ }
+
+ rule ssd {
+ ruleset 4
+ type replicated
+ min_size 0
+ max_size 4
+ step take ssd
+ step chooseleaf firstn 0 type host
+ step emit
+ }
+
+ rule ssd-primary {
+ ruleset 5
+ type replicated
+ min_size 5
+ max_size 10
+ step take ssd
+ step chooseleaf firstn 1 type host
+ step emit
+ step take platter
+ step chooseleaf firstn -1 type host
+ step emit
+ }
+
+You can then set a pool to use the SSD rule by::
+
+ ceph osd pool set <poolname> crush_ruleset 4
+
+Similarly, using the ``ssd-primary`` rule will cause each placement group in the
+pool to be placed with an SSD as the primary and platters as the replicas.
+
+
+Tuning CRUSH, the hard way
+--------------------------
+
+If you can ensure that all clients are running recent code, you can
+adjust the tunables by extracting the CRUSH map, modifying the values,
+and reinjecting it into the cluster.
+
+* Extract the latest CRUSH map::
+
+ ceph osd getcrushmap -o /tmp/crush
+
+* Adjust tunables. These values appear to offer the best behavior
+ for both large and small clusters we tested with. You will need to
+ additionally specify the ``--enable-unsafe-tunables`` argument to
+ ``crushtool`` for this to work. Please use this option with
+ extreme care.::
+
+ crushtool -i /tmp/crush --set-choose-local-tries 0 --set-choose-local-fallback-tries 0 --set-choose-total-tries 50 -o /tmp/crush.new
+
+* Reinject modified map::
+
+ ceph osd setcrushmap -i /tmp/crush.new
+
+Legacy values
+-------------
+
+For reference, the legacy values for the CRUSH tunables can be set
+with::
+
+ crushtool -i /tmp/crush --set-choose-local-tries 2 --set-choose-local-fallback-tries 5 --set-choose-total-tries 19 --set-chooseleaf-descend-once 0 --set-chooseleaf-vary-r 0 -o /tmp/crush.legacy
+
+Again, the special ``--enable-unsafe-tunables`` option is required.
+Further, as noted above, be careful running old versions of the
+``ceph-osd`` daemon after reverting to legacy values as the feature
+bit is not perfectly enforced.
diff --git a/src/ceph/doc/rados/operations/crush-map.rst b/src/ceph/doc/rados/operations/crush-map.rst
new file mode 100644
index 0000000..05fa4ff
--- /dev/null
+++ b/src/ceph/doc/rados/operations/crush-map.rst
@@ -0,0 +1,956 @@
+============
+ CRUSH Maps
+============
+
+The :abbr:`CRUSH (Controlled Replication Under Scalable Hashing)` algorithm
+determines how to store and retrieve data by computing data storage locations.
+CRUSH empowers Ceph clients to communicate with OSDs directly rather than
+through a centralized server or broker. With an algorithmically determined
+method of storing and retrieving data, Ceph avoids a single point of failure, a
+performance bottleneck, and a physical limit to its scalability.
+
+CRUSH requires a map of your cluster, and uses the CRUSH map to pseudo-randomly
+store and retrieve data in OSDs with a uniform distribution of data across the
+cluster. For a detailed discussion of CRUSH, see
+`CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_
+
+CRUSH maps contain a list of :abbr:`OSDs (Object Storage Devices)`, a list of
+'buckets' for aggregating the devices into physical locations, and a list of
+rules that tell CRUSH how it should replicate data in a Ceph cluster's pools. By
+reflecting the underlying physical organization of the installation, CRUSH can
+model—and thereby address—potential sources of correlated device failures.
+Typical sources include physical proximity, a shared power source, and a shared
+network. By encoding this information into the cluster map, CRUSH placement
+policies can separate object replicas across different failure domains while
+still maintaining the desired distribution. For example, to address the
+possibility of concurrent failures, it may be desirable to ensure that data
+replicas are on devices using different shelves, racks, power supplies,
+controllers, and/or physical locations.
+
+When you deploy OSDs they are automatically placed within the CRUSH map under a
+``host`` node named with the hostname for the host they are running on. This,
+combined with the default CRUSH failure domain, ensures that replicas or erasure
+code shards are separated across hosts and a single host failure will not
+affect availability. For larger clusters, however, administrators should carefully consider their choice of failure domain. Separating replicas across racks,
+for example, is common for mid- to large-sized clusters.
+
+
+CRUSH Location
+==============
+
+The location of an OSD in terms of the CRUSH map's hierarchy is
+referred to as a ``crush location``. This location specifier takes the
+form of a list of key and value pairs describing a position. For
+example, if an OSD is in a particular row, rack, chassis and host, and
+is part of the 'default' CRUSH tree (this is the case for the vast
+majority of clusters), its crush location could be described as::
+
+ root=default row=a rack=a2 chassis=a2a host=a2a1
+
+Note:
+
+#. Note that the order of the keys does not matter.
+#. The key name (left of ``=``) must be a valid CRUSH ``type``. By default
+ these include root, datacenter, room, row, pod, pdu, rack, chassis and host,
+ but those types can be customized to be anything appropriate by modifying
+ the CRUSH map.
+#. Not all keys need to be specified. For example, by default, Ceph
+ automatically sets a ``ceph-osd`` daemon's location to be
+ ``root=default host=HOSTNAME`` (based on the output from ``hostname -s``).
+
+The crush location for an OSD is normally expressed via the ``crush location``
+config option being set in the ``ceph.conf`` file. Each time the OSD starts,
+it verifies it is in the correct location in the CRUSH map and, if it is not,
+it moved itself. To disable this automatic CRUSH map management, add the
+following to your configuration file in the ``[osd]`` section::
+
+ osd crush update on start = false
+
+
+Custom location hooks
+---------------------
+
+A customized location hook can be used to generate a more complete
+crush location on startup. The sample ``ceph-crush-location`` utility
+will generate a CRUSH location string for a given daemon. The
+location is based on, in order of preference:
+
+#. A ``crush location`` option in ceph.conf.
+#. A default of ``root=default host=HOSTNAME`` where the hostname is
+ generated with the ``hostname -s`` command.
+
+This is not useful by itself, as the OSD itself has the exact same
+behavior. However, the script can be modified to provide additional
+location fields (for example, the rack or datacenter), and then the
+hook enabled via the config option::
+
+ crush location hook = /path/to/customized-ceph-crush-location
+
+This hook is passed several arguments (below) and should output a single line
+to stdout with the CRUSH location description.::
+
+ $ ceph-crush-location --cluster CLUSTER --id ID --type TYPE
+
+where the cluster name is typically 'ceph', the id is the daemon
+identifier (the OSD number), and the daemon type is typically ``osd``.
+
+
+CRUSH structure
+===============
+
+The CRUSH map consists of, loosely speaking, a hierarchy describing
+the physical topology of the cluster, and a set of rules defining
+policy about how we place data on those devices. The hierarchy has
+devices (``ceph-osd`` daemons) at the leaves, and internal nodes
+corresponding to other physical features or groupings: hosts, racks,
+rows, datacenters, and so on. The rules describe how replicas are
+placed in terms of that hierarchy (e.g., 'three replicas in different
+racks').
+
+Devices
+-------
+
+Devices are individual ``ceph-osd`` daemons that can store data. You
+will normally have one defined here for each OSD daemon in your
+cluster. Devices are identified by an id (a non-negative integer) and
+a name, normally ``osd.N`` where ``N`` is the device id.
+
+Devices may also have a *device class* associated with them (e.g.,
+``hdd`` or ``ssd``), allowing them to be conveniently targetted by a
+crush rule.
+
+Types and Buckets
+-----------------
+
+A bucket is the CRUSH term for internal nodes in the hierarchy: hosts,
+racks, rows, etc. The CRUSH map defines a series of *types* that are
+used to describe these nodes. By default, these types include:
+
+- osd (or device)
+- host
+- chassis
+- rack
+- row
+- pdu
+- pod
+- room
+- datacenter
+- region
+- root
+
+Most clusters make use of only a handful of these types, and others
+can be defined as needed.
+
+The hierarchy is built with devices (normally type ``osd``) at the
+leaves, interior nodes with non-device types, and a root node of type
+``root``. For example,
+
+.. ditaa::
+
+ +-----------------+
+ | {o}root default |
+ +--------+--------+
+ |
+ +---------------+---------------+
+ | |
+ +-------+-------+ +-----+-------+
+ | {o}host foo | | {o}host bar |
+ +-------+-------+ +-----+-------+
+ | |
+ +-------+-------+ +-------+-------+
+ | | | |
+ +-----+-----+ +-----+-----+ +-----+-----+ +-----+-----+
+ | osd.0 | | osd.1 | | osd.2 | | osd.3 |
+ +-----------+ +-----------+ +-----------+ +-----------+
+
+Each node (device or bucket) in the hierarchy has a *weight*
+associated with it, indicating the relative proportion of the total
+data that device or hierarchy subtree should store. Weights are set
+at the leaves, indicating the size of the device, and automatically
+sum up the tree from there, such that the weight of the default node
+will be the total of all devices contained beneath it. Normally
+weights are in units of terabytes (TB).
+
+You can get a simple view the CRUSH hierarchy for your cluster,
+including the weights, with::
+
+ ceph osd crush tree
+
+Rules
+-----
+
+Rules define policy about how data is distributed across the devices
+in the hierarchy.
+
+CRUSH rules define placement and replication strategies or
+distribution policies that allow you to specify exactly how CRUSH
+places object replicas. For example, you might create a rule selecting
+a pair of targets for 2-way mirroring, another rule for selecting
+three targets in two different data centers for 3-way mirroring, and
+yet another rule for erasure coding over six storage devices. For a
+detailed discussion of CRUSH rules, refer to `CRUSH - Controlled,
+Scalable, Decentralized Placement of Replicated Data`_, and more
+specifically to **Section 3.2**.
+
+In almost all cases, CRUSH rules can be created via the CLI by
+specifying the *pool type* they will be used for (replicated or
+erasure coded), the *failure domain*, and optionally a *device class*.
+In rare cases rules must be written by hand by manually editing the
+CRUSH map.
+
+You can see what rules are defined for your cluster with::
+
+ ceph osd crush rule ls
+
+You can view the contents of the rules with::
+
+ ceph osd crush rule dump
+
+Device classes
+--------------
+
+Each device can optionally have a *class* associated with it. By
+default, OSDs automatically set their class on startup to either
+`hdd`, `ssd`, or `nvme` based on the type of device they are backed
+by.
+
+The device class for one or more OSDs can be explicitly set with::
+
+ ceph osd crush set-device-class <class> <osd-name> [...]
+
+Once a device class is set, it cannot be changed to another class
+until the old class is unset with::
+
+ ceph osd crush rm-device-class <osd-name> [...]
+
+This allows administrators to set device classes without the class
+being changed on OSD restart or by some other script.
+
+A placement rule that targets a specific device class can be created with::
+
+ ceph osd crush rule create-replicated <rule-name> <root> <failure-domain> <class>
+
+A pool can then be changed to use the new rule with::
+
+ ceph osd pool set <pool-name> crush_rule <rule-name>
+
+Device classes are implemented by creating a "shadow" CRUSH hierarchy
+for each device class in use that contains only devices of that class.
+Rules can then distribute data over the shadow hierarchy. One nice
+thing about this approach is that it is fully backward compatible with
+old Ceph clients. You can view the CRUSH hierarchy with shadow items
+with::
+
+ ceph osd crush tree --show-shadow
+
+
+Weights sets
+------------
+
+A *weight set* is an alternative set of weights to use when
+calculating data placement. The normal weights associated with each
+device in the CRUSH map are set based on the device size and indicate
+how much data we *should* be storing where. However, because CRUSH is
+based on a pseudorandom placement process, there is always some
+variation from this ideal distribution, the same way that rolling a
+dice sixty times will not result in rolling exactly 10 ones and 10
+sixes. Weight sets allow the cluster to do a numerical optimization
+based on the specifics of your cluster (hierarchy, pools, etc.) to achieve
+a balanced distribution.
+
+There are two types of weight sets supported:
+
+ #. A **compat** weight set is a single alternative set of weights for
+ each device and node in the cluster. This is not well-suited for
+ correcting for all anomalies (for example, placement groups for
+ different pools may be different sizes and have different load
+ levels, but will be mostly treated the same by the balancer).
+ However, compat weight sets have the huge advantage that they are
+ *backward compatible* with previous versions of Ceph, which means
+ that even though weight sets were first introduced in Luminous
+ v12.2.z, older clients (e.g., firefly) can still connect to the
+ cluster when a compat weight set is being used to balance data.
+ #. A **per-pool** weight set is more flexible in that it allows
+ placement to be optimized for each data pool. Additionally,
+ weights can be adjusted for each position of placement, allowing
+ the optimizer to correct for a suble skew of data toward devices
+ with small weights relative to their peers (and effect that is
+ usually only apparently in very large clusters but which can cause
+ balancing problems).
+
+When weight sets are in use, the weights associated with each node in
+the hierarchy is visible as a separate column (labeled either
+``(compat)`` or the pool name) from the command::
+
+ ceph osd crush tree
+
+When both *compat* and *per-pool* weight sets are in use, data
+placement for a particular pool will use its own per-pool weight set
+if present. If not, it will use the compat weight set if present. If
+neither are present, it will use the normal CRUSH weights.
+
+Although weight sets can be set up and manipulated by hand, it is
+recommended that the *balancer* module be enabled to do so
+automatically.
+
+
+Modifying the CRUSH map
+=======================
+
+.. _addosd:
+
+Add/Move an OSD
+---------------
+
+.. note: OSDs are normally automatically added to the CRUSH map when
+ the OSD is created. This command is rarely needed.
+
+To add or move an OSD in the CRUSH map of a running cluster::
+
+ ceph osd crush set {name} {weight} root={root} [{bucket-type}={bucket-name} ...]
+
+Where:
+
+``name``
+
+:Description: The full name of the OSD.
+:Type: String
+:Required: Yes
+:Example: ``osd.0``
+
+
+``weight``
+
+:Description: The CRUSH weight for the OSD, normally its size measure in terabytes (TB).
+:Type: Double
+:Required: Yes
+:Example: ``2.0``
+
+
+``root``
+
+:Description: The root node of the tree in which the OSD resides (normally ``default``)
+:Type: Key/value pair.
+:Required: Yes
+:Example: ``root=default``
+
+
+``bucket-type``
+
+:Description: You may specify the OSD's location in the CRUSH hierarchy.
+:Type: Key/value pairs.
+:Required: No
+:Example: ``datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1``
+
+
+The following example adds ``osd.0`` to the hierarchy, or moves the
+OSD from a previous location. ::
+
+ ceph osd crush set osd.0 1.0 root=default datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1
+
+
+Adjust OSD weight
+-----------------
+
+.. note: Normally OSDs automatically add themselves to the CRUSH map
+ with the correct weight when they are created. This command
+ is rarely needed.
+
+To adjust an OSD's crush weight in the CRUSH map of a running cluster, execute
+the following::
+
+ ceph osd crush reweight {name} {weight}
+
+Where:
+
+``name``
+
+:Description: The full name of the OSD.
+:Type: String
+:Required: Yes
+:Example: ``osd.0``
+
+
+``weight``
+
+:Description: The CRUSH weight for the OSD.
+:Type: Double
+:Required: Yes
+:Example: ``2.0``
+
+
+.. _removeosd:
+
+Remove an OSD
+-------------
+
+.. note: OSDs are normally removed from the CRUSH as part of the
+ ``ceph osd purge`` command. This command is rarely needed.
+
+To remove an OSD from the CRUSH map of a running cluster, execute the
+following::
+
+ ceph osd crush remove {name}
+
+Where:
+
+``name``
+
+:Description: The full name of the OSD.
+:Type: String
+:Required: Yes
+:Example: ``osd.0``
+
+
+Add a Bucket
+------------
+
+.. note: Buckets are normally implicitly created when an OSD is added
+ that specifies a ``{bucket-type}={bucket-name}`` as part of its
+ location and a bucket with that name does not already exist. This
+ command is typically used when manually adjusting the structure of the
+ hierarchy after OSDs have been created (for example, to move a
+ series of hosts underneath a new rack-level bucket).
+
+To add a bucket in the CRUSH map of a running cluster, execute the
+``ceph osd crush add-bucket`` command::
+
+ ceph osd crush add-bucket {bucket-name} {bucket-type}
+
+Where:
+
+``bucket-name``
+
+:Description: The full name of the bucket.
+:Type: String
+:Required: Yes
+:Example: ``rack12``
+
+
+``bucket-type``
+
+:Description: The type of the bucket. The type must already exist in the hierarchy.
+:Type: String
+:Required: Yes
+:Example: ``rack``
+
+
+The following example adds the ``rack12`` bucket to the hierarchy::
+
+ ceph osd crush add-bucket rack12 rack
+
+Move a Bucket
+-------------
+
+To move a bucket to a different location or position in the CRUSH map
+hierarchy, execute the following::
+
+ ceph osd crush move {bucket-name} {bucket-type}={bucket-name}, [...]
+
+Where:
+
+``bucket-name``
+
+:Description: The name of the bucket to move/reposition.
+:Type: String
+:Required: Yes
+:Example: ``foo-bar-1``
+
+``bucket-type``
+
+:Description: You may specify the bucket's location in the CRUSH hierarchy.
+:Type: Key/value pairs.
+:Required: No
+:Example: ``datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1``
+
+Remove a Bucket
+---------------
+
+To remove a bucket from the CRUSH map hierarchy, execute the following::
+
+ ceph osd crush remove {bucket-name}
+
+.. note:: A bucket must be empty before removing it from the CRUSH hierarchy.
+
+Where:
+
+``bucket-name``
+
+:Description: The name of the bucket that you'd like to remove.
+:Type: String
+:Required: Yes
+:Example: ``rack12``
+
+The following example removes the ``rack12`` bucket from the hierarchy::
+
+ ceph osd crush remove rack12
+
+Creating a compat weight set
+----------------------------
+
+.. note: This step is normally done automatically by the ``balancer``
+ module when enabled.
+
+To create a *compat* weight set::
+
+ ceph osd crush weight-set create-compat
+
+Weights for the compat weight set can be adjusted with::
+
+ ceph osd crush weight-set reweight-compat {name} {weight}
+
+The compat weight set can be destroyed with::
+
+ ceph osd crush weight-set rm-compat
+
+Creating per-pool weight sets
+-----------------------------
+
+To create a weight set for a specific pool,::
+
+ ceph osd crush weight-set create {pool-name} {mode}
+
+.. note:: Per-pool weight sets require that all servers and daemons
+ run Luminous v12.2.z or later.
+
+Where:
+
+``pool-name``
+
+:Description: The name of a RADOS pool
+:Type: String
+:Required: Yes
+:Example: ``rbd``
+
+``mode``
+
+:Description: Either ``flat`` or ``positional``. A *flat* weight set
+ has a single weight for each device or bucket. A
+ *positional* weight set has a potentially different
+ weight for each position in the resulting placement
+ mapping. For example, if a pool has a replica count of
+ 3, then a positional weight set will have three weights
+ for each device and bucket.
+:Type: String
+:Required: Yes
+:Example: ``flat``
+
+To adjust the weight of an item in a weight set::
+
+ ceph osd crush weight-set reweight {pool-name} {item-name} {weight [...]}
+
+To list existing weight sets,::
+
+ ceph osd crush weight-set ls
+
+To remove a weight set,::
+
+ ceph osd crush weight-set rm {pool-name}
+
+Creating a rule for a replicated pool
+-------------------------------------
+
+For a replicated pool, the primary decision when creating the CRUSH
+rule is what the failure domain is going to be. For example, if a
+failure domain of ``host`` is selected, then CRUSH will ensure that
+each replica of the data is stored on a different host. If ``rack``
+is selected, then each replica will be stored in a different rack.
+What failure domain you choose primarily depends on the size of your
+cluster and how your hierarchy is structured.
+
+Normally, the entire cluster hierarchy is nested beneath a root node
+named ``default``. If you have customized your hierarchy, you may
+want to create a rule nested at some other node in the hierarchy. It
+doesn't matter what type is associated with that node (it doesn't have
+to be a ``root`` node).
+
+It is also possible to create a rule that restricts data placement to
+a specific *class* of device. By default, Ceph OSDs automatically
+classify themselves as either ``hdd`` or ``ssd``, depending on the
+underlying type of device being used. These classes can also be
+customized.
+
+To create a replicated rule,::
+
+ ceph osd crush rule create-replicated {name} {root} {failure-domain-type} [{class}]
+
+Where:
+
+``name``
+
+:Description: The name of the rule
+:Type: String
+:Required: Yes
+:Example: ``rbd-rule``
+
+``root``
+
+:Description: The name of the node under which data should be placed.
+:Type: String
+:Required: Yes
+:Example: ``default``
+
+``failure-domain-type``
+
+:Description: The type of CRUSH nodes across which we should separate replicas.
+:Type: String
+:Required: Yes
+:Example: ``rack``
+
+``class``
+
+:Description: The device class data should be placed on.
+:Type: String
+:Required: No
+:Example: ``ssd``
+
+Creating a rule for an erasure coded pool
+-----------------------------------------
+
+For an erasure-coded pool, the same basic decisions need to be made as
+with a replicated pool: what is the failure domain, what node in the
+hierarchy will data be placed under (usually ``default``), and will
+placement be restricted to a specific device class. Erasure code
+pools are created a bit differently, however, because they need to be
+constructed carefully based on the erasure code being used. For this reason,
+you must include this information in the *erasure code profile*. A CRUSH
+rule will then be created from that either explicitly or automatically when
+the profile is used to create a pool.
+
+The erasure code profiles can be listed with::
+
+ ceph osd erasure-code-profile ls
+
+An existing profile can be viewed with::
+
+ ceph osd erasure-code-profile get {profile-name}
+
+Normally profiles should never be modified; instead, a new profile
+should be created and used when creating a new pool or creating a new
+rule for an existing pool.
+
+An erasure code profile consists of a set of key=value pairs. Most of
+these control the behavior of the erasure code that is encoding data
+in the pool. Those that begin with ``crush-``, however, affect the
+CRUSH rule that is created.
+
+The erasure code profile properties of interest are:
+
+ * **crush-root**: the name of the CRUSH node to place data under [default: ``default``].
+ * **crush-failure-domain**: the CRUSH type to separate erasure-coded shards across [default: ``host``].
+ * **crush-device-class**: the device class to place data on [default: none, meaning all devices are used].
+ * **k** and **m** (and, for the ``lrc`` plugin, **l**): these determine the number of erasure code shards, affecting the resulting CRUSH rule.
+
+Once a profile is defined, you can create a CRUSH rule with::
+
+ ceph osd crush rule create-erasure {name} {profile-name}
+
+.. note: When creating a new pool, it is not actually necessary to
+ explicitly create the rule. If the erasure code profile alone is
+ specified and the rule argument is left off then Ceph will create
+ the CRUSH rule automatically.
+
+Deleting rules
+--------------
+
+Rules that are not in use by pools can be deleted with::
+
+ ceph osd crush rule rm {rule-name}
+
+
+Tunables
+========
+
+Over time, we have made (and continue to make) improvements to the
+CRUSH algorithm used to calculate the placement of data. In order to
+support the change in behavior, we have introduced a series of tunable
+options that control whether the legacy or improved variation of the
+algorithm is used.
+
+In order to use newer tunables, both clients and servers must support
+the new version of CRUSH. For this reason, we have created
+``profiles`` that are named after the Ceph version in which they were
+introduced. For example, the ``firefly`` tunables are first supported
+in the firefly release, and will not work with older (e.g., dumpling)
+clients. Once a given set of tunables are changed from the legacy
+default behavior, the ``ceph-mon`` and ``ceph-osd`` will prevent older
+clients who do not support the new CRUSH features from connecting to
+the cluster.
+
+argonaut (legacy)
+-----------------
+
+The legacy CRUSH behavior used by argonaut and older releases works
+fine for most clusters, provided there are not too many OSDs that have
+been marked out.
+
+bobtail (CRUSH_TUNABLES2)
+-------------------------
+
+The bobtail tunable profile fixes a few key misbehaviors:
+
+ * For hierarchies with a small number of devices in the leaf buckets,
+ some PGs map to fewer than the desired number of replicas. This
+ commonly happens for hierarchies with "host" nodes with a small
+ number (1-3) of OSDs nested beneath each one.
+
+ * For large clusters, some small percentages of PGs map to less than
+ the desired number of OSDs. This is more prevalent when there are
+ several layers of the hierarchy (e.g., row, rack, host, osd).
+
+ * When some OSDs are marked out, the data tends to get redistributed
+ to nearby OSDs instead of across the entire hierarchy.
+
+The new tunables are:
+
+ * ``choose_local_tries``: Number of local retries. Legacy value is
+ 2, optimal value is 0.
+
+ * ``choose_local_fallback_tries``: Legacy value is 5, optimal value
+ is 0.
+
+ * ``choose_total_tries``: Total number of attempts to choose an item.
+ Legacy value was 19, subsequent testing indicates that a value of
+ 50 is more appropriate for typical clusters. For extremely large
+ clusters, a larger value might be necessary.
+
+ * ``chooseleaf_descend_once``: Whether a recursive chooseleaf attempt
+ will retry, or only try once and allow the original placement to
+ retry. Legacy default is 0, optimal value is 1.
+
+Migration impact:
+
+ * Moving from argonaut to bobtail tunables triggers a moderate amount
+ of data movement. Use caution on a cluster that is already
+ populated with data.
+
+firefly (CRUSH_TUNABLES3)
+-------------------------
+
+The firefly tunable profile fixes a problem
+with the ``chooseleaf`` CRUSH rule behavior that tends to result in PG
+mappings with too few results when too many OSDs have been marked out.
+
+The new tunable is:
+
+ * ``chooseleaf_vary_r``: Whether a recursive chooseleaf attempt will
+ start with a non-zero value of r, based on how many attempts the
+ parent has already made. Legacy default is 0, but with this value
+ CRUSH is sometimes unable to find a mapping. The optimal value (in
+ terms of computational cost and correctness) is 1.
+
+Migration impact:
+
+ * For existing clusters that have lots of existing data, changing
+ from 0 to 1 will cause a lot of data to move; a value of 4 or 5
+ will allow CRUSH to find a valid mapping but will make less data
+ move.
+
+straw_calc_version tunable (introduced with Firefly too)
+--------------------------------------------------------
+
+There were some problems with the internal weights calculated and
+stored in the CRUSH map for ``straw`` buckets. Specifically, when
+there were items with a CRUSH weight of 0 or both a mix of weights and
+some duplicated weights CRUSH would distribute data incorrectly (i.e.,
+not in proportion to the weights).
+
+The new tunable is:
+
+ * ``straw_calc_version``: A value of 0 preserves the old, broken
+ internal weight calculation; a value of 1 fixes the behavior.
+
+Migration impact:
+
+ * Moving to straw_calc_version 1 and then adjusting a straw bucket
+ (by adding, removing, or reweighting an item, or by using the
+ reweight-all command) can trigger a small to moderate amount of
+ data movement *if* the cluster has hit one of the problematic
+ conditions.
+
+This tunable option is special because it has absolutely no impact
+concerning the required kernel version in the client side.
+
+hammer (CRUSH_V4)
+-----------------
+
+The hammer tunable profile does not affect the
+mapping of existing CRUSH maps simply by changing the profile. However:
+
+ * There is a new bucket type (``straw2``) supported. The new
+ ``straw2`` bucket type fixes several limitations in the original
+ ``straw`` bucket. Specifically, the old ``straw`` buckets would
+ change some mappings that should have changed when a weight was
+ adjusted, while ``straw2`` achieves the original goal of only
+ changing mappings to or from the bucket item whose weight has
+ changed.
+
+ * ``straw2`` is the default for any newly created buckets.
+
+Migration impact:
+
+ * Changing a bucket type from ``straw`` to ``straw2`` will result in
+ a reasonably small amount of data movement, depending on how much
+ the bucket item weights vary from each other. When the weights are
+ all the same no data will move, and when item weights vary
+ significantly there will be more movement.
+
+jewel (CRUSH_TUNABLES5)
+-----------------------
+
+The jewel tunable profile improves the
+overall behavior of CRUSH such that significantly fewer mappings
+change when an OSD is marked out of the cluster.
+
+The new tunable is:
+
+ * ``chooseleaf_stable``: Whether a recursive chooseleaf attempt will
+ use a better value for an inner loop that greatly reduces the number
+ of mapping changes when an OSD is marked out. The legacy value is 0,
+ while the new value of 1 uses the new approach.
+
+Migration impact:
+
+ * Changing this value on an existing cluster will result in a very
+ large amount of data movement as almost every PG mapping is likely
+ to change.
+
+
+
+
+Which client versions support CRUSH_TUNABLES
+--------------------------------------------
+
+ * argonaut series, v0.48.1 or later
+ * v0.49 or later
+ * Linux kernel version v3.6 or later (for the file system and RBD kernel clients)
+
+Which client versions support CRUSH_TUNABLES2
+---------------------------------------------
+
+ * v0.55 or later, including bobtail series (v0.56.x)
+ * Linux kernel version v3.9 or later (for the file system and RBD kernel clients)
+
+Which client versions support CRUSH_TUNABLES3
+---------------------------------------------
+
+ * v0.78 (firefly) or later
+ * Linux kernel version v3.15 or later (for the file system and RBD kernel clients)
+
+Which client versions support CRUSH_V4
+--------------------------------------
+
+ * v0.94 (hammer) or later
+ * Linux kernel version v4.1 or later (for the file system and RBD kernel clients)
+
+Which client versions support CRUSH_TUNABLES5
+---------------------------------------------
+
+ * v10.0.2 (jewel) or later
+ * Linux kernel version v4.5 or later (for the file system and RBD kernel clients)
+
+Warning when tunables are non-optimal
+-------------------------------------
+
+Starting with version v0.74, Ceph will issue a health warning if the
+current CRUSH tunables don't include all the optimal values from the
+``default`` profile (see below for the meaning of the ``default`` profile).
+To make this warning go away, you have two options:
+
+1. Adjust the tunables on the existing cluster. Note that this will
+ result in some data movement (possibly as much as 10%). This is the
+ preferred route, but should be taken with care on a production cluster
+ where the data movement may affect performance. You can enable optimal
+ tunables with::
+
+ ceph osd crush tunables optimal
+
+ If things go poorly (e.g., too much load) and not very much
+ progress has been made, or there is a client compatibility problem
+ (old kernel cephfs or rbd clients, or pre-bobtail librados
+ clients), you can switch back with::
+
+ ceph osd crush tunables legacy
+
+2. You can make the warning go away without making any changes to CRUSH by
+ adding the following option to your ceph.conf ``[mon]`` section::
+
+ mon warn on legacy crush tunables = false
+
+ For the change to take effect, you will need to restart the monitors, or
+ apply the option to running monitors with::
+
+ ceph tell mon.\* injectargs --no-mon-warn-on-legacy-crush-tunables
+
+
+A few important points
+----------------------
+
+ * Adjusting these values will result in the shift of some PGs between
+ storage nodes. If the Ceph cluster is already storing a lot of
+ data, be prepared for some fraction of the data to move.
+ * The ``ceph-osd`` and ``ceph-mon`` daemons will start requiring the
+ feature bits of new connections as soon as they get
+ the updated map. However, already-connected clients are
+ effectively grandfathered in, and will misbehave if they do not
+ support the new feature.
+ * If the CRUSH tunables are set to non-legacy values and then later
+ changed back to the defult values, ``ceph-osd`` daemons will not be
+ required to support the feature. However, the OSD peering process
+ requires examining and understanding old maps. Therefore, you
+ should not run old versions of the ``ceph-osd`` daemon
+ if the cluster has previously used non-legacy CRUSH values, even if
+ the latest version of the map has been switched back to using the
+ legacy defaults.
+
+Tuning CRUSH
+------------
+
+The simplest way to adjust the crush tunables is by changing to a known
+profile. Those are:
+
+ * ``legacy``: the legacy behavior from argonaut and earlier.
+ * ``argonaut``: the legacy values supported by the original argonaut release
+ * ``bobtail``: the values supported by the bobtail release
+ * ``firefly``: the values supported by the firefly release
+ * ``hammer``: the values supported by the hammer release
+ * ``jewel``: the values supported by the jewel release
+ * ``optimal``: the best (ie optimal) values of the current version of Ceph
+ * ``default``: the default values of a new cluster installed from
+ scratch. These values, which depend on the current version of Ceph,
+ are hard coded and are generally a mix of optimal and legacy values.
+ These values generally match the ``optimal`` profile of the previous
+ LTS release, or the most recent release for which we generally except
+ more users to have up to date clients for.
+
+You can select a profile on a running cluster with the command::
+
+ ceph osd crush tunables {PROFILE}
+
+Note that this may result in some data movement.
+
+
+.. _CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data: https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf
+
+
+Primary Affinity
+================
+
+When a Ceph Client reads or writes data, it always contacts the primary OSD in
+the acting set. For set ``[2, 3, 4]``, ``osd.2`` is the primary. Sometimes an
+OSD is not well suited to act as a primary compared to other OSDs (e.g., it has
+a slow disk or a slow controller). To prevent performance bottlenecks
+(especially on read operations) while maximizing utilization of your hardware,
+you can set a Ceph OSD's primary affinity so that CRUSH is less likely to use
+the OSD as a primary in an acting set. ::
+
+ ceph osd primary-affinity <osd-id> <weight>
+
+Primary affinity is ``1`` by default (*i.e.,* an OSD may act as a primary). You
+may set the OSD primary range from ``0-1``, where ``0`` means that the OSD may
+**NOT** be used as a primary and ``1`` means that an OSD may be used as a
+primary. When the weight is ``< 1``, it is less likely that CRUSH will select
+the Ceph OSD Daemon to act as a primary.
+
+
+
diff --git a/src/ceph/doc/rados/operations/data-placement.rst b/src/ceph/doc/rados/operations/data-placement.rst
new file mode 100644
index 0000000..27966b0
--- /dev/null
+++ b/src/ceph/doc/rados/operations/data-placement.rst
@@ -0,0 +1,37 @@
+=========================
+ Data Placement Overview
+=========================
+
+Ceph stores, replicates and rebalances data objects across a RADOS cluster
+dynamically. With many different users storing objects in different pools for
+different purposes on countless OSDs, Ceph operations require some data
+placement planning. The main data placement planning concepts in Ceph include:
+
+- **Pools:** Ceph stores data within pools, which are logical groups for storing
+ objects. Pools manage the number of placement groups, the number of replicas,
+ and the ruleset for the pool. To store data in a pool, you must have
+ an authenticated user with permissions for the pool. Ceph can snapshot pools.
+ See `Pools`_ for additional details.
+
+- **Placement Groups:** Ceph maps objects to placement groups (PGs).
+ Placement groups (PGs) are shards or fragments of a logical object pool
+ that place objects as a group into OSDs. Placement groups reduce the amount
+ of per-object metadata when Ceph stores the data in OSDs. A larger number of
+ placement groups (e.g., 100 per OSD) leads to better balancing. See
+ `Placement Groups`_ for additional details.
+
+- **CRUSH Maps:** CRUSH is a big part of what allows Ceph to scale without
+ performance bottlenecks, without limitations to scalability, and without a
+ single point of failure. CRUSH maps provide the physical topology of the
+ cluster to the CRUSH algorithm to determine where the data for an object
+ and its replicas should be stored, and how to do so across failure domains
+ for added data safety among other things. See `CRUSH Maps`_ for additional
+ details.
+
+When you initially set up a test cluster, you can use the default values. Once
+you begin planning for a large Ceph cluster, refer to pools, placement groups
+and CRUSH for data placement operations.
+
+.. _Pools: ../pools
+.. _Placement Groups: ../placement-groups
+.. _CRUSH Maps: ../crush-map
diff --git a/src/ceph/doc/rados/operations/erasure-code-isa.rst b/src/ceph/doc/rados/operations/erasure-code-isa.rst
new file mode 100644
index 0000000..b52933a
--- /dev/null
+++ b/src/ceph/doc/rados/operations/erasure-code-isa.rst
@@ -0,0 +1,105 @@
+=======================
+ISA erasure code plugin
+=======================
+
+The *isa* plugin encapsulates the `ISA
+<https://01.org/intel%C2%AE-storage-acceleration-library-open-source-version/>`_
+library. It only runs on Intel processors.
+
+Create an isa profile
+=====================
+
+To create a new *isa* erasure code profile::
+
+ ceph osd erasure-code-profile set {name} \
+ plugin=isa \
+ technique={reed_sol_van|cauchy} \
+ [k={data-chunks}] \
+ [m={coding-chunks}] \
+ [crush-root={root}] \
+ [crush-failure-domain={bucket-type}] \
+ [crush-device-class={device-class}] \
+ [directory={directory}] \
+ [--force]
+
+Where:
+
+``k={data chunks}``
+
+:Description: Each object is split in **data-chunks** parts,
+ each stored on a different OSD.
+
+:Type: Integer
+:Required: No.
+:Default: 7
+
+``m={coding-chunks}``
+
+:Description: Compute **coding chunks** for each object and store them
+ on different OSDs. The number of coding chunks is also
+ the number of OSDs that can be down without losing data.
+
+:Type: Integer
+:Required: No.
+:Default: 3
+
+``technique={reed_sol_van|cauchy}``
+
+:Description: The ISA plugin comes in two `Reed Solomon
+ <https://en.wikipedia.org/wiki/Reed%E2%80%93Solomon_error_correction>`_
+ forms. If *reed_sol_van* is set, it is `Vandermonde
+ <https://en.wikipedia.org/wiki/Vandermonde_matrix>`_, if
+ *cauchy* is set, it is `Cauchy
+ <https://en.wikipedia.org/wiki/Cauchy_matrix>`_.
+
+:Type: String
+:Required: No.
+:Default: reed_sol_van
+
+``crush-root={root}``
+
+:Description: The name of the crush bucket used for the first step of
+ the ruleset. For intance **step take default**.
+
+:Type: String
+:Required: No.
+:Default: default
+
+``crush-failure-domain={bucket-type}``
+
+:Description: Ensure that no two chunks are in a bucket with the same
+ failure domain. For instance, if the failure domain is
+ **host** no two chunks will be stored on the same
+ host. It is used to create a ruleset step such as **step
+ chooseleaf host**.
+
+:Type: String
+:Required: No.
+:Default: host
+
+``crush-device-class={device-class}``
+
+:Description: Restrict placement to devices of a specific class (e.g.,
+ ``ssd`` or ``hdd``), using the crush device class names
+ in the CRUSH map.
+
+:Type: String
+:Required: No.
+:Default:
+
+``directory={directory}``
+
+:Description: Set the **directory** name from which the erasure code
+ plugin is loaded.
+
+:Type: String
+:Required: No.
+:Default: /usr/lib/ceph/erasure-code
+
+``--force``
+
+:Description: Override an existing profile by the same name.
+
+:Type: String
+:Required: No.
+
diff --git a/src/ceph/doc/rados/operations/erasure-code-jerasure.rst b/src/ceph/doc/rados/operations/erasure-code-jerasure.rst
new file mode 100644
index 0000000..e8da097
--- /dev/null
+++ b/src/ceph/doc/rados/operations/erasure-code-jerasure.rst
@@ -0,0 +1,120 @@
+============================
+Jerasure erasure code plugin
+============================
+
+The *jerasure* plugin is the most generic and flexible plugin, it is
+also the default for Ceph erasure coded pools.
+
+The *jerasure* plugin encapsulates the `Jerasure
+<http://jerasure.org>`_ library. It is
+recommended to read the *jerasure* documentation to get a better
+understanding of the parameters.
+
+Create a jerasure profile
+=========================
+
+To create a new *jerasure* erasure code profile::
+
+ ceph osd erasure-code-profile set {name} \
+ plugin=jerasure \
+ k={data-chunks} \
+ m={coding-chunks} \
+ technique={reed_sol_van|reed_sol_r6_op|cauchy_orig|cauchy_good|liberation|blaum_roth|liber8tion} \
+ [crush-root={root}] \
+ [crush-failure-domain={bucket-type}] \
+ [crush-device-class={device-class}] \
+ [directory={directory}] \
+ [--force]
+
+Where:
+
+``k={data chunks}``
+
+:Description: Each object is split in **data-chunks** parts,
+ each stored on a different OSD.
+
+:Type: Integer
+:Required: Yes.
+:Example: 4
+
+``m={coding-chunks}``
+
+:Description: Compute **coding chunks** for each object and store them
+ on different OSDs. The number of coding chunks is also
+ the number of OSDs that can be down without losing data.
+
+:Type: Integer
+:Required: Yes.
+:Example: 2
+
+``technique={reed_sol_van|reed_sol_r6_op|cauchy_orig|cauchy_good|liberation|blaum_roth|liber8tion}``
+
+:Description: The more flexible technique is *reed_sol_van* : it is
+ enough to set *k* and *m*. The *cauchy_good* technique
+ can be faster but you need to chose the *packetsize*
+ carefully. All of *reed_sol_r6_op*, *liberation*,
+ *blaum_roth*, *liber8tion* are *RAID6* equivalents in
+ the sense that they can only be configured with *m=2*.
+
+:Type: String
+:Required: No.
+:Default: reed_sol_van
+
+``packetsize={bytes}``
+
+:Description: The encoding will be done on packets of *bytes* size at
+ a time. Chosing the right packet size is difficult. The
+ *jerasure* documentation contains extensive information
+ on this topic.
+
+:Type: Integer
+:Required: No.
+:Default: 2048
+
+``crush-root={root}``
+
+:Description: The name of the crush bucket used for the first step of
+ the ruleset. For intance **step take default**.
+
+:Type: String
+:Required: No.
+:Default: default
+
+``crush-failure-domain={bucket-type}``
+
+:Description: Ensure that no two chunks are in a bucket with the same
+ failure domain. For instance, if the failure domain is
+ **host** no two chunks will be stored on the same
+ host. It is used to create a ruleset step such as **step
+ chooseleaf host**.
+
+:Type: String
+:Required: No.
+:Default: host
+
+``crush-device-class={device-class}``
+
+:Description: Restrict placement to devices of a specific class (e.g.,
+ ``ssd`` or ``hdd``), using the crush device class names
+ in the CRUSH map.
+
+:Type: String
+:Required: No.
+:Default:
+
+ ``directory={directory}``
+
+:Description: Set the **directory** name from which the erasure code
+ plugin is loaded.
+
+:Type: String
+:Required: No.
+:Default: /usr/lib/ceph/erasure-code
+
+``--force``
+
+:Description: Override an existing profile by the same name.
+
+:Type: String
+:Required: No.
+
diff --git a/src/ceph/doc/rados/operations/erasure-code-lrc.rst b/src/ceph/doc/rados/operations/erasure-code-lrc.rst
new file mode 100644
index 0000000..447ce23
--- /dev/null
+++ b/src/ceph/doc/rados/operations/erasure-code-lrc.rst
@@ -0,0 +1,371 @@
+======================================
+Locally repairable erasure code plugin
+======================================
+
+With the *jerasure* plugin, when an erasure coded object is stored on
+multiple OSDs, recovering from the loss of one OSD requires reading
+from all the others. For instance if *jerasure* is configured with
+*k=8* and *m=4*, losing one OSD requires reading from the eleven
+others to repair.
+
+The *lrc* erasure code plugin creates local parity chunks to be able
+to recover using less OSDs. For instance if *lrc* is configured with
+*k=8*, *m=4* and *l=4*, it will create an additional parity chunk for
+every four OSDs. When a single OSD is lost, it can be recovered with
+only four OSDs instead of eleven.
+
+Erasure code profile examples
+=============================
+
+Reduce recovery bandwidth between hosts
+---------------------------------------
+
+Although it is probably not an interesting use case when all hosts are
+connected to the same switch, reduced bandwidth usage can actually be
+observed.::
+
+ $ ceph osd erasure-code-profile set LRCprofile \
+ plugin=lrc \
+ k=4 m=2 l=3 \
+ crush-failure-domain=host
+ $ ceph osd pool create lrcpool 12 12 erasure LRCprofile
+
+
+Reduce recovery bandwidth between racks
+---------------------------------------
+
+In Firefly the reduced bandwidth will only be observed if the primary
+OSD is in the same rack as the lost chunk.::
+
+ $ ceph osd erasure-code-profile set LRCprofile \
+ plugin=lrc \
+ k=4 m=2 l=3 \
+ crush-locality=rack \
+ crush-failure-domain=host
+ $ ceph osd pool create lrcpool 12 12 erasure LRCprofile
+
+
+Create an lrc profile
+=====================
+
+To create a new lrc erasure code profile::
+
+ ceph osd erasure-code-profile set {name} \
+ plugin=lrc \
+ k={data-chunks} \
+ m={coding-chunks} \
+ l={locality} \
+ [crush-root={root}] \
+ [crush-locality={bucket-type}] \
+ [crush-failure-domain={bucket-type}] \
+ [crush-device-class={device-class}] \
+ [directory={directory}] \
+ [--force]
+
+Where:
+
+``k={data chunks}``
+
+:Description: Each object is split in **data-chunks** parts,
+ each stored on a different OSD.
+
+:Type: Integer
+:Required: Yes.
+:Example: 4
+
+``m={coding-chunks}``
+
+:Description: Compute **coding chunks** for each object and store them
+ on different OSDs. The number of coding chunks is also
+ the number of OSDs that can be down without losing data.
+
+:Type: Integer
+:Required: Yes.
+:Example: 2
+
+``l={locality}``
+
+:Description: Group the coding and data chunks into sets of size
+ **locality**. For instance, for **k=4** and **m=2**,
+ when **locality=3** two groups of three are created.
+ Each set can be recovered without reading chunks
+ from another set.
+
+:Type: Integer
+:Required: Yes.
+:Example: 3
+
+``crush-root={root}``
+
+:Description: The name of the crush bucket used for the first step of
+ the ruleset. For intance **step take default**.
+
+:Type: String
+:Required: No.
+:Default: default
+
+``crush-locality={bucket-type}``
+
+:Description: The type of the crush bucket in which each set of chunks
+ defined by **l** will be stored. For instance, if it is
+ set to **rack**, each group of **l** chunks will be
+ placed in a different rack. It is used to create a
+ ruleset step such as **step choose rack**. If it is not
+ set, no such grouping is done.
+
+:Type: String
+:Required: No.
+
+``crush-failure-domain={bucket-type}``
+
+:Description: Ensure that no two chunks are in a bucket with the same
+ failure domain. For instance, if the failure domain is
+ **host** no two chunks will be stored on the same
+ host. It is used to create a ruleset step such as **step
+ chooseleaf host**.
+
+:Type: String
+:Required: No.
+:Default: host
+
+``crush-device-class={device-class}``
+
+:Description: Restrict placement to devices of a specific class (e.g.,
+ ``ssd`` or ``hdd``), using the crush device class names
+ in the CRUSH map.
+
+:Type: String
+:Required: No.
+:Default:
+
+``directory={directory}``
+
+:Description: Set the **directory** name from which the erasure code
+ plugin is loaded.
+
+:Type: String
+:Required: No.
+:Default: /usr/lib/ceph/erasure-code
+
+``--force``
+
+:Description: Override an existing profile by the same name.
+
+:Type: String
+:Required: No.
+
+Low level plugin configuration
+==============================
+
+The sum of **k** and **m** must be a multiple of the **l** parameter.
+The low level configuration parameters do not impose such a
+restriction and it may be more convienient to use it for specific
+purposes. It is for instance possible to define two groups, one with 4
+chunks and another with 3 chunks. It is also possible to recursively
+define locality sets, for instance datacenters and racks into
+datacenters. The **k/m/l** are implemented by generating a low level
+configuration.
+
+The *lrc* erasure code plugin recursively applies erasure code
+techniques so that recovering from the loss of some chunks only
+requires a subset of the available chunks, most of the time.
+
+For instance, when three coding steps are described as::
+
+ chunk nr 01234567
+ step 1 _cDD_cDD
+ step 2 cDDD____
+ step 3 ____cDDD
+
+where *c* are coding chunks calculated from the data chunks *D*, the
+loss of chunk *7* can be recovered with the last four chunks. And the
+loss of chunk *2* chunk can be recovered with the first four
+chunks.
+
+Erasure code profile examples using low level configuration
+===========================================================
+
+Minimal testing
+---------------
+
+It is strictly equivalent to using the default erasure code profile. The *DD*
+implies *K=2*, the *c* implies *M=1* and the *jerasure* plugin is used
+by default.::
+
+ $ ceph osd erasure-code-profile set LRCprofile \
+ plugin=lrc \
+ mapping=DD_ \
+ layers='[ [ "DDc", "" ] ]'
+ $ ceph osd pool create lrcpool 12 12 erasure LRCprofile
+
+Reduce recovery bandwidth between hosts
+---------------------------------------
+
+Although it is probably not an interesting use case when all hosts are
+connected to the same switch, reduced bandwidth usage can actually be
+observed. It is equivalent to **k=4**, **m=2** and **l=3** although
+the layout of the chunks is different::
+
+ $ ceph osd erasure-code-profile set LRCprofile \
+ plugin=lrc \
+ mapping=__DD__DD \
+ layers='[
+ [ "_cDD_cDD", "" ],
+ [ "cDDD____", "" ],
+ [ "____cDDD", "" ],
+ ]'
+ $ ceph osd pool create lrcpool 12 12 erasure LRCprofile
+
+
+Reduce recovery bandwidth between racks
+---------------------------------------
+
+In Firefly the reduced bandwidth will only be observed if the primary
+OSD is in the same rack as the lost chunk.::
+
+ $ ceph osd erasure-code-profile set LRCprofile \
+ plugin=lrc \
+ mapping=__DD__DD \
+ layers='[
+ [ "_cDD_cDD", "" ],
+ [ "cDDD____", "" ],
+ [ "____cDDD", "" ],
+ ]' \
+ crush-steps='[
+ [ "choose", "rack", 2 ],
+ [ "chooseleaf", "host", 4 ],
+ ]'
+ $ ceph osd pool create lrcpool 12 12 erasure LRCprofile
+
+Testing with different Erasure Code backends
+--------------------------------------------
+
+LRC now uses jerasure as the default EC backend. It is possible to
+specify the EC backend/algorithm on a per layer basis using the low
+level configuration. The second argument in layers='[ [ "DDc", "" ] ]'
+is actually an erasure code profile to be used for this level. The
+example below specifies the ISA backend with the cauchy technique to
+be used in the lrcpool.::
+
+ $ ceph osd erasure-code-profile set LRCprofile \
+ plugin=lrc \
+ mapping=DD_ \
+ layers='[ [ "DDc", "plugin=isa technique=cauchy" ] ]'
+ $ ceph osd pool create lrcpool 12 12 erasure LRCprofile
+
+You could also use a different erasure code profile for for each
+layer.::
+
+ $ ceph osd erasure-code-profile set LRCprofile \
+ plugin=lrc \
+ mapping=__DD__DD \
+ layers='[
+ [ "_cDD_cDD", "plugin=isa technique=cauchy" ],
+ [ "cDDD____", "plugin=isa" ],
+ [ "____cDDD", "plugin=jerasure" ],
+ ]'
+ $ ceph osd pool create lrcpool 12 12 erasure LRCprofile
+
+
+
+Erasure coding and decoding algorithm
+=====================================
+
+The steps found in the layers description::
+
+ chunk nr 01234567
+
+ step 1 _cDD_cDD
+ step 2 cDDD____
+ step 3 ____cDDD
+
+are applied in order. For instance, if a 4K object is encoded, it will
+first go thru *step 1* and be divided in four 1K chunks (the four
+uppercase D). They are stored in the chunks 2, 3, 6 and 7, in
+order. From these, two coding chunks are calculated (the two lowercase
+c). The coding chunks are stored in the chunks 1 and 5, respectively.
+
+The *step 2* re-uses the content created by *step 1* in a similar
+fashion and stores a single coding chunk *c* at position 0. The last four
+chunks, marked with an underscore (*_*) for readability, are ignored.
+
+The *step 3* stores a single coding chunk *c* at position 4. The three
+chunks created by *step 1* are used to compute this coding chunk,
+i.e. the coding chunk from *step 1* becomes a data chunk in *step 3*.
+
+If chunk *2* is lost::
+
+ chunk nr 01234567
+
+ step 1 _c D_cDD
+ step 2 cD D____
+ step 3 __ _cDDD
+
+decoding will attempt to recover it by walking the steps in reverse
+order: *step 3* then *step 2* and finally *step 1*.
+
+The *step 3* knows nothing about chunk *2* (i.e. it is an underscore)
+and is skipped.
+
+The coding chunk from *step 2*, stored in chunk *0*, allows it to
+recover the content of chunk *2*. There are no more chunks to recover
+and the process stops, without considering *step 1*.
+
+Recovering chunk *2* requires reading chunks *0, 1, 3* and writing
+back chunk *2*.
+
+If chunk *2, 3, 6* are lost::
+
+ chunk nr 01234567
+
+ step 1 _c _c D
+ step 2 cD __ _
+ step 3 __ cD D
+
+The *step 3* can recover the content of chunk *6*::
+
+ chunk nr 01234567
+
+ step 1 _c _cDD
+ step 2 cD ____
+ step 3 __ cDDD
+
+The *step 2* fails to recover and is skipped because there are two
+chunks missing (*2, 3*) and it can only recover from one missing
+chunk.
+
+The coding chunk from *step 1*, stored in chunk *1, 5*, allows it to
+recover the content of chunk *2, 3*::
+
+ chunk nr 01234567
+
+ step 1 _cDD_cDD
+ step 2 cDDD____
+ step 3 ____cDDD
+
+Controlling crush placement
+===========================
+
+The default crush ruleset provides OSDs that are on different hosts. For instance::
+
+ chunk nr 01234567
+
+ step 1 _cDD_cDD
+ step 2 cDDD____
+ step 3 ____cDDD
+
+needs exactly *8* OSDs, one for each chunk. If the hosts are in two
+adjacent racks, the first four chunks can be placed in the first rack
+and the last four in the second rack. So that recovering from the loss
+of a single OSD does not require using bandwidth between the two
+racks.
+
+For instance::
+
+ crush-steps='[ [ "choose", "rack", 2 ], [ "chooseleaf", "host", 4 ] ]'
+
+will create a ruleset that will select two crush buckets of type
+*rack* and for each of them choose four OSDs, each of them located in
+different buckets of type *host*.
+
+The ruleset can also be manually crafted for finer control.
diff --git a/src/ceph/doc/rados/operations/erasure-code-profile.rst b/src/ceph/doc/rados/operations/erasure-code-profile.rst
new file mode 100644
index 0000000..ddf772d
--- /dev/null
+++ b/src/ceph/doc/rados/operations/erasure-code-profile.rst
@@ -0,0 +1,121 @@
+=====================
+Erasure code profiles
+=====================
+
+Erasure code is defined by a **profile** and is used when creating an
+erasure coded pool and the associated crush ruleset.
+
+The **default** erasure code profile (which is created when the Ceph
+cluster is initialized) provides the same level of redundancy as two
+copies but requires 25% less disk space. It is described as a profile
+with **k=2** and **m=1**, meaning the information is spread over three
+OSD (k+m == 3) and one of them can be lost.
+
+To improve redundancy without increasing raw storage requirements, a
+new profile can be created. For instance, a profile with **k=10** and
+**m=4** can sustain the loss of four (**m=4**) OSDs by distributing an
+object on fourteen (k+m=14) OSDs. The object is first divided in
+**10** chunks (if the object is 10MB, each chunk is 1MB) and **4**
+coding chunks are computed, for recovery (each coding chunk has the
+same size as the data chunk, i.e. 1MB). The raw space overhead is only
+40% and the object will not be lost even if four OSDs break at the
+same time.
+
+.. _list of available plugins:
+
+.. toctree::
+ :maxdepth: 1
+
+ erasure-code-jerasure
+ erasure-code-isa
+ erasure-code-lrc
+ erasure-code-shec
+
+osd erasure-code-profile set
+============================
+
+To create a new erasure code profile::
+
+ ceph osd erasure-code-profile set {name} \
+ [{directory=directory}] \
+ [{plugin=plugin}] \
+ [{stripe_unit=stripe_unit}] \
+ [{key=value} ...] \
+ [--force]
+
+Where:
+
+``{directory=directory}``
+
+:Description: Set the **directory** name from which the erasure code
+ plugin is loaded.
+
+:Type: String
+:Required: No.
+:Default: /usr/lib/ceph/erasure-code
+
+``{plugin=plugin}``
+
+:Description: Use the erasure code **plugin** to compute coding chunks
+ and recover missing chunks. See the `list of available
+ plugins`_ for more information.
+
+:Type: String
+:Required: No.
+:Default: jerasure
+
+``{stripe_unit=stripe_unit}``
+
+:Description: The amount of data in a data chunk, per stripe. For
+ example, a profile with 2 data chunks and stripe_unit=4K
+ would put the range 0-4K in chunk 0, 4K-8K in chunk 1,
+ then 8K-12K in chunk 0 again. This should be a multiple
+ of 4K for best performance. The default value is taken
+ from the monitor config option
+ ``osd_pool_erasure_code_stripe_unit`` when a pool is
+ created. The stripe_width of a pool using this profile
+ will be the number of data chunks multiplied by this
+ stripe_unit.
+
+:Type: String
+:Required: No.
+
+``{key=value}``
+
+:Description: The semantic of the remaining key/value pairs is defined
+ by the erasure code plugin.
+
+:Type: String
+:Required: No.
+
+``--force``
+
+:Description: Override an existing profile by the same name, and allow
+ setting a non-4K-aligned stripe_unit.
+
+:Type: String
+:Required: No.
+
+osd erasure-code-profile rm
+============================
+
+To remove an erasure code profile::
+
+ ceph osd erasure-code-profile rm {name}
+
+If the profile is referenced by a pool, the deletion will fail.
+
+osd erasure-code-profile get
+============================
+
+To display an erasure code profile::
+
+ ceph osd erasure-code-profile get {name}
+
+osd erasure-code-profile ls
+===========================
+
+To list the names of all erasure code profiles::
+
+ ceph osd erasure-code-profile ls
+
diff --git a/src/ceph/doc/rados/operations/erasure-code-shec.rst b/src/ceph/doc/rados/operations/erasure-code-shec.rst
new file mode 100644
index 0000000..e3bab37
--- /dev/null
+++ b/src/ceph/doc/rados/operations/erasure-code-shec.rst
@@ -0,0 +1,144 @@
+========================
+SHEC erasure code plugin
+========================
+
+The *shec* plugin encapsulates the `multiple SHEC
+<http://tracker.ceph.com/projects/ceph/wiki/Shingled_Erasure_Code_(SHEC)>`_
+library. It allows ceph to recover data more efficiently than Reed Solomon codes.
+
+Create an SHEC profile
+======================
+
+To create a new *shec* erasure code profile::
+
+ ceph osd erasure-code-profile set {name} \
+ plugin=shec \
+ [k={data-chunks}] \
+ [m={coding-chunks}] \
+ [c={durability-estimator}] \
+ [crush-root={root}] \
+ [crush-failure-domain={bucket-type}] \
+ [crush-device-class={device-class}] \
+ [directory={directory}] \
+ [--force]
+
+Where:
+
+``k={data-chunks}``
+
+:Description: Each object is split in **data-chunks** parts,
+ each stored on a different OSD.
+
+:Type: Integer
+:Required: No.
+:Default: 4
+
+``m={coding-chunks}``
+
+:Description: Compute **coding-chunks** for each object and store them on
+ different OSDs. The number of **coding-chunks** does not necessarily
+ equal the number of OSDs that can be down without losing data.
+
+:Type: Integer
+:Required: No.
+:Default: 3
+
+``c={durability-estimator}``
+
+:Description: The number of parity chunks each of which includes each data chunk in its
+ calculation range. The number is used as a **durability estimator**.
+ For instance, if c=2, 2 OSDs can be down without losing data.
+
+:Type: Integer
+:Required: No.
+:Default: 2
+
+``crush-root={root}``
+
+:Description: The name of the crush bucket used for the first step of
+ the ruleset. For intance **step take default**.
+
+:Type: String
+:Required: No.
+:Default: default
+
+``crush-failure-domain={bucket-type}``
+
+:Description: Ensure that no two chunks are in a bucket with the same
+ failure domain. For instance, if the failure domain is
+ **host** no two chunks will be stored on the same
+ host. It is used to create a ruleset step such as **step
+ chooseleaf host**.
+
+:Type: String
+:Required: No.
+:Default: host
+
+``crush-device-class={device-class}``
+
+:Description: Restrict placement to devices of a specific class (e.g.,
+ ``ssd`` or ``hdd``), using the crush device class names
+ in the CRUSH map.
+
+:Type: String
+:Required: No.
+:Default:
+
+``directory={directory}``
+
+:Description: Set the **directory** name from which the erasure code
+ plugin is loaded.
+
+:Type: String
+:Required: No.
+:Default: /usr/lib/ceph/erasure-code
+
+``--force``
+
+:Description: Override an existing profile by the same name.
+
+:Type: String
+:Required: No.
+
+Brief description of SHEC's layouts
+===================================
+
+Space Efficiency
+----------------
+
+Space efficiency is a ratio of data chunks to all ones in a object and
+represented as k/(k+m).
+In order to improve space efficiency, you should increase k or decrease m.
+
+::
+
+ space efficiency of SHEC(4,3,2) = 4/(4+3) = 0.57
+ SHEC(5,3,2) or SHEC(4,2,2) improves SHEC(4,3,2)'s space efficiency
+
+Durability
+----------
+
+The third parameter of SHEC (=c) is a durability estimator, which approximates
+the number of OSDs that can be down without losing data.
+
+``durability estimator of SHEC(4,3,2) = 2``
+
+Recovery Efficiency
+-------------------
+
+Describing calculation of recovery efficiency is beyond the scope of this document,
+but at least increasing m without increasing c achieves improvement of recovery efficiency.
+(However, we must pay attention to the sacrifice of space efficiency in this case.)
+
+``SHEC(4,2,2) -> SHEC(4,3,2) : achieves improvement of recovery efficiency``
+
+Erasure code profile examples
+=============================
+
+::
+
+ $ ceph osd erasure-code-profile set SHECprofile \
+ plugin=shec \
+ k=8 m=4 c=3 \
+ crush-failure-domain=host
+ $ ceph osd pool create shecpool 256 256 erasure SHECprofile
diff --git a/src/ceph/doc/rados/operations/erasure-code.rst b/src/ceph/doc/rados/operations/erasure-code.rst
new file mode 100644
index 0000000..6ec5a09
--- /dev/null
+++ b/src/ceph/doc/rados/operations/erasure-code.rst
@@ -0,0 +1,195 @@
+=============
+ Erasure code
+=============
+
+A Ceph pool is associated to a type to sustain the loss of an OSD
+(i.e. a disk since most of the time there is one OSD per disk). The
+default choice when `creating a pool <../pools>`_ is *replicated*,
+meaning every object is copied on multiple disks. The `Erasure Code
+<https://en.wikipedia.org/wiki/Erasure_code>`_ pool type can be used
+instead to save space.
+
+Creating a sample erasure coded pool
+------------------------------------
+
+The simplest erasure coded pool is equivalent to `RAID5
+<https://en.wikipedia.org/wiki/Standard_RAID_levels#RAID_5>`_ and
+requires at least three hosts::
+
+ $ ceph osd pool create ecpool 12 12 erasure
+ pool 'ecpool' created
+ $ echo ABCDEFGHI | rados --pool ecpool put NYAN -
+ $ rados --pool ecpool get NYAN -
+ ABCDEFGHI
+
+.. note:: the 12 in *pool create* stands for
+ `the number of placement groups <../pools>`_.
+
+Erasure code profiles
+---------------------
+
+The default erasure code profile sustains the loss of a single OSD. It
+is equivalent to a replicated pool of size two but requires 1.5TB
+instead of 2TB to store 1TB of data. The default profile can be
+displayed with::
+
+ $ ceph osd erasure-code-profile get default
+ k=2
+ m=1
+ plugin=jerasure
+ crush-failure-domain=host
+ technique=reed_sol_van
+
+Choosing the right profile is important because it cannot be modified
+after the pool is created: a new pool with a different profile needs
+to be created and all objects from the previous pool moved to the new.
+
+The most important parameters of the profile are *K*, *M* and
+*crush-failure-domain* because they define the storage overhead and
+the data durability. For instance, if the desired architecture must
+sustain the loss of two racks with a storage overhead of 40% overhead,
+the following profile can be defined::
+
+ $ ceph osd erasure-code-profile set myprofile \
+ k=3 \
+ m=2 \
+ crush-failure-domain=rack
+ $ ceph osd pool create ecpool 12 12 erasure myprofile
+ $ echo ABCDEFGHI | rados --pool ecpool put NYAN -
+ $ rados --pool ecpool get NYAN -
+ ABCDEFGHI
+
+The *NYAN* object will be divided in three (*K=3*) and two additional
+*chunks* will be created (*M=2*). The value of *M* defines how many
+OSD can be lost simultaneously without losing any data. The
+*crush-failure-domain=rack* will create a CRUSH ruleset that ensures
+no two *chunks* are stored in the same rack.
+
+.. ditaa::
+ +-------------------+
+ name | NYAN |
+ +-------------------+
+ content | ABCDEFGHI |
+ +--------+----------+
+ |
+ |
+ v
+ +------+------+
+ +---------------+ encode(3,2) +-----------+
+ | +--+--+---+---+ |
+ | | | | |
+ | +-------+ | +-----+ |
+ | | | | |
+ +--v---+ +--v---+ +--v---+ +--v---+ +--v---+
+ name | NYAN | | NYAN | | NYAN | | NYAN | | NYAN |
+ +------+ +------+ +------+ +------+ +------+
+ shard | 1 | | 2 | | 3 | | 4 | | 5 |
+ +------+ +------+ +------+ +------+ +------+
+ content | ABC | | DEF | | GHI | | YXY | | QGC |
+ +--+---+ +--+---+ +--+---+ +--+---+ +--+---+
+ | | | | |
+ | | v | |
+ | | +--+---+ | |
+ | | | OSD1 | | |
+ | | +------+ | |
+ | | | |
+ | | +------+ | |
+ | +------>| OSD2 | | |
+ | +------+ | |
+ | | |
+ | +------+ | |
+ | | OSD3 |<----+ |
+ | +------+ |
+ | |
+ | +------+ |
+ | | OSD4 |<--------------+
+ | +------+
+ |
+ | +------+
+ +----------------->| OSD5 |
+ +------+
+
+
+More information can be found in the `erasure code profiles
+<../erasure-code-profile>`_ documentation.
+
+
+Erasure Coding with Overwrites
+------------------------------
+
+By default, erasure coded pools only work with uses like RGW that
+perform full object writes and appends.
+
+Since Luminous, partial writes for an erasure coded pool may be
+enabled with a per-pool setting. This lets RBD and Cephfs store their
+data in an erasure coded pool::
+
+ ceph osd pool set ec_pool allow_ec_overwrites true
+
+This can only be enabled on a pool residing on bluestore OSDs, since
+bluestore's checksumming is used to detect bitrot or other corruption
+during deep-scrub. In addition to being unsafe, using filestore with
+ec overwrites yields low performance compared to bluestore.
+
+Erasure coded pools do not support omap, so to use them with RBD and
+Cephfs you must instruct them to store their data in an ec pool, and
+their metadata in a replicated pool. For RBD, this means using the
+erasure coded pool as the ``--data-pool`` during image creation::
+
+ rbd create --size 1G --data-pool ec_pool replicated_pool/image_name
+
+For Cephfs, using an erasure coded pool means setting that pool in
+a `file layout <../../../cephfs/file-layouts>`_.
+
+
+Erasure coded pool and cache tiering
+------------------------------------
+
+Erasure coded pools require more resources than replicated pools and
+lack some functionalities such as omap. To overcome these
+limitations, one can set up a `cache tier <../cache-tiering>`_
+before the erasure coded pool.
+
+For instance, if the pool *hot-storage* is made of fast storage::
+
+ $ ceph osd tier add ecpool hot-storage
+ $ ceph osd tier cache-mode hot-storage writeback
+ $ ceph osd tier set-overlay ecpool hot-storage
+
+will place the *hot-storage* pool as tier of *ecpool* in *writeback*
+mode so that every write and read to the *ecpool* are actually using
+the *hot-storage* and benefit from its flexibility and speed.
+
+More information can be found in the `cache tiering
+<../cache-tiering>`_ documentation.
+
+Glossary
+--------
+
+*chunk*
+ when the encoding function is called, it returns chunks of the same
+ size. Data chunks which can be concatenated to reconstruct the original
+ object and coding chunks which can be used to rebuild a lost chunk.
+
+*K*
+ the number of data *chunks*, i.e. the number of *chunks* in which the
+ original object is divided. For instance if *K* = 2 a 10KB object
+ will be divided into *K* objects of 5KB each.
+
+*M*
+ the number of coding *chunks*, i.e. the number of additional *chunks*
+ computed by the encoding functions. If there are 2 coding *chunks*,
+ it means 2 OSDs can be out without losing data.
+
+
+Table of content
+----------------
+
+.. toctree::
+ :maxdepth: 1
+
+ erasure-code-profile
+ erasure-code-jerasure
+ erasure-code-isa
+ erasure-code-lrc
+ erasure-code-shec
diff --git a/src/ceph/doc/rados/operations/health-checks.rst b/src/ceph/doc/rados/operations/health-checks.rst
new file mode 100644
index 0000000..c1e2200
--- /dev/null
+++ b/src/ceph/doc/rados/operations/health-checks.rst
@@ -0,0 +1,527 @@
+
+=============
+Health checks
+=============
+
+Overview
+========
+
+There is a finite set of possible health messages that a Ceph cluster can
+raise -- these are defined as *health checks* which have unique identifiers.
+
+The identifier is a terse pseudo-human-readable (i.e. like a variable name)
+string. It is intended to enable tools (such as UIs) to make sense of
+health checks, and present them in a way that reflects their meaning.
+
+This page lists the health checks that are raised by the monitor and manager
+daemons. In addition to these, you may also see health checks that originate
+from MDS daemons (see :doc:`/cephfs/health-messages`), and health checks
+that are defined by ceph-mgr python modules.
+
+Definitions
+===========
+
+
+OSDs
+----
+
+OSD_DOWN
+________
+
+One or more OSDs are marked down. The ceph-osd daemon may have been
+stopped, or peer OSDs may be unable to reach the OSD over the network.
+Common causes include a stopped or crashed daemon, a down host, or a
+network outage.
+
+Verify the host is healthy, the daemon is started, and network is
+functioning. If the daemon has crashed, the daemon log file
+(``/var/log/ceph/ceph-osd.*``) may contain debugging information.
+
+OSD_<crush type>_DOWN
+_____________________
+
+(e.g. OSD_HOST_DOWN, OSD_ROOT_DOWN)
+
+All the OSDs within a particular CRUSH subtree are marked down, for example
+all OSDs on a host.
+
+OSD_ORPHAN
+__________
+
+An OSD is referenced in the CRUSH map hierarchy but does not exist.
+
+The OSD can be removed from the CRUSH hierarchy with::
+
+ ceph osd crush rm osd.<id>
+
+OSD_OUT_OF_ORDER_FULL
+_____________________
+
+The utilization thresholds for `backfillfull`, `nearfull`, `full`,
+and/or `failsafe_full` are not ascending. In particular, we expect
+`backfillfull < nearfull`, `nearfull < full`, and `full <
+failsafe_full`.
+
+The thresholds can be adjusted with::
+
+ ceph osd set-backfillfull-ratio <ratio>
+ ceph osd set-nearfull-ratio <ratio>
+ ceph osd set-full-ratio <ratio>
+
+
+OSD_FULL
+________
+
+One or more OSDs has exceeded the `full` threshold and is preventing
+the cluster from servicing writes.
+
+Utilization by pool can be checked with::
+
+ ceph df
+
+The currently defined `full` ratio can be seen with::
+
+ ceph osd dump | grep full_ratio
+
+A short-term workaround to restore write availability is to raise the full
+threshold by a small amount::
+
+ ceph osd set-full-ratio <ratio>
+
+New storage should be added to the cluster by deploying more OSDs or
+existing data should be deleted in order to free up space.
+
+OSD_BACKFILLFULL
+________________
+
+One or more OSDs has exceeded the `backfillfull` threshold, which will
+prevent data from being allowed to rebalance to this device. This is
+an early warning that rebalancing may not be able to complete and that
+the cluster is approaching full.
+
+Utilization by pool can be checked with::
+
+ ceph df
+
+OSD_NEARFULL
+____________
+
+One or more OSDs has exceeded the `nearfull` threshold. This is an early
+warning that the cluster is approaching full.
+
+Utilization by pool can be checked with::
+
+ ceph df
+
+OSDMAP_FLAGS
+____________
+
+One or more cluster flags of interest has been set. These flags include:
+
+* *full* - the cluster is flagged as full and cannot service writes
+* *pauserd*, *pausewr* - paused reads or writes
+* *noup* - OSDs are not allowed to start
+* *nodown* - OSD failure reports are being ignored, such that the
+ monitors will not mark OSDs `down`
+* *noin* - OSDs that were previously marked `out` will not be marked
+ back `in` when they start
+* *noout* - down OSDs will not automatically be marked out after the
+ configured interval
+* *nobackfill*, *norecover*, *norebalance* - recovery or data
+ rebalancing is suspended
+* *noscrub*, *nodeep_scrub* - scrubbing is disabled
+* *notieragent* - cache tiering activity is suspended
+
+With the exception of *full*, these flags can be set or cleared with::
+
+ ceph osd set <flag>
+ ceph osd unset <flag>
+
+OSD_FLAGS
+_________
+
+One or more OSDs has a per-OSD flag of interest set. These flags include:
+
+* *noup*: OSD is not allowed to start
+* *nodown*: failure reports for this OSD will be ignored
+* *noin*: if this OSD was previously marked `out` automatically
+ after a failure, it will not be marked in when it stats
+* *noout*: if this OSD is down it will not automatically be marked
+ `out` after the configured interval
+
+Per-OSD flags can be set and cleared with::
+
+ ceph osd add-<flag> <osd-id>
+ ceph osd rm-<flag> <osd-id>
+
+For example, ::
+
+ ceph osd rm-nodown osd.123
+
+OLD_CRUSH_TUNABLES
+__________________
+
+The CRUSH map is using very old settings and should be updated. The
+oldest tunables that can be used (i.e., the oldest client version that
+can connect to the cluster) without triggering this health warning is
+determined by the ``mon_crush_min_required_version`` config option.
+See :doc:`/rados/operations/crush-map/#tunables` for more information.
+
+OLD_CRUSH_STRAW_CALC_VERSION
+____________________________
+
+The CRUSH map is using an older, non-optimal method for calculating
+intermediate weight values for ``straw`` buckets.
+
+The CRUSH map should be updated to use the newer method
+(``straw_calc_version=1``). See
+:doc:`/rados/operations/crush-map/#tunables` for more information.
+
+CACHE_POOL_NO_HIT_SET
+_____________________
+
+One or more cache pools is not configured with a *hit set* to track
+utilization, which will prevent the tiering agent from identifying
+cold objects to flush and evict from the cache.
+
+Hit sets can be configured on the cache pool with::
+
+ ceph osd pool set <poolname> hit_set_type <type>
+ ceph osd pool set <poolname> hit_set_period <period-in-seconds>
+ ceph osd pool set <poolname> hit_set_count <number-of-hitsets>
+ ceph osd pool set <poolname> hit_set_fpp <target-false-positive-rate>
+
+OSD_NO_SORTBITWISE
+__________________
+
+No pre-luminous v12.y.z OSDs are running but the ``sortbitwise`` flag has not
+been set.
+
+The ``sortbitwise`` flag must be set before luminous v12.y.z or newer
+OSDs can start. You can safely set the flag with::
+
+ ceph osd set sortbitwise
+
+POOL_FULL
+_________
+
+One or more pools has reached its quota and is no longer allowing writes.
+
+Pool quotas and utilization can be seen with::
+
+ ceph df detail
+
+You can either raise the pool quota with::
+
+ ceph osd pool set-quota <poolname> max_objects <num-objects>
+ ceph osd pool set-quota <poolname> max_bytes <num-bytes>
+
+or delete some existing data to reduce utilization.
+
+
+Data health (pools & placement groups)
+--------------------------------------
+
+PG_AVAILABILITY
+_______________
+
+Data availability is reduced, meaning that the cluster is unable to
+service potential read or write requests for some data in the cluster.
+Specifically, one or more PGs is in a state that does not allow IO
+requests to be serviced. Problematic PG states include *peering*,
+*stale*, *incomplete*, and the lack of *active* (if those conditions do not clear
+quickly).
+
+Detailed information about which PGs are affected is available from::
+
+ ceph health detail
+
+In most cases the root cause is that one or more OSDs is currently
+down; see the dicussion for ``OSD_DOWN`` above.
+
+The state of specific problematic PGs can be queried with::
+
+ ceph tell <pgid> query
+
+PG_DEGRADED
+___________
+
+Data redundancy is reduced for some data, meaning the cluster does not
+have the desired number of replicas for all data (for replicated
+pools) or erasure code fragments (for erasure coded pools).
+Specifically, one or more PGs:
+
+* has the *degraded* or *undersized* flag set, meaning there are not
+ enough instances of that placement group in the cluster;
+* has not had the *clean* flag set for some time.
+
+Detailed information about which PGs are affected is available from::
+
+ ceph health detail
+
+In most cases the root cause is that one or more OSDs is currently
+down; see the dicussion for ``OSD_DOWN`` above.
+
+The state of specific problematic PGs can be queried with::
+
+ ceph tell <pgid> query
+
+
+PG_DEGRADED_FULL
+________________
+
+Data redundancy may be reduced or at risk for some data due to a lack
+of free space in the cluster. Specifically, one or more PGs has the
+*backfill_toofull* or *recovery_toofull* flag set, meaning that the
+cluster is unable to migrate or recover data because one or more OSDs
+is above the *backfillfull* threshold.
+
+See the discussion for *OSD_BACKFILLFULL* or *OSD_FULL* above for
+steps to resolve this condition.
+
+PG_DAMAGED
+__________
+
+Data scrubbing has discovered some problems with data consistency in
+the cluster. Specifically, one or more PGs has the *inconsistent* or
+*snaptrim_error* flag is set, indicating an earlier scrub operation
+found a problem, or that the *repair* flag is set, meaning a repair
+for such an inconsistency is currently in progress.
+
+See :doc:`pg-repair` for more information.
+
+OSD_SCRUB_ERRORS
+________________
+
+Recent OSD scrubs have uncovered inconsistencies. This error is generally
+paired with *PG_DAMANGED* (see above).
+
+See :doc:`pg-repair` for more information.
+
+CACHE_POOL_NEAR_FULL
+____________________
+
+A cache tier pool is nearly full. Full in this context is determined
+by the ``target_max_bytes`` and ``target_max_objects`` properties on
+the cache pool. Once the pool reaches the target threshold, write
+requests to the pool may block while data is flushed and evicted
+from the cache, a state that normally leads to very high latencies and
+poor performance.
+
+The cache pool target size can be adjusted with::
+
+ ceph osd pool set <cache-pool-name> target_max_bytes <bytes>
+ ceph osd pool set <cache-pool-name> target_max_objects <objects>
+
+Normal cache flush and evict activity may also be throttled due to reduced
+availability or performance of the base tier, or overall cluster load.
+
+TOO_FEW_PGS
+___________
+
+The number of PGs in use in the cluster is below the configurable
+threshold of ``mon_pg_warn_min_per_osd`` PGs per OSD. This can lead
+to suboptimizal distribution and balance of data across the OSDs in
+the cluster, and similar reduce overall performance.
+
+This may be an expected condition if data pools have not yet been
+created.
+
+The PG count for existing pools can be increased or new pools can be
+created. Please refer to
+:doc:`placement-groups#Choosing-the-number-of-Placement-Groups` for
+more information.
+
+TOO_MANY_PGS
+____________
+
+The number of PGs in use in the cluster is above the configurable
+threshold of ``mon_max_pg_per_osd`` PGs per OSD. If this threshold is
+exceed the cluster will not allow new pools to be created, pool `pg_num` to
+be increased, or pool replication to be increased (any of which would lead to
+more PGs in the cluster). A large number of PGs can lead
+to higher memory utilization for OSD daemons, slower peering after
+cluster state changes (like OSD restarts, additions, or removals), and
+higher load on the Manager and Monitor daemons.
+
+The simplest way to mitigate the problem is to increase the number of
+OSDs in the cluster by adding more hardware. Note that the OSD count
+used for the purposes of this health check is the number of "in" OSDs,
+so marking "out" OSDs "in" (if there are any) can also help::
+
+ ceph osd in <osd id(s)>
+
+Please refer to
+:doc:`placement-groups#Choosing-the-number-of-Placement-Groups` for
+more information.
+
+SMALLER_PGP_NUM
+_______________
+
+One or more pools has a ``pgp_num`` value less than ``pg_num``. This
+is normally an indication that the PG count was increased without
+also increasing the placement behavior.
+
+This is sometimes done deliberately to separate out the `split` step
+when the PG count is adjusted from the data migration that is needed
+when ``pgp_num`` is changed.
+
+This is normally resolved by setting ``pgp_num`` to match ``pg_num``,
+triggering the data migration, with::
+
+ ceph osd pool set <pool> pgp_num <pg-num-value>
+
+MANY_OBJECTS_PER_PG
+___________________
+
+One or more pools has an average number of objects per PG that is
+significantly higher than the overall cluster average. The specific
+threshold is controlled by the ``mon_pg_warn_max_object_skew``
+configuration value.
+
+This is usually an indication that the pool(s) containing most of the
+data in the cluster have too few PGs, and/or that other pools that do
+not contain as much data have too many PGs. See the discussion of
+*TOO_MANY_PGS* above.
+
+The threshold can be raised to silence the health warning by adjusting
+the ``mon_pg_warn_max_object_skew`` config option on the monitors.
+
+POOL_APP_NOT_ENABLED
+____________________
+
+A pool exists that contains one or more objects but has not been
+tagged for use by a particular application.
+
+Resolve this warning by labeling the pool for use by an application. For
+example, if the pool is used by RBD,::
+
+ rbd pool init <poolname>
+
+If the pool is being used by a custom application 'foo', you can also label
+via the low-level command::
+
+ ceph osd pool application enable foo
+
+For more information, see :doc:`pools.rst#associate-pool-to-application`.
+
+POOL_FULL
+_________
+
+One or more pools has reached (or is very close to reaching) its
+quota. The threshold to trigger this error condition is controlled by
+the ``mon_pool_quota_crit_threshold`` configuration option.
+
+Pool quotas can be adjusted up or down (or removed) with::
+
+ ceph osd pool set-quota <pool> max_bytes <bytes>
+ ceph osd pool set-quota <pool> max_objects <objects>
+
+Setting the quota value to 0 will disable the quota.
+
+POOL_NEAR_FULL
+______________
+
+One or more pools is approaching is quota. The threshold to trigger
+this warning condition is controlled by the
+``mon_pool_quota_warn_threshold`` configuration option.
+
+Pool quotas can be adjusted up or down (or removed) with::
+
+ ceph osd pool set-quota <pool> max_bytes <bytes>
+ ceph osd pool set-quota <pool> max_objects <objects>
+
+Setting the quota value to 0 will disable the quota.
+
+OBJECT_MISPLACED
+________________
+
+One or more objects in the cluster is not stored on the node the
+cluster would like it to be stored on. This is an indication that
+data migration due to some recent cluster change has not yet completed.
+
+Misplaced data is not a dangerous condition in and of itself; data
+consistency is never at risk, and old copies of objects are never
+removed until the desired number of new copies (in the desired
+locations) are present.
+
+OBJECT_UNFOUND
+______________
+
+One or more objects in the cluster cannot be found. Specifically, the
+OSDs know that a new or updated copy of an object should exist, but a
+copy of that version of the object has not been found on OSDs that are
+currently online.
+
+Read or write requests to unfound objects will block.
+
+Ideally, a down OSD can be brought back online that has the more
+recent copy of the unfound object. Candidate OSDs can be identified from the
+peering state for the PG(s) responsible for the unfound object::
+
+ ceph tell <pgid> query
+
+If the latest copy of the object is not available, the cluster can be
+told to roll back to a previous version of the object. See
+:doc:`troubleshooting-pg#Unfound-objects` for more information.
+
+REQUEST_SLOW
+____________
+
+One or more OSD requests is taking a long time to process. This can
+be an indication of extreme load, a slow storage device, or a software
+bug.
+
+The request queue on the OSD(s) in question can be queried with the
+following command, executed from the OSD host::
+
+ ceph daemon osd.<id> ops
+
+A summary of the slowest recent requests can be seen with::
+
+ ceph daemon osd.<id> dump_historic_ops
+
+The location of an OSD can be found with::
+
+ ceph osd find osd.<id>
+
+REQUEST_STUCK
+_____________
+
+One or more OSD requests has been blocked for an extremely long time.
+This is an indication that either the cluster has been unhealthy for
+an extended period of time (e.g., not enough running OSDs) or there is
+some internal problem with the OSD. See the dicussion of
+*REQUEST_SLOW* above.
+
+PG_NOT_SCRUBBED
+_______________
+
+One or more PGs has not been scrubbed recently. PGs are normally
+scrubbed every ``mon_scrub_interval`` seconds, and this warning
+triggers when ``mon_warn_not_scrubbed`` such intervals have elapsed
+without a scrub.
+
+PGs will not scrub if they are not flagged as *clean*, which may
+happen if they are misplaced or degraded (see *PG_AVAILABILITY* and
+*PG_DEGRADED* above).
+
+You can manually initiate a scrub of a clean PG with::
+
+ ceph pg scrub <pgid>
+
+PG_NOT_DEEP_SCRUBBED
+____________________
+
+One or more PGs has not been deep scrubbed recently. PGs are normally
+scrubbed every ``osd_deep_mon_scrub_interval`` seconds, and this warning
+triggers when ``mon_warn_not_deep_scrubbed`` such intervals have elapsed
+without a scrub.
+
+PGs will not (deep) scrub if they are not flagged as *clean*, which may
+happen if they are misplaced or degraded (see *PG_AVAILABILITY* and
+*PG_DEGRADED* above).
+
+You can manually initiate a scrub of a clean PG with::
+
+ ceph pg deep-scrub <pgid>
diff --git a/src/ceph/doc/rados/operations/index.rst b/src/ceph/doc/rados/operations/index.rst
new file mode 100644
index 0000000..aacf764
--- /dev/null
+++ b/src/ceph/doc/rados/operations/index.rst
@@ -0,0 +1,90 @@
+====================
+ Cluster Operations
+====================
+
+.. raw:: html
+
+ <table><colgroup><col width="50%"><col width="50%"></colgroup><tbody valign="top"><tr><td><h3>High-level Operations</h3>
+
+High-level cluster operations consist primarily of starting, stopping, and
+restarting a cluster with the ``ceph`` service; checking the cluster's health;
+and, monitoring an operating cluster.
+
+.. toctree::
+ :maxdepth: 1
+
+ operating
+ health-checks
+ monitoring
+ monitoring-osd-pg
+ user-management
+
+.. raw:: html
+
+ </td><td><h3>Data Placement</h3>
+
+Once you have your cluster up and running, you may begin working with data
+placement. Ceph supports petabyte-scale data storage clusters, with storage
+pools and placement groups that distribute data across the cluster using Ceph's
+CRUSH algorithm.
+
+.. toctree::
+ :maxdepth: 1
+
+ data-placement
+ pools
+ erasure-code
+ cache-tiering
+ placement-groups
+ upmap
+ crush-map
+ crush-map-edits
+
+
+
+.. raw:: html
+
+ </td></tr><tr><td><h3>Low-level Operations</h3>
+
+Low-level cluster operations consist of starting, stopping, and restarting a
+particular daemon within a cluster; changing the settings of a particular
+daemon or subsystem; and, adding a daemon to the cluster or removing a daemon
+from the cluster. The most common use cases for low-level operations include
+growing or shrinking the Ceph cluster and replacing legacy or failed hardware
+with new hardware.
+
+.. toctree::
+ :maxdepth: 1
+
+ add-or-rm-osds
+ add-or-rm-mons
+ Command Reference <control>
+
+
+
+.. raw:: html
+
+ </td><td><h3>Troubleshooting</h3>
+
+Ceph is still on the leading edge, so you may encounter situations that require
+you to evaluate your Ceph configuration and modify your logging and debugging
+settings to identify and remedy issues you are encountering with your cluster.
+
+.. toctree::
+ :maxdepth: 1
+
+ ../troubleshooting/community
+ ../troubleshooting/troubleshooting-mon
+ ../troubleshooting/troubleshooting-osd
+ ../troubleshooting/troubleshooting-pg
+ ../troubleshooting/log-and-debug
+ ../troubleshooting/cpu-profiling
+ ../troubleshooting/memory-profiling
+
+
+
+
+.. raw:: html
+
+ </td></tr></tbody></table>
+
diff --git a/src/ceph/doc/rados/operations/monitoring-osd-pg.rst b/src/ceph/doc/rados/operations/monitoring-osd-pg.rst
new file mode 100644
index 0000000..0107e34
--- /dev/null
+++ b/src/ceph/doc/rados/operations/monitoring-osd-pg.rst
@@ -0,0 +1,617 @@
+=========================
+ Monitoring OSDs and PGs
+=========================
+
+High availability and high reliability require a fault-tolerant approach to
+managing hardware and software issues. Ceph has no single point-of-failure, and
+can service requests for data in a "degraded" mode. Ceph's `data placement`_
+introduces a layer of indirection to ensure that data doesn't bind directly to
+particular OSD addresses. This means that tracking down system faults requires
+finding the `placement group`_ and the underlying OSDs at root of the problem.
+
+.. tip:: A fault in one part of the cluster may prevent you from accessing a
+ particular object, but that doesn't mean that you cannot access other objects.
+ When you run into a fault, don't panic. Just follow the steps for monitoring
+ your OSDs and placement groups. Then, begin troubleshooting.
+
+Ceph is generally self-repairing. However, when problems persist, monitoring
+OSDs and placement groups will help you identify the problem.
+
+
+Monitoring OSDs
+===============
+
+An OSD's status is either in the cluster (``in``) or out of the cluster
+(``out``); and, it is either up and running (``up``), or it is down and not
+running (``down``). If an OSD is ``up``, it may be either ``in`` the cluster
+(you can read and write data) or it is ``out`` of the cluster. If it was
+``in`` the cluster and recently moved ``out`` of the cluster, Ceph will migrate
+placement groups to other OSDs. If an OSD is ``out`` of the cluster, CRUSH will
+not assign placement groups to the OSD. If an OSD is ``down``, it should also be
+``out``.
+
+.. note:: If an OSD is ``down`` and ``in``, there is a problem and the cluster
+ will not be in a healthy state.
+
+.. ditaa:: +----------------+ +----------------+
+ | | | |
+ | OSD #n In | | OSD #n Up |
+ | | | |
+ +----------------+ +----------------+
+ ^ ^
+ | |
+ | |
+ v v
+ +----------------+ +----------------+
+ | | | |
+ | OSD #n Out | | OSD #n Down |
+ | | | |
+ +----------------+ +----------------+
+
+If you execute a command such as ``ceph health``, ``ceph -s`` or ``ceph -w``,
+you may notice that the cluster does not always echo back ``HEALTH OK``. Don't
+panic. With respect to OSDs, you should expect that the cluster will **NOT**
+echo ``HEALTH OK`` in a few expected circumstances:
+
+#. You haven't started the cluster yet (it won't respond).
+#. You have just started or restarted the cluster and it's not ready yet,
+ because the placement groups are getting created and the OSDs are in
+ the process of peering.
+#. You just added or removed an OSD.
+#. You just have modified your cluster map.
+
+An important aspect of monitoring OSDs is to ensure that when the cluster
+is up and running that all OSDs that are ``in`` the cluster are ``up`` and
+running, too. To see if all OSDs are running, execute::
+
+ ceph osd stat
+
+The result should tell you the map epoch (eNNNN), the total number of OSDs (x),
+how many are ``up`` (y) and how many are ``in`` (z). ::
+
+ eNNNN: x osds: y up, z in
+
+If the number of OSDs that are ``in`` the cluster is more than the number of
+OSDs that are ``up``, execute the following command to identify the ``ceph-osd``
+daemons that are not running::
+
+ ceph osd tree
+
+::
+
+ dumped osdmap tree epoch 1
+ # id weight type name up/down reweight
+ -1 2 pool openstack
+ -3 2 rack dell-2950-rack-A
+ -2 2 host dell-2950-A1
+ 0 1 osd.0 up 1
+ 1 1 osd.1 down 1
+
+
+.. tip:: The ability to search through a well-designed CRUSH hierarchy may help
+ you troubleshoot your cluster by identifying the physcial locations faster.
+
+If an OSD is ``down``, start it::
+
+ sudo systemctl start ceph-osd@1
+
+See `OSD Not Running`_ for problems associated with OSDs that stopped, or won't
+restart.
+
+
+PG Sets
+=======
+
+When CRUSH assigns placement groups to OSDs, it looks at the number of replicas
+for the pool and assigns the placement group to OSDs such that each replica of
+the placement group gets assigned to a different OSD. For example, if the pool
+requires three replicas of a placement group, CRUSH may assign them to
+``osd.1``, ``osd.2`` and ``osd.3`` respectively. CRUSH actually seeks a
+pseudo-random placement that will take into account failure domains you set in
+your `CRUSH map`_, so you will rarely see placement groups assigned to nearest
+neighbor OSDs in a large cluster. We refer to the set of OSDs that should
+contain the replicas of a particular placement group as the **Acting Set**. In
+some cases, an OSD in the Acting Set is ``down`` or otherwise not able to
+service requests for objects in the placement group. When these situations
+arise, don't panic. Common examples include:
+
+- You added or removed an OSD. Then, CRUSH reassigned the placement group to
+ other OSDs--thereby changing the composition of the Acting Set and spawning
+ the migration of data with a "backfill" process.
+- An OSD was ``down``, was restarted, and is now ``recovering``.
+- An OSD in the Acting Set is ``down`` or unable to service requests,
+ and another OSD has temporarily assumed its duties.
+
+Ceph processes a client request using the **Up Set**, which is the set of OSDs
+that will actually handle the requests. In most cases, the Up Set and the Acting
+Set are virtually identical. When they are not, it may indicate that Ceph is
+migrating data, an OSD is recovering, or that there is a problem (i.e., Ceph
+usually echoes a "HEALTH WARN" state with a "stuck stale" message in such
+scenarios).
+
+To retrieve a list of placement groups, execute::
+
+ ceph pg dump
+
+To view which OSDs are within the Acting Set or the Up Set for a given placement
+group, execute::
+
+ ceph pg map {pg-num}
+
+The result should tell you the osdmap epoch (eNNN), the placement group number
+({pg-num}), the OSDs in the Up Set (up[]), and the OSDs in the acting set
+(acting[]). ::
+
+ osdmap eNNN pg {pg-num} -> up [0,1,2] acting [0,1,2]
+
+.. note:: If the Up Set and Acting Set do not match, this may be an indicator
+ that the cluster rebalancing itself or of a potential problem with
+ the cluster.
+
+
+Peering
+=======
+
+Before you can write data to a placement group, it must be in an ``active``
+state, and it **should** be in a ``clean`` state. For Ceph to determine the
+current state of a placement group, the primary OSD of the placement group
+(i.e., the first OSD in the acting set), peers with the secondary and tertiary
+OSDs to establish agreement on the current state of the placement group
+(assuming a pool with 3 replicas of the PG).
+
+
+.. ditaa:: +---------+ +---------+ +-------+
+ | OSD 1 | | OSD 2 | | OSD 3 |
+ +---------+ +---------+ +-------+
+ | | |
+ | Request To | |
+ | Peer | |
+ |-------------->| |
+ |<--------------| |
+ | Peering |
+ | |
+ | Request To |
+ | Peer |
+ |----------------------------->|
+ |<-----------------------------|
+ | Peering |
+
+The OSDs also report their status to the monitor. See `Configuring Monitor/OSD
+Interaction`_ for details. To troubleshoot peering issues, see `Peering
+Failure`_.
+
+
+Monitoring Placement Group States
+=================================
+
+If you execute a command such as ``ceph health``, ``ceph -s`` or ``ceph -w``,
+you may notice that the cluster does not always echo back ``HEALTH OK``. After
+you check to see if the OSDs are running, you should also check placement group
+states. You should expect that the cluster will **NOT** echo ``HEALTH OK`` in a
+number of placement group peering-related circumstances:
+
+#. You have just created a pool and placement groups haven't peered yet.
+#. The placement groups are recovering.
+#. You have just added an OSD to or removed an OSD from the cluster.
+#. You have just modified your CRUSH map and your placement groups are migrating.
+#. There is inconsistent data in different replicas of a placement group.
+#. Ceph is scrubbing a placement group's replicas.
+#. Ceph doesn't have enough storage capacity to complete backfilling operations.
+
+If one of the foregoing circumstances causes Ceph to echo ``HEALTH WARN``, don't
+panic. In many cases, the cluster will recover on its own. In some cases, you
+may need to take action. An important aspect of monitoring placement groups is
+to ensure that when the cluster is up and running that all placement groups are
+``active``, and preferably in the ``clean`` state. To see the status of all
+placement groups, execute::
+
+ ceph pg stat
+
+The result should tell you the placement group map version (vNNNNNN), the total
+number of placement groups (x), and how many placement groups are in a
+particular state such as ``active+clean`` (y). ::
+
+ vNNNNNN: x pgs: y active+clean; z bytes data, aa MB used, bb GB / cc GB avail
+
+.. note:: It is common for Ceph to report multiple states for placement groups.
+
+In addition to the placement group states, Ceph will also echo back the amount
+of data used (aa), the amount of storage capacity remaining (bb), and the total
+storage capacity for the placement group. These numbers can be important in a
+few cases:
+
+- You are reaching your ``near full ratio`` or ``full ratio``.
+- Your data is not getting distributed across the cluster due to an
+ error in your CRUSH configuration.
+
+
+.. topic:: Placement Group IDs
+
+ Placement group IDs consist of the pool number (not pool name) followed
+ by a period (.) and the placement group ID--a hexadecimal number. You
+ can view pool numbers and their names from the output of ``ceph osd
+ lspools``. For example, the default pool ``rbd`` corresponds to
+ pool number ``0``. A fully qualified placement group ID has the
+ following form::
+
+ {pool-num}.{pg-id}
+
+ And it typically looks like this::
+
+ 0.1f
+
+
+To retrieve a list of placement groups, execute the following::
+
+ ceph pg dump
+
+You can also format the output in JSON format and save it to a file::
+
+ ceph pg dump -o {filename} --format=json
+
+To query a particular placement group, execute the following::
+
+ ceph pg {poolnum}.{pg-id} query
+
+Ceph will output the query in JSON format.
+
+.. code-block:: javascript
+
+ {
+ "state": "active+clean",
+ "up": [
+ 1,
+ 0
+ ],
+ "acting": [
+ 1,
+ 0
+ ],
+ "info": {
+ "pgid": "1.e",
+ "last_update": "4'1",
+ "last_complete": "4'1",
+ "log_tail": "0'0",
+ "last_backfill": "MAX",
+ "purged_snaps": "[]",
+ "history": {
+ "epoch_created": 1,
+ "last_epoch_started": 537,
+ "last_epoch_clean": 537,
+ "last_epoch_split": 534,
+ "same_up_since": 536,
+ "same_interval_since": 536,
+ "same_primary_since": 536,
+ "last_scrub": "4'1",
+ "last_scrub_stamp": "2013-01-25 10:12:23.828174"
+ },
+ "stats": {
+ "version": "4'1",
+ "reported": "536'782",
+ "state": "active+clean",
+ "last_fresh": "2013-01-25 10:12:23.828271",
+ "last_change": "2013-01-25 10:12:23.828271",
+ "last_active": "2013-01-25 10:12:23.828271",
+ "last_clean": "2013-01-25 10:12:23.828271",
+ "last_unstale": "2013-01-25 10:12:23.828271",
+ "mapping_epoch": 535,
+ "log_start": "0'0",
+ "ondisk_log_start": "0'0",
+ "created": 1,
+ "last_epoch_clean": 1,
+ "parent": "0.0",
+ "parent_split_bits": 0,
+ "last_scrub": "4'1",
+ "last_scrub_stamp": "2013-01-25 10:12:23.828174",
+ "log_size": 128,
+ "ondisk_log_size": 128,
+ "stat_sum": {
+ "num_bytes": 205,
+ "num_objects": 1,
+ "num_object_clones": 0,
+ "num_object_copies": 0,
+ "num_objects_missing_on_primary": 0,
+ "num_objects_degraded": 0,
+ "num_objects_unfound": 0,
+ "num_read": 1,
+ "num_read_kb": 0,
+ "num_write": 3,
+ "num_write_kb": 1
+ },
+ "stat_cat_sum": {
+
+ },
+ "up": [
+ 1,
+ 0
+ ],
+ "acting": [
+ 1,
+ 0
+ ]
+ },
+ "empty": 0,
+ "dne": 0,
+ "incomplete": 0
+ },
+ "recovery_state": [
+ {
+ "name": "Started\/Primary\/Active",
+ "enter_time": "2013-01-23 09:35:37.594691",
+ "might_have_unfound": [
+
+ ],
+ "scrub": {
+ "scrub_epoch_start": "536",
+ "scrub_active": 0,
+ "scrub_block_writes": 0,
+ "finalizing_scrub": 0,
+ "scrub_waiting_on": 0,
+ "scrub_waiting_on_whom": [
+
+ ]
+ }
+ },
+ {
+ "name": "Started",
+ "enter_time": "2013-01-23 09:35:31.581160"
+ }
+ ]
+ }
+
+
+
+The following subsections describe common states in greater detail.
+
+Creating
+--------
+
+When you create a pool, it will create the number of placement groups you
+specified. Ceph will echo ``creating`` when it is creating one or more
+placement groups. Once they are created, the OSDs that are part of a placement
+group's Acting Set will peer. Once peering is complete, the placement group
+status should be ``active+clean``, which means a Ceph client can begin writing
+to the placement group.
+
+.. ditaa::
+
+ /-----------\ /-----------\ /-----------\
+ | Creating |------>| Peering |------>| Active |
+ \-----------/ \-----------/ \-----------/
+
+Peering
+-------
+
+When Ceph is Peering a placement group, Ceph is bringing the OSDs that
+store the replicas of the placement group into **agreement about the state**
+of the objects and metadata in the placement group. When Ceph completes peering,
+this means that the OSDs that store the placement group agree about the current
+state of the placement group. However, completion of the peering process does
+**NOT** mean that each replica has the latest contents.
+
+.. topic:: Authoratative History
+
+ Ceph will **NOT** acknowledge a write operation to a client, until
+ all OSDs of the acting set persist the write operation. This practice
+ ensures that at least one member of the acting set will have a record
+ of every acknowledged write operation since the last successful
+ peering operation.
+
+ With an accurate record of each acknowledged write operation, Ceph can
+ construct and disseminate a new authoritative history of the placement
+ group--a complete, and fully ordered set of operations that, if performed,
+ would bring an OSD’s copy of a placement group up to date.
+
+
+Active
+------
+
+Once Ceph completes the peering process, a placement group may become
+``active``. The ``active`` state means that the data in the placement group is
+generally available in the primary placement group and the replicas for read
+and write operations.
+
+
+Clean
+-----
+
+When a placement group is in the ``clean`` state, the primary OSD and the
+replica OSDs have successfully peered and there are no stray replicas for the
+placement group. Ceph replicated all objects in the placement group the correct
+number of times.
+
+
+Degraded
+--------
+
+When a client writes an object to the primary OSD, the primary OSD is
+responsible for writing the replicas to the replica OSDs. After the primary OSD
+writes the object to storage, the placement group will remain in a ``degraded``
+state until the primary OSD has received an acknowledgement from the replica
+OSDs that Ceph created the replica objects successfully.
+
+The reason a placement group can be ``active+degraded`` is that an OSD may be
+``active`` even though it doesn't hold all of the objects yet. If an OSD goes
+``down``, Ceph marks each placement group assigned to the OSD as ``degraded``.
+The OSDs must peer again when the OSD comes back online. However, a client can
+still write a new object to a ``degraded`` placement group if it is ``active``.
+
+If an OSD is ``down`` and the ``degraded`` condition persists, Ceph may mark the
+``down`` OSD as ``out`` of the cluster and remap the data from the ``down`` OSD
+to another OSD. The time between being marked ``down`` and being marked ``out``
+is controlled by ``mon osd down out interval``, which is set to ``600`` seconds
+by default.
+
+A placement group can also be ``degraded``, because Ceph cannot find one or more
+objects that Ceph thinks should be in the placement group. While you cannot
+read or write to unfound objects, you can still access all of the other objects
+in the ``degraded`` placement group.
+
+
+Recovering
+----------
+
+Ceph was designed for fault-tolerance at a scale where hardware and software
+problems are ongoing. When an OSD goes ``down``, its contents may fall behind
+the current state of other replicas in the placement groups. When the OSD is
+back ``up``, the contents of the placement groups must be updated to reflect the
+current state. During that time period, the OSD may reflect a ``recovering``
+state.
+
+Recovery is not always trivial, because a hardware failure might cause a
+cascading failure of multiple OSDs. For example, a network switch for a rack or
+cabinet may fail, which can cause the OSDs of a number of host machines to fall
+behind the current state of the cluster. Each one of the OSDs must recover once
+the fault is resolved.
+
+Ceph provides a number of settings to balance the resource contention between
+new service requests and the need to recover data objects and restore the
+placement groups to the current state. The ``osd recovery delay start`` setting
+allows an OSD to restart, re-peer and even process some replay requests before
+starting the recovery process. The ``osd
+recovery thread timeout`` sets a thread timeout, because multiple OSDs may fail,
+restart and re-peer at staggered rates. The ``osd recovery max active`` setting
+limits the number of recovery requests an OSD will entertain simultaneously to
+prevent the OSD from failing to serve . The ``osd recovery max chunk`` setting
+limits the size of the recovered data chunks to prevent network congestion.
+
+
+Back Filling
+------------
+
+When a new OSD joins the cluster, CRUSH will reassign placement groups from OSDs
+in the cluster to the newly added OSD. Forcing the new OSD to accept the
+reassigned placement groups immediately can put excessive load on the new OSD.
+Back filling the OSD with the placement groups allows this process to begin in
+the background. Once backfilling is complete, the new OSD will begin serving
+requests when it is ready.
+
+During the backfill operations, you may see one of several states:
+``backfill_wait`` indicates that a backfill operation is pending, but is not
+underway yet; ``backfill`` indicates that a backfill operation is underway;
+and, ``backfill_too_full`` indicates that a backfill operation was requested,
+but couldn't be completed due to insufficient storage capacity. When a
+placement group cannot be backfilled, it may be considered ``incomplete``.
+
+Ceph provides a number of settings to manage the load spike associated with
+reassigning placement groups to an OSD (especially a new OSD). By default,
+``osd_max_backfills`` sets the maximum number of concurrent backfills to or from
+an OSD to 10. The ``backfill full ratio`` enables an OSD to refuse a
+backfill request if the OSD is approaching its full ratio (90%, by default) and
+change with ``ceph osd set-backfillfull-ratio`` comand.
+If an OSD refuses a backfill request, the ``osd backfill retry interval``
+enables an OSD to retry the request (after 10 seconds, by default). OSDs can
+also set ``osd backfill scan min`` and ``osd backfill scan max`` to manage scan
+intervals (64 and 512, by default).
+
+
+Remapped
+--------
+
+When the Acting Set that services a placement group changes, the data migrates
+from the old acting set to the new acting set. It may take some time for a new
+primary OSD to service requests. So it may ask the old primary to continue to
+service requests until the placement group migration is complete. Once data
+migration completes, the mapping uses the primary OSD of the new acting set.
+
+
+Stale
+-----
+
+While Ceph uses heartbeats to ensure that hosts and daemons are running, the
+``ceph-osd`` daemons may also get into a ``stuck`` state where they are not
+reporting statistics in a timely manner (e.g., a temporary network fault). By
+default, OSD daemons report their placement group, up thru, boot and failure
+statistics every half second (i.e., ``0.5``), which is more frequent than the
+heartbeat thresholds. If the **Primary OSD** of a placement group's acting set
+fails to report to the monitor or if other OSDs have reported the primary OSD
+``down``, the monitors will mark the placement group ``stale``.
+
+When you start your cluster, it is common to see the ``stale`` state until
+the peering process completes. After your cluster has been running for awhile,
+seeing placement groups in the ``stale`` state indicates that the primary OSD
+for those placement groups is ``down`` or not reporting placement group statistics
+to the monitor.
+
+
+Identifying Troubled PGs
+========================
+
+As previously noted, a placement group is not necessarily problematic just
+because its state is not ``active+clean``. Generally, Ceph's ability to self
+repair may not be working when placement groups get stuck. The stuck states
+include:
+
+- **Unclean**: Placement groups contain objects that are not replicated the
+ desired number of times. They should be recovering.
+- **Inactive**: Placement groups cannot process reads or writes because they
+ are waiting for an OSD with the most up-to-date data to come back ``up``.
+- **Stale**: Placement groups are in an unknown state, because the OSDs that
+ host them have not reported to the monitor cluster in a while (configured
+ by ``mon osd report timeout``).
+
+To identify stuck placement groups, execute the following::
+
+ ceph pg dump_stuck [unclean|inactive|stale|undersized|degraded]
+
+See `Placement Group Subsystem`_ for additional details. To troubleshoot
+stuck placement groups, see `Troubleshooting PG Errors`_.
+
+
+Finding an Object Location
+==========================
+
+To store object data in the Ceph Object Store, a Ceph client must:
+
+#. Set an object name
+#. Specify a `pool`_
+
+The Ceph client retrieves the latest cluster map and the CRUSH algorithm
+calculates how to map the object to a `placement group`_, and then calculates
+how to assign the placement group to an OSD dynamically. To find the object
+location, all you need is the object name and the pool name. For example::
+
+ ceph osd map {poolname} {object-name}
+
+.. topic:: Exercise: Locate an Object
+
+ As an exercise, lets create an object. Specify an object name, a path to a
+ test file containing some object data and a pool name using the
+ ``rados put`` command on the command line. For example::
+
+ rados put {object-name} {file-path} --pool=data
+ rados put test-object-1 testfile.txt --pool=data
+
+ To verify that the Ceph Object Store stored the object, execute the following::
+
+ rados -p data ls
+
+ Now, identify the object location::
+
+ ceph osd map {pool-name} {object-name}
+ ceph osd map data test-object-1
+
+ Ceph should output the object's location. For example::
+
+ osdmap e537 pool 'data' (0) object 'test-object-1' -> pg 0.d1743484 (0.4) -> up [1,0] acting [1,0]
+
+ To remove the test object, simply delete it using the ``rados rm`` command.
+ For example::
+
+ rados rm test-object-1 --pool=data
+
+
+As the cluster evolves, the object location may change dynamically. One benefit
+of Ceph's dynamic rebalancing is that Ceph relieves you from having to perform
+the migration manually. See the `Architecture`_ section for details.
+
+.. _data placement: ../data-placement
+.. _pool: ../pools
+.. _placement group: ../placement-groups
+.. _Architecture: ../../../architecture
+.. _OSD Not Running: ../../troubleshooting/troubleshooting-osd#osd-not-running
+.. _Troubleshooting PG Errors: ../../troubleshooting/troubleshooting-pg#troubleshooting-pg-errors
+.. _Peering Failure: ../../troubleshooting/troubleshooting-pg#failures-osd-peering
+.. _CRUSH map: ../crush-map
+.. _Configuring Monitor/OSD Interaction: ../../configuration/mon-osd-interaction/
+.. _Placement Group Subsystem: ../control#placement-group-subsystem
diff --git a/src/ceph/doc/rados/operations/monitoring.rst b/src/ceph/doc/rados/operations/monitoring.rst
new file mode 100644
index 0000000..c291440
--- /dev/null
+++ b/src/ceph/doc/rados/operations/monitoring.rst
@@ -0,0 +1,351 @@
+======================
+ Monitoring a Cluster
+======================
+
+Once you have a running cluster, you may use the ``ceph`` tool to monitor your
+cluster. Monitoring a cluster typically involves checking OSD status, monitor
+status, placement group status and metadata server status.
+
+Using the command line
+======================
+
+Interactive mode
+----------------
+
+To run the ``ceph`` tool in interactive mode, type ``ceph`` at the command line
+with no arguments. For example::
+
+ ceph
+ ceph> health
+ ceph> status
+ ceph> quorum_status
+ ceph> mon_status
+
+Non-default paths
+-----------------
+
+If you specified non-default locations for your configuration or keyring,
+you may specify their locations::
+
+ ceph -c /path/to/conf -k /path/to/keyring health
+
+Checking a Cluster's Status
+===========================
+
+After you start your cluster, and before you start reading and/or
+writing data, check your cluster's status first.
+
+To check a cluster's status, execute the following::
+
+ ceph status
+
+Or::
+
+ ceph -s
+
+In interactive mode, type ``status`` and press **Enter**. ::
+
+ ceph> status
+
+Ceph will print the cluster status. For example, a tiny Ceph demonstration
+cluster with one of each service may print the following:
+
+::
+
+ cluster:
+ id: 477e46f1-ae41-4e43-9c8f-72c918ab0a20
+ health: HEALTH_OK
+
+ services:
+ mon: 1 daemons, quorum a
+ mgr: x(active)
+ mds: 1/1/1 up {0=a=up:active}
+ osd: 1 osds: 1 up, 1 in
+
+ data:
+ pools: 2 pools, 16 pgs
+ objects: 21 objects, 2246 bytes
+ usage: 546 GB used, 384 GB / 931 GB avail
+ pgs: 16 active+clean
+
+
+.. topic:: How Ceph Calculates Data Usage
+
+ The ``usage`` value reflects the *actual* amount of raw storage used. The
+ ``xxx GB / xxx GB`` value means the amount available (the lesser number)
+ of the overall storage capacity of the cluster. The notional number reflects
+ the size of the stored data before it is replicated, cloned or snapshotted.
+ Therefore, the amount of data actually stored typically exceeds the notional
+ amount stored, because Ceph creates replicas of the data and may also use
+ storage capacity for cloning and snapshotting.
+
+
+Watching a Cluster
+==================
+
+In addition to local logging by each daemon, Ceph clusters maintain
+a *cluster log* that records high level events about the whole system.
+This is logged to disk on monitor servers (as ``/var/log/ceph/ceph.log`` by
+default), but can also be monitored via the command line.
+
+To follow the cluster log, use the following command
+
+::
+
+ ceph -w
+
+Ceph will print the status of the system, followed by each log message as it
+is emitted. For example:
+
+::
+
+ cluster:
+ id: 477e46f1-ae41-4e43-9c8f-72c918ab0a20
+ health: HEALTH_OK
+
+ services:
+ mon: 1 daemons, quorum a
+ mgr: x(active)
+ mds: 1/1/1 up {0=a=up:active}
+ osd: 1 osds: 1 up, 1 in
+
+ data:
+ pools: 2 pools, 16 pgs
+ objects: 21 objects, 2246 bytes
+ usage: 546 GB used, 384 GB / 931 GB avail
+ pgs: 16 active+clean
+
+
+ 2017-07-24 08:15:11.329298 mon.a mon.0 172.21.9.34:6789/0 23 : cluster [INF] osd.0 172.21.9.34:6806/20527 boot
+ 2017-07-24 08:15:14.258143 mon.a mon.0 172.21.9.34:6789/0 39 : cluster [INF] Activating manager daemon x
+ 2017-07-24 08:15:15.446025 mon.a mon.0 172.21.9.34:6789/0 47 : cluster [INF] Manager daemon x is now available
+
+
+In addition to using ``ceph -w`` to print log lines as they are emitted,
+use ``ceph log last [n]`` to see the most recent ``n`` lines from the cluster
+log.
+
+Monitoring Health Checks
+========================
+
+Ceph continously runs various *health checks* against its own status. When
+a health check fails, this is reflected in the output of ``ceph status`` (or
+``ceph health``). In addition, messages are sent to the cluster log to
+indicate when a check fails, and when the cluster recovers.
+
+For example, when an OSD goes down, the ``health`` section of the status
+output may be updated as follows:
+
+::
+
+ health: HEALTH_WARN
+ 1 osds down
+ Degraded data redundancy: 21/63 objects degraded (33.333%), 16 pgs unclean, 16 pgs degraded
+
+At this time, cluster log messages are also emitted to record the failure of the
+health checks:
+
+::
+
+ 2017-07-25 10:08:58.265945 mon.a mon.0 172.21.9.34:6789/0 91 : cluster [WRN] Health check failed: 1 osds down (OSD_DOWN)
+ 2017-07-25 10:09:01.302624 mon.a mon.0 172.21.9.34:6789/0 94 : cluster [WRN] Health check failed: Degraded data redundancy: 21/63 objects degraded (33.333%), 16 pgs unclean, 16 pgs degraded (PG_DEGRADED)
+
+When the OSD comes back online, the cluster log records the cluster's return
+to a health state:
+
+::
+
+ 2017-07-25 10:11:11.526841 mon.a mon.0 172.21.9.34:6789/0 109 : cluster [WRN] Health check update: Degraded data redundancy: 2 pgs unclean, 2 pgs degraded, 2 pgs undersized (PG_DEGRADED)
+ 2017-07-25 10:11:13.535493 mon.a mon.0 172.21.9.34:6789/0 110 : cluster [INF] Health check cleared: PG_DEGRADED (was: Degraded data redundancy: 2 pgs unclean, 2 pgs degraded, 2 pgs undersized)
+ 2017-07-25 10:11:13.535577 mon.a mon.0 172.21.9.34:6789/0 111 : cluster [INF] Cluster is now healthy
+
+
+Detecting configuration issues
+==============================
+
+In addition to the health checks that Ceph continuously runs on its
+own status, there are some configuration issues that may only be detected
+by an external tool.
+
+Use the `ceph-medic`_ tool to run these additional checks on your Ceph
+cluster's configuration.
+
+Checking a Cluster's Usage Stats
+================================
+
+To check a cluster's data usage and data distribution among pools, you can
+use the ``df`` option. It is similar to Linux ``df``. Execute
+the following::
+
+ ceph df
+
+The **GLOBAL** section of the output provides an overview of the amount of
+storage your cluster uses for your data.
+
+- **SIZE:** The overall storage capacity of the cluster.
+- **AVAIL:** The amount of free space available in the cluster.
+- **RAW USED:** The amount of raw storage used.
+- **% RAW USED:** The percentage of raw storage used. Use this number in
+ conjunction with the ``full ratio`` and ``near full ratio`` to ensure that
+ you are not reaching your cluster's capacity. See `Storage Capacity`_ for
+ additional details.
+
+The **POOLS** section of the output provides a list of pools and the notional
+usage of each pool. The output from this section **DOES NOT** reflect replicas,
+clones or snapshots. For example, if you store an object with 1MB of data, the
+notional usage will be 1MB, but the actual usage may be 2MB or more depending
+on the number of replicas, clones and snapshots.
+
+- **NAME:** The name of the pool.
+- **ID:** The pool ID.
+- **USED:** The notional amount of data stored in kilobytes, unless the number
+ appends **M** for megabytes or **G** for gigabytes.
+- **%USED:** The notional percentage of storage used per pool.
+- **MAX AVAIL:** An estimate of the notional amount of data that can be written
+ to this pool.
+- **Objects:** The notional number of objects stored per pool.
+
+.. note:: The numbers in the **POOLS** section are notional. They are not
+ inclusive of the number of replicas, shapshots or clones. As a result,
+ the sum of the **USED** and **%USED** amounts will not add up to the
+ **RAW USED** and **%RAW USED** amounts in the **GLOBAL** section of the
+ output.
+
+.. note:: The **MAX AVAIL** value is a complicated function of the
+ replication or erasure code used, the CRUSH rule that maps storage
+ to devices, the utilization of those devices, and the configured
+ mon_osd_full_ratio.
+
+
+
+Checking OSD Status
+===================
+
+You can check OSDs to ensure they are ``up`` and ``in`` by executing::
+
+ ceph osd stat
+
+Or::
+
+ ceph osd dump
+
+You can also check view OSDs according to their position in the CRUSH map. ::
+
+ ceph osd tree
+
+Ceph will print out a CRUSH tree with a host, its OSDs, whether they are up
+and their weight. ::
+
+ # id weight type name up/down reweight
+ -1 3 pool default
+ -3 3 rack mainrack
+ -2 3 host osd-host
+ 0 1 osd.0 up 1
+ 1 1 osd.1 up 1
+ 2 1 osd.2 up 1
+
+For a detailed discussion, refer to `Monitoring OSDs and Placement Groups`_.
+
+Checking Monitor Status
+=======================
+
+If your cluster has multiple monitors (likely), you should check the monitor
+quorum status after you start the cluster before reading and/or writing data. A
+quorum must be present when multiple monitors are running. You should also check
+monitor status periodically to ensure that they are running.
+
+To see display the monitor map, execute the following::
+
+ ceph mon stat
+
+Or::
+
+ ceph mon dump
+
+To check the quorum status for the monitor cluster, execute the following::
+
+ ceph quorum_status
+
+Ceph will return the quorum status. For example, a Ceph cluster consisting of
+three monitors may return the following:
+
+.. code-block:: javascript
+
+ { "election_epoch": 10,
+ "quorum": [
+ 0,
+ 1,
+ 2],
+ "monmap": { "epoch": 1,
+ "fsid": "444b489c-4f16-4b75-83f0-cb8097468898",
+ "modified": "2011-12-12 13:28:27.505520",
+ "created": "2011-12-12 13:28:27.505520",
+ "mons": [
+ { "rank": 0,
+ "name": "a",
+ "addr": "127.0.0.1:6789\/0"},
+ { "rank": 1,
+ "name": "b",
+ "addr": "127.0.0.1:6790\/0"},
+ { "rank": 2,
+ "name": "c",
+ "addr": "127.0.0.1:6791\/0"}
+ ]
+ }
+ }
+
+Checking MDS Status
+===================
+
+Metadata servers provide metadata services for Ceph FS. Metadata servers have
+two sets of states: ``up | down`` and ``active | inactive``. To ensure your
+metadata servers are ``up`` and ``active``, execute the following::
+
+ ceph mds stat
+
+To display details of the metadata cluster, execute the following::
+
+ ceph fs dump
+
+
+Checking Placement Group States
+===============================
+
+Placement groups map objects to OSDs. When you monitor your
+placement groups, you will want them to be ``active`` and ``clean``.
+For a detailed discussion, refer to `Monitoring OSDs and Placement Groups`_.
+
+.. _Monitoring OSDs and Placement Groups: ../monitoring-osd-pg
+
+
+Using the Admin Socket
+======================
+
+The Ceph admin socket allows you to query a daemon via a socket interface.
+By default, Ceph sockets reside under ``/var/run/ceph``. To access a daemon
+via the admin socket, login to the host running the daemon and use the
+following command::
+
+ ceph daemon {daemon-name}
+ ceph daemon {path-to-socket-file}
+
+For example, the following are equivalent::
+
+ ceph daemon osd.0 foo
+ ceph daemon /var/run/ceph/ceph-osd.0.asok foo
+
+To view the available admin socket commands, execute the following command::
+
+ ceph daemon {daemon-name} help
+
+The admin socket command enables you to show and set your configuration at
+runtime. See `Viewing a Configuration at Runtime`_ for details.
+
+Additionally, you can set configuration values at runtime directly (i.e., the
+admin socket bypasses the monitor, unlike ``ceph tell {daemon-type}.{id}
+injectargs``, which relies on the monitor but doesn't require you to login
+directly to the host in question ).
+
+.. _Viewing a Configuration at Runtime: ../../configuration/ceph-conf#ceph-runtime-config
+.. _Storage Capacity: ../../configuration/mon-config-ref#storage-capacity
+.. _ceph-medic: http://docs.ceph.com/ceph-medic/master/
diff --git a/src/ceph/doc/rados/operations/operating.rst b/src/ceph/doc/rados/operations/operating.rst
new file mode 100644
index 0000000..791941a
--- /dev/null
+++ b/src/ceph/doc/rados/operations/operating.rst
@@ -0,0 +1,251 @@
+=====================
+ Operating a Cluster
+=====================
+
+.. index:: systemd; operating a cluster
+
+
+Running Ceph with systemd
+==========================
+
+For all distributions that support systemd (CentOS 7, Fedora, Debian
+Jessie 8 and later, SUSE), ceph daemons are now managed using native
+systemd files instead of the legacy sysvinit scripts. For example::
+
+ sudo systemctl start ceph.target # start all daemons
+ sudo systemctl status ceph-osd@12 # check status of osd.12
+
+To list the Ceph systemd units on a node, execute::
+
+ sudo systemctl status ceph\*.service ceph\*.target
+
+Starting all Daemons
+--------------------
+
+To start all daemons on a Ceph Node (irrespective of type), execute the
+following::
+
+ sudo systemctl start ceph.target
+
+
+Stopping all Daemons
+--------------------
+
+To stop all daemons on a Ceph Node (irrespective of type), execute the
+following::
+
+ sudo systemctl stop ceph\*.service ceph\*.target
+
+
+Starting all Daemons by Type
+----------------------------
+
+To start all daemons of a particular type on a Ceph Node, execute one of the
+following::
+
+ sudo systemctl start ceph-osd.target
+ sudo systemctl start ceph-mon.target
+ sudo systemctl start ceph-mds.target
+
+
+Stopping all Daemons by Type
+----------------------------
+
+To stop all daemons of a particular type on a Ceph Node, execute one of the
+following::
+
+ sudo systemctl stop ceph-mon\*.service ceph-mon.target
+ sudo systemctl stop ceph-osd\*.service ceph-osd.target
+ sudo systemctl stop ceph-mds\*.service ceph-mds.target
+
+
+Starting a Daemon
+-----------------
+
+To start a specific daemon instance on a Ceph Node, execute one of the
+following::
+
+ sudo systemctl start ceph-osd@{id}
+ sudo systemctl start ceph-mon@{hostname}
+ sudo systemctl start ceph-mds@{hostname}
+
+For example::
+
+ sudo systemctl start ceph-osd@1
+ sudo systemctl start ceph-mon@ceph-server
+ sudo systemctl start ceph-mds@ceph-server
+
+
+Stopping a Daemon
+-----------------
+
+To stop a specific daemon instance on a Ceph Node, execute one of the
+following::
+
+ sudo systemctl stop ceph-osd@{id}
+ sudo systemctl stop ceph-mon@{hostname}
+ sudo systemctl stop ceph-mds@{hostname}
+
+For example::
+
+ sudo systemctl stop ceph-osd@1
+ sudo systemctl stop ceph-mon@ceph-server
+ sudo systemctl stop ceph-mds@ceph-server
+
+
+.. index:: Ceph service; Upstart; operating a cluster
+
+
+
+Running Ceph with Upstart
+=========================
+
+When deploying Ceph with ``ceph-deploy`` on Ubuntu Trusty, you may start and
+stop Ceph daemons on a :term:`Ceph Node` using the event-based `Upstart`_.
+Upstart does not require you to define daemon instances in the Ceph
+configuration file.
+
+To list the Ceph Upstart jobs and instances on a node, execute::
+
+ sudo initctl list | grep ceph
+
+See `initctl`_ for additional details.
+
+
+Starting all Daemons
+--------------------
+
+To start all daemons on a Ceph Node (irrespective of type), execute the
+following::
+
+ sudo start ceph-all
+
+
+Stopping all Daemons
+--------------------
+
+To stop all daemons on a Ceph Node (irrespective of type), execute the
+following::
+
+ sudo stop ceph-all
+
+
+Starting all Daemons by Type
+----------------------------
+
+To start all daemons of a particular type on a Ceph Node, execute one of the
+following::
+
+ sudo start ceph-osd-all
+ sudo start ceph-mon-all
+ sudo start ceph-mds-all
+
+
+Stopping all Daemons by Type
+----------------------------
+
+To stop all daemons of a particular type on a Ceph Node, execute one of the
+following::
+
+ sudo stop ceph-osd-all
+ sudo stop ceph-mon-all
+ sudo stop ceph-mds-all
+
+
+Starting a Daemon
+-----------------
+
+To start a specific daemon instance on a Ceph Node, execute one of the
+following::
+
+ sudo start ceph-osd id={id}
+ sudo start ceph-mon id={hostname}
+ sudo start ceph-mds id={hostname}
+
+For example::
+
+ sudo start ceph-osd id=1
+ sudo start ceph-mon id=ceph-server
+ sudo start ceph-mds id=ceph-server
+
+
+Stopping a Daemon
+-----------------
+
+To stop a specific daemon instance on a Ceph Node, execute one of the
+following::
+
+ sudo stop ceph-osd id={id}
+ sudo stop ceph-mon id={hostname}
+ sudo stop ceph-mds id={hostname}
+
+For example::
+
+ sudo stop ceph-osd id=1
+ sudo start ceph-mon id=ceph-server
+ sudo start ceph-mds id=ceph-server
+
+
+.. index:: Ceph service; sysvinit; operating a cluster
+
+
+Running Ceph
+============
+
+Each time you to **start**, **restart**, and **stop** Ceph daemons (or your
+entire cluster) you must specify at least one option and one command. You may
+also specify a daemon type or a daemon instance. ::
+
+ {commandline} [options] [commands] [daemons]
+
+
+The ``ceph`` options include:
+
++-----------------+----------+-------------------------------------------------+
+| Option | Shortcut | Description |
++=================+==========+=================================================+
+| ``--verbose`` | ``-v`` | Use verbose logging. |
++-----------------+----------+-------------------------------------------------+
+| ``--valgrind`` | ``N/A`` | (Dev and QA only) Use `Valgrind`_ debugging. |
++-----------------+----------+-------------------------------------------------+
+| ``--allhosts`` | ``-a`` | Execute on all nodes in ``ceph.conf.`` |
+| | | Otherwise, it only executes on ``localhost``. |
++-----------------+----------+-------------------------------------------------+
+| ``--restart`` | ``N/A`` | Automatically restart daemon if it core dumps. |
++-----------------+----------+-------------------------------------------------+
+| ``--norestart`` | ``N/A`` | Don't restart a daemon if it core dumps. |
++-----------------+----------+-------------------------------------------------+
+| ``--conf`` | ``-c`` | Use an alternate configuration file. |
++-----------------+----------+-------------------------------------------------+
+
+The ``ceph`` commands include:
+
++------------------+------------------------------------------------------------+
+| Command | Description |
++==================+============================================================+
+| ``start`` | Start the daemon(s). |
++------------------+------------------------------------------------------------+
+| ``stop`` | Stop the daemon(s). |
++------------------+------------------------------------------------------------+
+| ``forcestop`` | Force the daemon(s) to stop. Same as ``kill -9`` |
++------------------+------------------------------------------------------------+
+| ``killall`` | Kill all daemons of a particular type. |
++------------------+------------------------------------------------------------+
+| ``cleanlogs`` | Cleans out the log directory. |
++------------------+------------------------------------------------------------+
+| ``cleanalllogs`` | Cleans out **everything** in the log directory. |
++------------------+------------------------------------------------------------+
+
+For subsystem operations, the ``ceph`` service can target specific daemon types
+by adding a particular daemon type for the ``[daemons]`` option. Daemon types
+include:
+
+- ``mon``
+- ``osd``
+- ``mds``
+
+
+
+.. _Valgrind: http://www.valgrind.org/
+.. _Upstart: http://upstart.ubuntu.com/index.html
+.. _initctl: http://manpages.ubuntu.com/manpages/raring/en/man8/initctl.8.html
diff --git a/src/ceph/doc/rados/operations/pg-concepts.rst b/src/ceph/doc/rados/operations/pg-concepts.rst
new file mode 100644
index 0000000..636d6bf
--- /dev/null
+++ b/src/ceph/doc/rados/operations/pg-concepts.rst
@@ -0,0 +1,102 @@
+==========================
+ Placement Group Concepts
+==========================
+
+When you execute commands like ``ceph -w``, ``ceph osd dump``, and other
+commands related to placement groups, Ceph may return values using some
+of the following terms:
+
+*Peering*
+ The process of bringing all of the OSDs that store
+ a Placement Group (PG) into agreement about the state
+ of all of the objects (and their metadata) in that PG.
+ Note that agreeing on the state does not mean that
+ they all have the latest contents.
+
+*Acting Set*
+ The ordered list of OSDs who are (or were as of some epoch)
+ responsible for a particular placement group.
+
+*Up Set*
+ The ordered list of OSDs responsible for a particular placement
+ group for a particular epoch according to CRUSH. Normally this
+ is the same as the *Acting Set*, except when the *Acting Set* has
+ been explicitly overridden via ``pg_temp`` in the OSD Map.
+
+*Current Interval* or *Past Interval*
+ A sequence of OSD map epochs during which the *Acting Set* and *Up
+ Set* for particular placement group do not change.
+
+*Primary*
+ The member (and by convention first) of the *Acting Set*,
+ that is responsible for coordination peering, and is
+ the only OSD that will accept client-initiated
+ writes to objects in a placement group.
+
+*Replica*
+ A non-primary OSD in the *Acting Set* for a placement group
+ (and who has been recognized as such and *activated* by the primary).
+
+*Stray*
+ An OSD that is not a member of the current *Acting Set*, but
+ has not yet been told that it can delete its copies of a
+ particular placement group.
+
+*Recovery*
+ Ensuring that copies of all of the objects in a placement group
+ are on all of the OSDs in the *Acting Set*. Once *Peering* has
+ been performed, the *Primary* can start accepting write operations,
+ and *Recovery* can proceed in the background.
+
+*PG Info*
+ Basic metadata about the placement group's creation epoch, the version
+ for the most recent write to the placement group, *last epoch started*,
+ *last epoch clean*, and the beginning of the *current interval*. Any
+ inter-OSD communication about placement groups includes the *PG Info*,
+ such that any OSD that knows a placement group exists (or once existed)
+ also has a lower bound on *last epoch clean* or *last epoch started*.
+
+*PG Log*
+ A list of recent updates made to objects in a placement group.
+ Note that these logs can be truncated after all OSDs
+ in the *Acting Set* have acknowledged up to a certain
+ point.
+
+*Missing Set*
+ Each OSD notes update log entries and if they imply updates to
+ the contents of an object, adds that object to a list of needed
+ updates. This list is called the *Missing Set* for that ``<OSD,PG>``.
+
+*Authoritative History*
+ A complete, and fully ordered set of operations that, if
+ performed, would bring an OSD's copy of a placement group
+ up to date.
+
+*Epoch*
+ A (monotonically increasing) OSD map version number
+
+*Last Epoch Start*
+ The last epoch at which all nodes in the *Acting Set*
+ for a particular placement group agreed on an
+ *Authoritative History*. At this point, *Peering* is
+ deemed to have been successful.
+
+*up_thru*
+ Before a *Primary* can successfully complete the *Peering* process,
+ it must inform a monitor that is alive through the current
+ OSD map *Epoch* by having the monitor set its *up_thru* in the osd
+ map. This helps *Peering* ignore previous *Acting Sets* for which
+ *Peering* never completed after certain sequences of failures, such as
+ the second interval below:
+
+ - *acting set* = [A,B]
+ - *acting set* = [A]
+ - *acting set* = [] very shortly after (e.g., simultaneous failure, but staggered detection)
+ - *acting set* = [B] (B restarts, A does not)
+
+*Last Epoch Clean*
+ The last *Epoch* at which all nodes in the *Acting set*
+ for a particular placement group were completely
+ up to date (both placement group logs and object contents).
+ At this point, *recovery* is deemed to have been
+ completed.
diff --git a/src/ceph/doc/rados/operations/pg-repair.rst b/src/ceph/doc/rados/operations/pg-repair.rst
new file mode 100644
index 0000000..0d6692a
--- /dev/null
+++ b/src/ceph/doc/rados/operations/pg-repair.rst
@@ -0,0 +1,4 @@
+Repairing PG inconsistencies
+============================
+
+
diff --git a/src/ceph/doc/rados/operations/pg-states.rst b/src/ceph/doc/rados/operations/pg-states.rst
new file mode 100644
index 0000000..0fbd3dc
--- /dev/null
+++ b/src/ceph/doc/rados/operations/pg-states.rst
@@ -0,0 +1,80 @@
+========================
+ Placement Group States
+========================
+
+When checking a cluster's status (e.g., running ``ceph -w`` or ``ceph -s``),
+Ceph will report on the status of the placement groups. A placement group has
+one or more states. The optimum state for placement groups in the placement group
+map is ``active + clean``.
+
+*Creating*
+ Ceph is still creating the placement group.
+
+*Active*
+ Ceph will process requests to the placement group.
+
+*Clean*
+ Ceph replicated all objects in the placement group the correct number of times.
+
+*Down*
+ A replica with necessary data is down, so the placement group is offline.
+
+*Scrubbing*
+ Ceph is checking the placement group for inconsistencies.
+
+*Degraded*
+ Ceph has not replicated some objects in the placement group the correct number of times yet.
+
+*Inconsistent*
+ Ceph detects inconsistencies in the one or more replicas of an object in the placement group
+ (e.g. objects are the wrong size, objects are missing from one replica *after* recovery finished, etc.).
+
+*Peering*
+ The placement group is undergoing the peering process
+
+*Repair*
+ Ceph is checking the placement group and repairing any inconsistencies it finds (if possible).
+
+*Recovering*
+ Ceph is migrating/synchronizing objects and their replicas.
+
+*Forced-Recovery*
+ High recovery priority of that PG is enforced by user.
+
+*Backfill*
+ Ceph is scanning and synchronizing the entire contents of a placement group
+ instead of inferring what contents need to be synchronized from the logs of
+ recent operations. *Backfill* is a special case of recovery.
+
+*Forced-Backfill*
+ High backfill priority of that PG is enforced by user.
+
+*Wait-backfill*
+ The placement group is waiting in line to start backfill.
+
+*Backfill-toofull*
+ A backfill operation is waiting because the destination OSD is over its
+ full ratio.
+
+*Incomplete*
+ Ceph detects that a placement group is missing information about
+ writes that may have occurred, or does not have any healthy
+ copies. If you see this state, try to start any failed OSDs that may
+ contain the needed information. In the case of an erasure coded pool
+ temporarily reducing min_size may allow recovery.
+
+*Stale*
+ The placement group is in an unknown state - the monitors have not received
+ an update for it since the placement group mapping changed.
+
+*Remapped*
+ The placement group is temporarily mapped to a different set of OSDs from what
+ CRUSH specified.
+
+*Undersized*
+ The placement group fewer copies than the configured pool replication level.
+
+*Peered*
+ The placement group has peered, but cannot serve client IO due to not having
+ enough copies to reach the pool's configured min_size parameter. Recovery
+ may occur in this state, so the pg may heal up to min_size eventually.
diff --git a/src/ceph/doc/rados/operations/placement-groups.rst b/src/ceph/doc/rados/operations/placement-groups.rst
new file mode 100644
index 0000000..fee833a
--- /dev/null
+++ b/src/ceph/doc/rados/operations/placement-groups.rst
@@ -0,0 +1,469 @@
+==================
+ Placement Groups
+==================
+
+.. _preselection:
+
+A preselection of pg_num
+========================
+
+When creating a new pool with::
+
+ ceph osd pool create {pool-name} pg_num
+
+it is mandatory to choose the value of ``pg_num`` because it cannot be
+calculated automatically. Here are a few values commonly used:
+
+- Less than 5 OSDs set ``pg_num`` to 128
+
+- Between 5 and 10 OSDs set ``pg_num`` to 512
+
+- Between 10 and 50 OSDs set ``pg_num`` to 1024
+
+- If you have more than 50 OSDs, you need to understand the tradeoffs
+ and how to calculate the ``pg_num`` value by yourself
+
+- For calculating ``pg_num`` value by yourself please take help of `pgcalc`_ tool
+
+As the number of OSDs increases, chosing the right value for pg_num
+becomes more important because it has a significant influence on the
+behavior of the cluster as well as the durability of the data when
+something goes wrong (i.e. the probability that a catastrophic event
+leads to data loss).
+
+How are Placement Groups used ?
+===============================
+
+A placement group (PG) aggregates objects within a pool because
+tracking object placement and object metadata on a per-object basis is
+computationally expensive--i.e., a system with millions of objects
+cannot realistically track placement on a per-object basis.
+
+.. ditaa::
+ /-----\ /-----\ /-----\ /-----\ /-----\
+ | obj | | obj | | obj | | obj | | obj |
+ \-----/ \-----/ \-----/ \-----/ \-----/
+ | | | | |
+ +--------+--------+ +---+----+
+ | |
+ v v
+ +-----------------------+ +-----------------------+
+ | Placement Group #1 | | Placement Group #2 |
+ | | | |
+ +-----------------------+ +-----------------------+
+ | |
+ +------------------------------+
+ |
+ v
+ +-----------------------+
+ | Pool |
+ | |
+ +-----------------------+
+
+The Ceph client will calculate which placement group an object should
+be in. It does this by hashing the object ID and applying an operation
+based on the number of PGs in the defined pool and the ID of the pool.
+See `Mapping PGs to OSDs`_ for details.
+
+The object's contents within a placement group are stored in a set of
+OSDs. For instance, in a replicated pool of size two, each placement
+group will store objects on two OSDs, as shown below.
+
+.. ditaa::
+
+ +-----------------------+ +-----------------------+
+ | Placement Group #1 | | Placement Group #2 |
+ | | | |
+ +-----------------------+ +-----------------------+
+ | | | |
+ v v v v
+ /----------\ /----------\ /----------\ /----------\
+ | | | | | | | |
+ | OSD #1 | | OSD #2 | | OSD #2 | | OSD #3 |
+ | | | | | | | |
+ \----------/ \----------/ \----------/ \----------/
+
+
+Should OSD #2 fail, another will be assigned to Placement Group #1 and
+will be filled with copies of all objects in OSD #1. If the pool size
+is changed from two to three, an additional OSD will be assigned to
+the placement group and will receive copies of all objects in the
+placement group.
+
+Placement groups do not own the OSD, they share it with other
+placement groups from the same pool or even other pools. If OSD #2
+fails, the Placement Group #2 will also have to restore copies of
+objects, using OSD #3.
+
+When the number of placement groups increases, the new placement
+groups will be assigned OSDs. The result of the CRUSH function will
+also change and some objects from the former placement groups will be
+copied over to the new Placement Groups and removed from the old ones.
+
+Placement Groups Tradeoffs
+==========================
+
+Data durability and even distribution among all OSDs call for more
+placement groups but their number should be reduced to the minimum to
+save CPU and memory.
+
+.. _data durability:
+
+Data durability
+---------------
+
+After an OSD fails, the risk of data loss increases until the data it
+contained is fully recovered. Let's imagine a scenario that causes
+permanent data loss in a single placement group:
+
+- The OSD fails and all copies of the object it contains are lost.
+ For all objects within the placement group the number of replica
+ suddently drops from three to two.
+
+- Ceph starts recovery for this placement group by chosing a new OSD
+ to re-create the third copy of all objects.
+
+- Another OSD, within the same placement group, fails before the new
+ OSD is fully populated with the third copy. Some objects will then
+ only have one surviving copies.
+
+- Ceph picks yet another OSD and keeps copying objects to restore the
+ desired number of copies.
+
+- A third OSD, within the same placement group, fails before recovery
+ is complete. If this OSD contained the only remaining copy of an
+ object, it is permanently lost.
+
+In a cluster containing 10 OSDs with 512 placement groups in a three
+replica pool, CRUSH will give each placement groups three OSDs. In the
+end, each OSDs will end up hosting (512 * 3) / 10 = ~150 Placement
+Groups. When the first OSD fails, the above scenario will therefore
+start recovery for all 150 placement groups at the same time.
+
+The 150 placement groups being recovered are likely to be
+homogeneously spread over the 9 remaining OSDs. Each remaining OSD is
+therefore likely to send copies of objects to all others and also
+receive some new objects to be stored because they became part of a
+new placement group.
+
+The amount of time it takes for this recovery to complete entirely
+depends on the architecture of the Ceph cluster. Let say each OSD is
+hosted by a 1TB SSD on a single machine and all of them are connected
+to a 10Gb/s switch and the recovery for a single OSD completes within
+M minutes. If there are two OSDs per machine using spinners with no
+SSD journal and a 1Gb/s switch, it will at least be an order of
+magnitude slower.
+
+In a cluster of this size, the number of placement groups has almost
+no influence on data durability. It could be 128 or 8192 and the
+recovery would not be slower or faster.
+
+However, growing the same Ceph cluster to 20 OSDs instead of 10 OSDs
+is likely to speed up recovery and therefore improve data durability
+significantly. Each OSD now participates in only ~75 placement groups
+instead of ~150 when there were only 10 OSDs and it will still require
+all 19 remaining OSDs to perform the same amount of object copies in
+order to recover. But where 10 OSDs had to copy approximately 100GB
+each, they now have to copy 50GB each instead. If the network was the
+bottleneck, recovery will happen twice as fast. In other words,
+recovery goes faster when the number of OSDs increases.
+
+If this cluster grows to 40 OSDs, each of them will only host ~35
+placement groups. If an OSD dies, recovery will keep going faster
+unless it is blocked by another bottleneck. However, if this cluster
+grows to 200 OSDs, each of them will only host ~7 placement groups. If
+an OSD dies, recovery will happen between at most of ~21 (7 * 3) OSDs
+in these placement groups: recovery will take longer than when there
+were 40 OSDs, meaning the number of placement groups should be
+increased.
+
+No matter how short the recovery time is, there is a chance for a
+second OSD to fail while it is in progress. In the 10 OSDs cluster
+described above, if any of them fail, then ~17 placement groups
+(i.e. ~150 / 9 placement groups being recovered) will only have one
+surviving copy. And if any of the 8 remaining OSD fail, the last
+objects of two placement groups are likely to be lost (i.e. ~17 / 8
+placement groups with only one remaining copy being recovered).
+
+When the size of the cluster grows to 20 OSDs, the number of Placement
+Groups damaged by the loss of three OSDs drops. The second OSD lost
+will degrade ~4 (i.e. ~75 / 19 placement groups being recovered)
+instead of ~17 and the third OSD lost will only lose data if it is one
+of the four OSDs containing the surviving copy. In other words, if the
+probability of losing one OSD is 0.0001% during the recovery time
+frame, it goes from 17 * 10 * 0.0001% in the cluster with 10 OSDs to 4 * 20 *
+0.0001% in the cluster with 20 OSDs.
+
+In a nutshell, more OSDs mean faster recovery and a lower risk of
+cascading failures leading to the permanent loss of a Placement
+Group. Having 512 or 4096 Placement Groups is roughly equivalent in a
+cluster with less than 50 OSDs as far as data durability is concerned.
+
+Note: It may take a long time for a new OSD added to the cluster to be
+populated with placement groups that were assigned to it. However
+there is no degradation of any object and it has no impact on the
+durability of the data contained in the Cluster.
+
+.. _object distribution:
+
+Object distribution within a pool
+---------------------------------
+
+Ideally objects are evenly distributed in each placement group. Since
+CRUSH computes the placement group for each object, but does not
+actually know how much data is stored in each OSD within this
+placement group, the ratio between the number of placement groups and
+the number of OSDs may influence the distribution of the data
+significantly.
+
+For instance, if there was single a placement group for ten OSDs in a
+three replica pool, only three OSD would be used because CRUSH would
+have no other choice. When more placement groups are available,
+objects are more likely to be evenly spread among them. CRUSH also
+makes every effort to evenly spread OSDs among all existing Placement
+Groups.
+
+As long as there are one or two orders of magnitude more Placement
+Groups than OSDs, the distribution should be even. For instance, 300
+placement groups for 3 OSDs, 1000 placement groups for 10 OSDs etc.
+
+Uneven data distribution can be caused by factors other than the ratio
+between OSDs and placement groups. Since CRUSH does not take into
+account the size of the objects, a few very large objects may create
+an imbalance. Let say one million 4K objects totaling 4GB are evenly
+spread among 1000 placement groups on 10 OSDs. They will use 4GB / 10
+= 400MB on each OSD. If one 400MB object is added to the pool, the
+three OSDs supporting the placement group in which the object has been
+placed will be filled with 400MB + 400MB = 800MB while the seven
+others will remain occupied with only 400MB.
+
+.. _resource usage:
+
+Memory, CPU and network usage
+-----------------------------
+
+For each placement group, OSDs and MONs need memory, network and CPU
+at all times and even more during recovery. Sharing this overhead by
+clustering objects within a placement group is one of the main reasons
+they exist.
+
+Minimizing the number of placement groups saves significant amounts of
+resources.
+
+Choosing the number of Placement Groups
+=======================================
+
+If you have more than 50 OSDs, we recommend approximately 50-100
+placement groups per OSD to balance out resource usage, data
+durability and distribution. If you have less than 50 OSDs, chosing
+among the `preselection`_ above is best. For a single pool of objects,
+you can use the following formula to get a baseline::
+
+ (OSDs * 100)
+ Total PGs = ------------
+ pool size
+
+Where **pool size** is either the number of replicas for replicated
+pools or the K+M sum for erasure coded pools (as returned by **ceph
+osd erasure-code-profile get**).
+
+You should then check if the result makes sense with the way you
+designed your Ceph cluster to maximize `data durability`_,
+`object distribution`_ and minimize `resource usage`_.
+
+The result should be **rounded up to the nearest power of two.**
+Rounding up is optional, but recommended for CRUSH to evenly balance
+the number of objects among placement groups.
+
+As an example, for a cluster with 200 OSDs and a pool size of 3
+replicas, you would estimate your number of PGs as follows::
+
+ (200 * 100)
+ ----------- = 6667. Nearest power of 2: 8192
+ 3
+
+When using multiple data pools for storing objects, you need to ensure
+that you balance the number of placement groups per pool with the
+number of placement groups per OSD so that you arrive at a reasonable
+total number of placement groups that provides reasonably low variance
+per OSD without taxing system resources or making the peering process
+too slow.
+
+For instance a cluster of 10 pools each with 512 placement groups on
+ten OSDs is a total of 5,120 placement groups spread over ten OSDs,
+that is 512 placement groups per OSD. That does not use too many
+resources. However, if 1,000 pools were created with 512 placement
+groups each, the OSDs will handle ~50,000 placement groups each and it
+would require significantly more resources and time for peering.
+
+You may find the `PGCalc`_ tool helpful.
+
+
+.. _setting the number of placement groups:
+
+Set the Number of Placement Groups
+==================================
+
+To set the number of placement groups in a pool, you must specify the
+number of placement groups at the time you create the pool.
+See `Create a Pool`_ for details. Once you have set placement groups for a
+pool, you may increase the number of placement groups (but you cannot
+decrease the number of placement groups). To increase the number of
+placement groups, execute the following::
+
+ ceph osd pool set {pool-name} pg_num {pg_num}
+
+Once you increase the number of placement groups, you must also
+increase the number of placement groups for placement (``pgp_num``)
+before your cluster will rebalance. The ``pgp_num`` will be the number of
+placement groups that will be considered for placement by the CRUSH
+algorithm. Increasing ``pg_num`` splits the placement groups but data
+will not be migrated to the newer placement groups until placement
+groups for placement, ie. ``pgp_num`` is increased. The ``pgp_num``
+should be equal to the ``pg_num``. To increase the number of
+placement groups for placement, execute the following::
+
+ ceph osd pool set {pool-name} pgp_num {pgp_num}
+
+
+Get the Number of Placement Groups
+==================================
+
+To get the number of placement groups in a pool, execute the following::
+
+ ceph osd pool get {pool-name} pg_num
+
+
+Get a Cluster's PG Statistics
+=============================
+
+To get the statistics for the placement groups in your cluster, execute the following::
+
+ ceph pg dump [--format {format}]
+
+Valid formats are ``plain`` (default) and ``json``.
+
+
+Get Statistics for Stuck PGs
+============================
+
+To get the statistics for all placement groups stuck in a specified state,
+execute the following::
+
+ ceph pg dump_stuck inactive|unclean|stale|undersized|degraded [--format <format>] [-t|--threshold <seconds>]
+
+**Inactive** Placement groups cannot process reads or writes because they are waiting for an OSD
+with the most up-to-date data to come up and in.
+
+**Unclean** Placement groups contain objects that are not replicated the desired number
+of times. They should be recovering.
+
+**Stale** Placement groups are in an unknown state - the OSDs that host them have not
+reported to the monitor cluster in a while (configured by ``mon_osd_report_timeout``).
+
+Valid formats are ``plain`` (default) and ``json``. The threshold defines the minimum number
+of seconds the placement group is stuck before including it in the returned statistics
+(default 300 seconds).
+
+
+Get a PG Map
+============
+
+To get the placement group map for a particular placement group, execute the following::
+
+ ceph pg map {pg-id}
+
+For example::
+
+ ceph pg map 1.6c
+
+Ceph will return the placement group map, the placement group, and the OSD status::
+
+ osdmap e13 pg 1.6c (1.6c) -> up [1,0] acting [1,0]
+
+
+Get a PGs Statistics
+====================
+
+To retrieve statistics for a particular placement group, execute the following::
+
+ ceph pg {pg-id} query
+
+
+Scrub a Placement Group
+=======================
+
+To scrub a placement group, execute the following::
+
+ ceph pg scrub {pg-id}
+
+Ceph checks the primary and any replica nodes, generates a catalog of all objects
+in the placement group and compares them to ensure that no objects are missing
+or mismatched, and their contents are consistent. Assuming the replicas all
+match, a final semantic sweep ensures that all of the snapshot-related object
+metadata is consistent. Errors are reported via logs.
+
+Prioritize backfill/recovery of a Placement Group(s)
+====================================================
+
+You may run into a situation where a bunch of placement groups will require
+recovery and/or backfill, and some particular groups hold data more important
+than others (for example, those PGs may hold data for images used by running
+machines and other PGs may be used by inactive machines/less relevant data).
+In that case, you may want to prioritize recovery of those groups so
+performance and/or availability of data stored on those groups is restored
+earlier. To do this (mark particular placement group(s) as prioritized during
+backfill or recovery), execute the following::
+
+ ceph pg force-recovery {pg-id} [{pg-id #2}] [{pg-id #3} ...]
+ ceph pg force-backfill {pg-id} [{pg-id #2}] [{pg-id #3} ...]
+
+This will cause Ceph to perform recovery or backfill on specified placement
+groups first, before other placement groups. This does not interrupt currently
+ongoing backfills or recovery, but causes specified PGs to be processed
+as soon as possible. If you change your mind or prioritize wrong groups,
+use::
+
+ ceph pg cancel-force-recovery {pg-id} [{pg-id #2}] [{pg-id #3} ...]
+ ceph pg cancel-force-backfill {pg-id} [{pg-id #2}] [{pg-id #3} ...]
+
+This will remove "force" flag from those PGs and they will be processed
+in default order. Again, this doesn't affect currently processed placement
+group, only those that are still queued.
+
+The "force" flag is cleared automatically after recovery or backfill of group
+is done.
+
+Revert Lost
+===========
+
+If the cluster has lost one or more objects, and you have decided to
+abandon the search for the lost data, you must mark the unfound objects
+as ``lost``.
+
+If all possible locations have been queried and objects are still
+lost, you may have to give up on the lost objects. This is
+possible given unusual combinations of failures that allow the cluster
+to learn about writes that were performed before the writes themselves
+are recovered.
+
+Currently the only supported option is "revert", which will either roll back to
+a previous version of the object or (if it was a new object) forget about it
+entirely. To mark the "unfound" objects as "lost", execute the following::
+
+ ceph pg {pg-id} mark_unfound_lost revert|delete
+
+.. important:: Use this feature with caution, because it may confuse
+ applications that expect the object(s) to exist.
+
+
+.. toctree::
+ :hidden:
+
+ pg-states
+ pg-concepts
+
+
+.. _Create a Pool: ../pools#createpool
+.. _Mapping PGs to OSDs: ../../../architecture#mapping-pgs-to-osds
+.. _pgcalc: http://ceph.com/pgcalc/
diff --git a/src/ceph/doc/rados/operations/pools.rst b/src/ceph/doc/rados/operations/pools.rst
new file mode 100644
index 0000000..7015593
--- /dev/null
+++ b/src/ceph/doc/rados/operations/pools.rst
@@ -0,0 +1,798 @@
+=======
+ Pools
+=======
+
+When you first deploy a cluster without creating a pool, Ceph uses the default
+pools for storing data. A pool provides you with:
+
+- **Resilience**: You can set how many OSD are allowed to fail without losing data.
+ For replicated pools, it is the desired number of copies/replicas of an object.
+ A typical configuration stores an object and one additional copy
+ (i.e., ``size = 2``), but you can determine the number of copies/replicas.
+ For `erasure coded pools <../erasure-code>`_, it is the number of coding chunks
+ (i.e. ``m=2`` in the **erasure code profile**)
+
+- **Placement Groups**: You can set the number of placement groups for the pool.
+ A typical configuration uses approximately 100 placement groups per OSD to
+ provide optimal balancing without using up too many computing resources. When
+ setting up multiple pools, be careful to ensure you set a reasonable number of
+ placement groups for both the pool and the cluster as a whole.
+
+- **CRUSH Rules**: When you store data in a pool, a CRUSH ruleset mapped to the
+ pool enables CRUSH to identify a rule for the placement of the object
+ and its replicas (or chunks for erasure coded pools) in your cluster.
+ You can create a custom CRUSH rule for your pool.
+
+- **Snapshots**: When you create snapshots with ``ceph osd pool mksnap``,
+ you effectively take a snapshot of a particular pool.
+
+To organize data into pools, you can list, create, and remove pools.
+You can also view the utilization statistics for each pool.
+
+List Pools
+==========
+
+To list your cluster's pools, execute::
+
+ ceph osd lspools
+
+On a freshly installed cluster, only the ``rbd`` pool exists.
+
+
+.. _createpool:
+
+Create a Pool
+=============
+
+Before creating pools, refer to the `Pool, PG and CRUSH Config Reference`_.
+Ideally, you should override the default value for the number of placement
+groups in your Ceph configuration file, as the default is NOT ideal.
+For details on placement group numbers refer to `setting the number of placement groups`_
+
+.. note:: Starting with Luminous, all pools need to be associated to the
+ application using the pool. See `Associate Pool to Application`_ below for
+ more information.
+
+For example::
+
+ osd pool default pg num = 100
+ osd pool default pgp num = 100
+
+To create a pool, execute::
+
+ ceph osd pool create {pool-name} {pg-num} [{pgp-num}] [replicated] \
+ [crush-rule-name] [expected-num-objects]
+ ceph osd pool create {pool-name} {pg-num} {pgp-num} erasure \
+ [erasure-code-profile] [crush-rule-name] [expected_num_objects]
+
+Where:
+
+``{pool-name}``
+
+:Description: The name of the pool. It must be unique.
+:Type: String
+:Required: Yes.
+
+``{pg-num}``
+
+:Description: The total number of placement groups for the pool. See `Placement
+ Groups`_ for details on calculating a suitable number. The
+ default value ``8`` is NOT suitable for most systems.
+
+:Type: Integer
+:Required: Yes.
+:Default: 8
+
+``{pgp-num}``
+
+:Description: The total number of placement groups for placement purposes. This
+ **should be equal to the total number of placement groups**, except
+ for placement group splitting scenarios.
+
+:Type: Integer
+:Required: Yes. Picks up default or Ceph configuration value if not specified.
+:Default: 8
+
+``{replicated|erasure}``
+
+:Description: The pool type which may either be **replicated** to
+ recover from lost OSDs by keeping multiple copies of the
+ objects or **erasure** to get a kind of
+ `generalized RAID5 <../erasure-code>`_ capability.
+ The **replicated** pools require more
+ raw storage but implement all Ceph operations. The
+ **erasure** pools require less raw storage but only
+ implement a subset of the available operations.
+
+:Type: String
+:Required: No.
+:Default: replicated
+
+``[crush-rule-name]``
+
+:Description: The name of a CRUSH rule to use for this pool. The specified
+ rule must exist.
+
+:Type: String
+:Required: No.
+:Default: For **replicated** pools it is the ruleset specified by the ``osd
+ pool default crush replicated ruleset`` config variable. This
+ ruleset must exist.
+ For **erasure** pools it is ``erasure-code`` if the ``default``
+ `erasure code profile`_ is used or ``{pool-name}`` otherwise. This
+ ruleset will be created implicitly if it doesn't exist already.
+
+
+``[erasure-code-profile=profile]``
+
+.. _erasure code profile: ../erasure-code-profile
+
+:Description: For **erasure** pools only. Use the `erasure code profile`_. It
+ must be an existing profile as defined by
+ **osd erasure-code-profile set**.
+
+:Type: String
+:Required: No.
+
+When you create a pool, set the number of placement groups to a reasonable value
+(e.g., ``100``). Consider the total number of placement groups per OSD too.
+Placement groups are computationally expensive, so performance will degrade when
+you have many pools with many placement groups (e.g., 50 pools with 100
+placement groups each). The point of diminishing returns depends upon the power
+of the OSD host.
+
+See `Placement Groups`_ for details on calculating an appropriate number of
+placement groups for your pool.
+
+.. _Placement Groups: ../placement-groups
+
+``[expected-num-objects]``
+
+:Description: The expected number of objects for this pool. By setting this value (
+ together with a negative **filestore merge threshold**), the PG folder
+ splitting would happen at the pool creation time, to avoid the latency
+ impact to do a runtime folder splitting.
+
+:Type: Integer
+:Required: No.
+:Default: 0, no splitting at the pool creation time.
+
+Associate Pool to Application
+=============================
+
+Pools need to be associated with an application before use. Pools that will be
+used with CephFS or pools that are automatically created by RGW are
+automatically associated. Pools that are intended for use with RBD should be
+initialized using the ``rbd`` tool (see `Block Device Commands`_ for more
+information).
+
+For other cases, you can manually associate a free-form application name to
+a pool.::
+
+ ceph osd pool application enable {pool-name} {application-name}
+
+.. note:: CephFS uses the application name ``cephfs``, RBD uses the
+ application name ``rbd``, and RGW uses the application name ``rgw``.
+
+Set Pool Quotas
+===============
+
+You can set pool quotas for the maximum number of bytes and/or the maximum
+number of objects per pool. ::
+
+ ceph osd pool set-quota {pool-name} [max_objects {obj-count}] [max_bytes {bytes}]
+
+For example::
+
+ ceph osd pool set-quota data max_objects 10000
+
+To remove a quota, set its value to ``0``.
+
+
+Delete a Pool
+=============
+
+To delete a pool, execute::
+
+ ceph osd pool delete {pool-name} [{pool-name} --yes-i-really-really-mean-it]
+
+
+To remove a pool the mon_allow_pool_delete flag must be set to true in the Monitor's
+configuration. Otherwise they will refuse to remove a pool.
+
+See `Monitor Configuration`_ for more information.
+
+.. _Monitor Configuration: ../../configuration/mon-config-ref
+
+If you created your own rulesets and rules for a pool you created, you should
+consider removing them when you no longer need your pool::
+
+ ceph osd pool get {pool-name} crush_ruleset
+
+If the ruleset was "123", for example, you can check the other pools like so::
+
+ ceph osd dump | grep "^pool" | grep "crush_ruleset 123"
+
+If no other pools use that custom ruleset, then it's safe to delete that
+ruleset from the cluster.
+
+If you created users with permissions strictly for a pool that no longer
+exists, you should consider deleting those users too::
+
+ ceph auth ls | grep -C 5 {pool-name}
+ ceph auth del {user}
+
+
+Rename a Pool
+=============
+
+To rename a pool, execute::
+
+ ceph osd pool rename {current-pool-name} {new-pool-name}
+
+If you rename a pool and you have per-pool capabilities for an authenticated
+user, you must update the user's capabilities (i.e., caps) with the new pool
+name.
+
+.. note:: Version ``0.48`` Argonaut and above.
+
+Show Pool Statistics
+====================
+
+To show a pool's utilization statistics, execute::
+
+ rados df
+
+
+Make a Snapshot of a Pool
+=========================
+
+To make a snapshot of a pool, execute::
+
+ ceph osd pool mksnap {pool-name} {snap-name}
+
+.. note:: Version ``0.48`` Argonaut and above.
+
+
+Remove a Snapshot of a Pool
+===========================
+
+To remove a snapshot of a pool, execute::
+
+ ceph osd pool rmsnap {pool-name} {snap-name}
+
+.. note:: Version ``0.48`` Argonaut and above.
+
+.. _setpoolvalues:
+
+
+Set Pool Values
+===============
+
+To set a value to a pool, execute the following::
+
+ ceph osd pool set {pool-name} {key} {value}
+
+You may set values for the following keys:
+
+.. _compression_algorithm:
+
+``compression_algorithm``
+:Description: Sets inline compression algorithm to use for underlying BlueStore.
+ This setting overrides the `global setting <rados/configuration/bluestore-config-ref/#inline-compression>`_ of ``bluestore compression algorithm``.
+
+:Type: String
+:Valid Settings: ``lz4``, ``snappy``, ``zlib``, ``zstd``
+
+``compression_mode``
+
+:Description: Sets the policy for the inline compression algorithm for underlying BlueStore.
+ This setting overrides the `global setting <rados/configuration/bluestore-config-ref/#inline-compression>`_ of ``bluestore compression mode``.
+
+:Type: String
+:Valid Settings: ``none``, ``passive``, ``aggressive``, ``force``
+
+``compression_min_blob_size``
+
+:Description: Chunks smaller than this are never compressed.
+ This setting overrides the `global setting <rados/configuration/bluestore-config-ref/#inline-compression>`_ of ``bluestore compression min blob *``.
+
+:Type: Unsigned Integer
+
+``compression_max_blob_size``
+
+:Description: Chunks larger than this are broken into smaller blobs sizing
+ ``compression_max_blob_size`` before being compressed.
+
+:Type: Unsigned Integer
+
+.. _size:
+
+``size``
+
+:Description: Sets the number of replicas for objects in the pool.
+ See `Set the Number of Object Replicas`_ for further details.
+ Replicated pools only.
+
+:Type: Integer
+
+.. _min_size:
+
+``min_size``
+
+:Description: Sets the minimum number of replicas required for I/O.
+ See `Set the Number of Object Replicas`_ for further details.
+ Replicated pools only.
+
+:Type: Integer
+:Version: ``0.54`` and above
+
+.. _pg_num:
+
+``pg_num``
+
+:Description: The effective number of placement groups to use when calculating
+ data placement.
+:Type: Integer
+:Valid Range: Superior to ``pg_num`` current value.
+
+.. _pgp_num:
+
+``pgp_num``
+
+:Description: The effective number of placement groups for placement to use
+ when calculating data placement.
+
+:Type: Integer
+:Valid Range: Equal to or less than ``pg_num``.
+
+.. _crush_ruleset:
+
+``crush_ruleset``
+
+:Description: The ruleset to use for mapping object placement in the cluster.
+:Type: Integer
+
+.. _allow_ec_overwrites:
+
+``allow_ec_overwrites``
+
+:Description: Whether writes to an erasure coded pool can update part
+ of an object, so cephfs and rbd can use it. See
+ `Erasure Coding with Overwrites`_ for more details.
+:Type: Boolean
+:Version: ``12.2.0`` and above
+
+.. _hashpspool:
+
+``hashpspool``
+
+:Description: Set/Unset HASHPSPOOL flag on a given pool.
+:Type: Integer
+:Valid Range: 1 sets flag, 0 unsets flag
+:Version: Version ``0.48`` Argonaut and above.
+
+.. _nodelete:
+
+``nodelete``
+
+:Description: Set/Unset NODELETE flag on a given pool.
+:Type: Integer
+:Valid Range: 1 sets flag, 0 unsets flag
+:Version: Version ``FIXME``
+
+.. _nopgchange:
+
+``nopgchange``
+
+:Description: Set/Unset NOPGCHANGE flag on a given pool.
+:Type: Integer
+:Valid Range: 1 sets flag, 0 unsets flag
+:Version: Version ``FIXME``
+
+.. _nosizechange:
+
+``nosizechange``
+
+:Description: Set/Unset NOSIZECHANGE flag on a given pool.
+:Type: Integer
+:Valid Range: 1 sets flag, 0 unsets flag
+:Version: Version ``FIXME``
+
+.. _write_fadvise_dontneed:
+
+``write_fadvise_dontneed``
+
+:Description: Set/Unset WRITE_FADVISE_DONTNEED flag on a given pool.
+:Type: Integer
+:Valid Range: 1 sets flag, 0 unsets flag
+
+.. _noscrub:
+
+``noscrub``
+
+:Description: Set/Unset NOSCRUB flag on a given pool.
+:Type: Integer
+:Valid Range: 1 sets flag, 0 unsets flag
+
+.. _nodeep-scrub:
+
+``nodeep-scrub``
+
+:Description: Set/Unset NODEEP_SCRUB flag on a given pool.
+:Type: Integer
+:Valid Range: 1 sets flag, 0 unsets flag
+
+.. _hit_set_type:
+
+``hit_set_type``
+
+:Description: Enables hit set tracking for cache pools.
+ See `Bloom Filter`_ for additional information.
+
+:Type: String
+:Valid Settings: ``bloom``, ``explicit_hash``, ``explicit_object``
+:Default: ``bloom``. Other values are for testing.
+
+.. _hit_set_count:
+
+``hit_set_count``
+
+:Description: The number of hit sets to store for cache pools. The higher
+ the number, the more RAM consumed by the ``ceph-osd`` daemon.
+
+:Type: Integer
+:Valid Range: ``1``. Agent doesn't handle > 1 yet.
+
+.. _hit_set_period:
+
+``hit_set_period``
+
+:Description: The duration of a hit set period in seconds for cache pools.
+ The higher the number, the more RAM consumed by the
+ ``ceph-osd`` daemon.
+
+:Type: Integer
+:Example: ``3600`` 1hr
+
+.. _hit_set_fpp:
+
+``hit_set_fpp``
+
+:Description: The false positive probability for the ``bloom`` hit set type.
+ See `Bloom Filter`_ for additional information.
+
+:Type: Double
+:Valid Range: 0.0 - 1.0
+:Default: ``0.05``
+
+.. _cache_target_dirty_ratio:
+
+``cache_target_dirty_ratio``
+
+:Description: The percentage of the cache pool containing modified (dirty)
+ objects before the cache tiering agent will flush them to the
+ backing storage pool.
+
+:Type: Double
+:Default: ``.4``
+
+.. _cache_target_dirty_high_ratio:
+
+``cache_target_dirty_high_ratio``
+
+:Description: The percentage of the cache pool containing modified (dirty)
+ objects before the cache tiering agent will flush them to the
+ backing storage pool with a higher speed.
+
+:Type: Double
+:Default: ``.6``
+
+.. _cache_target_full_ratio:
+
+``cache_target_full_ratio``
+
+:Description: The percentage of the cache pool containing unmodified (clean)
+ objects before the cache tiering agent will evict them from the
+ cache pool.
+
+:Type: Double
+:Default: ``.8``
+
+.. _target_max_bytes:
+
+``target_max_bytes``
+
+:Description: Ceph will begin flushing or evicting objects when the
+ ``max_bytes`` threshold is triggered.
+
+:Type: Integer
+:Example: ``1000000000000`` #1-TB
+
+.. _target_max_objects:
+
+``target_max_objects``
+
+:Description: Ceph will begin flushing or evicting objects when the
+ ``max_objects`` threshold is triggered.
+
+:Type: Integer
+:Example: ``1000000`` #1M objects
+
+
+``hit_set_grade_decay_rate``
+
+:Description: Temperature decay rate between two successive hit_sets
+:Type: Integer
+:Valid Range: 0 - 100
+:Default: ``20``
+
+
+``hit_set_search_last_n``
+
+:Description: Count at most N appearance in hit_sets for temperature calculation
+:Type: Integer
+:Valid Range: 0 - hit_set_count
+:Default: ``1``
+
+
+.. _cache_min_flush_age:
+
+``cache_min_flush_age``
+
+:Description: The time (in seconds) before the cache tiering agent will flush
+ an object from the cache pool to the storage pool.
+
+:Type: Integer
+:Example: ``600`` 10min
+
+.. _cache_min_evict_age:
+
+``cache_min_evict_age``
+
+:Description: The time (in seconds) before the cache tiering agent will evict
+ an object from the cache pool.
+
+:Type: Integer
+:Example: ``1800`` 30min
+
+.. _fast_read:
+
+``fast_read``
+
+:Description: On Erasure Coding pool, if this flag is turned on, the read request
+ would issue sub reads to all shards, and waits until it receives enough
+ shards to decode to serve the client. In the case of jerasure and isa
+ erasure plugins, once the first K replies return, client's request is
+ served immediately using the data decoded from these replies. This
+ helps to tradeoff some resources for better performance. Currently this
+ flag is only supported for Erasure Coding pool.
+
+:Type: Boolean
+:Defaults: ``0``
+
+.. _scrub_min_interval:
+
+``scrub_min_interval``
+
+:Description: The minimum interval in seconds for pool scrubbing when
+ load is low. If it is 0, the value osd_scrub_min_interval
+ from config is used.
+
+:Type: Double
+:Default: ``0``
+
+.. _scrub_max_interval:
+
+``scrub_max_interval``
+
+:Description: The maximum interval in seconds for pool scrubbing
+ irrespective of cluster load. If it is 0, the value
+ osd_scrub_max_interval from config is used.
+
+:Type: Double
+:Default: ``0``
+
+.. _deep_scrub_interval:
+
+``deep_scrub_interval``
+
+:Description: The interval in seconds for pool “deep” scrubbing. If it
+ is 0, the value osd_deep_scrub_interval from config is used.
+
+:Type: Double
+:Default: ``0``
+
+
+Get Pool Values
+===============
+
+To get a value from a pool, execute the following::
+
+ ceph osd pool get {pool-name} {key}
+
+You may get values for the following keys:
+
+``size``
+
+:Description: see size_
+
+:Type: Integer
+
+``min_size``
+
+:Description: see min_size_
+
+:Type: Integer
+:Version: ``0.54`` and above
+
+``pg_num``
+
+:Description: see pg_num_
+
+:Type: Integer
+
+
+``pgp_num``
+
+:Description: see pgp_num_
+
+:Type: Integer
+:Valid Range: Equal to or less than ``pg_num``.
+
+
+``crush_ruleset``
+
+:Description: see crush_ruleset_
+
+
+``hit_set_type``
+
+:Description: see hit_set_type_
+
+:Type: String
+:Valid Settings: ``bloom``, ``explicit_hash``, ``explicit_object``
+
+``hit_set_count``
+
+:Description: see hit_set_count_
+
+:Type: Integer
+
+
+``hit_set_period``
+
+:Description: see hit_set_period_
+
+:Type: Integer
+
+
+``hit_set_fpp``
+
+:Description: see hit_set_fpp_
+
+:Type: Double
+
+
+``cache_target_dirty_ratio``
+
+:Description: see cache_target_dirty_ratio_
+
+:Type: Double
+
+
+``cache_target_dirty_high_ratio``
+
+:Description: see cache_target_dirty_high_ratio_
+
+:Type: Double
+
+
+``cache_target_full_ratio``
+
+:Description: see cache_target_full_ratio_
+
+:Type: Double
+
+
+``target_max_bytes``
+
+:Description: see target_max_bytes_
+
+:Type: Integer
+
+
+``target_max_objects``
+
+:Description: see target_max_objects_
+
+:Type: Integer
+
+
+``cache_min_flush_age``
+
+:Description: see cache_min_flush_age_
+
+:Type: Integer
+
+
+``cache_min_evict_age``
+
+:Description: see cache_min_evict_age_
+
+:Type: Integer
+
+
+``fast_read``
+
+:Description: see fast_read_
+
+:Type: Boolean
+
+
+``scrub_min_interval``
+
+:Description: see scrub_min_interval_
+
+:Type: Double
+
+
+``scrub_max_interval``
+
+:Description: see scrub_max_interval_
+
+:Type: Double
+
+
+``deep_scrub_interval``
+
+:Description: see deep_scrub_interval_
+
+:Type: Double
+
+
+Set the Number of Object Replicas
+=================================
+
+To set the number of object replicas on a replicated pool, execute the following::
+
+ ceph osd pool set {poolname} size {num-replicas}
+
+.. important:: The ``{num-replicas}`` includes the object itself.
+ If you want the object and two copies of the object for a total of
+ three instances of the object, specify ``3``.
+
+For example::
+
+ ceph osd pool set data size 3
+
+You may execute this command for each pool. **Note:** An object might accept
+I/Os in degraded mode with fewer than ``pool size`` replicas. To set a minimum
+number of required replicas for I/O, you should use the ``min_size`` setting.
+For example::
+
+ ceph osd pool set data min_size 2
+
+This ensures that no object in the data pool will receive I/O with fewer than
+``min_size`` replicas.
+
+
+Get the Number of Object Replicas
+=================================
+
+To get the number of object replicas, execute the following::
+
+ ceph osd dump | grep 'replicated size'
+
+Ceph will list the pools, with the ``replicated size`` attribute highlighted.
+By default, ceph creates two replicas of an object (a total of three copies, or
+a size of 3).
+
+
+
+.. _Pool, PG and CRUSH Config Reference: ../../configuration/pool-pg-config-ref
+.. _Bloom Filter: http://en.wikipedia.org/wiki/Bloom_filter
+.. _setting the number of placement groups: ../placement-groups#set-the-number-of-placement-groups
+.. _Erasure Coding with Overwrites: ../erasure-code#erasure-coding-with-overwrites
+.. _Block Device Commands: ../../../rbd/rados-rbd-cmds/#create-a-block-device-pool
+
diff --git a/src/ceph/doc/rados/operations/upmap.rst b/src/ceph/doc/rados/operations/upmap.rst
new file mode 100644
index 0000000..58f6322
--- /dev/null
+++ b/src/ceph/doc/rados/operations/upmap.rst
@@ -0,0 +1,75 @@
+Using the pg-upmap
+==================
+
+Starting in Luminous v12.2.z there is a new *pg-upmap* exception table
+in the OSDMap that allows the cluster to explicitly map specific PGs to
+specific OSDs. This allows the cluster to fine-tune the data
+distribution to, in most cases, perfectly distributed PGs across OSDs.
+
+The key caveat to this new mechanism is that it requires that all
+clients understand the new *pg-upmap* structure in the OSDMap.
+
+Enabling
+--------
+
+To allow use of the feature, you must tell the cluster that it only
+needs to support luminous (and newer) clients with::
+
+ ceph osd set-require-min-compat-client luminous
+
+This command will fail if any pre-luminous clients or daemons are
+connected to the monitors. You can see what client versions are in
+use with::
+
+ ceph features
+
+A word of caution
+-----------------
+
+This is a new feature and not very user friendly. At the time of this
+writing we are working on a new `balancer` module for ceph-mgr that
+will eventually do all of this automatically.
+
+Until then,
+
+Offline optimization
+--------------------
+
+Upmap entries are updated with an offline optimizer built into ``osdmaptool``.
+
+#. Grab the latest copy of your osdmap::
+
+ ceph osd getmap -o om
+
+#. Run the optimizer::
+
+ osdmaptool om --upmap out.txt [--upmap-pool <pool>] [--upmap-max <max-count>] [--upmap-deviation <max-deviation>]
+
+ It is highly recommended that optimization be done for each pool
+ individually, or for sets of similarly-utilized pools. You can
+ specify the ``--upmap-pool`` option multiple times. "Similar pools"
+ means pools that are mapped to the same devices and store the same
+ kind of data (e.g., RBD image pools, yes; RGW index pool and RGW
+ data pool, no).
+
+ The ``max-count`` value is the maximum number of upmap entries to
+ identify in the run. The default is 100, but you may want to make
+ this a smaller number so that the tool completes more quickly (but
+ does less work). If it cannot find any additional changes to make
+ it will stop early (i.e., when the pool distribution is perfect).
+
+ The ``max-deviation`` value defaults to `.01` (i.e., 1%). If an OSD
+ utilization varies from the average by less than this amount it
+ will be considered perfect.
+
+#. The proposed changes are written to the output file ``out.txt`` in
+ the example above. These are normal ceph CLI commands that can be
+ run to apply the changes to the cluster. This can be done with::
+
+ source out.txt
+
+The above steps can be repeated as many times as necessary to achieve
+a perfect distribution of PGs for each set of pools.
+
+You can see some (gory) details about what the tool is doing by
+passing ``--debug-osd 10`` to ``osdmaptool``.
diff --git a/src/ceph/doc/rados/operations/user-management.rst b/src/ceph/doc/rados/operations/user-management.rst
new file mode 100644
index 0000000..8a35a50
--- /dev/null
+++ b/src/ceph/doc/rados/operations/user-management.rst
@@ -0,0 +1,665 @@
+=================
+ User Management
+=================
+
+This document describes :term:`Ceph Client` users, and their authentication and
+authorization with the :term:`Ceph Storage Cluster`. Users are either
+individuals or system actors such as applications, which use Ceph clients to
+interact with the Ceph Storage Cluster daemons.
+
+.. ditaa:: +-----+
+ | {o} |
+ | |
+ +--+--+ /---------\ /---------\
+ | | Ceph | | Ceph |
+ ---+---*----->| |<------------->| |
+ | uses | Clients | | Servers |
+ | \---------/ \---------/
+ /--+--\
+ | |
+ | |
+ actor
+
+
+When Ceph runs with authentication and authorization enabled (enabled by
+default), you must specify a user name and a keyring containing the secret key
+of the specified user (usually via the command line). If you do not specify a
+user name, Ceph will use ``client.admin`` as the default user name. If you do
+not specify a keyring, Ceph will look for a keyring via the ``keyring`` setting
+in the Ceph configuration. For example, if you execute the ``ceph health``
+command without specifying a user or keyring::
+
+ ceph health
+
+Ceph interprets the command like this::
+
+ ceph -n client.admin --keyring=/etc/ceph/ceph.client.admin.keyring health
+
+Alternatively, you may use the ``CEPH_ARGS`` environment variable to avoid
+re-entry of the user name and secret.
+
+For details on configuring the Ceph Storage Cluster to use authentication,
+see `Cephx Config Reference`_. For details on the architecture of Cephx, see
+`Architecture - High Availability Authentication`_.
+
+
+Background
+==========
+
+Irrespective of the type of Ceph client (e.g., Block Device, Object Storage,
+Filesystem, native API, etc.), Ceph stores all data as objects within `pools`_.
+Ceph users must have access to pools in order to read and write data.
+Additionally, Ceph users must have execute permissions to use Ceph's
+administrative commands. The following concepts will help you understand Ceph
+user management.
+
+
+User
+----
+
+A user is either an individual or a system actor such as an application.
+Creating users allows you to control who (or what) can access your Ceph Storage
+Cluster, its pools, and the data within pools.
+
+Ceph has the notion of a ``type`` of user. For the purposes of user management,
+the type will always be ``client``. Ceph identifies users in period (.)
+delimited form consisting of the user type and the user ID: for example,
+``TYPE.ID``, ``client.admin``, or ``client.user1``. The reason for user typing
+is that Ceph Monitors, OSDs, and Metadata Servers also use the Cephx protocol,
+but they are not clients. Distinguishing the user type helps to distinguish
+between client users and other users--streamlining access control, user
+monitoring and traceability.
+
+Sometimes Ceph's user type may seem confusing, because the Ceph command line
+allows you to specify a user with or without the type, depending upon your
+command line usage. If you specify ``--user`` or ``--id``, you can omit the
+type. So ``client.user1`` can be entered simply as ``user1``. If you specify
+``--name`` or ``-n``, you must specify the type and name, such as
+``client.user1``. We recommend using the type and name as a best practice
+wherever possible.
+
+.. note:: A Ceph Storage Cluster user is not the same as a Ceph Object Storage
+ user or a Ceph Filesystem user. The Ceph Object Gateway uses a Ceph Storage
+ Cluster user to communicate between the gateway daemon and the storage
+ cluster, but the gateway has its own user management functionality for end
+ users. The Ceph Filesystem uses POSIX semantics. The user space associated
+ with the Ceph Filesystem is not the same as a Ceph Storage Cluster user.
+
+
+
+Authorization (Capabilities)
+----------------------------
+
+Ceph uses the term "capabilities" (caps) to describe authorizing an
+authenticated user to exercise the functionality of the monitors, OSDs and
+metadata servers. Capabilities can also restrict access to data within a pool or
+a namespace within a pool. A Ceph administrative user sets a user's
+capabilities when creating or updating a user.
+
+Capability syntax follows the form::
+
+ {daemon-type} '{capspec}[, {capspec} ...]'
+
+- **Monitor Caps:** Monitor capabilities include ``r``, ``w``, ``x`` access
+ settings or ``profile {name}``. For example::
+
+ mon 'allow rwx'
+ mon 'profile osd'
+
+- **OSD Caps:** OSD capabilities include ``r``, ``w``, ``x``, ``class-read``,
+ ``class-write`` access settings or ``profile {name}``. Additionally, OSD
+ capabilities also allow for pool and namespace settings. ::
+
+ osd 'allow {access} [pool={pool-name} [namespace={namespace-name}]]'
+ osd 'profile {name} [pool={pool-name} [namespace={namespace-name}]]'
+
+- **Metadata Server Caps:** For administrators, use ``allow *``. For all
+ other users, such as CephFS clients, consult :doc:`/cephfs/client-auth`
+
+
+.. note:: The Ceph Object Gateway daemon (``radosgw``) is a client of the
+ Ceph Storage Cluster, so it is not represented as a Ceph Storage
+ Cluster daemon type.
+
+The following entries describe each capability.
+
+``allow``
+
+:Description: Precedes access settings for a daemon. Implies ``rw``
+ for MDS only.
+
+
+``r``
+
+:Description: Gives the user read access. Required with monitors to retrieve
+ the CRUSH map.
+
+
+``w``
+
+:Description: Gives the user write access to objects.
+
+
+``x``
+
+:Description: Gives the user the capability to call class methods
+ (i.e., both read and write) and to conduct ``auth``
+ operations on monitors.
+
+
+``class-read``
+
+:Descriptions: Gives the user the capability to call class read methods.
+ Subset of ``x``.
+
+
+``class-write``
+
+:Description: Gives the user the capability to call class write methods.
+ Subset of ``x``.
+
+
+``*``
+
+:Description: Gives the user read, write and execute permissions for a
+ particular daemon/pool, and the ability to execute
+ admin commands.
+
+
+``profile osd`` (Monitor only)
+
+:Description: Gives a user permissions to connect as an OSD to other OSDs or
+ monitors. Conferred on OSDs to enable OSDs to handle replication
+ heartbeat traffic and status reporting.
+
+
+``profile mds`` (Monitor only)
+
+:Description: Gives a user permissions to connect as a MDS to other MDSs or
+ monitors.
+
+
+``profile bootstrap-osd`` (Monitor only)
+
+:Description: Gives a user permissions to bootstrap an OSD. Conferred on
+ deployment tools such as ``ceph-disk``, ``ceph-deploy``, etc.
+ so that they have permissions to add keys, etc. when
+ bootstrapping an OSD.
+
+
+``profile bootstrap-mds`` (Monitor only)
+
+:Description: Gives a user permissions to bootstrap a metadata server.
+ Conferred on deployment tools such as ``ceph-deploy``, etc.
+ so they have permissions to add keys, etc. when bootstrapping
+ a metadata server.
+
+``profile rbd`` (Monitor and OSD)
+
+:Description: Gives a user permissions to manipulate RBD images. When used
+ as a Monitor cap, it provides the minimal privileges required
+ by an RBD client application. When used as an OSD cap, it
+ provides read-write access to an RBD client application.
+
+``profile rbd-read-only`` (OSD only)
+
+:Description: Gives a user read-only permissions to an RBD image.
+
+
+Pool
+----
+
+A pool is a logical partition where users store data.
+In Ceph deployments, it is common to create a pool as a logical partition for
+similar types of data. For example, when deploying Ceph as a backend for
+OpenStack, a typical deployment would have pools for volumes, images, backups
+and virtual machines, and users such as ``client.glance``, ``client.cinder``,
+etc.
+
+
+Namespace
+---------
+
+Objects within a pool can be associated to a namespace--a logical group of
+objects within the pool. A user's access to a pool can be associated with a
+namespace such that reads and writes by the user take place only within the
+namespace. Objects written to a namespace within the pool can only be accessed
+by users who have access to the namespace.
+
+.. note:: Namespaces are primarily useful for applications written on top of
+ ``librados`` where the logical grouping can alleviate the need to create
+ different pools. Ceph Object Gateway (from ``luminous``) uses namespaces for various
+ metadata objects.
+
+The rationale for namespaces is that pools can be a computationally expensive
+method of segregating data sets for the purposes of authorizing separate sets
+of users. For example, a pool should have ~100 placement groups per OSD. So an
+exemplary cluster with 1000 OSDs would have 100,000 placement groups for one
+pool. Each pool would create another 100,000 placement groups in the exemplary
+cluster. By contrast, writing an object to a namespace simply associates the
+namespace to the object name with out the computational overhead of a separate
+pool. Rather than creating a separate pool for a user or set of users, you may
+use a namespace. **Note:** Only available using ``librados`` at this time.
+
+
+Managing Users
+==============
+
+User management functionality provides Ceph Storage Cluster administrators with
+the ability to create, update and delete users directly in the Ceph Storage
+Cluster.
+
+When you create or delete users in the Ceph Storage Cluster, you may need to
+distribute keys to clients so that they can be added to keyrings. See `Keyring
+Management`_ for details.
+
+
+List Users
+----------
+
+To list the users in your cluster, execute the following::
+
+ ceph auth ls
+
+Ceph will list out all users in your cluster. For example, in a two-node
+exemplary cluster, ``ceph auth ls`` will output something that looks like
+this::
+
+ installed auth entries:
+
+ osd.0
+ key: AQCvCbtToC6MDhAATtuT70Sl+DymPCfDSsyV4w==
+ caps: [mon] allow profile osd
+ caps: [osd] allow *
+ osd.1
+ key: AQC4CbtTCFJBChAAVq5spj0ff4eHZICxIOVZeA==
+ caps: [mon] allow profile osd
+ caps: [osd] allow *
+ client.admin
+ key: AQBHCbtT6APDHhAA5W00cBchwkQjh3dkKsyPjw==
+ caps: [mds] allow
+ caps: [mon] allow *
+ caps: [osd] allow *
+ client.bootstrap-mds
+ key: AQBICbtTOK9uGBAAdbe5zcIGHZL3T/u2g6EBww==
+ caps: [mon] allow profile bootstrap-mds
+ client.bootstrap-osd
+ key: AQBHCbtT4GxqORAADE5u7RkpCN/oo4e5W0uBtw==
+ caps: [mon] allow profile bootstrap-osd
+
+
+Note that the ``TYPE.ID`` notation for users applies such that ``osd.0`` is a
+user of type ``osd`` and its ID is ``0``, ``client.admin`` is a user of type
+``client`` and its ID is ``admin`` (i.e., the default ``client.admin`` user).
+Note also that each entry has a ``key: <value>`` entry, and one or more
+``caps:`` entries.
+
+You may use the ``-o {filename}`` option with ``ceph auth ls`` to
+save the output to a file.
+
+
+Get a User
+----------
+
+To retrieve a specific user, key and capabilities, execute the
+following::
+
+ ceph auth get {TYPE.ID}
+
+For example::
+
+ ceph auth get client.admin
+
+You may also use the ``-o {filename}`` option with ``ceph auth get`` to
+save the output to a file. Developers may also execute the following::
+
+ ceph auth export {TYPE.ID}
+
+The ``auth export`` command is identical to ``auth get``, but also prints
+out the internal ``auid``, which is not relevant to end users.
+
+
+
+Add a User
+----------
+
+Adding a user creates a username (i.e., ``TYPE.ID``), a secret key and
+any capabilities included in the command you use to create the user.
+
+A user's key enables the user to authenticate with the Ceph Storage Cluster.
+The user's capabilities authorize the user to read, write, or execute on Ceph
+monitors (``mon``), Ceph OSDs (``osd``) or Ceph Metadata Servers (``mds``).
+
+There are a few ways to add a user:
+
+- ``ceph auth add``: This command is the canonical way to add a user. It
+ will create the user, generate a key and add any specified capabilities.
+
+- ``ceph auth get-or-create``: This command is often the most convenient way
+ to create a user, because it returns a keyfile format with the user name
+ (in brackets) and the key. If the user already exists, this command
+ simply returns the user name and key in the keyfile format. You may use the
+ ``-o {filename}`` option to save the output to a file.
+
+- ``ceph auth get-or-create-key``: This command is a convenient way to create
+ a user and return the user's key (only). This is useful for clients that
+ need the key only (e.g., libvirt). If the user already exists, this command
+ simply returns the key. You may use the ``-o {filename}`` option to save the
+ output to a file.
+
+When creating client users, you may create a user with no capabilities. A user
+with no capabilities is useless beyond mere authentication, because the client
+cannot retrieve the cluster map from the monitor. However, you can create a
+user with no capabilities if you wish to defer adding capabilities later using
+the ``ceph auth caps`` command.
+
+A typical user has at least read capabilities on the Ceph monitor and
+read and write capability on Ceph OSDs. Additionally, a user's OSD permissions
+are often restricted to accessing a particular pool. ::
+
+ ceph auth add client.john mon 'allow r' osd 'allow rw pool=liverpool'
+ ceph auth get-or-create client.paul mon 'allow r' osd 'allow rw pool=liverpool'
+ ceph auth get-or-create client.george mon 'allow r' osd 'allow rw pool=liverpool' -o george.keyring
+ ceph auth get-or-create-key client.ringo mon 'allow r' osd 'allow rw pool=liverpool' -o ringo.key
+
+
+.. important:: If you provide a user with capabilities to OSDs, but you DO NOT
+ restrict access to particular pools, the user will have access to ALL
+ pools in the cluster!
+
+
+.. _modify-user-capabilities:
+
+Modify User Capabilities
+------------------------
+
+The ``ceph auth caps`` command allows you to specify a user and change the
+user's capabilities. Setting new capabilities will overwrite current capabilities.
+To view current capabilities run ``ceph auth get USERTYPE.USERID``. To add
+capabilities, you should also specify the existing capabilities when using the form::
+
+ ceph auth caps USERTYPE.USERID {daemon} 'allow [r|w|x|*|...] [pool={pool-name}] [namespace={namespace-name}]' [{daemon} 'allow [r|w|x|*|...] [pool={pool-name}] [namespace={namespace-name}]']
+
+For example::
+
+ ceph auth get client.john
+ ceph auth caps client.john mon 'allow r' osd 'allow rw pool=liverpool'
+ ceph auth caps client.paul mon 'allow rw' osd 'allow rwx pool=liverpool'
+ ceph auth caps client.brian-manager mon 'allow *' osd 'allow *'
+
+To remove a capability, you may reset the capability. If you want the user
+to have no access to a particular daemon that was previously set, specify
+an empty string. For example::
+
+ ceph auth caps client.ringo mon ' ' osd ' '
+
+See `Authorization (Capabilities)`_ for additional details on capabilities.
+
+
+Delete a User
+-------------
+
+To delete a user, use ``ceph auth del``::
+
+ ceph auth del {TYPE}.{ID}
+
+Where ``{TYPE}`` is one of ``client``, ``osd``, ``mon``, or ``mds``,
+and ``{ID}`` is the user name or ID of the daemon.
+
+
+Print a User's Key
+------------------
+
+To print a user's authentication key to standard output, execute the following::
+
+ ceph auth print-key {TYPE}.{ID}
+
+Where ``{TYPE}`` is one of ``client``, ``osd``, ``mon``, or ``mds``,
+and ``{ID}`` is the user name or ID of the daemon.
+
+Printing a user's key is useful when you need to populate client
+software with a user's key (e.g., libvirt). ::
+
+ mount -t ceph serverhost:/ mountpoint -o name=client.user,secret=`ceph auth print-key client.user`
+
+
+Import a User(s)
+----------------
+
+To import one or more users, use ``ceph auth import`` and
+specify a keyring::
+
+ ceph auth import -i /path/to/keyring
+
+For example::
+
+ sudo ceph auth import -i /etc/ceph/ceph.keyring
+
+
+.. note:: The ceph storage cluster will add new users, their keys and their
+ capabilities and will update existing users, their keys and their
+ capabilities.
+
+
+Keyring Management
+==================
+
+When you access Ceph via a Ceph client, the Ceph client will look for a local
+keyring. Ceph presets the ``keyring`` setting with the following four keyring
+names by default so you don't have to set them in your Ceph configuration file
+unless you want to override the defaults (not recommended):
+
+- ``/etc/ceph/$cluster.$name.keyring``
+- ``/etc/ceph/$cluster.keyring``
+- ``/etc/ceph/keyring``
+- ``/etc/ceph/keyring.bin``
+
+The ``$cluster`` metavariable is your Ceph cluster name as defined by the
+name of the Ceph configuration file (i.e., ``ceph.conf`` means the cluster name
+is ``ceph``; thus, ``ceph.keyring``). The ``$name`` metavariable is the user
+type and user ID (e.g., ``client.admin``; thus, ``ceph.client.admin.keyring``).
+
+.. note:: When executing commands that read or write to ``/etc/ceph``, you may
+ need to use ``sudo`` to execute the command as ``root``.
+
+After you create a user (e.g., ``client.ringo``), you must get the key and add
+it to a keyring on a Ceph client so that the user can access the Ceph Storage
+Cluster.
+
+The `User Management`_ section details how to list, get, add, modify and delete
+users directly in the Ceph Storage Cluster. However, Ceph also provides the
+``ceph-authtool`` utility to allow you to manage keyrings from a Ceph client.
+
+
+Create a Keyring
+----------------
+
+When you use the procedures in the `Managing Users`_ section to create users,
+you need to provide user keys to the Ceph client(s) so that the Ceph client
+can retrieve the key for the specified user and authenticate with the Ceph
+Storage Cluster. Ceph Clients access keyrings to lookup a user name and
+retrieve the user's key.
+
+The ``ceph-authtool`` utility allows you to create a keyring. To create an
+empty keyring, use ``--create-keyring`` or ``-C``. For example::
+
+ ceph-authtool --create-keyring /path/to/keyring
+
+When creating a keyring with multiple users, we recommend using the cluster name
+(e.g., ``$cluster.keyring``) for the keyring filename and saving it in the
+``/etc/ceph`` directory so that the ``keyring`` configuration default setting
+will pick up the filename without requiring you to specify it in the local copy
+of your Ceph configuration file. For example, create ``ceph.keyring`` by
+executing the following::
+
+ sudo ceph-authtool -C /etc/ceph/ceph.keyring
+
+When creating a keyring with a single user, we recommend using the cluster name,
+the user type and the user name and saving it in the ``/etc/ceph`` directory.
+For example, ``ceph.client.admin.keyring`` for the ``client.admin`` user.
+
+To create a keyring in ``/etc/ceph``, you must do so as ``root``. This means
+the file will have ``rw`` permissions for the ``root`` user only, which is
+appropriate when the keyring contains administrator keys. However, if you
+intend to use the keyring for a particular user or group of users, ensure
+that you execute ``chown`` or ``chmod`` to establish appropriate keyring
+ownership and access.
+
+
+Add a User to a Keyring
+-----------------------
+
+When you `Add a User`_ to the Ceph Storage Cluster, you can use the `Get a
+User`_ procedure to retrieve a user, key and capabilities and save the user to a
+keyring.
+
+When you only want to use one user per keyring, the `Get a User`_ procedure with
+the ``-o`` option will save the output in the keyring file format. For example,
+to create a keyring for the ``client.admin`` user, execute the following::
+
+ sudo ceph auth get client.admin -o /etc/ceph/ceph.client.admin.keyring
+
+Notice that we use the recommended file format for an individual user.
+
+When you want to import users to a keyring, you can use ``ceph-authtool``
+to specify the destination keyring and the source keyring.
+For example::
+
+ sudo ceph-authtool /etc/ceph/ceph.keyring --import-keyring /etc/ceph/ceph.client.admin.keyring
+
+
+Create a User
+-------------
+
+Ceph provides the `Add a User`_ function to create a user directly in the Ceph
+Storage Cluster. However, you can also create a user, keys and capabilities
+directly on a Ceph client keyring. Then, you can import the user to the Ceph
+Storage Cluster. For example::
+
+ sudo ceph-authtool -n client.ringo --cap osd 'allow rwx' --cap mon 'allow rwx' /etc/ceph/ceph.keyring
+
+See `Authorization (Capabilities)`_ for additional details on capabilities.
+
+You can also create a keyring and add a new user to the keyring simultaneously.
+For example::
+
+ sudo ceph-authtool -C /etc/ceph/ceph.keyring -n client.ringo --cap osd 'allow rwx' --cap mon 'allow rwx' --gen-key
+
+In the foregoing scenarios, the new user ``client.ringo`` is only in the
+keyring. To add the new user to the Ceph Storage Cluster, you must still add
+the new user to the Ceph Storage Cluster. ::
+
+ sudo ceph auth add client.ringo -i /etc/ceph/ceph.keyring
+
+
+Modify a User
+-------------
+
+To modify the capabilities of a user record in a keyring, specify the keyring,
+and the user followed by the capabilities. For example::
+
+ sudo ceph-authtool /etc/ceph/ceph.keyring -n client.ringo --cap osd 'allow rwx' --cap mon 'allow rwx'
+
+To update the user to the Ceph Storage Cluster, you must update the user
+in the keyring to the user entry in the the Ceph Storage Cluster. ::
+
+ sudo ceph auth import -i /etc/ceph/ceph.keyring
+
+See `Import a User(s)`_ for details on updating a Ceph Storage Cluster user
+from a keyring.
+
+You may also `Modify User Capabilities`_ directly in the cluster, store the
+results to a keyring file; then, import the keyring into your main
+``ceph.keyring`` file.
+
+
+Command Line Usage
+==================
+
+Ceph supports the following usage for user name and secret:
+
+``--id`` | ``--user``
+
+:Description: Ceph identifies users with a type and an ID (e.g., ``TYPE.ID`` or
+ ``client.admin``, ``client.user1``). The ``id``, ``name`` and
+ ``-n`` options enable you to specify the ID portion of the user
+ name (e.g., ``admin``, ``user1``, ``foo``, etc.). You can specify
+ the user with the ``--id`` and omit the type. For example,
+ to specify user ``client.foo`` enter the following::
+
+ ceph --id foo --keyring /path/to/keyring health
+ ceph --user foo --keyring /path/to/keyring health
+
+
+``--name`` | ``-n``
+
+:Description: Ceph identifies users with a type and an ID (e.g., ``TYPE.ID`` or
+ ``client.admin``, ``client.user1``). The ``--name`` and ``-n``
+ options enables you to specify the fully qualified user name.
+ You must specify the user type (typically ``client``) with the
+ user ID. For example::
+
+ ceph --name client.foo --keyring /path/to/keyring health
+ ceph -n client.foo --keyring /path/to/keyring health
+
+
+``--keyring``
+
+:Description: The path to the keyring containing one or more user name and
+ secret. The ``--secret`` option provides the same functionality,
+ but it does not work with Ceph RADOS Gateway, which uses
+ ``--secret`` for another purpose. You may retrieve a keyring with
+ ``ceph auth get-or-create`` and store it locally. This is a
+ preferred approach, because you can switch user names without
+ switching the keyring path. For example::
+
+ sudo rbd map --id foo --keyring /path/to/keyring mypool/myimage
+
+
+.. _pools: ../pools
+
+
+Limitations
+===========
+
+The ``cephx`` protocol authenticates Ceph clients and servers to each other. It
+is not intended to handle authentication of human users or application programs
+run on their behalf. If that effect is required to handle your access control
+needs, you must have another mechanism, which is likely to be specific to the
+front end used to access the Ceph object store. This other mechanism has the
+role of ensuring that only acceptable users and programs are able to run on the
+machine that Ceph will permit to access its object store.
+
+The keys used to authenticate Ceph clients and servers are typically stored in
+a plain text file with appropriate permissions in a trusted host.
+
+.. important:: Storing keys in plaintext files has security shortcomings, but
+ they are difficult to avoid, given the basic authentication methods Ceph
+ uses in the background. Those setting up Ceph systems should be aware of
+ these shortcomings.
+
+In particular, arbitrary user machines, especially portable machines, should not
+be configured to interact directly with Ceph, since that mode of use would
+require the storage of a plaintext authentication key on an insecure machine.
+Anyone who stole that machine or obtained surreptitious access to it could
+obtain the key that will allow them to authenticate their own machines to Ceph.
+
+Rather than permitting potentially insecure machines to access a Ceph object
+store directly, users should be required to sign in to a trusted machine in
+your environment using a method that provides sufficient security for your
+purposes. That trusted machine will store the plaintext Ceph keys for the
+human users. A future version of Ceph may address these particular
+authentication issues more fully.
+
+At the moment, none of the Ceph authentication protocols provide secrecy for
+messages in transit. Thus, an eavesdropper on the wire can hear and understand
+all data sent between clients and servers in Ceph, even if it cannot create or
+alter them. Further, Ceph does not include options to encrypt user data in the
+object store. Users can hand-encrypt and store their own data in the Ceph
+object store, of course, but Ceph provides no features to perform object
+encryption itself. Those storing sensitive data in Ceph should consider
+encrypting their data before providing it to the Ceph system.
+
+
+.. _Architecture - High Availability Authentication: ../../../architecture#high-availability-authentication
+.. _Cephx Config Reference: ../../configuration/auth-config-ref
diff --git a/src/ceph/doc/rados/troubleshooting/community.rst b/src/ceph/doc/rados/troubleshooting/community.rst
new file mode 100644
index 0000000..9faad13
--- /dev/null
+++ b/src/ceph/doc/rados/troubleshooting/community.rst
@@ -0,0 +1,29 @@
+====================
+ The Ceph Community
+====================
+
+The Ceph community is an excellent source of information and help. For
+operational issues with Ceph releases we recommend you `subscribe to the
+ceph-users email list`_. When you no longer want to receive emails, you can
+`unsubscribe from the ceph-users email list`_.
+
+You may also `subscribe to the ceph-devel email list`_. You should do so if
+your issue is:
+
+- Likely related to a bug
+- Related to a development release package
+- Related to a development testing package
+- Related to your own builds
+
+If you no longer want to receive emails from the ``ceph-devel`` email list, you
+may `unsubscribe from the ceph-devel email list`_.
+
+.. tip:: The Ceph community is growing rapidly, and community members can help
+ you if you provide them with detailed information about your problem. You
+ can attach the output of the ``ceph report`` command to help people understand your issues.
+
+.. _subscribe to the ceph-devel email list: mailto:majordomo@vger.kernel.org?body=subscribe+ceph-devel
+.. _unsubscribe from the ceph-devel email list: mailto:majordomo@vger.kernel.org?body=unsubscribe+ceph-devel
+.. _subscribe to the ceph-users email list: mailto:ceph-users-join@lists.ceph.com
+.. _unsubscribe from the ceph-users email list: mailto:ceph-users-leave@lists.ceph.com
+.. _ceph-devel: ceph-devel@vger.kernel.org \ No newline at end of file
diff --git a/src/ceph/doc/rados/troubleshooting/cpu-profiling.rst b/src/ceph/doc/rados/troubleshooting/cpu-profiling.rst
new file mode 100644
index 0000000..159f799
--- /dev/null
+++ b/src/ceph/doc/rados/troubleshooting/cpu-profiling.rst
@@ -0,0 +1,67 @@
+===============
+ CPU Profiling
+===============
+
+If you built Ceph from source and compiled Ceph for use with `oprofile`_
+you can profile Ceph's CPU usage. See `Installing Oprofile`_ for details.
+
+
+Initializing oprofile
+=====================
+
+The first time you use ``oprofile`` you need to initialize it. Locate the
+``vmlinux`` image corresponding to the kernel you are now running. ::
+
+ ls /boot
+ sudo opcontrol --init
+ sudo opcontrol --setup --vmlinux={path-to-image} --separate=library --callgraph=6
+
+
+Starting oprofile
+=================
+
+To start ``oprofile`` execute the following command::
+
+ opcontrol --start
+
+Once you start ``oprofile``, you may run some tests with Ceph.
+
+
+Stopping oprofile
+=================
+
+To stop ``oprofile`` execute the following command::
+
+ opcontrol --stop
+
+
+Retrieving oprofile Results
+===========================
+
+To retrieve the top ``cmon`` results, execute the following command::
+
+ opreport -gal ./cmon | less
+
+
+To retrieve the top ``cmon`` results with call graphs attached, execute the
+following command::
+
+ opreport -cal ./cmon | less
+
+.. important:: After reviewing results, you should reset ``oprofile`` before
+ running it again. Resetting ``oprofile`` removes data from the session
+ directory.
+
+
+Resetting oprofile
+==================
+
+To reset ``oprofile``, execute the following command::
+
+ sudo opcontrol --reset
+
+.. important:: You should reset ``oprofile`` after analyzing data so that
+ you do not commingle results from different tests.
+
+.. _oprofile: http://oprofile.sourceforge.net/about/
+.. _Installing Oprofile: ../../../dev/cpu-profiler
diff --git a/src/ceph/doc/rados/troubleshooting/index.rst b/src/ceph/doc/rados/troubleshooting/index.rst
new file mode 100644
index 0000000..80d14f3
--- /dev/null
+++ b/src/ceph/doc/rados/troubleshooting/index.rst
@@ -0,0 +1,19 @@
+=================
+ Troubleshooting
+=================
+
+Ceph is still on the leading edge, so you may encounter situations that require
+you to examine your configuration, modify your logging output, troubleshoot
+monitors and OSDs, profile memory and CPU usage, and reach out to the
+Ceph community for help.
+
+.. toctree::
+ :maxdepth: 1
+
+ community
+ log-and-debug
+ troubleshooting-mon
+ troubleshooting-osd
+ troubleshooting-pg
+ memory-profiling
+ cpu-profiling
diff --git a/src/ceph/doc/rados/troubleshooting/log-and-debug.rst b/src/ceph/doc/rados/troubleshooting/log-and-debug.rst
new file mode 100644
index 0000000..c91f272
--- /dev/null
+++ b/src/ceph/doc/rados/troubleshooting/log-and-debug.rst
@@ -0,0 +1,550 @@
+=======================
+ Logging and Debugging
+=======================
+
+Typically, when you add debugging to your Ceph configuration, you do so at
+runtime. You can also add Ceph debug logging to your Ceph configuration file if
+you are encountering issues when starting your cluster. You may view Ceph log
+files under ``/var/log/ceph`` (the default location).
+
+.. tip:: When debug output slows down your system, the latency can hide
+ race conditions.
+
+Logging is resource intensive. If you are encountering a problem in a specific
+area of your cluster, enable logging for that area of the cluster. For example,
+if your OSDs are running fine, but your metadata servers are not, you should
+start by enabling debug logging for the specific metadata server instance(s)
+giving you trouble. Enable logging for each subsystem as needed.
+
+.. important:: Verbose logging can generate over 1GB of data per hour. If your
+ OS disk reaches its capacity, the node will stop working.
+
+If you enable or increase the rate of Ceph logging, ensure that you have
+sufficient disk space on your OS disk. See `Accelerating Log Rotation`_ for
+details on rotating log files. When your system is running well, remove
+unnecessary debugging settings to ensure your cluster runs optimally. Logging
+debug output messages is relatively slow, and a waste of resources when
+operating your cluster.
+
+See `Subsystem, Log and Debug Settings`_ for details on available settings.
+
+Runtime
+=======
+
+If you would like to see the configuration settings at runtime, you must log
+in to a host with a running daemon and execute the following::
+
+ ceph daemon {daemon-name} config show | less
+
+For example,::
+
+ ceph daemon osd.0 config show | less
+
+To activate Ceph's debugging output (*i.e.*, ``dout()``) at runtime, use the
+``ceph tell`` command to inject arguments into the runtime configuration::
+
+ ceph tell {daemon-type}.{daemon id or *} injectargs --{name} {value} [--{name} {value}]
+
+Replace ``{daemon-type}`` with one of ``osd``, ``mon`` or ``mds``. You may apply
+the runtime setting to all daemons of a particular type with ``*``, or specify
+a specific daemon's ID. For example, to increase
+debug logging for a ``ceph-osd`` daemon named ``osd.0``, execute the following::
+
+ ceph tell osd.0 injectargs --debug-osd 0/5
+
+The ``ceph tell`` command goes through the monitors. If you cannot bind to the
+monitor, you can still make the change by logging into the host of the daemon
+whose configuration you'd like to change using ``ceph daemon``.
+For example::
+
+ sudo ceph daemon osd.0 config set debug_osd 0/5
+
+See `Subsystem, Log and Debug Settings`_ for details on available settings.
+
+
+Boot Time
+=========
+
+To activate Ceph's debugging output (*i.e.*, ``dout()``) at boot time, you must
+add settings to your Ceph configuration file. Subsystems common to each daemon
+may be set under ``[global]`` in your configuration file. Subsystems for
+particular daemons are set under the daemon section in your configuration file
+(*e.g.*, ``[mon]``, ``[osd]``, ``[mds]``). For example::
+
+ [global]
+ debug ms = 1/5
+
+ [mon]
+ debug mon = 20
+ debug paxos = 1/5
+ debug auth = 2
+
+ [osd]
+ debug osd = 1/5
+ debug filestore = 1/5
+ debug journal = 1
+ debug monc = 5/20
+
+ [mds]
+ debug mds = 1
+ debug mds balancer = 1
+
+
+See `Subsystem, Log and Debug Settings`_ for details.
+
+
+Accelerating Log Rotation
+=========================
+
+If your OS disk is relatively full, you can accelerate log rotation by modifying
+the Ceph log rotation file at ``/etc/logrotate.d/ceph``. Add a size setting
+after the rotation frequency to accelerate log rotation (via cronjob) if your
+logs exceed the size setting. For example, the default setting looks like
+this::
+
+ rotate 7
+ weekly
+ compress
+ sharedscripts
+
+Modify it by adding a ``size`` setting. ::
+
+ rotate 7
+ weekly
+ size 500M
+ compress
+ sharedscripts
+
+Then, start the crontab editor for your user space. ::
+
+ crontab -e
+
+Finally, add an entry to check the ``etc/logrotate.d/ceph`` file. ::
+
+ 30 * * * * /usr/sbin/logrotate /etc/logrotate.d/ceph >/dev/null 2>&1
+
+The preceding example checks the ``etc/logrotate.d/ceph`` file every 30 minutes.
+
+
+Valgrind
+========
+
+Debugging may also require you to track down memory and threading issues.
+You can run a single daemon, a type of daemon, or the whole cluster with
+Valgrind. You should only use Valgrind when developing or debugging Ceph.
+Valgrind is computationally expensive, and will slow down your system otherwise.
+Valgrind messages are logged to ``stderr``.
+
+
+Subsystem, Log and Debug Settings
+=================================
+
+In most cases, you will enable debug logging output via subsystems.
+
+Ceph Subsystems
+---------------
+
+Each subsystem has a logging level for its output logs, and for its logs
+in-memory. You may set different values for each of these subsystems by setting
+a log file level and a memory level for debug logging. Ceph's logging levels
+operate on a scale of ``1`` to ``20``, where ``1`` is terse and ``20`` is
+verbose [#]_ . In general, the logs in-memory are not sent to the output log unless:
+
+- a fatal signal is raised or
+- an ``assert`` in source code is triggered or
+- upon requested. Please consult `document on admin socket <http://docs.ceph.com/docs/master/man/8/ceph/#daemon>`_ for more details.
+
+A debug logging setting can take a single value for the log level and the
+memory level, which sets them both as the same value. For example, if you
+specify ``debug ms = 5``, Ceph will treat it as a log level and a memory level
+of ``5``. You may also specify them separately. The first setting is the log
+level, and the second setting is the memory level. You must separate them with
+a forward slash (/). For example, if you want to set the ``ms`` subsystem's
+debug logging level to ``1`` and its memory level to ``5``, you would specify it
+as ``debug ms = 1/5``. For example:
+
+
+
+.. code-block:: ini
+
+ debug {subsystem} = {log-level}/{memory-level}
+ #for example
+ debug mds balancer = 1/20
+
+
+The following table provides a list of Ceph subsystems and their default log and
+memory levels. Once you complete your logging efforts, restore the subsystems
+to their default level or to a level suitable for normal operations.
+
+
++--------------------+-----------+--------------+
+| Subsystem | Log Level | Memory Level |
++====================+===========+==============+
+| ``default`` | 0 | 5 |
++--------------------+-----------+--------------+
+| ``lockdep`` | 0 | 5 |
++--------------------+-----------+--------------+
+| ``context`` | 0 | 5 |
++--------------------+-----------+--------------+
+| ``crush`` | 1 | 5 |
++--------------------+-----------+--------------+
+| ``mds`` | 1 | 5 |
++--------------------+-----------+--------------+
+| ``mds balancer`` | 1 | 5 |
++--------------------+-----------+--------------+
+| ``mds locker`` | 1 | 5 |
++--------------------+-----------+--------------+
+| ``mds log`` | 1 | 5 |
++--------------------+-----------+--------------+
+| ``mds log expire`` | 1 | 5 |
++--------------------+-----------+--------------+
+| ``mds migrator`` | 1 | 5 |
++--------------------+-----------+--------------+
+| ``buffer`` | 0 | 0 |
++--------------------+-----------+--------------+
+| ``timer`` | 0 | 5 |
++--------------------+-----------+--------------+
+| ``filer`` | 0 | 5 |
++--------------------+-----------+--------------+
+| ``objecter`` | 0 | 0 |
++--------------------+-----------+--------------+
+| ``rados`` | 0 | 5 |
++--------------------+-----------+--------------+
+| ``rbd`` | 0 | 5 |
++--------------------+-----------+--------------+
+| ``journaler`` | 0 | 5 |
++--------------------+-----------+--------------+
+| ``objectcacher`` | 0 | 5 |
++--------------------+-----------+--------------+
+| ``client`` | 0 | 5 |
++--------------------+-----------+--------------+
+| ``osd`` | 0 | 5 |
++--------------------+-----------+--------------+
+| ``optracker`` | 0 | 5 |
++--------------------+-----------+--------------+
+| ``objclass`` | 0 | 5 |
++--------------------+-----------+--------------+
+| ``filestore`` | 1 | 5 |
++--------------------+-----------+--------------+
+| ``journal`` | 1 | 5 |
++--------------------+-----------+--------------+
+| ``ms`` | 0 | 5 |
++--------------------+-----------+--------------+
+| ``mon`` | 1 | 5 |
++--------------------+-----------+--------------+
+| ``monc`` | 0 | 5 |
++--------------------+-----------+--------------+
+| ``paxos`` | 0 | 5 |
++--------------------+-----------+--------------+
+| ``tp`` | 0 | 5 |
++--------------------+-----------+--------------+
+| ``auth`` | 1 | 5 |
++--------------------+-----------+--------------+
+| ``finisher`` | 1 | 5 |
++--------------------+-----------+--------------+
+| ``heartbeatmap`` | 1 | 5 |
++--------------------+-----------+--------------+
+| ``perfcounter`` | 1 | 5 |
++--------------------+-----------+--------------+
+| ``rgw`` | 1 | 5 |
++--------------------+-----------+--------------+
+| ``javaclient`` | 1 | 5 |
++--------------------+-----------+--------------+
+| ``asok`` | 1 | 5 |
++--------------------+-----------+--------------+
+| ``throttle`` | 1 | 5 |
++--------------------+-----------+--------------+
+
+
+Logging Settings
+----------------
+
+Logging and debugging settings are not required in a Ceph configuration file,
+but you may override default settings as needed. Ceph supports the following
+settings:
+
+
+``log file``
+
+:Description: The location of the logging file for your cluster.
+:Type: String
+:Required: No
+:Default: ``/var/log/ceph/$cluster-$name.log``
+
+
+``log max new``
+
+:Description: The maximum number of new log files.
+:Type: Integer
+:Required: No
+:Default: ``1000``
+
+
+``log max recent``
+
+:Description: The maximum number of recent events to include in a log file.
+:Type: Integer
+:Required: No
+:Default: ``1000000``
+
+
+``log to stderr``
+
+:Description: Determines if logging messages should appear in ``stderr``.
+:Type: Boolean
+:Required: No
+:Default: ``true``
+
+
+``err to stderr``
+
+:Description: Determines if error messages should appear in ``stderr``.
+:Type: Boolean
+:Required: No
+:Default: ``true``
+
+
+``log to syslog``
+
+:Description: Determines if logging messages should appear in ``syslog``.
+:Type: Boolean
+:Required: No
+:Default: ``false``
+
+
+``err to syslog``
+
+:Description: Determines if error messages should appear in ``syslog``.
+:Type: Boolean
+:Required: No
+:Default: ``false``
+
+
+``log flush on exit``
+
+:Description: Determines if Ceph should flush the log files after exit.
+:Type: Boolean
+:Required: No
+:Default: ``true``
+
+
+``clog to monitors``
+
+:Description: Determines if ``clog`` messages should be sent to monitors.
+:Type: Boolean
+:Required: No
+:Default: ``true``
+
+
+``clog to syslog``
+
+:Description: Determines if ``clog`` messages should be sent to syslog.
+:Type: Boolean
+:Required: No
+:Default: ``false``
+
+
+``mon cluster log to syslog``
+
+:Description: Determines if the cluster log should be output to the syslog.
+:Type: Boolean
+:Required: No
+:Default: ``false``
+
+
+``mon cluster log file``
+
+:Description: The location of the cluster's log file.
+:Type: String
+:Required: No
+:Default: ``/var/log/ceph/$cluster.log``
+
+
+
+OSD
+---
+
+
+``osd debug drop ping probability``
+
+:Description: ?
+:Type: Double
+:Required: No
+:Default: 0
+
+
+``osd debug drop ping duration``
+
+:Description:
+:Type: Integer
+:Required: No
+:Default: 0
+
+``osd debug drop pg create probability``
+
+:Description:
+:Type: Integer
+:Required: No
+:Default: 0
+
+``osd debug drop pg create duration``
+
+:Description: ?
+:Type: Double
+:Required: No
+:Default: 1
+
+
+``osd tmapput sets uses tmap``
+
+:Description: Uses ``tmap``. For debug only.
+:Type: Boolean
+:Required: No
+:Default: ``false``
+
+
+``osd min pg log entries``
+
+:Description: The minimum number of log entries for placement groups.
+:Type: 32-bit Unsigned Integer
+:Required: No
+:Default: 1000
+
+
+``osd op log threshold``
+
+:Description: How many op log messages to show up in one pass.
+:Type: Integer
+:Required: No
+:Default: 5
+
+
+
+Filestore
+---------
+
+``filestore debug omap check``
+
+:Description: Debugging check on synchronization. This is an expensive operation.
+:Type: Boolean
+:Required: No
+:Default: 0
+
+
+MDS
+---
+
+
+``mds debug scatterstat``
+
+:Description: Ceph will assert that various recursive stat invariants are true
+ (for developers only).
+
+:Type: Boolean
+:Required: No
+:Default: ``false``
+
+
+``mds debug frag``
+
+:Description: Ceph will verify directory fragmentation invariants when
+ convenient (developers only).
+
+:Type: Boolean
+:Required: No
+:Default: ``false``
+
+
+``mds debug auth pins``
+
+:Description: The debug auth pin invariants (for developers only).
+:Type: Boolean
+:Required: No
+:Default: ``false``
+
+
+``mds debug subtrees``
+
+:Description: The debug subtree invariants (for developers only).
+:Type: Boolean
+:Required: No
+:Default: ``false``
+
+
+
+RADOS Gateway
+-------------
+
+
+``rgw log nonexistent bucket``
+
+:Description: Should we log a non-existent buckets?
+:Type: Boolean
+:Required: No
+:Default: ``false``
+
+
+``rgw log object name``
+
+:Description: Should an object's name be logged. // man date to see codes (a subset are supported)
+:Type: String
+:Required: No
+:Default: ``%Y-%m-%d-%H-%i-%n``
+
+
+``rgw log object name utc``
+
+:Description: Object log name contains UTC?
+:Type: Boolean
+:Required: No
+:Default: ``false``
+
+
+``rgw enable ops log``
+
+:Description: Enables logging of every RGW operation.
+:Type: Boolean
+:Required: No
+:Default: ``true``
+
+
+``rgw enable usage log``
+
+:Description: Enable logging of RGW's bandwidth usage.
+:Type: Boolean
+:Required: No
+:Default: ``true``
+
+
+``rgw usage log flush threshold``
+
+:Description: Threshold to flush pending log data.
+:Type: Integer
+:Required: No
+:Default: ``1024``
+
+
+``rgw usage log tick interval``
+
+:Description: Flush pending log data every ``s`` seconds.
+:Type: Integer
+:Required: No
+:Default: 30
+
+
+``rgw intent log object name``
+
+:Description:
+:Type: String
+:Required: No
+:Default: ``%Y-%m-%d-%i-%n``
+
+
+``rgw intent log object name utc``
+
+:Description: Include a UTC timestamp in the intent log object name.
+:Type: Boolean
+:Required: No
+:Default: ``false``
+
+.. [#] there are levels >20 in some rare cases and that they are extremely verbose.
diff --git a/src/ceph/doc/rados/troubleshooting/memory-profiling.rst b/src/ceph/doc/rados/troubleshooting/memory-profiling.rst
new file mode 100644
index 0000000..e2396e2
--- /dev/null
+++ b/src/ceph/doc/rados/troubleshooting/memory-profiling.rst
@@ -0,0 +1,142 @@
+==================
+ Memory Profiling
+==================
+
+Ceph MON, OSD and MDS can generate heap profiles using
+``tcmalloc``. To generate heap profiles, ensure you have
+``google-perftools`` installed::
+
+ sudo apt-get install google-perftools
+
+The profiler dumps output to your ``log file`` directory (i.e.,
+``/var/log/ceph``). See `Logging and Debugging`_ for details.
+To view the profiler logs with Google's performance tools, execute the
+following::
+
+ google-pprof --text {path-to-daemon} {log-path/filename}
+
+For example::
+
+ $ ceph tell osd.0 heap start_profiler
+ $ ceph tell osd.0 heap dump
+ osd.0 tcmalloc heap stats:------------------------------------------------
+ MALLOC: 2632288 ( 2.5 MiB) Bytes in use by application
+ MALLOC: + 499712 ( 0.5 MiB) Bytes in page heap freelist
+ MALLOC: + 543800 ( 0.5 MiB) Bytes in central cache freelist
+ MALLOC: + 327680 ( 0.3 MiB) Bytes in transfer cache freelist
+ MALLOC: + 1239400 ( 1.2 MiB) Bytes in thread cache freelists
+ MALLOC: + 1142936 ( 1.1 MiB) Bytes in malloc metadata
+ MALLOC: ------------
+ MALLOC: = 6385816 ( 6.1 MiB) Actual memory used (physical + swap)
+ MALLOC: + 0 ( 0.0 MiB) Bytes released to OS (aka unmapped)
+ MALLOC: ------------
+ MALLOC: = 6385816 ( 6.1 MiB) Virtual address space used
+ MALLOC:
+ MALLOC: 231 Spans in use
+ MALLOC: 56 Thread heaps in use
+ MALLOC: 8192 Tcmalloc page size
+ ------------------------------------------------
+ Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).
+ Bytes released to the OS take up virtual address space but no physical memory.
+ $ google-pprof --text \
+ /usr/bin/ceph-osd \
+ /var/log/ceph/ceph-osd.0.profile.0001.heap
+ Total: 3.7 MB
+ 1.9 51.1% 51.1% 1.9 51.1% ceph::log::Log::create_entry
+ 1.8 47.3% 98.4% 1.8 47.3% std::string::_Rep::_S_create
+ 0.0 0.4% 98.9% 0.0 0.6% SimpleMessenger::add_accept_pipe
+ 0.0 0.4% 99.2% 0.0 0.6% decode_message
+ ...
+
+Another heap dump on the same daemon will add another file. It is
+convenient to compare to a previous heap dump to show what has grown
+in the interval. For instance::
+
+ $ google-pprof --text --base out/osd.0.profile.0001.heap \
+ ceph-osd out/osd.0.profile.0003.heap
+ Total: 0.2 MB
+ 0.1 50.3% 50.3% 0.1 50.3% ceph::log::Log::create_entry
+ 0.1 46.6% 96.8% 0.1 46.6% std::string::_Rep::_S_create
+ 0.0 0.9% 97.7% 0.0 26.1% ReplicatedPG::do_op
+ 0.0 0.8% 98.5% 0.0 0.8% __gnu_cxx::new_allocator::allocate
+
+Refer to `Google Heap Profiler`_ for additional details.
+
+Once you have the heap profiler installed, start your cluster and
+begin using the heap profiler. You may enable or disable the heap
+profiler at runtime, or ensure that it runs continuously. For the
+following commandline usage, replace ``{daemon-type}`` with ``mon``,
+``osd`` or ``mds``, and replace ``{daemon-id}`` with the OSD number or
+the MON or MDS id.
+
+
+Starting the Profiler
+---------------------
+
+To start the heap profiler, execute the following::
+
+ ceph tell {daemon-type}.{daemon-id} heap start_profiler
+
+For example::
+
+ ceph tell osd.1 heap start_profiler
+
+Alternatively the profile can be started when the daemon starts
+running if the ``CEPH_HEAP_PROFILER_INIT=true`` variable is found in
+the environment.
+
+Printing Stats
+--------------
+
+To print out statistics, execute the following::
+
+ ceph tell {daemon-type}.{daemon-id} heap stats
+
+For example::
+
+ ceph tell osd.0 heap stats
+
+.. note:: Printing stats does not require the profiler to be running and does
+ not dump the heap allocation information to a file.
+
+
+Dumping Heap Information
+------------------------
+
+To dump heap information, execute the following::
+
+ ceph tell {daemon-type}.{daemon-id} heap dump
+
+For example::
+
+ ceph tell mds.a heap dump
+
+.. note:: Dumping heap information only works when the profiler is running.
+
+
+Releasing Memory
+----------------
+
+To release memory that ``tcmalloc`` has allocated but which is not being used by
+the Ceph daemon itself, execute the following::
+
+ ceph tell {daemon-type}{daemon-id} heap release
+
+For example::
+
+ ceph tell osd.2 heap release
+
+
+Stopping the Profiler
+---------------------
+
+To stop the heap profiler, execute the following::
+
+ ceph tell {daemon-type}.{daemon-id} heap stop_profiler
+
+For example::
+
+ ceph tell osd.0 heap stop_profiler
+
+.. _Logging and Debugging: ../log-and-debug
+.. _Google Heap Profiler: http://goog-perftools.sourceforge.net/doc/heap_profiler.html
diff --git a/src/ceph/doc/rados/troubleshooting/troubleshooting-mon.rst b/src/ceph/doc/rados/troubleshooting/troubleshooting-mon.rst
new file mode 100644
index 0000000..89fb94c
--- /dev/null
+++ b/src/ceph/doc/rados/troubleshooting/troubleshooting-mon.rst
@@ -0,0 +1,567 @@
+=================================
+ Troubleshooting Monitors
+=================================
+
+.. index:: monitor, high availability
+
+When a cluster encounters monitor-related troubles there's a tendency to
+panic, and some times with good reason. You should keep in mind that losing
+a monitor, or a bunch of them, don't necessarily mean that your cluster is
+down, as long as a majority is up, running and with a formed quorum.
+Regardless of how bad the situation is, the first thing you should do is to
+calm down, take a breath and try answering our initial troubleshooting script.
+
+
+Initial Troubleshooting
+========================
+
+
+**Are the monitors running?**
+
+ First of all, we need to make sure the monitors are running. You would be
+ amazed by how often people forget to run the monitors, or restart them after
+ an upgrade. There's no shame in that, but let's try not losing a couple of
+ hours chasing an issue that is not there.
+
+**Are you able to connect to the monitor's servers?**
+
+ Doesn't happen often, but sometimes people do have ``iptables`` rules that
+ block accesses to monitor servers or monitor ports. Usually leftovers from
+ monitor stress-testing that were forgotten at some point. Try ssh'ing into
+ the server and, if that succeeds, try connecting to the monitor's port
+ using you tool of choice (telnet, nc,...).
+
+**Does ceph -s run and obtain a reply from the cluster?**
+
+ If the answer is yes then your cluster is up and running. One thing you
+ can take for granted is that the monitors will only answer to a ``status``
+ request if there is a formed quorum.
+
+ If ``ceph -s`` blocked however, without obtaining a reply from the cluster
+ or showing a lot of ``fault`` messages, then it is likely that your monitors
+ are either down completely or just a portion is up -- a portion that is not
+ enough to form a quorum (keep in mind that a quorum if formed by a majority
+ of monitors).
+
+**What if ceph -s doesn't finish?**
+
+ If you haven't gone through all the steps so far, please go back and do.
+
+ For those running on Emperor 0.72-rc1 and forward, you will be able to
+ contact each monitor individually asking them for their status, regardless
+ of a quorum being formed. This an be achieved using ``ceph ping mon.ID``,
+ ID being the monitor's identifier. You should perform this for each monitor
+ in the cluster. In section `Understanding mon_status`_ we will explain how
+ to interpret the output of this command.
+
+ For the rest of you who don't tread on the bleeding edge, you will need to
+ ssh into the server and use the monitor's admin socket. Please jump to
+ `Using the monitor's admin socket`_.
+
+For other specific issues, keep on reading.
+
+
+Using the monitor's admin socket
+=================================
+
+The admin socket allows you to interact with a given daemon directly using a
+Unix socket file. This file can be found in your monitor's ``run`` directory.
+By default, the admin socket will be kept in ``/var/run/ceph/ceph-mon.ID.asok``
+but this can vary if you defined it otherwise. If you don't find it there,
+please check your ``ceph.conf`` for an alternative path or run::
+
+ ceph-conf --name mon.ID --show-config-value admin_socket
+
+Please bear in mind that the admin socket will only be available while the
+monitor is running. When the monitor is properly shutdown, the admin socket
+will be removed. If however the monitor is not running and the admin socket
+still persists, it is likely that the monitor was improperly shutdown.
+Regardless, if the monitor is not running, you will not be able to use the
+admin socket, with ``ceph`` likely returning ``Error 111: Connection Refused``.
+
+Accessing the admin socket is as simple as telling the ``ceph`` tool to use
+the ``asok`` file. In pre-Dumpling Ceph, this can be achieved by::
+
+ ceph --admin-daemon /var/run/ceph/ceph-mon.<id>.asok <command>
+
+while in Dumpling and beyond you can use the alternate (and recommended)
+format::
+
+ ceph daemon mon.<id> <command>
+
+Using ``help`` as the command to the ``ceph`` tool will show you the
+supported commands available through the admin socket. Please take a look
+at ``config get``, ``config show``, ``mon_status`` and ``quorum_status``,
+as those can be enlightening when troubleshooting a monitor.
+
+
+Understanding mon_status
+=========================
+
+``mon_status`` can be obtained through the ``ceph`` tool when you have
+a formed quorum, or via the admin socket if you don't. This command will
+output a multitude of information about the monitor, including the same
+output you would get with ``quorum_status``.
+
+Take the following example of ``mon_status``::
+
+
+ { "name": "c",
+ "rank": 2,
+ "state": "peon",
+ "election_epoch": 38,
+ "quorum": [
+ 1,
+ 2],
+ "outside_quorum": [],
+ "extra_probe_peers": [],
+ "sync_provider": [],
+ "monmap": { "epoch": 3,
+ "fsid": "5c4e9d53-e2e1-478a-8061-f543f8be4cf8",
+ "modified": "2013-10-30 04:12:01.945629",
+ "created": "2013-10-29 14:14:41.914786",
+ "mons": [
+ { "rank": 0,
+ "name": "a",
+ "addr": "127.0.0.1:6789\/0"},
+ { "rank": 1,
+ "name": "b",
+ "addr": "127.0.0.1:6790\/0"},
+ { "rank": 2,
+ "name": "c",
+ "addr": "127.0.0.1:6795\/0"}]}}
+
+A couple of things are obvious: we have three monitors in the monmap (*a*, *b*
+and *c*), the quorum is formed by only two monitors, and *c* is in the quorum
+as a *peon*.
+
+Which monitor is out of the quorum?
+
+ The answer would be **a**.
+
+Why?
+
+ Take a look at the ``quorum`` set. We have two monitors in this set: *1*
+ and *2*. These are not monitor names. These are monitor ranks, as established
+ in the current monmap. We are missing the monitor with rank 0, and according
+ to the monmap that would be ``mon.a``.
+
+By the way, how are ranks established?
+
+ Ranks are (re)calculated whenever you add or remove monitors and follow a
+ simple rule: the **greater** the ``IP:PORT`` combination, the **lower** the
+ rank is. In this case, considering that ``127.0.0.1:6789`` is lower than all
+ the remaining ``IP:PORT`` combinations, ``mon.a`` has rank 0.
+
+Most Common Monitor Issues
+===========================
+
+Have Quorum but at least one Monitor is down
+---------------------------------------------
+
+When this happens, depending on the version of Ceph you are running,
+you should be seeing something similar to::
+
+ $ ceph health detail
+ [snip]
+ mon.a (rank 0) addr 127.0.0.1:6789/0 is down (out of quorum)
+
+How to troubleshoot this?
+
+ First, make sure ``mon.a`` is running.
+
+ Second, make sure you are able to connect to ``mon.a``'s server from the
+ other monitors' servers. Check the ports as well. Check ``iptables`` on
+ all your monitor nodes and make sure you are not dropping/rejecting
+ connections.
+
+ If this initial troubleshooting doesn't solve your problems, then it's
+ time to go deeper.
+
+ First, check the problematic monitor's ``mon_status`` via the admin
+ socket as explained in `Using the monitor's admin socket`_ and
+ `Understanding mon_status`_.
+
+ Considering the monitor is out of the quorum, its state should be one of
+ ``probing``, ``electing`` or ``synchronizing``. If it happens to be either
+ ``leader`` or ``peon``, then the monitor believes to be in quorum, while
+ the remaining cluster is sure it is not; or maybe it got into the quorum
+ while we were troubleshooting the monitor, so check you ``ceph -s`` again
+ just to make sure. Proceed if the monitor is not yet in the quorum.
+
+What if the state is ``probing``?
+
+ This means the monitor is still looking for the other monitors. Every time
+ you start a monitor, the monitor will stay in this state for some time
+ while trying to find the rest of the monitors specified in the ``monmap``.
+ The time a monitor will spend in this state can vary. For instance, when on
+ a single-monitor cluster, the monitor will pass through the probing state
+ almost instantaneously, since there are no other monitors around. On a
+ multi-monitor cluster, the monitors will stay in this state until they
+ find enough monitors to form a quorum -- this means that if you have 2 out
+ of 3 monitors down, the one remaining monitor will stay in this state
+ indefinitively until you bring one of the other monitors up.
+
+ If you have a quorum, however, the monitor should be able to find the
+ remaining monitors pretty fast, as long as they can be reached. If your
+ monitor is stuck probing and you have gone through with all the communication
+ troubleshooting, then there is a fair chance that the monitor is trying
+ to reach the other monitors on a wrong address. ``mon_status`` outputs the
+ ``monmap`` known to the monitor: check if the other monitor's locations
+ match reality. If they don't, jump to
+ `Recovering a Monitor's Broken monmap`_; if they do, then it may be related
+ to severe clock skews amongst the monitor nodes and you should refer to
+ `Clock Skews`_ first, but if that doesn't solve your problem then it is
+ the time to prepare some logs and reach out to the community (please refer
+ to `Preparing your logs`_ on how to best prepare your logs).
+
+
+What if state is ``electing``?
+
+ This means the monitor is in the middle of an election. These should be
+ fast to complete, but at times the monitors can get stuck electing. This
+ is usually a sign of a clock skew among the monitor nodes; jump to
+ `Clock Skews`_ for more infos on that. If all your clocks are properly
+ synchronized, it is best if you prepare some logs and reach out to the
+ community. This is not a state that is likely to persist and aside from
+ (*really*) old bugs there is not an obvious reason besides clock skews on
+ why this would happen.
+
+What if state is ``synchronizing``?
+
+ This means the monitor is synchronizing with the rest of the cluster in
+ order to join the quorum. The synchronization process is as faster as
+ smaller your monitor store is, so if you have a big store it may
+ take a while. Don't worry, it should be finished soon enough.
+
+ However, if you notice that the monitor jumps from ``synchronizing`` to
+ ``electing`` and then back to ``synchronizing``, then you do have a
+ problem: the cluster state is advancing (i.e., generating new maps) way
+ too fast for the synchronization process to keep up. This used to be a
+ thing in early Cuttlefish, but since then the synchronization process was
+ quite refactored and enhanced to avoid just this sort of behavior. If this
+ happens in later versions let us know. And bring some logs
+ (see `Preparing your logs`_).
+
+What if state is ``leader`` or ``peon``?
+
+ This should not happen. There is a chance this might happen however, and
+ it has a lot to do with clock skews -- see `Clock Skews`_. If you are not
+ suffering from clock skews, then please prepare your logs (see
+ `Preparing your logs`_) and reach out to us.
+
+
+Recovering a Monitor's Broken monmap
+-------------------------------------
+
+This is how a ``monmap`` usually looks like, depending on the number of
+monitors::
+
+
+ epoch 3
+ fsid 5c4e9d53-e2e1-478a-8061-f543f8be4cf8
+ last_changed 2013-10-30 04:12:01.945629
+ created 2013-10-29 14:14:41.914786
+ 0: 127.0.0.1:6789/0 mon.a
+ 1: 127.0.0.1:6790/0 mon.b
+ 2: 127.0.0.1:6795/0 mon.c
+
+This may not be what you have however. For instance, in some versions of
+early Cuttlefish there was this one bug that could cause your ``monmap``
+to be nullified. Completely filled with zeros. This means that not even
+``monmaptool`` would be able to read it because it would find it hard to
+make sense of only-zeros. Some other times, you may end up with a monitor
+with a severely outdated monmap, thus being unable to find the remaining
+monitors (e.g., say ``mon.c`` is down; you add a new monitor ``mon.d``,
+then remove ``mon.a``, then add a new monitor ``mon.e`` and remove
+``mon.b``; you will end up with a totally different monmap from the one
+``mon.c`` knows).
+
+In this sort of situations, you have two possible solutions:
+
+Scrap the monitor and create a new one
+
+ You should only take this route if you are positive that you won't
+ lose the information kept by that monitor; that you have other monitors
+ and that they are running just fine so that your new monitor is able
+ to synchronize from the remaining monitors. Keep in mind that destroying
+ a monitor, if there are no other copies of its contents, may lead to
+ loss of data.
+
+Inject a monmap into the monitor
+
+ Usually the safest path. You should grab the monmap from the remaining
+ monitors and inject it into the monitor with the corrupted/lost monmap.
+
+ These are the basic steps:
+
+ 1. Is there a formed quorum? If so, grab the monmap from the quorum::
+
+ $ ceph mon getmap -o /tmp/monmap
+
+ 2. No quorum? Grab the monmap directly from another monitor (this
+ assumes the monitor you are grabbing the monmap from has id ID-FOO
+ and has been stopped)::
+
+ $ ceph-mon -i ID-FOO --extract-monmap /tmp/monmap
+
+ 3. Stop the monitor you are going to inject the monmap into.
+
+ 4. Inject the monmap::
+
+ $ ceph-mon -i ID --inject-monmap /tmp/monmap
+
+ 5. Start the monitor
+
+ Please keep in mind that the ability to inject monmaps is a powerful
+ feature that can cause havoc with your monitors if misused as it will
+ overwrite the latest, existing monmap kept by the monitor.
+
+
+Clock Skews
+------------
+
+Monitors can be severely affected by significant clock skews across the
+monitor nodes. This usually translates into weird behavior with no obvious
+cause. To avoid such issues, you should run a clock synchronization tool
+on your monitor nodes.
+
+
+What's the maximum tolerated clock skew?
+
+ By default the monitors will allow clocks to drift up to ``0.05 seconds``.
+
+
+Can I increase the maximum tolerated clock skew?
+
+ This value is configurable via the ``mon-clock-drift-allowed`` option, and
+ although you *CAN* it doesn't mean you *SHOULD*. The clock skew mechanism
+ is in place because clock skewed monitor may not properly behave. We, as
+ developers and QA afficcionados, are comfortable with the current default
+ value, as it will alert the user before the monitors get out hand. Changing
+ this value without testing it first may cause unforeseen effects on the
+ stability of the monitors and overall cluster healthiness, although there is
+ no risk of dataloss.
+
+
+How do I know there's a clock skew?
+
+ The monitors will warn you in the form of a ``HEALTH_WARN``. ``ceph health
+ detail`` should show something in the form of::
+
+ mon.c addr 10.10.0.1:6789/0 clock skew 0.08235s > max 0.05s (latency 0.0045s)
+
+ That means that ``mon.c`` has been flagged as suffering from a clock skew.
+
+
+What should I do if there's a clock skew?
+
+ Synchronize your clocks. Running an NTP client may help. If you are already
+ using one and you hit this sort of issues, check if you are using some NTP
+ server remote to your network and consider hosting your own NTP server on
+ your network. This last option tends to reduce the amount of issues with
+ monitor clock skews.
+
+
+Client Can't Connect or Mount
+------------------------------
+
+Check your IP tables. Some OS install utilities add a ``REJECT`` rule to
+``iptables``. The rule rejects all clients trying to connect to the host except
+for ``ssh``. If your monitor host's IP tables have such a ``REJECT`` rule in
+place, clients connecting from a separate node will fail to mount with a timeout
+error. You need to address ``iptables`` rules that reject clients trying to
+connect to Ceph daemons. For example, you would need to address rules that look
+like this appropriately::
+
+ REJECT all -- anywhere anywhere reject-with icmp-host-prohibited
+
+You may also need to add rules to IP tables on your Ceph hosts to ensure
+that clients can access the ports associated with your Ceph monitors (i.e., port
+6789 by default) and Ceph OSDs (i.e., 6800 through 7300 by default). For
+example::
+
+ iptables -A INPUT -m multiport -p tcp -s {ip-address}/{netmask} --dports 6789,6800:7300 -j ACCEPT
+
+Monitor Store Failures
+======================
+
+Symptoms of store corruption
+----------------------------
+
+Ceph monitor stores the `cluster map`_ in a key/value store such as LevelDB. If
+a monitor fails due to the key/value store corruption, following error messages
+might be found in the monitor log::
+
+ Corruption: error in middle of record
+
+or::
+
+ Corruption: 1 missing files; e.g.: /var/lib/ceph/mon/mon.0/store.db/1234567.ldb
+
+Recovery using healthy monitor(s)
+---------------------------------
+
+If there is any survivers, we can always `replace`_ the corrupted one with a
+new one. And after booting up, the new joiner will sync up with a healthy
+peer, and once it is fully sync'ed, it will be able to serve the clients.
+
+Recovery using OSDs
+-------------------
+
+But what if all monitors fail at the same time? Since users are encouraged to
+deploy at least three monitors in a Ceph cluster, the chance of simultaneous
+failure is rare. But unplanned power-downs in a data center with improperly
+configured disk/fs settings could fail the underlying filesystem, and hence
+kill all the monitors. In this case, we can recover the monitor store with the
+information stored in OSDs.::
+
+ ms=/tmp/mon-store
+ mkdir $ms
+ # collect the cluster map from OSDs
+ for host in $hosts; do
+ rsync -avz $ms user@host:$ms
+ rm -rf $ms
+ ssh user@host <<EOF
+ for osd in /var/lib/osd/osd-*; do
+ ceph-objectstore-tool --data-path \$osd --op update-mon-db --mon-store-path $ms
+ done
+ EOF
+ rsync -avz user@host:$ms $ms
+ done
+ # rebuild the monitor store from the collected map, if the cluster does not
+ # use cephx authentication, we can skip the following steps to update the
+ # keyring with the caps, and there is no need to pass the "--keyring" option.
+ # i.e. just use "ceph-monstore-tool /tmp/mon-store rebuild" instead
+ ceph-authtool /path/to/admin.keyring -n mon. \
+ --cap mon 'allow *'
+ ceph-authtool /path/to/admin.keyring -n client.admin \
+ --cap mon 'allow *' --cap osd 'allow *' --cap mds 'allow *'
+ ceph-monstore-tool /tmp/mon-store rebuild -- --keyring /path/to/admin.keyring
+ # backup corrupted store.db just in case
+ mv /var/lib/ceph/mon/mon.0/store.db /var/lib/ceph/mon/mon.0/store.db.corrupted
+ mv /tmp/mon-store/store.db /var/lib/ceph/mon/mon.0/store.db
+ chown -R ceph:ceph /var/lib/ceph/mon/mon.0/store.db
+
+The steps above
+
+#. collect the map from all OSD hosts,
+#. then rebuild the store,
+#. fill the entities in keyring file with appropriate caps
+#. replace the corrupted store on ``mon.0`` with the recovered copy.
+
+Known limitations
+~~~~~~~~~~~~~~~~~
+
+Following information are not recoverable using the steps above:
+
+- **some added keyrings**: all the OSD keyrings added using ``ceph auth add`` command
+ are recovered from the OSD's copy. And the ``client.admin`` keyring is imported
+ using ``ceph-monstore-tool``. But the MDS keyrings and other keyrings are missing
+ in the recovered monitor store. You might need to re-add them manually.
+
+- **pg settings**: the ``full ratio`` and ``nearfull ratio`` settings configured using
+ ``ceph pg set_full_ratio`` and ``ceph pg set_nearfull_ratio`` will be lost.
+
+- **MDS Maps**: the MDS maps are lost.
+
+
+Everything Failed! Now What?
+=============================
+
+Reaching out for help
+----------------------
+
+You can find us on IRC at #ceph and #ceph-devel at OFTC (server irc.oftc.net)
+and on ``ceph-devel@vger.kernel.org`` and ``ceph-users@lists.ceph.com``. Make
+sure you have grabbed your logs and have them ready if someone asks: the faster
+the interaction and lower the latency in response, the better chances everyone's
+time is optimized.
+
+
+Preparing your logs
+---------------------
+
+Monitor logs are, by default, kept in ``/var/log/ceph/ceph-mon.FOO.log*``. We
+may want them. However, your logs may not have the necessary information. If
+you don't find your monitor logs at their default location, you can check
+where they should be by running::
+
+ ceph-conf --name mon.FOO --show-config-value log_file
+
+The amount of information in the logs are subject to the debug levels being
+enforced by your configuration files. If you have not enforced a specific
+debug level then Ceph is using the default levels and your logs may not
+contain important information to track down you issue.
+A first step in getting relevant information into your logs will be to raise
+debug levels. In this case we will be interested in the information from the
+monitor.
+Similarly to what happens on other components, different parts of the monitor
+will output their debug information on different subsystems.
+
+You will have to raise the debug levels of those subsystems more closely
+related to your issue. This may not be an easy task for someone unfamiliar
+with troubleshooting Ceph. For most situations, setting the following options
+on your monitors will be enough to pinpoint a potential source of the issue::
+
+ debug mon = 10
+ debug ms = 1
+
+If we find that these debug levels are not enough, there's a chance we may
+ask you to raise them or even define other debug subsystems to obtain infos
+from -- but at least we started off with some useful information, instead
+of a massively empty log without much to go on with.
+
+Do I need to restart a monitor to adjust debug levels?
+------------------------------------------------------
+
+No. You may do it in one of two ways:
+
+You have quorum
+
+ Either inject the debug option into the monitor you want to debug::
+
+ ceph tell mon.FOO injectargs --debug_mon 10/10
+
+ or into all monitors at once::
+
+ ceph tell mon.* injectargs --debug_mon 10/10
+
+No quourm
+
+ Use the monitor's admin socket and directly adjust the configuration
+ options::
+
+ ceph daemon mon.FOO config set debug_mon 10/10
+
+
+Going back to default values is as easy as rerunning the above commands
+using the debug level ``1/10`` instead. You can check your current
+values using the admin socket and the following commands::
+
+ ceph daemon mon.FOO config show
+
+or::
+
+ ceph daemon mon.FOO config get 'OPTION_NAME'
+
+
+Reproduced the problem with appropriate debug levels. Now what?
+----------------------------------------------------------------
+
+Ideally you would send us only the relevant portions of your logs.
+We realise that figuring out the corresponding portion may not be the
+easiest of tasks. Therefore, we won't hold it to you if you provide the
+full log, but common sense should be employed. If your log has hundreds
+of thousands of lines, it may get tricky to go through the whole thing,
+specially if we are not aware at which point, whatever your issue is,
+happened. For instance, when reproducing, keep in mind to write down
+current time and date and to extract the relevant portions of your logs
+based on that.
+
+Finally, you should reach out to us on the mailing lists, on IRC or file
+a new issue on the `tracker`_.
+
+.. _cluster map: ../../architecture#cluster-map
+.. _replace: ../operation/add-or-rm-mons
+.. _tracker: http://tracker.ceph.com/projects/ceph/issues/new
diff --git a/src/ceph/doc/rados/troubleshooting/troubleshooting-osd.rst b/src/ceph/doc/rados/troubleshooting/troubleshooting-osd.rst
new file mode 100644
index 0000000..88307fe
--- /dev/null
+++ b/src/ceph/doc/rados/troubleshooting/troubleshooting-osd.rst
@@ -0,0 +1,536 @@
+======================
+ Troubleshooting OSDs
+======================
+
+Before troubleshooting your OSDs, check your monitors and network first. If
+you execute ``ceph health`` or ``ceph -s`` on the command line and Ceph returns
+a health status, it means that the monitors have a quorum.
+If you don't have a monitor quorum or if there are errors with the monitor
+status, `address the monitor issues first <../troubleshooting-mon>`_.
+Check your networks to ensure they
+are running properly, because networks may have a significant impact on OSD
+operation and performance.
+
+
+
+Obtaining Data About OSDs
+=========================
+
+A good first step in troubleshooting your OSDs is to obtain information in
+addition to the information you collected while `monitoring your OSDs`_
+(e.g., ``ceph osd tree``).
+
+
+Ceph Logs
+---------
+
+If you haven't changed the default path, you can find Ceph log files at
+``/var/log/ceph``::
+
+ ls /var/log/ceph
+
+If you don't get enough log detail, you can change your logging level. See
+`Logging and Debugging`_ for details to ensure that Ceph performs adequately
+under high logging volume.
+
+
+Admin Socket
+------------
+
+Use the admin socket tool to retrieve runtime information. For details, list
+the sockets for your Ceph processes::
+
+ ls /var/run/ceph
+
+Then, execute the following, replacing ``{daemon-name}`` with an actual
+daemon (e.g., ``osd.0``)::
+
+ ceph daemon osd.0 help
+
+Alternatively, you can specify a ``{socket-file}`` (e.g., something in ``/var/run/ceph``)::
+
+ ceph daemon {socket-file} help
+
+
+The admin socket, among other things, allows you to:
+
+- List your configuration at runtime
+- Dump historic operations
+- Dump the operation priority queue state
+- Dump operations in flight
+- Dump perfcounters
+
+
+Display Freespace
+-----------------
+
+Filesystem issues may arise. To display your filesystem's free space, execute
+``df``. ::
+
+ df -h
+
+Execute ``df --help`` for additional usage.
+
+
+I/O Statistics
+--------------
+
+Use `iostat`_ to identify I/O-related issues. ::
+
+ iostat -x
+
+
+Diagnostic Messages
+-------------------
+
+To retrieve diagnostic messages, use ``dmesg`` with ``less``, ``more``, ``grep``
+or ``tail``. For example::
+
+ dmesg | grep scsi
+
+
+Stopping w/out Rebalancing
+==========================
+
+Periodically, you may need to perform maintenance on a subset of your cluster,
+or resolve a problem that affects a failure domain (e.g., a rack). If you do not
+want CRUSH to automatically rebalance the cluster as you stop OSDs for
+maintenance, set the cluster to ``noout`` first::
+
+ ceph osd set noout
+
+Once the cluster is set to ``noout``, you can begin stopping the OSDs within the
+failure domain that requires maintenance work. ::
+
+ stop ceph-osd id={num}
+
+.. note:: Placement groups within the OSDs you stop will become ``degraded``
+ while you are addressing issues with within the failure domain.
+
+Once you have completed your maintenance, restart the OSDs. ::
+
+ start ceph-osd id={num}
+
+Finally, you must unset the cluster from ``noout``. ::
+
+ ceph osd unset noout
+
+
+
+.. _osd-not-running:
+
+OSD Not Running
+===============
+
+Under normal circumstances, simply restarting the ``ceph-osd`` daemon will
+allow it to rejoin the cluster and recover.
+
+An OSD Won't Start
+------------------
+
+If you start your cluster and an OSD won't start, check the following:
+
+- **Configuration File:** If you were not able to get OSDs running from
+ a new installation, check your configuration file to ensure it conforms
+ (e.g., ``host`` not ``hostname``, etc.).
+
+- **Check Paths:** Check the paths in your configuration, and the actual
+ paths themselves for data and journals. If you separate the OSD data from
+ the journal data and there are errors in your configuration file or in the
+ actual mounts, you may have trouble starting OSDs. If you want to store the
+ journal on a block device, you should partition your journal disk and assign
+ one partition per OSD.
+
+- **Check Max Threadcount:** If you have a node with a lot of OSDs, you may be
+ hitting the default maximum number of threads (e.g., usually 32k), especially
+ during recovery. You can increase the number of threads using ``sysctl`` to
+ see if increasing the maximum number of threads to the maximum possible
+ number of threads allowed (i.e., 4194303) will help. For example::
+
+ sysctl -w kernel.pid_max=4194303
+
+ If increasing the maximum thread count resolves the issue, you can make it
+ permanent by including a ``kernel.pid_max`` setting in the
+ ``/etc/sysctl.conf`` file. For example::
+
+ kernel.pid_max = 4194303
+
+- **Kernel Version:** Identify the kernel version and distribution you
+ are using. Ceph uses some third party tools by default, which may be
+ buggy or may conflict with certain distributions and/or kernel
+ versions (e.g., Google perftools). Check the `OS recommendations`_
+ to ensure you have addressed any issues related to your kernel.
+
+- **Segment Fault:** If there is a segment fault, turn your logging up
+ (if it is not already), and try again. If it segment faults again,
+ contact the ceph-devel email list and provide your Ceph configuration
+ file, your monitor output and the contents of your log file(s).
+
+
+
+An OSD Failed
+-------------
+
+When a ``ceph-osd`` process dies, the monitor will learn about the failure
+from surviving ``ceph-osd`` daemons and report it via the ``ceph health``
+command::
+
+ ceph health
+ HEALTH_WARN 1/3 in osds are down
+
+Specifically, you will get a warning whenever there are ``ceph-osd``
+processes that are marked ``in`` and ``down``. You can identify which
+``ceph-osds`` are ``down`` with::
+
+ ceph health detail
+ HEALTH_WARN 1/3 in osds are down
+ osd.0 is down since epoch 23, last address 192.168.106.220:6800/11080
+
+If there is a disk
+failure or other fault preventing ``ceph-osd`` from functioning or
+restarting, an error message should be present in its log file in
+``/var/log/ceph``.
+
+If the daemon stopped because of a heartbeat failure, the underlying
+kernel file system may be unresponsive. Check ``dmesg`` output for disk
+or other kernel errors.
+
+If the problem is a software error (failed assertion or other
+unexpected error), it should be reported to the `ceph-devel`_ email list.
+
+
+No Free Drive Space
+-------------------
+
+Ceph prevents you from writing to a full OSD so that you don't lose data.
+In an operational cluster, you should receive a warning when your cluster
+is getting near its full ratio. The ``mon osd full ratio`` defaults to
+``0.95``, or 95% of capacity before it stops clients from writing data.
+The ``mon osd backfillfull ratio`` defaults to ``0.90``, or 90 % of
+capacity when it blocks backfills from starting. The
+``mon osd nearfull ratio`` defaults to ``0.85``, or 85% of capacity
+when it generates a health warning.
+
+Full cluster issues usually arise when testing how Ceph handles an OSD
+failure on a small cluster. When one node has a high percentage of the
+cluster's data, the cluster can easily eclipse its nearfull and full ratio
+immediately. If you are testing how Ceph reacts to OSD failures on a small
+cluster, you should leave ample free disk space and consider temporarily
+lowering the ``mon osd full ratio``, ``mon osd backfillfull ratio`` and
+``mon osd nearfull ratio``.
+
+Full ``ceph-osds`` will be reported by ``ceph health``::
+
+ ceph health
+ HEALTH_WARN 1 nearfull osd(s)
+
+Or::
+
+ ceph health detail
+ HEALTH_ERR 1 full osd(s); 1 backfillfull osd(s); 1 nearfull osd(s)
+ osd.3 is full at 97%
+ osd.4 is backfill full at 91%
+ osd.2 is near full at 87%
+
+The best way to deal with a full cluster is to add new ``ceph-osds``, allowing
+the cluster to redistribute data to the newly available storage.
+
+If you cannot start an OSD because it is full, you may delete some data by deleting
+some placement group directories in the full OSD.
+
+.. important:: If you choose to delete a placement group directory on a full OSD,
+ **DO NOT** delete the same placement group directory on another full OSD, or
+ **YOU MAY LOSE DATA**. You **MUST** maintain at least one copy of your data on
+ at least one OSD.
+
+See `Monitor Config Reference`_ for additional details.
+
+
+OSDs are Slow/Unresponsive
+==========================
+
+A commonly recurring issue involves slow or unresponsive OSDs. Ensure that you
+have eliminated other troubleshooting possibilities before delving into OSD
+performance issues. For example, ensure that your network(s) is working properly
+and your OSDs are running. Check to see if OSDs are throttling recovery traffic.
+
+.. tip:: Newer versions of Ceph provide better recovery handling by preventing
+ recovering OSDs from using up system resources so that ``up`` and ``in``
+ OSDs are not available or are otherwise slow.
+
+
+Networking Issues
+-----------------
+
+Ceph is a distributed storage system, so it depends upon networks to peer with
+OSDs, replicate objects, recover from faults and check heartbeats. Networking
+issues can cause OSD latency and flapping OSDs. See `Flapping OSDs`_ for
+details.
+
+Ensure that Ceph processes and Ceph-dependent processes are connected and/or
+listening. ::
+
+ netstat -a | grep ceph
+ netstat -l | grep ceph
+ sudo netstat -p | grep ceph
+
+Check network statistics. ::
+
+ netstat -s
+
+
+Drive Configuration
+-------------------
+
+A storage drive should only support one OSD. Sequential read and sequential
+write throughput can bottleneck if other processes share the drive, including
+journals, operating systems, monitors, other OSDs and non-Ceph processes.
+
+Ceph acknowledges writes *after* journaling, so fast SSDs are an
+attractive option to accelerate the response time--particularly when
+using the ``XFS`` or ``ext4`` filesystems. By contrast, the ``btrfs``
+filesystem can write and journal simultaneously. (Note, however, that
+we recommend against using ``btrfs`` for production deployments.)
+
+.. note:: Partitioning a drive does not change its total throughput or
+ sequential read/write limits. Running a journal in a separate partition
+ may help, but you should prefer a separate physical drive.
+
+
+Bad Sectors / Fragmented Disk
+-----------------------------
+
+Check your disks for bad sectors and fragmentation. This can cause total throughput
+to drop substantially.
+
+
+Co-resident Monitors/OSDs
+-------------------------
+
+Monitors are generally light-weight processes, but they do lots of ``fsync()``,
+which can interfere with other workloads, particularly if monitors run on the
+same drive as your OSDs. Additionally, if you run monitors on the same host as
+the OSDs, you may incur performance issues related to:
+
+- Running an older kernel (pre-3.0)
+- Running Argonaut with an old ``glibc``
+- Running a kernel with no syncfs(2) syscall.
+
+In these cases, multiple OSDs running on the same host can drag each other down
+by doing lots of commits. That often leads to the bursty writes.
+
+
+Co-resident Processes
+---------------------
+
+Spinning up co-resident processes such as a cloud-based solution, virtual
+machines and other applications that write data to Ceph while operating on the
+same hardware as OSDs can introduce significant OSD latency. Generally, we
+recommend optimizing a host for use with Ceph and using other hosts for other
+processes. The practice of separating Ceph operations from other applications
+may help improve performance and may streamline troubleshooting and maintenance.
+
+
+Logging Levels
+--------------
+
+If you turned logging levels up to track an issue and then forgot to turn
+logging levels back down, the OSD may be putting a lot of logs onto the disk. If
+you intend to keep logging levels high, you may consider mounting a drive to the
+default path for logging (i.e., ``/var/log/ceph/$cluster-$name.log``).
+
+
+Recovery Throttling
+-------------------
+
+Depending upon your configuration, Ceph may reduce recovery rates to maintain
+performance or it may increase recovery rates to the point that recovery
+impacts OSD performance. Check to see if the OSD is recovering.
+
+
+Kernel Version
+--------------
+
+Check the kernel version you are running. Older kernels may not receive
+new backports that Ceph depends upon for better performance.
+
+
+Kernel Issues with SyncFS
+-------------------------
+
+Try running one OSD per host to see if performance improves. Old kernels
+might not have a recent enough version of ``glibc`` to support ``syncfs(2)``.
+
+
+Filesystem Issues
+-----------------
+
+Currently, we recommend deploying clusters with XFS.
+
+We recommend against using btrfs or ext4. The btrfs filesystem has
+many attractive features, but bugs in the filesystem may lead to
+performance issues and suprious ENOSPC errors. We do not recommend
+ext4 because xattr size limitations break our support for long object
+names (needed for RGW).
+
+For more information, see `Filesystem Recommendations`_.
+
+.. _Filesystem Recommendations: ../configuration/filesystem-recommendations
+
+
+Insufficient RAM
+----------------
+
+We recommend 1GB of RAM per OSD daemon. You may notice that during normal
+operations, the OSD only uses a fraction of that amount (e.g., 100-200MB).
+Unused RAM makes it tempting to use the excess RAM for co-resident applications,
+VMs and so forth. However, when OSDs go into recovery mode, their memory
+utilization spikes. If there is no RAM available, the OSD performance will slow
+considerably.
+
+
+Old Requests or Slow Requests
+-----------------------------
+
+If a ``ceph-osd`` daemon is slow to respond to a request, it will generate log messages
+complaining about requests that are taking too long. The warning threshold
+defaults to 30 seconds, and is configurable via the ``osd op complaint time``
+option. When this happens, the cluster log will receive messages.
+
+Legacy versions of Ceph complain about 'old requests`::
+
+ osd.0 192.168.106.220:6800/18813 312 : [WRN] old request osd_op(client.5099.0:790 fatty_26485_object789 [write 0~4096] 2.5e54f643) v4 received at 2012-03-06 15:42:56.054801 currently waiting for sub ops
+
+New versions of Ceph complain about 'slow requests`::
+
+ {date} {osd.num} [WRN] 1 slow requests, 1 included below; oldest blocked for > 30.005692 secs
+ {date} {osd.num} [WRN] slow request 30.005692 seconds old, received at {date-time}: osd_op(client.4240.0:8 benchmark_data_ceph-1_39426_object7 [write 0~4194304] 0.69848840) v4 currently waiting for subops from [610]
+
+
+Possible causes include:
+
+- A bad drive (check ``dmesg`` output)
+- A bug in the kernel file system bug (check ``dmesg`` output)
+- An overloaded cluster (check system load, iostat, etc.)
+- A bug in the ``ceph-osd`` daemon.
+
+Possible solutions
+
+- Remove VMs Cloud Solutions from Ceph Hosts
+- Upgrade Kernel
+- Upgrade Ceph
+- Restart OSDs
+
+Debugging Slow Requests
+-----------------------
+
+If you run "ceph daemon osd.<id> dump_historic_ops" or "dump_ops_in_flight",
+you will see a set of operations and a list of events each operation went
+through. These are briefly described below.
+
+Events from the Messenger layer:
+
+- header_read: when the messenger first started reading the message off the wire
+- throttled: when the messenger tried to acquire memory throttle space to read
+ the message into memory
+- all_read: when the messenger finished reading the message off the wire
+- dispatched: when the messenger gave the message to the OSD
+- Initiated: <This is identical to header_read. The existence of both is a
+ historical oddity.
+
+Events from the OSD as it prepares operations
+
+- queued_for_pg: the op has been put into the queue for processing by its PG
+- reached_pg: the PG has started doing the op
+- waiting for \*: the op is waiting for some other work to complete before it
+ can proceed (a new OSDMap; for its object target to scrub; for the PG to
+ finish peering; all as specified in the message)
+- started: the op has been accepted as something the OSD should actually do
+ (reasons not to do it: failed security/permission checks; out-of-date local
+ state; etc) and is now actually being performed
+- waiting for subops from: the op has been sent to replica OSDs
+
+Events from the FileStore
+
+- commit_queued_for_journal_write: the op has been given to the FileStore
+- write_thread_in_journal_buffer: the op is in the journal's buffer and waiting
+ to be persisted (as the next disk write)
+- journaled_completion_queued: the op was journaled to disk and its callback
+ queued for invocation
+
+Events from the OSD after stuff has been given to local disk
+
+- op_commit: the op has been committed (ie, written to journal) by the
+ primary OSD
+- op_applied: The op has been write()'en to the backing FS (ie, applied in
+ memory but not flushed out to disk) on the primary
+- sub_op_applied: op_applied, but for a replica's "subop"
+- sub_op_committed: op_commited, but for a replica's subop (only for EC pools)
+- sub_op_commit_rec/sub_op_apply_rec from <X>: the primary marks this when it
+ hears about the above, but for a particular replica <X>
+- commit_sent: we sent a reply back to the client (or primary OSD, for sub ops)
+
+Many of these events are seemingly redundant, but cross important boundaries in
+the internal code (such as passing data across locks into new threads).
+
+Flapping OSDs
+=============
+
+We recommend using both a public (front-end) network and a cluster (back-end)
+network so that you can better meet the capacity requirements of object
+replication. Another advantage is that you can run a cluster network such that
+it is not connected to the internet, thereby preventing some denial of service
+attacks. When OSDs peer and check heartbeats, they use the cluster (back-end)
+network when it's available. See `Monitor/OSD Interaction`_ for details.
+
+However, if the cluster (back-end) network fails or develops significant latency
+while the public (front-end) network operates optimally, OSDs currently do not
+handle this situation well. What happens is that OSDs mark each other ``down``
+on the monitor, while marking themselves ``up``. We call this scenario
+'flapping`.
+
+If something is causing OSDs to 'flap' (repeatedly getting marked ``down`` and
+then ``up`` again), you can force the monitors to stop the flapping with::
+
+ ceph osd set noup # prevent OSDs from getting marked up
+ ceph osd set nodown # prevent OSDs from getting marked down
+
+These flags are recorded in the osdmap structure::
+
+ ceph osd dump | grep flags
+ flags no-up,no-down
+
+You can clear the flags with::
+
+ ceph osd unset noup
+ ceph osd unset nodown
+
+Two other flags are supported, ``noin`` and ``noout``, which prevent
+booting OSDs from being marked ``in`` (allocated data) or protect OSDs
+from eventually being marked ``out`` (regardless of what the current value for
+``mon osd down out interval`` is).
+
+.. note:: ``noup``, ``noout``, and ``nodown`` are temporary in the
+ sense that once the flags are cleared, the action they were blocking
+ should occur shortly after. The ``noin`` flag, on the other hand,
+ prevents OSDs from being marked ``in`` on boot, and any daemons that
+ started while the flag was set will remain that way.
+
+
+
+
+
+
+.. _iostat: http://en.wikipedia.org/wiki/Iostat
+.. _Ceph Logging and Debugging: ../../configuration/ceph-conf#ceph-logging-and-debugging
+.. _Logging and Debugging: ../log-and-debug
+.. _Debugging and Logging: ../debug
+.. _Monitor/OSD Interaction: ../../configuration/mon-osd-interaction
+.. _Monitor Config Reference: ../../configuration/mon-config-ref
+.. _monitoring your OSDs: ../../operations/monitoring-osd-pg
+.. _subscribe to the ceph-devel email list: mailto:majordomo@vger.kernel.org?body=subscribe+ceph-devel
+.. _unsubscribe from the ceph-devel email list: mailto:majordomo@vger.kernel.org?body=unsubscribe+ceph-devel
+.. _subscribe to the ceph-users email list: mailto:ceph-users-join@lists.ceph.com
+.. _unsubscribe from the ceph-users email list: mailto:ceph-users-leave@lists.ceph.com
+.. _OS recommendations: ../../../start/os-recommendations
+.. _ceph-devel: ceph-devel@vger.kernel.org
diff --git a/src/ceph/doc/rados/troubleshooting/troubleshooting-pg.rst b/src/ceph/doc/rados/troubleshooting/troubleshooting-pg.rst
new file mode 100644
index 0000000..4241fee
--- /dev/null
+++ b/src/ceph/doc/rados/troubleshooting/troubleshooting-pg.rst
@@ -0,0 +1,668 @@
+=====================
+ Troubleshooting PGs
+=====================
+
+Placement Groups Never Get Clean
+================================
+
+When you create a cluster and your cluster remains in ``active``,
+``active+remapped`` or ``active+degraded`` status and never achieve an
+``active+clean`` status, you likely have a problem with your configuration.
+
+You may need to review settings in the `Pool, PG and CRUSH Config Reference`_
+and make appropriate adjustments.
+
+As a general rule, you should run your cluster with more than one OSD and a
+pool size greater than 1 object replica.
+
+One Node Cluster
+----------------
+
+Ceph no longer provides documentation for operating on a single node, because
+you would never deploy a system designed for distributed computing on a single
+node. Additionally, mounting client kernel modules on a single node containing a
+Ceph daemon may cause a deadlock due to issues with the Linux kernel itself
+(unless you use VMs for the clients). You can experiment with Ceph in a 1-node
+configuration, in spite of the limitations as described herein.
+
+If you are trying to create a cluster on a single node, you must change the
+default of the ``osd crush chooseleaf type`` setting from ``1`` (meaning
+``host`` or ``node``) to ``0`` (meaning ``osd``) in your Ceph configuration
+file before you create your monitors and OSDs. This tells Ceph that an OSD
+can peer with another OSD on the same host. If you are trying to set up a
+1-node cluster and ``osd crush chooseleaf type`` is greater than ``0``,
+Ceph will try to peer the PGs of one OSD with the PGs of another OSD on
+another node, chassis, rack, row, or even datacenter depending on the setting.
+
+.. tip:: DO NOT mount kernel clients directly on the same node as your
+ Ceph Storage Cluster, because kernel conflicts can arise. However, you
+ can mount kernel clients within virtual machines (VMs) on a single node.
+
+If you are creating OSDs using a single disk, you must create directories
+for the data manually first. For example::
+
+ mkdir /var/local/osd0 /var/local/osd1
+ ceph-deploy osd prepare {localhost-name}:/var/local/osd0 {localhost-name}:/var/local/osd1
+ ceph-deploy osd activate {localhost-name}:/var/local/osd0 {localhost-name}:/var/local/osd1
+
+
+Fewer OSDs than Replicas
+------------------------
+
+If you have brought up two OSDs to an ``up`` and ``in`` state, but you still
+don't see ``active + clean`` placement groups, you may have an
+``osd pool default size`` set to greater than ``2``.
+
+There are a few ways to address this situation. If you want to operate your
+cluster in an ``active + degraded`` state with two replicas, you can set the
+``osd pool default min size`` to ``2`` so that you can write objects in
+an ``active + degraded`` state. You may also set the ``osd pool default size``
+setting to ``2`` so that you only have two stored replicas (the original and
+one replica), in which case the cluster should achieve an ``active + clean``
+state.
+
+.. note:: You can make the changes at runtime. If you make the changes in
+ your Ceph configuration file, you may need to restart your cluster.
+
+
+Pool Size = 1
+-------------
+
+If you have the ``osd pool default size`` set to ``1``, you will only have
+one copy of the object. OSDs rely on other OSDs to tell them which objects
+they should have. If a first OSD has a copy of an object and there is no
+second copy, then no second OSD can tell the first OSD that it should have
+that copy. For each placement group mapped to the first OSD (see
+``ceph pg dump``), you can force the first OSD to notice the placement groups
+it needs by running::
+
+ ceph osd force-create-pg <pgid>
+
+
+CRUSH Map Errors
+----------------
+
+Another candidate for placement groups remaining unclean involves errors
+in your CRUSH map.
+
+
+Stuck Placement Groups
+======================
+
+It is normal for placement groups to enter states like "degraded" or "peering"
+following a failure. Normally these states indicate the normal progression
+through the failure recovery process. However, if a placement group stays in one
+of these states for a long time this may be an indication of a larger problem.
+For this reason, the monitor will warn when placement groups get "stuck" in a
+non-optimal state. Specifically, we check for:
+
+* ``inactive`` - The placement group has not been ``active`` for too long
+ (i.e., it hasn't been able to service read/write requests).
+
+* ``unclean`` - The placement group has not been ``clean`` for too long
+ (i.e., it hasn't been able to completely recover from a previous failure).
+
+* ``stale`` - The placement group status has not been updated by a ``ceph-osd``,
+ indicating that all nodes storing this placement group may be ``down``.
+
+You can explicitly list stuck placement groups with one of::
+
+ ceph pg dump_stuck stale
+ ceph pg dump_stuck inactive
+ ceph pg dump_stuck unclean
+
+For stuck ``stale`` placement groups, it is normally a matter of getting the
+right ``ceph-osd`` daemons running again. For stuck ``inactive`` placement
+groups, it is usually a peering problem (see :ref:`failures-osd-peering`). For
+stuck ``unclean`` placement groups, there is usually something preventing
+recovery from completing, like unfound objects (see
+:ref:`failures-osd-unfound`);
+
+
+
+.. _failures-osd-peering:
+
+Placement Group Down - Peering Failure
+======================================
+
+In certain cases, the ``ceph-osd`` `Peering` process can run into
+problems, preventing a PG from becoming active and usable. For
+example, ``ceph health`` might report::
+
+ ceph health detail
+ HEALTH_ERR 7 pgs degraded; 12 pgs down; 12 pgs peering; 1 pgs recovering; 6 pgs stuck unclean; 114/3300 degraded (3.455%); 1/3 in osds are down
+ ...
+ pg 0.5 is down+peering
+ pg 1.4 is down+peering
+ ...
+ osd.1 is down since epoch 69, last address 192.168.106.220:6801/8651
+
+We can query the cluster to determine exactly why the PG is marked ``down`` with::
+
+ ceph pg 0.5 query
+
+.. code-block:: javascript
+
+ { "state": "down+peering",
+ ...
+ "recovery_state": [
+ { "name": "Started\/Primary\/Peering\/GetInfo",
+ "enter_time": "2012-03-06 14:40:16.169679",
+ "requested_info_from": []},
+ { "name": "Started\/Primary\/Peering",
+ "enter_time": "2012-03-06 14:40:16.169659",
+ "probing_osds": [
+ 0,
+ 1],
+ "blocked": "peering is blocked due to down osds",
+ "down_osds_we_would_probe": [
+ 1],
+ "peering_blocked_by": [
+ { "osd": 1,
+ "current_lost_at": 0,
+ "comment": "starting or marking this osd lost may let us proceed"}]},
+ { "name": "Started",
+ "enter_time": "2012-03-06 14:40:16.169513"}
+ ]
+ }
+
+The ``recovery_state`` section tells us that peering is blocked due to
+down ``ceph-osd`` daemons, specifically ``osd.1``. In this case, we can start that ``ceph-osd``
+and things will recover.
+
+Alternatively, if there is a catastrophic failure of ``osd.1`` (e.g., disk
+failure), we can tell the cluster that it is ``lost`` and to cope as
+best it can.
+
+.. important:: This is dangerous in that the cluster cannot
+ guarantee that the other copies of the data are consistent
+ and up to date.
+
+To instruct Ceph to continue anyway::
+
+ ceph osd lost 1
+
+Recovery will proceed.
+
+
+.. _failures-osd-unfound:
+
+Unfound Objects
+===============
+
+Under certain combinations of failures Ceph may complain about
+``unfound`` objects::
+
+ ceph health detail
+ HEALTH_WARN 1 pgs degraded; 78/3778 unfound (2.065%)
+ pg 2.4 is active+degraded, 78 unfound
+
+This means that the storage cluster knows that some objects (or newer
+copies of existing objects) exist, but it hasn't found copies of them.
+One example of how this might come about for a PG whose data is on ceph-osds
+1 and 2:
+
+* 1 goes down
+* 2 handles some writes, alone
+* 1 comes up
+* 1 and 2 repeer, and the objects missing on 1 are queued for recovery.
+* Before the new objects are copied, 2 goes down.
+
+Now 1 knows that these object exist, but there is no live ``ceph-osd`` who
+has a copy. In this case, IO to those objects will block, and the
+cluster will hope that the failed node comes back soon; this is
+assumed to be preferable to returning an IO error to the user.
+
+First, you can identify which objects are unfound with::
+
+ ceph pg 2.4 list_missing [starting offset, in json]
+
+.. code-block:: javascript
+
+ { "offset": { "oid": "",
+ "key": "",
+ "snapid": 0,
+ "hash": 0,
+ "max": 0},
+ "num_missing": 0,
+ "num_unfound": 0,
+ "objects": [
+ { "oid": "object 1",
+ "key": "",
+ "hash": 0,
+ "max": 0 },
+ ...
+ ],
+ "more": 0}
+
+If there are too many objects to list in a single result, the ``more``
+field will be true and you can query for more. (Eventually the
+command line tool will hide this from you, but not yet.)
+
+Second, you can identify which OSDs have been probed or might contain
+data::
+
+ ceph pg 2.4 query
+
+.. code-block:: javascript
+
+ "recovery_state": [
+ { "name": "Started\/Primary\/Active",
+ "enter_time": "2012-03-06 15:15:46.713212",
+ "might_have_unfound": [
+ { "osd": 1,
+ "status": "osd is down"}]},
+
+In this case, for example, the cluster knows that ``osd.1`` might have
+data, but it is ``down``. The full range of possible states include:
+
+* already probed
+* querying
+* OSD is down
+* not queried (yet)
+
+Sometimes it simply takes some time for the cluster to query possible
+locations.
+
+It is possible that there are other locations where the object can
+exist that are not listed. For example, if a ceph-osd is stopped and
+taken out of the cluster, the cluster fully recovers, and due to some
+future set of failures ends up with an unfound object, it won't
+consider the long-departed ceph-osd as a potential location to
+consider. (This scenario, however, is unlikely.)
+
+If all possible locations have been queried and objects are still
+lost, you may have to give up on the lost objects. This, again, is
+possible given unusual combinations of failures that allow the cluster
+to learn about writes that were performed before the writes themselves
+are recovered. To mark the "unfound" objects as "lost"::
+
+ ceph pg 2.5 mark_unfound_lost revert|delete
+
+This the final argument specifies how the cluster should deal with
+lost objects.
+
+The "delete" option will forget about them entirely.
+
+The "revert" option (not available for erasure coded pools) will
+either roll back to a previous version of the object or (if it was a
+new object) forget about it entirely. Use this with caution, as it
+may confuse applications that expected the object to exist.
+
+
+Homeless Placement Groups
+=========================
+
+It is possible for all OSDs that had copies of a given placement groups to fail.
+If that's the case, that subset of the object store is unavailable, and the
+monitor will receive no status updates for those placement groups. To detect
+this situation, the monitor marks any placement group whose primary OSD has
+failed as ``stale``. For example::
+
+ ceph health
+ HEALTH_WARN 24 pgs stale; 3/300 in osds are down
+
+You can identify which placement groups are ``stale``, and what the last OSDs to
+store them were, with::
+
+ ceph health detail
+ HEALTH_WARN 24 pgs stale; 3/300 in osds are down
+ ...
+ pg 2.5 is stuck stale+active+remapped, last acting [2,0]
+ ...
+ osd.10 is down since epoch 23, last address 192.168.106.220:6800/11080
+ osd.11 is down since epoch 13, last address 192.168.106.220:6803/11539
+ osd.12 is down since epoch 24, last address 192.168.106.220:6806/11861
+
+If we want to get placement group 2.5 back online, for example, this tells us that
+it was last managed by ``osd.0`` and ``osd.2``. Restarting those ``ceph-osd``
+daemons will allow the cluster to recover that placement group (and, presumably,
+many others).
+
+
+Only a Few OSDs Receive Data
+============================
+
+If you have many nodes in your cluster and only a few of them receive data,
+`check`_ the number of placement groups in your pool. Since placement groups get
+mapped to OSDs, a small number of placement groups will not distribute across
+your cluster. Try creating a pool with a placement group count that is a
+multiple of the number of OSDs. See `Placement Groups`_ for details. The default
+placement group count for pools is not useful, but you can change it `here`_.
+
+
+Can't Write Data
+================
+
+If your cluster is up, but some OSDs are down and you cannot write data,
+check to ensure that you have the minimum number of OSDs running for the
+placement group. If you don't have the minimum number of OSDs running,
+Ceph will not allow you to write data because there is no guarantee
+that Ceph can replicate your data. See ``osd pool default min size``
+in the `Pool, PG and CRUSH Config Reference`_ for details.
+
+
+PGs Inconsistent
+================
+
+If you receive an ``active + clean + inconsistent`` state, this may happen
+due to an error during scrubbing. As always, we can identify the inconsistent
+placement group(s) with::
+
+ $ ceph health detail
+ HEALTH_ERR 1 pgs inconsistent; 2 scrub errors
+ pg 0.6 is active+clean+inconsistent, acting [0,1,2]
+ 2 scrub errors
+
+Or if you prefer inspecting the output in a programmatic way::
+
+ $ rados list-inconsistent-pg rbd
+ ["0.6"]
+
+There is only one consistent state, but in the worst case, we could have
+different inconsistencies in multiple perspectives found in more than one
+objects. If an object named ``foo`` in PG ``0.6`` is truncated, we will have::
+
+ $ rados list-inconsistent-obj 0.6 --format=json-pretty
+
+.. code-block:: javascript
+
+ {
+ "epoch": 14,
+ "inconsistents": [
+ {
+ "object": {
+ "name": "foo",
+ "nspace": "",
+ "locator": "",
+ "snap": "head",
+ "version": 1
+ },
+ "errors": [
+ "data_digest_mismatch",
+ "size_mismatch"
+ ],
+ "union_shard_errors": [
+ "data_digest_mismatch_oi",
+ "size_mismatch_oi"
+ ],
+ "selected_object_info": "0:602f83fe:::foo:head(16'1 client.4110.0:1 dirty|data_digest|omap_digest s 968 uv 1 dd e978e67f od ffffffff alloc_hint [0 0 0])",
+ "shards": [
+ {
+ "osd": 0,
+ "errors": [],
+ "size": 968,
+ "omap_digest": "0xffffffff",
+ "data_digest": "0xe978e67f"
+ },
+ {
+ "osd": 1,
+ "errors": [],
+ "size": 968,
+ "omap_digest": "0xffffffff",
+ "data_digest": "0xe978e67f"
+ },
+ {
+ "osd": 2,
+ "errors": [
+ "data_digest_mismatch_oi",
+ "size_mismatch_oi"
+ ],
+ "size": 0,
+ "omap_digest": "0xffffffff",
+ "data_digest": "0xffffffff"
+ }
+ ]
+ }
+ ]
+ }
+
+In this case, we can learn from the output:
+
+* The only inconsistent object is named ``foo``, and it is its head that has
+ inconsistencies.
+* The inconsistencies fall into two categories:
+
+ * ``errors``: these errors indicate inconsistencies between shards without a
+ determination of which shard(s) are bad. Check for the ``errors`` in the
+ `shards` array, if available, to pinpoint the problem.
+
+ * ``data_digest_mismatch``: the digest of the replica read from OSD.2 is
+ different from the ones of OSD.0 and OSD.1
+ * ``size_mismatch``: the size of the replica read from OSD.2 is 0, while
+ the size reported by OSD.0 and OSD.1 is 968.
+ * ``union_shard_errors``: the union of all shard specific ``errors`` in
+ ``shards`` array. The ``errors`` are set for the given shard that has the
+ problem. They include errors like ``read_error``. The ``errors`` ending in
+ ``oi`` indicate a comparison with ``selected_object_info``. Look at the
+ ``shards`` array to determine which shard has which error(s).
+
+ * ``data_digest_mismatch_oi``: the digest stored in the object-info is not
+ ``0xffffffff``, which is calculated from the shard read from OSD.2
+ * ``size_mismatch_oi``: the size stored in the object-info is different
+ from the one read from OSD.2. The latter is 0.
+
+You can repair the inconsistent placement group by executing::
+
+ ceph pg repair {placement-group-ID}
+
+Which overwrites the `bad` copies with the `authoritative` ones. In most cases,
+Ceph is able to choose authoritative copies from all available replicas using
+some predefined criteria. But this does not always work. For example, the stored
+data digest could be missing, and the calculated digest will be ignored when
+choosing the authoritative copies. So, please use the above command with caution.
+
+If ``read_error`` is listed in the ``errors`` attribute of a shard, the
+inconsistency is likely due to disk errors. You might want to check your disk
+used by that OSD.
+
+If you receive ``active + clean + inconsistent`` states periodically due to
+clock skew, you may consider configuring your `NTP`_ daemons on your
+monitor hosts to act as peers. See `The Network Time Protocol`_ and Ceph
+`Clock Settings`_ for additional details.
+
+
+Erasure Coded PGs are not active+clean
+======================================
+
+When CRUSH fails to find enough OSDs to map to a PG, it will show as a
+``2147483647`` which is ITEM_NONE or ``no OSD found``. For instance::
+
+ [2,1,6,0,5,8,2147483647,7,4]
+
+Not enough OSDs
+---------------
+
+If the Ceph cluster only has 8 OSDs and the erasure coded pool needs
+9, that is what it will show. You can either create another erasure
+coded pool that requires less OSDs::
+
+ ceph osd erasure-code-profile set myprofile k=5 m=3
+ ceph osd pool create erasurepool 16 16 erasure myprofile
+
+or add a new OSDs and the PG will automatically use them.
+
+CRUSH constraints cannot be satisfied
+-------------------------------------
+
+If the cluster has enough OSDs, it is possible that the CRUSH ruleset
+imposes constraints that cannot be satisfied. If there are 10 OSDs on
+two hosts and the CRUSH rulesets require that no two OSDs from the
+same host are used in the same PG, the mapping may fail because only
+two OSD will be found. You can check the constraint by displaying the
+ruleset::
+
+ $ ceph osd crush rule ls
+ [
+ "replicated_ruleset",
+ "erasurepool"]
+ $ ceph osd crush rule dump erasurepool
+ { "rule_id": 1,
+ "rule_name": "erasurepool",
+ "ruleset": 1,
+ "type": 3,
+ "min_size": 3,
+ "max_size": 20,
+ "steps": [
+ { "op": "take",
+ "item": -1,
+ "item_name": "default"},
+ { "op": "chooseleaf_indep",
+ "num": 0,
+ "type": "host"},
+ { "op": "emit"}]}
+
+
+You can resolve the problem by creating a new pool in which PGs are allowed
+to have OSDs residing on the same host with::
+
+ ceph osd erasure-code-profile set myprofile crush-failure-domain=osd
+ ceph osd pool create erasurepool 16 16 erasure myprofile
+
+CRUSH gives up too soon
+-----------------------
+
+If the Ceph cluster has just enough OSDs to map the PG (for instance a
+cluster with a total of 9 OSDs and an erasure coded pool that requires
+9 OSDs per PG), it is possible that CRUSH gives up before finding a
+mapping. It can be resolved by:
+
+* lowering the erasure coded pool requirements to use less OSDs per PG
+ (that requires the creation of another pool as erasure code profiles
+ cannot be dynamically modified).
+
+* adding more OSDs to the cluster (that does not require the erasure
+ coded pool to be modified, it will become clean automatically)
+
+* use a hand made CRUSH ruleset that tries more times to find a good
+ mapping. It can be done by setting ``set_choose_tries`` to a value
+ greater than the default.
+
+You should first verify the problem with ``crushtool`` after
+extracting the crushmap from the cluster so your experiments do not
+modify the Ceph cluster and only work on a local files::
+
+ $ ceph osd crush rule dump erasurepool
+ { "rule_name": "erasurepool",
+ "ruleset": 1,
+ "type": 3,
+ "min_size": 3,
+ "max_size": 20,
+ "steps": [
+ { "op": "take",
+ "item": -1,
+ "item_name": "default"},
+ { "op": "chooseleaf_indep",
+ "num": 0,
+ "type": "host"},
+ { "op": "emit"}]}
+ $ ceph osd getcrushmap > crush.map
+ got crush map from osdmap epoch 13
+ $ crushtool -i crush.map --test --show-bad-mappings \
+ --rule 1 \
+ --num-rep 9 \
+ --min-x 1 --max-x $((1024 * 1024))
+ bad mapping rule 8 x 43 num_rep 9 result [3,2,7,1,2147483647,8,5,6,0]
+ bad mapping rule 8 x 79 num_rep 9 result [6,0,2,1,4,7,2147483647,5,8]
+ bad mapping rule 8 x 173 num_rep 9 result [0,4,6,8,2,1,3,7,2147483647]
+
+Where ``--num-rep`` is the number of OSDs the erasure code crush
+ruleset needs, ``--rule`` is the value of the ``ruleset`` field
+displayed by ``ceph osd crush rule dump``. The test will try mapping
+one million values (i.e. the range defined by ``[--min-x,--max-x]``)
+and must display at least one bad mapping. If it outputs nothing it
+means all mappings are successfull and you can stop right there: the
+problem is elsewhere.
+
+The crush ruleset can be edited by decompiling the crush map::
+
+ $ crushtool --decompile crush.map > crush.txt
+
+and adding the following line to the ruleset::
+
+ step set_choose_tries 100
+
+The relevant part of of the ``crush.txt`` file should look something
+like::
+
+ rule erasurepool {
+ ruleset 1
+ type erasure
+ min_size 3
+ max_size 20
+ step set_chooseleaf_tries 5
+ step set_choose_tries 100
+ step take default
+ step chooseleaf indep 0 type host
+ step emit
+ }
+
+It can then be compiled and tested again::
+
+ $ crushtool --compile crush.txt -o better-crush.map
+
+When all mappings succeed, an histogram of the number of tries that
+were necessary to find all of them can be displayed with the
+``--show-choose-tries`` option of ``crushtool``::
+
+ $ crushtool -i better-crush.map --test --show-bad-mappings \
+ --show-choose-tries \
+ --rule 1 \
+ --num-rep 9 \
+ --min-x 1 --max-x $((1024 * 1024))
+ ...
+ 11: 42
+ 12: 44
+ 13: 54
+ 14: 45
+ 15: 35
+ 16: 34
+ 17: 30
+ 18: 25
+ 19: 19
+ 20: 22
+ 21: 20
+ 22: 17
+ 23: 13
+ 24: 16
+ 25: 13
+ 26: 11
+ 27: 11
+ 28: 13
+ 29: 11
+ 30: 10
+ 31: 6
+ 32: 5
+ 33: 10
+ 34: 3
+ 35: 7
+ 36: 5
+ 37: 2
+ 38: 5
+ 39: 5
+ 40: 2
+ 41: 5
+ 42: 4
+ 43: 1
+ 44: 2
+ 45: 2
+ 46: 3
+ 47: 1
+ 48: 0
+ ...
+ 102: 0
+ 103: 1
+ 104: 0
+ ...
+
+It took 11 tries to map 42 PGs, 12 tries to map 44 PGs etc. The highest number of tries is the minimum value of ``set_choose_tries`` that prevents bad mappings (i.e. 103 in the above output because it did not take more than 103 tries for any PG to be mapped).
+
+.. _check: ../../operations/placement-groups#get-the-number-of-placement-groups
+.. _here: ../../configuration/pool-pg-config-ref
+.. _Placement Groups: ../../operations/placement-groups
+.. _Pool, PG and CRUSH Config Reference: ../../configuration/pool-pg-config-ref
+.. _NTP: http://en.wikipedia.org/wiki/Network_Time_Protocol
+.. _The Network Time Protocol: http://www.ntp.org/
+.. _Clock Settings: ../../configuration/mon-config-ref/#clock
+
+