summaryrefslogtreecommitdiffstats
path: root/src/ceph/doc/rados/configuration/osd-config-ref.rst
diff options
context:
space:
mode:
Diffstat (limited to 'src/ceph/doc/rados/configuration/osd-config-ref.rst')
-rw-r--r--src/ceph/doc/rados/configuration/osd-config-ref.rst1105
1 files changed, 1105 insertions, 0 deletions
diff --git a/src/ceph/doc/rados/configuration/osd-config-ref.rst b/src/ceph/doc/rados/configuration/osd-config-ref.rst
new file mode 100644
index 0000000..fae7078
--- /dev/null
+++ b/src/ceph/doc/rados/configuration/osd-config-ref.rst
@@ -0,0 +1,1105 @@
+======================
+ OSD Config Reference
+======================
+
+.. index:: OSD; configuration
+
+You can configure Ceph OSD Daemons in the Ceph configuration file, but Ceph OSD
+Daemons can use the default values and a very minimal configuration. A minimal
+Ceph OSD Daemon configuration sets ``osd journal size`` and ``host``, and
+uses default values for nearly everything else.
+
+Ceph OSD Daemons are numerically identified in incremental fashion, beginning
+with ``0`` using the following convention. ::
+
+ osd.0
+ osd.1
+ osd.2
+
+In a configuration file, you may specify settings for all Ceph OSD Daemons in
+the cluster by adding configuration settings to the ``[osd]`` section of your
+configuration file. To add settings directly to a specific Ceph OSD Daemon
+(e.g., ``host``), enter it in an OSD-specific section of your configuration
+file. For example:
+
+.. code-block:: ini
+
+ [osd]
+ osd journal size = 1024
+
+ [osd.0]
+ host = osd-host-a
+
+ [osd.1]
+ host = osd-host-b
+
+
+.. index:: OSD; config settings
+
+General Settings
+================
+
+The following settings provide an Ceph OSD Daemon's ID, and determine paths to
+data and journals. Ceph deployment scripts typically generate the UUID
+automatically. We **DO NOT** recommend changing the default paths for data or
+journals, as it makes it more problematic to troubleshoot Ceph later.
+
+The journal size should be at least twice the product of the expected drive
+speed multiplied by ``filestore max sync interval``. However, the most common
+practice is to partition the journal drive (often an SSD), and mount it such
+that Ceph uses the entire partition for the journal.
+
+
+``osd uuid``
+
+:Description: The universally unique identifier (UUID) for the Ceph OSD Daemon.
+:Type: UUID
+:Default: The UUID.
+:Note: The ``osd uuid`` applies to a single Ceph OSD Daemon. The ``fsid``
+ applies to the entire cluster.
+
+
+``osd data``
+
+:Description: The path to the OSDs data. You must create the directory when
+ deploying Ceph. You should mount a drive for OSD data at this
+ mount point. We do not recommend changing the default.
+
+:Type: String
+:Default: ``/var/lib/ceph/osd/$cluster-$id``
+
+
+``osd max write size``
+
+:Description: The maximum size of a write in megabytes.
+:Type: 32-bit Integer
+:Default: ``90``
+
+
+``osd client message size cap``
+
+:Description: The largest client data message allowed in memory.
+:Type: 64-bit Unsigned Integer
+:Default: 500MB default. ``500*1024L*1024L``
+
+
+``osd class dir``
+
+:Description: The class path for RADOS class plug-ins.
+:Type: String
+:Default: ``$libdir/rados-classes``
+
+
+.. index:: OSD; file system
+
+File System Settings
+====================
+Ceph builds and mounts file systems which are used for Ceph OSDs.
+
+``osd mkfs options {fs-type}``
+
+:Description: Options used when creating a new Ceph OSD of type {fs-type}.
+
+:Type: String
+:Default for xfs: ``-f -i 2048``
+:Default for other file systems: {empty string}
+
+For example::
+ ``osd mkfs options xfs = -f -d agcount=24``
+
+``osd mount options {fs-type}``
+
+:Description: Options used when mounting a Ceph OSD of type {fs-type}.
+
+:Type: String
+:Default for xfs: ``rw,noatime,inode64``
+:Default for other file systems: ``rw, noatime``
+
+For example::
+ ``osd mount options xfs = rw, noatime, inode64, logbufs=8``
+
+
+.. index:: OSD; journal settings
+
+Journal Settings
+================
+
+By default, Ceph expects that you will store an Ceph OSD Daemons journal with
+the following path::
+
+ /var/lib/ceph/osd/$cluster-$id/journal
+
+Without performance optimization, Ceph stores the journal on the same disk as
+the Ceph OSD Daemons data. An Ceph OSD Daemon optimized for performance may use
+a separate disk to store journal data (e.g., a solid state drive delivers high
+performance journaling).
+
+Ceph's default ``osd journal size`` is 0, so you will need to set this in your
+``ceph.conf`` file. A journal size should find the product of the ``filestore
+max sync interval`` and the expected throughput, and multiply the product by
+two (2)::
+
+ osd journal size = {2 * (expected throughput * filestore max sync interval)}
+
+The expected throughput number should include the expected disk throughput
+(i.e., sustained data transfer rate), and network throughput. For example,
+a 7200 RPM disk will likely have approximately 100 MB/s. Taking the ``min()``
+of the disk and network throughput should provide a reasonable expected
+throughput. Some users just start off with a 10GB journal size. For
+example::
+
+ osd journal size = 10000
+
+
+``osd journal``
+
+:Description: The path to the OSD's journal. This may be a path to a file or a
+ block device (such as a partition of an SSD). If it is a file,
+ you must create the directory to contain it. We recommend using a
+ drive separate from the ``osd data`` drive.
+
+:Type: String
+:Default: ``/var/lib/ceph/osd/$cluster-$id/journal``
+
+
+``osd journal size``
+
+:Description: The size of the journal in megabytes. If this is 0, and the
+ journal is a block device, the entire block device is used.
+ Since v0.54, this is ignored if the journal is a block device,
+ and the entire block device is used.
+
+:Type: 32-bit Integer
+:Default: ``5120``
+:Recommended: Begin with 1GB. Should be at least twice the product of the
+ expected speed multiplied by ``filestore max sync interval``.
+
+
+See `Journal Config Reference`_ for additional details.
+
+
+Monitor OSD Interaction
+=======================
+
+Ceph OSD Daemons check each other's heartbeats and report to monitors
+periodically. Ceph can use default values in many cases. However, if your
+network has latency issues, you may need to adopt longer intervals. See
+`Configuring Monitor/OSD Interaction`_ for a detailed discussion of heartbeats.
+
+
+Data Placement
+==============
+
+See `Pool & PG Config Reference`_ for details.
+
+
+.. index:: OSD; scrubbing
+
+Scrubbing
+=========
+
+In addition to making multiple copies of objects, Ceph insures data integrity by
+scrubbing placement groups. Ceph scrubbing is analogous to ``fsck`` on the
+object storage layer. For each placement group, Ceph generates a catalog of all
+objects and compares each primary object and its replicas to ensure that no
+objects are missing or mismatched. Light scrubbing (daily) checks the object
+size and attributes. Deep scrubbing (weekly) reads the data and uses checksums
+to ensure data integrity.
+
+Scrubbing is important for maintaining data integrity, but it can reduce
+performance. You can adjust the following settings to increase or decrease
+scrubbing operations.
+
+
+``osd max scrubs``
+
+:Description: The maximum number of simultaneous scrub operations for
+ a Ceph OSD Daemon.
+
+:Type: 32-bit Int
+:Default: ``1``
+
+``osd scrub begin hour``
+
+:Description: The time of day for the lower bound when a scheduled scrub can be
+ performed.
+:Type: Integer in the range of 0 to 24
+:Default: ``0``
+
+
+``osd scrub end hour``
+
+:Description: The time of day for the upper bound when a scheduled scrub can be
+ performed. Along with ``osd scrub begin hour``, they define a time
+ window, in which the scrubs can happen. But a scrub will be performed
+ no matter the time window allows or not, as long as the placement
+ group's scrub interval exceeds ``osd scrub max interval``.
+:Type: Integer in the range of 0 to 24
+:Default: ``24``
+
+
+``osd scrub during recovery``
+
+:Description: Allow scrub during recovery. Setting this to ``false`` will disable
+ scheduling new scrub (and deep--scrub) while there is active recovery.
+ Already running scrubs will be continued. This might be useful to reduce
+ load on busy clusters.
+:Type: Boolean
+:Default: ``true``
+
+
+``osd scrub thread timeout``
+
+:Description: The maximum time in seconds before timing out a scrub thread.
+:Type: 32-bit Integer
+:Default: ``60``
+
+
+``osd scrub finalize thread timeout``
+
+:Description: The maximum time in seconds before timing out a scrub finalize
+ thread.
+
+:Type: 32-bit Integer
+:Default: ``60*10``
+
+
+``osd scrub load threshold``
+
+:Description: The maximum load. Ceph will not scrub when the system load
+ (as defined by ``getloadavg()``) is higher than this number.
+ Default is ``0.5``.
+
+:Type: Float
+:Default: ``0.5``
+
+
+``osd scrub min interval``
+
+:Description: The minimal interval in seconds for scrubbing the Ceph OSD Daemon
+ when the Ceph Storage Cluster load is low.
+
+:Type: Float
+:Default: Once per day. ``60*60*24``
+
+
+``osd scrub max interval``
+
+:Description: The maximum interval in seconds for scrubbing the Ceph OSD Daemon
+ irrespective of cluster load.
+
+:Type: Float
+:Default: Once per week. ``7*60*60*24``
+
+
+``osd scrub chunk min``
+
+:Description: The minimal number of object store chunks to scrub during single operation.
+ Ceph blocks writes to single chunk during scrub.
+
+:Type: 32-bit Integer
+:Default: 5
+
+
+``osd scrub chunk max``
+
+:Description: The maximum number of object store chunks to scrub during single operation.
+
+:Type: 32-bit Integer
+:Default: 25
+
+
+``osd scrub sleep``
+
+:Description: Time to sleep before scrubbing next group of chunks. Increasing this value will slow
+ down whole scrub operation while client operations will be less impacted.
+
+:Type: Float
+:Default: 0
+
+
+``osd deep scrub interval``
+
+:Description: The interval for "deep" scrubbing (fully reading all data). The
+ ``osd scrub load threshold`` does not affect this setting.
+
+:Type: Float
+:Default: Once per week. ``60*60*24*7``
+
+
+``osd scrub interval randomize ratio``
+
+:Description: Add a random delay to ``osd scrub min interval`` when scheduling
+ the next scrub job for a placement group. The delay is a random
+ value less than ``osd scrub min interval`` \*
+ ``osd scrub interval randomized ratio``. So the default setting
+ practically randomly spreads the scrubs out in the allowed time
+ window of ``[1, 1.5]`` \* ``osd scrub min interval``.
+:Type: Float
+:Default: ``0.5``
+
+``osd deep scrub stride``
+
+:Description: Read size when doing a deep scrub.
+:Type: 32-bit Integer
+:Default: 512 KB. ``524288``
+
+
+.. index:: OSD; operations settings
+
+Operations
+==========
+
+Operations settings allow you to configure the number of threads for servicing
+requests. If you set ``osd op threads`` to ``0``, it disables multi-threading.
+By default, Ceph uses two threads with a 30 second timeout and a 30 second
+complaint time if an operation doesn't complete within those time parameters.
+You can set operations priority weights between client operations and
+recovery operations to ensure optimal performance during recovery.
+
+
+``osd op threads``
+
+:Description: The number of threads to service Ceph OSD Daemon operations.
+ Set to ``0`` to disable it. Increasing the number may increase
+ the request processing rate.
+
+:Type: 32-bit Integer
+:Default: ``2``
+
+
+``osd op queue``
+
+:Description: This sets the type of queue to be used for prioritizing ops
+ in the OSDs. Both queues feature a strict sub-queue which is
+ dequeued before the normal queue. The normal queue is different
+ between implementations. The original PrioritizedQueue (``prio``) uses a
+ token bucket system which when there are sufficient tokens will
+ dequeue high priority queues first. If there are not enough
+ tokens available, queues are dequeued low priority to high priority.
+ The WeightedPriorityQueue (``wpq``) dequeues all priorities in
+ relation to their priorities to prevent starvation of any queue.
+ WPQ should help in cases where a few OSDs are more overloaded
+ than others. The new mClock based OpClassQueue
+ (``mclock_opclass``) prioritizes operations based on which class
+ they belong to (recovery, scrub, snaptrim, client op, osd subop).
+ And, the mClock based ClientQueue (``mclock_client``) also
+ incorporates the client identifier in order to promote fairness
+ between clients. See `QoS Based on mClock`_. Requires a restart.
+
+:Type: String
+:Valid Choices: prio, wpq, mclock_opclass, mclock_client
+:Default: ``prio``
+
+
+``osd op queue cut off``
+
+:Description: This selects which priority ops will be sent to the strict
+ queue verses the normal queue. The ``low`` setting sends all
+ replication ops and higher to the strict queue, while the ``high``
+ option sends only replication acknowledgement ops and higher to
+ the strict queue. Setting this to ``high`` should help when a few
+ OSDs in the cluster are very busy especially when combined with
+ ``wpq`` in the ``osd op queue`` setting. OSDs that are very busy
+ handling replication traffic could starve primary client traffic
+ on these OSDs without these settings. Requires a restart.
+
+:Type: String
+:Valid Choices: low, high
+:Default: ``low``
+
+
+``osd client op priority``
+
+:Description: The priority set for client operations. It is relative to
+ ``osd recovery op priority``.
+
+:Type: 32-bit Integer
+:Default: ``63``
+:Valid Range: 1-63
+
+
+``osd recovery op priority``
+
+:Description: The priority set for recovery operations. It is relative to
+ ``osd client op priority``.
+
+:Type: 32-bit Integer
+:Default: ``3``
+:Valid Range: 1-63
+
+
+``osd scrub priority``
+
+:Description: The priority set for scrub operations. It is relative to
+ ``osd client op priority``.
+
+:Type: 32-bit Integer
+:Default: ``5``
+:Valid Range: 1-63
+
+
+``osd snap trim priority``
+
+:Description: The priority set for snap trim operations. It is relative to
+ ``osd client op priority``.
+
+:Type: 32-bit Integer
+:Default: ``5``
+:Valid Range: 1-63
+
+
+``osd op thread timeout``
+
+:Description: The Ceph OSD Daemon operation thread timeout in seconds.
+:Type: 32-bit Integer
+:Default: ``15``
+
+
+``osd op complaint time``
+
+:Description: An operation becomes complaint worthy after the specified number
+ of seconds have elapsed.
+
+:Type: Float
+:Default: ``30``
+
+
+``osd disk threads``
+
+:Description: The number of disk threads, which are used to perform background
+ disk intensive OSD operations such as scrubbing and snap
+ trimming.
+
+:Type: 32-bit Integer
+:Default: ``1``
+
+``osd disk thread ioprio class``
+
+:Description: Warning: it will only be used if both ``osd disk thread
+ ioprio class`` and ``osd disk thread ioprio priority`` are
+ set to a non default value. Sets the ioprio_set(2) I/O
+ scheduling ``class`` for the disk thread. Acceptable
+ values are ``idle``, ``be`` or ``rt``. The ``idle``
+ class means the disk thread will have lower priority
+ than any other thread in the OSD. This is useful to slow
+ down scrubbing on an OSD that is busy handling client
+ operations. ``be`` is the default and is the same
+ priority as all other threads in the OSD. ``rt`` means
+ the disk thread will have precendence over all other
+ threads in the OSD. Note: Only works with the Linux Kernel
+ CFQ scheduler. Since Jewel scrubbing is no longer carried
+ out by the disk iothread, see osd priority options instead.
+:Type: String
+:Default: the empty string
+
+``osd disk thread ioprio priority``
+
+:Description: Warning: it will only be used if both ``osd disk thread
+ ioprio class`` and ``osd disk thread ioprio priority`` are
+ set to a non default value. It sets the ioprio_set(2)
+ I/O scheduling ``priority`` of the disk thread ranging
+ from 0 (highest) to 7 (lowest). If all OSDs on a given
+ host were in class ``idle`` and compete for I/O
+ (i.e. due to controller congestion), it can be used to
+ lower the disk thread priority of one OSD to 7 so that
+ another OSD with priority 0 can have priority.
+ Note: Only works with the Linux Kernel CFQ scheduler.
+:Type: Integer in the range of 0 to 7 or -1 if not to be used.
+:Default: ``-1``
+
+``osd op history size``
+
+:Description: The maximum number of completed operations to track.
+:Type: 32-bit Unsigned Integer
+:Default: ``20``
+
+
+``osd op history duration``
+
+:Description: The oldest completed operation to track.
+:Type: 32-bit Unsigned Integer
+:Default: ``600``
+
+
+``osd op log threshold``
+
+:Description: How many operations logs to display at once.
+:Type: 32-bit Integer
+:Default: ``5``
+
+
+QoS Based on mClock
+-------------------
+
+Ceph's use of mClock is currently in the experimental phase and should
+be approached with an exploratory mindset.
+
+Core Concepts
+`````````````
+
+The QoS support of Ceph is implemented using a queueing scheduler
+based on `the dmClock algorithm`_. This algorithm allocates the I/O
+resources of the Ceph cluster in proportion to weights, and enforces
+the constraits of minimum reservation and maximum limitation, so that
+the services can compete for the resources fairly. Currently the
+*mclock_opclass* operation queue divides Ceph services involving I/O
+resources into following buckets:
+
+- client op: the iops issued by client
+- osd subop: the iops issued by primary OSD
+- snap trim: the snap trimming related requests
+- pg recovery: the recovery related requests
+- pg scrub: the scrub related requests
+
+And the resources are partitioned using following three sets of tags. In other
+words, the share of each type of service is controlled by three tags:
+
+#. reservation: the minimum IOPS allocated for the service.
+#. limitation: the maximum IOPS allocated for the service.
+#. weight: the proportional share of capacity if extra capacity or system
+ oversubscribed.
+
+In Ceph operations are graded with "cost". And the resources allocated
+for serving various services are consumed by these "costs". So, for
+example, the more reservation a services has, the more resource it is
+guaranteed to possess, as long as it requires. Assuming there are 2
+services: recovery and client ops:
+
+- recovery: (r:1, l:5, w:1)
+- client ops: (r:2, l:0, w:9)
+
+The settings above ensure that the recovery won't get more than 5
+requests per second serviced, even if it requires so (see CURRENT
+IMPLEMENTATION NOTE below), and no other services are competing with
+it. But if the clients start to issue large amount of I/O requests,
+neither will they exhaust all the I/O resources. 1 request per second
+is always allocated for recovery jobs as long as there are any such
+requests. So the recovery jobs won't be starved even in a cluster with
+high load. And in the meantime, the client ops can enjoy a larger
+portion of the I/O resource, because its weight is "9", while its
+competitor "1". In the case of client ops, it is not clamped by the
+limit setting, so it can make use of all the resources if there is no
+recovery ongoing.
+
+Along with *mclock_opclass* another mclock operation queue named
+*mclock_client* is available. It divides operations based on category
+but also divides them based on the client making the request. This
+helps not only manage the distribution of resources spent on different
+classes of operations but also tries to insure fairness among clients.
+
+CURRENT IMPLEMENTATION NOTE: the current experimental implementation
+does not enforce the limit values. As a first approximation we decided
+not to prevent operations that would otherwise enter the operation
+sequencer from doing so.
+
+Subtleties of mClock
+````````````````````
+
+The reservation and limit values have a unit of requests per
+second. The weight, however, does not technically have a unit and the
+weights are relative to one another. So if one class of requests has a
+weight of 1 and another a weight of 9, then the latter class of
+requests should get 9 executed at a 9 to 1 ratio as the first class.
+However that will only happen once the reservations are met and those
+values include the operations executed under the reservation phase.
+
+Even though the weights do not have units, one must be careful in
+choosing their values due how the algorithm assigns weight tags to
+requests. If the weight is *W*, then for a given class of requests,
+the next one that comes in will have a weight tag of *1/W* plus the
+previous weight tag or the current time, whichever is larger. That
+means if *W* is sufficiently large and therefore *1/W* is sufficiently
+small, the calculated tag may never be assigned as it will get a value
+of the current time. The ultimate lesson is that values for weight
+should not be too large. They should be under the number of requests
+one expects to ve serviced each second.
+
+Caveats
+```````
+
+There are some factors that can reduce the impact of the mClock op
+queues within Ceph. First, requests to an OSD are sharded by their
+placement group identifier. Each shard has its own mClock queue and
+these queues neither interact nor share information among them. The
+number of shards can be controlled with the configuration options
+``osd_op_num_shards``, ``osd_op_num_shards_hdd``, and
+``osd_op_num_shards_ssd``. A lower number of shards will increase the
+impact of the mClock queues, but may have other deliterious effects.
+
+Second, requests are transferred from the operation queue to the
+operation sequencer, in which they go through the phases of
+execution. The operation queue is where mClock resides and mClock
+determines the next op to transfer to the operation sequencer. The
+number of operations allowed in the operation sequencer is a complex
+issue. In general we want to keep enough operations in the sequencer
+so it's always getting work done on some operations while it's waiting
+for disk and network access to complete on other operations. On the
+other hand, once an operation is transferred to the operation
+sequencer, mClock no longer has control over it. Therefore to maximize
+the impact of mClock, we want to keep as few operations in the
+operation sequencer as possible. So we have an inherent tension.
+
+The configuration options that influence the number of operations in
+the operation sequencer are ``bluestore_throttle_bytes``,
+``bluestore_throttle_deferred_bytes``,
+``bluestore_throttle_cost_per_io``,
+``bluestore_throttle_cost_per_io_hdd``, and
+``bluestore_throttle_cost_per_io_ssd``.
+
+A third factor that affects the impact of the mClock algorithm is that
+we're using a distributed system, where requests are made to multiple
+OSDs and each OSD has (can have) multiple shards. Yet we're currently
+using the mClock algorithm, which is not distributed (note: dmClock is
+the distributed version of mClock).
+
+Various organizations and individuals are currently experimenting with
+mClock as it exists in this code base along with their modifications
+to the code base. We hope you'll share you're experiences with your
+mClock and dmClock experiments in the ceph-devel mailing list.
+
+
+``osd push per object cost``
+
+:Description: the overhead for serving a push op
+
+:Type: Unsigned Integer
+:Default: 1000
+
+``osd recovery max chunk``
+
+:Description: the maximum total size of data chunks a recovery op can carry.
+
+:Type: Unsigned Integer
+:Default: 8 MiB
+
+
+``osd op queue mclock client op res``
+
+:Description: the reservation of client op.
+
+:Type: Float
+:Default: 1000.0
+
+
+``osd op queue mclock client op wgt``
+
+:Description: the weight of client op.
+
+:Type: Float
+:Default: 500.0
+
+
+``osd op queue mclock client op lim``
+
+:Description: the limit of client op.
+
+:Type: Float
+:Default: 1000.0
+
+
+``osd op queue mclock osd subop res``
+
+:Description: the reservation of osd subop.
+
+:Type: Float
+:Default: 1000.0
+
+
+``osd op queue mclock osd subop wgt``
+
+:Description: the weight of osd subop.
+
+:Type: Float
+:Default: 500.0
+
+
+``osd op queue mclock osd subop lim``
+
+:Description: the limit of osd subop.
+
+:Type: Float
+:Default: 0.0
+
+
+``osd op queue mclock snap res``
+
+:Description: the reservation of snap trimming.
+
+:Type: Float
+:Default: 0.0
+
+
+``osd op queue mclock snap wgt``
+
+:Description: the weight of snap trimming.
+
+:Type: Float
+:Default: 1.0
+
+
+``osd op queue mclock snap lim``
+
+:Description: the limit of snap trimming.
+
+:Type: Float
+:Default: 0.001
+
+
+``osd op queue mclock recov res``
+
+:Description: the reservation of recovery.
+
+:Type: Float
+:Default: 0.0
+
+
+``osd op queue mclock recov wgt``
+
+:Description: the weight of recovery.
+
+:Type: Float
+:Default: 1.0
+
+
+``osd op queue mclock recov lim``
+
+:Description: the limit of recovery.
+
+:Type: Float
+:Default: 0.001
+
+
+``osd op queue mclock scrub res``
+
+:Description: the reservation of scrub jobs.
+
+:Type: Float
+:Default: 0.0
+
+
+``osd op queue mclock scrub wgt``
+
+:Description: the weight of scrub jobs.
+
+:Type: Float
+:Default: 1.0
+
+
+``osd op queue mclock scrub lim``
+
+:Description: the limit of scrub jobs.
+
+:Type: Float
+:Default: 0.001
+
+.. _the dmClock algorithm: https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Gulati.pdf
+
+
+.. index:: OSD; backfilling
+
+Backfilling
+===========
+
+When you add or remove Ceph OSD Daemons to a cluster, the CRUSH algorithm will
+want to rebalance the cluster by moving placement groups to or from Ceph OSD
+Daemons to restore the balance. The process of migrating placement groups and
+the objects they contain can reduce the cluster's operational performance
+considerably. To maintain operational performance, Ceph performs this migration
+with 'backfilling', which allows Ceph to set backfill operations to a lower
+priority than requests to read or write data.
+
+
+``osd max backfills``
+
+:Description: The maximum number of backfills allowed to or from a single OSD.
+:Type: 64-bit Unsigned Integer
+:Default: ``1``
+
+
+``osd backfill scan min``
+
+:Description: The minimum number of objects per backfill scan.
+
+:Type: 32-bit Integer
+:Default: ``64``
+
+
+``osd backfill scan max``
+
+:Description: The maximum number of objects per backfill scan.
+
+:Type: 32-bit Integer
+:Default: ``512``
+
+
+``osd backfill retry interval``
+
+:Description: The number of seconds to wait before retrying backfill requests.
+:Type: Double
+:Default: ``10.0``
+
+.. index:: OSD; osdmap
+
+OSD Map
+=======
+
+OSD maps reflect the OSD daemons operating in the cluster. Over time, the
+number of map epochs increases. Ceph provides some settings to ensure that
+Ceph performs well as the OSD map grows larger.
+
+
+``osd map dedup``
+
+:Description: Enable removing duplicates in the OSD map.
+:Type: Boolean
+:Default: ``true``
+
+
+``osd map cache size``
+
+:Description: The number of OSD maps to keep cached.
+:Type: 32-bit Integer
+:Default: ``500``
+
+
+``osd map cache bl size``
+
+:Description: The size of the in-memory OSD map cache in OSD daemons.
+:Type: 32-bit Integer
+:Default: ``50``
+
+
+``osd map cache bl inc size``
+
+:Description: The size of the in-memory OSD map cache incrementals in
+ OSD daemons.
+
+:Type: 32-bit Integer
+:Default: ``100``
+
+
+``osd map message max``
+
+:Description: The maximum map entries allowed per MOSDMap message.
+:Type: 32-bit Integer
+:Default: ``100``
+
+
+
+.. index:: OSD; recovery
+
+Recovery
+========
+
+When the cluster starts or when a Ceph OSD Daemon crashes and restarts, the OSD
+begins peering with other Ceph OSD Daemons before writes can occur. See
+`Monitoring OSDs and PGs`_ for details.
+
+If a Ceph OSD Daemon crashes and comes back online, usually it will be out of
+sync with other Ceph OSD Daemons containing more recent versions of objects in
+the placement groups. When this happens, the Ceph OSD Daemon goes into recovery
+mode and seeks to get the latest copy of the data and bring its map back up to
+date. Depending upon how long the Ceph OSD Daemon was down, the OSD's objects
+and placement groups may be significantly out of date. Also, if a failure domain
+went down (e.g., a rack), more than one Ceph OSD Daemon may come back online at
+the same time. This can make the recovery process time consuming and resource
+intensive.
+
+To maintain operational performance, Ceph performs recovery with limitations on
+the number recovery requests, threads and object chunk sizes which allows Ceph
+perform well in a degraded state.
+
+
+``osd recovery delay start``
+
+:Description: After peering completes, Ceph will delay for the specified number
+ of seconds before starting to recover objects.
+
+:Type: Float
+:Default: ``0``
+
+
+``osd recovery max active``
+
+:Description: The number of active recovery requests per OSD at one time. More
+ requests will accelerate recovery, but the requests places an
+ increased load on the cluster.
+
+:Type: 32-bit Integer
+:Default: ``3``
+
+
+``osd recovery max chunk``
+
+:Description: The maximum size of a recovered chunk of data to push.
+:Type: 64-bit Unsigned Integer
+:Default: ``8 << 20``
+
+
+``osd recovery max single start``
+
+:Description: The maximum number of recovery operations per OSD that will be
+ newly started when an OSD is recovering.
+:Type: 64-bit Unsigned Integer
+:Default: ``1``
+
+
+``osd recovery thread timeout``
+
+:Description: The maximum time in seconds before timing out a recovery thread.
+:Type: 32-bit Integer
+:Default: ``30``
+
+
+``osd recover clone overlap``
+
+:Description: Preserves clone overlap during recovery. Should always be set
+ to ``true``.
+
+:Type: Boolean
+:Default: ``true``
+
+
+``osd recovery sleep``
+
+:Description: Time in seconds to sleep before next recovery or backfill op.
+ Increasing this value will slow down recovery operation while
+ client operations will be less impacted.
+
+:Type: Float
+:Default: ``0``
+
+
+``osd recovery sleep hdd``
+
+:Description: Time in seconds to sleep before next recovery or backfill op
+ for HDDs.
+
+:Type: Float
+:Default: ``0.1``
+
+
+``osd recovery sleep ssd``
+
+:Description: Time in seconds to sleep before next recovery or backfill op
+ for SSDs.
+
+:Type: Float
+:Default: ``0``
+
+
+``osd recovery sleep hybrid``
+
+:Description: Time in seconds to sleep before next recovery or backfill op
+ when osd data is on HDD and osd journal is on SSD.
+
+:Type: Float
+:Default: ``0.025``
+
+Tiering
+=======
+
+``osd agent max ops``
+
+:Description: The maximum number of simultaneous flushing ops per tiering agent
+ in the high speed mode.
+:Type: 32-bit Integer
+:Default: ``4``
+
+
+``osd agent max low ops``
+
+:Description: The maximum number of simultaneous flushing ops per tiering agent
+ in the low speed mode.
+:Type: 32-bit Integer
+:Default: ``2``
+
+See `cache target dirty high ratio`_ for when the tiering agent flushes dirty
+objects within the high speed mode.
+
+Miscellaneous
+=============
+
+
+``osd snap trim thread timeout``
+
+:Description: The maximum time in seconds before timing out a snap trim thread.
+:Type: 32-bit Integer
+:Default: ``60*60*1``
+
+
+``osd backlog thread timeout``
+
+:Description: The maximum time in seconds before timing out a backlog thread.
+:Type: 32-bit Integer
+:Default: ``60*60*1``
+
+
+``osd default notify timeout``
+
+:Description: The OSD default notification timeout (in seconds).
+:Type: 32-bit Unsigned Integer
+:Default: ``30``
+
+
+``osd check for log corruption``
+
+:Description: Check log files for corruption. Can be computationally expensive.
+:Type: Boolean
+:Default: ``false``
+
+
+``osd remove thread timeout``
+
+:Description: The maximum time in seconds before timing out a remove OSD thread.
+:Type: 32-bit Integer
+:Default: ``60*60``
+
+
+``osd command thread timeout``
+
+:Description: The maximum time in seconds before timing out a command thread.
+:Type: 32-bit Integer
+:Default: ``10*60``
+
+
+``osd command max records``
+
+:Description: Limits the number of lost objects to return.
+:Type: 32-bit Integer
+:Default: ``256``
+
+
+``osd auto upgrade tmap``
+
+:Description: Uses ``tmap`` for ``omap`` on old objects.
+:Type: Boolean
+:Default: ``true``
+
+
+``osd tmapput sets users tmap``
+
+:Description: Uses ``tmap`` for debugging only.
+:Type: Boolean
+:Default: ``false``
+
+
+``osd fast fail on connection refused``
+
+:Description: If this option is enabled, crashed OSDs are marked down
+ immediately by connected peers and MONs (assuming that the
+ crashed OSD host survives). Disable it to restore old
+ behavior, at the expense of possible long I/O stalls when
+ OSDs crash in the middle of I/O operations.
+:Type: Boolean
+:Default: ``true``
+
+
+
+.. _pool: ../../operations/pools
+.. _Configuring Monitor/OSD Interaction: ../mon-osd-interaction
+.. _Monitoring OSDs and PGs: ../../operations/monitoring-osd-pg#peering
+.. _Pool & PG Config Reference: ../pool-pg-config-ref
+.. _Journal Config Reference: ../journal-ref
+.. _cache target dirty high ratio: ../../operations/pools#cache-target-dirty-high-ratio