diff options
Diffstat (limited to 'src/ceph/doc/rados/configuration')
17 files changed, 5807 insertions, 0 deletions
diff --git a/src/ceph/doc/rados/configuration/auth-config-ref.rst b/src/ceph/doc/rados/configuration/auth-config-ref.rst new file mode 100644 index 0000000..eb14fa4 --- /dev/null +++ b/src/ceph/doc/rados/configuration/auth-config-ref.rst @@ -0,0 +1,432 @@ +======================== + Cephx Config Reference +======================== + +The ``cephx`` protocol is enabled by default. Cryptographic authentication has +some computational costs, though they should generally be quite low. If the +network environment connecting your client and server hosts is very safe and +you cannot afford authentication, you can turn it off. **This is not generally +recommended**. + +.. note:: If you disable authentication, you are at risk of a man-in-the-middle + attack altering your client/server messages, which could lead to disastrous + security effects. + +For creating users, see `User Management`_. For details on the architecture +of Cephx, see `Architecture - High Availability Authentication`_. + + +Deployment Scenarios +==================== + +There are two main scenarios for deploying a Ceph cluster, which impact +how you initially configure Cephx. Most first time Ceph users use +``ceph-deploy`` to create a cluster (easiest). For clusters using +other deployment tools (e.g., Chef, Juju, Puppet, etc.), you will need +to use the manual procedures or configure your deployment tool to +bootstrap your monitor(s). + +ceph-deploy +----------- + +When you deploy a cluster with ``ceph-deploy``, you do not have to bootstrap the +monitor manually or create the ``client.admin`` user or keyring. The steps you +execute in the `Storage Cluster Quick Start`_ will invoke ``ceph-deploy`` to do +that for you. + +When you execute ``ceph-deploy new {initial-monitor(s)}``, Ceph will create a +monitor keyring for you (only used to bootstrap monitors), and it will generate +an initial Ceph configuration file for you, which contains the following +authentication settings, indicating that Ceph enables authentication by +default:: + + auth_cluster_required = cephx + auth_service_required = cephx + auth_client_required = cephx + +When you execute ``ceph-deploy mon create-initial``, Ceph will bootstrap the +initial monitor(s), retrieve a ``ceph.client.admin.keyring`` file containing the +key for the ``client.admin`` user. Additionally, it will also retrieve keyrings +that give ``ceph-deploy`` and ``ceph-disk`` utilities the ability to prepare and +activate OSDs and metadata servers. + +When you execute ``ceph-deploy admin {node-name}`` (**note:** Ceph must be +installed first), you are pushing a Ceph configuration file and the +``ceph.client.admin.keyring`` to the ``/etc/ceph`` directory of the node. You +will be able to execute Ceph administrative functions as ``root`` on the command +line of that node. + + +Manual Deployment +----------------- + +When you deploy a cluster manually, you have to bootstrap the monitor manually +and create the ``client.admin`` user and keyring. To bootstrap monitors, follow +the steps in `Monitor Bootstrapping`_. The steps for monitor bootstrapping are +the logical steps you must perform when using third party deployment tools like +Chef, Puppet, Juju, etc. + + +Enabling/Disabling Cephx +======================== + +Enabling Cephx requires that you have deployed keys for your monitors, +OSDs and metadata servers. If you are simply toggling Cephx on / off, +you do not have to repeat the bootstrapping procedures. + + +Enabling Cephx +-------------- + +When ``cephx`` is enabled, Ceph will look for the keyring in the default search +path, which includes ``/etc/ceph/$cluster.$name.keyring``. You can override +this location by adding a ``keyring`` option in the ``[global]`` section of +your `Ceph configuration`_ file, but this is not recommended. + +Execute the following procedures to enable ``cephx`` on a cluster with +authentication disabled. If you (or your deployment utility) have already +generated the keys, you may skip the steps related to generating keys. + +#. Create a ``client.admin`` key, and save a copy of the key for your client + host:: + + ceph auth get-or-create client.admin mon 'allow *' mds 'allow *' osd 'allow *' -o /etc/ceph/ceph.client.admin.keyring + + **Warning:** This will clobber any existing + ``/etc/ceph/client.admin.keyring`` file. Do not perform this step if a + deployment tool has already done it for you. Be careful! + +#. Create a keyring for your monitor cluster and generate a monitor + secret key. :: + + ceph-authtool --create-keyring /tmp/ceph.mon.keyring --gen-key -n mon. --cap mon 'allow *' + +#. Copy the monitor keyring into a ``ceph.mon.keyring`` file in every monitor's + ``mon data`` directory. For example, to copy it to ``mon.a`` in cluster ``ceph``, + use the following:: + + cp /tmp/ceph.mon.keyring /var/lib/ceph/mon/ceph-a/keyring + +#. Generate a secret key for every OSD, where ``{$id}`` is the OSD number:: + + ceph auth get-or-create osd.{$id} mon 'allow rwx' osd 'allow *' -o /var/lib/ceph/osd/ceph-{$id}/keyring + +#. Generate a secret key for every MDS, where ``{$id}`` is the MDS letter:: + + ceph auth get-or-create mds.{$id} mon 'allow rwx' osd 'allow *' mds 'allow *' -o /var/lib/ceph/mds/ceph-{$id}/keyring + +#. Enable ``cephx`` authentication by setting the following options in the + ``[global]`` section of your `Ceph configuration`_ file:: + + auth cluster required = cephx + auth service required = cephx + auth client required = cephx + + +#. Start or restart the Ceph cluster. See `Operating a Cluster`_ for details. + +For details on bootstrapping a monitor manually, see `Manual Deployment`_. + + + +Disabling Cephx +--------------- + +The following procedure describes how to disable Cephx. If your cluster +environment is relatively safe, you can offset the computation expense of +running authentication. **We do not recommend it.** However, it may be easier +during setup and/or troubleshooting to temporarily disable authentication. + +#. Disable ``cephx`` authentication by setting the following options in the + ``[global]`` section of your `Ceph configuration`_ file:: + + auth cluster required = none + auth service required = none + auth client required = none + + +#. Start or restart the Ceph cluster. See `Operating a Cluster`_ for details. + + +Configuration Settings +====================== + +Enablement +---------- + + +``auth cluster required`` + +:Description: If enabled, the Ceph Storage Cluster daemons (i.e., ``ceph-mon``, + ``ceph-osd``, and ``ceph-mds``) must authenticate with + each other. Valid settings are ``cephx`` or ``none``. + +:Type: String +:Required: No +:Default: ``cephx``. + + +``auth service required`` + +:Description: If enabled, the Ceph Storage Cluster daemons require Ceph Clients + to authenticate with the Ceph Storage Cluster in order to access + Ceph services. Valid settings are ``cephx`` or ``none``. + +:Type: String +:Required: No +:Default: ``cephx``. + + +``auth client required`` + +:Description: If enabled, the Ceph Client requires the Ceph Storage Cluster to + authenticate with the Ceph Client. Valid settings are ``cephx`` + or ``none``. + +:Type: String +:Required: No +:Default: ``cephx``. + + +.. index:: keys; keyring + +Keys +---- + +When you run Ceph with authentication enabled, ``ceph`` administrative commands +and Ceph Clients require authentication keys to access the Ceph Storage Cluster. + +The most common way to provide these keys to the ``ceph`` administrative +commands and clients is to include a Ceph keyring under the ``/etc/ceph`` +directory. For Cuttlefish and later releases using ``ceph-deploy``, the filename +is usually ``ceph.client.admin.keyring`` (or ``$cluster.client.admin.keyring``). +If you include the keyring under the ``/etc/ceph`` directory, you don't need to +specify a ``keyring`` entry in your Ceph configuration file. + +We recommend copying the Ceph Storage Cluster's keyring file to nodes where you +will run administrative commands, because it contains the ``client.admin`` key. + +You may use ``ceph-deploy admin`` to perform this task. See `Create an Admin +Host`_ for details. To perform this step manually, execute the following:: + + sudo scp {user}@{ceph-cluster-host}:/etc/ceph/ceph.client.admin.keyring /etc/ceph/ceph.client.admin.keyring + +.. tip:: Ensure the ``ceph.keyring`` file has appropriate permissions set + (e.g., ``chmod 644``) on your client machine. + +You may specify the key itself in the Ceph configuration file using the ``key`` +setting (not recommended), or a path to a keyfile using the ``keyfile`` setting. + + +``keyring`` + +:Description: The path to the keyring file. +:Type: String +:Required: No +:Default: ``/etc/ceph/$cluster.$name.keyring,/etc/ceph/$cluster.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin`` + + +``keyfile`` + +:Description: The path to a key file (i.e,. a file containing only the key). +:Type: String +:Required: No +:Default: None + + +``key`` + +:Description: The key (i.e., the text string of the key itself). Not recommended. +:Type: String +:Required: No +:Default: None + + +Daemon Keyrings +--------------- + +Administrative users or deployment tools (e.g., ``ceph-deploy``) may generate +daemon keyrings in the same way as generating user keyrings. By default, Ceph +stores daemons keyrings inside their data directory. The default keyring +locations, and the capabilities necessary for the daemon to function, are shown +below. + +``ceph-mon`` + +:Location: ``$mon_data/keyring`` +:Capabilities: ``mon 'allow *'`` + +``ceph-osd`` + +:Location: ``$osd_data/keyring`` +:Capabilities: ``mon 'allow profile osd' osd 'allow *'`` + +``ceph-mds`` + +:Location: ``$mds_data/keyring`` +:Capabilities: ``mds 'allow' mon 'allow profile mds' osd 'allow rwx'`` + +``radosgw`` + +:Location: ``$rgw_data/keyring`` +:Capabilities: ``mon 'allow rwx' osd 'allow rwx'`` + + +.. note:: The monitor keyring (i.e., ``mon.``) contains a key but no + capabilities, and is not part of the cluster ``auth`` database. + +The daemon data directory locations default to directories of the form:: + + /var/lib/ceph/$type/$cluster-$id + +For example, ``osd.12`` would be:: + + /var/lib/ceph/osd/ceph-12 + +You can override these locations, but it is not recommended. + + +.. index:: signatures + +Signatures +---------- + +In Ceph Bobtail and subsequent versions, we prefer that Ceph authenticate all +ongoing messages between the entities using the session key set up for that +initial authentication. However, Argonaut and earlier Ceph daemons do not know +how to perform ongoing message authentication. To maintain backward +compatibility (e.g., running both Botbail and Argonaut daemons in the same +cluster), message signing is **off** by default. If you are running Bobtail or +later daemons exclusively, configure Ceph to require signatures. + +Like other parts of Ceph authentication, Ceph provides fine-grained control so +you can enable/disable signatures for service messages between the client and +Ceph, and you can enable/disable signatures for messages between Ceph daemons. + + +``cephx require signatures`` + +:Description: If set to ``true``, Ceph requires signatures on all message + traffic between the Ceph Client and the Ceph Storage Cluster, and + between daemons comprising the Ceph Storage Cluster. + +:Type: Boolean +:Required: No +:Default: ``false`` + + +``cephx cluster require signatures`` + +:Description: If set to ``true``, Ceph requires signatures on all message + traffic between Ceph daemons comprising the Ceph Storage Cluster. + +:Type: Boolean +:Required: No +:Default: ``false`` + + +``cephx service require signatures`` + +:Description: If set to ``true``, Ceph requires signatures on all message + traffic between Ceph Clients and the Ceph Storage Cluster. + +:Type: Boolean +:Required: No +:Default: ``false`` + + +``cephx sign messages`` + +:Description: If the Ceph version supports message signing, Ceph will sign + all messages so they cannot be spoofed. + +:Type: Boolean +:Default: ``true`` + + +Time to Live +------------ + +``auth service ticket ttl`` + +:Description: When the Ceph Storage Cluster sends a Ceph Client a ticket for + authentication, the Ceph Storage Cluster assigns the ticket a + time to live. + +:Type: Double +:Default: ``60*60`` + + +Backward Compatibility +====================== + +For Cuttlefish and earlier releases, see `Cephx`_. + +In Ceph Argonaut v0.48 and earlier versions, if you enable ``cephx`` +authentication, Ceph only authenticates the initial communication between the +client and daemon; Ceph does not authenticate the subsequent messages they send +to each other, which has security implications. In Ceph Bobtail and subsequent +versions, Ceph authenticates all ongoing messages between the entities using the +session key set up for that initial authentication. + +We identified a backward compatibility issue between Argonaut v0.48 (and prior +versions) and Bobtail (and subsequent versions). During testing, if you +attempted to use Argonaut (and earlier) daemons with Bobtail (and later) +daemons, the Argonaut daemons did not know how to perform ongoing message +authentication, while the Bobtail versions of the daemons insist on +authenticating message traffic subsequent to the initial +request/response--making it impossible for Argonaut (and prior) daemons to +interoperate with Bobtail (and subsequent) daemons. + +We have addressed this potential problem by providing a means for Argonaut (and +prior) systems to interact with Bobtail (and subsequent) systems. Here's how it +works: by default, the newer systems will not insist on seeing signatures from +older systems that do not know how to perform them, but will simply accept such +messages without authenticating them. This new default behavior provides the +advantage of allowing two different releases to interact. **We do not recommend +this as a long term solution**. Allowing newer daemons to forgo ongoing +authentication has the unfortunate security effect that an attacker with control +of some of your machines or some access to your network can disable session +security simply by claiming to be unable to sign messages. + +.. note:: Even if you don't actually run any old versions of Ceph, + the attacker may be able to force some messages to be accepted unsigned in the + default scenario. While running Cephx with the default scenario, Ceph still + authenticates the initial communication, but you lose desirable session security. + +If you know that you are not running older versions of Ceph, or you are willing +to accept that old servers and new servers will not be able to interoperate, you +can eliminate this security risk. If you do so, any Ceph system that is new +enough to support session authentication and that has Cephx enabled will reject +unsigned messages. To preclude new servers from interacting with old servers, +include the following in the ``[global]`` section of your `Ceph +configuration`_ file directly below the line that specifies the use of Cephx +for authentication:: + + cephx require signatures = true ; everywhere possible + +You can also selectively require signatures for cluster internal +communications only, separate from client-facing service:: + + cephx cluster require signatures = true ; for cluster-internal communication + cephx service require signatures = true ; for client-facing service + +An option to make a client require signatures from the cluster is not +yet implemented. + +**We recommend migrating all daemons to the newer versions and enabling the +foregoing flag** at the nearest practical time so that you may avail yourself +of the enhanced authentication. + +.. note:: Ceph kernel modules do not support signatures yet. + + +.. _Storage Cluster Quick Start: ../../../start/quick-ceph-deploy/ +.. _Monitor Bootstrapping: ../../../install/manual-deployment#monitor-bootstrapping +.. _Operating a Cluster: ../../operations/operating +.. _Manual Deployment: ../../../install/manual-deployment +.. _Cephx: http://docs.ceph.com/docs/cuttlefish/rados/configuration/auth-config-ref/ +.. _Ceph configuration: ../ceph-conf +.. _Create an Admin Host: ../../deployment/ceph-deploy-admin +.. _Architecture - High Availability Authentication: ../../../architecture#high-availability-authentication +.. _User Management: ../../operations/user-management diff --git a/src/ceph/doc/rados/configuration/bluestore-config-ref.rst b/src/ceph/doc/rados/configuration/bluestore-config-ref.rst new file mode 100644 index 0000000..8d8ace6 --- /dev/null +++ b/src/ceph/doc/rados/configuration/bluestore-config-ref.rst @@ -0,0 +1,297 @@ +========================== +BlueStore Config Reference +========================== + +Devices +======= + +BlueStore manages either one, two, or (in certain cases) three storage +devices. + +In the simplest case, BlueStore consumes a single (primary) storage +device. The storage device is normally partitioned into two parts: + +#. A small partition is formatted with XFS and contains basic metadata + for the OSD. This *data directory* includes information about the + OSD (its identifier, which cluster it belongs to, and its private + keyring. + +#. The rest of the device is normally a large partition occupying the + rest of the device that is managed directly by BlueStore contains + all of the actual data. This *primary device* is normally identifed + by a ``block`` symlink in data directory. + +It is also possible to deploy BlueStore across two additional devices: + +* A *WAL device* can be used for BlueStore's internal journal or + write-ahead log. It is identified by the ``block.wal`` symlink in + the data directory. It is only useful to use a WAL device if the + device is faster than the primary device (e.g., when it is on an SSD + and the primary device is an HDD). +* A *DB device* can be used for storing BlueStore's internal metadata. + BlueStore (or rather, the embedded RocksDB) will put as much + metadata as it can on the DB device to improve performance. If the + DB device fills up, metadata will spill back onto the primary device + (where it would have been otherwise). Again, it is only helpful to + provision a DB device if it is faster than the primary device. + +If there is only a small amount of fast storage available (e.g., less +than a gigabyte), we recommend using it as a WAL device. If there is +more, provisioning a DB device makes more sense. The BlueStore +journal will always be placed on the fastest device available, so +using a DB device will provide the same benefit that the WAL device +would while *also* allowing additional metadata to be stored there (if +it will fix). + +A single-device BlueStore OSD can be provisioned with:: + + ceph-disk prepare --bluestore <device> + +To specify a WAL device and/or DB device, :: + + ceph-disk prepare --bluestore <device> --block.wal <wal-device> --block-db <db-device> + +Cache size +========== + +The amount of memory consumed by each OSD for BlueStore's cache is +determined by the ``bluestore_cache_size`` configuration option. If +that config option is not set (i.e., remains at 0), there is a +different default value that is used depending on whether an HDD or +SSD is used for the primary device (set by the +``bluestore_cache_size_ssd`` and ``bluestore_cache_size_hdd`` config +options). + +BlueStore and the rest of the Ceph OSD does the best it can currently +to stick to the budgeted memory. Note that on top of the configured +cache size, there is also memory consumed by the OSD itself, and +generally some overhead due to memory fragmentation and other +allocator overhead. + +The configured cache memory budget can be used in a few different ways: + +* Key/Value metadata (i.e., RocksDB's internal cache) +* BlueStore metadata +* BlueStore data (i.e., recently read or written object data) + +Cache memory usage is governed by the following options: +``bluestore_cache_meta_ratio``, ``bluestore_cache_kv_ratio``, and +``bluestore_cache_kv_max``. The fraction of the cache devoted to data +is 1.0 minus the meta and kv ratios. The memory devoted to kv +metadata (the RocksDB cache) is capped by ``bluestore_cache_kv_max`` +since our testing indicates there are diminishing returns beyond a +certain point. + +``bluestore_cache_size`` + +:Description: The amount of memory BlueStore will use for its cache. If zero, ``bluestore_cache_size_hdd`` or ``bluestore_cache_size_ssd`` will be used instead. +:Type: Integer +:Required: Yes +:Default: ``0`` + +``bluestore_cache_size_hdd`` + +:Description: The default amount of memory BlueStore will use for its cache when backed by an HDD. +:Type: Integer +:Required: Yes +:Default: ``1 * 1024 * 1024 * 1024`` (1 GB) + +``bluestore_cache_size_ssd`` + +:Description: The default amount of memory BlueStore will use for its cache when backed by an SSD. +:Type: Integer +:Required: Yes +:Default: ``3 * 1024 * 1024 * 1024`` (3 GB) + +``bluestore_cache_meta_ratio`` + +:Description: The ratio of cache devoted to metadata. +:Type: Floating point +:Required: Yes +:Default: ``.01`` + +``bluestore_cache_kv_ratio`` + +:Description: The ratio of cache devoted to key/value data (rocksdb). +:Type: Floating point +:Required: Yes +:Default: ``.99`` + +``bluestore_cache_kv_max`` + +:Description: The maximum amount of cache devoted to key/value data (rocksdb). +:Type: Floating point +:Required: Yes +:Default: ``512 * 1024*1024`` (512 MB) + + +Checksums +========= + +BlueStore checksums all metadata and data written to disk. Metadata +checksumming is handled by RocksDB and uses `crc32c`. Data +checksumming is done by BlueStore and can make use of `crc32c`, +`xxhash32`, or `xxhash64`. The default is `crc32c` and should be +suitable for most purposes. + +Full data checksumming does increase the amount of metadata that +BlueStore must store and manage. When possible, e.g., when clients +hint that data is written and read sequentially, BlueStore will +checksum larger blocks, but in many cases it must store a checksum +value (usually 4 bytes) for every 4 kilobyte block of data. + +It is possible to use a smaller checksum value by truncating the +checksum to two or one byte, reducing the metadata overhead. The +trade-off is that the probability that a random error will not be +detected is higher with a smaller checksum, going from about one if +four billion with a 32-bit (4 byte) checksum to one is 65,536 for a +16-bit (2 byte) checksum or one in 256 for an 8-bit (1 byte) checksum. +The smaller checksum values can be used by selecting `crc32c_16` or +`crc32c_8` as the checksum algorithm. + +The *checksum algorithm* can be set either via a per-pool +``csum_type`` property or the global config option. For example, :: + + ceph osd pool set <pool-name> csum_type <algorithm> + +``bluestore_csum_type`` + +:Description: The default checksum algorithm to use. +:Type: String +:Required: Yes +:Valid Settings: ``none``, ``crc32c``, ``crc32c_16``, ``crc32c_8``, ``xxhash32``, ``xxhash64`` +:Default: ``crc32c`` + + +Inline Compression +================== + +BlueStore supports inline compression using `snappy`, `zlib`, or +`lz4`. Please note that the `lz4` compression plugin is not +distributed in the official release. + +Whether data in BlueStore is compressed is determined by a combination +of the *compression mode* and any hints associated with a write +operation. The modes are: + +* **none**: Never compress data. +* **passive**: Do not compress data unless the write operation as a + *compressible* hint set. +* **aggressive**: Compress data unless the write operation as an + *incompressible* hint set. +* **force**: Try to compress data no matter what. + +For more information about the *compressible* and *incompressible* IO +hints, see :doc:`/api/librados/#rados_set_alloc_hint`. + +Note that regardless of the mode, if the size of the data chunk is not +reduced sufficiently it will not be used and the original +(uncompressed) data will be stored. For example, if the ``bluestore +compression required ratio`` is set to ``.7`` then the compressed data +must be 70% of the size of the original (or smaller). + +The *compression mode*, *compression algorithm*, *compression required +ratio*, *min blob size*, and *max blob size* can be set either via a +per-pool property or a global config option. Pool properties can be +set with:: + + ceph osd pool set <pool-name> compression_algorithm <algorithm> + ceph osd pool set <pool-name> compression_mode <mode> + ceph osd pool set <pool-name> compression_required_ratio <ratio> + ceph osd pool set <pool-name> compression_min_blob_size <size> + ceph osd pool set <pool-name> compression_max_blob_size <size> + +``bluestore compression algorithm`` + +:Description: The default compressor to use (if any) if the per-pool property + ``compression_algorithm`` is not set. Note that zstd is *not* + recommended for bluestore due to high CPU overhead when + compressing small amounts of data. +:Type: String +:Required: No +:Valid Settings: ``lz4``, ``snappy``, ``zlib``, ``zstd`` +:Default: ``snappy`` + +``bluestore compression mode`` + +:Description: The default policy for using compression if the per-pool property + ``compression_mode`` is not set. ``none`` means never use + compression. ``passive`` means use compression when + `clients hint`_ that data is compressible. ``aggressive`` means + use compression unless clients hint that data is not compressible. + ``force`` means use compression under all circumstances even if + the clients hint that the data is not compressible. +:Type: String +:Required: No +:Valid Settings: ``none``, ``passive``, ``aggressive``, ``force`` +:Default: ``none`` + +``bluestore compression required ratio`` + +:Description: The ratio of the size of the data chunk after + compression relative to the original size must be at + least this small in order to store the compressed + version. + +:Type: Floating point +:Required: No +:Default: .875 + +``bluestore compression min blob size`` + +:Description: Chunks smaller than this are never compressed. + The per-pool property ``compression_min_blob_size`` overrides + this setting. + +:Type: Unsigned Integer +:Required: No +:Default: 0 + +``bluestore compression min blob size hdd`` + +:Description: Default value of ``bluestore compression min blob size`` + for rotational media. + +:Type: Unsigned Integer +:Required: No +:Default: 128K + +``bluestore compression min blob size ssd`` + +:Description: Default value of ``bluestore compression min blob size`` + for non-rotational (solid state) media. + +:Type: Unsigned Integer +:Required: No +:Default: 8K + +``bluestore compression max blob size`` + +:Description: Chunks larger than this are broken into smaller blobs sizing + ``bluestore compression max blob size`` before being compressed. + The per-pool property ``compression_max_blob_size`` overrides + this setting. + +:Type: Unsigned Integer +:Required: No +:Default: 0 + +``bluestore compression max blob size hdd`` + +:Description: Default value of ``bluestore compression max blob size`` + for rotational media. + +:Type: Unsigned Integer +:Required: No +:Default: 512K + +``bluestore compression max blob size ssd`` + +:Description: Default value of ``bluestore compression max blob size`` + for non-rotational (solid state) media. + +:Type: Unsigned Integer +:Required: No +:Default: 64K + +.. _clients hint: ../../api/librados/#rados_set_alloc_hint diff --git a/src/ceph/doc/rados/configuration/ceph-conf.rst b/src/ceph/doc/rados/configuration/ceph-conf.rst new file mode 100644 index 0000000..df88452 --- /dev/null +++ b/src/ceph/doc/rados/configuration/ceph-conf.rst @@ -0,0 +1,629 @@ +================== + Configuring Ceph +================== + +When you start the Ceph service, the initialization process activates a series +of daemons that run in the background. A :term:`Ceph Storage Cluster` runs +two types of daemons: + +- :term:`Ceph Monitor` (``ceph-mon``) +- :term:`Ceph OSD Daemon` (``ceph-osd``) + +Ceph Storage Clusters that support the :term:`Ceph Filesystem` run at least one +:term:`Ceph Metadata Server` (``ceph-mds``). Clusters that support :term:`Ceph +Object Storage` run Ceph Gateway daemons (``radosgw``). For your convenience, +each daemon has a series of default values (*i.e.*, many are set by +``ceph/src/common/config_opts.h``). You may override these settings with a Ceph +configuration file. + + +.. _ceph-conf-file: + +The Configuration File +====================== + +When you start a Ceph Storage Cluster, each daemon looks for a Ceph +configuration file (i.e., ``ceph.conf`` by default) that provides the cluster's +configuration settings. For manual deployments, you need to create a Ceph +configuration file. For tools that create configuration files for you (*e.g.*, +``ceph-deploy``, Chef, etc.), you may use the information contained herein as a +reference. The Ceph configuration file defines: + +- Cluster Identity +- Authentication settings +- Cluster membership +- Host names +- Host addresses +- Paths to keyrings +- Paths to journals +- Paths to data +- Other runtime options + +The default Ceph configuration file locations in sequential order include: + +#. ``$CEPH_CONF`` (*i.e.,* the path following the ``$CEPH_CONF`` + environment variable) +#. ``-c path/path`` (*i.e.,* the ``-c`` command line argument) +#. ``/etc/ceph/ceph.conf`` +#. ``~/.ceph/config`` +#. ``./ceph.conf`` (*i.e.,* in the current working directory) + + +The Ceph configuration file uses an *ini* style syntax. You can add comments +by preceding comments with a pound sign (#) or a semi-colon (;). For example: + +.. code-block:: ini + + # <--A number (#) sign precedes a comment. + ; A comment may be anything. + # Comments always follow a semi-colon (;) or a pound (#) on each line. + # The end of the line terminates a comment. + # We recommend that you provide comments in your configuration file(s). + + +.. _ceph-conf-settings: + +Config Sections +=============== + +The configuration file can configure all Ceph daemons in a Ceph Storage Cluster, +or all Ceph daemons of a particular type. To configure a series of daemons, the +settings must be included under the processes that will receive the +configuration as follows: + +``[global]`` + +:Description: Settings under ``[global]`` affect all daemons in a Ceph Storage + Cluster. + +:Example: ``auth supported = cephx`` + +``[osd]`` + +:Description: Settings under ``[osd]`` affect all ``ceph-osd`` daemons in + the Ceph Storage Cluster, and override the same setting in + ``[global]``. + +:Example: ``osd journal size = 1000`` + +``[mon]`` + +:Description: Settings under ``[mon]`` affect all ``ceph-mon`` daemons in + the Ceph Storage Cluster, and override the same setting in + ``[global]``. + +:Example: ``mon addr = 10.0.0.101:6789`` + + +``[mds]`` + +:Description: Settings under ``[mds]`` affect all ``ceph-mds`` daemons in + the Ceph Storage Cluster, and override the same setting in + ``[global]``. + +:Example: ``host = myserver01`` + +``[client]`` + +:Description: Settings under ``[client]`` affect all Ceph Clients + (e.g., mounted Ceph Filesystems, mounted Ceph Block Devices, + etc.). + +:Example: ``log file = /var/log/ceph/radosgw.log`` + + +Global settings affect all instances of all daemon in the Ceph Storage Cluster. +Use the ``[global]`` setting for values that are common for all daemons in the +Ceph Storage Cluster. You can override each ``[global]`` setting by: + +#. Changing the setting in a particular process type + (*e.g.,* ``[osd]``, ``[mon]``, ``[mds]`` ). + +#. Changing the setting in a particular process (*e.g.,* ``[osd.1]`` ). + +Overriding a global setting affects all child processes, except those that +you specifically override in a particular daemon. + +A typical global setting involves activating authentication. For example: + +.. code-block:: ini + + [global] + #Enable authentication between hosts within the cluster. + #v 0.54 and earlier + auth supported = cephx + + #v 0.55 and after + auth cluster required = cephx + auth service required = cephx + auth client required = cephx + + +You can specify settings that apply to a particular type of daemon. When you +specify settings under ``[osd]``, ``[mon]`` or ``[mds]`` without specifying a +particular instance, the setting will apply to all OSDs, monitors or metadata +daemons respectively. + +A typical daemon-wide setting involves setting journal sizes, filestore +settings, etc. For example: + +.. code-block:: ini + + [osd] + osd journal size = 1000 + + +You may specify settings for particular instances of a daemon. You may specify +an instance by entering its type, delimited by a period (.) and by the instance +ID. The instance ID for a Ceph OSD Daemon is always numeric, but it may be +alphanumeric for Ceph Monitors and Ceph Metadata Servers. + +.. code-block:: ini + + [osd.1] + # settings affect osd.1 only. + + [mon.a] + # settings affect mon.a only. + + [mds.b] + # settings affect mds.b only. + + +If the daemon you specify is a Ceph Gateway client, specify the daemon and the +instance, delimited by a period (.). For example:: + + [client.radosgw.instance-name] + # settings affect client.radosgw.instance-name only. + + + +.. _ceph-metavariables: + +Metavariables +============= + +Metavariables simplify Ceph Storage Cluster configuration dramatically. When a +metavariable is set in a configuration value, Ceph expands the metavariable into +a concrete value. Metavariables are very powerful when used within the +``[global]``, ``[osd]``, ``[mon]``, ``[mds]`` or ``[client]`` sections of your +configuration file. Ceph metavariables are similar to Bash shell expansion. + +Ceph supports the following metavariables: + + +``$cluster`` + +:Description: Expands to the Ceph Storage Cluster name. Useful when running + multiple Ceph Storage Clusters on the same hardware. + +:Example: ``/etc/ceph/$cluster.keyring`` +:Default: ``ceph`` + + +``$type`` + +:Description: Expands to one of ``mds``, ``osd``, or ``mon``, depending on the + type of the instant daemon. + +:Example: ``/var/lib/ceph/$type`` + + +``$id`` + +:Description: Expands to the daemon identifier. For ``osd.0``, this would be + ``0``; for ``mds.a``, it would be ``a``. + +:Example: ``/var/lib/ceph/$type/$cluster-$id`` + + +``$host`` + +:Description: Expands to the host name of the instant daemon. + + +``$name`` + +:Description: Expands to ``$type.$id``. +:Example: ``/var/run/ceph/$cluster-$name.asok`` + +``$pid`` + +:Description: Expands to daemon pid. +:Example: ``/var/run/ceph/$cluster-$name-$pid.asok`` + + +.. _ceph-conf-common-settings: + +Common Settings +=============== + +The `Hardware Recommendations`_ section provides some hardware guidelines for +configuring a Ceph Storage Cluster. It is possible for a single :term:`Ceph +Node` to run multiple daemons. For example, a single node with multiple drives +may run one ``ceph-osd`` for each drive. Ideally, you will have a node for a +particular type of process. For example, some nodes may run ``ceph-osd`` +daemons, other nodes may run ``ceph-mds`` daemons, and still other nodes may +run ``ceph-mon`` daemons. + +Each node has a name identified by the ``host`` setting. Monitors also specify +a network address and port (i.e., domain name or IP address) identified by the +``addr`` setting. A basic configuration file will typically specify only +minimal settings for each instance of monitor daemons. For example: + +.. code-block:: ini + + [global] + mon_initial_members = ceph1 + mon_host = 10.0.0.1 + + +.. important:: The ``host`` setting is the short name of the node (i.e., not + an fqdn). It is **NOT** an IP address either. Enter ``hostname -s`` on + the command line to retrieve the name of the node. Do not use ``host`` + settings for anything other than initial monitors unless you are deploying + Ceph manually. You **MUST NOT** specify ``host`` under individual daemons + when using deployment tools like ``chef`` or ``ceph-deploy``, as those tools + will enter the appropriate values for you in the cluster map. + + +.. _ceph-network-config: + +Networks +======== + +See the `Network Configuration Reference`_ for a detailed discussion about +configuring a network for use with Ceph. + + +Monitors +======== + +Ceph production clusters typically deploy with a minimum 3 :term:`Ceph Monitor` +daemons to ensure high availability should a monitor instance crash. At least +three (3) monitors ensures that the Paxos algorithm can determine which version +of the :term:`Ceph Cluster Map` is the most recent from a majority of Ceph +Monitors in the quorum. + +.. note:: You may deploy Ceph with a single monitor, but if the instance fails, + the lack of other monitors may interrupt data service availability. + +Ceph Monitors typically listen on port ``6789``. For example: + +.. code-block:: ini + + [mon.a] + host = hostName + mon addr = 150.140.130.120:6789 + +By default, Ceph expects that you will store a monitor's data under the +following path:: + + /var/lib/ceph/mon/$cluster-$id + +You or a deployment tool (e.g., ``ceph-deploy``) must create the corresponding +directory. With metavariables fully expressed and a cluster named "ceph", the +foregoing directory would evaluate to:: + + /var/lib/ceph/mon/ceph-a + +For additional details, see the `Monitor Config Reference`_. + +.. _Monitor Config Reference: ../mon-config-ref + + +.. _ceph-osd-config: + + +Authentication +============== + +.. versionadded:: Bobtail 0.56 + +For Bobtail (v 0.56) and beyond, you should expressly enable or disable +authentication in the ``[global]`` section of your Ceph configuration file. :: + + auth cluster required = cephx + auth service required = cephx + auth client required = cephx + +Additionally, you should enable message signing. See `Cephx Config Reference`_ for details. + +.. important:: When upgrading, we recommend expressly disabling authentication + first, then perform the upgrade. Once the upgrade is complete, re-enable + authentication. + +.. _Cephx Config Reference: ../auth-config-ref + + +.. _ceph-monitor-config: + + +OSDs +==== + +Ceph production clusters typically deploy :term:`Ceph OSD Daemons` where one node +has one OSD daemon running a filestore on one storage drive. A typical +deployment specifies a journal size. For example: + +.. code-block:: ini + + [osd] + osd journal size = 10000 + + [osd.0] + host = {hostname} #manual deployments only. + + +By default, Ceph expects that you will store a Ceph OSD Daemon's data with the +following path:: + + /var/lib/ceph/osd/$cluster-$id + +You or a deployment tool (e.g., ``ceph-deploy``) must create the corresponding +directory. With metavariables fully expressed and a cluster named "ceph", the +foregoing directory would evaluate to:: + + /var/lib/ceph/osd/ceph-0 + +You may override this path using the ``osd data`` setting. We don't recommend +changing the default location. Create the default directory on your OSD host. + +:: + + ssh {osd-host} + sudo mkdir /var/lib/ceph/osd/ceph-{osd-number} + +The ``osd data`` path ideally leads to a mount point with a hard disk that is +separate from the hard disk storing and running the operating system and +daemons. If the OSD is for a disk other than the OS disk, prepare it for +use with Ceph, and mount it to the directory you just created:: + + ssh {new-osd-host} + sudo mkfs -t {fstype} /dev/{disk} + sudo mount -o user_xattr /dev/{hdd} /var/lib/ceph/osd/ceph-{osd-number} + +We recommend using the ``xfs`` file system when running +:command:`mkfs`. (``btrfs`` and ``ext4`` are not recommended and no +longer tested.) + +See the `OSD Config Reference`_ for additional configuration details. + + +Heartbeats +========== + +During runtime operations, Ceph OSD Daemons check up on other Ceph OSD Daemons +and report their findings to the Ceph Monitor. You do not have to provide any +settings. However, if you have network latency issues, you may wish to modify +the settings. + +See `Configuring Monitor/OSD Interaction`_ for additional details. + + +.. _ceph-logging-and-debugging: + +Logs / Debugging +================ + +Sometimes you may encounter issues with Ceph that require +modifying logging output and using Ceph's debugging. See `Debugging and +Logging`_ for details on log rotation. + +.. _Debugging and Logging: ../../troubleshooting/log-and-debug + + +Example ceph.conf +================= + +.. literalinclude:: demo-ceph.conf + :language: ini + +.. _ceph-runtime-config: + +Runtime Changes +=============== + +Ceph allows you to make changes to the configuration of a ``ceph-osd``, +``ceph-mon``, or ``ceph-mds`` daemon at runtime. This capability is quite +useful for increasing/decreasing logging output, enabling/disabling debug +settings, and even for runtime optimization. The following reflects runtime +configuration usage:: + + ceph tell {daemon-type}.{id or *} injectargs --{name} {value} [--{name} {value}] + +Replace ``{daemon-type}`` with one of ``osd``, ``mon`` or ``mds``. You may apply +the runtime setting to all daemons of a particular type with ``*``, or specify +a specific daemon's ID (i.e., its number or letter). For example, to increase +debug logging for a ``ceph-osd`` daemon named ``osd.0``, execute the following:: + + ceph tell osd.0 injectargs --debug-osd 20 --debug-ms 1 + +In your ``ceph.conf`` file, you may use spaces when specifying a +setting name. When specifying a setting name on the command line, +ensure that you use an underscore or hyphen (``_`` or ``-``) between +terms (e.g., ``debug osd`` becomes ``--debug-osd``). + + +Viewing a Configuration at Runtime +================================== + +If your Ceph Storage Cluster is running, and you would like to see the +configuration settings from a running daemon, execute the following:: + + ceph daemon {daemon-type}.{id} config show | less + +If you are on a machine where osd.0 is running, the command would be:: + + ceph daemon osd.0 config show | less + +Reading Configuration Metadata at Runtime +========================================= + +Information about the available configuration options is available via +the ``config help`` command: + +:: + + ceph daemon {daemon-type}.{id} config help | less + + +This metadata is primarily intended to be used when integrating other +software with Ceph, such as graphical user interfaces. The output is +a list of JSON objects, for example: + +:: + + { + "name": "mon_host", + "type": "std::string", + "level": "basic", + "desc": "list of hosts or addresses to search for a monitor", + "long_desc": "This is a comma, whitespace, or semicolon separated list of IP addresses or hostnames. Hostnames are resolved via DNS and all A or AAAA records are included in the search list.", + "default": "", + "daemon_default": "", + "tags": [], + "services": [ + "common" + ], + "see_also": [], + "enum_values": [], + "min": "", + "max": "" + } + +type +____ + +The type of the setting, given as a C++ type name. + +level +_____ + +One of `basic`, `advanced`, `dev`. The `dev` options are not intended +for use outside of development and testing. + +desc +____ + +A short description -- this is a sentence fragment suitable for display +in small spaces like a single line in a list. + +long_desc +_________ + +A full description of what the setting does, this may be as long as needed. + +default +_______ + +The default value, if any. + +daemon_default +______________ + +An alternative default used for daemons (services) as opposed to clients. + +tags +____ + +A list of strings indicating topics to which this setting relates. Examples +of tags are `performance` and `networking`. + +services +________ + +A list of strings indicating which Ceph services the setting relates to, such +as `osd`, `mds`, `mon`. For settings that are relevant to any Ceph client +or server, `common` is used. + +see_also +________ + +A list of strings indicating other configuration options that may also +be of interest to a user setting this option. + +enum_values +___________ + +Optional: a list of strings indicating the valid settings. + +min, max +________ + +Optional: upper and lower (inclusive) bounds on valid settings. + + + + +Running Multiple Clusters +========================= + +With Ceph, you can run multiple Ceph Storage Clusters on the same hardware. +Running multiple clusters provides a higher level of isolation compared to +using different pools on the same cluster with different CRUSH rulesets. A +separate cluster will have separate monitor, OSD and metadata server processes. +When running Ceph with default settings, the default cluster name is ``ceph``, +which means you would save your Ceph configuration file with the file name +``ceph.conf`` in the ``/etc/ceph`` default directory. + +See `ceph-deploy new`_ for details. +.. _ceph-deploy new:../ceph-deploy-new + +When you run multiple clusters, you must name your cluster and save the Ceph +configuration file with the name of the cluster. For example, a cluster named +``openstack`` will have a Ceph configuration file with the file name +``openstack.conf`` in the ``/etc/ceph`` default directory. + +.. important:: Cluster names must consist of letters a-z and digits 0-9 only. + +Separate clusters imply separate data disks and journals, which are not shared +between clusters. Referring to `Metavariables`_, the ``$cluster`` metavariable +evaluates to the cluster name (i.e., ``openstack`` in the foregoing example). +Various settings use the ``$cluster`` metavariable, including: + +- ``keyring`` +- ``admin socket`` +- ``log file`` +- ``pid file`` +- ``mon data`` +- ``mon cluster log file`` +- ``osd data`` +- ``osd journal`` +- ``mds data`` +- ``rgw data`` + +See `General Settings`_, `OSD Settings`_, `Monitor Settings`_, `MDS Settings`_, +`RGW Settings`_ and `Log Settings`_ for relevant path defaults that use the +``$cluster`` metavariable. + +.. _General Settings: ../general-config-ref +.. _OSD Settings: ../osd-config-ref +.. _Monitor Settings: ../mon-config-ref +.. _MDS Settings: ../../../cephfs/mds-config-ref +.. _RGW Settings: ../../../radosgw/config-ref/ +.. _Log Settings: ../../troubleshooting/log-and-debug + + +When creating default directories or files, you should use the cluster +name at the appropriate places in the path. For example:: + + sudo mkdir /var/lib/ceph/osd/openstack-0 + sudo mkdir /var/lib/ceph/mon/openstack-a + +.. important:: When running monitors on the same host, you should use + different ports. By default, monitors use port 6789. If you already + have monitors using port 6789, use a different port for your other cluster(s). + +To invoke a cluster other than the default ``ceph`` cluster, use the +``-c {filename}.conf`` option with the ``ceph`` command. For example:: + + ceph -c {cluster-name}.conf health + ceph -c openstack.conf health + + +.. _Hardware Recommendations: ../../../start/hardware-recommendations +.. _Network Configuration Reference: ../network-config-ref +.. _OSD Config Reference: ../osd-config-ref +.. _Configuring Monitor/OSD Interaction: ../mon-osd-interaction +.. _ceph-deploy new: ../../deployment/ceph-deploy-new#naming-a-cluster diff --git a/src/ceph/doc/rados/configuration/demo-ceph.conf b/src/ceph/doc/rados/configuration/demo-ceph.conf new file mode 100644 index 0000000..ba86d53 --- /dev/null +++ b/src/ceph/doc/rados/configuration/demo-ceph.conf @@ -0,0 +1,31 @@ +[global] +fsid = {cluster-id} +mon initial members = {hostname}[, {hostname}] +mon host = {ip-address}[, {ip-address}] + +#All clusters have a front-side public network. +#If you have two NICs, you can configure a back side cluster +#network for OSD object replication, heart beats, backfilling, +#recovery, etc. +public network = {network}[, {network}] +#cluster network = {network}[, {network}] + +#Clusters require authentication by default. +auth cluster required = cephx +auth service required = cephx +auth client required = cephx + +#Choose reasonable numbers for your journals, number of replicas +#and placement groups. +osd journal size = {n} +osd pool default size = {n} # Write an object n times. +osd pool default min size = {n} # Allow writing n copy in a degraded state. +osd pool default pg num = {n} +osd pool default pgp num = {n} + +#Choose a reasonable crush leaf type. +#0 for a 1-node cluster. +#1 for a multi node cluster in a single rack +#2 for a multi node, multi chassis cluster with multiple hosts in a chassis +#3 for a multi node cluster with hosts across racks, etc. +osd crush chooseleaf type = {n}
\ No newline at end of file diff --git a/src/ceph/doc/rados/configuration/filestore-config-ref.rst b/src/ceph/doc/rados/configuration/filestore-config-ref.rst new file mode 100644 index 0000000..4dff60c --- /dev/null +++ b/src/ceph/doc/rados/configuration/filestore-config-ref.rst @@ -0,0 +1,365 @@ +============================ + Filestore Config Reference +============================ + + +``filestore debug omap check`` + +:Description: Debugging check on synchronization. Expensive. For debugging only. +:Type: Boolean +:Required: No +:Default: ``0`` + + +.. index:: filestore; extended attributes + +Extended Attributes +=================== + +Extended Attributes (XATTRs) are an important aspect in your configuration. +Some file systems have limits on the number of bytes stored in XATTRS. +Additionally, in some cases, the filesystem may not be as fast as an alternative +method of storing XATTRs. The following settings may help improve performance +by using a method of storing XATTRs that is extrinsic to the underlying filesystem. + +Ceph XATTRs are stored as ``inline xattr``, using the XATTRs provided +by the underlying file system, if it does not impose a size limit. If +there is a size limit (4KB total on ext4, for instance), some Ceph +XATTRs will be stored in an key/value database when either the +``filestore max inline xattr size`` or ``filestore max inline +xattrs`` threshold is reached. + + +``filestore max inline xattr size`` + +:Description: The maximimum size of an XATTR stored in the filesystem (i.e., XFS, + btrfs, ext4, etc.) per object. Should not be larger than the + filesytem can handle. Default value of 0 means to use the value + specific to the underlying filesystem. +:Type: Unsigned 32-bit Integer +:Required: No +:Default: ``0`` + + +``filestore max inline xattr size xfs`` + +:Description: The maximimum size of an XATTR stored in the XFS filesystem. + Only used if ``filestore max inline xattr size`` == 0. +:Type: Unsigned 32-bit Integer +:Required: No +:Default: ``65536`` + + +``filestore max inline xattr size btrfs`` + +:Description: The maximimum size of an XATTR stored in the btrfs filesystem. + Only used if ``filestore max inline xattr size`` == 0. +:Type: Unsigned 32-bit Integer +:Required: No +:Default: ``2048`` + + +``filestore max inline xattr size other`` + +:Description: The maximimum size of an XATTR stored in other filesystems. + Only used if ``filestore max inline xattr size`` == 0. +:Type: Unsigned 32-bit Integer +:Required: No +:Default: ``512`` + + +``filestore max inline xattrs`` + +:Description: The maximum number of XATTRs stored in the filesystem per object. + Default value of 0 means to use the value specific to the + underlying filesystem. +:Type: 32-bit Integer +:Required: No +:Default: ``0`` + + +``filestore max inline xattrs xfs`` + +:Description: The maximum number of XATTRs stored in the XFS filesystem per object. + Only used if ``filestore max inline xattrs`` == 0. +:Type: 32-bit Integer +:Required: No +:Default: ``10`` + + +``filestore max inline xattrs btrfs`` + +:Description: The maximum number of XATTRs stored in the btrfs filesystem per object. + Only used if ``filestore max inline xattrs`` == 0. +:Type: 32-bit Integer +:Required: No +:Default: ``10`` + + +``filestore max inline xattrs other`` + +:Description: The maximum number of XATTRs stored in other filesystems per object. + Only used if ``filestore max inline xattrs`` == 0. +:Type: 32-bit Integer +:Required: No +:Default: ``2`` + +.. index:: filestore; synchronization + +Synchronization Intervals +========================= + +Periodically, the filestore needs to quiesce writes and synchronize the +filesystem, which creates a consistent commit point. It can then free journal +entries up to the commit point. Synchronizing more frequently tends to reduce +the time required to perform synchronization, and reduces the amount of data +that needs to remain in the journal. Less frequent synchronization allows the +backing filesystem to coalesce small writes and metadata updates more +optimally--potentially resulting in more efficient synchronization. + + +``filestore max sync interval`` + +:Description: The maximum interval in seconds for synchronizing the filestore. +:Type: Double +:Required: No +:Default: ``5`` + + +``filestore min sync interval`` + +:Description: The minimum interval in seconds for synchronizing the filestore. +:Type: Double +:Required: No +:Default: ``.01`` + + +.. index:: filestore; flusher + +Flusher +======= + +The filestore flusher forces data from large writes to be written out using +``sync file range`` before the sync in order to (hopefully) reduce the cost of +the eventual sync. In practice, disabling 'filestore flusher' seems to improve +performance in some cases. + + +``filestore flusher`` + +:Description: Enables the filestore flusher. +:Type: Boolean +:Required: No +:Default: ``false`` + +.. deprecated:: v.65 + +``filestore flusher max fds`` + +:Description: Sets the maximum number of file descriptors for the flusher. +:Type: Integer +:Required: No +:Default: ``512`` + +.. deprecated:: v.65 + +``filestore sync flush`` + +:Description: Enables the synchronization flusher. +:Type: Boolean +:Required: No +:Default: ``false`` + +.. deprecated:: v.65 + +``filestore fsync flushes journal data`` + +:Description: Flush journal data during filesystem synchronization. +:Type: Boolean +:Required: No +:Default: ``false`` + + +.. index:: filestore; queue + +Queue +===== + +The following settings provide limits on the size of filestore queue. + +``filestore queue max ops`` + +:Description: Defines the maximum number of in progress operations the file store accepts before blocking on queuing new operations. +:Type: Integer +:Required: No. Minimal impact on performance. +:Default: ``50`` + + +``filestore queue max bytes`` + +:Description: The maximum number of bytes for an operation. +:Type: Integer +:Required: No +:Default: ``100 << 20`` + + + + +.. index:: filestore; timeouts + +Timeouts +======== + + +``filestore op threads`` + +:Description: The number of filesystem operation threads that execute in parallel. +:Type: Integer +:Required: No +:Default: ``2`` + + +``filestore op thread timeout`` + +:Description: The timeout for a filesystem operation thread (in seconds). +:Type: Integer +:Required: No +:Default: ``60`` + + +``filestore op thread suicide timeout`` + +:Description: The timeout for a commit operation before cancelling the commit (in seconds). +:Type: Integer +:Required: No +:Default: ``180`` + + +.. index:: filestore; btrfs + +B-Tree Filesystem +================= + + +``filestore btrfs snap`` + +:Description: Enable snapshots for a ``btrfs`` filestore. +:Type: Boolean +:Required: No. Only used for ``btrfs``. +:Default: ``true`` + + +``filestore btrfs clone range`` + +:Description: Enable cloning ranges for a ``btrfs`` filestore. +:Type: Boolean +:Required: No. Only used for ``btrfs``. +:Default: ``true`` + + +.. index:: filestore; journal + +Journal +======= + + +``filestore journal parallel`` + +:Description: Enables parallel journaling, default for btrfs. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``filestore journal writeahead`` + +:Description: Enables writeahead journaling, default for xfs. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``filestore journal trailing`` + +:Description: Deprecated, never use. +:Type: Boolean +:Required: No +:Default: ``false`` + + +Misc +==== + + +``filestore merge threshold`` + +:Description: Min number of files in a subdir before merging into parent + NOTE: A negative value means to disable subdir merging +:Type: Integer +:Required: No +:Default: ``10`` + + +``filestore split multiple`` + +:Description: ``(filestore_split_multiple * abs(filestore_merge_threshold) + (rand() % filestore_split_rand_factor)) * 16`` + is the maximum number of files in a subdirectory before + splitting into child directories. + +:Type: Integer +:Required: No +:Default: ``2`` + + +``filestore split rand factor`` + +:Description: A random factor added to the split threshold to avoid + too many filestore splits occurring at once. See + ``filestore split multiple`` for details. + This can only be changed for an existing osd offline, + via ceph-objectstore-tool's apply-layout-settings command. + +:Type: Unsigned 32-bit Integer +:Required: No +:Default: ``20`` + + +``filestore update to`` + +:Description: Limits filestore auto upgrade to specified version. +:Type: Integer +:Required: No +:Default: ``1000`` + + +``filestore blackhole`` + +:Description: Drop any new transactions on the floor. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``filestore dump file`` + +:Description: File onto which store transaction dumps. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``filestore kill at`` + +:Description: inject a failure at the n'th opportunity +:Type: String +:Required: No +:Default: ``false`` + + +``filestore fail eio`` + +:Description: Fail/Crash on eio. +:Type: Boolean +:Required: No +:Default: ``true`` + diff --git a/src/ceph/doc/rados/configuration/general-config-ref.rst b/src/ceph/doc/rados/configuration/general-config-ref.rst new file mode 100644 index 0000000..ca09ee5 --- /dev/null +++ b/src/ceph/doc/rados/configuration/general-config-ref.rst @@ -0,0 +1,66 @@ +========================== + General Config Reference +========================== + + +``fsid`` + +:Description: The filesystem ID. One per cluster. +:Type: UUID +:Required: No. +:Default: N/A. Usually generated by deployment tools. + + +``admin socket`` + +:Description: The socket for executing administrative commands on a daemon, + irrespective of whether Ceph Monitors have established a quorum. + +:Type: String +:Required: No +:Default: ``/var/run/ceph/$cluster-$name.asok`` + + +``pid file`` + +:Description: The file in which the mon, osd or mds will write its + PID. For instance, ``/var/run/$cluster/$type.$id.pid`` + will create /var/run/ceph/mon.a.pid for the ``mon`` with + id ``a`` running in the ``ceph`` cluster. The ``pid + file`` is removed when the daemon stops gracefully. If + the process is not daemonized (i.e. runs with the ``-f`` + or ``-d`` option), the ``pid file`` is not created. +:Type: String +:Required: No +:Default: No + + +``chdir`` + +:Description: The directory Ceph daemons change to once they are + up and running. Default ``/`` directory recommended. + +:Type: String +:Required: No +:Default: ``/`` + + +``max open files`` + +:Description: If set, when the :term:`Ceph Storage Cluster` starts, Ceph sets + the ``max open fds`` at the OS level (i.e., the max # of file + descriptors). It helps prevents Ceph OSD Daemons from running out + of file descriptors. + +:Type: 64-bit Integer +:Required: No +:Default: ``0`` + + +``fatal signal handlers`` + +:Description: If set, we will install signal handlers for SEGV, ABRT, BUS, ILL, + FPE, XCPU, XFSZ, SYS signals to generate a useful log message + +:Type: Boolean +:Default: ``true`` diff --git a/src/ceph/doc/rados/configuration/index.rst b/src/ceph/doc/rados/configuration/index.rst new file mode 100644 index 0000000..48b58ef --- /dev/null +++ b/src/ceph/doc/rados/configuration/index.rst @@ -0,0 +1,64 @@ +=============== + Configuration +=============== + +Ceph can run with a cluster containing thousands of Object Storage Devices +(OSDs). A minimal system will have at least two OSDs for data replication. To +configure OSD clusters, you must provide settings in the configuration file. +Ceph provides default values for many settings, which you can override in the +configuration file. Additionally, you can make runtime modification to the +configuration using command-line utilities. + +When Ceph starts, it activates three daemons: + +- ``ceph-mon`` (mandatory) +- ``ceph-osd`` (mandatory) +- ``ceph-mds`` (mandatory for cephfs only) + +Each process, daemon or utility loads the host's configuration file. A process +may have information about more than one daemon instance (*i.e.,* multiple +contexts). A daemon or utility only has information about a single daemon +instance (a single context). + +.. note:: Ceph can run on a single host for evaluation purposes. + + +.. raw:: html + + <table cellpadding="10"><colgroup><col width="50%"><col width="50%"></colgroup><tbody valign="top"><tr><td><h3>Configuring the Object Store</h3> + +For general object store configuration, refer to the following: + +.. toctree:: + :maxdepth: 1 + + Storage devices <storage-devices> + ceph-conf + + +.. raw:: html + + </td><td><h3>Reference</h3> + +To optimize the performance of your cluster, refer to the following: + +.. toctree:: + :maxdepth: 1 + + Network Settings <network-config-ref> + Auth Settings <auth-config-ref> + Monitor Settings <mon-config-ref> + mon-lookup-dns + Heartbeat Settings <mon-osd-interaction> + OSD Settings <osd-config-ref> + BlueStore Settings <bluestore-config-ref> + FileStore Settings <filestore-config-ref> + Journal Settings <journal-ref> + Pool, PG & CRUSH Settings <pool-pg-config-ref.rst> + Messaging Settings <ms-ref> + General Settings <general-config-ref> + + +.. raw:: html + + </td></tr></tbody></table> diff --git a/src/ceph/doc/rados/configuration/journal-ref.rst b/src/ceph/doc/rados/configuration/journal-ref.rst new file mode 100644 index 0000000..97300f4 --- /dev/null +++ b/src/ceph/doc/rados/configuration/journal-ref.rst @@ -0,0 +1,116 @@ +========================== + Journal Config Reference +========================== + +.. index:: journal; journal configuration + +Ceph OSDs use a journal for two reasons: speed and consistency. + +- **Speed:** The journal enables the Ceph OSD Daemon to commit small writes + quickly. Ceph writes small, random i/o to the journal sequentially, which + tends to speed up bursty workloads by allowing the backing filesystem more + time to coalesce writes. The Ceph OSD Daemon's journal, however, can lead + to spiky performance with short spurts of high-speed writes followed by + periods without any write progress as the filesystem catches up to the + journal. + +- **Consistency:** Ceph OSD Daemons require a filesystem interface that + guarantees atomic compound operations. Ceph OSD Daemons write a description + of the operation to the journal and apply the operation to the filesystem. + This enables atomic updates to an object (for example, placement group + metadata). Every few seconds--between ``filestore max sync interval`` and + ``filestore min sync interval``--the Ceph OSD Daemon stops writes and + synchronizes the journal with the filesystem, allowing Ceph OSD Daemons to + trim operations from the journal and reuse the space. On failure, Ceph + OSD Daemons replay the journal starting after the last synchronization + operation. + +Ceph OSD Daemons support the following journal settings: + + +``journal dio`` + +:Description: Enables direct i/o to the journal. Requires ``journal block + align`` set to ``true``. + +:Type: Boolean +:Required: Yes when using ``aio``. +:Default: ``true`` + + + +``journal aio`` + +.. versionchanged:: 0.61 Cuttlefish + +:Description: Enables using ``libaio`` for asynchronous writes to the journal. + Requires ``journal dio`` set to ``true``. + +:Type: Boolean +:Required: No. +:Default: Version 0.61 and later, ``true``. Version 0.60 and earlier, ``false``. + + +``journal block align`` + +:Description: Block aligns write operations. Required for ``dio`` and ``aio``. +:Type: Boolean +:Required: Yes when using ``dio`` and ``aio``. +:Default: ``true`` + + +``journal max write bytes`` + +:Description: The maximum number of bytes the journal will write at + any one time. + +:Type: Integer +:Required: No +:Default: ``10 << 20`` + + +``journal max write entries`` + +:Description: The maximum number of entries the journal will write at + any one time. + +:Type: Integer +:Required: No +:Default: ``100`` + + +``journal queue max ops`` + +:Description: The maximum number of operations allowed in the queue at + any one time. + +:Type: Integer +:Required: No +:Default: ``500`` + + +``journal queue max bytes`` + +:Description: The maximum number of bytes allowed in the queue at + any one time. + +:Type: Integer +:Required: No +:Default: ``10 << 20`` + + +``journal align min size`` + +:Description: Align data payloads greater than the specified minimum. +:Type: Integer +:Required: No +:Default: ``64 << 10`` + + +``journal zero on create`` + +:Description: Causes the file store to overwrite the entire journal with + ``0``'s during ``mkfs``. +:Type: Boolean +:Required: No +:Default: ``false`` diff --git a/src/ceph/doc/rados/configuration/mon-config-ref.rst b/src/ceph/doc/rados/configuration/mon-config-ref.rst new file mode 100644 index 0000000..6c8e92b --- /dev/null +++ b/src/ceph/doc/rados/configuration/mon-config-ref.rst @@ -0,0 +1,1222 @@ +========================== + Monitor Config Reference +========================== + +Understanding how to configure a :term:`Ceph Monitor` is an important part of +building a reliable :term:`Ceph Storage Cluster`. **All Ceph Storage Clusters +have at least one monitor**. A monitor configuration usually remains fairly +consistent, but you can add, remove or replace a monitor in a cluster. See +`Adding/Removing a Monitor`_ and `Add/Remove a Monitor (ceph-deploy)`_ for +details. + + +.. index:: Ceph Monitor; Paxos + +Background +========== + +Ceph Monitors maintain a "master copy" of the :term:`cluster map`, which means a +:term:`Ceph Client` can determine the location of all Ceph Monitors, Ceph OSD +Daemons, and Ceph Metadata Servers just by connecting to one Ceph Monitor and +retrieving a current cluster map. Before Ceph Clients can read from or write to +Ceph OSD Daemons or Ceph Metadata Servers, they must connect to a Ceph Monitor +first. With a current copy of the cluster map and the CRUSH algorithm, a Ceph +Client can compute the location for any object. The ability to compute object +locations allows a Ceph Client to talk directly to Ceph OSD Daemons, which is a +very important aspect of Ceph's high scalability and performance. See +`Scalability and High Availability`_ for additional details. + +The primary role of the Ceph Monitor is to maintain a master copy of the cluster +map. Ceph Monitors also provide authentication and logging services. Ceph +Monitors write all changes in the monitor services to a single Paxos instance, +and Paxos writes the changes to a key/value store for strong consistency. Ceph +Monitors can query the most recent version of the cluster map during sync +operations. Ceph Monitors leverage the key/value store's snapshots and iterators +(using leveldb) to perform store-wide synchronization. + +.. ditaa:: + + /-------------\ /-------------\ + | Monitor | Write Changes | Paxos | + | cCCC +-------------->+ cCCC | + | | | | + +-------------+ \------+------/ + | Auth | | + +-------------+ | Write Changes + | Log | | + +-------------+ v + | Monitor Map | /------+------\ + +-------------+ | Key / Value | + | OSD Map | | Store | + +-------------+ | cCCC | + | PG Map | \------+------/ + +-------------+ ^ + | MDS Map | | Read Changes + +-------------+ | + | cCCC |*---------------------+ + \-------------/ + + +.. deprecated:: version 0.58 + +In Ceph versions 0.58 and earlier, Ceph Monitors use a Paxos instance for +each service and store the map as a file. + +.. index:: Ceph Monitor; cluster map + +Cluster Maps +------------ + +The cluster map is a composite of maps, including the monitor map, the OSD map, +the placement group map and the metadata server map. The cluster map tracks a +number of important things: which processes are ``in`` the Ceph Storage Cluster; +which processes that are ``in`` the Ceph Storage Cluster are ``up`` and running +or ``down``; whether, the placement groups are ``active`` or ``inactive``, and +``clean`` or in some other state; and, other details that reflect the current +state of the cluster such as the total amount of storage space, and the amount +of storage used. + +When there is a significant change in the state of the cluster--e.g., a Ceph OSD +Daemon goes down, a placement group falls into a degraded state, etc.--the +cluster map gets updated to reflect the current state of the cluster. +Additionally, the Ceph Monitor also maintains a history of the prior states of +the cluster. The monitor map, OSD map, placement group map and metadata server +map each maintain a history of their map versions. We call each version an +"epoch." + +When operating your Ceph Storage Cluster, keeping track of these states is an +important part of your system administration duties. See `Monitoring a Cluster`_ +and `Monitoring OSDs and PGs`_ for additional details. + +.. index:: high availability; quorum + +Monitor Quorum +-------------- + +Our Configuring ceph section provides a trivial `Ceph configuration file`_ that +provides for one monitor in the test cluster. A cluster will run fine with a +single monitor; however, **a single monitor is a single-point-of-failure**. To +ensure high availability in a production Ceph Storage Cluster, you should run +Ceph with multiple monitors so that the failure of a single monitor **WILL NOT** +bring down your entire cluster. + +When a Ceph Storage Cluster runs multiple Ceph Monitors for high availability, +Ceph Monitors use `Paxos`_ to establish consensus about the master cluster map. +A consensus requires a majority of monitors running to establish a quorum for +consensus about the cluster map (e.g., 1; 2 out of 3; 3 out of 5; 4 out of 6; +etc.). + +``mon force quorum join`` + +:Description: Force monitor to join quorum even if it has been previously removed from the map +:Type: Boolean +:Default: ``False`` + +.. index:: Ceph Monitor; consistency + +Consistency +----------- + +When you add monitor settings to your Ceph configuration file, you need to be +aware of some of the architectural aspects of Ceph Monitors. **Ceph imposes +strict consistency requirements** for a Ceph monitor when discovering another +Ceph Monitor within the cluster. Whereas, Ceph Clients and other Ceph daemons +use the Ceph configuration file to discover monitors, monitors discover each +other using the monitor map (monmap), not the Ceph configuration file. + +A Ceph Monitor always refers to the local copy of the monmap when discovering +other Ceph Monitors in the Ceph Storage Cluster. Using the monmap instead of the +Ceph configuration file avoids errors that could break the cluster (e.g., typos +in ``ceph.conf`` when specifying a monitor address or port). Since monitors use +monmaps for discovery and they share monmaps with clients and other Ceph +daemons, **the monmap provides monitors with a strict guarantee that their +consensus is valid.** + +Strict consistency also applies to updates to the monmap. As with any other +updates on the Ceph Monitor, changes to the monmap always run through a +distributed consensus algorithm called `Paxos`_. The Ceph Monitors must agree on +each update to the monmap, such as adding or removing a Ceph Monitor, to ensure +that each monitor in the quorum has the same version of the monmap. Updates to +the monmap are incremental so that Ceph Monitors have the latest agreed upon +version, and a set of previous versions. Maintaining a history enables a Ceph +Monitor that has an older version of the monmap to catch up with the current +state of the Ceph Storage Cluster. + +If Ceph Monitors discovered each other through the Ceph configuration file +instead of through the monmap, it would introduce additional risks because the +Ceph configuration files are not updated and distributed automatically. Ceph +Monitors might inadvertently use an older Ceph configuration file, fail to +recognize a Ceph Monitor, fall out of a quorum, or develop a situation where +`Paxos`_ is not able to determine the current state of the system accurately. + + +.. index:: Ceph Monitor; bootstrapping monitors + +Bootstrapping Monitors +---------------------- + +In most configuration and deployment cases, tools that deploy Ceph may help +bootstrap the Ceph Monitors by generating a monitor map for you (e.g., +``ceph-deploy``, etc). A Ceph Monitor requires a few explicit +settings: + +- **Filesystem ID**: The ``fsid`` is the unique identifier for your + object store. Since you can run multiple clusters on the same + hardware, you must specify the unique ID of the object store when + bootstrapping a monitor. Deployment tools usually do this for you + (e.g., ``ceph-deploy`` can call a tool like ``uuidgen``), but you + may specify the ``fsid`` manually too. + +- **Monitor ID**: A monitor ID is a unique ID assigned to each monitor within + the cluster. It is an alphanumeric value, and by convention the identifier + usually follows an alphabetical increment (e.g., ``a``, ``b``, etc.). This + can be set in a Ceph configuration file (e.g., ``[mon.a]``, ``[mon.b]``, etc.), + by a deployment tool, or using the ``ceph`` commandline. + +- **Keys**: The monitor must have secret keys. A deployment tool such as + ``ceph-deploy`` usually does this for you, but you may + perform this step manually too. See `Monitor Keyrings`_ for details. + +For additional details on bootstrapping, see `Bootstrapping a Monitor`_. + +.. index:: Ceph Monitor; configuring monitors + +Configuring Monitors +==================== + +To apply configuration settings to the entire cluster, enter the configuration +settings under ``[global]``. To apply configuration settings to all monitors in +your cluster, enter the configuration settings under ``[mon]``. To apply +configuration settings to specific monitors, specify the monitor instance +(e.g., ``[mon.a]``). By convention, monitor instance names use alpha notation. + +.. code-block:: ini + + [global] + + [mon] + + [mon.a] + + [mon.b] + + [mon.c] + + +Minimum Configuration +--------------------- + +The bare minimum monitor settings for a Ceph monitor via the Ceph configuration +file include a hostname and a monitor address for each monitor. You can configure +these under ``[mon]`` or under the entry for a specific monitor. + +.. code-block:: ini + + [mon] + mon host = hostname1,hostname2,hostname3 + mon addr = 10.0.0.10:6789,10.0.0.11:6789,10.0.0.12:6789 + + +.. code-block:: ini + + [mon.a] + host = hostname1 + mon addr = 10.0.0.10:6789 + +See the `Network Configuration Reference`_ for details. + +.. note:: This minimum configuration for monitors assumes that a deployment + tool generates the ``fsid`` and the ``mon.`` key for you. + +Once you deploy a Ceph cluster, you **SHOULD NOT** change the IP address of +the monitors. However, if you decide to change the monitor's IP address, you +must follow a specific procedure. See `Changing a Monitor's IP Address`_ for +details. + +Monitors can also be found by clients using DNS SRV records. See `Monitor lookup through DNS`_ for details. + +Cluster ID +---------- + +Each Ceph Storage Cluster has a unique identifier (``fsid``). If specified, it +usually appears under the ``[global]`` section of the configuration file. +Deployment tools usually generate the ``fsid`` and store it in the monitor map, +so the value may not appear in a configuration file. The ``fsid`` makes it +possible to run daemons for multiple clusters on the same hardware. + +``fsid`` + +:Description: The cluster ID. One per cluster. +:Type: UUID +:Required: Yes. +:Default: N/A. May be generated by a deployment tool if not specified. + +.. note:: Do not set this value if you use a deployment tool that does + it for you. + + +.. index:: Ceph Monitor; initial members + +Initial Members +--------------- + +We recommend running a production Ceph Storage Cluster with at least three Ceph +Monitors to ensure high availability. When you run multiple monitors, you may +specify the initial monitors that must be members of the cluster in order to +establish a quorum. This may reduce the time it takes for your cluster to come +online. + +.. code-block:: ini + + [mon] + mon initial members = a,b,c + + +``mon initial members`` + +:Description: The IDs of initial monitors in a cluster during startup. If + specified, Ceph requires an odd number of monitors to form an + initial quorum (e.g., 3). + +:Type: String +:Default: None + +.. note:: A *majority* of monitors in your cluster must be able to reach + each other in order to establish a quorum. You can decrease the initial + number of monitors to establish a quorum with this setting. + +.. index:: Ceph Monitor; data path + +Data +---- + +Ceph provides a default path where Ceph Monitors store data. For optimal +performance in a production Ceph Storage Cluster, we recommend running Ceph +Monitors on separate hosts and drives from Ceph OSD Daemons. As leveldb is using +``mmap()`` for writing the data, Ceph Monitors flush their data from memory to disk +very often, which can interfere with Ceph OSD Daemon workloads if the data +store is co-located with the OSD Daemons. + +In Ceph versions 0.58 and earlier, Ceph Monitors store their data in files. This +approach allows users to inspect monitor data with common tools like ``ls`` +and ``cat``. However, it doesn't provide strong consistency. + +In Ceph versions 0.59 and later, Ceph Monitors store their data as key/value +pairs. Ceph Monitors require `ACID`_ transactions. Using a data store prevents +recovering Ceph Monitors from running corrupted versions through Paxos, and it +enables multiple modification operations in one single atomic batch, among other +advantages. + +Generally, we do not recommend changing the default data location. If you modify +the default location, we recommend that you make it uniform across Ceph Monitors +by setting it in the ``[mon]`` section of the configuration file. + + +``mon data`` + +:Description: The monitor's data location. +:Type: String +:Default: ``/var/lib/ceph/mon/$cluster-$id`` + + +``mon data size warn`` + +:Description: Issue a ``HEALTH_WARN`` in cluster log when the monitor's data + store goes over 15GB. +:Type: Integer +:Default: 15*1024*1024*1024* + + +``mon data avail warn`` + +:Description: Issue a ``HEALTH_WARN`` in cluster log when the available disk + space of monitor's data store is lower or equal to this + percentage. +:Type: Integer +:Default: 30 + + +``mon data avail crit`` + +:Description: Issue a ``HEALTH_ERR`` in cluster log when the available disk + space of monitor's data store is lower or equal to this + percentage. +:Type: Integer +:Default: 5 + + +``mon warn on cache pools without hit sets`` + +:Description: Issue a ``HEALTH_WARN`` in cluster log if a cache pool does not + have the hitset type set set. + See `hit set type <../operations/pools#hit-set-type>`_ for more + details. +:Type: Boolean +:Default: True + + +``mon warn on crush straw calc version zero`` + +:Description: Issue a ``HEALTH_WARN`` in cluster log if the CRUSH's + ``straw_calc_version`` is zero. See + `CRUSH map tunables <../operations/crush-map#tunables>`_ for + details. +:Type: Boolean +:Default: True + + +``mon warn on legacy crush tunables`` + +:Description: Issue a ``HEALTH_WARN`` in cluster log if + CRUSH tunables are too old (older than ``mon_min_crush_required_version``) +:Type: Boolean +:Default: True + + +``mon crush min required version`` + +:Description: The minimum tunable profile version required by the cluster. + See + `CRUSH map tunables <../operations/crush-map#tunables>`_ for + details. +:Type: String +:Default: ``firefly`` + + +``mon warn on osd down out interval zero`` + +:Description: Issue a ``HEALTH_WARN`` in cluster log if + ``mon osd down out interval`` is zero. Having this option set to + zero on the leader acts much like the ``noout`` flag. It's hard + to figure out what's going wrong with clusters witout the + ``noout`` flag set but acting like that just the same, so we + report a warning in this case. +:Type: Boolean +:Default: True + + +``mon cache target full warn ratio`` + +:Description: Position between pool's ``cache_target_full`` and + ``target_max_object`` where we start warning +:Type: Float +:Default: ``0.66`` + + +``mon health data update interval`` + +:Description: How often (in seconds) the monitor in quorum shares its health + status with its peers. (negative number disables it) +:Type: Float +:Default: ``60`` + + +``mon health to clog`` + +:Description: Enable sending health summary to cluster log periodically. +:Type: Boolean +:Default: True + + +``mon health to clog tick interval`` + +:Description: How often (in seconds) the monitor send health summary to cluster + log (a non-positive number disables it). If current health summary + is empty or identical to the last time, monitor will not send it + to cluster log. +:Type: Integer +:Default: 3600 + + +``mon health to clog interval`` + +:Description: How often (in seconds) the monitor send health summary to cluster + log (a non-positive number disables it). Monitor will always + send the summary to cluster log no matter if the summary changes + or not. +:Type: Integer +:Default: 60 + + + +.. index:: Ceph Storage Cluster; capacity planning, Ceph Monitor; capacity planning + +Storage Capacity +---------------- + +When a Ceph Storage Cluster gets close to its maximum capacity (i.e., ``mon osd +full ratio``), Ceph prevents you from writing to or reading from Ceph OSD +Daemons as a safety measure to prevent data loss. Therefore, letting a +production Ceph Storage Cluster approach its full ratio is not a good practice, +because it sacrifices high availability. The default full ratio is ``.95``, or +95% of capacity. This a very aggressive setting for a test cluster with a small +number of OSDs. + +.. tip:: When monitoring your cluster, be alert to warnings related to the + ``nearfull`` ratio. This means that a failure of some OSDs could result + in a temporary service disruption if one or more OSDs fails. Consider adding + more OSDs to increase storage capacity. + +A common scenario for test clusters involves a system administrator removing a +Ceph OSD Daemon from the Ceph Storage Cluster to watch the cluster rebalance; +then, removing another Ceph OSD Daemon, and so on until the Ceph Storage Cluster +eventually reaches the full ratio and locks up. We recommend a bit of capacity +planning even with a test cluster. Planning enables you to gauge how much spare +capacity you will need in order to maintain high availability. Ideally, you want +to plan for a series of Ceph OSD Daemon failures where the cluster can recover +to an ``active + clean`` state without replacing those Ceph OSD Daemons +immediately. You can run a cluster in an ``active + degraded`` state, but this +is not ideal for normal operating conditions. + +The following diagram depicts a simplistic Ceph Storage Cluster containing 33 +Ceph Nodes with one Ceph OSD Daemon per host, each Ceph OSD Daemon reading from +and writing to a 3TB drive. So this exemplary Ceph Storage Cluster has a maximum +actual capacity of 99TB. With a ``mon osd full ratio`` of ``0.95``, if the Ceph +Storage Cluster falls to 5TB of remaining capacity, the cluster will not allow +Ceph Clients to read and write data. So the Ceph Storage Cluster's operating +capacity is 95TB, not 99TB. + +.. ditaa:: + + +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ + | Rack 1 | | Rack 2 | | Rack 3 | | Rack 4 | | Rack 5 | | Rack 6 | + | cCCC | | cF00 | | cCCC | | cCCC | | cCCC | | cCCC | + +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ + | OSD 1 | | OSD 7 | | OSD 13 | | OSD 19 | | OSD 25 | | OSD 31 | + +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ + | OSD 2 | | OSD 8 | | OSD 14 | | OSD 20 | | OSD 26 | | OSD 32 | + +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ + | OSD 3 | | OSD 9 | | OSD 15 | | OSD 21 | | OSD 27 | | OSD 33 | + +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ + | OSD 4 | | OSD 10 | | OSD 16 | | OSD 22 | | OSD 28 | | Spare | + +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ + | OSD 5 | | OSD 11 | | OSD 17 | | OSD 23 | | OSD 29 | | Spare | + +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ + | OSD 6 | | OSD 12 | | OSD 18 | | OSD 24 | | OSD 30 | | Spare | + +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ + +It is normal in such a cluster for one or two OSDs to fail. A less frequent but +reasonable scenario involves a rack's router or power supply failing, which +brings down multiple OSDs simultaneously (e.g., OSDs 7-12). In such a scenario, +you should still strive for a cluster that can remain operational and achieve an +``active + clean`` state--even if that means adding a few hosts with additional +OSDs in short order. If your capacity utilization is too high, you may not lose +data, but you could still sacrifice data availability while resolving an outage +within a failure domain if capacity utilization of the cluster exceeds the full +ratio. For this reason, we recommend at least some rough capacity planning. + +Identify two numbers for your cluster: + +#. The number of OSDs. +#. The total capacity of the cluster + +If you divide the total capacity of your cluster by the number of OSDs in your +cluster, you will find the mean average capacity of an OSD within your cluster. +Consider multiplying that number by the number of OSDs you expect will fail +simultaneously during normal operations (a relatively small number). Finally +multiply the capacity of the cluster by the full ratio to arrive at a maximum +operating capacity; then, subtract the number of amount of data from the OSDs +you expect to fail to arrive at a reasonable full ratio. Repeat the foregoing +process with a higher number of OSD failures (e.g., a rack of OSDs) to arrive at +a reasonable number for a near full ratio. + +.. code-block:: ini + + [global] + + mon osd full ratio = .80 + mon osd backfillfull ratio = .75 + mon osd nearfull ratio = .70 + + +``mon osd full ratio`` + +:Description: The percentage of disk space used before an OSD is + considered ``full``. + +:Type: Float +:Default: ``.95`` + + +``mon osd backfillfull ratio`` + +:Description: The percentage of disk space used before an OSD is + considered too ``full`` to backfill. + +:Type: Float +:Default: ``.90`` + + +``mon osd nearfull ratio`` + +:Description: The percentage of disk space used before an OSD is + considered ``nearfull``. + +:Type: Float +:Default: ``.85`` + + +.. tip:: If some OSDs are nearfull, but others have plenty of capacity, you + may have a problem with the CRUSH weight for the nearfull OSDs. + +.. index:: heartbeat + +Heartbeat +--------- + +Ceph monitors know about the cluster by requiring reports from each OSD, and by +receiving reports from OSDs about the status of their neighboring OSDs. Ceph +provides reasonable default settings for monitor/OSD interaction; however, you +may modify them as needed. See `Monitor/OSD Interaction`_ for details. + + +.. index:: Ceph Monitor; leader, Ceph Monitor; provider, Ceph Monitor; requester, Ceph Monitor; synchronization + +Monitor Store Synchronization +----------------------------- + +When you run a production cluster with multiple monitors (recommended), each +monitor checks to see if a neighboring monitor has a more recent version of the +cluster map (e.g., a map in a neighboring monitor with one or more epoch numbers +higher than the most current epoch in the map of the instant monitor). +Periodically, one monitor in the cluster may fall behind the other monitors to +the point where it must leave the quorum, synchronize to retrieve the most +current information about the cluster, and then rejoin the quorum. For the +purposes of synchronization, monitors may assume one of three roles: + +#. **Leader**: The `Leader` is the first monitor to achieve the most recent + Paxos version of the cluster map. + +#. **Provider**: The `Provider` is a monitor that has the most recent version + of the cluster map, but wasn't the first to achieve the most recent version. + +#. **Requester:** A `Requester` is a monitor that has fallen behind the leader + and must synchronize in order to retrieve the most recent information about + the cluster before it can rejoin the quorum. + +These roles enable a leader to delegate synchronization duties to a provider, +which prevents synchronization requests from overloading the leader--improving +performance. In the following diagram, the requester has learned that it has +fallen behind the other monitors. The requester asks the leader to synchronize, +and the leader tells the requester to synchronize with a provider. + + +.. ditaa:: +-----------+ +---------+ +----------+ + | Requester | | Leader | | Provider | + +-----------+ +---------+ +----------+ + | | | + | | | + | Ask to Synchronize | | + |------------------->| | + | | | + |<-------------------| | + | Tell Requester to | | + | Sync with Provider | | + | | | + | Synchronize | + |--------------------+-------------------->| + | | | + |<-------------------+---------------------| + | Send Chunk to Requester | + | (repeat as necessary) | + | Requester Acks Chuck to Provider | + |--------------------+-------------------->| + | | + | Sync Complete | + | Notification | + |------------------->| + | | + |<-------------------| + | Ack | + | | + + +Synchronization always occurs when a new monitor joins the cluster. During +runtime operations, monitors may receive updates to the cluster map at different +times. This means the leader and provider roles may migrate from one monitor to +another. If this happens while synchronizing (e.g., a provider falls behind the +leader), the provider can terminate synchronization with a requester. + +Once synchronization is complete, Ceph requires trimming across the cluster. +Trimming requires that the placement groups are ``active + clean``. + + +``mon sync trim timeout`` + +:Description: +:Type: Double +:Default: ``30.0`` + + +``mon sync heartbeat timeout`` + +:Description: +:Type: Double +:Default: ``30.0`` + + +``mon sync heartbeat interval`` + +:Description: +:Type: Double +:Default: ``5.0`` + + +``mon sync backoff timeout`` + +:Description: +:Type: Double +:Default: ``30.0`` + + +``mon sync timeout`` + +:Description: Number of seconds the monitor will wait for the next update + message from its sync provider before it gives up and bootstrap + again. +:Type: Double +:Default: ``30.0`` + + +``mon sync max retries`` + +:Description: +:Type: Integer +:Default: ``5`` + + +``mon sync max payload size`` + +:Description: The maximum size for a sync payload (in bytes). +:Type: 32-bit Integer +:Default: ``1045676`` + + +``paxos max join drift`` + +:Description: The maximum Paxos iterations before we must first sync the + monitor data stores. When a monitor finds that its peer is too + far ahead of it, it will first sync with data stores before moving + on. +:Type: Integer +:Default: ``10`` + +``paxos stash full interval`` + +:Description: How often (in commits) to stash a full copy of the PaxosService state. + Current this setting only affects ``mds``, ``mon``, ``auth`` and ``mgr`` + PaxosServices. +:Type: Integer +:Default: 25 + +``paxos propose interval`` + +:Description: Gather updates for this time interval before proposing + a map update. +:Type: Double +:Default: ``1.0`` + + +``paxos min`` + +:Description: The minimum number of paxos states to keep around +:Type: Integer +:Default: 500 + + +``paxos min wait`` + +:Description: The minimum amount of time to gather updates after a period of + inactivity. +:Type: Double +:Default: ``0.05`` + + +``paxos trim min`` + +:Description: Number of extra proposals tolerated before trimming +:Type: Integer +:Default: 250 + + +``paxos trim max`` + +:Description: The maximum number of extra proposals to trim at a time +:Type: Integer +:Default: 500 + + +``paxos service trim min`` + +:Description: The minimum amount of versions to trigger a trim (0 disables it) +:Type: Integer +:Default: 250 + + +``paxos service trim max`` + +:Description: The maximum amount of versions to trim during a single proposal (0 disables it) +:Type: Integer +:Default: 500 + + +``mon max log epochs`` + +:Description: The maximum amount of log epochs to trim during a single proposal +:Type: Integer +:Default: 500 + + +``mon max pgmap epochs`` + +:Description: The maximum amount of pgmap epochs to trim during a single proposal +:Type: Integer +:Default: 500 + + +``mon mds force trim to`` + +:Description: Force monitor to trim mdsmaps to this point (0 disables it. + dangerous, use with care) +:Type: Integer +:Default: 0 + + +``mon osd force trim to`` + +:Description: Force monitor to trim osdmaps to this point, even if there is + PGs not clean at the specified epoch (0 disables it. dangerous, + use with care) +:Type: Integer +:Default: 0 + +``mon osd cache size`` + +:Description: The size of osdmaps cache, not to rely on underlying store's cache +:Type: Integer +:Default: 10 + + +``mon election timeout`` + +:Description: On election proposer, maximum waiting time for all ACKs in seconds. +:Type: Float +:Default: ``5`` + + +``mon lease`` + +:Description: The length (in seconds) of the lease on the monitor's versions. +:Type: Float +:Default: ``5`` + + +``mon lease renew interval factor`` + +:Description: ``mon lease`` \* ``mon lease renew interval factor`` will be the + interval for the Leader to renew the other monitor's leases. The + factor should be less than ``1.0``. +:Type: Float +:Default: ``0.6`` + + +``mon lease ack timeout factor`` + +:Description: The Leader will wait ``mon lease`` \* ``mon lease ack timeout factor`` + for the Providers to acknowledge the lease extension. +:Type: Float +:Default: ``2.0`` + + +``mon accept timeout factor`` + +:Description: The Leader will wait ``mon lease`` \* ``mon accept timeout factor`` + for the Requester(s) to accept a Paxos update. It is also used + during the Paxos recovery phase for similar purposes. +:Type: Float +:Default: ``2.0`` + + +``mon min osdmap epochs`` + +:Description: Minimum number of OSD map epochs to keep at all times. +:Type: 32-bit Integer +:Default: ``500`` + + +``mon max pgmap epochs`` + +:Description: Maximum number of PG map epochs the monitor should keep. +:Type: 32-bit Integer +:Default: ``500`` + + +``mon max log epochs`` + +:Description: Maximum number of Log epochs the monitor should keep. +:Type: 32-bit Integer +:Default: ``500`` + + + +.. index:: Ceph Monitor; clock + +Clock +----- + +Ceph daemons pass critical messages to each other, which must be processed +before daemons reach a timeout threshold. If the clocks in Ceph monitors +are not synchronized, it can lead to a number of anomalies. For example: + +- Daemons ignoring received messages (e.g., timestamps outdated) +- Timeouts triggered too soon/late when a message wasn't received in time. + +See `Monitor Store Synchronization`_ for details. + + +.. tip:: You SHOULD install NTP on your Ceph monitor hosts to + ensure that the monitor cluster operates with synchronized clocks. + +Clock drift may still be noticeable with NTP even though the discrepancy is not +yet harmful. Ceph's clock drift / clock skew warnings may get triggered even +though NTP maintains a reasonable level of synchronization. Increasing your +clock drift may be tolerable under such circumstances; however, a number of +factors such as workload, network latency, configuring overrides to default +timeouts and the `Monitor Store Synchronization`_ settings may influence +the level of acceptable clock drift without compromising Paxos guarantees. + +Ceph provides the following tunable options to allow you to find +acceptable values. + + +``clock offset`` + +:Description: How much to offset the system clock. See ``Clock.cc`` for details. +:Type: Double +:Default: ``0`` + + +.. deprecated:: 0.58 + +``mon tick interval`` + +:Description: A monitor's tick interval in seconds. +:Type: 32-bit Integer +:Default: ``5`` + + +``mon clock drift allowed`` + +:Description: The clock drift in seconds allowed between monitors. +:Type: Float +:Default: ``.050`` + + +``mon clock drift warn backoff`` + +:Description: Exponential backoff for clock drift warnings +:Type: Float +:Default: ``5`` + + +``mon timecheck interval`` + +:Description: The time check interval (clock drift check) in seconds + for the Leader. + +:Type: Float +:Default: ``300.0`` + + +``mon timecheck skew interval`` + +:Description: The time check interval (clock drift check) in seconds when in + presence of a skew in seconds for the Leader. +:Type: Float +:Default: ``30.0`` + + +Client +------ + +``mon client hunt interval`` + +:Description: The client will try a new monitor every ``N`` seconds until it + establishes a connection. + +:Type: Double +:Default: ``3.0`` + + +``mon client ping interval`` + +:Description: The client will ping the monitor every ``N`` seconds. +:Type: Double +:Default: ``10.0`` + + +``mon client max log entries per message`` + +:Description: The maximum number of log entries a monitor will generate + per client message. + +:Type: Integer +:Default: ``1000`` + + +``mon client bytes`` + +:Description: The amount of client message data allowed in memory (in bytes). +:Type: 64-bit Integer Unsigned +:Default: ``100ul << 20`` + + +Pool settings +============= +Since version v0.94 there is support for pool flags which allow or disallow changes to be made to pools. + +Monitors can also disallow removal of pools if configured that way. + +``mon allow pool delete`` + +:Description: If the monitors should allow pools to be removed. Regardless of what the pool flags say. +:Type: Boolean +:Default: ``false`` + +``osd pool default flag hashpspool`` + +:Description: Set the hashpspool flag on new pools +:Type: Boolean +:Default: ``true`` + +``osd pool default flag nodelete`` + +:Description: Set the nodelete flag on new pools. Prevents allow pool removal with this flag in any way. +:Type: Boolean +:Default: ``false`` + +``osd pool default flag nopgchange`` + +:Description: Set the nopgchange flag on new pools. Does not allow the number of PGs to be changed for a pool. +:Type: Boolean +:Default: ``false`` + +``osd pool default flag nosizechange`` + +:Description: Set the nosizechange flag on new pools. Does not allow the size to be changed of pool. +:Type: Boolean +:Default: ``false`` + +For more information about the pool flags see `Pool values`_. + +Miscellaneous +============= + + +``mon max osd`` + +:Description: The maximum number of OSDs allowed in the cluster. +:Type: 32-bit Integer +:Default: ``10000`` + +``mon globalid prealloc`` + +:Description: The number of global IDs to pre-allocate for clients and daemons in the cluster. +:Type: 32-bit Integer +:Default: ``100`` + +``mon subscribe interval`` + +:Description: The refresh interval (in seconds) for subscriptions. The + subscription mechanism enables obtaining the cluster maps + and log information. + +:Type: Double +:Default: ``300`` + + +``mon stat smooth intervals`` + +:Description: Ceph will smooth statistics over the last ``N`` PG maps. +:Type: Integer +:Default: ``2`` + + +``mon probe timeout`` + +:Description: Number of seconds the monitor will wait to find peers before bootstrapping. +:Type: Double +:Default: ``2.0`` + + +``mon daemon bytes`` + +:Description: The message memory cap for metadata server and OSD messages (in bytes). +:Type: 64-bit Integer Unsigned +:Default: ``400ul << 20`` + + +``mon max log entries per event`` + +:Description: The maximum number of log entries per event. +:Type: Integer +:Default: ``4096`` + + +``mon osd prime pg temp`` + +:Description: Enables or disable priming the PGMap with the previous OSDs when an out + OSD comes back into the cluster. With the ``true`` setting the clients + will continue to use the previous OSDs until the newly in OSDs as that + PG peered. +:Type: Boolean +:Default: ``true`` + + +``mon osd prime pg temp max time`` + +:Description: How much time in seconds the monitor should spend trying to prime the + PGMap when an out OSD comes back into the cluster. +:Type: Float +:Default: ``0.5`` + + +``mon osd prime pg temp max time estimate`` + +:Description: Maximum estimate of time spent on each PG before we prime all PGs + in parallel. +:Type: Float +:Default: ``0.25`` + + +``mon osd allow primary affinity`` + +:Description: allow ``primary_affinity`` to be set in the osdmap. +:Type: Boolean +:Default: False + + +``mon osd pool ec fast read`` + +:Description: Whether turn on fast read on the pool or not. It will be used as + the default setting of newly created erasure pools if ``fast_read`` + is not specified at create time. +:Type: Boolean +:Default: False + + +``mon mds skip sanity`` + +:Description: Skip safety assertions on FSMap (in case of bugs where we want to + continue anyway). Monitor terminates if the FSMap sanity check + fails, but we can disable it by enabling this option. +:Type: Boolean +:Default: False + + +``mon max mdsmap epochs`` + +:Description: The maximum amount of mdsmap epochs to trim during a single proposal. +:Type: Integer +:Default: 500 + + +``mon config key max entry size`` + +:Description: The maximum size of config-key entry (in bytes) +:Type: Integer +:Default: 4096 + + +``mon scrub interval`` + +:Description: How often (in seconds) the monitor scrub its store by comparing + the stored checksums with the computed ones of all the stored + keys. +:Type: Integer +:Default: 3600*24 + + +``mon scrub max keys`` + +:Description: The maximum number of keys to scrub each time. +:Type: Integer +:Default: 100 + + +``mon compact on start`` + +:Description: Compact the database used as Ceph Monitor store on + ``ceph-mon`` start. A manual compaction helps to shrink the + monitor database and improve the performance of it if the regular + compaction fails to work. +:Type: Boolean +:Default: False + + +``mon compact on bootstrap`` + +:Description: Compact the database used as Ceph Monitor store on + on bootstrap. Monitor starts probing each other for creating + a quorum after bootstrap. If it times out before joining the + quorum, it will start over and bootstrap itself again. +:Type: Boolean +:Default: False + + +``mon compact on trim`` + +:Description: Compact a certain prefix (including paxos) when we trim its old states. +:Type: Boolean +:Default: True + + +``mon cpu threads`` + +:Description: Number of threads for performing CPU intensive work on monitor. +:Type: Boolean +:Default: True + + +``mon osd mapping pgs per chunk`` + +:Description: We calculate the mapping from placement group to OSDs in chunks. + This option specifies the number of placement groups per chunk. +:Type: Integer +:Default: 4096 + + +``mon osd max split count`` + +:Description: Largest number of PGs per "involved" OSD to let split create. + When we increase the ``pg_num`` of a pool, the placement groups + will be splitted on all OSDs serving that pool. We want to avoid + extreme multipliers on PG splits. +:Type: Integer +:Default: 300 + + +``mon session timeout`` + +:Description: Monitor will terminate inactive sessions stay idle over this + time limit. +:Type: Integer +:Default: 300 + + + +.. _Paxos: http://en.wikipedia.org/wiki/Paxos_(computer_science) +.. _Monitor Keyrings: ../../../dev/mon-bootstrap#secret-keys +.. _Ceph configuration file: ../ceph-conf/#monitors +.. _Network Configuration Reference: ../network-config-ref +.. _Monitor lookup through DNS: ../mon-lookup-dns +.. _ACID: http://en.wikipedia.org/wiki/ACID +.. _Adding/Removing a Monitor: ../../operations/add-or-rm-mons +.. _Add/Remove a Monitor (ceph-deploy): ../../deployment/ceph-deploy-mon +.. _Monitoring a Cluster: ../../operations/monitoring +.. _Monitoring OSDs and PGs: ../../operations/monitoring-osd-pg +.. _Bootstrapping a Monitor: ../../../dev/mon-bootstrap +.. _Changing a Monitor's IP Address: ../../operations/add-or-rm-mons#changing-a-monitor-s-ip-address +.. _Monitor/OSD Interaction: ../mon-osd-interaction +.. _Scalability and High Availability: ../../../architecture#scalability-and-high-availability +.. _Pool values: ../../operations/pools/#set-pool-values diff --git a/src/ceph/doc/rados/configuration/mon-lookup-dns.rst b/src/ceph/doc/rados/configuration/mon-lookup-dns.rst new file mode 100644 index 0000000..e32b320 --- /dev/null +++ b/src/ceph/doc/rados/configuration/mon-lookup-dns.rst @@ -0,0 +1,51 @@ +=============================== +Looking op Monitors through DNS +=============================== + +Since version 11.0.0 RADOS supports looking up Monitors through DNS. + +This way daemons and clients do not require a *mon host* configuration directive in their ceph.conf configuration file. + +Using DNS SRV TCP records clients are able to look up the monitors. + +This allows for less configuration on clients and monitors. Using a DNS update clients and daemons can be made aware of changes in the monitor topology. + +By default clients and daemons will look for the TCP service called *ceph-mon* which is configured by the *mon_dns_srv_name* configuration directive. + + +``mon dns srv name`` + +:Description: the service name used querying the DNS for the monitor hosts/addresses +:Type: String +:Default: ``ceph-mon`` + +Example +------- +When the DNS search domain is set to *example.com* a DNS zone file might contain the following elements. + +First, create records for the Monitors, either IPv4 (A) or IPv6 (AAAA). + +:: + + mon1.example.com. AAAA 2001:db8::100 + mon2.example.com. AAAA 2001:db8::200 + mon3.example.com. AAAA 2001:db8::300 + +:: + + mon1.example.com. A 192.168.0.1 + mon2.example.com. A 192.168.0.2 + mon3.example.com. A 192.168.0.3 + + +With those records now existing we can create the SRV TCP records with the name *ceph-mon* pointing to the three Monitors. + +:: + + _ceph-mon._tcp.example.com. 60 IN SRV 10 60 6789 mon1.example.com. + _ceph-mon._tcp.example.com. 60 IN SRV 10 60 6789 mon2.example.com. + _ceph-mon._tcp.example.com. 60 IN SRV 10 60 6789 mon3.example.com. + +In this case the Monitors are running on port *6789*, and their priority and weight are all *10* and *60* respectively. + +The current implementation in clients and daemons will *only* respect the priority set in SRV records, and they will only connect to the monitors with lowest-numbered priority. The targets with the same priority will be selected at random. diff --git a/src/ceph/doc/rados/configuration/mon-osd-interaction.rst b/src/ceph/doc/rados/configuration/mon-osd-interaction.rst new file mode 100644 index 0000000..e335ff0 --- /dev/null +++ b/src/ceph/doc/rados/configuration/mon-osd-interaction.rst @@ -0,0 +1,408 @@ +===================================== + Configuring Monitor/OSD Interaction +===================================== + +.. index:: heartbeat + +After you have completed your initial Ceph configuration, you may deploy and run +Ceph. When you execute a command such as ``ceph health`` or ``ceph -s``, the +:term:`Ceph Monitor` reports on the current state of the :term:`Ceph Storage +Cluster`. The Ceph Monitor knows about the Ceph Storage Cluster by requiring +reports from each :term:`Ceph OSD Daemon`, and by receiving reports from Ceph +OSD Daemons about the status of their neighboring Ceph OSD Daemons. If the Ceph +Monitor doesn't receive reports, or if it receives reports of changes in the +Ceph Storage Cluster, the Ceph Monitor updates the status of the :term:`Ceph +Cluster Map`. + +Ceph provides reasonable default settings for Ceph Monitor/Ceph OSD Daemon +interaction. However, you may override the defaults. The following sections +describe how Ceph Monitors and Ceph OSD Daemons interact for the purposes of +monitoring the Ceph Storage Cluster. + +.. index:: heartbeat interval + +OSDs Check Heartbeats +===================== + +Each Ceph OSD Daemon checks the heartbeat of other Ceph OSD Daemons every 6 +seconds. You can change the heartbeat interval by adding an ``osd heartbeat +interval`` setting under the ``[osd]`` section of your Ceph configuration file, +or by setting the value at runtime. If a neighboring Ceph OSD Daemon doesn't +show a heartbeat within a 20 second grace period, the Ceph OSD Daemon may +consider the neighboring Ceph OSD Daemon ``down`` and report it back to a Ceph +Monitor, which will update the Ceph Cluster Map. You may change this grace +period by adding an ``osd heartbeat grace`` setting under the ``[mon]`` +and ``[osd]`` or ``[global]`` section of your Ceph configuration file, +or by setting the value at runtime. + + +.. ditaa:: +---------+ +---------+ + | OSD 1 | | OSD 2 | + +---------+ +---------+ + | | + |----+ Heartbeat | + | | Interval | + |<---+ Exceeded | + | | + | Check | + | Heartbeat | + |------------------->| + | | + |<-------------------| + | Heart Beating | + | | + |----+ Heartbeat | + | | Interval | + |<---+ Exceeded | + | | + | Check | + | Heartbeat | + |------------------->| + | | + |----+ Grace | + | | Period | + |<---+ Exceeded | + | | + |----+ Mark | + | | OSD 2 | + |<---+ Down | + + +.. index:: OSD down report + +OSDs Report Down OSDs +===================== + +By default, two Ceph OSD Daemons from different hosts must report to the Ceph +Monitors that another Ceph OSD Daemon is ``down`` before the Ceph Monitors +acknowledge that the reported Ceph OSD Daemon is ``down``. But there is chance +that all the OSDs reporting the failure are hosted in a rack with a bad switch +which has trouble connecting to another OSD. To avoid this sort of false alarm, +we consider the peers reporting a failure a proxy for a potential "subcluster" +over the overall cluster that is similarly laggy. This is clearly not true in +all cases, but will sometimes help us localize the grace correction to a subset +of the system that is unhappy. ``mon osd reporter subtree level`` is used to +group the peers into the "subcluster" by their common ancestor type in CRUSH +map. By default, only two reports from different subtree are required to report +another Ceph OSD Daemon ``down``. You can change the number of reporters from +unique subtrees and the common ancestor type required to report a Ceph OSD +Daemon ``down`` to a Ceph Monitor by adding an ``mon osd min down reporters`` +and ``mon osd reporter subtree level`` settings under the ``[mon]`` section of +your Ceph configuration file, or by setting the value at runtime. + + +.. ditaa:: +---------+ +---------+ +---------+ + | OSD 1 | | OSD 2 | | Monitor | + +---------+ +---------+ +---------+ + | | | + | OSD 3 Is Down | | + |---------------+--------------->| + | | | + | | | + | | OSD 3 Is Down | + | |--------------->| + | | | + | | | + | | |---------+ Mark + | | | | OSD 3 + | | |<--------+ Down + + +.. index:: peering failure + +OSDs Report Peering Failure +=========================== + +If a Ceph OSD Daemon cannot peer with any of the Ceph OSD Daemons defined in its +Ceph configuration file (or the cluster map), it will ping a Ceph Monitor for +the most recent copy of the cluster map every 30 seconds. You can change the +Ceph Monitor heartbeat interval by adding an ``osd mon heartbeat interval`` +setting under the ``[osd]`` section of your Ceph configuration file, or by +setting the value at runtime. + +.. ditaa:: +---------+ +---------+ +-------+ +---------+ + | OSD 1 | | OSD 2 | | OSD 3 | | Monitor | + +---------+ +---------+ +-------+ +---------+ + | | | | + | Request To | | | + | Peer | | | + |-------------->| | | + |<--------------| | | + | Peering | | + | | | + | Request To | | + | Peer | | + |----------------------------->| | + | | + |----+ OSD Monitor | + | | Heartbeat | + |<---+ Interval Exceeded | + | | + | Failed to Peer with OSD 3 | + |-------------------------------------------->| + |<--------------------------------------------| + | Receive New Cluster Map | + + +.. index:: OSD status + +OSDs Report Their Status +======================== + +If an Ceph OSD Daemon doesn't report to a Ceph Monitor, the Ceph Monitor will +consider the Ceph OSD Daemon ``down`` after the ``mon osd report timeout`` +elapses. A Ceph OSD Daemon sends a report to a Ceph Monitor when a reportable +event such as a failure, a change in placement group stats, a change in +``up_thru`` or when it boots within 5 seconds. You can change the Ceph OSD +Daemon minimum report interval by adding an ``osd mon report interval min`` +setting under the ``[osd]`` section of your Ceph configuration file, or by +setting the value at runtime. A Ceph OSD Daemon sends a report to a Ceph +Monitor every 120 seconds irrespective of whether any notable changes occur. +You can change the Ceph Monitor report interval by adding an ``osd mon report +interval max`` setting under the ``[osd]`` section of your Ceph configuration +file, or by setting the value at runtime. + + +.. ditaa:: +---------+ +---------+ + | OSD 1 | | Monitor | + +---------+ +---------+ + | | + |----+ Report Min | + | | Interval | + |<---+ Exceeded | + | | + |----+ Reportable | + | | Event | + |<---+ Occurs | + | | + | Report To | + | Monitor | + |------------------->| + | | + |----+ Report Max | + | | Interval | + |<---+ Exceeded | + | | + | Report To | + | Monitor | + |------------------->| + | | + |----+ Monitor | + | | Fails | + |<---+ | + +----+ Monitor OSD + | | Report Timeout + |<---+ Exceeded + | + +----+ Mark + | | OSD 1 + |<---+ Down + + + + +Configuration Settings +====================== + +When modifying heartbeat settings, you should include them in the ``[global]`` +section of your configuration file. + +.. index:: monitor heartbeat + +Monitor Settings +---------------- + +``mon osd min up ratio`` + +:Description: The minimum ratio of ``up`` Ceph OSD Daemons before Ceph will + mark Ceph OSD Daemons ``down``. + +:Type: Double +:Default: ``.3`` + + +``mon osd min in ratio`` + +:Description: The minimum ratio of ``in`` Ceph OSD Daemons before Ceph will + mark Ceph OSD Daemons ``out``. + +:Type: Double +:Default: ``.75`` + + +``mon osd laggy halflife`` + +:Description: The number of seconds laggy estimates will decay. +:Type: Integer +:Default: ``60*60`` + + +``mon osd laggy weight`` + +:Description: The weight for new samples in laggy estimation decay. +:Type: Double +:Default: ``0.3`` + + + +``mon osd laggy max interval`` + +:Description: Maximum value of ``laggy_interval`` in laggy estimations (in seconds). + Monitor uses an adaptive approach to evaluate the ``laggy_interval`` of + a certain OSD. This value will be used to calculate the grace time for + that OSD. +:Type: Integer +:Default: 300 + +``mon osd adjust heartbeat grace`` + +:Description: If set to ``true``, Ceph will scale based on laggy estimations. +:Type: Boolean +:Default: ``true`` + + +``mon osd adjust down out interval`` + +:Description: If set to ``true``, Ceph will scaled based on laggy estimations. +:Type: Boolean +:Default: ``true`` + + +``mon osd auto mark in`` + +:Description: Ceph will mark any booting Ceph OSD Daemons as ``in`` + the Ceph Storage Cluster. + +:Type: Boolean +:Default: ``false`` + + +``mon osd auto mark auto out in`` + +:Description: Ceph will mark booting Ceph OSD Daemons auto marked ``out`` + of the Ceph Storage Cluster as ``in`` the cluster. + +:Type: Boolean +:Default: ``true`` + + +``mon osd auto mark new in`` + +:Description: Ceph will mark booting new Ceph OSD Daemons as ``in`` the + Ceph Storage Cluster. + +:Type: Boolean +:Default: ``true`` + + +``mon osd down out interval`` + +:Description: The number of seconds Ceph waits before marking a Ceph OSD Daemon + ``down`` and ``out`` if it doesn't respond. + +:Type: 32-bit Integer +:Default: ``600`` + + +``mon osd down out subtree limit`` + +:Description: The smallest :term:`CRUSH` unit type that Ceph will **not** + automatically mark out. For instance, if set to ``host`` and if + all OSDs of a host are down, Ceph will not automatically mark out + these OSDs. + +:Type: String +:Default: ``rack`` + + +``mon osd report timeout`` + +:Description: The grace period in seconds before declaring + unresponsive Ceph OSD Daemons ``down``. + +:Type: 32-bit Integer +:Default: ``900`` + +``mon osd min down reporters`` + +:Description: The minimum number of Ceph OSD Daemons required to report a + ``down`` Ceph OSD Daemon. + +:Type: 32-bit Integer +:Default: ``2`` + + +``mon osd reporter subtree level`` + +:Description: In which level of parent bucket the reporters are counted. The OSDs + send failure reports to monitor if they find its peer is not responsive. + And monitor mark the reported OSD out and then down after a grace period. +:Type: String +:Default: ``host`` + + +.. index:: OSD hearbeat + +OSD Settings +------------ + +``osd heartbeat address`` + +:Description: An Ceph OSD Daemon's network address for heartbeats. +:Type: Address +:Default: The host address. + + +``osd heartbeat interval`` + +:Description: How often an Ceph OSD Daemon pings its peers (in seconds). +:Type: 32-bit Integer +:Default: ``6`` + + +``osd heartbeat grace`` + +:Description: The elapsed time when a Ceph OSD Daemon hasn't shown a heartbeat + that the Ceph Storage Cluster considers it ``down``. + This setting has to be set in both the [mon] and [osd] or [global] + section so that it is read by both the MON and OSD daemons. +:Type: 32-bit Integer +:Default: ``20`` + + +``osd mon heartbeat interval`` + +:Description: How often the Ceph OSD Daemon pings a Ceph Monitor if it has no + Ceph OSD Daemon peers. + +:Type: 32-bit Integer +:Default: ``30`` + + +``osd mon report interval max`` + +:Description: The maximum time in seconds that a Ceph OSD Daemon can wait before + it must report to a Ceph Monitor. + +:Type: 32-bit Integer +:Default: ``120`` + + +``osd mon report interval min`` + +:Description: The minimum number of seconds a Ceph OSD Daemon may wait + from startup or another reportable event before reporting + to a Ceph Monitor. + +:Type: 32-bit Integer +:Default: ``5`` +:Valid Range: Should be less than ``osd mon report interval max`` + + +``osd mon ack timeout`` + +:Description: The number of seconds to wait for a Ceph Monitor to acknowledge a + request for statistics. + +:Type: 32-bit Integer +:Default: ``30`` diff --git a/src/ceph/doc/rados/configuration/ms-ref.rst b/src/ceph/doc/rados/configuration/ms-ref.rst new file mode 100644 index 0000000..55d009e --- /dev/null +++ b/src/ceph/doc/rados/configuration/ms-ref.rst @@ -0,0 +1,154 @@ +=========== + Messaging +=========== + +General Settings +================ + +``ms tcp nodelay`` + +:Description: Disables nagle's algorithm on messenger tcp sessions. +:Type: Boolean +:Required: No +:Default: ``true`` + + +``ms initial backoff`` + +:Description: The initial time to wait before reconnecting on a fault. +:Type: Double +:Required: No +:Default: ``.2`` + + +``ms max backoff`` + +:Description: The maximum time to wait before reconnecting on a fault. +:Type: Double +:Required: No +:Default: ``15.0`` + + +``ms nocrc`` + +:Description: Disables crc on network messages. May increase performance if cpu limited. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``ms die on bad msg`` + +:Description: Debug option; do not configure. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``ms dispatch throttle bytes`` + +:Description: Throttles total size of messages waiting to be dispatched. +:Type: 64-bit Unsigned Integer +:Required: No +:Default: ``100 << 20`` + + +``ms bind ipv6`` + +:Description: Enable if you want your daemons to bind to IPv6 address instead of IPv4 ones. (Not required if you specify a daemon or cluster IP.) +:Type: Boolean +:Required: No +:Default: ``false`` + + +``ms rwthread stack bytes`` + +:Description: Debug option for stack size; do not configure. +:Type: 64-bit Unsigned Integer +:Required: No +:Default: ``1024 << 10`` + + +``ms tcp read timeout`` + +:Description: Controls how long (in seconds) the messenger will wait before closing an idle connection. +:Type: 64-bit Unsigned Integer +:Required: No +:Default: ``900`` + + +``ms inject socket failures`` + +:Description: Debug option; do not configure. +:Type: 64-bit Unsigned Integer +:Required: No +:Default: ``0`` + +Async messenger options +======================= + + +``ms async transport type`` + +:Description: Transport type used by Async Messenger. Can be ``posix``, ``dpdk`` + or ``rdma``. Posix uses standard TCP/IP networking and is default. + Other transports may be experimental and support may be limited. +:Type: String +:Required: No +:Default: ``posix`` + + +``ms async op threads`` + +:Description: Initial number of worker threads used by each Async Messenger instance. + Should be at least equal to highest number of replicas, but you can + decrease it if you are low on CPU core count and/or you host a lot of + OSDs on single server. +:Type: 64-bit Unsigned Integer +:Required: No +:Default: ``3`` + + +``ms async max op threads`` + +:Description: Maximum number of worker threads used by each Async Messenger instance. + Set to lower values when your machine has limited CPU count, and increase + when your CPUs are underutilized (i. e. one or more of CPUs are + constantly on 100% load during I/O operations). +:Type: 64-bit Unsigned Integer +:Required: No +:Default: ``5`` + + +``ms async set affinity`` + +:Description: Set to true to bind Async Messenger workers to particular CPU cores. +:Type: Boolean +:Required: No +:Default: ``true`` + + +``ms async affinity cores`` + +:Description: When ``ms async set affinity`` is true, this string specifies how Async + Messenger workers are bound to CPU cores. For example, "0,2" will bind + workers #1 and #2 to CPU cores #0 and #2, respectively. + NOTE: when manually setting affinity, make sure to not assign workers to + processors that are virtual CPUs created as an effect of Hyperthreading + or similar technology, because they are slower than regular CPU cores. +:Type: String +:Required: No +:Default: ``(empty)`` + + +``ms async send inline`` + +:Description: Send messages directly from the thread that generated them instead of + queuing and sending from Async Messenger thread. This option is known + to decrease performance on systems with a lot of CPU cores, so it's + disabled by default. +:Type: Boolean +:Required: No +:Default: ``false`` + + diff --git a/src/ceph/doc/rados/configuration/network-config-ref.rst b/src/ceph/doc/rados/configuration/network-config-ref.rst new file mode 100644 index 0000000..2d7f9d6 --- /dev/null +++ b/src/ceph/doc/rados/configuration/network-config-ref.rst @@ -0,0 +1,494 @@ +================================= + Network Configuration Reference +================================= + +Network configuration is critical for building a high performance :term:`Ceph +Storage Cluster`. The Ceph Storage Cluster does not perform request routing or +dispatching on behalf of the :term:`Ceph Client`. Instead, Ceph Clients make +requests directly to Ceph OSD Daemons. Ceph OSD Daemons perform data replication +on behalf of Ceph Clients, which means replication and other factors impose +additional loads on Ceph Storage Cluster networks. + +Our Quick Start configurations provide a trivial `Ceph configuration file`_ that +sets monitor IP addresses and daemon host names only. Unless you specify a +cluster network, Ceph assumes a single "public" network. Ceph functions just +fine with a public network only, but you may see significant performance +improvement with a second "cluster" network in a large cluster. + +We recommend running a Ceph Storage Cluster with two networks: a public +(front-side) network and a cluster (back-side) network. To support two networks, +each :term:`Ceph Node` will need to have more than one NIC. See `Hardware +Recommendations - Networks`_ for additional details. + +.. ditaa:: + +-------------+ + | Ceph Client | + +----*--*-----+ + | ^ + Request | : Response + v | + /----------------------------------*--*-------------------------------------\ + | Public Network | + \---*--*------------*--*-------------*--*------------*--*------------*--*---/ + ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ + | | | | | | | | | | + | : | : | : | : | : + v v v v v v v v v v + +---*--*---+ +---*--*---+ +---*--*---+ +---*--*---+ +---*--*---+ + | Ceph MON | | Ceph MDS | | Ceph OSD | | Ceph OSD | | Ceph OSD | + +----------+ +----------+ +---*--*---+ +---*--*---+ +---*--*---+ + ^ ^ ^ ^ ^ ^ + The cluster network relieves | | | | | | + OSD replication and heartbeat | : | : | : + traffic from the public network. v v v v v v + /------------------------------------*--*------------*--*------------*--*---\ + | cCCC Cluster Network | + \---------------------------------------------------------------------------/ + + +There are several reasons to consider operating two separate networks: + +#. **Performance:** Ceph OSD Daemons handle data replication for the Ceph + Clients. When Ceph OSD Daemons replicate data more than once, the network + load between Ceph OSD Daemons easily dwarfs the network load between Ceph + Clients and the Ceph Storage Cluster. This can introduce latency and + create a performance problem. Recovery and rebalancing can + also introduce significant latency on the public network. See + `Scalability and High Availability`_ for additional details on how Ceph + replicates data. See `Monitor / OSD Interaction`_ for details on heartbeat + traffic. + +#. **Security**: While most people are generally civil, a very tiny segment of + the population likes to engage in what's known as a Denial of Service (DoS) + attack. When traffic between Ceph OSD Daemons gets disrupted, placement + groups may no longer reflect an ``active + clean`` state, which may prevent + users from reading and writing data. A great way to defeat this type of + attack is to maintain a completely separate cluster network that doesn't + connect directly to the internet. Also, consider using `Message Signatures`_ + to defeat spoofing attacks. + + +IP Tables +========= + +By default, daemons `bind`_ to ports within the ``6800:7300`` range. You may +configure this range at your discretion. Before configuring your IP tables, +check the default ``iptables`` configuration. + + sudo iptables -L + +Some Linux distributions include rules that reject all inbound requests +except SSH from all network interfaces. For example:: + + REJECT all -- anywhere anywhere reject-with icmp-host-prohibited + +You will need to delete these rules on both your public and cluster networks +initially, and replace them with appropriate rules when you are ready to +harden the ports on your Ceph Nodes. + + +Monitor IP Tables +----------------- + +Ceph Monitors listen on port ``6789`` by default. Additionally, Ceph Monitors +always operate on the public network. When you add the rule using the example +below, make sure you replace ``{iface}`` with the public network interface +(e.g., ``eth0``, ``eth1``, etc.), ``{ip-address}`` with the IP address of the +public network and ``{netmask}`` with the netmask for the public network. :: + + sudo iptables -A INPUT -i {iface} -p tcp -s {ip-address}/{netmask} --dport 6789 -j ACCEPT + + +MDS IP Tables +------------- + +A :term:`Ceph Metadata Server` listens on the first available port on the public +network beginning at port 6800. Note that this behavior is not deterministic, so +if you are running more than one OSD or MDS on the same host, or if you restart +the daemons within a short window of time, the daemons will bind to higher +ports. You should open the entire 6800-7300 range by default. When you add the +rule using the example below, make sure you replace ``{iface}`` with the public +network interface (e.g., ``eth0``, ``eth1``, etc.), ``{ip-address}`` with the IP +address of the public network and ``{netmask}`` with the netmask of the public +network. + +For example:: + + sudo iptables -A INPUT -i {iface} -m multiport -p tcp -s {ip-address}/{netmask} --dports 6800:7300 -j ACCEPT + + +OSD IP Tables +------------- + +By default, Ceph OSD Daemons `bind`_ to the first available ports on a Ceph Node +beginning at port 6800. Note that this behavior is not deterministic, so if you +are running more than one OSD or MDS on the same host, or if you restart the +daemons within a short window of time, the daemons will bind to higher ports. +Each Ceph OSD Daemon on a Ceph Node may use up to four ports: + +#. One for talking to clients and monitors. +#. One for sending data to other OSDs. +#. Two for heartbeating on each interface. + +.. ditaa:: + /---------------\ + | OSD | + | +---+----------------+-----------+ + | | Clients & Monitors | Heartbeat | + | +---+----------------+-----------+ + | | + | +---+----------------+-----------+ + | | Data Replication | Heartbeat | + | +---+----------------+-----------+ + | cCCC | + \---------------/ + +When a daemon fails and restarts without letting go of the port, the restarted +daemon will bind to a new port. You should open the entire 6800-7300 port range +to handle this possibility. + +If you set up separate public and cluster networks, you must add rules for both +the public network and the cluster network, because clients will connect using +the public network and other Ceph OSD Daemons will connect using the cluster +network. When you add the rule using the example below, make sure you replace +``{iface}`` with the network interface (e.g., ``eth0``, ``eth1``, etc.), +``{ip-address}`` with the IP address and ``{netmask}`` with the netmask of the +public or cluster network. For example:: + + sudo iptables -A INPUT -i {iface} -m multiport -p tcp -s {ip-address}/{netmask} --dports 6800:7300 -j ACCEPT + +.. tip:: If you run Ceph Metadata Servers on the same Ceph Node as the + Ceph OSD Daemons, you can consolidate the public network configuration step. + + +Ceph Networks +============= + +To configure Ceph networks, you must add a network configuration to the +``[global]`` section of the configuration file. Our 5-minute Quick Start +provides a trivial `Ceph configuration file`_ that assumes one public network +with client and server on the same network and subnet. Ceph functions just fine +with a public network only. However, Ceph allows you to establish much more +specific criteria, including multiple IP network and subnet masks for your +public network. You can also establish a separate cluster network to handle OSD +heartbeat, object replication and recovery traffic. Don't confuse the IP +addresses you set in your configuration with the public-facing IP addresses +network clients may use to access your service. Typical internal IP networks are +often ``192.168.0.0`` or ``10.0.0.0``. + +.. tip:: If you specify more than one IP address and subnet mask for + either the public or the cluster network, the subnets within the network + must be capable of routing to each other. Additionally, make sure you + include each IP address/subnet in your IP tables and open ports for them + as necessary. + +.. note:: Ceph uses `CIDR`_ notation for subnets (e.g., ``10.0.0.0/24``). + +When you have configured your networks, you may restart your cluster or restart +each daemon. Ceph daemons bind dynamically, so you do not have to restart the +entire cluster at once if you change your network configuration. + + +Public Network +-------------- + +To configure a public network, add the following option to the ``[global]`` +section of your Ceph configuration file. + +.. code-block:: ini + + [global] + ... + public network = {public-network/netmask} + + +Cluster Network +--------------- + +If you declare a cluster network, OSDs will route heartbeat, object replication +and recovery traffic over the cluster network. This may improve performance +compared to using a single network. To configure a cluster network, add the +following option to the ``[global]`` section of your Ceph configuration file. + +.. code-block:: ini + + [global] + ... + cluster network = {cluster-network/netmask} + +We prefer that the cluster network is **NOT** reachable from the public network +or the Internet for added security. + + +Ceph Daemons +============ + +Ceph has one network configuration requirement that applies to all daemons: the +Ceph configuration file **MUST** specify the ``host`` for each daemon. Ceph also +requires that a Ceph configuration file specify the monitor IP address and its +port. + +.. important:: Some deployment tools (e.g., ``ceph-deploy``, Chef) may create a + configuration file for you. **DO NOT** set these values if the deployment + tool does it for you. + +.. tip:: The ``host`` setting is the short name of the host (i.e., not + an fqdn). It is **NOT** an IP address either. Enter ``hostname -s`` on + the command line to retrieve the name of the host. + + +.. code-block:: ini + + [mon.a] + + host = {hostname} + mon addr = {ip-address}:6789 + + [osd.0] + host = {hostname} + + +You do not have to set the host IP address for a daemon. If you have a static IP +configuration and both public and cluster networks running, the Ceph +configuration file may specify the IP address of the host for each daemon. To +set a static IP address for a daemon, the following option(s) should appear in +the daemon instance sections of your ``ceph.conf`` file. + +.. code-block:: ini + + [osd.0] + public addr = {host-public-ip-address} + cluster addr = {host-cluster-ip-address} + + +.. topic:: One NIC OSD in a Two Network Cluster + + Generally, we do not recommend deploying an OSD host with a single NIC in a + cluster with two networks. However, you may accomplish this by forcing the + OSD host to operate on the public network by adding a ``public addr`` entry + to the ``[osd.n]`` section of the Ceph configuration file, where ``n`` + refers to the number of the OSD with one NIC. Additionally, the public + network and cluster network must be able to route traffic to each other, + which we don't recommend for security reasons. + + +Network Config Settings +======================= + +Network configuration settings are not required. Ceph assumes a public network +with all hosts operating on it unless you specifically configure a cluster +network. + + +Public Network +-------------- + +The public network configuration allows you specifically define IP addresses +and subnets for the public network. You may specifically assign static IP +addresses or override ``public network`` settings using the ``public addr`` +setting for a specific daemon. + +``public network`` + +:Description: The IP address and netmask of the public (front-side) network + (e.g., ``192.168.0.0/24``). Set in ``[global]``. You may specify + comma-delimited subnets. + +:Type: ``{ip-address}/{netmask} [, {ip-address}/{netmask}]`` +:Required: No +:Default: N/A + + +``public addr`` + +:Description: The IP address for the public (front-side) network. + Set for each daemon. + +:Type: IP Address +:Required: No +:Default: N/A + + + +Cluster Network +--------------- + +The cluster network configuration allows you to declare a cluster network, and +specifically define IP addresses and subnets for the cluster network. You may +specifically assign static IP addresses or override ``cluster network`` +settings using the ``cluster addr`` setting for specific OSD daemons. + + +``cluster network`` + +:Description: The IP address and netmask of the cluster (back-side) network + (e.g., ``10.0.0.0/24``). Set in ``[global]``. You may specify + comma-delimited subnets. + +:Type: ``{ip-address}/{netmask} [, {ip-address}/{netmask}]`` +:Required: No +:Default: N/A + + +``cluster addr`` + +:Description: The IP address for the cluster (back-side) network. + Set for each daemon. + +:Type: Address +:Required: No +:Default: N/A + + +Bind +---- + +Bind settings set the default port ranges Ceph OSD and MDS daemons use. The +default range is ``6800:7300``. Ensure that your `IP Tables`_ configuration +allows you to use the configured port range. + +You may also enable Ceph daemons to bind to IPv6 addresses instead of IPv4 +addresses. + + +``ms bind port min`` + +:Description: The minimum port number to which an OSD or MDS daemon will bind. +:Type: 32-bit Integer +:Default: ``6800`` +:Required: No + + +``ms bind port max`` + +:Description: The maximum port number to which an OSD or MDS daemon will bind. +:Type: 32-bit Integer +:Default: ``7300`` +:Required: No. + + +``ms bind ipv6`` + +:Description: Enables Ceph daemons to bind to IPv6 addresses. Currently the + messenger *either* uses IPv4 or IPv6, but it cannot do both. +:Type: Boolean +:Default: ``false`` +:Required: No + +``public bind addr`` + +:Description: In some dynamic deployments the Ceph MON daemon might bind + to an IP address locally that is different from the ``public addr`` + advertised to other peers in the network. The environment must ensure + that routing rules are set correclty. If ``public bind addr`` is set + the Ceph MON daemon will bind to it locally and use ``public addr`` + in the monmaps to advertise its address to peers. This behavior is limited + to the MON daemon. + +:Type: IP Address +:Required: No +:Default: N/A + + + +Hosts +----- + +Ceph expects at least one monitor declared in the Ceph configuration file, with +a ``mon addr`` setting under each declared monitor. Ceph expects a ``host`` +setting under each declared monitor, metadata server and OSD in the Ceph +configuration file. Optionally, a monitor can be assigned with a priority, and +the clients will always connect to the monitor with lower value of priority if +specified. + + +``mon addr`` + +:Description: A list of ``{hostname}:{port}`` entries that clients can use to + connect to a Ceph monitor. If not set, Ceph searches ``[mon.*]`` + sections. + +:Type: String +:Required: No +:Default: N/A + +``mon priority`` + +:Description: The priority of the declared monitor, the lower value the more + prefered when a client selects a monitor when trying to connect + to the cluster. + +:Type: Unsigned 16-bit Integer +:Required: No +:Default: 0 + +``host`` + +:Description: The hostname. Use this setting for specific daemon instances + (e.g., ``[osd.0]``). + +:Type: String +:Required: Yes, for daemon instances. +:Default: ``localhost`` + +.. tip:: Do not use ``localhost``. To get your host name, execute + ``hostname -s`` on your command line and use the name of your host + (to the first period, not the fully-qualified domain name). + +.. important:: You should not specify any value for ``host`` when using a third + party deployment system that retrieves the host name for you. + + + +TCP +--- + +Ceph disables TCP buffering by default. + + +``ms tcp nodelay`` + +:Description: Ceph enables ``ms tcp nodelay`` so that each request is sent + immediately (no buffering). Disabling `Nagle's algorithm`_ + increases network traffic, which can introduce latency. If you + experience large numbers of small packets, you may try + disabling ``ms tcp nodelay``. + +:Type: Boolean +:Required: No +:Default: ``true`` + + + +``ms tcp rcvbuf`` + +:Description: The size of the socket buffer on the receiving end of a network + connection. Disable by default. + +:Type: 32-bit Integer +:Required: No +:Default: ``0`` + + + +``ms tcp read timeout`` + +:Description: If a client or daemon makes a request to another Ceph daemon and + does not drop an unused connection, the ``ms tcp read timeout`` + defines the connection as idle after the specified number + of seconds. + +:Type: Unsigned 64-bit Integer +:Required: No +:Default: ``900`` 15 minutes. + + + +.. _Scalability and High Availability: ../../../architecture#scalability-and-high-availability +.. _Hardware Recommendations - Networks: ../../../start/hardware-recommendations#networks +.. _Ceph configuration file: ../../../start/quick-ceph-deploy/#create-a-cluster +.. _hardware recommendations: ../../../start/hardware-recommendations +.. _Monitor / OSD Interaction: ../mon-osd-interaction +.. _Message Signatures: ../auth-config-ref#signatures +.. _CIDR: http://en.wikipedia.org/wiki/Classless_Inter-Domain_Routing +.. _Nagle's Algorithm: http://en.wikipedia.org/wiki/Nagle's_algorithm diff --git a/src/ceph/doc/rados/configuration/osd-config-ref.rst b/src/ceph/doc/rados/configuration/osd-config-ref.rst new file mode 100644 index 0000000..fae7078 --- /dev/null +++ b/src/ceph/doc/rados/configuration/osd-config-ref.rst @@ -0,0 +1,1105 @@ +====================== + OSD Config Reference +====================== + +.. index:: OSD; configuration + +You can configure Ceph OSD Daemons in the Ceph configuration file, but Ceph OSD +Daemons can use the default values and a very minimal configuration. A minimal +Ceph OSD Daemon configuration sets ``osd journal size`` and ``host``, and +uses default values for nearly everything else. + +Ceph OSD Daemons are numerically identified in incremental fashion, beginning +with ``0`` using the following convention. :: + + osd.0 + osd.1 + osd.2 + +In a configuration file, you may specify settings for all Ceph OSD Daemons in +the cluster by adding configuration settings to the ``[osd]`` section of your +configuration file. To add settings directly to a specific Ceph OSD Daemon +(e.g., ``host``), enter it in an OSD-specific section of your configuration +file. For example: + +.. code-block:: ini + + [osd] + osd journal size = 1024 + + [osd.0] + host = osd-host-a + + [osd.1] + host = osd-host-b + + +.. index:: OSD; config settings + +General Settings +================ + +The following settings provide an Ceph OSD Daemon's ID, and determine paths to +data and journals. Ceph deployment scripts typically generate the UUID +automatically. We **DO NOT** recommend changing the default paths for data or +journals, as it makes it more problematic to troubleshoot Ceph later. + +The journal size should be at least twice the product of the expected drive +speed multiplied by ``filestore max sync interval``. However, the most common +practice is to partition the journal drive (often an SSD), and mount it such +that Ceph uses the entire partition for the journal. + + +``osd uuid`` + +:Description: The universally unique identifier (UUID) for the Ceph OSD Daemon. +:Type: UUID +:Default: The UUID. +:Note: The ``osd uuid`` applies to a single Ceph OSD Daemon. The ``fsid`` + applies to the entire cluster. + + +``osd data`` + +:Description: The path to the OSDs data. You must create the directory when + deploying Ceph. You should mount a drive for OSD data at this + mount point. We do not recommend changing the default. + +:Type: String +:Default: ``/var/lib/ceph/osd/$cluster-$id`` + + +``osd max write size`` + +:Description: The maximum size of a write in megabytes. +:Type: 32-bit Integer +:Default: ``90`` + + +``osd client message size cap`` + +:Description: The largest client data message allowed in memory. +:Type: 64-bit Unsigned Integer +:Default: 500MB default. ``500*1024L*1024L`` + + +``osd class dir`` + +:Description: The class path for RADOS class plug-ins. +:Type: String +:Default: ``$libdir/rados-classes`` + + +.. index:: OSD; file system + +File System Settings +==================== +Ceph builds and mounts file systems which are used for Ceph OSDs. + +``osd mkfs options {fs-type}`` + +:Description: Options used when creating a new Ceph OSD of type {fs-type}. + +:Type: String +:Default for xfs: ``-f -i 2048`` +:Default for other file systems: {empty string} + +For example:: + ``osd mkfs options xfs = -f -d agcount=24`` + +``osd mount options {fs-type}`` + +:Description: Options used when mounting a Ceph OSD of type {fs-type}. + +:Type: String +:Default for xfs: ``rw,noatime,inode64`` +:Default for other file systems: ``rw, noatime`` + +For example:: + ``osd mount options xfs = rw, noatime, inode64, logbufs=8`` + + +.. index:: OSD; journal settings + +Journal Settings +================ + +By default, Ceph expects that you will store an Ceph OSD Daemons journal with +the following path:: + + /var/lib/ceph/osd/$cluster-$id/journal + +Without performance optimization, Ceph stores the journal on the same disk as +the Ceph OSD Daemons data. An Ceph OSD Daemon optimized for performance may use +a separate disk to store journal data (e.g., a solid state drive delivers high +performance journaling). + +Ceph's default ``osd journal size`` is 0, so you will need to set this in your +``ceph.conf`` file. A journal size should find the product of the ``filestore +max sync interval`` and the expected throughput, and multiply the product by +two (2):: + + osd journal size = {2 * (expected throughput * filestore max sync interval)} + +The expected throughput number should include the expected disk throughput +(i.e., sustained data transfer rate), and network throughput. For example, +a 7200 RPM disk will likely have approximately 100 MB/s. Taking the ``min()`` +of the disk and network throughput should provide a reasonable expected +throughput. Some users just start off with a 10GB journal size. For +example:: + + osd journal size = 10000 + + +``osd journal`` + +:Description: The path to the OSD's journal. This may be a path to a file or a + block device (such as a partition of an SSD). If it is a file, + you must create the directory to contain it. We recommend using a + drive separate from the ``osd data`` drive. + +:Type: String +:Default: ``/var/lib/ceph/osd/$cluster-$id/journal`` + + +``osd journal size`` + +:Description: The size of the journal in megabytes. If this is 0, and the + journal is a block device, the entire block device is used. + Since v0.54, this is ignored if the journal is a block device, + and the entire block device is used. + +:Type: 32-bit Integer +:Default: ``5120`` +:Recommended: Begin with 1GB. Should be at least twice the product of the + expected speed multiplied by ``filestore max sync interval``. + + +See `Journal Config Reference`_ for additional details. + + +Monitor OSD Interaction +======================= + +Ceph OSD Daemons check each other's heartbeats and report to monitors +periodically. Ceph can use default values in many cases. However, if your +network has latency issues, you may need to adopt longer intervals. See +`Configuring Monitor/OSD Interaction`_ for a detailed discussion of heartbeats. + + +Data Placement +============== + +See `Pool & PG Config Reference`_ for details. + + +.. index:: OSD; scrubbing + +Scrubbing +========= + +In addition to making multiple copies of objects, Ceph insures data integrity by +scrubbing placement groups. Ceph scrubbing is analogous to ``fsck`` on the +object storage layer. For each placement group, Ceph generates a catalog of all +objects and compares each primary object and its replicas to ensure that no +objects are missing or mismatched. Light scrubbing (daily) checks the object +size and attributes. Deep scrubbing (weekly) reads the data and uses checksums +to ensure data integrity. + +Scrubbing is important for maintaining data integrity, but it can reduce +performance. You can adjust the following settings to increase or decrease +scrubbing operations. + + +``osd max scrubs`` + +:Description: The maximum number of simultaneous scrub operations for + a Ceph OSD Daemon. + +:Type: 32-bit Int +:Default: ``1`` + +``osd scrub begin hour`` + +:Description: The time of day for the lower bound when a scheduled scrub can be + performed. +:Type: Integer in the range of 0 to 24 +:Default: ``0`` + + +``osd scrub end hour`` + +:Description: The time of day for the upper bound when a scheduled scrub can be + performed. Along with ``osd scrub begin hour``, they define a time + window, in which the scrubs can happen. But a scrub will be performed + no matter the time window allows or not, as long as the placement + group's scrub interval exceeds ``osd scrub max interval``. +:Type: Integer in the range of 0 to 24 +:Default: ``24`` + + +``osd scrub during recovery`` + +:Description: Allow scrub during recovery. Setting this to ``false`` will disable + scheduling new scrub (and deep--scrub) while there is active recovery. + Already running scrubs will be continued. This might be useful to reduce + load on busy clusters. +:Type: Boolean +:Default: ``true`` + + +``osd scrub thread timeout`` + +:Description: The maximum time in seconds before timing out a scrub thread. +:Type: 32-bit Integer +:Default: ``60`` + + +``osd scrub finalize thread timeout`` + +:Description: The maximum time in seconds before timing out a scrub finalize + thread. + +:Type: 32-bit Integer +:Default: ``60*10`` + + +``osd scrub load threshold`` + +:Description: The maximum load. Ceph will not scrub when the system load + (as defined by ``getloadavg()``) is higher than this number. + Default is ``0.5``. + +:Type: Float +:Default: ``0.5`` + + +``osd scrub min interval`` + +:Description: The minimal interval in seconds for scrubbing the Ceph OSD Daemon + when the Ceph Storage Cluster load is low. + +:Type: Float +:Default: Once per day. ``60*60*24`` + + +``osd scrub max interval`` + +:Description: The maximum interval in seconds for scrubbing the Ceph OSD Daemon + irrespective of cluster load. + +:Type: Float +:Default: Once per week. ``7*60*60*24`` + + +``osd scrub chunk min`` + +:Description: The minimal number of object store chunks to scrub during single operation. + Ceph blocks writes to single chunk during scrub. + +:Type: 32-bit Integer +:Default: 5 + + +``osd scrub chunk max`` + +:Description: The maximum number of object store chunks to scrub during single operation. + +:Type: 32-bit Integer +:Default: 25 + + +``osd scrub sleep`` + +:Description: Time to sleep before scrubbing next group of chunks. Increasing this value will slow + down whole scrub operation while client operations will be less impacted. + +:Type: Float +:Default: 0 + + +``osd deep scrub interval`` + +:Description: The interval for "deep" scrubbing (fully reading all data). The + ``osd scrub load threshold`` does not affect this setting. + +:Type: Float +:Default: Once per week. ``60*60*24*7`` + + +``osd scrub interval randomize ratio`` + +:Description: Add a random delay to ``osd scrub min interval`` when scheduling + the next scrub job for a placement group. The delay is a random + value less than ``osd scrub min interval`` \* + ``osd scrub interval randomized ratio``. So the default setting + practically randomly spreads the scrubs out in the allowed time + window of ``[1, 1.5]`` \* ``osd scrub min interval``. +:Type: Float +:Default: ``0.5`` + +``osd deep scrub stride`` + +:Description: Read size when doing a deep scrub. +:Type: 32-bit Integer +:Default: 512 KB. ``524288`` + + +.. index:: OSD; operations settings + +Operations +========== + +Operations settings allow you to configure the number of threads for servicing +requests. If you set ``osd op threads`` to ``0``, it disables multi-threading. +By default, Ceph uses two threads with a 30 second timeout and a 30 second +complaint time if an operation doesn't complete within those time parameters. +You can set operations priority weights between client operations and +recovery operations to ensure optimal performance during recovery. + + +``osd op threads`` + +:Description: The number of threads to service Ceph OSD Daemon operations. + Set to ``0`` to disable it. Increasing the number may increase + the request processing rate. + +:Type: 32-bit Integer +:Default: ``2`` + + +``osd op queue`` + +:Description: This sets the type of queue to be used for prioritizing ops + in the OSDs. Both queues feature a strict sub-queue which is + dequeued before the normal queue. The normal queue is different + between implementations. The original PrioritizedQueue (``prio``) uses a + token bucket system which when there are sufficient tokens will + dequeue high priority queues first. If there are not enough + tokens available, queues are dequeued low priority to high priority. + The WeightedPriorityQueue (``wpq``) dequeues all priorities in + relation to their priorities to prevent starvation of any queue. + WPQ should help in cases where a few OSDs are more overloaded + than others. The new mClock based OpClassQueue + (``mclock_opclass``) prioritizes operations based on which class + they belong to (recovery, scrub, snaptrim, client op, osd subop). + And, the mClock based ClientQueue (``mclock_client``) also + incorporates the client identifier in order to promote fairness + between clients. See `QoS Based on mClock`_. Requires a restart. + +:Type: String +:Valid Choices: prio, wpq, mclock_opclass, mclock_client +:Default: ``prio`` + + +``osd op queue cut off`` + +:Description: This selects which priority ops will be sent to the strict + queue verses the normal queue. The ``low`` setting sends all + replication ops and higher to the strict queue, while the ``high`` + option sends only replication acknowledgement ops and higher to + the strict queue. Setting this to ``high`` should help when a few + OSDs in the cluster are very busy especially when combined with + ``wpq`` in the ``osd op queue`` setting. OSDs that are very busy + handling replication traffic could starve primary client traffic + on these OSDs without these settings. Requires a restart. + +:Type: String +:Valid Choices: low, high +:Default: ``low`` + + +``osd client op priority`` + +:Description: The priority set for client operations. It is relative to + ``osd recovery op priority``. + +:Type: 32-bit Integer +:Default: ``63`` +:Valid Range: 1-63 + + +``osd recovery op priority`` + +:Description: The priority set for recovery operations. It is relative to + ``osd client op priority``. + +:Type: 32-bit Integer +:Default: ``3`` +:Valid Range: 1-63 + + +``osd scrub priority`` + +:Description: The priority set for scrub operations. It is relative to + ``osd client op priority``. + +:Type: 32-bit Integer +:Default: ``5`` +:Valid Range: 1-63 + + +``osd snap trim priority`` + +:Description: The priority set for snap trim operations. It is relative to + ``osd client op priority``. + +:Type: 32-bit Integer +:Default: ``5`` +:Valid Range: 1-63 + + +``osd op thread timeout`` + +:Description: The Ceph OSD Daemon operation thread timeout in seconds. +:Type: 32-bit Integer +:Default: ``15`` + + +``osd op complaint time`` + +:Description: An operation becomes complaint worthy after the specified number + of seconds have elapsed. + +:Type: Float +:Default: ``30`` + + +``osd disk threads`` + +:Description: The number of disk threads, which are used to perform background + disk intensive OSD operations such as scrubbing and snap + trimming. + +:Type: 32-bit Integer +:Default: ``1`` + +``osd disk thread ioprio class`` + +:Description: Warning: it will only be used if both ``osd disk thread + ioprio class`` and ``osd disk thread ioprio priority`` are + set to a non default value. Sets the ioprio_set(2) I/O + scheduling ``class`` for the disk thread. Acceptable + values are ``idle``, ``be`` or ``rt``. The ``idle`` + class means the disk thread will have lower priority + than any other thread in the OSD. This is useful to slow + down scrubbing on an OSD that is busy handling client + operations. ``be`` is the default and is the same + priority as all other threads in the OSD. ``rt`` means + the disk thread will have precendence over all other + threads in the OSD. Note: Only works with the Linux Kernel + CFQ scheduler. Since Jewel scrubbing is no longer carried + out by the disk iothread, see osd priority options instead. +:Type: String +:Default: the empty string + +``osd disk thread ioprio priority`` + +:Description: Warning: it will only be used if both ``osd disk thread + ioprio class`` and ``osd disk thread ioprio priority`` are + set to a non default value. It sets the ioprio_set(2) + I/O scheduling ``priority`` of the disk thread ranging + from 0 (highest) to 7 (lowest). If all OSDs on a given + host were in class ``idle`` and compete for I/O + (i.e. due to controller congestion), it can be used to + lower the disk thread priority of one OSD to 7 so that + another OSD with priority 0 can have priority. + Note: Only works with the Linux Kernel CFQ scheduler. +:Type: Integer in the range of 0 to 7 or -1 if not to be used. +:Default: ``-1`` + +``osd op history size`` + +:Description: The maximum number of completed operations to track. +:Type: 32-bit Unsigned Integer +:Default: ``20`` + + +``osd op history duration`` + +:Description: The oldest completed operation to track. +:Type: 32-bit Unsigned Integer +:Default: ``600`` + + +``osd op log threshold`` + +:Description: How many operations logs to display at once. +:Type: 32-bit Integer +:Default: ``5`` + + +QoS Based on mClock +------------------- + +Ceph's use of mClock is currently in the experimental phase and should +be approached with an exploratory mindset. + +Core Concepts +````````````` + +The QoS support of Ceph is implemented using a queueing scheduler +based on `the dmClock algorithm`_. This algorithm allocates the I/O +resources of the Ceph cluster in proportion to weights, and enforces +the constraits of minimum reservation and maximum limitation, so that +the services can compete for the resources fairly. Currently the +*mclock_opclass* operation queue divides Ceph services involving I/O +resources into following buckets: + +- client op: the iops issued by client +- osd subop: the iops issued by primary OSD +- snap trim: the snap trimming related requests +- pg recovery: the recovery related requests +- pg scrub: the scrub related requests + +And the resources are partitioned using following three sets of tags. In other +words, the share of each type of service is controlled by three tags: + +#. reservation: the minimum IOPS allocated for the service. +#. limitation: the maximum IOPS allocated for the service. +#. weight: the proportional share of capacity if extra capacity or system + oversubscribed. + +In Ceph operations are graded with "cost". And the resources allocated +for serving various services are consumed by these "costs". So, for +example, the more reservation a services has, the more resource it is +guaranteed to possess, as long as it requires. Assuming there are 2 +services: recovery and client ops: + +- recovery: (r:1, l:5, w:1) +- client ops: (r:2, l:0, w:9) + +The settings above ensure that the recovery won't get more than 5 +requests per second serviced, even if it requires so (see CURRENT +IMPLEMENTATION NOTE below), and no other services are competing with +it. But if the clients start to issue large amount of I/O requests, +neither will they exhaust all the I/O resources. 1 request per second +is always allocated for recovery jobs as long as there are any such +requests. So the recovery jobs won't be starved even in a cluster with +high load. And in the meantime, the client ops can enjoy a larger +portion of the I/O resource, because its weight is "9", while its +competitor "1". In the case of client ops, it is not clamped by the +limit setting, so it can make use of all the resources if there is no +recovery ongoing. + +Along with *mclock_opclass* another mclock operation queue named +*mclock_client* is available. It divides operations based on category +but also divides them based on the client making the request. This +helps not only manage the distribution of resources spent on different +classes of operations but also tries to insure fairness among clients. + +CURRENT IMPLEMENTATION NOTE: the current experimental implementation +does not enforce the limit values. As a first approximation we decided +not to prevent operations that would otherwise enter the operation +sequencer from doing so. + +Subtleties of mClock +```````````````````` + +The reservation and limit values have a unit of requests per +second. The weight, however, does not technically have a unit and the +weights are relative to one another. So if one class of requests has a +weight of 1 and another a weight of 9, then the latter class of +requests should get 9 executed at a 9 to 1 ratio as the first class. +However that will only happen once the reservations are met and those +values include the operations executed under the reservation phase. + +Even though the weights do not have units, one must be careful in +choosing their values due how the algorithm assigns weight tags to +requests. If the weight is *W*, then for a given class of requests, +the next one that comes in will have a weight tag of *1/W* plus the +previous weight tag or the current time, whichever is larger. That +means if *W* is sufficiently large and therefore *1/W* is sufficiently +small, the calculated tag may never be assigned as it will get a value +of the current time. The ultimate lesson is that values for weight +should not be too large. They should be under the number of requests +one expects to ve serviced each second. + +Caveats +``````` + +There are some factors that can reduce the impact of the mClock op +queues within Ceph. First, requests to an OSD are sharded by their +placement group identifier. Each shard has its own mClock queue and +these queues neither interact nor share information among them. The +number of shards can be controlled with the configuration options +``osd_op_num_shards``, ``osd_op_num_shards_hdd``, and +``osd_op_num_shards_ssd``. A lower number of shards will increase the +impact of the mClock queues, but may have other deliterious effects. + +Second, requests are transferred from the operation queue to the +operation sequencer, in which they go through the phases of +execution. The operation queue is where mClock resides and mClock +determines the next op to transfer to the operation sequencer. The +number of operations allowed in the operation sequencer is a complex +issue. In general we want to keep enough operations in the sequencer +so it's always getting work done on some operations while it's waiting +for disk and network access to complete on other operations. On the +other hand, once an operation is transferred to the operation +sequencer, mClock no longer has control over it. Therefore to maximize +the impact of mClock, we want to keep as few operations in the +operation sequencer as possible. So we have an inherent tension. + +The configuration options that influence the number of operations in +the operation sequencer are ``bluestore_throttle_bytes``, +``bluestore_throttle_deferred_bytes``, +``bluestore_throttle_cost_per_io``, +``bluestore_throttle_cost_per_io_hdd``, and +``bluestore_throttle_cost_per_io_ssd``. + +A third factor that affects the impact of the mClock algorithm is that +we're using a distributed system, where requests are made to multiple +OSDs and each OSD has (can have) multiple shards. Yet we're currently +using the mClock algorithm, which is not distributed (note: dmClock is +the distributed version of mClock). + +Various organizations and individuals are currently experimenting with +mClock as it exists in this code base along with their modifications +to the code base. We hope you'll share you're experiences with your +mClock and dmClock experiments in the ceph-devel mailing list. + + +``osd push per object cost`` + +:Description: the overhead for serving a push op + +:Type: Unsigned Integer +:Default: 1000 + +``osd recovery max chunk`` + +:Description: the maximum total size of data chunks a recovery op can carry. + +:Type: Unsigned Integer +:Default: 8 MiB + + +``osd op queue mclock client op res`` + +:Description: the reservation of client op. + +:Type: Float +:Default: 1000.0 + + +``osd op queue mclock client op wgt`` + +:Description: the weight of client op. + +:Type: Float +:Default: 500.0 + + +``osd op queue mclock client op lim`` + +:Description: the limit of client op. + +:Type: Float +:Default: 1000.0 + + +``osd op queue mclock osd subop res`` + +:Description: the reservation of osd subop. + +:Type: Float +:Default: 1000.0 + + +``osd op queue mclock osd subop wgt`` + +:Description: the weight of osd subop. + +:Type: Float +:Default: 500.0 + + +``osd op queue mclock osd subop lim`` + +:Description: the limit of osd subop. + +:Type: Float +:Default: 0.0 + + +``osd op queue mclock snap res`` + +:Description: the reservation of snap trimming. + +:Type: Float +:Default: 0.0 + + +``osd op queue mclock snap wgt`` + +:Description: the weight of snap trimming. + +:Type: Float +:Default: 1.0 + + +``osd op queue mclock snap lim`` + +:Description: the limit of snap trimming. + +:Type: Float +:Default: 0.001 + + +``osd op queue mclock recov res`` + +:Description: the reservation of recovery. + +:Type: Float +:Default: 0.0 + + +``osd op queue mclock recov wgt`` + +:Description: the weight of recovery. + +:Type: Float +:Default: 1.0 + + +``osd op queue mclock recov lim`` + +:Description: the limit of recovery. + +:Type: Float +:Default: 0.001 + + +``osd op queue mclock scrub res`` + +:Description: the reservation of scrub jobs. + +:Type: Float +:Default: 0.0 + + +``osd op queue mclock scrub wgt`` + +:Description: the weight of scrub jobs. + +:Type: Float +:Default: 1.0 + + +``osd op queue mclock scrub lim`` + +:Description: the limit of scrub jobs. + +:Type: Float +:Default: 0.001 + +.. _the dmClock algorithm: https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Gulati.pdf + + +.. index:: OSD; backfilling + +Backfilling +=========== + +When you add or remove Ceph OSD Daemons to a cluster, the CRUSH algorithm will +want to rebalance the cluster by moving placement groups to or from Ceph OSD +Daemons to restore the balance. The process of migrating placement groups and +the objects they contain can reduce the cluster's operational performance +considerably. To maintain operational performance, Ceph performs this migration +with 'backfilling', which allows Ceph to set backfill operations to a lower +priority than requests to read or write data. + + +``osd max backfills`` + +:Description: The maximum number of backfills allowed to or from a single OSD. +:Type: 64-bit Unsigned Integer +:Default: ``1`` + + +``osd backfill scan min`` + +:Description: The minimum number of objects per backfill scan. + +:Type: 32-bit Integer +:Default: ``64`` + + +``osd backfill scan max`` + +:Description: The maximum number of objects per backfill scan. + +:Type: 32-bit Integer +:Default: ``512`` + + +``osd backfill retry interval`` + +:Description: The number of seconds to wait before retrying backfill requests. +:Type: Double +:Default: ``10.0`` + +.. index:: OSD; osdmap + +OSD Map +======= + +OSD maps reflect the OSD daemons operating in the cluster. Over time, the +number of map epochs increases. Ceph provides some settings to ensure that +Ceph performs well as the OSD map grows larger. + + +``osd map dedup`` + +:Description: Enable removing duplicates in the OSD map. +:Type: Boolean +:Default: ``true`` + + +``osd map cache size`` + +:Description: The number of OSD maps to keep cached. +:Type: 32-bit Integer +:Default: ``500`` + + +``osd map cache bl size`` + +:Description: The size of the in-memory OSD map cache in OSD daemons. +:Type: 32-bit Integer +:Default: ``50`` + + +``osd map cache bl inc size`` + +:Description: The size of the in-memory OSD map cache incrementals in + OSD daemons. + +:Type: 32-bit Integer +:Default: ``100`` + + +``osd map message max`` + +:Description: The maximum map entries allowed per MOSDMap message. +:Type: 32-bit Integer +:Default: ``100`` + + + +.. index:: OSD; recovery + +Recovery +======== + +When the cluster starts or when a Ceph OSD Daemon crashes and restarts, the OSD +begins peering with other Ceph OSD Daemons before writes can occur. See +`Monitoring OSDs and PGs`_ for details. + +If a Ceph OSD Daemon crashes and comes back online, usually it will be out of +sync with other Ceph OSD Daemons containing more recent versions of objects in +the placement groups. When this happens, the Ceph OSD Daemon goes into recovery +mode and seeks to get the latest copy of the data and bring its map back up to +date. Depending upon how long the Ceph OSD Daemon was down, the OSD's objects +and placement groups may be significantly out of date. Also, if a failure domain +went down (e.g., a rack), more than one Ceph OSD Daemon may come back online at +the same time. This can make the recovery process time consuming and resource +intensive. + +To maintain operational performance, Ceph performs recovery with limitations on +the number recovery requests, threads and object chunk sizes which allows Ceph +perform well in a degraded state. + + +``osd recovery delay start`` + +:Description: After peering completes, Ceph will delay for the specified number + of seconds before starting to recover objects. + +:Type: Float +:Default: ``0`` + + +``osd recovery max active`` + +:Description: The number of active recovery requests per OSD at one time. More + requests will accelerate recovery, but the requests places an + increased load on the cluster. + +:Type: 32-bit Integer +:Default: ``3`` + + +``osd recovery max chunk`` + +:Description: The maximum size of a recovered chunk of data to push. +:Type: 64-bit Unsigned Integer +:Default: ``8 << 20`` + + +``osd recovery max single start`` + +:Description: The maximum number of recovery operations per OSD that will be + newly started when an OSD is recovering. +:Type: 64-bit Unsigned Integer +:Default: ``1`` + + +``osd recovery thread timeout`` + +:Description: The maximum time in seconds before timing out a recovery thread. +:Type: 32-bit Integer +:Default: ``30`` + + +``osd recover clone overlap`` + +:Description: Preserves clone overlap during recovery. Should always be set + to ``true``. + +:Type: Boolean +:Default: ``true`` + + +``osd recovery sleep`` + +:Description: Time in seconds to sleep before next recovery or backfill op. + Increasing this value will slow down recovery operation while + client operations will be less impacted. + +:Type: Float +:Default: ``0`` + + +``osd recovery sleep hdd`` + +:Description: Time in seconds to sleep before next recovery or backfill op + for HDDs. + +:Type: Float +:Default: ``0.1`` + + +``osd recovery sleep ssd`` + +:Description: Time in seconds to sleep before next recovery or backfill op + for SSDs. + +:Type: Float +:Default: ``0`` + + +``osd recovery sleep hybrid`` + +:Description: Time in seconds to sleep before next recovery or backfill op + when osd data is on HDD and osd journal is on SSD. + +:Type: Float +:Default: ``0.025`` + +Tiering +======= + +``osd agent max ops`` + +:Description: The maximum number of simultaneous flushing ops per tiering agent + in the high speed mode. +:Type: 32-bit Integer +:Default: ``4`` + + +``osd agent max low ops`` + +:Description: The maximum number of simultaneous flushing ops per tiering agent + in the low speed mode. +:Type: 32-bit Integer +:Default: ``2`` + +See `cache target dirty high ratio`_ for when the tiering agent flushes dirty +objects within the high speed mode. + +Miscellaneous +============= + + +``osd snap trim thread timeout`` + +:Description: The maximum time in seconds before timing out a snap trim thread. +:Type: 32-bit Integer +:Default: ``60*60*1`` + + +``osd backlog thread timeout`` + +:Description: The maximum time in seconds before timing out a backlog thread. +:Type: 32-bit Integer +:Default: ``60*60*1`` + + +``osd default notify timeout`` + +:Description: The OSD default notification timeout (in seconds). +:Type: 32-bit Unsigned Integer +:Default: ``30`` + + +``osd check for log corruption`` + +:Description: Check log files for corruption. Can be computationally expensive. +:Type: Boolean +:Default: ``false`` + + +``osd remove thread timeout`` + +:Description: The maximum time in seconds before timing out a remove OSD thread. +:Type: 32-bit Integer +:Default: ``60*60`` + + +``osd command thread timeout`` + +:Description: The maximum time in seconds before timing out a command thread. +:Type: 32-bit Integer +:Default: ``10*60`` + + +``osd command max records`` + +:Description: Limits the number of lost objects to return. +:Type: 32-bit Integer +:Default: ``256`` + + +``osd auto upgrade tmap`` + +:Description: Uses ``tmap`` for ``omap`` on old objects. +:Type: Boolean +:Default: ``true`` + + +``osd tmapput sets users tmap`` + +:Description: Uses ``tmap`` for debugging only. +:Type: Boolean +:Default: ``false`` + + +``osd fast fail on connection refused`` + +:Description: If this option is enabled, crashed OSDs are marked down + immediately by connected peers and MONs (assuming that the + crashed OSD host survives). Disable it to restore old + behavior, at the expense of possible long I/O stalls when + OSDs crash in the middle of I/O operations. +:Type: Boolean +:Default: ``true`` + + + +.. _pool: ../../operations/pools +.. _Configuring Monitor/OSD Interaction: ../mon-osd-interaction +.. _Monitoring OSDs and PGs: ../../operations/monitoring-osd-pg#peering +.. _Pool & PG Config Reference: ../pool-pg-config-ref +.. _Journal Config Reference: ../journal-ref +.. _cache target dirty high ratio: ../../operations/pools#cache-target-dirty-high-ratio diff --git a/src/ceph/doc/rados/configuration/pool-pg-config-ref.rst b/src/ceph/doc/rados/configuration/pool-pg-config-ref.rst new file mode 100644 index 0000000..89a3707 --- /dev/null +++ b/src/ceph/doc/rados/configuration/pool-pg-config-ref.rst @@ -0,0 +1,270 @@ +====================================== + Pool, PG and CRUSH Config Reference +====================================== + +.. index:: pools; configuration + +When you create pools and set the number of placement groups for the pool, Ceph +uses default values when you don't specifically override the defaults. **We +recommend** overridding some of the defaults. Specifically, we recommend setting +a pool's replica size and overriding the default number of placement groups. You +can specifically set these values when running `pool`_ commands. You can also +override the defaults by adding new ones in the ``[global]`` section of your +Ceph configuration file. + + +.. literalinclude:: pool-pg.conf + :language: ini + + + +``mon max pool pg num`` + +:Description: The maximum number of placement groups per pool. +:Type: Integer +:Default: ``65536`` + + +``mon pg create interval`` + +:Description: Number of seconds between PG creation in the same + Ceph OSD Daemon. + +:Type: Float +:Default: ``30.0`` + + +``mon pg stuck threshold`` + +:Description: Number of seconds after which PGs can be considered as + being stuck. + +:Type: 32-bit Integer +:Default: ``300`` + +``mon pg min inactive`` + +:Description: Issue a ``HEALTH_ERR`` in cluster log if the number of PGs stay + inactive longer than ``mon_pg_stuck_threshold`` exceeds this + setting. A non-positive number means disabled, never go into ERR. +:Type: Integer +:Default: ``1`` + + +``mon pg warn min per osd`` + +:Description: Issue a ``HEALTH_WARN`` in cluster log if the average number + of PGs per (in) OSD is under this number. (a non-positive number + disables this) +:Type: Integer +:Default: ``30`` + + +``mon pg warn max per osd`` + +:Description: Issue a ``HEALTH_WARN`` in cluster log if the average number + of PGs per (in) OSD is above this number. (a non-positive number + disables this) +:Type: Integer +:Default: ``300`` + + +``mon pg warn min objects`` + +:Description: Do not warn if the total number of objects in cluster is below + this number +:Type: Integer +:Default: ``1000`` + + +``mon pg warn min pool objects`` + +:Description: Do not warn on pools whose object number is below this number +:Type: Integer +:Default: ``1000`` + + +``mon pg check down all threshold`` + +:Description: Threshold of down OSDs percentage after which we check all PGs + for stale ones. +:Type: Float +:Default: ``0.5`` + + +``mon pg warn max object skew`` + +:Description: Issue a ``HEALTH_WARN`` in cluster log if the average object number + of a certain pool is greater than ``mon pg warn max object skew`` times + the average object number of the whole pool. (a non-positive number + disables this) +:Type: Float +:Default: ``10`` + + +``mon delta reset interval`` + +:Description: Seconds of inactivity before we reset the pg delta to 0. We keep + track of the delta of the used space of each pool, so, for + example, it would be easier for us to understand the progress of + recovery or the performance of cache tier. But if there's no + activity reported for a certain pool, we just reset the history of + deltas of that pool. +:Type: Integer +:Default: ``10`` + + +``mon osd max op age`` + +:Description: Maximum op age before we get concerned (make it a power of 2). + A ``HEALTH_WARN`` will be issued if a request has been blocked longer + than this limit. +:Type: Float +:Default: ``32.0`` + + +``osd pg bits`` + +:Description: Placement group bits per Ceph OSD Daemon. +:Type: 32-bit Integer +:Default: ``6`` + + +``osd pgp bits`` + +:Description: The number of bits per Ceph OSD Daemon for PGPs. +:Type: 32-bit Integer +:Default: ``6`` + + +``osd crush chooseleaf type`` + +:Description: The bucket type to use for ``chooseleaf`` in a CRUSH rule. Uses + ordinal rank rather than name. + +:Type: 32-bit Integer +:Default: ``1``. Typically a host containing one or more Ceph OSD Daemons. + + +``osd crush initial weight`` + +:Description: The initial crush weight for newly added osds into crushmap. + +:Type: Double +:Default: ``the size of newly added osd in TB``. By default, the initial crush + weight for the newly added osd is set to its volume size in TB. + See `Weighting Bucket Items`_ for details. + + +``osd pool default crush replicated ruleset`` + +:Description: The default CRUSH ruleset to use when creating a replicated pool. +:Type: 8-bit Integer +:Default: ``CEPH_DEFAULT_CRUSH_REPLICATED_RULESET``, which means "pick + a ruleset with the lowest numerical ID and use that". This is to + make pool creation work in the absence of ruleset 0. + + +``osd pool erasure code stripe unit`` + +:Description: Sets the default size, in bytes, of a chunk of an object + stripe for erasure coded pools. Every object of size S + will be stored as N stripes, with each data chunk + receiving ``stripe unit`` bytes. Each stripe of ``N * + stripe unit`` bytes will be encoded/decoded + individually. This option can is overridden by the + ``stripe_unit`` setting in an erasure code profile. + +:Type: Unsigned 32-bit Integer +:Default: ``4096`` + + +``osd pool default size`` + +:Description: Sets the number of replicas for objects in the pool. The default + value is the same as + ``ceph osd pool set {pool-name} size {size}``. + +:Type: 32-bit Integer +:Default: ``3`` + + +``osd pool default min size`` + +:Description: Sets the minimum number of written replicas for objects in the + pool in order to acknowledge a write operation to the client. + If minimum is not met, Ceph will not acknowledge the write to the + client. This setting ensures a minimum number of replicas when + operating in ``degraded`` mode. + +:Type: 32-bit Integer +:Default: ``0``, which means no particular minimum. If ``0``, + minimum is ``size - (size / 2)``. + + +``osd pool default pg num`` + +:Description: The default number of placement groups for a pool. The default + value is the same as ``pg_num`` with ``mkpool``. + +:Type: 32-bit Integer +:Default: ``8`` + + +``osd pool default pgp num`` + +:Description: The default number of placement groups for placement for a pool. + The default value is the same as ``pgp_num`` with ``mkpool``. + PG and PGP should be equal (for now). + +:Type: 32-bit Integer +:Default: ``8`` + + +``osd pool default flags`` + +:Description: The default flags for new pools. +:Type: 32-bit Integer +:Default: ``0`` + + +``osd max pgls`` + +:Description: The maximum number of placement groups to list. A client + requesting a large number can tie up the Ceph OSD Daemon. + +:Type: Unsigned 64-bit Integer +:Default: ``1024`` +:Note: Default should be fine. + + +``osd min pg log entries`` + +:Description: The minimum number of placement group logs to maintain + when trimming log files. + +:Type: 32-bit Int Unsigned +:Default: ``1000`` + + +``osd default data pool replay window`` + +:Description: The time (in seconds) for an OSD to wait for a client to replay + a request. + +:Type: 32-bit Integer +:Default: ``45`` + +``osd max pg per osd hard ratio`` + +:Description: The ratio of number of PGs per OSD allowed by the cluster before + OSD refuses to create new PGs. OSD stops creating new PGs if the number + of PGs it serves exceeds + ``osd max pg per osd hard ratio`` \* ``mon max pg per osd``. + +:Type: Float +:Default: ``2`` + +.. _pool: ../../operations/pools +.. _Monitoring OSDs and PGs: ../../operations/monitoring-osd-pg#peering +.. _Weighting Bucket Items: ../../operations/crush-map#weightingbucketitems diff --git a/src/ceph/doc/rados/configuration/pool-pg.conf b/src/ceph/doc/rados/configuration/pool-pg.conf new file mode 100644 index 0000000..5f1b3b7 --- /dev/null +++ b/src/ceph/doc/rados/configuration/pool-pg.conf @@ -0,0 +1,20 @@ +[global] + + # By default, Ceph makes 3 replicas of objects. If you want to make four + # copies of an object the default value--a primary copy and three replica + # copies--reset the default values as shown in 'osd pool default size'. + # If you want to allow Ceph to write a lesser number of copies in a degraded + # state, set 'osd pool default min size' to a number less than the + # 'osd pool default size' value. + + osd pool default size = 4 # Write an object 4 times. + osd pool default min size = 1 # Allow writing one copy in a degraded state. + + # Ensure you have a realistic number of placement groups. We recommend + # approximately 100 per OSD. E.g., total number of OSDs multiplied by 100 + # divided by the number of replicas (i.e., osd pool default size). So for + # 10 OSDs and osd pool default size = 4, we'd recommend approximately + # (100 * 10) / 4 = 250. + + osd pool default pg num = 250 + osd pool default pgp num = 250 diff --git a/src/ceph/doc/rados/configuration/storage-devices.rst b/src/ceph/doc/rados/configuration/storage-devices.rst new file mode 100644 index 0000000..83c0c9b --- /dev/null +++ b/src/ceph/doc/rados/configuration/storage-devices.rst @@ -0,0 +1,83 @@ +================= + Storage Devices +================= + +There are two Ceph daemons that store data on disk: + +* **Ceph OSDs** (or Object Storage Daemons) are where most of the + data is stored in Ceph. Generally speaking, each OSD is backed by + a single storage device, like a traditional hard disk (HDD) or + solid state disk (SSD). OSDs can also be backed by a combination + of devices, like a HDD for most data and an SSD (or partition of an + SSD) for some metadata. The number of OSDs in a cluster is + generally a function of how much data will be stored, how big each + storage device will be, and the level and type of redundancy + (replication or erasure coding). +* **Ceph Monitor** daemons manage critical cluster state like cluster + membership and authentication information. For smaller clusters a + few gigabytes is all that is needed, although for larger clusters + the monitor database can reach tens or possibly hundreds of + gigabytes. + + +OSD Backends +============ + +There are two ways that OSDs can manage the data they store. Starting +with the Luminous 12.2.z release, the new default (and recommended) backend is +*BlueStore*. Prior to Luminous, the default (and only option) was +*FileStore*. + +BlueStore +--------- + +BlueStore is a special-purpose storage backend designed specifically +for managing data on disk for Ceph OSD workloads. It is motivated by +experience supporting and managing OSDs using FileStore over the +last ten years. Key BlueStore features include: + +* Direct management of storage devices. BlueStore consumes raw block + devices or partitions. This avoids any intervening layers of + abstraction (such as local file systems like XFS) that may limit + performance or add complexity. +* Metadata management with RocksDB. We embed RocksDB's key/value database + in order to manage internal metadata, such as the mapping from object + names to block locations on disk. +* Full data and metadata checksumming. By default all data and + metadata written to BlueStore is protected by one or more + checksums. No data or metadata will be read from disk or returned + to the user without being verified. +* Inline compression. Data written may be optionally compressed + before being written to disk. +* Multi-device metadata tiering. BlueStore allows its internal + journal (write-ahead log) to be written to a separate, high-speed + device (like an SSD, NVMe, or NVDIMM) to increased performance. If + a significant amount of faster storage is available, internal + metadata can also be stored on the faster device. +* Efficient copy-on-write. RBD and CephFS snapshots rely on a + copy-on-write *clone* mechanism that is implemented efficiently in + BlueStore. This results in efficient IO both for regular snapshots + and for erasure coded pools (which rely on cloning to implement + efficient two-phase commits). + +For more information, see :doc:`bluestore-config-ref`. + +FileStore +--------- + +FileStore is the legacy approach to storing objects in Ceph. It +relies on a standard file system (normally XFS) in combination with a +key/value database (traditionally LevelDB, now RocksDB) for some +metadata. + +FileStore is well-tested and widely used in production but suffers +from many performance deficiencies due to its overall design and +reliance on a traditional file system for storing object data. + +Although FileStore is generally capable of functioning on most +POSIX-compatible file systems (including btrfs and ext4), we only +recommend that XFS be used. Both btrfs and ext4 have known bugs and +deficiencies and their use may lead to data loss. By default all Ceph +provisioning tools will use XFS. + +For more information, see :doc:`filestore-config-ref`. |