diff options
Diffstat (limited to 'src/ceph/doc/rados/configuration')
17 files changed, 0 insertions, 5807 deletions
diff --git a/src/ceph/doc/rados/configuration/auth-config-ref.rst b/src/ceph/doc/rados/configuration/auth-config-ref.rst deleted file mode 100644 index eb14fa4..0000000 --- a/src/ceph/doc/rados/configuration/auth-config-ref.rst +++ /dev/null @@ -1,432 +0,0 @@ -======================== - Cephx Config Reference -======================== - -The ``cephx`` protocol is enabled by default. Cryptographic authentication has -some computational costs, though they should generally be quite low. If the -network environment connecting your client and server hosts is very safe and -you cannot afford authentication, you can turn it off. **This is not generally -recommended**. - -.. note:: If you disable authentication, you are at risk of a man-in-the-middle - attack altering your client/server messages, which could lead to disastrous - security effects. - -For creating users, see `User Management`_. For details on the architecture -of Cephx, see `Architecture - High Availability Authentication`_. - - -Deployment Scenarios -==================== - -There are two main scenarios for deploying a Ceph cluster, which impact -how you initially configure Cephx. Most first time Ceph users use -``ceph-deploy`` to create a cluster (easiest). For clusters using -other deployment tools (e.g., Chef, Juju, Puppet, etc.), you will need -to use the manual procedures or configure your deployment tool to -bootstrap your monitor(s). - -ceph-deploy ------------ - -When you deploy a cluster with ``ceph-deploy``, you do not have to bootstrap the -monitor manually or create the ``client.admin`` user or keyring. The steps you -execute in the `Storage Cluster Quick Start`_ will invoke ``ceph-deploy`` to do -that for you. - -When you execute ``ceph-deploy new {initial-monitor(s)}``, Ceph will create a -monitor keyring for you (only used to bootstrap monitors), and it will generate -an initial Ceph configuration file for you, which contains the following -authentication settings, indicating that Ceph enables authentication by -default:: - - auth_cluster_required = cephx - auth_service_required = cephx - auth_client_required = cephx - -When you execute ``ceph-deploy mon create-initial``, Ceph will bootstrap the -initial monitor(s), retrieve a ``ceph.client.admin.keyring`` file containing the -key for the ``client.admin`` user. Additionally, it will also retrieve keyrings -that give ``ceph-deploy`` and ``ceph-disk`` utilities the ability to prepare and -activate OSDs and metadata servers. - -When you execute ``ceph-deploy admin {node-name}`` (**note:** Ceph must be -installed first), you are pushing a Ceph configuration file and the -``ceph.client.admin.keyring`` to the ``/etc/ceph`` directory of the node. You -will be able to execute Ceph administrative functions as ``root`` on the command -line of that node. - - -Manual Deployment ------------------ - -When you deploy a cluster manually, you have to bootstrap the monitor manually -and create the ``client.admin`` user and keyring. To bootstrap monitors, follow -the steps in `Monitor Bootstrapping`_. The steps for monitor bootstrapping are -the logical steps you must perform when using third party deployment tools like -Chef, Puppet, Juju, etc. - - -Enabling/Disabling Cephx -======================== - -Enabling Cephx requires that you have deployed keys for your monitors, -OSDs and metadata servers. If you are simply toggling Cephx on / off, -you do not have to repeat the bootstrapping procedures. - - -Enabling Cephx --------------- - -When ``cephx`` is enabled, Ceph will look for the keyring in the default search -path, which includes ``/etc/ceph/$cluster.$name.keyring``. You can override -this location by adding a ``keyring`` option in the ``[global]`` section of -your `Ceph configuration`_ file, but this is not recommended. - -Execute the following procedures to enable ``cephx`` on a cluster with -authentication disabled. If you (or your deployment utility) have already -generated the keys, you may skip the steps related to generating keys. - -#. Create a ``client.admin`` key, and save a copy of the key for your client - host:: - - ceph auth get-or-create client.admin mon 'allow *' mds 'allow *' osd 'allow *' -o /etc/ceph/ceph.client.admin.keyring - - **Warning:** This will clobber any existing - ``/etc/ceph/client.admin.keyring`` file. Do not perform this step if a - deployment tool has already done it for you. Be careful! - -#. Create a keyring for your monitor cluster and generate a monitor - secret key. :: - - ceph-authtool --create-keyring /tmp/ceph.mon.keyring --gen-key -n mon. --cap mon 'allow *' - -#. Copy the monitor keyring into a ``ceph.mon.keyring`` file in every monitor's - ``mon data`` directory. For example, to copy it to ``mon.a`` in cluster ``ceph``, - use the following:: - - cp /tmp/ceph.mon.keyring /var/lib/ceph/mon/ceph-a/keyring - -#. Generate a secret key for every OSD, where ``{$id}`` is the OSD number:: - - ceph auth get-or-create osd.{$id} mon 'allow rwx' osd 'allow *' -o /var/lib/ceph/osd/ceph-{$id}/keyring - -#. Generate a secret key for every MDS, where ``{$id}`` is the MDS letter:: - - ceph auth get-or-create mds.{$id} mon 'allow rwx' osd 'allow *' mds 'allow *' -o /var/lib/ceph/mds/ceph-{$id}/keyring - -#. Enable ``cephx`` authentication by setting the following options in the - ``[global]`` section of your `Ceph configuration`_ file:: - - auth cluster required = cephx - auth service required = cephx - auth client required = cephx - - -#. Start or restart the Ceph cluster. See `Operating a Cluster`_ for details. - -For details on bootstrapping a monitor manually, see `Manual Deployment`_. - - - -Disabling Cephx ---------------- - -The following procedure describes how to disable Cephx. If your cluster -environment is relatively safe, you can offset the computation expense of -running authentication. **We do not recommend it.** However, it may be easier -during setup and/or troubleshooting to temporarily disable authentication. - -#. Disable ``cephx`` authentication by setting the following options in the - ``[global]`` section of your `Ceph configuration`_ file:: - - auth cluster required = none - auth service required = none - auth client required = none - - -#. Start or restart the Ceph cluster. See `Operating a Cluster`_ for details. - - -Configuration Settings -====================== - -Enablement ----------- - - -``auth cluster required`` - -:Description: If enabled, the Ceph Storage Cluster daemons (i.e., ``ceph-mon``, - ``ceph-osd``, and ``ceph-mds``) must authenticate with - each other. Valid settings are ``cephx`` or ``none``. - -:Type: String -:Required: No -:Default: ``cephx``. - - -``auth service required`` - -:Description: If enabled, the Ceph Storage Cluster daemons require Ceph Clients - to authenticate with the Ceph Storage Cluster in order to access - Ceph services. Valid settings are ``cephx`` or ``none``. - -:Type: String -:Required: No -:Default: ``cephx``. - - -``auth client required`` - -:Description: If enabled, the Ceph Client requires the Ceph Storage Cluster to - authenticate with the Ceph Client. Valid settings are ``cephx`` - or ``none``. - -:Type: String -:Required: No -:Default: ``cephx``. - - -.. index:: keys; keyring - -Keys ----- - -When you run Ceph with authentication enabled, ``ceph`` administrative commands -and Ceph Clients require authentication keys to access the Ceph Storage Cluster. - -The most common way to provide these keys to the ``ceph`` administrative -commands and clients is to include a Ceph keyring under the ``/etc/ceph`` -directory. For Cuttlefish and later releases using ``ceph-deploy``, the filename -is usually ``ceph.client.admin.keyring`` (or ``$cluster.client.admin.keyring``). -If you include the keyring under the ``/etc/ceph`` directory, you don't need to -specify a ``keyring`` entry in your Ceph configuration file. - -We recommend copying the Ceph Storage Cluster's keyring file to nodes where you -will run administrative commands, because it contains the ``client.admin`` key. - -You may use ``ceph-deploy admin`` to perform this task. See `Create an Admin -Host`_ for details. To perform this step manually, execute the following:: - - sudo scp {user}@{ceph-cluster-host}:/etc/ceph/ceph.client.admin.keyring /etc/ceph/ceph.client.admin.keyring - -.. tip:: Ensure the ``ceph.keyring`` file has appropriate permissions set - (e.g., ``chmod 644``) on your client machine. - -You may specify the key itself in the Ceph configuration file using the ``key`` -setting (not recommended), or a path to a keyfile using the ``keyfile`` setting. - - -``keyring`` - -:Description: The path to the keyring file. -:Type: String -:Required: No -:Default: ``/etc/ceph/$cluster.$name.keyring,/etc/ceph/$cluster.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin`` - - -``keyfile`` - -:Description: The path to a key file (i.e,. a file containing only the key). -:Type: String -:Required: No -:Default: None - - -``key`` - -:Description: The key (i.e., the text string of the key itself). Not recommended. -:Type: String -:Required: No -:Default: None - - -Daemon Keyrings ---------------- - -Administrative users or deployment tools (e.g., ``ceph-deploy``) may generate -daemon keyrings in the same way as generating user keyrings. By default, Ceph -stores daemons keyrings inside their data directory. The default keyring -locations, and the capabilities necessary for the daemon to function, are shown -below. - -``ceph-mon`` - -:Location: ``$mon_data/keyring`` -:Capabilities: ``mon 'allow *'`` - -``ceph-osd`` - -:Location: ``$osd_data/keyring`` -:Capabilities: ``mon 'allow profile osd' osd 'allow *'`` - -``ceph-mds`` - -:Location: ``$mds_data/keyring`` -:Capabilities: ``mds 'allow' mon 'allow profile mds' osd 'allow rwx'`` - -``radosgw`` - -:Location: ``$rgw_data/keyring`` -:Capabilities: ``mon 'allow rwx' osd 'allow rwx'`` - - -.. note:: The monitor keyring (i.e., ``mon.``) contains a key but no - capabilities, and is not part of the cluster ``auth`` database. - -The daemon data directory locations default to directories of the form:: - - /var/lib/ceph/$type/$cluster-$id - -For example, ``osd.12`` would be:: - - /var/lib/ceph/osd/ceph-12 - -You can override these locations, but it is not recommended. - - -.. index:: signatures - -Signatures ----------- - -In Ceph Bobtail and subsequent versions, we prefer that Ceph authenticate all -ongoing messages between the entities using the session key set up for that -initial authentication. However, Argonaut and earlier Ceph daemons do not know -how to perform ongoing message authentication. To maintain backward -compatibility (e.g., running both Botbail and Argonaut daemons in the same -cluster), message signing is **off** by default. If you are running Bobtail or -later daemons exclusively, configure Ceph to require signatures. - -Like other parts of Ceph authentication, Ceph provides fine-grained control so -you can enable/disable signatures for service messages between the client and -Ceph, and you can enable/disable signatures for messages between Ceph daemons. - - -``cephx require signatures`` - -:Description: If set to ``true``, Ceph requires signatures on all message - traffic between the Ceph Client and the Ceph Storage Cluster, and - between daemons comprising the Ceph Storage Cluster. - -:Type: Boolean -:Required: No -:Default: ``false`` - - -``cephx cluster require signatures`` - -:Description: If set to ``true``, Ceph requires signatures on all message - traffic between Ceph daemons comprising the Ceph Storage Cluster. - -:Type: Boolean -:Required: No -:Default: ``false`` - - -``cephx service require signatures`` - -:Description: If set to ``true``, Ceph requires signatures on all message - traffic between Ceph Clients and the Ceph Storage Cluster. - -:Type: Boolean -:Required: No -:Default: ``false`` - - -``cephx sign messages`` - -:Description: If the Ceph version supports message signing, Ceph will sign - all messages so they cannot be spoofed. - -:Type: Boolean -:Default: ``true`` - - -Time to Live ------------- - -``auth service ticket ttl`` - -:Description: When the Ceph Storage Cluster sends a Ceph Client a ticket for - authentication, the Ceph Storage Cluster assigns the ticket a - time to live. - -:Type: Double -:Default: ``60*60`` - - -Backward Compatibility -====================== - -For Cuttlefish and earlier releases, see `Cephx`_. - -In Ceph Argonaut v0.48 and earlier versions, if you enable ``cephx`` -authentication, Ceph only authenticates the initial communication between the -client and daemon; Ceph does not authenticate the subsequent messages they send -to each other, which has security implications. In Ceph Bobtail and subsequent -versions, Ceph authenticates all ongoing messages between the entities using the -session key set up for that initial authentication. - -We identified a backward compatibility issue between Argonaut v0.48 (and prior -versions) and Bobtail (and subsequent versions). During testing, if you -attempted to use Argonaut (and earlier) daemons with Bobtail (and later) -daemons, the Argonaut daemons did not know how to perform ongoing message -authentication, while the Bobtail versions of the daemons insist on -authenticating message traffic subsequent to the initial -request/response--making it impossible for Argonaut (and prior) daemons to -interoperate with Bobtail (and subsequent) daemons. - -We have addressed this potential problem by providing a means for Argonaut (and -prior) systems to interact with Bobtail (and subsequent) systems. Here's how it -works: by default, the newer systems will not insist on seeing signatures from -older systems that do not know how to perform them, but will simply accept such -messages without authenticating them. This new default behavior provides the -advantage of allowing two different releases to interact. **We do not recommend -this as a long term solution**. Allowing newer daemons to forgo ongoing -authentication has the unfortunate security effect that an attacker with control -of some of your machines or some access to your network can disable session -security simply by claiming to be unable to sign messages. - -.. note:: Even if you don't actually run any old versions of Ceph, - the attacker may be able to force some messages to be accepted unsigned in the - default scenario. While running Cephx with the default scenario, Ceph still - authenticates the initial communication, but you lose desirable session security. - -If you know that you are not running older versions of Ceph, or you are willing -to accept that old servers and new servers will not be able to interoperate, you -can eliminate this security risk. If you do so, any Ceph system that is new -enough to support session authentication and that has Cephx enabled will reject -unsigned messages. To preclude new servers from interacting with old servers, -include the following in the ``[global]`` section of your `Ceph -configuration`_ file directly below the line that specifies the use of Cephx -for authentication:: - - cephx require signatures = true ; everywhere possible - -You can also selectively require signatures for cluster internal -communications only, separate from client-facing service:: - - cephx cluster require signatures = true ; for cluster-internal communication - cephx service require signatures = true ; for client-facing service - -An option to make a client require signatures from the cluster is not -yet implemented. - -**We recommend migrating all daemons to the newer versions and enabling the -foregoing flag** at the nearest practical time so that you may avail yourself -of the enhanced authentication. - -.. note:: Ceph kernel modules do not support signatures yet. - - -.. _Storage Cluster Quick Start: ../../../start/quick-ceph-deploy/ -.. _Monitor Bootstrapping: ../../../install/manual-deployment#monitor-bootstrapping -.. _Operating a Cluster: ../../operations/operating -.. _Manual Deployment: ../../../install/manual-deployment -.. _Cephx: http://docs.ceph.com/docs/cuttlefish/rados/configuration/auth-config-ref/ -.. _Ceph configuration: ../ceph-conf -.. _Create an Admin Host: ../../deployment/ceph-deploy-admin -.. _Architecture - High Availability Authentication: ../../../architecture#high-availability-authentication -.. _User Management: ../../operations/user-management diff --git a/src/ceph/doc/rados/configuration/bluestore-config-ref.rst b/src/ceph/doc/rados/configuration/bluestore-config-ref.rst deleted file mode 100644 index 8d8ace6..0000000 --- a/src/ceph/doc/rados/configuration/bluestore-config-ref.rst +++ /dev/null @@ -1,297 +0,0 @@ -========================== -BlueStore Config Reference -========================== - -Devices -======= - -BlueStore manages either one, two, or (in certain cases) three storage -devices. - -In the simplest case, BlueStore consumes a single (primary) storage -device. The storage device is normally partitioned into two parts: - -#. A small partition is formatted with XFS and contains basic metadata - for the OSD. This *data directory* includes information about the - OSD (its identifier, which cluster it belongs to, and its private - keyring. - -#. The rest of the device is normally a large partition occupying the - rest of the device that is managed directly by BlueStore contains - all of the actual data. This *primary device* is normally identifed - by a ``block`` symlink in data directory. - -It is also possible to deploy BlueStore across two additional devices: - -* A *WAL device* can be used for BlueStore's internal journal or - write-ahead log. It is identified by the ``block.wal`` symlink in - the data directory. It is only useful to use a WAL device if the - device is faster than the primary device (e.g., when it is on an SSD - and the primary device is an HDD). -* A *DB device* can be used for storing BlueStore's internal metadata. - BlueStore (or rather, the embedded RocksDB) will put as much - metadata as it can on the DB device to improve performance. If the - DB device fills up, metadata will spill back onto the primary device - (where it would have been otherwise). Again, it is only helpful to - provision a DB device if it is faster than the primary device. - -If there is only a small amount of fast storage available (e.g., less -than a gigabyte), we recommend using it as a WAL device. If there is -more, provisioning a DB device makes more sense. The BlueStore -journal will always be placed on the fastest device available, so -using a DB device will provide the same benefit that the WAL device -would while *also* allowing additional metadata to be stored there (if -it will fix). - -A single-device BlueStore OSD can be provisioned with:: - - ceph-disk prepare --bluestore <device> - -To specify a WAL device and/or DB device, :: - - ceph-disk prepare --bluestore <device> --block.wal <wal-device> --block-db <db-device> - -Cache size -========== - -The amount of memory consumed by each OSD for BlueStore's cache is -determined by the ``bluestore_cache_size`` configuration option. If -that config option is not set (i.e., remains at 0), there is a -different default value that is used depending on whether an HDD or -SSD is used for the primary device (set by the -``bluestore_cache_size_ssd`` and ``bluestore_cache_size_hdd`` config -options). - -BlueStore and the rest of the Ceph OSD does the best it can currently -to stick to the budgeted memory. Note that on top of the configured -cache size, there is also memory consumed by the OSD itself, and -generally some overhead due to memory fragmentation and other -allocator overhead. - -The configured cache memory budget can be used in a few different ways: - -* Key/Value metadata (i.e., RocksDB's internal cache) -* BlueStore metadata -* BlueStore data (i.e., recently read or written object data) - -Cache memory usage is governed by the following options: -``bluestore_cache_meta_ratio``, ``bluestore_cache_kv_ratio``, and -``bluestore_cache_kv_max``. The fraction of the cache devoted to data -is 1.0 minus the meta and kv ratios. The memory devoted to kv -metadata (the RocksDB cache) is capped by ``bluestore_cache_kv_max`` -since our testing indicates there are diminishing returns beyond a -certain point. - -``bluestore_cache_size`` - -:Description: The amount of memory BlueStore will use for its cache. If zero, ``bluestore_cache_size_hdd`` or ``bluestore_cache_size_ssd`` will be used instead. -:Type: Integer -:Required: Yes -:Default: ``0`` - -``bluestore_cache_size_hdd`` - -:Description: The default amount of memory BlueStore will use for its cache when backed by an HDD. -:Type: Integer -:Required: Yes -:Default: ``1 * 1024 * 1024 * 1024`` (1 GB) - -``bluestore_cache_size_ssd`` - -:Description: The default amount of memory BlueStore will use for its cache when backed by an SSD. -:Type: Integer -:Required: Yes -:Default: ``3 * 1024 * 1024 * 1024`` (3 GB) - -``bluestore_cache_meta_ratio`` - -:Description: The ratio of cache devoted to metadata. -:Type: Floating point -:Required: Yes -:Default: ``.01`` - -``bluestore_cache_kv_ratio`` - -:Description: The ratio of cache devoted to key/value data (rocksdb). -:Type: Floating point -:Required: Yes -:Default: ``.99`` - -``bluestore_cache_kv_max`` - -:Description: The maximum amount of cache devoted to key/value data (rocksdb). -:Type: Floating point -:Required: Yes -:Default: ``512 * 1024*1024`` (512 MB) - - -Checksums -========= - -BlueStore checksums all metadata and data written to disk. Metadata -checksumming is handled by RocksDB and uses `crc32c`. Data -checksumming is done by BlueStore and can make use of `crc32c`, -`xxhash32`, or `xxhash64`. The default is `crc32c` and should be -suitable for most purposes. - -Full data checksumming does increase the amount of metadata that -BlueStore must store and manage. When possible, e.g., when clients -hint that data is written and read sequentially, BlueStore will -checksum larger blocks, but in many cases it must store a checksum -value (usually 4 bytes) for every 4 kilobyte block of data. - -It is possible to use a smaller checksum value by truncating the -checksum to two or one byte, reducing the metadata overhead. The -trade-off is that the probability that a random error will not be -detected is higher with a smaller checksum, going from about one if -four billion with a 32-bit (4 byte) checksum to one is 65,536 for a -16-bit (2 byte) checksum or one in 256 for an 8-bit (1 byte) checksum. -The smaller checksum values can be used by selecting `crc32c_16` or -`crc32c_8` as the checksum algorithm. - -The *checksum algorithm* can be set either via a per-pool -``csum_type`` property or the global config option. For example, :: - - ceph osd pool set <pool-name> csum_type <algorithm> - -``bluestore_csum_type`` - -:Description: The default checksum algorithm to use. -:Type: String -:Required: Yes -:Valid Settings: ``none``, ``crc32c``, ``crc32c_16``, ``crc32c_8``, ``xxhash32``, ``xxhash64`` -:Default: ``crc32c`` - - -Inline Compression -================== - -BlueStore supports inline compression using `snappy`, `zlib`, or -`lz4`. Please note that the `lz4` compression plugin is not -distributed in the official release. - -Whether data in BlueStore is compressed is determined by a combination -of the *compression mode* and any hints associated with a write -operation. The modes are: - -* **none**: Never compress data. -* **passive**: Do not compress data unless the write operation as a - *compressible* hint set. -* **aggressive**: Compress data unless the write operation as an - *incompressible* hint set. -* **force**: Try to compress data no matter what. - -For more information about the *compressible* and *incompressible* IO -hints, see :doc:`/api/librados/#rados_set_alloc_hint`. - -Note that regardless of the mode, if the size of the data chunk is not -reduced sufficiently it will not be used and the original -(uncompressed) data will be stored. For example, if the ``bluestore -compression required ratio`` is set to ``.7`` then the compressed data -must be 70% of the size of the original (or smaller). - -The *compression mode*, *compression algorithm*, *compression required -ratio*, *min blob size*, and *max blob size* can be set either via a -per-pool property or a global config option. Pool properties can be -set with:: - - ceph osd pool set <pool-name> compression_algorithm <algorithm> - ceph osd pool set <pool-name> compression_mode <mode> - ceph osd pool set <pool-name> compression_required_ratio <ratio> - ceph osd pool set <pool-name> compression_min_blob_size <size> - ceph osd pool set <pool-name> compression_max_blob_size <size> - -``bluestore compression algorithm`` - -:Description: The default compressor to use (if any) if the per-pool property - ``compression_algorithm`` is not set. Note that zstd is *not* - recommended for bluestore due to high CPU overhead when - compressing small amounts of data. -:Type: String -:Required: No -:Valid Settings: ``lz4``, ``snappy``, ``zlib``, ``zstd`` -:Default: ``snappy`` - -``bluestore compression mode`` - -:Description: The default policy for using compression if the per-pool property - ``compression_mode`` is not set. ``none`` means never use - compression. ``passive`` means use compression when - `clients hint`_ that data is compressible. ``aggressive`` means - use compression unless clients hint that data is not compressible. - ``force`` means use compression under all circumstances even if - the clients hint that the data is not compressible. -:Type: String -:Required: No -:Valid Settings: ``none``, ``passive``, ``aggressive``, ``force`` -:Default: ``none`` - -``bluestore compression required ratio`` - -:Description: The ratio of the size of the data chunk after - compression relative to the original size must be at - least this small in order to store the compressed - version. - -:Type: Floating point -:Required: No -:Default: .875 - -``bluestore compression min blob size`` - -:Description: Chunks smaller than this are never compressed. - The per-pool property ``compression_min_blob_size`` overrides - this setting. - -:Type: Unsigned Integer -:Required: No -:Default: 0 - -``bluestore compression min blob size hdd`` - -:Description: Default value of ``bluestore compression min blob size`` - for rotational media. - -:Type: Unsigned Integer -:Required: No -:Default: 128K - -``bluestore compression min blob size ssd`` - -:Description: Default value of ``bluestore compression min blob size`` - for non-rotational (solid state) media. - -:Type: Unsigned Integer -:Required: No -:Default: 8K - -``bluestore compression max blob size`` - -:Description: Chunks larger than this are broken into smaller blobs sizing - ``bluestore compression max blob size`` before being compressed. - The per-pool property ``compression_max_blob_size`` overrides - this setting. - -:Type: Unsigned Integer -:Required: No -:Default: 0 - -``bluestore compression max blob size hdd`` - -:Description: Default value of ``bluestore compression max blob size`` - for rotational media. - -:Type: Unsigned Integer -:Required: No -:Default: 512K - -``bluestore compression max blob size ssd`` - -:Description: Default value of ``bluestore compression max blob size`` - for non-rotational (solid state) media. - -:Type: Unsigned Integer -:Required: No -:Default: 64K - -.. _clients hint: ../../api/librados/#rados_set_alloc_hint diff --git a/src/ceph/doc/rados/configuration/ceph-conf.rst b/src/ceph/doc/rados/configuration/ceph-conf.rst deleted file mode 100644 index df88452..0000000 --- a/src/ceph/doc/rados/configuration/ceph-conf.rst +++ /dev/null @@ -1,629 +0,0 @@ -================== - Configuring Ceph -================== - -When you start the Ceph service, the initialization process activates a series -of daemons that run in the background. A :term:`Ceph Storage Cluster` runs -two types of daemons: - -- :term:`Ceph Monitor` (``ceph-mon``) -- :term:`Ceph OSD Daemon` (``ceph-osd``) - -Ceph Storage Clusters that support the :term:`Ceph Filesystem` run at least one -:term:`Ceph Metadata Server` (``ceph-mds``). Clusters that support :term:`Ceph -Object Storage` run Ceph Gateway daemons (``radosgw``). For your convenience, -each daemon has a series of default values (*i.e.*, many are set by -``ceph/src/common/config_opts.h``). You may override these settings with a Ceph -configuration file. - - -.. _ceph-conf-file: - -The Configuration File -====================== - -When you start a Ceph Storage Cluster, each daemon looks for a Ceph -configuration file (i.e., ``ceph.conf`` by default) that provides the cluster's -configuration settings. For manual deployments, you need to create a Ceph -configuration file. For tools that create configuration files for you (*e.g.*, -``ceph-deploy``, Chef, etc.), you may use the information contained herein as a -reference. The Ceph configuration file defines: - -- Cluster Identity -- Authentication settings -- Cluster membership -- Host names -- Host addresses -- Paths to keyrings -- Paths to journals -- Paths to data -- Other runtime options - -The default Ceph configuration file locations in sequential order include: - -#. ``$CEPH_CONF`` (*i.e.,* the path following the ``$CEPH_CONF`` - environment variable) -#. ``-c path/path`` (*i.e.,* the ``-c`` command line argument) -#. ``/etc/ceph/ceph.conf`` -#. ``~/.ceph/config`` -#. ``./ceph.conf`` (*i.e.,* in the current working directory) - - -The Ceph configuration file uses an *ini* style syntax. You can add comments -by preceding comments with a pound sign (#) or a semi-colon (;). For example: - -.. code-block:: ini - - # <--A number (#) sign precedes a comment. - ; A comment may be anything. - # Comments always follow a semi-colon (;) or a pound (#) on each line. - # The end of the line terminates a comment. - # We recommend that you provide comments in your configuration file(s). - - -.. _ceph-conf-settings: - -Config Sections -=============== - -The configuration file can configure all Ceph daemons in a Ceph Storage Cluster, -or all Ceph daemons of a particular type. To configure a series of daemons, the -settings must be included under the processes that will receive the -configuration as follows: - -``[global]`` - -:Description: Settings under ``[global]`` affect all daemons in a Ceph Storage - Cluster. - -:Example: ``auth supported = cephx`` - -``[osd]`` - -:Description: Settings under ``[osd]`` affect all ``ceph-osd`` daemons in - the Ceph Storage Cluster, and override the same setting in - ``[global]``. - -:Example: ``osd journal size = 1000`` - -``[mon]`` - -:Description: Settings under ``[mon]`` affect all ``ceph-mon`` daemons in - the Ceph Storage Cluster, and override the same setting in - ``[global]``. - -:Example: ``mon addr = 10.0.0.101:6789`` - - -``[mds]`` - -:Description: Settings under ``[mds]`` affect all ``ceph-mds`` daemons in - the Ceph Storage Cluster, and override the same setting in - ``[global]``. - -:Example: ``host = myserver01`` - -``[client]`` - -:Description: Settings under ``[client]`` affect all Ceph Clients - (e.g., mounted Ceph Filesystems, mounted Ceph Block Devices, - etc.). - -:Example: ``log file = /var/log/ceph/radosgw.log`` - - -Global settings affect all instances of all daemon in the Ceph Storage Cluster. -Use the ``[global]`` setting for values that are common for all daemons in the -Ceph Storage Cluster. You can override each ``[global]`` setting by: - -#. Changing the setting in a particular process type - (*e.g.,* ``[osd]``, ``[mon]``, ``[mds]`` ). - -#. Changing the setting in a particular process (*e.g.,* ``[osd.1]`` ). - -Overriding a global setting affects all child processes, except those that -you specifically override in a particular daemon. - -A typical global setting involves activating authentication. For example: - -.. code-block:: ini - - [global] - #Enable authentication between hosts within the cluster. - #v 0.54 and earlier - auth supported = cephx - - #v 0.55 and after - auth cluster required = cephx - auth service required = cephx - auth client required = cephx - - -You can specify settings that apply to a particular type of daemon. When you -specify settings under ``[osd]``, ``[mon]`` or ``[mds]`` without specifying a -particular instance, the setting will apply to all OSDs, monitors or metadata -daemons respectively. - -A typical daemon-wide setting involves setting journal sizes, filestore -settings, etc. For example: - -.. code-block:: ini - - [osd] - osd journal size = 1000 - - -You may specify settings for particular instances of a daemon. You may specify -an instance by entering its type, delimited by a period (.) and by the instance -ID. The instance ID for a Ceph OSD Daemon is always numeric, but it may be -alphanumeric for Ceph Monitors and Ceph Metadata Servers. - -.. code-block:: ini - - [osd.1] - # settings affect osd.1 only. - - [mon.a] - # settings affect mon.a only. - - [mds.b] - # settings affect mds.b only. - - -If the daemon you specify is a Ceph Gateway client, specify the daemon and the -instance, delimited by a period (.). For example:: - - [client.radosgw.instance-name] - # settings affect client.radosgw.instance-name only. - - - -.. _ceph-metavariables: - -Metavariables -============= - -Metavariables simplify Ceph Storage Cluster configuration dramatically. When a -metavariable is set in a configuration value, Ceph expands the metavariable into -a concrete value. Metavariables are very powerful when used within the -``[global]``, ``[osd]``, ``[mon]``, ``[mds]`` or ``[client]`` sections of your -configuration file. Ceph metavariables are similar to Bash shell expansion. - -Ceph supports the following metavariables: - - -``$cluster`` - -:Description: Expands to the Ceph Storage Cluster name. Useful when running - multiple Ceph Storage Clusters on the same hardware. - -:Example: ``/etc/ceph/$cluster.keyring`` -:Default: ``ceph`` - - -``$type`` - -:Description: Expands to one of ``mds``, ``osd``, or ``mon``, depending on the - type of the instant daemon. - -:Example: ``/var/lib/ceph/$type`` - - -``$id`` - -:Description: Expands to the daemon identifier. For ``osd.0``, this would be - ``0``; for ``mds.a``, it would be ``a``. - -:Example: ``/var/lib/ceph/$type/$cluster-$id`` - - -``$host`` - -:Description: Expands to the host name of the instant daemon. - - -``$name`` - -:Description: Expands to ``$type.$id``. -:Example: ``/var/run/ceph/$cluster-$name.asok`` - -``$pid`` - -:Description: Expands to daemon pid. -:Example: ``/var/run/ceph/$cluster-$name-$pid.asok`` - - -.. _ceph-conf-common-settings: - -Common Settings -=============== - -The `Hardware Recommendations`_ section provides some hardware guidelines for -configuring a Ceph Storage Cluster. It is possible for a single :term:`Ceph -Node` to run multiple daemons. For example, a single node with multiple drives -may run one ``ceph-osd`` for each drive. Ideally, you will have a node for a -particular type of process. For example, some nodes may run ``ceph-osd`` -daemons, other nodes may run ``ceph-mds`` daemons, and still other nodes may -run ``ceph-mon`` daemons. - -Each node has a name identified by the ``host`` setting. Monitors also specify -a network address and port (i.e., domain name or IP address) identified by the -``addr`` setting. A basic configuration file will typically specify only -minimal settings for each instance of monitor daemons. For example: - -.. code-block:: ini - - [global] - mon_initial_members = ceph1 - mon_host = 10.0.0.1 - - -.. important:: The ``host`` setting is the short name of the node (i.e., not - an fqdn). It is **NOT** an IP address either. Enter ``hostname -s`` on - the command line to retrieve the name of the node. Do not use ``host`` - settings for anything other than initial monitors unless you are deploying - Ceph manually. You **MUST NOT** specify ``host`` under individual daemons - when using deployment tools like ``chef`` or ``ceph-deploy``, as those tools - will enter the appropriate values for you in the cluster map. - - -.. _ceph-network-config: - -Networks -======== - -See the `Network Configuration Reference`_ for a detailed discussion about -configuring a network for use with Ceph. - - -Monitors -======== - -Ceph production clusters typically deploy with a minimum 3 :term:`Ceph Monitor` -daemons to ensure high availability should a monitor instance crash. At least -three (3) monitors ensures that the Paxos algorithm can determine which version -of the :term:`Ceph Cluster Map` is the most recent from a majority of Ceph -Monitors in the quorum. - -.. note:: You may deploy Ceph with a single monitor, but if the instance fails, - the lack of other monitors may interrupt data service availability. - -Ceph Monitors typically listen on port ``6789``. For example: - -.. code-block:: ini - - [mon.a] - host = hostName - mon addr = 150.140.130.120:6789 - -By default, Ceph expects that you will store a monitor's data under the -following path:: - - /var/lib/ceph/mon/$cluster-$id - -You or a deployment tool (e.g., ``ceph-deploy``) must create the corresponding -directory. With metavariables fully expressed and a cluster named "ceph", the -foregoing directory would evaluate to:: - - /var/lib/ceph/mon/ceph-a - -For additional details, see the `Monitor Config Reference`_. - -.. _Monitor Config Reference: ../mon-config-ref - - -.. _ceph-osd-config: - - -Authentication -============== - -.. versionadded:: Bobtail 0.56 - -For Bobtail (v 0.56) and beyond, you should expressly enable or disable -authentication in the ``[global]`` section of your Ceph configuration file. :: - - auth cluster required = cephx - auth service required = cephx - auth client required = cephx - -Additionally, you should enable message signing. See `Cephx Config Reference`_ for details. - -.. important:: When upgrading, we recommend expressly disabling authentication - first, then perform the upgrade. Once the upgrade is complete, re-enable - authentication. - -.. _Cephx Config Reference: ../auth-config-ref - - -.. _ceph-monitor-config: - - -OSDs -==== - -Ceph production clusters typically deploy :term:`Ceph OSD Daemons` where one node -has one OSD daemon running a filestore on one storage drive. A typical -deployment specifies a journal size. For example: - -.. code-block:: ini - - [osd] - osd journal size = 10000 - - [osd.0] - host = {hostname} #manual deployments only. - - -By default, Ceph expects that you will store a Ceph OSD Daemon's data with the -following path:: - - /var/lib/ceph/osd/$cluster-$id - -You or a deployment tool (e.g., ``ceph-deploy``) must create the corresponding -directory. With metavariables fully expressed and a cluster named "ceph", the -foregoing directory would evaluate to:: - - /var/lib/ceph/osd/ceph-0 - -You may override this path using the ``osd data`` setting. We don't recommend -changing the default location. Create the default directory on your OSD host. - -:: - - ssh {osd-host} - sudo mkdir /var/lib/ceph/osd/ceph-{osd-number} - -The ``osd data`` path ideally leads to a mount point with a hard disk that is -separate from the hard disk storing and running the operating system and -daemons. If the OSD is for a disk other than the OS disk, prepare it for -use with Ceph, and mount it to the directory you just created:: - - ssh {new-osd-host} - sudo mkfs -t {fstype} /dev/{disk} - sudo mount -o user_xattr /dev/{hdd} /var/lib/ceph/osd/ceph-{osd-number} - -We recommend using the ``xfs`` file system when running -:command:`mkfs`. (``btrfs`` and ``ext4`` are not recommended and no -longer tested.) - -See the `OSD Config Reference`_ for additional configuration details. - - -Heartbeats -========== - -During runtime operations, Ceph OSD Daemons check up on other Ceph OSD Daemons -and report their findings to the Ceph Monitor. You do not have to provide any -settings. However, if you have network latency issues, you may wish to modify -the settings. - -See `Configuring Monitor/OSD Interaction`_ for additional details. - - -.. _ceph-logging-and-debugging: - -Logs / Debugging -================ - -Sometimes you may encounter issues with Ceph that require -modifying logging output and using Ceph's debugging. See `Debugging and -Logging`_ for details on log rotation. - -.. _Debugging and Logging: ../../troubleshooting/log-and-debug - - -Example ceph.conf -================= - -.. literalinclude:: demo-ceph.conf - :language: ini - -.. _ceph-runtime-config: - -Runtime Changes -=============== - -Ceph allows you to make changes to the configuration of a ``ceph-osd``, -``ceph-mon``, or ``ceph-mds`` daemon at runtime. This capability is quite -useful for increasing/decreasing logging output, enabling/disabling debug -settings, and even for runtime optimization. The following reflects runtime -configuration usage:: - - ceph tell {daemon-type}.{id or *} injectargs --{name} {value} [--{name} {value}] - -Replace ``{daemon-type}`` with one of ``osd``, ``mon`` or ``mds``. You may apply -the runtime setting to all daemons of a particular type with ``*``, or specify -a specific daemon's ID (i.e., its number or letter). For example, to increase -debug logging for a ``ceph-osd`` daemon named ``osd.0``, execute the following:: - - ceph tell osd.0 injectargs --debug-osd 20 --debug-ms 1 - -In your ``ceph.conf`` file, you may use spaces when specifying a -setting name. When specifying a setting name on the command line, -ensure that you use an underscore or hyphen (``_`` or ``-``) between -terms (e.g., ``debug osd`` becomes ``--debug-osd``). - - -Viewing a Configuration at Runtime -================================== - -If your Ceph Storage Cluster is running, and you would like to see the -configuration settings from a running daemon, execute the following:: - - ceph daemon {daemon-type}.{id} config show | less - -If you are on a machine where osd.0 is running, the command would be:: - - ceph daemon osd.0 config show | less - -Reading Configuration Metadata at Runtime -========================================= - -Information about the available configuration options is available via -the ``config help`` command: - -:: - - ceph daemon {daemon-type}.{id} config help | less - - -This metadata is primarily intended to be used when integrating other -software with Ceph, such as graphical user interfaces. The output is -a list of JSON objects, for example: - -:: - - { - "name": "mon_host", - "type": "std::string", - "level": "basic", - "desc": "list of hosts or addresses to search for a monitor", - "long_desc": "This is a comma, whitespace, or semicolon separated list of IP addresses or hostnames. Hostnames are resolved via DNS and all A or AAAA records are included in the search list.", - "default": "", - "daemon_default": "", - "tags": [], - "services": [ - "common" - ], - "see_also": [], - "enum_values": [], - "min": "", - "max": "" - } - -type -____ - -The type of the setting, given as a C++ type name. - -level -_____ - -One of `basic`, `advanced`, `dev`. The `dev` options are not intended -for use outside of development and testing. - -desc -____ - -A short description -- this is a sentence fragment suitable for display -in small spaces like a single line in a list. - -long_desc -_________ - -A full description of what the setting does, this may be as long as needed. - -default -_______ - -The default value, if any. - -daemon_default -______________ - -An alternative default used for daemons (services) as opposed to clients. - -tags -____ - -A list of strings indicating topics to which this setting relates. Examples -of tags are `performance` and `networking`. - -services -________ - -A list of strings indicating which Ceph services the setting relates to, such -as `osd`, `mds`, `mon`. For settings that are relevant to any Ceph client -or server, `common` is used. - -see_also -________ - -A list of strings indicating other configuration options that may also -be of interest to a user setting this option. - -enum_values -___________ - -Optional: a list of strings indicating the valid settings. - -min, max -________ - -Optional: upper and lower (inclusive) bounds on valid settings. - - - - -Running Multiple Clusters -========================= - -With Ceph, you can run multiple Ceph Storage Clusters on the same hardware. -Running multiple clusters provides a higher level of isolation compared to -using different pools on the same cluster with different CRUSH rulesets. A -separate cluster will have separate monitor, OSD and metadata server processes. -When running Ceph with default settings, the default cluster name is ``ceph``, -which means you would save your Ceph configuration file with the file name -``ceph.conf`` in the ``/etc/ceph`` default directory. - -See `ceph-deploy new`_ for details. -.. _ceph-deploy new:../ceph-deploy-new - -When you run multiple clusters, you must name your cluster and save the Ceph -configuration file with the name of the cluster. For example, a cluster named -``openstack`` will have a Ceph configuration file with the file name -``openstack.conf`` in the ``/etc/ceph`` default directory. - -.. important:: Cluster names must consist of letters a-z and digits 0-9 only. - -Separate clusters imply separate data disks and journals, which are not shared -between clusters. Referring to `Metavariables`_, the ``$cluster`` metavariable -evaluates to the cluster name (i.e., ``openstack`` in the foregoing example). -Various settings use the ``$cluster`` metavariable, including: - -- ``keyring`` -- ``admin socket`` -- ``log file`` -- ``pid file`` -- ``mon data`` -- ``mon cluster log file`` -- ``osd data`` -- ``osd journal`` -- ``mds data`` -- ``rgw data`` - -See `General Settings`_, `OSD Settings`_, `Monitor Settings`_, `MDS Settings`_, -`RGW Settings`_ and `Log Settings`_ for relevant path defaults that use the -``$cluster`` metavariable. - -.. _General Settings: ../general-config-ref -.. _OSD Settings: ../osd-config-ref -.. _Monitor Settings: ../mon-config-ref -.. _MDS Settings: ../../../cephfs/mds-config-ref -.. _RGW Settings: ../../../radosgw/config-ref/ -.. _Log Settings: ../../troubleshooting/log-and-debug - - -When creating default directories or files, you should use the cluster -name at the appropriate places in the path. For example:: - - sudo mkdir /var/lib/ceph/osd/openstack-0 - sudo mkdir /var/lib/ceph/mon/openstack-a - -.. important:: When running monitors on the same host, you should use - different ports. By default, monitors use port 6789. If you already - have monitors using port 6789, use a different port for your other cluster(s). - -To invoke a cluster other than the default ``ceph`` cluster, use the -``-c {filename}.conf`` option with the ``ceph`` command. For example:: - - ceph -c {cluster-name}.conf health - ceph -c openstack.conf health - - -.. _Hardware Recommendations: ../../../start/hardware-recommendations -.. _Network Configuration Reference: ../network-config-ref -.. _OSD Config Reference: ../osd-config-ref -.. _Configuring Monitor/OSD Interaction: ../mon-osd-interaction -.. _ceph-deploy new: ../../deployment/ceph-deploy-new#naming-a-cluster diff --git a/src/ceph/doc/rados/configuration/demo-ceph.conf b/src/ceph/doc/rados/configuration/demo-ceph.conf deleted file mode 100644 index ba86d53..0000000 --- a/src/ceph/doc/rados/configuration/demo-ceph.conf +++ /dev/null @@ -1,31 +0,0 @@ -[global] -fsid = {cluster-id} -mon initial members = {hostname}[, {hostname}] -mon host = {ip-address}[, {ip-address}] - -#All clusters have a front-side public network. -#If you have two NICs, you can configure a back side cluster -#network for OSD object replication, heart beats, backfilling, -#recovery, etc. -public network = {network}[, {network}] -#cluster network = {network}[, {network}] - -#Clusters require authentication by default. -auth cluster required = cephx -auth service required = cephx -auth client required = cephx - -#Choose reasonable numbers for your journals, number of replicas -#and placement groups. -osd journal size = {n} -osd pool default size = {n} # Write an object n times. -osd pool default min size = {n} # Allow writing n copy in a degraded state. -osd pool default pg num = {n} -osd pool default pgp num = {n} - -#Choose a reasonable crush leaf type. -#0 for a 1-node cluster. -#1 for a multi node cluster in a single rack -#2 for a multi node, multi chassis cluster with multiple hosts in a chassis -#3 for a multi node cluster with hosts across racks, etc. -osd crush chooseleaf type = {n}
\ No newline at end of file diff --git a/src/ceph/doc/rados/configuration/filestore-config-ref.rst b/src/ceph/doc/rados/configuration/filestore-config-ref.rst deleted file mode 100644 index 4dff60c..0000000 --- a/src/ceph/doc/rados/configuration/filestore-config-ref.rst +++ /dev/null @@ -1,365 +0,0 @@ -============================ - Filestore Config Reference -============================ - - -``filestore debug omap check`` - -:Description: Debugging check on synchronization. Expensive. For debugging only. -:Type: Boolean -:Required: No -:Default: ``0`` - - -.. index:: filestore; extended attributes - -Extended Attributes -=================== - -Extended Attributes (XATTRs) are an important aspect in your configuration. -Some file systems have limits on the number of bytes stored in XATTRS. -Additionally, in some cases, the filesystem may not be as fast as an alternative -method of storing XATTRs. The following settings may help improve performance -by using a method of storing XATTRs that is extrinsic to the underlying filesystem. - -Ceph XATTRs are stored as ``inline xattr``, using the XATTRs provided -by the underlying file system, if it does not impose a size limit. If -there is a size limit (4KB total on ext4, for instance), some Ceph -XATTRs will be stored in an key/value database when either the -``filestore max inline xattr size`` or ``filestore max inline -xattrs`` threshold is reached. - - -``filestore max inline xattr size`` - -:Description: The maximimum size of an XATTR stored in the filesystem (i.e., XFS, - btrfs, ext4, etc.) per object. Should not be larger than the - filesytem can handle. Default value of 0 means to use the value - specific to the underlying filesystem. -:Type: Unsigned 32-bit Integer -:Required: No -:Default: ``0`` - - -``filestore max inline xattr size xfs`` - -:Description: The maximimum size of an XATTR stored in the XFS filesystem. - Only used if ``filestore max inline xattr size`` == 0. -:Type: Unsigned 32-bit Integer -:Required: No -:Default: ``65536`` - - -``filestore max inline xattr size btrfs`` - -:Description: The maximimum size of an XATTR stored in the btrfs filesystem. - Only used if ``filestore max inline xattr size`` == 0. -:Type: Unsigned 32-bit Integer -:Required: No -:Default: ``2048`` - - -``filestore max inline xattr size other`` - -:Description: The maximimum size of an XATTR stored in other filesystems. - Only used if ``filestore max inline xattr size`` == 0. -:Type: Unsigned 32-bit Integer -:Required: No -:Default: ``512`` - - -``filestore max inline xattrs`` - -:Description: The maximum number of XATTRs stored in the filesystem per object. - Default value of 0 means to use the value specific to the - underlying filesystem. -:Type: 32-bit Integer -:Required: No -:Default: ``0`` - - -``filestore max inline xattrs xfs`` - -:Description: The maximum number of XATTRs stored in the XFS filesystem per object. - Only used if ``filestore max inline xattrs`` == 0. -:Type: 32-bit Integer -:Required: No -:Default: ``10`` - - -``filestore max inline xattrs btrfs`` - -:Description: The maximum number of XATTRs stored in the btrfs filesystem per object. - Only used if ``filestore max inline xattrs`` == 0. -:Type: 32-bit Integer -:Required: No -:Default: ``10`` - - -``filestore max inline xattrs other`` - -:Description: The maximum number of XATTRs stored in other filesystems per object. - Only used if ``filestore max inline xattrs`` == 0. -:Type: 32-bit Integer -:Required: No -:Default: ``2`` - -.. index:: filestore; synchronization - -Synchronization Intervals -========================= - -Periodically, the filestore needs to quiesce writes and synchronize the -filesystem, which creates a consistent commit point. It can then free journal -entries up to the commit point. Synchronizing more frequently tends to reduce -the time required to perform synchronization, and reduces the amount of data -that needs to remain in the journal. Less frequent synchronization allows the -backing filesystem to coalesce small writes and metadata updates more -optimally--potentially resulting in more efficient synchronization. - - -``filestore max sync interval`` - -:Description: The maximum interval in seconds for synchronizing the filestore. -:Type: Double -:Required: No -:Default: ``5`` - - -``filestore min sync interval`` - -:Description: The minimum interval in seconds for synchronizing the filestore. -:Type: Double -:Required: No -:Default: ``.01`` - - -.. index:: filestore; flusher - -Flusher -======= - -The filestore flusher forces data from large writes to be written out using -``sync file range`` before the sync in order to (hopefully) reduce the cost of -the eventual sync. In practice, disabling 'filestore flusher' seems to improve -performance in some cases. - - -``filestore flusher`` - -:Description: Enables the filestore flusher. -:Type: Boolean -:Required: No -:Default: ``false`` - -.. deprecated:: v.65 - -``filestore flusher max fds`` - -:Description: Sets the maximum number of file descriptors for the flusher. -:Type: Integer -:Required: No -:Default: ``512`` - -.. deprecated:: v.65 - -``filestore sync flush`` - -:Description: Enables the synchronization flusher. -:Type: Boolean -:Required: No -:Default: ``false`` - -.. deprecated:: v.65 - -``filestore fsync flushes journal data`` - -:Description: Flush journal data during filesystem synchronization. -:Type: Boolean -:Required: No -:Default: ``false`` - - -.. index:: filestore; queue - -Queue -===== - -The following settings provide limits on the size of filestore queue. - -``filestore queue max ops`` - -:Description: Defines the maximum number of in progress operations the file store accepts before blocking on queuing new operations. -:Type: Integer -:Required: No. Minimal impact on performance. -:Default: ``50`` - - -``filestore queue max bytes`` - -:Description: The maximum number of bytes for an operation. -:Type: Integer -:Required: No -:Default: ``100 << 20`` - - - - -.. index:: filestore; timeouts - -Timeouts -======== - - -``filestore op threads`` - -:Description: The number of filesystem operation threads that execute in parallel. -:Type: Integer -:Required: No -:Default: ``2`` - - -``filestore op thread timeout`` - -:Description: The timeout for a filesystem operation thread (in seconds). -:Type: Integer -:Required: No -:Default: ``60`` - - -``filestore op thread suicide timeout`` - -:Description: The timeout for a commit operation before cancelling the commit (in seconds). -:Type: Integer -:Required: No -:Default: ``180`` - - -.. index:: filestore; btrfs - -B-Tree Filesystem -================= - - -``filestore btrfs snap`` - -:Description: Enable snapshots for a ``btrfs`` filestore. -:Type: Boolean -:Required: No. Only used for ``btrfs``. -:Default: ``true`` - - -``filestore btrfs clone range`` - -:Description: Enable cloning ranges for a ``btrfs`` filestore. -:Type: Boolean -:Required: No. Only used for ``btrfs``. -:Default: ``true`` - - -.. index:: filestore; journal - -Journal -======= - - -``filestore journal parallel`` - -:Description: Enables parallel journaling, default for btrfs. -:Type: Boolean -:Required: No -:Default: ``false`` - - -``filestore journal writeahead`` - -:Description: Enables writeahead journaling, default for xfs. -:Type: Boolean -:Required: No -:Default: ``false`` - - -``filestore journal trailing`` - -:Description: Deprecated, never use. -:Type: Boolean -:Required: No -:Default: ``false`` - - -Misc -==== - - -``filestore merge threshold`` - -:Description: Min number of files in a subdir before merging into parent - NOTE: A negative value means to disable subdir merging -:Type: Integer -:Required: No -:Default: ``10`` - - -``filestore split multiple`` - -:Description: ``(filestore_split_multiple * abs(filestore_merge_threshold) + (rand() % filestore_split_rand_factor)) * 16`` - is the maximum number of files in a subdirectory before - splitting into child directories. - -:Type: Integer -:Required: No -:Default: ``2`` - - -``filestore split rand factor`` - -:Description: A random factor added to the split threshold to avoid - too many filestore splits occurring at once. See - ``filestore split multiple`` for details. - This can only be changed for an existing osd offline, - via ceph-objectstore-tool's apply-layout-settings command. - -:Type: Unsigned 32-bit Integer -:Required: No -:Default: ``20`` - - -``filestore update to`` - -:Description: Limits filestore auto upgrade to specified version. -:Type: Integer -:Required: No -:Default: ``1000`` - - -``filestore blackhole`` - -:Description: Drop any new transactions on the floor. -:Type: Boolean -:Required: No -:Default: ``false`` - - -``filestore dump file`` - -:Description: File onto which store transaction dumps. -:Type: Boolean -:Required: No -:Default: ``false`` - - -``filestore kill at`` - -:Description: inject a failure at the n'th opportunity -:Type: String -:Required: No -:Default: ``false`` - - -``filestore fail eio`` - -:Description: Fail/Crash on eio. -:Type: Boolean -:Required: No -:Default: ``true`` - diff --git a/src/ceph/doc/rados/configuration/general-config-ref.rst b/src/ceph/doc/rados/configuration/general-config-ref.rst deleted file mode 100644 index ca09ee5..0000000 --- a/src/ceph/doc/rados/configuration/general-config-ref.rst +++ /dev/null @@ -1,66 +0,0 @@ -========================== - General Config Reference -========================== - - -``fsid`` - -:Description: The filesystem ID. One per cluster. -:Type: UUID -:Required: No. -:Default: N/A. Usually generated by deployment tools. - - -``admin socket`` - -:Description: The socket for executing administrative commands on a daemon, - irrespective of whether Ceph Monitors have established a quorum. - -:Type: String -:Required: No -:Default: ``/var/run/ceph/$cluster-$name.asok`` - - -``pid file`` - -:Description: The file in which the mon, osd or mds will write its - PID. For instance, ``/var/run/$cluster/$type.$id.pid`` - will create /var/run/ceph/mon.a.pid for the ``mon`` with - id ``a`` running in the ``ceph`` cluster. The ``pid - file`` is removed when the daemon stops gracefully. If - the process is not daemonized (i.e. runs with the ``-f`` - or ``-d`` option), the ``pid file`` is not created. -:Type: String -:Required: No -:Default: No - - -``chdir`` - -:Description: The directory Ceph daemons change to once they are - up and running. Default ``/`` directory recommended. - -:Type: String -:Required: No -:Default: ``/`` - - -``max open files`` - -:Description: If set, when the :term:`Ceph Storage Cluster` starts, Ceph sets - the ``max open fds`` at the OS level (i.e., the max # of file - descriptors). It helps prevents Ceph OSD Daemons from running out - of file descriptors. - -:Type: 64-bit Integer -:Required: No -:Default: ``0`` - - -``fatal signal handlers`` - -:Description: If set, we will install signal handlers for SEGV, ABRT, BUS, ILL, - FPE, XCPU, XFSZ, SYS signals to generate a useful log message - -:Type: Boolean -:Default: ``true`` diff --git a/src/ceph/doc/rados/configuration/index.rst b/src/ceph/doc/rados/configuration/index.rst deleted file mode 100644 index 48b58ef..0000000 --- a/src/ceph/doc/rados/configuration/index.rst +++ /dev/null @@ -1,64 +0,0 @@ -=============== - Configuration -=============== - -Ceph can run with a cluster containing thousands of Object Storage Devices -(OSDs). A minimal system will have at least two OSDs for data replication. To -configure OSD clusters, you must provide settings in the configuration file. -Ceph provides default values for many settings, which you can override in the -configuration file. Additionally, you can make runtime modification to the -configuration using command-line utilities. - -When Ceph starts, it activates three daemons: - -- ``ceph-mon`` (mandatory) -- ``ceph-osd`` (mandatory) -- ``ceph-mds`` (mandatory for cephfs only) - -Each process, daemon or utility loads the host's configuration file. A process -may have information about more than one daemon instance (*i.e.,* multiple -contexts). A daemon or utility only has information about a single daemon -instance (a single context). - -.. note:: Ceph can run on a single host for evaluation purposes. - - -.. raw:: html - - <table cellpadding="10"><colgroup><col width="50%"><col width="50%"></colgroup><tbody valign="top"><tr><td><h3>Configuring the Object Store</h3> - -For general object store configuration, refer to the following: - -.. toctree:: - :maxdepth: 1 - - Storage devices <storage-devices> - ceph-conf - - -.. raw:: html - - </td><td><h3>Reference</h3> - -To optimize the performance of your cluster, refer to the following: - -.. toctree:: - :maxdepth: 1 - - Network Settings <network-config-ref> - Auth Settings <auth-config-ref> - Monitor Settings <mon-config-ref> - mon-lookup-dns - Heartbeat Settings <mon-osd-interaction> - OSD Settings <osd-config-ref> - BlueStore Settings <bluestore-config-ref> - FileStore Settings <filestore-config-ref> - Journal Settings <journal-ref> - Pool, PG & CRUSH Settings <pool-pg-config-ref.rst> - Messaging Settings <ms-ref> - General Settings <general-config-ref> - - -.. raw:: html - - </td></tr></tbody></table> diff --git a/src/ceph/doc/rados/configuration/journal-ref.rst b/src/ceph/doc/rados/configuration/journal-ref.rst deleted file mode 100644 index 97300f4..0000000 --- a/src/ceph/doc/rados/configuration/journal-ref.rst +++ /dev/null @@ -1,116 +0,0 @@ -========================== - Journal Config Reference -========================== - -.. index:: journal; journal configuration - -Ceph OSDs use a journal for two reasons: speed and consistency. - -- **Speed:** The journal enables the Ceph OSD Daemon to commit small writes - quickly. Ceph writes small, random i/o to the journal sequentially, which - tends to speed up bursty workloads by allowing the backing filesystem more - time to coalesce writes. The Ceph OSD Daemon's journal, however, can lead - to spiky performance with short spurts of high-speed writes followed by - periods without any write progress as the filesystem catches up to the - journal. - -- **Consistency:** Ceph OSD Daemons require a filesystem interface that - guarantees atomic compound operations. Ceph OSD Daemons write a description - of the operation to the journal and apply the operation to the filesystem. - This enables atomic updates to an object (for example, placement group - metadata). Every few seconds--between ``filestore max sync interval`` and - ``filestore min sync interval``--the Ceph OSD Daemon stops writes and - synchronizes the journal with the filesystem, allowing Ceph OSD Daemons to - trim operations from the journal and reuse the space. On failure, Ceph - OSD Daemons replay the journal starting after the last synchronization - operation. - -Ceph OSD Daemons support the following journal settings: - - -``journal dio`` - -:Description: Enables direct i/o to the journal. Requires ``journal block - align`` set to ``true``. - -:Type: Boolean -:Required: Yes when using ``aio``. -:Default: ``true`` - - - -``journal aio`` - -.. versionchanged:: 0.61 Cuttlefish - -:Description: Enables using ``libaio`` for asynchronous writes to the journal. - Requires ``journal dio`` set to ``true``. - -:Type: Boolean -:Required: No. -:Default: Version 0.61 and later, ``true``. Version 0.60 and earlier, ``false``. - - -``journal block align`` - -:Description: Block aligns write operations. Required for ``dio`` and ``aio``. -:Type: Boolean -:Required: Yes when using ``dio`` and ``aio``. -:Default: ``true`` - - -``journal max write bytes`` - -:Description: The maximum number of bytes the journal will write at - any one time. - -:Type: Integer -:Required: No -:Default: ``10 << 20`` - - -``journal max write entries`` - -:Description: The maximum number of entries the journal will write at - any one time. - -:Type: Integer -:Required: No -:Default: ``100`` - - -``journal queue max ops`` - -:Description: The maximum number of operations allowed in the queue at - any one time. - -:Type: Integer -:Required: No -:Default: ``500`` - - -``journal queue max bytes`` - -:Description: The maximum number of bytes allowed in the queue at - any one time. - -:Type: Integer -:Required: No -:Default: ``10 << 20`` - - -``journal align min size`` - -:Description: Align data payloads greater than the specified minimum. -:Type: Integer -:Required: No -:Default: ``64 << 10`` - - -``journal zero on create`` - -:Description: Causes the file store to overwrite the entire journal with - ``0``'s during ``mkfs``. -:Type: Boolean -:Required: No -:Default: ``false`` diff --git a/src/ceph/doc/rados/configuration/mon-config-ref.rst b/src/ceph/doc/rados/configuration/mon-config-ref.rst deleted file mode 100644 index 6c8e92b..0000000 --- a/src/ceph/doc/rados/configuration/mon-config-ref.rst +++ /dev/null @@ -1,1222 +0,0 @@ -========================== - Monitor Config Reference -========================== - -Understanding how to configure a :term:`Ceph Monitor` is an important part of -building a reliable :term:`Ceph Storage Cluster`. **All Ceph Storage Clusters -have at least one monitor**. A monitor configuration usually remains fairly -consistent, but you can add, remove or replace a monitor in a cluster. See -`Adding/Removing a Monitor`_ and `Add/Remove a Monitor (ceph-deploy)`_ for -details. - - -.. index:: Ceph Monitor; Paxos - -Background -========== - -Ceph Monitors maintain a "master copy" of the :term:`cluster map`, which means a -:term:`Ceph Client` can determine the location of all Ceph Monitors, Ceph OSD -Daemons, and Ceph Metadata Servers just by connecting to one Ceph Monitor and -retrieving a current cluster map. Before Ceph Clients can read from or write to -Ceph OSD Daemons or Ceph Metadata Servers, they must connect to a Ceph Monitor -first. With a current copy of the cluster map and the CRUSH algorithm, a Ceph -Client can compute the location for any object. The ability to compute object -locations allows a Ceph Client to talk directly to Ceph OSD Daemons, which is a -very important aspect of Ceph's high scalability and performance. See -`Scalability and High Availability`_ for additional details. - -The primary role of the Ceph Monitor is to maintain a master copy of the cluster -map. Ceph Monitors also provide authentication and logging services. Ceph -Monitors write all changes in the monitor services to a single Paxos instance, -and Paxos writes the changes to a key/value store for strong consistency. Ceph -Monitors can query the most recent version of the cluster map during sync -operations. Ceph Monitors leverage the key/value store's snapshots and iterators -(using leveldb) to perform store-wide synchronization. - -.. ditaa:: - - /-------------\ /-------------\ - | Monitor | Write Changes | Paxos | - | cCCC +-------------->+ cCCC | - | | | | - +-------------+ \------+------/ - | Auth | | - +-------------+ | Write Changes - | Log | | - +-------------+ v - | Monitor Map | /------+------\ - +-------------+ | Key / Value | - | OSD Map | | Store | - +-------------+ | cCCC | - | PG Map | \------+------/ - +-------------+ ^ - | MDS Map | | Read Changes - +-------------+ | - | cCCC |*---------------------+ - \-------------/ - - -.. deprecated:: version 0.58 - -In Ceph versions 0.58 and earlier, Ceph Monitors use a Paxos instance for -each service and store the map as a file. - -.. index:: Ceph Monitor; cluster map - -Cluster Maps ------------- - -The cluster map is a composite of maps, including the monitor map, the OSD map, -the placement group map and the metadata server map. The cluster map tracks a -number of important things: which processes are ``in`` the Ceph Storage Cluster; -which processes that are ``in`` the Ceph Storage Cluster are ``up`` and running -or ``down``; whether, the placement groups are ``active`` or ``inactive``, and -``clean`` or in some other state; and, other details that reflect the current -state of the cluster such as the total amount of storage space, and the amount -of storage used. - -When there is a significant change in the state of the cluster--e.g., a Ceph OSD -Daemon goes down, a placement group falls into a degraded state, etc.--the -cluster map gets updated to reflect the current state of the cluster. -Additionally, the Ceph Monitor also maintains a history of the prior states of -the cluster. The monitor map, OSD map, placement group map and metadata server -map each maintain a history of their map versions. We call each version an -"epoch." - -When operating your Ceph Storage Cluster, keeping track of these states is an -important part of your system administration duties. See `Monitoring a Cluster`_ -and `Monitoring OSDs and PGs`_ for additional details. - -.. index:: high availability; quorum - -Monitor Quorum --------------- - -Our Configuring ceph section provides a trivial `Ceph configuration file`_ that -provides for one monitor in the test cluster. A cluster will run fine with a -single monitor; however, **a single monitor is a single-point-of-failure**. To -ensure high availability in a production Ceph Storage Cluster, you should run -Ceph with multiple monitors so that the failure of a single monitor **WILL NOT** -bring down your entire cluster. - -When a Ceph Storage Cluster runs multiple Ceph Monitors for high availability, -Ceph Monitors use `Paxos`_ to establish consensus about the master cluster map. -A consensus requires a majority of monitors running to establish a quorum for -consensus about the cluster map (e.g., 1; 2 out of 3; 3 out of 5; 4 out of 6; -etc.). - -``mon force quorum join`` - -:Description: Force monitor to join quorum even if it has been previously removed from the map -:Type: Boolean -:Default: ``False`` - -.. index:: Ceph Monitor; consistency - -Consistency ------------ - -When you add monitor settings to your Ceph configuration file, you need to be -aware of some of the architectural aspects of Ceph Monitors. **Ceph imposes -strict consistency requirements** for a Ceph monitor when discovering another -Ceph Monitor within the cluster. Whereas, Ceph Clients and other Ceph daemons -use the Ceph configuration file to discover monitors, monitors discover each -other using the monitor map (monmap), not the Ceph configuration file. - -A Ceph Monitor always refers to the local copy of the monmap when discovering -other Ceph Monitors in the Ceph Storage Cluster. Using the monmap instead of the -Ceph configuration file avoids errors that could break the cluster (e.g., typos -in ``ceph.conf`` when specifying a monitor address or port). Since monitors use -monmaps for discovery and they share monmaps with clients and other Ceph -daemons, **the monmap provides monitors with a strict guarantee that their -consensus is valid.** - -Strict consistency also applies to updates to the monmap. As with any other -updates on the Ceph Monitor, changes to the monmap always run through a -distributed consensus algorithm called `Paxos`_. The Ceph Monitors must agree on -each update to the monmap, such as adding or removing a Ceph Monitor, to ensure -that each monitor in the quorum has the same version of the monmap. Updates to -the monmap are incremental so that Ceph Monitors have the latest agreed upon -version, and a set of previous versions. Maintaining a history enables a Ceph -Monitor that has an older version of the monmap to catch up with the current -state of the Ceph Storage Cluster. - -If Ceph Monitors discovered each other through the Ceph configuration file -instead of through the monmap, it would introduce additional risks because the -Ceph configuration files are not updated and distributed automatically. Ceph -Monitors might inadvertently use an older Ceph configuration file, fail to -recognize a Ceph Monitor, fall out of a quorum, or develop a situation where -`Paxos`_ is not able to determine the current state of the system accurately. - - -.. index:: Ceph Monitor; bootstrapping monitors - -Bootstrapping Monitors ----------------------- - -In most configuration and deployment cases, tools that deploy Ceph may help -bootstrap the Ceph Monitors by generating a monitor map for you (e.g., -``ceph-deploy``, etc). A Ceph Monitor requires a few explicit -settings: - -- **Filesystem ID**: The ``fsid`` is the unique identifier for your - object store. Since you can run multiple clusters on the same - hardware, you must specify the unique ID of the object store when - bootstrapping a monitor. Deployment tools usually do this for you - (e.g., ``ceph-deploy`` can call a tool like ``uuidgen``), but you - may specify the ``fsid`` manually too. - -- **Monitor ID**: A monitor ID is a unique ID assigned to each monitor within - the cluster. It is an alphanumeric value, and by convention the identifier - usually follows an alphabetical increment (e.g., ``a``, ``b``, etc.). This - can be set in a Ceph configuration file (e.g., ``[mon.a]``, ``[mon.b]``, etc.), - by a deployment tool, or using the ``ceph`` commandline. - -- **Keys**: The monitor must have secret keys. A deployment tool such as - ``ceph-deploy`` usually does this for you, but you may - perform this step manually too. See `Monitor Keyrings`_ for details. - -For additional details on bootstrapping, see `Bootstrapping a Monitor`_. - -.. index:: Ceph Monitor; configuring monitors - -Configuring Monitors -==================== - -To apply configuration settings to the entire cluster, enter the configuration -settings under ``[global]``. To apply configuration settings to all monitors in -your cluster, enter the configuration settings under ``[mon]``. To apply -configuration settings to specific monitors, specify the monitor instance -(e.g., ``[mon.a]``). By convention, monitor instance names use alpha notation. - -.. code-block:: ini - - [global] - - [mon] - - [mon.a] - - [mon.b] - - [mon.c] - - -Minimum Configuration ---------------------- - -The bare minimum monitor settings for a Ceph monitor via the Ceph configuration -file include a hostname and a monitor address for each monitor. You can configure -these under ``[mon]`` or under the entry for a specific monitor. - -.. code-block:: ini - - [mon] - mon host = hostname1,hostname2,hostname3 - mon addr = 10.0.0.10:6789,10.0.0.11:6789,10.0.0.12:6789 - - -.. code-block:: ini - - [mon.a] - host = hostname1 - mon addr = 10.0.0.10:6789 - -See the `Network Configuration Reference`_ for details. - -.. note:: This minimum configuration for monitors assumes that a deployment - tool generates the ``fsid`` and the ``mon.`` key for you. - -Once you deploy a Ceph cluster, you **SHOULD NOT** change the IP address of -the monitors. However, if you decide to change the monitor's IP address, you -must follow a specific procedure. See `Changing a Monitor's IP Address`_ for -details. - -Monitors can also be found by clients using DNS SRV records. See `Monitor lookup through DNS`_ for details. - -Cluster ID ----------- - -Each Ceph Storage Cluster has a unique identifier (``fsid``). If specified, it -usually appears under the ``[global]`` section of the configuration file. -Deployment tools usually generate the ``fsid`` and store it in the monitor map, -so the value may not appear in a configuration file. The ``fsid`` makes it -possible to run daemons for multiple clusters on the same hardware. - -``fsid`` - -:Description: The cluster ID. One per cluster. -:Type: UUID -:Required: Yes. -:Default: N/A. May be generated by a deployment tool if not specified. - -.. note:: Do not set this value if you use a deployment tool that does - it for you. - - -.. index:: Ceph Monitor; initial members - -Initial Members ---------------- - -We recommend running a production Ceph Storage Cluster with at least three Ceph -Monitors to ensure high availability. When you run multiple monitors, you may -specify the initial monitors that must be members of the cluster in order to -establish a quorum. This may reduce the time it takes for your cluster to come -online. - -.. code-block:: ini - - [mon] - mon initial members = a,b,c - - -``mon initial members`` - -:Description: The IDs of initial monitors in a cluster during startup. If - specified, Ceph requires an odd number of monitors to form an - initial quorum (e.g., 3). - -:Type: String -:Default: None - -.. note:: A *majority* of monitors in your cluster must be able to reach - each other in order to establish a quorum. You can decrease the initial - number of monitors to establish a quorum with this setting. - -.. index:: Ceph Monitor; data path - -Data ----- - -Ceph provides a default path where Ceph Monitors store data. For optimal -performance in a production Ceph Storage Cluster, we recommend running Ceph -Monitors on separate hosts and drives from Ceph OSD Daemons. As leveldb is using -``mmap()`` for writing the data, Ceph Monitors flush their data from memory to disk -very often, which can interfere with Ceph OSD Daemon workloads if the data -store is co-located with the OSD Daemons. - -In Ceph versions 0.58 and earlier, Ceph Monitors store their data in files. This -approach allows users to inspect monitor data with common tools like ``ls`` -and ``cat``. However, it doesn't provide strong consistency. - -In Ceph versions 0.59 and later, Ceph Monitors store their data as key/value -pairs. Ceph Monitors require `ACID`_ transactions. Using a data store prevents -recovering Ceph Monitors from running corrupted versions through Paxos, and it -enables multiple modification operations in one single atomic batch, among other -advantages. - -Generally, we do not recommend changing the default data location. If you modify -the default location, we recommend that you make it uniform across Ceph Monitors -by setting it in the ``[mon]`` section of the configuration file. - - -``mon data`` - -:Description: The monitor's data location. -:Type: String -:Default: ``/var/lib/ceph/mon/$cluster-$id`` - - -``mon data size warn`` - -:Description: Issue a ``HEALTH_WARN`` in cluster log when the monitor's data - store goes over 15GB. -:Type: Integer -:Default: 15*1024*1024*1024* - - -``mon data avail warn`` - -:Description: Issue a ``HEALTH_WARN`` in cluster log when the available disk - space of monitor's data store is lower or equal to this - percentage. -:Type: Integer -:Default: 30 - - -``mon data avail crit`` - -:Description: Issue a ``HEALTH_ERR`` in cluster log when the available disk - space of monitor's data store is lower or equal to this - percentage. -:Type: Integer -:Default: 5 - - -``mon warn on cache pools without hit sets`` - -:Description: Issue a ``HEALTH_WARN`` in cluster log if a cache pool does not - have the hitset type set set. - See `hit set type <../operations/pools#hit-set-type>`_ for more - details. -:Type: Boolean -:Default: True - - -``mon warn on crush straw calc version zero`` - -:Description: Issue a ``HEALTH_WARN`` in cluster log if the CRUSH's - ``straw_calc_version`` is zero. See - `CRUSH map tunables <../operations/crush-map#tunables>`_ for - details. -:Type: Boolean -:Default: True - - -``mon warn on legacy crush tunables`` - -:Description: Issue a ``HEALTH_WARN`` in cluster log if - CRUSH tunables are too old (older than ``mon_min_crush_required_version``) -:Type: Boolean -:Default: True - - -``mon crush min required version`` - -:Description: The minimum tunable profile version required by the cluster. - See - `CRUSH map tunables <../operations/crush-map#tunables>`_ for - details. -:Type: String -:Default: ``firefly`` - - -``mon warn on osd down out interval zero`` - -:Description: Issue a ``HEALTH_WARN`` in cluster log if - ``mon osd down out interval`` is zero. Having this option set to - zero on the leader acts much like the ``noout`` flag. It's hard - to figure out what's going wrong with clusters witout the - ``noout`` flag set but acting like that just the same, so we - report a warning in this case. -:Type: Boolean -:Default: True - - -``mon cache target full warn ratio`` - -:Description: Position between pool's ``cache_target_full`` and - ``target_max_object`` where we start warning -:Type: Float -:Default: ``0.66`` - - -``mon health data update interval`` - -:Description: How often (in seconds) the monitor in quorum shares its health - status with its peers. (negative number disables it) -:Type: Float -:Default: ``60`` - - -``mon health to clog`` - -:Description: Enable sending health summary to cluster log periodically. -:Type: Boolean -:Default: True - - -``mon health to clog tick interval`` - -:Description: How often (in seconds) the monitor send health summary to cluster - log (a non-positive number disables it). If current health summary - is empty or identical to the last time, monitor will not send it - to cluster log. -:Type: Integer -:Default: 3600 - - -``mon health to clog interval`` - -:Description: How often (in seconds) the monitor send health summary to cluster - log (a non-positive number disables it). Monitor will always - send the summary to cluster log no matter if the summary changes - or not. -:Type: Integer -:Default: 60 - - - -.. index:: Ceph Storage Cluster; capacity planning, Ceph Monitor; capacity planning - -Storage Capacity ----------------- - -When a Ceph Storage Cluster gets close to its maximum capacity (i.e., ``mon osd -full ratio``), Ceph prevents you from writing to or reading from Ceph OSD -Daemons as a safety measure to prevent data loss. Therefore, letting a -production Ceph Storage Cluster approach its full ratio is not a good practice, -because it sacrifices high availability. The default full ratio is ``.95``, or -95% of capacity. This a very aggressive setting for a test cluster with a small -number of OSDs. - -.. tip:: When monitoring your cluster, be alert to warnings related to the - ``nearfull`` ratio. This means that a failure of some OSDs could result - in a temporary service disruption if one or more OSDs fails. Consider adding - more OSDs to increase storage capacity. - -A common scenario for test clusters involves a system administrator removing a -Ceph OSD Daemon from the Ceph Storage Cluster to watch the cluster rebalance; -then, removing another Ceph OSD Daemon, and so on until the Ceph Storage Cluster -eventually reaches the full ratio and locks up. We recommend a bit of capacity -planning even with a test cluster. Planning enables you to gauge how much spare -capacity you will need in order to maintain high availability. Ideally, you want -to plan for a series of Ceph OSD Daemon failures where the cluster can recover -to an ``active + clean`` state without replacing those Ceph OSD Daemons -immediately. You can run a cluster in an ``active + degraded`` state, but this -is not ideal for normal operating conditions. - -The following diagram depicts a simplistic Ceph Storage Cluster containing 33 -Ceph Nodes with one Ceph OSD Daemon per host, each Ceph OSD Daemon reading from -and writing to a 3TB drive. So this exemplary Ceph Storage Cluster has a maximum -actual capacity of 99TB. With a ``mon osd full ratio`` of ``0.95``, if the Ceph -Storage Cluster falls to 5TB of remaining capacity, the cluster will not allow -Ceph Clients to read and write data. So the Ceph Storage Cluster's operating -capacity is 95TB, not 99TB. - -.. ditaa:: - - +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ - | Rack 1 | | Rack 2 | | Rack 3 | | Rack 4 | | Rack 5 | | Rack 6 | - | cCCC | | cF00 | | cCCC | | cCCC | | cCCC | | cCCC | - +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ - | OSD 1 | | OSD 7 | | OSD 13 | | OSD 19 | | OSD 25 | | OSD 31 | - +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ - | OSD 2 | | OSD 8 | | OSD 14 | | OSD 20 | | OSD 26 | | OSD 32 | - +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ - | OSD 3 | | OSD 9 | | OSD 15 | | OSD 21 | | OSD 27 | | OSD 33 | - +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ - | OSD 4 | | OSD 10 | | OSD 16 | | OSD 22 | | OSD 28 | | Spare | - +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ - | OSD 5 | | OSD 11 | | OSD 17 | | OSD 23 | | OSD 29 | | Spare | - +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ - | OSD 6 | | OSD 12 | | OSD 18 | | OSD 24 | | OSD 30 | | Spare | - +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ - -It is normal in such a cluster for one or two OSDs to fail. A less frequent but -reasonable scenario involves a rack's router or power supply failing, which -brings down multiple OSDs simultaneously (e.g., OSDs 7-12). In such a scenario, -you should still strive for a cluster that can remain operational and achieve an -``active + clean`` state--even if that means adding a few hosts with additional -OSDs in short order. If your capacity utilization is too high, you may not lose -data, but you could still sacrifice data availability while resolving an outage -within a failure domain if capacity utilization of the cluster exceeds the full -ratio. For this reason, we recommend at least some rough capacity planning. - -Identify two numbers for your cluster: - -#. The number of OSDs. -#. The total capacity of the cluster - -If you divide the total capacity of your cluster by the number of OSDs in your -cluster, you will find the mean average capacity of an OSD within your cluster. -Consider multiplying that number by the number of OSDs you expect will fail -simultaneously during normal operations (a relatively small number). Finally -multiply the capacity of the cluster by the full ratio to arrive at a maximum -operating capacity; then, subtract the number of amount of data from the OSDs -you expect to fail to arrive at a reasonable full ratio. Repeat the foregoing -process with a higher number of OSD failures (e.g., a rack of OSDs) to arrive at -a reasonable number for a near full ratio. - -.. code-block:: ini - - [global] - - mon osd full ratio = .80 - mon osd backfillfull ratio = .75 - mon osd nearfull ratio = .70 - - -``mon osd full ratio`` - -:Description: The percentage of disk space used before an OSD is - considered ``full``. - -:Type: Float -:Default: ``.95`` - - -``mon osd backfillfull ratio`` - -:Description: The percentage of disk space used before an OSD is - considered too ``full`` to backfill. - -:Type: Float -:Default: ``.90`` - - -``mon osd nearfull ratio`` - -:Description: The percentage of disk space used before an OSD is - considered ``nearfull``. - -:Type: Float -:Default: ``.85`` - - -.. tip:: If some OSDs are nearfull, but others have plenty of capacity, you - may have a problem with the CRUSH weight for the nearfull OSDs. - -.. index:: heartbeat - -Heartbeat ---------- - -Ceph monitors know about the cluster by requiring reports from each OSD, and by -receiving reports from OSDs about the status of their neighboring OSDs. Ceph -provides reasonable default settings for monitor/OSD interaction; however, you -may modify them as needed. See `Monitor/OSD Interaction`_ for details. - - -.. index:: Ceph Monitor; leader, Ceph Monitor; provider, Ceph Monitor; requester, Ceph Monitor; synchronization - -Monitor Store Synchronization ------------------------------ - -When you run a production cluster with multiple monitors (recommended), each -monitor checks to see if a neighboring monitor has a more recent version of the -cluster map (e.g., a map in a neighboring monitor with one or more epoch numbers -higher than the most current epoch in the map of the instant monitor). -Periodically, one monitor in the cluster may fall behind the other monitors to -the point where it must leave the quorum, synchronize to retrieve the most -current information about the cluster, and then rejoin the quorum. For the -purposes of synchronization, monitors may assume one of three roles: - -#. **Leader**: The `Leader` is the first monitor to achieve the most recent - Paxos version of the cluster map. - -#. **Provider**: The `Provider` is a monitor that has the most recent version - of the cluster map, but wasn't the first to achieve the most recent version. - -#. **Requester:** A `Requester` is a monitor that has fallen behind the leader - and must synchronize in order to retrieve the most recent information about - the cluster before it can rejoin the quorum. - -These roles enable a leader to delegate synchronization duties to a provider, -which prevents synchronization requests from overloading the leader--improving -performance. In the following diagram, the requester has learned that it has -fallen behind the other monitors. The requester asks the leader to synchronize, -and the leader tells the requester to synchronize with a provider. - - -.. ditaa:: +-----------+ +---------+ +----------+ - | Requester | | Leader | | Provider | - +-----------+ +---------+ +----------+ - | | | - | | | - | Ask to Synchronize | | - |------------------->| | - | | | - |<-------------------| | - | Tell Requester to | | - | Sync with Provider | | - | | | - | Synchronize | - |--------------------+-------------------->| - | | | - |<-------------------+---------------------| - | Send Chunk to Requester | - | (repeat as necessary) | - | Requester Acks Chuck to Provider | - |--------------------+-------------------->| - | | - | Sync Complete | - | Notification | - |------------------->| - | | - |<-------------------| - | Ack | - | | - - -Synchronization always occurs when a new monitor joins the cluster. During -runtime operations, monitors may receive updates to the cluster map at different -times. This means the leader and provider roles may migrate from one monitor to -another. If this happens while synchronizing (e.g., a provider falls behind the -leader), the provider can terminate synchronization with a requester. - -Once synchronization is complete, Ceph requires trimming across the cluster. -Trimming requires that the placement groups are ``active + clean``. - - -``mon sync trim timeout`` - -:Description: -:Type: Double -:Default: ``30.0`` - - -``mon sync heartbeat timeout`` - -:Description: -:Type: Double -:Default: ``30.0`` - - -``mon sync heartbeat interval`` - -:Description: -:Type: Double -:Default: ``5.0`` - - -``mon sync backoff timeout`` - -:Description: -:Type: Double -:Default: ``30.0`` - - -``mon sync timeout`` - -:Description: Number of seconds the monitor will wait for the next update - message from its sync provider before it gives up and bootstrap - again. -:Type: Double -:Default: ``30.0`` - - -``mon sync max retries`` - -:Description: -:Type: Integer -:Default: ``5`` - - -``mon sync max payload size`` - -:Description: The maximum size for a sync payload (in bytes). -:Type: 32-bit Integer -:Default: ``1045676`` - - -``paxos max join drift`` - -:Description: The maximum Paxos iterations before we must first sync the - monitor data stores. When a monitor finds that its peer is too - far ahead of it, it will first sync with data stores before moving - on. -:Type: Integer -:Default: ``10`` - -``paxos stash full interval`` - -:Description: How often (in commits) to stash a full copy of the PaxosService state. - Current this setting only affects ``mds``, ``mon``, ``auth`` and ``mgr`` - PaxosServices. -:Type: Integer -:Default: 25 - -``paxos propose interval`` - -:Description: Gather updates for this time interval before proposing - a map update. -:Type: Double -:Default: ``1.0`` - - -``paxos min`` - -:Description: The minimum number of paxos states to keep around -:Type: Integer -:Default: 500 - - -``paxos min wait`` - -:Description: The minimum amount of time to gather updates after a period of - inactivity. -:Type: Double -:Default: ``0.05`` - - -``paxos trim min`` - -:Description: Number of extra proposals tolerated before trimming -:Type: Integer -:Default: 250 - - -``paxos trim max`` - -:Description: The maximum number of extra proposals to trim at a time -:Type: Integer -:Default: 500 - - -``paxos service trim min`` - -:Description: The minimum amount of versions to trigger a trim (0 disables it) -:Type: Integer -:Default: 250 - - -``paxos service trim max`` - -:Description: The maximum amount of versions to trim during a single proposal (0 disables it) -:Type: Integer -:Default: 500 - - -``mon max log epochs`` - -:Description: The maximum amount of log epochs to trim during a single proposal -:Type: Integer -:Default: 500 - - -``mon max pgmap epochs`` - -:Description: The maximum amount of pgmap epochs to trim during a single proposal -:Type: Integer -:Default: 500 - - -``mon mds force trim to`` - -:Description: Force monitor to trim mdsmaps to this point (0 disables it. - dangerous, use with care) -:Type: Integer -:Default: 0 - - -``mon osd force trim to`` - -:Description: Force monitor to trim osdmaps to this point, even if there is - PGs not clean at the specified epoch (0 disables it. dangerous, - use with care) -:Type: Integer -:Default: 0 - -``mon osd cache size`` - -:Description: The size of osdmaps cache, not to rely on underlying store's cache -:Type: Integer -:Default: 10 - - -``mon election timeout`` - -:Description: On election proposer, maximum waiting time for all ACKs in seconds. -:Type: Float -:Default: ``5`` - - -``mon lease`` - -:Description: The length (in seconds) of the lease on the monitor's versions. -:Type: Float -:Default: ``5`` - - -``mon lease renew interval factor`` - -:Description: ``mon lease`` \* ``mon lease renew interval factor`` will be the - interval for the Leader to renew the other monitor's leases. The - factor should be less than ``1.0``. -:Type: Float -:Default: ``0.6`` - - -``mon lease ack timeout factor`` - -:Description: The Leader will wait ``mon lease`` \* ``mon lease ack timeout factor`` - for the Providers to acknowledge the lease extension. -:Type: Float -:Default: ``2.0`` - - -``mon accept timeout factor`` - -:Description: The Leader will wait ``mon lease`` \* ``mon accept timeout factor`` - for the Requester(s) to accept a Paxos update. It is also used - during the Paxos recovery phase for similar purposes. -:Type: Float -:Default: ``2.0`` - - -``mon min osdmap epochs`` - -:Description: Minimum number of OSD map epochs to keep at all times. -:Type: 32-bit Integer -:Default: ``500`` - - -``mon max pgmap epochs`` - -:Description: Maximum number of PG map epochs the monitor should keep. -:Type: 32-bit Integer -:Default: ``500`` - - -``mon max log epochs`` - -:Description: Maximum number of Log epochs the monitor should keep. -:Type: 32-bit Integer -:Default: ``500`` - - - -.. index:: Ceph Monitor; clock - -Clock ------ - -Ceph daemons pass critical messages to each other, which must be processed -before daemons reach a timeout threshold. If the clocks in Ceph monitors -are not synchronized, it can lead to a number of anomalies. For example: - -- Daemons ignoring received messages (e.g., timestamps outdated) -- Timeouts triggered too soon/late when a message wasn't received in time. - -See `Monitor Store Synchronization`_ for details. - - -.. tip:: You SHOULD install NTP on your Ceph monitor hosts to - ensure that the monitor cluster operates with synchronized clocks. - -Clock drift may still be noticeable with NTP even though the discrepancy is not -yet harmful. Ceph's clock drift / clock skew warnings may get triggered even -though NTP maintains a reasonable level of synchronization. Increasing your -clock drift may be tolerable under such circumstances; however, a number of -factors such as workload, network latency, configuring overrides to default -timeouts and the `Monitor Store Synchronization`_ settings may influence -the level of acceptable clock drift without compromising Paxos guarantees. - -Ceph provides the following tunable options to allow you to find -acceptable values. - - -``clock offset`` - -:Description: How much to offset the system clock. See ``Clock.cc`` for details. -:Type: Double -:Default: ``0`` - - -.. deprecated:: 0.58 - -``mon tick interval`` - -:Description: A monitor's tick interval in seconds. -:Type: 32-bit Integer -:Default: ``5`` - - -``mon clock drift allowed`` - -:Description: The clock drift in seconds allowed between monitors. -:Type: Float -:Default: ``.050`` - - -``mon clock drift warn backoff`` - -:Description: Exponential backoff for clock drift warnings -:Type: Float -:Default: ``5`` - - -``mon timecheck interval`` - -:Description: The time check interval (clock drift check) in seconds - for the Leader. - -:Type: Float -:Default: ``300.0`` - - -``mon timecheck skew interval`` - -:Description: The time check interval (clock drift check) in seconds when in - presence of a skew in seconds for the Leader. -:Type: Float -:Default: ``30.0`` - - -Client ------- - -``mon client hunt interval`` - -:Description: The client will try a new monitor every ``N`` seconds until it - establishes a connection. - -:Type: Double -:Default: ``3.0`` - - -``mon client ping interval`` - -:Description: The client will ping the monitor every ``N`` seconds. -:Type: Double -:Default: ``10.0`` - - -``mon client max log entries per message`` - -:Description: The maximum number of log entries a monitor will generate - per client message. - -:Type: Integer -:Default: ``1000`` - - -``mon client bytes`` - -:Description: The amount of client message data allowed in memory (in bytes). -:Type: 64-bit Integer Unsigned -:Default: ``100ul << 20`` - - -Pool settings -============= -Since version v0.94 there is support for pool flags which allow or disallow changes to be made to pools. - -Monitors can also disallow removal of pools if configured that way. - -``mon allow pool delete`` - -:Description: If the monitors should allow pools to be removed. Regardless of what the pool flags say. -:Type: Boolean -:Default: ``false`` - -``osd pool default flag hashpspool`` - -:Description: Set the hashpspool flag on new pools -:Type: Boolean -:Default: ``true`` - -``osd pool default flag nodelete`` - -:Description: Set the nodelete flag on new pools. Prevents allow pool removal with this flag in any way. -:Type: Boolean -:Default: ``false`` - -``osd pool default flag nopgchange`` - -:Description: Set the nopgchange flag on new pools. Does not allow the number of PGs to be changed for a pool. -:Type: Boolean -:Default: ``false`` - -``osd pool default flag nosizechange`` - -:Description: Set the nosizechange flag on new pools. Does not allow the size to be changed of pool. -:Type: Boolean -:Default: ``false`` - -For more information about the pool flags see `Pool values`_. - -Miscellaneous -============= - - -``mon max osd`` - -:Description: The maximum number of OSDs allowed in the cluster. -:Type: 32-bit Integer -:Default: ``10000`` - -``mon globalid prealloc`` - -:Description: The number of global IDs to pre-allocate for clients and daemons in the cluster. -:Type: 32-bit Integer -:Default: ``100`` - -``mon subscribe interval`` - -:Description: The refresh interval (in seconds) for subscriptions. The - subscription mechanism enables obtaining the cluster maps - and log information. - -:Type: Double -:Default: ``300`` - - -``mon stat smooth intervals`` - -:Description: Ceph will smooth statistics over the last ``N`` PG maps. -:Type: Integer -:Default: ``2`` - - -``mon probe timeout`` - -:Description: Number of seconds the monitor will wait to find peers before bootstrapping. -:Type: Double -:Default: ``2.0`` - - -``mon daemon bytes`` - -:Description: The message memory cap for metadata server and OSD messages (in bytes). -:Type: 64-bit Integer Unsigned -:Default: ``400ul << 20`` - - -``mon max log entries per event`` - -:Description: The maximum number of log entries per event. -:Type: Integer -:Default: ``4096`` - - -``mon osd prime pg temp`` - -:Description: Enables or disable priming the PGMap with the previous OSDs when an out - OSD comes back into the cluster. With the ``true`` setting the clients - will continue to use the previous OSDs until the newly in OSDs as that - PG peered. -:Type: Boolean -:Default: ``true`` - - -``mon osd prime pg temp max time`` - -:Description: How much time in seconds the monitor should spend trying to prime the - PGMap when an out OSD comes back into the cluster. -:Type: Float -:Default: ``0.5`` - - -``mon osd prime pg temp max time estimate`` - -:Description: Maximum estimate of time spent on each PG before we prime all PGs - in parallel. -:Type: Float -:Default: ``0.25`` - - -``mon osd allow primary affinity`` - -:Description: allow ``primary_affinity`` to be set in the osdmap. -:Type: Boolean -:Default: False - - -``mon osd pool ec fast read`` - -:Description: Whether turn on fast read on the pool or not. It will be used as - the default setting of newly created erasure pools if ``fast_read`` - is not specified at create time. -:Type: Boolean -:Default: False - - -``mon mds skip sanity`` - -:Description: Skip safety assertions on FSMap (in case of bugs where we want to - continue anyway). Monitor terminates if the FSMap sanity check - fails, but we can disable it by enabling this option. -:Type: Boolean -:Default: False - - -``mon max mdsmap epochs`` - -:Description: The maximum amount of mdsmap epochs to trim during a single proposal. -:Type: Integer -:Default: 500 - - -``mon config key max entry size`` - -:Description: The maximum size of config-key entry (in bytes) -:Type: Integer -:Default: 4096 - - -``mon scrub interval`` - -:Description: How often (in seconds) the monitor scrub its store by comparing - the stored checksums with the computed ones of all the stored - keys. -:Type: Integer -:Default: 3600*24 - - -``mon scrub max keys`` - -:Description: The maximum number of keys to scrub each time. -:Type: Integer -:Default: 100 - - -``mon compact on start`` - -:Description: Compact the database used as Ceph Monitor store on - ``ceph-mon`` start. A manual compaction helps to shrink the - monitor database and improve the performance of it if the regular - compaction fails to work. -:Type: Boolean -:Default: False - - -``mon compact on bootstrap`` - -:Description: Compact the database used as Ceph Monitor store on - on bootstrap. Monitor starts probing each other for creating - a quorum after bootstrap. If it times out before joining the - quorum, it will start over and bootstrap itself again. -:Type: Boolean -:Default: False - - -``mon compact on trim`` - -:Description: Compact a certain prefix (including paxos) when we trim its old states. -:Type: Boolean -:Default: True - - -``mon cpu threads`` - -:Description: Number of threads for performing CPU intensive work on monitor. -:Type: Boolean -:Default: True - - -``mon osd mapping pgs per chunk`` - -:Description: We calculate the mapping from placement group to OSDs in chunks. - This option specifies the number of placement groups per chunk. -:Type: Integer -:Default: 4096 - - -``mon osd max split count`` - -:Description: Largest number of PGs per "involved" OSD to let split create. - When we increase the ``pg_num`` of a pool, the placement groups - will be splitted on all OSDs serving that pool. We want to avoid - extreme multipliers on PG splits. -:Type: Integer -:Default: 300 - - -``mon session timeout`` - -:Description: Monitor will terminate inactive sessions stay idle over this - time limit. -:Type: Integer -:Default: 300 - - - -.. _Paxos: http://en.wikipedia.org/wiki/Paxos_(computer_science) -.. _Monitor Keyrings: ../../../dev/mon-bootstrap#secret-keys -.. _Ceph configuration file: ../ceph-conf/#monitors -.. _Network Configuration Reference: ../network-config-ref -.. _Monitor lookup through DNS: ../mon-lookup-dns -.. _ACID: http://en.wikipedia.org/wiki/ACID -.. _Adding/Removing a Monitor: ../../operations/add-or-rm-mons -.. _Add/Remove a Monitor (ceph-deploy): ../../deployment/ceph-deploy-mon -.. _Monitoring a Cluster: ../../operations/monitoring -.. _Monitoring OSDs and PGs: ../../operations/monitoring-osd-pg -.. _Bootstrapping a Monitor: ../../../dev/mon-bootstrap -.. _Changing a Monitor's IP Address: ../../operations/add-or-rm-mons#changing-a-monitor-s-ip-address -.. _Monitor/OSD Interaction: ../mon-osd-interaction -.. _Scalability and High Availability: ../../../architecture#scalability-and-high-availability -.. _Pool values: ../../operations/pools/#set-pool-values diff --git a/src/ceph/doc/rados/configuration/mon-lookup-dns.rst b/src/ceph/doc/rados/configuration/mon-lookup-dns.rst deleted file mode 100644 index e32b320..0000000 --- a/src/ceph/doc/rados/configuration/mon-lookup-dns.rst +++ /dev/null @@ -1,51 +0,0 @@ -=============================== -Looking op Monitors through DNS -=============================== - -Since version 11.0.0 RADOS supports looking up Monitors through DNS. - -This way daemons and clients do not require a *mon host* configuration directive in their ceph.conf configuration file. - -Using DNS SRV TCP records clients are able to look up the monitors. - -This allows for less configuration on clients and monitors. Using a DNS update clients and daemons can be made aware of changes in the monitor topology. - -By default clients and daemons will look for the TCP service called *ceph-mon* which is configured by the *mon_dns_srv_name* configuration directive. - - -``mon dns srv name`` - -:Description: the service name used querying the DNS for the monitor hosts/addresses -:Type: String -:Default: ``ceph-mon`` - -Example -------- -When the DNS search domain is set to *example.com* a DNS zone file might contain the following elements. - -First, create records for the Monitors, either IPv4 (A) or IPv6 (AAAA). - -:: - - mon1.example.com. AAAA 2001:db8::100 - mon2.example.com. AAAA 2001:db8::200 - mon3.example.com. AAAA 2001:db8::300 - -:: - - mon1.example.com. A 192.168.0.1 - mon2.example.com. A 192.168.0.2 - mon3.example.com. A 192.168.0.3 - - -With those records now existing we can create the SRV TCP records with the name *ceph-mon* pointing to the three Monitors. - -:: - - _ceph-mon._tcp.example.com. 60 IN SRV 10 60 6789 mon1.example.com. - _ceph-mon._tcp.example.com. 60 IN SRV 10 60 6789 mon2.example.com. - _ceph-mon._tcp.example.com. 60 IN SRV 10 60 6789 mon3.example.com. - -In this case the Monitors are running on port *6789*, and their priority and weight are all *10* and *60* respectively. - -The current implementation in clients and daemons will *only* respect the priority set in SRV records, and they will only connect to the monitors with lowest-numbered priority. The targets with the same priority will be selected at random. diff --git a/src/ceph/doc/rados/configuration/mon-osd-interaction.rst b/src/ceph/doc/rados/configuration/mon-osd-interaction.rst deleted file mode 100644 index e335ff0..0000000 --- a/src/ceph/doc/rados/configuration/mon-osd-interaction.rst +++ /dev/null @@ -1,408 +0,0 @@ -===================================== - Configuring Monitor/OSD Interaction -===================================== - -.. index:: heartbeat - -After you have completed your initial Ceph configuration, you may deploy and run -Ceph. When you execute a command such as ``ceph health`` or ``ceph -s``, the -:term:`Ceph Monitor` reports on the current state of the :term:`Ceph Storage -Cluster`. The Ceph Monitor knows about the Ceph Storage Cluster by requiring -reports from each :term:`Ceph OSD Daemon`, and by receiving reports from Ceph -OSD Daemons about the status of their neighboring Ceph OSD Daemons. If the Ceph -Monitor doesn't receive reports, or if it receives reports of changes in the -Ceph Storage Cluster, the Ceph Monitor updates the status of the :term:`Ceph -Cluster Map`. - -Ceph provides reasonable default settings for Ceph Monitor/Ceph OSD Daemon -interaction. However, you may override the defaults. The following sections -describe how Ceph Monitors and Ceph OSD Daemons interact for the purposes of -monitoring the Ceph Storage Cluster. - -.. index:: heartbeat interval - -OSDs Check Heartbeats -===================== - -Each Ceph OSD Daemon checks the heartbeat of other Ceph OSD Daemons every 6 -seconds. You can change the heartbeat interval by adding an ``osd heartbeat -interval`` setting under the ``[osd]`` section of your Ceph configuration file, -or by setting the value at runtime. If a neighboring Ceph OSD Daemon doesn't -show a heartbeat within a 20 second grace period, the Ceph OSD Daemon may -consider the neighboring Ceph OSD Daemon ``down`` and report it back to a Ceph -Monitor, which will update the Ceph Cluster Map. You may change this grace -period by adding an ``osd heartbeat grace`` setting under the ``[mon]`` -and ``[osd]`` or ``[global]`` section of your Ceph configuration file, -or by setting the value at runtime. - - -.. ditaa:: +---------+ +---------+ - | OSD 1 | | OSD 2 | - +---------+ +---------+ - | | - |----+ Heartbeat | - | | Interval | - |<---+ Exceeded | - | | - | Check | - | Heartbeat | - |------------------->| - | | - |<-------------------| - | Heart Beating | - | | - |----+ Heartbeat | - | | Interval | - |<---+ Exceeded | - | | - | Check | - | Heartbeat | - |------------------->| - | | - |----+ Grace | - | | Period | - |<---+ Exceeded | - | | - |----+ Mark | - | | OSD 2 | - |<---+ Down | - - -.. index:: OSD down report - -OSDs Report Down OSDs -===================== - -By default, two Ceph OSD Daemons from different hosts must report to the Ceph -Monitors that another Ceph OSD Daemon is ``down`` before the Ceph Monitors -acknowledge that the reported Ceph OSD Daemon is ``down``. But there is chance -that all the OSDs reporting the failure are hosted in a rack with a bad switch -which has trouble connecting to another OSD. To avoid this sort of false alarm, -we consider the peers reporting a failure a proxy for a potential "subcluster" -over the overall cluster that is similarly laggy. This is clearly not true in -all cases, but will sometimes help us localize the grace correction to a subset -of the system that is unhappy. ``mon osd reporter subtree level`` is used to -group the peers into the "subcluster" by their common ancestor type in CRUSH -map. By default, only two reports from different subtree are required to report -another Ceph OSD Daemon ``down``. You can change the number of reporters from -unique subtrees and the common ancestor type required to report a Ceph OSD -Daemon ``down`` to a Ceph Monitor by adding an ``mon osd min down reporters`` -and ``mon osd reporter subtree level`` settings under the ``[mon]`` section of -your Ceph configuration file, or by setting the value at runtime. - - -.. ditaa:: +---------+ +---------+ +---------+ - | OSD 1 | | OSD 2 | | Monitor | - +---------+ +---------+ +---------+ - | | | - | OSD 3 Is Down | | - |---------------+--------------->| - | | | - | | | - | | OSD 3 Is Down | - | |--------------->| - | | | - | | | - | | |---------+ Mark - | | | | OSD 3 - | | |<--------+ Down - - -.. index:: peering failure - -OSDs Report Peering Failure -=========================== - -If a Ceph OSD Daemon cannot peer with any of the Ceph OSD Daemons defined in its -Ceph configuration file (or the cluster map), it will ping a Ceph Monitor for -the most recent copy of the cluster map every 30 seconds. You can change the -Ceph Monitor heartbeat interval by adding an ``osd mon heartbeat interval`` -setting under the ``[osd]`` section of your Ceph configuration file, or by -setting the value at runtime. - -.. ditaa:: +---------+ +---------+ +-------+ +---------+ - | OSD 1 | | OSD 2 | | OSD 3 | | Monitor | - +---------+ +---------+ +-------+ +---------+ - | | | | - | Request To | | | - | Peer | | | - |-------------->| | | - |<--------------| | | - | Peering | | - | | | - | Request To | | - | Peer | | - |----------------------------->| | - | | - |----+ OSD Monitor | - | | Heartbeat | - |<---+ Interval Exceeded | - | | - | Failed to Peer with OSD 3 | - |-------------------------------------------->| - |<--------------------------------------------| - | Receive New Cluster Map | - - -.. index:: OSD status - -OSDs Report Their Status -======================== - -If an Ceph OSD Daemon doesn't report to a Ceph Monitor, the Ceph Monitor will -consider the Ceph OSD Daemon ``down`` after the ``mon osd report timeout`` -elapses. A Ceph OSD Daemon sends a report to a Ceph Monitor when a reportable -event such as a failure, a change in placement group stats, a change in -``up_thru`` or when it boots within 5 seconds. You can change the Ceph OSD -Daemon minimum report interval by adding an ``osd mon report interval min`` -setting under the ``[osd]`` section of your Ceph configuration file, or by -setting the value at runtime. A Ceph OSD Daemon sends a report to a Ceph -Monitor every 120 seconds irrespective of whether any notable changes occur. -You can change the Ceph Monitor report interval by adding an ``osd mon report -interval max`` setting under the ``[osd]`` section of your Ceph configuration -file, or by setting the value at runtime. - - -.. ditaa:: +---------+ +---------+ - | OSD 1 | | Monitor | - +---------+ +---------+ - | | - |----+ Report Min | - | | Interval | - |<---+ Exceeded | - | | - |----+ Reportable | - | | Event | - |<---+ Occurs | - | | - | Report To | - | Monitor | - |------------------->| - | | - |----+ Report Max | - | | Interval | - |<---+ Exceeded | - | | - | Report To | - | Monitor | - |------------------->| - | | - |----+ Monitor | - | | Fails | - |<---+ | - +----+ Monitor OSD - | | Report Timeout - |<---+ Exceeded - | - +----+ Mark - | | OSD 1 - |<---+ Down - - - - -Configuration Settings -====================== - -When modifying heartbeat settings, you should include them in the ``[global]`` -section of your configuration file. - -.. index:: monitor heartbeat - -Monitor Settings ----------------- - -``mon osd min up ratio`` - -:Description: The minimum ratio of ``up`` Ceph OSD Daemons before Ceph will - mark Ceph OSD Daemons ``down``. - -:Type: Double -:Default: ``.3`` - - -``mon osd min in ratio`` - -:Description: The minimum ratio of ``in`` Ceph OSD Daemons before Ceph will - mark Ceph OSD Daemons ``out``. - -:Type: Double -:Default: ``.75`` - - -``mon osd laggy halflife`` - -:Description: The number of seconds laggy estimates will decay. -:Type: Integer -:Default: ``60*60`` - - -``mon osd laggy weight`` - -:Description: The weight for new samples in laggy estimation decay. -:Type: Double -:Default: ``0.3`` - - - -``mon osd laggy max interval`` - -:Description: Maximum value of ``laggy_interval`` in laggy estimations (in seconds). - Monitor uses an adaptive approach to evaluate the ``laggy_interval`` of - a certain OSD. This value will be used to calculate the grace time for - that OSD. -:Type: Integer -:Default: 300 - -``mon osd adjust heartbeat grace`` - -:Description: If set to ``true``, Ceph will scale based on laggy estimations. -:Type: Boolean -:Default: ``true`` - - -``mon osd adjust down out interval`` - -:Description: If set to ``true``, Ceph will scaled based on laggy estimations. -:Type: Boolean -:Default: ``true`` - - -``mon osd auto mark in`` - -:Description: Ceph will mark any booting Ceph OSD Daemons as ``in`` - the Ceph Storage Cluster. - -:Type: Boolean -:Default: ``false`` - - -``mon osd auto mark auto out in`` - -:Description: Ceph will mark booting Ceph OSD Daemons auto marked ``out`` - of the Ceph Storage Cluster as ``in`` the cluster. - -:Type: Boolean -:Default: ``true`` - - -``mon osd auto mark new in`` - -:Description: Ceph will mark booting new Ceph OSD Daemons as ``in`` the - Ceph Storage Cluster. - -:Type: Boolean -:Default: ``true`` - - -``mon osd down out interval`` - -:Description: The number of seconds Ceph waits before marking a Ceph OSD Daemon - ``down`` and ``out`` if it doesn't respond. - -:Type: 32-bit Integer -:Default: ``600`` - - -``mon osd down out subtree limit`` - -:Description: The smallest :term:`CRUSH` unit type that Ceph will **not** - automatically mark out. For instance, if set to ``host`` and if - all OSDs of a host are down, Ceph will not automatically mark out - these OSDs. - -:Type: String -:Default: ``rack`` - - -``mon osd report timeout`` - -:Description: The grace period in seconds before declaring - unresponsive Ceph OSD Daemons ``down``. - -:Type: 32-bit Integer -:Default: ``900`` - -``mon osd min down reporters`` - -:Description: The minimum number of Ceph OSD Daemons required to report a - ``down`` Ceph OSD Daemon. - -:Type: 32-bit Integer -:Default: ``2`` - - -``mon osd reporter subtree level`` - -:Description: In which level of parent bucket the reporters are counted. The OSDs - send failure reports to monitor if they find its peer is not responsive. - And monitor mark the reported OSD out and then down after a grace period. -:Type: String -:Default: ``host`` - - -.. index:: OSD hearbeat - -OSD Settings ------------- - -``osd heartbeat address`` - -:Description: An Ceph OSD Daemon's network address for heartbeats. -:Type: Address -:Default: The host address. - - -``osd heartbeat interval`` - -:Description: How often an Ceph OSD Daemon pings its peers (in seconds). -:Type: 32-bit Integer -:Default: ``6`` - - -``osd heartbeat grace`` - -:Description: The elapsed time when a Ceph OSD Daemon hasn't shown a heartbeat - that the Ceph Storage Cluster considers it ``down``. - This setting has to be set in both the [mon] and [osd] or [global] - section so that it is read by both the MON and OSD daemons. -:Type: 32-bit Integer -:Default: ``20`` - - -``osd mon heartbeat interval`` - -:Description: How often the Ceph OSD Daemon pings a Ceph Monitor if it has no - Ceph OSD Daemon peers. - -:Type: 32-bit Integer -:Default: ``30`` - - -``osd mon report interval max`` - -:Description: The maximum time in seconds that a Ceph OSD Daemon can wait before - it must report to a Ceph Monitor. - -:Type: 32-bit Integer -:Default: ``120`` - - -``osd mon report interval min`` - -:Description: The minimum number of seconds a Ceph OSD Daemon may wait - from startup or another reportable event before reporting - to a Ceph Monitor. - -:Type: 32-bit Integer -:Default: ``5`` -:Valid Range: Should be less than ``osd mon report interval max`` - - -``osd mon ack timeout`` - -:Description: The number of seconds to wait for a Ceph Monitor to acknowledge a - request for statistics. - -:Type: 32-bit Integer -:Default: ``30`` diff --git a/src/ceph/doc/rados/configuration/ms-ref.rst b/src/ceph/doc/rados/configuration/ms-ref.rst deleted file mode 100644 index 55d009e..0000000 --- a/src/ceph/doc/rados/configuration/ms-ref.rst +++ /dev/null @@ -1,154 +0,0 @@ -=========== - Messaging -=========== - -General Settings -================ - -``ms tcp nodelay`` - -:Description: Disables nagle's algorithm on messenger tcp sessions. -:Type: Boolean -:Required: No -:Default: ``true`` - - -``ms initial backoff`` - -:Description: The initial time to wait before reconnecting on a fault. -:Type: Double -:Required: No -:Default: ``.2`` - - -``ms max backoff`` - -:Description: The maximum time to wait before reconnecting on a fault. -:Type: Double -:Required: No -:Default: ``15.0`` - - -``ms nocrc`` - -:Description: Disables crc on network messages. May increase performance if cpu limited. -:Type: Boolean -:Required: No -:Default: ``false`` - - -``ms die on bad msg`` - -:Description: Debug option; do not configure. -:Type: Boolean -:Required: No -:Default: ``false`` - - -``ms dispatch throttle bytes`` - -:Description: Throttles total size of messages waiting to be dispatched. -:Type: 64-bit Unsigned Integer -:Required: No -:Default: ``100 << 20`` - - -``ms bind ipv6`` - -:Description: Enable if you want your daemons to bind to IPv6 address instead of IPv4 ones. (Not required if you specify a daemon or cluster IP.) -:Type: Boolean -:Required: No -:Default: ``false`` - - -``ms rwthread stack bytes`` - -:Description: Debug option for stack size; do not configure. -:Type: 64-bit Unsigned Integer -:Required: No -:Default: ``1024 << 10`` - - -``ms tcp read timeout`` - -:Description: Controls how long (in seconds) the messenger will wait before closing an idle connection. -:Type: 64-bit Unsigned Integer -:Required: No -:Default: ``900`` - - -``ms inject socket failures`` - -:Description: Debug option; do not configure. -:Type: 64-bit Unsigned Integer -:Required: No -:Default: ``0`` - -Async messenger options -======================= - - -``ms async transport type`` - -:Description: Transport type used by Async Messenger. Can be ``posix``, ``dpdk`` - or ``rdma``. Posix uses standard TCP/IP networking and is default. - Other transports may be experimental and support may be limited. -:Type: String -:Required: No -:Default: ``posix`` - - -``ms async op threads`` - -:Description: Initial number of worker threads used by each Async Messenger instance. - Should be at least equal to highest number of replicas, but you can - decrease it if you are low on CPU core count and/or you host a lot of - OSDs on single server. -:Type: 64-bit Unsigned Integer -:Required: No -:Default: ``3`` - - -``ms async max op threads`` - -:Description: Maximum number of worker threads used by each Async Messenger instance. - Set to lower values when your machine has limited CPU count, and increase - when your CPUs are underutilized (i. e. one or more of CPUs are - constantly on 100% load during I/O operations). -:Type: 64-bit Unsigned Integer -:Required: No -:Default: ``5`` - - -``ms async set affinity`` - -:Description: Set to true to bind Async Messenger workers to particular CPU cores. -:Type: Boolean -:Required: No -:Default: ``true`` - - -``ms async affinity cores`` - -:Description: When ``ms async set affinity`` is true, this string specifies how Async - Messenger workers are bound to CPU cores. For example, "0,2" will bind - workers #1 and #2 to CPU cores #0 and #2, respectively. - NOTE: when manually setting affinity, make sure to not assign workers to - processors that are virtual CPUs created as an effect of Hyperthreading - or similar technology, because they are slower than regular CPU cores. -:Type: String -:Required: No -:Default: ``(empty)`` - - -``ms async send inline`` - -:Description: Send messages directly from the thread that generated them instead of - queuing and sending from Async Messenger thread. This option is known - to decrease performance on systems with a lot of CPU cores, so it's - disabled by default. -:Type: Boolean -:Required: No -:Default: ``false`` - - diff --git a/src/ceph/doc/rados/configuration/network-config-ref.rst b/src/ceph/doc/rados/configuration/network-config-ref.rst deleted file mode 100644 index 2d7f9d6..0000000 --- a/src/ceph/doc/rados/configuration/network-config-ref.rst +++ /dev/null @@ -1,494 +0,0 @@ -================================= - Network Configuration Reference -================================= - -Network configuration is critical for building a high performance :term:`Ceph -Storage Cluster`. The Ceph Storage Cluster does not perform request routing or -dispatching on behalf of the :term:`Ceph Client`. Instead, Ceph Clients make -requests directly to Ceph OSD Daemons. Ceph OSD Daemons perform data replication -on behalf of Ceph Clients, which means replication and other factors impose -additional loads on Ceph Storage Cluster networks. - -Our Quick Start configurations provide a trivial `Ceph configuration file`_ that -sets monitor IP addresses and daemon host names only. Unless you specify a -cluster network, Ceph assumes a single "public" network. Ceph functions just -fine with a public network only, but you may see significant performance -improvement with a second "cluster" network in a large cluster. - -We recommend running a Ceph Storage Cluster with two networks: a public -(front-side) network and a cluster (back-side) network. To support two networks, -each :term:`Ceph Node` will need to have more than one NIC. See `Hardware -Recommendations - Networks`_ for additional details. - -.. ditaa:: - +-------------+ - | Ceph Client | - +----*--*-----+ - | ^ - Request | : Response - v | - /----------------------------------*--*-------------------------------------\ - | Public Network | - \---*--*------------*--*-------------*--*------------*--*------------*--*---/ - ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ - | | | | | | | | | | - | : | : | : | : | : - v v v v v v v v v v - +---*--*---+ +---*--*---+ +---*--*---+ +---*--*---+ +---*--*---+ - | Ceph MON | | Ceph MDS | | Ceph OSD | | Ceph OSD | | Ceph OSD | - +----------+ +----------+ +---*--*---+ +---*--*---+ +---*--*---+ - ^ ^ ^ ^ ^ ^ - The cluster network relieves | | | | | | - OSD replication and heartbeat | : | : | : - traffic from the public network. v v v v v v - /------------------------------------*--*------------*--*------------*--*---\ - | cCCC Cluster Network | - \---------------------------------------------------------------------------/ - - -There are several reasons to consider operating two separate networks: - -#. **Performance:** Ceph OSD Daemons handle data replication for the Ceph - Clients. When Ceph OSD Daemons replicate data more than once, the network - load between Ceph OSD Daemons easily dwarfs the network load between Ceph - Clients and the Ceph Storage Cluster. This can introduce latency and - create a performance problem. Recovery and rebalancing can - also introduce significant latency on the public network. See - `Scalability and High Availability`_ for additional details on how Ceph - replicates data. See `Monitor / OSD Interaction`_ for details on heartbeat - traffic. - -#. **Security**: While most people are generally civil, a very tiny segment of - the population likes to engage in what's known as a Denial of Service (DoS) - attack. When traffic between Ceph OSD Daemons gets disrupted, placement - groups may no longer reflect an ``active + clean`` state, which may prevent - users from reading and writing data. A great way to defeat this type of - attack is to maintain a completely separate cluster network that doesn't - connect directly to the internet. Also, consider using `Message Signatures`_ - to defeat spoofing attacks. - - -IP Tables -========= - -By default, daemons `bind`_ to ports within the ``6800:7300`` range. You may -configure this range at your discretion. Before configuring your IP tables, -check the default ``iptables`` configuration. - - sudo iptables -L - -Some Linux distributions include rules that reject all inbound requests -except SSH from all network interfaces. For example:: - - REJECT all -- anywhere anywhere reject-with icmp-host-prohibited - -You will need to delete these rules on both your public and cluster networks -initially, and replace them with appropriate rules when you are ready to -harden the ports on your Ceph Nodes. - - -Monitor IP Tables ------------------ - -Ceph Monitors listen on port ``6789`` by default. Additionally, Ceph Monitors -always operate on the public network. When you add the rule using the example -below, make sure you replace ``{iface}`` with the public network interface -(e.g., ``eth0``, ``eth1``, etc.), ``{ip-address}`` with the IP address of the -public network and ``{netmask}`` with the netmask for the public network. :: - - sudo iptables -A INPUT -i {iface} -p tcp -s {ip-address}/{netmask} --dport 6789 -j ACCEPT - - -MDS IP Tables -------------- - -A :term:`Ceph Metadata Server` listens on the first available port on the public -network beginning at port 6800. Note that this behavior is not deterministic, so -if you are running more than one OSD or MDS on the same host, or if you restart -the daemons within a short window of time, the daemons will bind to higher -ports. You should open the entire 6800-7300 range by default. When you add the -rule using the example below, make sure you replace ``{iface}`` with the public -network interface (e.g., ``eth0``, ``eth1``, etc.), ``{ip-address}`` with the IP -address of the public network and ``{netmask}`` with the netmask of the public -network. - -For example:: - - sudo iptables -A INPUT -i {iface} -m multiport -p tcp -s {ip-address}/{netmask} --dports 6800:7300 -j ACCEPT - - -OSD IP Tables -------------- - -By default, Ceph OSD Daemons `bind`_ to the first available ports on a Ceph Node -beginning at port 6800. Note that this behavior is not deterministic, so if you -are running more than one OSD or MDS on the same host, or if you restart the -daemons within a short window of time, the daemons will bind to higher ports. -Each Ceph OSD Daemon on a Ceph Node may use up to four ports: - -#. One for talking to clients and monitors. -#. One for sending data to other OSDs. -#. Two for heartbeating on each interface. - -.. ditaa:: - /---------------\ - | OSD | - | +---+----------------+-----------+ - | | Clients & Monitors | Heartbeat | - | +---+----------------+-----------+ - | | - | +---+----------------+-----------+ - | | Data Replication | Heartbeat | - | +---+----------------+-----------+ - | cCCC | - \---------------/ - -When a daemon fails and restarts without letting go of the port, the restarted -daemon will bind to a new port. You should open the entire 6800-7300 port range -to handle this possibility. - -If you set up separate public and cluster networks, you must add rules for both -the public network and the cluster network, because clients will connect using -the public network and other Ceph OSD Daemons will connect using the cluster -network. When you add the rule using the example below, make sure you replace -``{iface}`` with the network interface (e.g., ``eth0``, ``eth1``, etc.), -``{ip-address}`` with the IP address and ``{netmask}`` with the netmask of the -public or cluster network. For example:: - - sudo iptables -A INPUT -i {iface} -m multiport -p tcp -s {ip-address}/{netmask} --dports 6800:7300 -j ACCEPT - -.. tip:: If you run Ceph Metadata Servers on the same Ceph Node as the - Ceph OSD Daemons, you can consolidate the public network configuration step. - - -Ceph Networks -============= - -To configure Ceph networks, you must add a network configuration to the -``[global]`` section of the configuration file. Our 5-minute Quick Start -provides a trivial `Ceph configuration file`_ that assumes one public network -with client and server on the same network and subnet. Ceph functions just fine -with a public network only. However, Ceph allows you to establish much more -specific criteria, including multiple IP network and subnet masks for your -public network. You can also establish a separate cluster network to handle OSD -heartbeat, object replication and recovery traffic. Don't confuse the IP -addresses you set in your configuration with the public-facing IP addresses -network clients may use to access your service. Typical internal IP networks are -often ``192.168.0.0`` or ``10.0.0.0``. - -.. tip:: If you specify more than one IP address and subnet mask for - either the public or the cluster network, the subnets within the network - must be capable of routing to each other. Additionally, make sure you - include each IP address/subnet in your IP tables and open ports for them - as necessary. - -.. note:: Ceph uses `CIDR`_ notation for subnets (e.g., ``10.0.0.0/24``). - -When you have configured your networks, you may restart your cluster or restart -each daemon. Ceph daemons bind dynamically, so you do not have to restart the -entire cluster at once if you change your network configuration. - - -Public Network --------------- - -To configure a public network, add the following option to the ``[global]`` -section of your Ceph configuration file. - -.. code-block:: ini - - [global] - ... - public network = {public-network/netmask} - - -Cluster Network ---------------- - -If you declare a cluster network, OSDs will route heartbeat, object replication -and recovery traffic over the cluster network. This may improve performance -compared to using a single network. To configure a cluster network, add the -following option to the ``[global]`` section of your Ceph configuration file. - -.. code-block:: ini - - [global] - ... - cluster network = {cluster-network/netmask} - -We prefer that the cluster network is **NOT** reachable from the public network -or the Internet for added security. - - -Ceph Daemons -============ - -Ceph has one network configuration requirement that applies to all daemons: the -Ceph configuration file **MUST** specify the ``host`` for each daemon. Ceph also -requires that a Ceph configuration file specify the monitor IP address and its -port. - -.. important:: Some deployment tools (e.g., ``ceph-deploy``, Chef) may create a - configuration file for you. **DO NOT** set these values if the deployment - tool does it for you. - -.. tip:: The ``host`` setting is the short name of the host (i.e., not - an fqdn). It is **NOT** an IP address either. Enter ``hostname -s`` on - the command line to retrieve the name of the host. - - -.. code-block:: ini - - [mon.a] - - host = {hostname} - mon addr = {ip-address}:6789 - - [osd.0] - host = {hostname} - - -You do not have to set the host IP address for a daemon. If you have a static IP -configuration and both public and cluster networks running, the Ceph -configuration file may specify the IP address of the host for each daemon. To -set a static IP address for a daemon, the following option(s) should appear in -the daemon instance sections of your ``ceph.conf`` file. - -.. code-block:: ini - - [osd.0] - public addr = {host-public-ip-address} - cluster addr = {host-cluster-ip-address} - - -.. topic:: One NIC OSD in a Two Network Cluster - - Generally, we do not recommend deploying an OSD host with a single NIC in a - cluster with two networks. However, you may accomplish this by forcing the - OSD host to operate on the public network by adding a ``public addr`` entry - to the ``[osd.n]`` section of the Ceph configuration file, where ``n`` - refers to the number of the OSD with one NIC. Additionally, the public - network and cluster network must be able to route traffic to each other, - which we don't recommend for security reasons. - - -Network Config Settings -======================= - -Network configuration settings are not required. Ceph assumes a public network -with all hosts operating on it unless you specifically configure a cluster -network. - - -Public Network --------------- - -The public network configuration allows you specifically define IP addresses -and subnets for the public network. You may specifically assign static IP -addresses or override ``public network`` settings using the ``public addr`` -setting for a specific daemon. - -``public network`` - -:Description: The IP address and netmask of the public (front-side) network - (e.g., ``192.168.0.0/24``). Set in ``[global]``. You may specify - comma-delimited subnets. - -:Type: ``{ip-address}/{netmask} [, {ip-address}/{netmask}]`` -:Required: No -:Default: N/A - - -``public addr`` - -:Description: The IP address for the public (front-side) network. - Set for each daemon. - -:Type: IP Address -:Required: No -:Default: N/A - - - -Cluster Network ---------------- - -The cluster network configuration allows you to declare a cluster network, and -specifically define IP addresses and subnets for the cluster network. You may -specifically assign static IP addresses or override ``cluster network`` -settings using the ``cluster addr`` setting for specific OSD daemons. - - -``cluster network`` - -:Description: The IP address and netmask of the cluster (back-side) network - (e.g., ``10.0.0.0/24``). Set in ``[global]``. You may specify - comma-delimited subnets. - -:Type: ``{ip-address}/{netmask} [, {ip-address}/{netmask}]`` -:Required: No -:Default: N/A - - -``cluster addr`` - -:Description: The IP address for the cluster (back-side) network. - Set for each daemon. - -:Type: Address -:Required: No -:Default: N/A - - -Bind ----- - -Bind settings set the default port ranges Ceph OSD and MDS daemons use. The -default range is ``6800:7300``. Ensure that your `IP Tables`_ configuration -allows you to use the configured port range. - -You may also enable Ceph daemons to bind to IPv6 addresses instead of IPv4 -addresses. - - -``ms bind port min`` - -:Description: The minimum port number to which an OSD or MDS daemon will bind. -:Type: 32-bit Integer -:Default: ``6800`` -:Required: No - - -``ms bind port max`` - -:Description: The maximum port number to which an OSD or MDS daemon will bind. -:Type: 32-bit Integer -:Default: ``7300`` -:Required: No. - - -``ms bind ipv6`` - -:Description: Enables Ceph daemons to bind to IPv6 addresses. Currently the - messenger *either* uses IPv4 or IPv6, but it cannot do both. -:Type: Boolean -:Default: ``false`` -:Required: No - -``public bind addr`` - -:Description: In some dynamic deployments the Ceph MON daemon might bind - to an IP address locally that is different from the ``public addr`` - advertised to other peers in the network. The environment must ensure - that routing rules are set correclty. If ``public bind addr`` is set - the Ceph MON daemon will bind to it locally and use ``public addr`` - in the monmaps to advertise its address to peers. This behavior is limited - to the MON daemon. - -:Type: IP Address -:Required: No -:Default: N/A - - - -Hosts ------ - -Ceph expects at least one monitor declared in the Ceph configuration file, with -a ``mon addr`` setting under each declared monitor. Ceph expects a ``host`` -setting under each declared monitor, metadata server and OSD in the Ceph -configuration file. Optionally, a monitor can be assigned with a priority, and -the clients will always connect to the monitor with lower value of priority if -specified. - - -``mon addr`` - -:Description: A list of ``{hostname}:{port}`` entries that clients can use to - connect to a Ceph monitor. If not set, Ceph searches ``[mon.*]`` - sections. - -:Type: String -:Required: No -:Default: N/A - -``mon priority`` - -:Description: The priority of the declared monitor, the lower value the more - prefered when a client selects a monitor when trying to connect - to the cluster. - -:Type: Unsigned 16-bit Integer -:Required: No -:Default: 0 - -``host`` - -:Description: The hostname. Use this setting for specific daemon instances - (e.g., ``[osd.0]``). - -:Type: String -:Required: Yes, for daemon instances. -:Default: ``localhost`` - -.. tip:: Do not use ``localhost``. To get your host name, execute - ``hostname -s`` on your command line and use the name of your host - (to the first period, not the fully-qualified domain name). - -.. important:: You should not specify any value for ``host`` when using a third - party deployment system that retrieves the host name for you. - - - -TCP ---- - -Ceph disables TCP buffering by default. - - -``ms tcp nodelay`` - -:Description: Ceph enables ``ms tcp nodelay`` so that each request is sent - immediately (no buffering). Disabling `Nagle's algorithm`_ - increases network traffic, which can introduce latency. If you - experience large numbers of small packets, you may try - disabling ``ms tcp nodelay``. - -:Type: Boolean -:Required: No -:Default: ``true`` - - - -``ms tcp rcvbuf`` - -:Description: The size of the socket buffer on the receiving end of a network - connection. Disable by default. - -:Type: 32-bit Integer -:Required: No -:Default: ``0`` - - - -``ms tcp read timeout`` - -:Description: If a client or daemon makes a request to another Ceph daemon and - does not drop an unused connection, the ``ms tcp read timeout`` - defines the connection as idle after the specified number - of seconds. - -:Type: Unsigned 64-bit Integer -:Required: No -:Default: ``900`` 15 minutes. - - - -.. _Scalability and High Availability: ../../../architecture#scalability-and-high-availability -.. _Hardware Recommendations - Networks: ../../../start/hardware-recommendations#networks -.. _Ceph configuration file: ../../../start/quick-ceph-deploy/#create-a-cluster -.. _hardware recommendations: ../../../start/hardware-recommendations -.. _Monitor / OSD Interaction: ../mon-osd-interaction -.. _Message Signatures: ../auth-config-ref#signatures -.. _CIDR: http://en.wikipedia.org/wiki/Classless_Inter-Domain_Routing -.. _Nagle's Algorithm: http://en.wikipedia.org/wiki/Nagle's_algorithm diff --git a/src/ceph/doc/rados/configuration/osd-config-ref.rst b/src/ceph/doc/rados/configuration/osd-config-ref.rst deleted file mode 100644 index fae7078..0000000 --- a/src/ceph/doc/rados/configuration/osd-config-ref.rst +++ /dev/null @@ -1,1105 +0,0 @@ -====================== - OSD Config Reference -====================== - -.. index:: OSD; configuration - -You can configure Ceph OSD Daemons in the Ceph configuration file, but Ceph OSD -Daemons can use the default values and a very minimal configuration. A minimal -Ceph OSD Daemon configuration sets ``osd journal size`` and ``host``, and -uses default values for nearly everything else. - -Ceph OSD Daemons are numerically identified in incremental fashion, beginning -with ``0`` using the following convention. :: - - osd.0 - osd.1 - osd.2 - -In a configuration file, you may specify settings for all Ceph OSD Daemons in -the cluster by adding configuration settings to the ``[osd]`` section of your -configuration file. To add settings directly to a specific Ceph OSD Daemon -(e.g., ``host``), enter it in an OSD-specific section of your configuration -file. For example: - -.. code-block:: ini - - [osd] - osd journal size = 1024 - - [osd.0] - host = osd-host-a - - [osd.1] - host = osd-host-b - - -.. index:: OSD; config settings - -General Settings -================ - -The following settings provide an Ceph OSD Daemon's ID, and determine paths to -data and journals. Ceph deployment scripts typically generate the UUID -automatically. We **DO NOT** recommend changing the default paths for data or -journals, as it makes it more problematic to troubleshoot Ceph later. - -The journal size should be at least twice the product of the expected drive -speed multiplied by ``filestore max sync interval``. However, the most common -practice is to partition the journal drive (often an SSD), and mount it such -that Ceph uses the entire partition for the journal. - - -``osd uuid`` - -:Description: The universally unique identifier (UUID) for the Ceph OSD Daemon. -:Type: UUID -:Default: The UUID. -:Note: The ``osd uuid`` applies to a single Ceph OSD Daemon. The ``fsid`` - applies to the entire cluster. - - -``osd data`` - -:Description: The path to the OSDs data. You must create the directory when - deploying Ceph. You should mount a drive for OSD data at this - mount point. We do not recommend changing the default. - -:Type: String -:Default: ``/var/lib/ceph/osd/$cluster-$id`` - - -``osd max write size`` - -:Description: The maximum size of a write in megabytes. -:Type: 32-bit Integer -:Default: ``90`` - - -``osd client message size cap`` - -:Description: The largest client data message allowed in memory. -:Type: 64-bit Unsigned Integer -:Default: 500MB default. ``500*1024L*1024L`` - - -``osd class dir`` - -:Description: The class path for RADOS class plug-ins. -:Type: String -:Default: ``$libdir/rados-classes`` - - -.. index:: OSD; file system - -File System Settings -==================== -Ceph builds and mounts file systems which are used for Ceph OSDs. - -``osd mkfs options {fs-type}`` - -:Description: Options used when creating a new Ceph OSD of type {fs-type}. - -:Type: String -:Default for xfs: ``-f -i 2048`` -:Default for other file systems: {empty string} - -For example:: - ``osd mkfs options xfs = -f -d agcount=24`` - -``osd mount options {fs-type}`` - -:Description: Options used when mounting a Ceph OSD of type {fs-type}. - -:Type: String -:Default for xfs: ``rw,noatime,inode64`` -:Default for other file systems: ``rw, noatime`` - -For example:: - ``osd mount options xfs = rw, noatime, inode64, logbufs=8`` - - -.. index:: OSD; journal settings - -Journal Settings -================ - -By default, Ceph expects that you will store an Ceph OSD Daemons journal with -the following path:: - - /var/lib/ceph/osd/$cluster-$id/journal - -Without performance optimization, Ceph stores the journal on the same disk as -the Ceph OSD Daemons data. An Ceph OSD Daemon optimized for performance may use -a separate disk to store journal data (e.g., a solid state drive delivers high -performance journaling). - -Ceph's default ``osd journal size`` is 0, so you will need to set this in your -``ceph.conf`` file. A journal size should find the product of the ``filestore -max sync interval`` and the expected throughput, and multiply the product by -two (2):: - - osd journal size = {2 * (expected throughput * filestore max sync interval)} - -The expected throughput number should include the expected disk throughput -(i.e., sustained data transfer rate), and network throughput. For example, -a 7200 RPM disk will likely have approximately 100 MB/s. Taking the ``min()`` -of the disk and network throughput should provide a reasonable expected -throughput. Some users just start off with a 10GB journal size. For -example:: - - osd journal size = 10000 - - -``osd journal`` - -:Description: The path to the OSD's journal. This may be a path to a file or a - block device (such as a partition of an SSD). If it is a file, - you must create the directory to contain it. We recommend using a - drive separate from the ``osd data`` drive. - -:Type: String -:Default: ``/var/lib/ceph/osd/$cluster-$id/journal`` - - -``osd journal size`` - -:Description: The size of the journal in megabytes. If this is 0, and the - journal is a block device, the entire block device is used. - Since v0.54, this is ignored if the journal is a block device, - and the entire block device is used. - -:Type: 32-bit Integer -:Default: ``5120`` -:Recommended: Begin with 1GB. Should be at least twice the product of the - expected speed multiplied by ``filestore max sync interval``. - - -See `Journal Config Reference`_ for additional details. - - -Monitor OSD Interaction -======================= - -Ceph OSD Daemons check each other's heartbeats and report to monitors -periodically. Ceph can use default values in many cases. However, if your -network has latency issues, you may need to adopt longer intervals. See -`Configuring Monitor/OSD Interaction`_ for a detailed discussion of heartbeats. - - -Data Placement -============== - -See `Pool & PG Config Reference`_ for details. - - -.. index:: OSD; scrubbing - -Scrubbing -========= - -In addition to making multiple copies of objects, Ceph insures data integrity by -scrubbing placement groups. Ceph scrubbing is analogous to ``fsck`` on the -object storage layer. For each placement group, Ceph generates a catalog of all -objects and compares each primary object and its replicas to ensure that no -objects are missing or mismatched. Light scrubbing (daily) checks the object -size and attributes. Deep scrubbing (weekly) reads the data and uses checksums -to ensure data integrity. - -Scrubbing is important for maintaining data integrity, but it can reduce -performance. You can adjust the following settings to increase or decrease -scrubbing operations. - - -``osd max scrubs`` - -:Description: The maximum number of simultaneous scrub operations for - a Ceph OSD Daemon. - -:Type: 32-bit Int -:Default: ``1`` - -``osd scrub begin hour`` - -:Description: The time of day for the lower bound when a scheduled scrub can be - performed. -:Type: Integer in the range of 0 to 24 -:Default: ``0`` - - -``osd scrub end hour`` - -:Description: The time of day for the upper bound when a scheduled scrub can be - performed. Along with ``osd scrub begin hour``, they define a time - window, in which the scrubs can happen. But a scrub will be performed - no matter the time window allows or not, as long as the placement - group's scrub interval exceeds ``osd scrub max interval``. -:Type: Integer in the range of 0 to 24 -:Default: ``24`` - - -``osd scrub during recovery`` - -:Description: Allow scrub during recovery. Setting this to ``false`` will disable - scheduling new scrub (and deep--scrub) while there is active recovery. - Already running scrubs will be continued. This might be useful to reduce - load on busy clusters. -:Type: Boolean -:Default: ``true`` - - -``osd scrub thread timeout`` - -:Description: The maximum time in seconds before timing out a scrub thread. -:Type: 32-bit Integer -:Default: ``60`` - - -``osd scrub finalize thread timeout`` - -:Description: The maximum time in seconds before timing out a scrub finalize - thread. - -:Type: 32-bit Integer -:Default: ``60*10`` - - -``osd scrub load threshold`` - -:Description: The maximum load. Ceph will not scrub when the system load - (as defined by ``getloadavg()``) is higher than this number. - Default is ``0.5``. - -:Type: Float -:Default: ``0.5`` - - -``osd scrub min interval`` - -:Description: The minimal interval in seconds for scrubbing the Ceph OSD Daemon - when the Ceph Storage Cluster load is low. - -:Type: Float -:Default: Once per day. ``60*60*24`` - - -``osd scrub max interval`` - -:Description: The maximum interval in seconds for scrubbing the Ceph OSD Daemon - irrespective of cluster load. - -:Type: Float -:Default: Once per week. ``7*60*60*24`` - - -``osd scrub chunk min`` - -:Description: The minimal number of object store chunks to scrub during single operation. - Ceph blocks writes to single chunk during scrub. - -:Type: 32-bit Integer -:Default: 5 - - -``osd scrub chunk max`` - -:Description: The maximum number of object store chunks to scrub during single operation. - -:Type: 32-bit Integer -:Default: 25 - - -``osd scrub sleep`` - -:Description: Time to sleep before scrubbing next group of chunks. Increasing this value will slow - down whole scrub operation while client operations will be less impacted. - -:Type: Float -:Default: 0 - - -``osd deep scrub interval`` - -:Description: The interval for "deep" scrubbing (fully reading all data). The - ``osd scrub load threshold`` does not affect this setting. - -:Type: Float -:Default: Once per week. ``60*60*24*7`` - - -``osd scrub interval randomize ratio`` - -:Description: Add a random delay to ``osd scrub min interval`` when scheduling - the next scrub job for a placement group. The delay is a random - value less than ``osd scrub min interval`` \* - ``osd scrub interval randomized ratio``. So the default setting - practically randomly spreads the scrubs out in the allowed time - window of ``[1, 1.5]`` \* ``osd scrub min interval``. -:Type: Float -:Default: ``0.5`` - -``osd deep scrub stride`` - -:Description: Read size when doing a deep scrub. -:Type: 32-bit Integer -:Default: 512 KB. ``524288`` - - -.. index:: OSD; operations settings - -Operations -========== - -Operations settings allow you to configure the number of threads for servicing -requests. If you set ``osd op threads`` to ``0``, it disables multi-threading. -By default, Ceph uses two threads with a 30 second timeout and a 30 second -complaint time if an operation doesn't complete within those time parameters. -You can set operations priority weights between client operations and -recovery operations to ensure optimal performance during recovery. - - -``osd op threads`` - -:Description: The number of threads to service Ceph OSD Daemon operations. - Set to ``0`` to disable it. Increasing the number may increase - the request processing rate. - -:Type: 32-bit Integer -:Default: ``2`` - - -``osd op queue`` - -:Description: This sets the type of queue to be used for prioritizing ops - in the OSDs. Both queues feature a strict sub-queue which is - dequeued before the normal queue. The normal queue is different - between implementations. The original PrioritizedQueue (``prio``) uses a - token bucket system which when there are sufficient tokens will - dequeue high priority queues first. If there are not enough - tokens available, queues are dequeued low priority to high priority. - The WeightedPriorityQueue (``wpq``) dequeues all priorities in - relation to their priorities to prevent starvation of any queue. - WPQ should help in cases where a few OSDs are more overloaded - than others. The new mClock based OpClassQueue - (``mclock_opclass``) prioritizes operations based on which class - they belong to (recovery, scrub, snaptrim, client op, osd subop). - And, the mClock based ClientQueue (``mclock_client``) also - incorporates the client identifier in order to promote fairness - between clients. See `QoS Based on mClock`_. Requires a restart. - -:Type: String -:Valid Choices: prio, wpq, mclock_opclass, mclock_client -:Default: ``prio`` - - -``osd op queue cut off`` - -:Description: This selects which priority ops will be sent to the strict - queue verses the normal queue. The ``low`` setting sends all - replication ops and higher to the strict queue, while the ``high`` - option sends only replication acknowledgement ops and higher to - the strict queue. Setting this to ``high`` should help when a few - OSDs in the cluster are very busy especially when combined with - ``wpq`` in the ``osd op queue`` setting. OSDs that are very busy - handling replication traffic could starve primary client traffic - on these OSDs without these settings. Requires a restart. - -:Type: String -:Valid Choices: low, high -:Default: ``low`` - - -``osd client op priority`` - -:Description: The priority set for client operations. It is relative to - ``osd recovery op priority``. - -:Type: 32-bit Integer -:Default: ``63`` -:Valid Range: 1-63 - - -``osd recovery op priority`` - -:Description: The priority set for recovery operations. It is relative to - ``osd client op priority``. - -:Type: 32-bit Integer -:Default: ``3`` -:Valid Range: 1-63 - - -``osd scrub priority`` - -:Description: The priority set for scrub operations. It is relative to - ``osd client op priority``. - -:Type: 32-bit Integer -:Default: ``5`` -:Valid Range: 1-63 - - -``osd snap trim priority`` - -:Description: The priority set for snap trim operations. It is relative to - ``osd client op priority``. - -:Type: 32-bit Integer -:Default: ``5`` -:Valid Range: 1-63 - - -``osd op thread timeout`` - -:Description: The Ceph OSD Daemon operation thread timeout in seconds. -:Type: 32-bit Integer -:Default: ``15`` - - -``osd op complaint time`` - -:Description: An operation becomes complaint worthy after the specified number - of seconds have elapsed. - -:Type: Float -:Default: ``30`` - - -``osd disk threads`` - -:Description: The number of disk threads, which are used to perform background - disk intensive OSD operations such as scrubbing and snap - trimming. - -:Type: 32-bit Integer -:Default: ``1`` - -``osd disk thread ioprio class`` - -:Description: Warning: it will only be used if both ``osd disk thread - ioprio class`` and ``osd disk thread ioprio priority`` are - set to a non default value. Sets the ioprio_set(2) I/O - scheduling ``class`` for the disk thread. Acceptable - values are ``idle``, ``be`` or ``rt``. The ``idle`` - class means the disk thread will have lower priority - than any other thread in the OSD. This is useful to slow - down scrubbing on an OSD that is busy handling client - operations. ``be`` is the default and is the same - priority as all other threads in the OSD. ``rt`` means - the disk thread will have precendence over all other - threads in the OSD. Note: Only works with the Linux Kernel - CFQ scheduler. Since Jewel scrubbing is no longer carried - out by the disk iothread, see osd priority options instead. -:Type: String -:Default: the empty string - -``osd disk thread ioprio priority`` - -:Description: Warning: it will only be used if both ``osd disk thread - ioprio class`` and ``osd disk thread ioprio priority`` are - set to a non default value. It sets the ioprio_set(2) - I/O scheduling ``priority`` of the disk thread ranging - from 0 (highest) to 7 (lowest). If all OSDs on a given - host were in class ``idle`` and compete for I/O - (i.e. due to controller congestion), it can be used to - lower the disk thread priority of one OSD to 7 so that - another OSD with priority 0 can have priority. - Note: Only works with the Linux Kernel CFQ scheduler. -:Type: Integer in the range of 0 to 7 or -1 if not to be used. -:Default: ``-1`` - -``osd op history size`` - -:Description: The maximum number of completed operations to track. -:Type: 32-bit Unsigned Integer -:Default: ``20`` - - -``osd op history duration`` - -:Description: The oldest completed operation to track. -:Type: 32-bit Unsigned Integer -:Default: ``600`` - - -``osd op log threshold`` - -:Description: How many operations logs to display at once. -:Type: 32-bit Integer -:Default: ``5`` - - -QoS Based on mClock -------------------- - -Ceph's use of mClock is currently in the experimental phase and should -be approached with an exploratory mindset. - -Core Concepts -````````````` - -The QoS support of Ceph is implemented using a queueing scheduler -based on `the dmClock algorithm`_. This algorithm allocates the I/O -resources of the Ceph cluster in proportion to weights, and enforces -the constraits of minimum reservation and maximum limitation, so that -the services can compete for the resources fairly. Currently the -*mclock_opclass* operation queue divides Ceph services involving I/O -resources into following buckets: - -- client op: the iops issued by client -- osd subop: the iops issued by primary OSD -- snap trim: the snap trimming related requests -- pg recovery: the recovery related requests -- pg scrub: the scrub related requests - -And the resources are partitioned using following three sets of tags. In other -words, the share of each type of service is controlled by three tags: - -#. reservation: the minimum IOPS allocated for the service. -#. limitation: the maximum IOPS allocated for the service. -#. weight: the proportional share of capacity if extra capacity or system - oversubscribed. - -In Ceph operations are graded with "cost". And the resources allocated -for serving various services are consumed by these "costs". So, for -example, the more reservation a services has, the more resource it is -guaranteed to possess, as long as it requires. Assuming there are 2 -services: recovery and client ops: - -- recovery: (r:1, l:5, w:1) -- client ops: (r:2, l:0, w:9) - -The settings above ensure that the recovery won't get more than 5 -requests per second serviced, even if it requires so (see CURRENT -IMPLEMENTATION NOTE below), and no other services are competing with -it. But if the clients start to issue large amount of I/O requests, -neither will they exhaust all the I/O resources. 1 request per second -is always allocated for recovery jobs as long as there are any such -requests. So the recovery jobs won't be starved even in a cluster with -high load. And in the meantime, the client ops can enjoy a larger -portion of the I/O resource, because its weight is "9", while its -competitor "1". In the case of client ops, it is not clamped by the -limit setting, so it can make use of all the resources if there is no -recovery ongoing. - -Along with *mclock_opclass* another mclock operation queue named -*mclock_client* is available. It divides operations based on category -but also divides them based on the client making the request. This -helps not only manage the distribution of resources spent on different -classes of operations but also tries to insure fairness among clients. - -CURRENT IMPLEMENTATION NOTE: the current experimental implementation -does not enforce the limit values. As a first approximation we decided -not to prevent operations that would otherwise enter the operation -sequencer from doing so. - -Subtleties of mClock -```````````````````` - -The reservation and limit values have a unit of requests per -second. The weight, however, does not technically have a unit and the -weights are relative to one another. So if one class of requests has a -weight of 1 and another a weight of 9, then the latter class of -requests should get 9 executed at a 9 to 1 ratio as the first class. -However that will only happen once the reservations are met and those -values include the operations executed under the reservation phase. - -Even though the weights do not have units, one must be careful in -choosing their values due how the algorithm assigns weight tags to -requests. If the weight is *W*, then for a given class of requests, -the next one that comes in will have a weight tag of *1/W* plus the -previous weight tag or the current time, whichever is larger. That -means if *W* is sufficiently large and therefore *1/W* is sufficiently -small, the calculated tag may never be assigned as it will get a value -of the current time. The ultimate lesson is that values for weight -should not be too large. They should be under the number of requests -one expects to ve serviced each second. - -Caveats -``````` - -There are some factors that can reduce the impact of the mClock op -queues within Ceph. First, requests to an OSD are sharded by their -placement group identifier. Each shard has its own mClock queue and -these queues neither interact nor share information among them. The -number of shards can be controlled with the configuration options -``osd_op_num_shards``, ``osd_op_num_shards_hdd``, and -``osd_op_num_shards_ssd``. A lower number of shards will increase the -impact of the mClock queues, but may have other deliterious effects. - -Second, requests are transferred from the operation queue to the -operation sequencer, in which they go through the phases of -execution. The operation queue is where mClock resides and mClock -determines the next op to transfer to the operation sequencer. The -number of operations allowed in the operation sequencer is a complex -issue. In general we want to keep enough operations in the sequencer -so it's always getting work done on some operations while it's waiting -for disk and network access to complete on other operations. On the -other hand, once an operation is transferred to the operation -sequencer, mClock no longer has control over it. Therefore to maximize -the impact of mClock, we want to keep as few operations in the -operation sequencer as possible. So we have an inherent tension. - -The configuration options that influence the number of operations in -the operation sequencer are ``bluestore_throttle_bytes``, -``bluestore_throttle_deferred_bytes``, -``bluestore_throttle_cost_per_io``, -``bluestore_throttle_cost_per_io_hdd``, and -``bluestore_throttle_cost_per_io_ssd``. - -A third factor that affects the impact of the mClock algorithm is that -we're using a distributed system, where requests are made to multiple -OSDs and each OSD has (can have) multiple shards. Yet we're currently -using the mClock algorithm, which is not distributed (note: dmClock is -the distributed version of mClock). - -Various organizations and individuals are currently experimenting with -mClock as it exists in this code base along with their modifications -to the code base. We hope you'll share you're experiences with your -mClock and dmClock experiments in the ceph-devel mailing list. - - -``osd push per object cost`` - -:Description: the overhead for serving a push op - -:Type: Unsigned Integer -:Default: 1000 - -``osd recovery max chunk`` - -:Description: the maximum total size of data chunks a recovery op can carry. - -:Type: Unsigned Integer -:Default: 8 MiB - - -``osd op queue mclock client op res`` - -:Description: the reservation of client op. - -:Type: Float -:Default: 1000.0 - - -``osd op queue mclock client op wgt`` - -:Description: the weight of client op. - -:Type: Float -:Default: 500.0 - - -``osd op queue mclock client op lim`` - -:Description: the limit of client op. - -:Type: Float -:Default: 1000.0 - - -``osd op queue mclock osd subop res`` - -:Description: the reservation of osd subop. - -:Type: Float -:Default: 1000.0 - - -``osd op queue mclock osd subop wgt`` - -:Description: the weight of osd subop. - -:Type: Float -:Default: 500.0 - - -``osd op queue mclock osd subop lim`` - -:Description: the limit of osd subop. - -:Type: Float -:Default: 0.0 - - -``osd op queue mclock snap res`` - -:Description: the reservation of snap trimming. - -:Type: Float -:Default: 0.0 - - -``osd op queue mclock snap wgt`` - -:Description: the weight of snap trimming. - -:Type: Float -:Default: 1.0 - - -``osd op queue mclock snap lim`` - -:Description: the limit of snap trimming. - -:Type: Float -:Default: 0.001 - - -``osd op queue mclock recov res`` - -:Description: the reservation of recovery. - -:Type: Float -:Default: 0.0 - - -``osd op queue mclock recov wgt`` - -:Description: the weight of recovery. - -:Type: Float -:Default: 1.0 - - -``osd op queue mclock recov lim`` - -:Description: the limit of recovery. - -:Type: Float -:Default: 0.001 - - -``osd op queue mclock scrub res`` - -:Description: the reservation of scrub jobs. - -:Type: Float -:Default: 0.0 - - -``osd op queue mclock scrub wgt`` - -:Description: the weight of scrub jobs. - -:Type: Float -:Default: 1.0 - - -``osd op queue mclock scrub lim`` - -:Description: the limit of scrub jobs. - -:Type: Float -:Default: 0.001 - -.. _the dmClock algorithm: https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Gulati.pdf - - -.. index:: OSD; backfilling - -Backfilling -=========== - -When you add or remove Ceph OSD Daemons to a cluster, the CRUSH algorithm will -want to rebalance the cluster by moving placement groups to or from Ceph OSD -Daemons to restore the balance. The process of migrating placement groups and -the objects they contain can reduce the cluster's operational performance -considerably. To maintain operational performance, Ceph performs this migration -with 'backfilling', which allows Ceph to set backfill operations to a lower -priority than requests to read or write data. - - -``osd max backfills`` - -:Description: The maximum number of backfills allowed to or from a single OSD. -:Type: 64-bit Unsigned Integer -:Default: ``1`` - - -``osd backfill scan min`` - -:Description: The minimum number of objects per backfill scan. - -:Type: 32-bit Integer -:Default: ``64`` - - -``osd backfill scan max`` - -:Description: The maximum number of objects per backfill scan. - -:Type: 32-bit Integer -:Default: ``512`` - - -``osd backfill retry interval`` - -:Description: The number of seconds to wait before retrying backfill requests. -:Type: Double -:Default: ``10.0`` - -.. index:: OSD; osdmap - -OSD Map -======= - -OSD maps reflect the OSD daemons operating in the cluster. Over time, the -number of map epochs increases. Ceph provides some settings to ensure that -Ceph performs well as the OSD map grows larger. - - -``osd map dedup`` - -:Description: Enable removing duplicates in the OSD map. -:Type: Boolean -:Default: ``true`` - - -``osd map cache size`` - -:Description: The number of OSD maps to keep cached. -:Type: 32-bit Integer -:Default: ``500`` - - -``osd map cache bl size`` - -:Description: The size of the in-memory OSD map cache in OSD daemons. -:Type: 32-bit Integer -:Default: ``50`` - - -``osd map cache bl inc size`` - -:Description: The size of the in-memory OSD map cache incrementals in - OSD daemons. - -:Type: 32-bit Integer -:Default: ``100`` - - -``osd map message max`` - -:Description: The maximum map entries allowed per MOSDMap message. -:Type: 32-bit Integer -:Default: ``100`` - - - -.. index:: OSD; recovery - -Recovery -======== - -When the cluster starts or when a Ceph OSD Daemon crashes and restarts, the OSD -begins peering with other Ceph OSD Daemons before writes can occur. See -`Monitoring OSDs and PGs`_ for details. - -If a Ceph OSD Daemon crashes and comes back online, usually it will be out of -sync with other Ceph OSD Daemons containing more recent versions of objects in -the placement groups. When this happens, the Ceph OSD Daemon goes into recovery -mode and seeks to get the latest copy of the data and bring its map back up to -date. Depending upon how long the Ceph OSD Daemon was down, the OSD's objects -and placement groups may be significantly out of date. Also, if a failure domain -went down (e.g., a rack), more than one Ceph OSD Daemon may come back online at -the same time. This can make the recovery process time consuming and resource -intensive. - -To maintain operational performance, Ceph performs recovery with limitations on -the number recovery requests, threads and object chunk sizes which allows Ceph -perform well in a degraded state. - - -``osd recovery delay start`` - -:Description: After peering completes, Ceph will delay for the specified number - of seconds before starting to recover objects. - -:Type: Float -:Default: ``0`` - - -``osd recovery max active`` - -:Description: The number of active recovery requests per OSD at one time. More - requests will accelerate recovery, but the requests places an - increased load on the cluster. - -:Type: 32-bit Integer -:Default: ``3`` - - -``osd recovery max chunk`` - -:Description: The maximum size of a recovered chunk of data to push. -:Type: 64-bit Unsigned Integer -:Default: ``8 << 20`` - - -``osd recovery max single start`` - -:Description: The maximum number of recovery operations per OSD that will be - newly started when an OSD is recovering. -:Type: 64-bit Unsigned Integer -:Default: ``1`` - - -``osd recovery thread timeout`` - -:Description: The maximum time in seconds before timing out a recovery thread. -:Type: 32-bit Integer -:Default: ``30`` - - -``osd recover clone overlap`` - -:Description: Preserves clone overlap during recovery. Should always be set - to ``true``. - -:Type: Boolean -:Default: ``true`` - - -``osd recovery sleep`` - -:Description: Time in seconds to sleep before next recovery or backfill op. - Increasing this value will slow down recovery operation while - client operations will be less impacted. - -:Type: Float -:Default: ``0`` - - -``osd recovery sleep hdd`` - -:Description: Time in seconds to sleep before next recovery or backfill op - for HDDs. - -:Type: Float -:Default: ``0.1`` - - -``osd recovery sleep ssd`` - -:Description: Time in seconds to sleep before next recovery or backfill op - for SSDs. - -:Type: Float -:Default: ``0`` - - -``osd recovery sleep hybrid`` - -:Description: Time in seconds to sleep before next recovery or backfill op - when osd data is on HDD and osd journal is on SSD. - -:Type: Float -:Default: ``0.025`` - -Tiering -======= - -``osd agent max ops`` - -:Description: The maximum number of simultaneous flushing ops per tiering agent - in the high speed mode. -:Type: 32-bit Integer -:Default: ``4`` - - -``osd agent max low ops`` - -:Description: The maximum number of simultaneous flushing ops per tiering agent - in the low speed mode. -:Type: 32-bit Integer -:Default: ``2`` - -See `cache target dirty high ratio`_ for when the tiering agent flushes dirty -objects within the high speed mode. - -Miscellaneous -============= - - -``osd snap trim thread timeout`` - -:Description: The maximum time in seconds before timing out a snap trim thread. -:Type: 32-bit Integer -:Default: ``60*60*1`` - - -``osd backlog thread timeout`` - -:Description: The maximum time in seconds before timing out a backlog thread. -:Type: 32-bit Integer -:Default: ``60*60*1`` - - -``osd default notify timeout`` - -:Description: The OSD default notification timeout (in seconds). -:Type: 32-bit Unsigned Integer -:Default: ``30`` - - -``osd check for log corruption`` - -:Description: Check log files for corruption. Can be computationally expensive. -:Type: Boolean -:Default: ``false`` - - -``osd remove thread timeout`` - -:Description: The maximum time in seconds before timing out a remove OSD thread. -:Type: 32-bit Integer -:Default: ``60*60`` - - -``osd command thread timeout`` - -:Description: The maximum time in seconds before timing out a command thread. -:Type: 32-bit Integer -:Default: ``10*60`` - - -``osd command max records`` - -:Description: Limits the number of lost objects to return. -:Type: 32-bit Integer -:Default: ``256`` - - -``osd auto upgrade tmap`` - -:Description: Uses ``tmap`` for ``omap`` on old objects. -:Type: Boolean -:Default: ``true`` - - -``osd tmapput sets users tmap`` - -:Description: Uses ``tmap`` for debugging only. -:Type: Boolean -:Default: ``false`` - - -``osd fast fail on connection refused`` - -:Description: If this option is enabled, crashed OSDs are marked down - immediately by connected peers and MONs (assuming that the - crashed OSD host survives). Disable it to restore old - behavior, at the expense of possible long I/O stalls when - OSDs crash in the middle of I/O operations. -:Type: Boolean -:Default: ``true`` - - - -.. _pool: ../../operations/pools -.. _Configuring Monitor/OSD Interaction: ../mon-osd-interaction -.. _Monitoring OSDs and PGs: ../../operations/monitoring-osd-pg#peering -.. _Pool & PG Config Reference: ../pool-pg-config-ref -.. _Journal Config Reference: ../journal-ref -.. _cache target dirty high ratio: ../../operations/pools#cache-target-dirty-high-ratio diff --git a/src/ceph/doc/rados/configuration/pool-pg-config-ref.rst b/src/ceph/doc/rados/configuration/pool-pg-config-ref.rst deleted file mode 100644 index 89a3707..0000000 --- a/src/ceph/doc/rados/configuration/pool-pg-config-ref.rst +++ /dev/null @@ -1,270 +0,0 @@ -====================================== - Pool, PG and CRUSH Config Reference -====================================== - -.. index:: pools; configuration - -When you create pools and set the number of placement groups for the pool, Ceph -uses default values when you don't specifically override the defaults. **We -recommend** overridding some of the defaults. Specifically, we recommend setting -a pool's replica size and overriding the default number of placement groups. You -can specifically set these values when running `pool`_ commands. You can also -override the defaults by adding new ones in the ``[global]`` section of your -Ceph configuration file. - - -.. literalinclude:: pool-pg.conf - :language: ini - - - -``mon max pool pg num`` - -:Description: The maximum number of placement groups per pool. -:Type: Integer -:Default: ``65536`` - - -``mon pg create interval`` - -:Description: Number of seconds between PG creation in the same - Ceph OSD Daemon. - -:Type: Float -:Default: ``30.0`` - - -``mon pg stuck threshold`` - -:Description: Number of seconds after which PGs can be considered as - being stuck. - -:Type: 32-bit Integer -:Default: ``300`` - -``mon pg min inactive`` - -:Description: Issue a ``HEALTH_ERR`` in cluster log if the number of PGs stay - inactive longer than ``mon_pg_stuck_threshold`` exceeds this - setting. A non-positive number means disabled, never go into ERR. -:Type: Integer -:Default: ``1`` - - -``mon pg warn min per osd`` - -:Description: Issue a ``HEALTH_WARN`` in cluster log if the average number - of PGs per (in) OSD is under this number. (a non-positive number - disables this) -:Type: Integer -:Default: ``30`` - - -``mon pg warn max per osd`` - -:Description: Issue a ``HEALTH_WARN`` in cluster log if the average number - of PGs per (in) OSD is above this number. (a non-positive number - disables this) -:Type: Integer -:Default: ``300`` - - -``mon pg warn min objects`` - -:Description: Do not warn if the total number of objects in cluster is below - this number -:Type: Integer -:Default: ``1000`` - - -``mon pg warn min pool objects`` - -:Description: Do not warn on pools whose object number is below this number -:Type: Integer -:Default: ``1000`` - - -``mon pg check down all threshold`` - -:Description: Threshold of down OSDs percentage after which we check all PGs - for stale ones. -:Type: Float -:Default: ``0.5`` - - -``mon pg warn max object skew`` - -:Description: Issue a ``HEALTH_WARN`` in cluster log if the average object number - of a certain pool is greater than ``mon pg warn max object skew`` times - the average object number of the whole pool. (a non-positive number - disables this) -:Type: Float -:Default: ``10`` - - -``mon delta reset interval`` - -:Description: Seconds of inactivity before we reset the pg delta to 0. We keep - track of the delta of the used space of each pool, so, for - example, it would be easier for us to understand the progress of - recovery or the performance of cache tier. But if there's no - activity reported for a certain pool, we just reset the history of - deltas of that pool. -:Type: Integer -:Default: ``10`` - - -``mon osd max op age`` - -:Description: Maximum op age before we get concerned (make it a power of 2). - A ``HEALTH_WARN`` will be issued if a request has been blocked longer - than this limit. -:Type: Float -:Default: ``32.0`` - - -``osd pg bits`` - -:Description: Placement group bits per Ceph OSD Daemon. -:Type: 32-bit Integer -:Default: ``6`` - - -``osd pgp bits`` - -:Description: The number of bits per Ceph OSD Daemon for PGPs. -:Type: 32-bit Integer -:Default: ``6`` - - -``osd crush chooseleaf type`` - -:Description: The bucket type to use for ``chooseleaf`` in a CRUSH rule. Uses - ordinal rank rather than name. - -:Type: 32-bit Integer -:Default: ``1``. Typically a host containing one or more Ceph OSD Daemons. - - -``osd crush initial weight`` - -:Description: The initial crush weight for newly added osds into crushmap. - -:Type: Double -:Default: ``the size of newly added osd in TB``. By default, the initial crush - weight for the newly added osd is set to its volume size in TB. - See `Weighting Bucket Items`_ for details. - - -``osd pool default crush replicated ruleset`` - -:Description: The default CRUSH ruleset to use when creating a replicated pool. -:Type: 8-bit Integer -:Default: ``CEPH_DEFAULT_CRUSH_REPLICATED_RULESET``, which means "pick - a ruleset with the lowest numerical ID and use that". This is to - make pool creation work in the absence of ruleset 0. - - -``osd pool erasure code stripe unit`` - -:Description: Sets the default size, in bytes, of a chunk of an object - stripe for erasure coded pools. Every object of size S - will be stored as N stripes, with each data chunk - receiving ``stripe unit`` bytes. Each stripe of ``N * - stripe unit`` bytes will be encoded/decoded - individually. This option can is overridden by the - ``stripe_unit`` setting in an erasure code profile. - -:Type: Unsigned 32-bit Integer -:Default: ``4096`` - - -``osd pool default size`` - -:Description: Sets the number of replicas for objects in the pool. The default - value is the same as - ``ceph osd pool set {pool-name} size {size}``. - -:Type: 32-bit Integer -:Default: ``3`` - - -``osd pool default min size`` - -:Description: Sets the minimum number of written replicas for objects in the - pool in order to acknowledge a write operation to the client. - If minimum is not met, Ceph will not acknowledge the write to the - client. This setting ensures a minimum number of replicas when - operating in ``degraded`` mode. - -:Type: 32-bit Integer -:Default: ``0``, which means no particular minimum. If ``0``, - minimum is ``size - (size / 2)``. - - -``osd pool default pg num`` - -:Description: The default number of placement groups for a pool. The default - value is the same as ``pg_num`` with ``mkpool``. - -:Type: 32-bit Integer -:Default: ``8`` - - -``osd pool default pgp num`` - -:Description: The default number of placement groups for placement for a pool. - The default value is the same as ``pgp_num`` with ``mkpool``. - PG and PGP should be equal (for now). - -:Type: 32-bit Integer -:Default: ``8`` - - -``osd pool default flags`` - -:Description: The default flags for new pools. -:Type: 32-bit Integer -:Default: ``0`` - - -``osd max pgls`` - -:Description: The maximum number of placement groups to list. A client - requesting a large number can tie up the Ceph OSD Daemon. - -:Type: Unsigned 64-bit Integer -:Default: ``1024`` -:Note: Default should be fine. - - -``osd min pg log entries`` - -:Description: The minimum number of placement group logs to maintain - when trimming log files. - -:Type: 32-bit Int Unsigned -:Default: ``1000`` - - -``osd default data pool replay window`` - -:Description: The time (in seconds) for an OSD to wait for a client to replay - a request. - -:Type: 32-bit Integer -:Default: ``45`` - -``osd max pg per osd hard ratio`` - -:Description: The ratio of number of PGs per OSD allowed by the cluster before - OSD refuses to create new PGs. OSD stops creating new PGs if the number - of PGs it serves exceeds - ``osd max pg per osd hard ratio`` \* ``mon max pg per osd``. - -:Type: Float -:Default: ``2`` - -.. _pool: ../../operations/pools -.. _Monitoring OSDs and PGs: ../../operations/monitoring-osd-pg#peering -.. _Weighting Bucket Items: ../../operations/crush-map#weightingbucketitems diff --git a/src/ceph/doc/rados/configuration/pool-pg.conf b/src/ceph/doc/rados/configuration/pool-pg.conf deleted file mode 100644 index 5f1b3b7..0000000 --- a/src/ceph/doc/rados/configuration/pool-pg.conf +++ /dev/null @@ -1,20 +0,0 @@ -[global] - - # By default, Ceph makes 3 replicas of objects. If you want to make four - # copies of an object the default value--a primary copy and three replica - # copies--reset the default values as shown in 'osd pool default size'. - # If you want to allow Ceph to write a lesser number of copies in a degraded - # state, set 'osd pool default min size' to a number less than the - # 'osd pool default size' value. - - osd pool default size = 4 # Write an object 4 times. - osd pool default min size = 1 # Allow writing one copy in a degraded state. - - # Ensure you have a realistic number of placement groups. We recommend - # approximately 100 per OSD. E.g., total number of OSDs multiplied by 100 - # divided by the number of replicas (i.e., osd pool default size). So for - # 10 OSDs and osd pool default size = 4, we'd recommend approximately - # (100 * 10) / 4 = 250. - - osd pool default pg num = 250 - osd pool default pgp num = 250 diff --git a/src/ceph/doc/rados/configuration/storage-devices.rst b/src/ceph/doc/rados/configuration/storage-devices.rst deleted file mode 100644 index 83c0c9b..0000000 --- a/src/ceph/doc/rados/configuration/storage-devices.rst +++ /dev/null @@ -1,83 +0,0 @@ -================= - Storage Devices -================= - -There are two Ceph daemons that store data on disk: - -* **Ceph OSDs** (or Object Storage Daemons) are where most of the - data is stored in Ceph. Generally speaking, each OSD is backed by - a single storage device, like a traditional hard disk (HDD) or - solid state disk (SSD). OSDs can also be backed by a combination - of devices, like a HDD for most data and an SSD (or partition of an - SSD) for some metadata. The number of OSDs in a cluster is - generally a function of how much data will be stored, how big each - storage device will be, and the level and type of redundancy - (replication or erasure coding). -* **Ceph Monitor** daemons manage critical cluster state like cluster - membership and authentication information. For smaller clusters a - few gigabytes is all that is needed, although for larger clusters - the monitor database can reach tens or possibly hundreds of - gigabytes. - - -OSD Backends -============ - -There are two ways that OSDs can manage the data they store. Starting -with the Luminous 12.2.z release, the new default (and recommended) backend is -*BlueStore*. Prior to Luminous, the default (and only option) was -*FileStore*. - -BlueStore ---------- - -BlueStore is a special-purpose storage backend designed specifically -for managing data on disk for Ceph OSD workloads. It is motivated by -experience supporting and managing OSDs using FileStore over the -last ten years. Key BlueStore features include: - -* Direct management of storage devices. BlueStore consumes raw block - devices or partitions. This avoids any intervening layers of - abstraction (such as local file systems like XFS) that may limit - performance or add complexity. -* Metadata management with RocksDB. We embed RocksDB's key/value database - in order to manage internal metadata, such as the mapping from object - names to block locations on disk. -* Full data and metadata checksumming. By default all data and - metadata written to BlueStore is protected by one or more - checksums. No data or metadata will be read from disk or returned - to the user without being verified. -* Inline compression. Data written may be optionally compressed - before being written to disk. -* Multi-device metadata tiering. BlueStore allows its internal - journal (write-ahead log) to be written to a separate, high-speed - device (like an SSD, NVMe, or NVDIMM) to increased performance. If - a significant amount of faster storage is available, internal - metadata can also be stored on the faster device. -* Efficient copy-on-write. RBD and CephFS snapshots rely on a - copy-on-write *clone* mechanism that is implemented efficiently in - BlueStore. This results in efficient IO both for regular snapshots - and for erasure coded pools (which rely on cloning to implement - efficient two-phase commits). - -For more information, see :doc:`bluestore-config-ref`. - -FileStore ---------- - -FileStore is the legacy approach to storing objects in Ceph. It -relies on a standard file system (normally XFS) in combination with a -key/value database (traditionally LevelDB, now RocksDB) for some -metadata. - -FileStore is well-tested and widely used in production but suffers -from many performance deficiencies due to its overall design and -reliance on a traditional file system for storing object data. - -Although FileStore is generally capable of functioning on most -POSIX-compatible file systems (including btrfs and ext4), we only -recommend that XFS be used. Both btrfs and ext4 have known bugs and -deficiencies and their use may lead to data loss. By default all Ceph -provisioning tools will use XFS. - -For more information, see :doc:`filestore-config-ref`. |