diff options
author | Qiaowei Ren <qiaowei.ren@intel.com> | 2018-03-01 14:38:11 +0800 |
---|---|---|
committer | Qiaowei Ren <qiaowei.ren@intel.com> | 2018-03-01 14:38:11 +0800 |
commit | 7da45d65be36d36b880cc55c5036e96c24b53f00 (patch) | |
tree | d4f944eb4f8f8de50a9a7584ffa408dc3a3185b2 /src/ceph/doc/rados/operations | |
parent | 691462d09d0987b47e112d6ee8740375df3c51b2 (diff) |
remove ceph code
This patch removes initial ceph code, due to license issue.
Change-Id: I092d44f601cdf34aed92300fe13214925563081c
Signed-off-by: Qiaowei Ren <qiaowei.ren@intel.com>
Diffstat (limited to 'src/ceph/doc/rados/operations')
25 files changed, 0 insertions, 8382 deletions
diff --git a/src/ceph/doc/rados/operations/add-or-rm-mons.rst b/src/ceph/doc/rados/operations/add-or-rm-mons.rst deleted file mode 100644 index 0cdc431..0000000 --- a/src/ceph/doc/rados/operations/add-or-rm-mons.rst +++ /dev/null @@ -1,370 +0,0 @@ -========================== - Adding/Removing Monitors -========================== - -When you have a cluster up and running, you may add or remove monitors -from the cluster at runtime. To bootstrap a monitor, see `Manual Deployment`_ -or `Monitor Bootstrap`_. - -Adding Monitors -=============== - -Ceph monitors are light-weight processes that maintain a master copy of the -cluster map. You can run a cluster with 1 monitor. We recommend at least 3 -monitors for a production cluster. Ceph monitors use a variation of the -`Paxos`_ protocol to establish consensus about maps and other critical -information across the cluster. Due to the nature of Paxos, Ceph requires -a majority of monitors running to establish a quorum (thus establishing -consensus). - -It is advisable to run an odd-number of monitors but not mandatory. An -odd-number of monitors has a higher resiliency to failures than an -even-number of monitors. For instance, on a 2 monitor deployment, no -failures can be tolerated in order to maintain a quorum; with 3 monitors, -one failure can be tolerated; in a 4 monitor deployment, one failure can -be tolerated; with 5 monitors, two failures can be tolerated. This is -why an odd-number is advisable. Summarizing, Ceph needs a majority of -monitors to be running (and able to communicate with each other), but that -majority can be achieved using a single monitor, or 2 out of 2 monitors, -2 out of 3, 3 out of 4, etc. - -For an initial deployment of a multi-node Ceph cluster, it is advisable to -deploy three monitors, increasing the number two at a time if a valid need -for more than three exists. - -Since monitors are light-weight, it is possible to run them on the same -host as an OSD; however, we recommend running them on separate hosts, -because fsync issues with the kernel may impair performance. - -.. note:: A *majority* of monitors in your cluster must be able to - reach each other in order to establish a quorum. - -Deploy your Hardware --------------------- - -If you are adding a new host when adding a new monitor, see `Hardware -Recommendations`_ for details on minimum recommendations for monitor hardware. -To add a monitor host to your cluster, first make sure you have an up-to-date -version of Linux installed (typically Ubuntu 14.04 or RHEL 7). - -Add your monitor host to a rack in your cluster, connect it to the network -and ensure that it has network connectivity. - -.. _Hardware Recommendations: ../../../start/hardware-recommendations - -Install the Required Software ------------------------------ - -For manually deployed clusters, you must install Ceph packages -manually. See `Installing Packages`_ for details. -You should configure SSH to a user with password-less authentication -and root permissions. - -.. _Installing Packages: ../../../install/install-storage-cluster - - -.. _Adding a Monitor (Manual): - -Adding a Monitor (Manual) -------------------------- - -This procedure creates a ``ceph-mon`` data directory, retrieves the monitor map -and monitor keyring, and adds a ``ceph-mon`` daemon to your cluster. If -this results in only two monitor daemons, you may add more monitors by -repeating this procedure until you have a sufficient number of ``ceph-mon`` -daemons to achieve a quorum. - -At this point you should define your monitor's id. Traditionally, monitors -have been named with single letters (``a``, ``b``, ``c``, ...), but you are -free to define the id as you see fit. For the purpose of this document, -please take into account that ``{mon-id}`` should be the id you chose, -without the ``mon.`` prefix (i.e., ``{mon-id}`` should be the ``a`` -on ``mon.a``). - -#. Create the default directory on the machine that will host your - new monitor. :: - - ssh {new-mon-host} - sudo mkdir /var/lib/ceph/mon/ceph-{mon-id} - -#. Create a temporary directory ``{tmp}`` to keep the files needed during - this process. This directory should be different from the monitor's default - directory created in the previous step, and can be removed after all the - steps are executed. :: - - mkdir {tmp} - -#. Retrieve the keyring for your monitors, where ``{tmp}`` is the path to - the retrieved keyring, and ``{key-filename}`` is the name of the file - containing the retrieved monitor key. :: - - ceph auth get mon. -o {tmp}/{key-filename} - -#. Retrieve the monitor map, where ``{tmp}`` is the path to - the retrieved monitor map, and ``{map-filename}`` is the name of the file - containing the retrieved monitor monitor map. :: - - ceph mon getmap -o {tmp}/{map-filename} - -#. Prepare the monitor's data directory created in the first step. You must - specify the path to the monitor map so that you can retrieve the - information about a quorum of monitors and their ``fsid``. You must also - specify a path to the monitor keyring:: - - sudo ceph-mon -i {mon-id} --mkfs --monmap {tmp}/{map-filename} --keyring {tmp}/{key-filename} - - -#. Start the new monitor and it will automatically join the cluster. - The daemon needs to know which address to bind to, either via - ``--public-addr {ip:port}`` or by setting ``mon addr`` in the - appropriate section of ``ceph.conf``. For example:: - - ceph-mon -i {mon-id} --public-addr {ip:port} - - -Removing Monitors -================= - -When you remove monitors from a cluster, consider that Ceph monitors use -PAXOS to establish consensus about the master cluster map. You must have -a sufficient number of monitors to establish a quorum for consensus about -the cluster map. - -.. _Removing a Monitor (Manual): - -Removing a Monitor (Manual) ---------------------------- - -This procedure removes a ``ceph-mon`` daemon from your cluster. If this -procedure results in only two monitor daemons, you may add or remove another -monitor until you have a number of ``ceph-mon`` daemons that can achieve a -quorum. - -#. Stop the monitor. :: - - service ceph -a stop mon.{mon-id} - -#. Remove the monitor from the cluster. :: - - ceph mon remove {mon-id} - -#. Remove the monitor entry from ``ceph.conf``. - - -Removing Monitors from an Unhealthy Cluster -------------------------------------------- - -This procedure removes a ``ceph-mon`` daemon from an unhealthy -cluster, for example a cluster where the monitors cannot form a -quorum. - - -#. Stop all ``ceph-mon`` daemons on all monitor hosts. :: - - ssh {mon-host} - service ceph stop mon || stop ceph-mon-all - # and repeat for all mons - -#. Identify a surviving monitor and log in to that host. :: - - ssh {mon-host} - -#. Extract a copy of the monmap file. :: - - ceph-mon -i {mon-id} --extract-monmap {map-path} - # in most cases, that's - ceph-mon -i `hostname` --extract-monmap /tmp/monmap - -#. Remove the non-surviving or problematic monitors. For example, if - you have three monitors, ``mon.a``, ``mon.b``, and ``mon.c``, where - only ``mon.a`` will survive, follow the example below:: - - monmaptool {map-path} --rm {mon-id} - # for example, - monmaptool /tmp/monmap --rm b - monmaptool /tmp/monmap --rm c - -#. Inject the surviving map with the removed monitors into the - surviving monitor(s). For example, to inject a map into monitor - ``mon.a``, follow the example below:: - - ceph-mon -i {mon-id} --inject-monmap {map-path} - # for example, - ceph-mon -i a --inject-monmap /tmp/monmap - -#. Start only the surviving monitors. - -#. Verify the monitors form a quorum (``ceph -s``). - -#. You may wish to archive the removed monitors' data directory in - ``/var/lib/ceph/mon`` in a safe location, or delete it if you are - confident the remaining monitors are healthy and are sufficiently - redundant. - -.. _Changing a Monitor's IP address: - -Changing a Monitor's IP Address -=============================== - -.. important:: Existing monitors are not supposed to change their IP addresses. - -Monitors are critical components of a Ceph cluster, and they need to maintain a -quorum for the whole system to work properly. To establish a quorum, the -monitors need to discover each other. Ceph has strict requirements for -discovering monitors. - -Ceph clients and other Ceph daemons use ``ceph.conf`` to discover monitors. -However, monitors discover each other using the monitor map, not ``ceph.conf``. -For example, if you refer to `Adding a Monitor (Manual)`_ you will see that you -need to obtain the current monmap for the cluster when creating a new monitor, -as it is one of the required arguments of ``ceph-mon -i {mon-id} --mkfs``. The -following sections explain the consistency requirements for Ceph monitors, and a -few safe ways to change a monitor's IP address. - - -Consistency Requirements ------------------------- - -A monitor always refers to the local copy of the monmap when discovering other -monitors in the cluster. Using the monmap instead of ``ceph.conf`` avoids -errors that could break the cluster (e.g., typos in ``ceph.conf`` when -specifying a monitor address or port). Since monitors use monmaps for discovery -and they share monmaps with clients and other Ceph daemons, the monmap provides -monitors with a strict guarantee that their consensus is valid. - -Strict consistency also applies to updates to the monmap. As with any other -updates on the monitor, changes to the monmap always run through a distributed -consensus algorithm called `Paxos`_. The monitors must agree on each update to -the monmap, such as adding or removing a monitor, to ensure that each monitor in -the quorum has the same version of the monmap. Updates to the monmap are -incremental so that monitors have the latest agreed upon version, and a set of -previous versions, allowing a monitor that has an older version of the monmap to -catch up with the current state of the cluster. - -If monitors discovered each other through the Ceph configuration file instead of -through the monmap, it would introduce additional risks because the Ceph -configuration files are not updated and distributed automatically. Monitors -might inadvertently use an older ``ceph.conf`` file, fail to recognize a -monitor, fall out of a quorum, or develop a situation where `Paxos`_ is not able -to determine the current state of the system accurately. Consequently, making -changes to an existing monitor's IP address must be done with great care. - - -Changing a Monitor's IP address (The Right Way) ------------------------------------------------ - -Changing a monitor's IP address in ``ceph.conf`` only is not sufficient to -ensure that other monitors in the cluster will receive the update. To change a -monitor's IP address, you must add a new monitor with the IP address you want -to use (as described in `Adding a Monitor (Manual)`_), ensure that the new -monitor successfully joins the quorum; then, remove the monitor that uses the -old IP address. Then, update the ``ceph.conf`` file to ensure that clients and -other daemons know the IP address of the new monitor. - -For example, lets assume there are three monitors in place, such as :: - - [mon.a] - host = host01 - addr = 10.0.0.1:6789 - [mon.b] - host = host02 - addr = 10.0.0.2:6789 - [mon.c] - host = host03 - addr = 10.0.0.3:6789 - -To change ``mon.c`` to ``host04`` with the IP address ``10.0.0.4``, follow the -steps in `Adding a Monitor (Manual)`_ by adding a new monitor ``mon.d``. Ensure -that ``mon.d`` is running before removing ``mon.c``, or it will break the -quorum. Remove ``mon.c`` as described on `Removing a Monitor (Manual)`_. Moving -all three monitors would thus require repeating this process as many times as -needed. - - -Changing a Monitor's IP address (The Messy Way) ------------------------------------------------ - -There may come a time when the monitors must be moved to a different network, a -different part of the datacenter or a different datacenter altogether. While it -is possible to do it, the process becomes a bit more hazardous. - -In such a case, the solution is to generate a new monmap with updated IP -addresses for all the monitors in the cluster, and inject the new map on each -individual monitor. This is not the most user-friendly approach, but we do not -expect this to be something that needs to be done every other week. As it is -clearly stated on the top of this section, monitors are not supposed to change -IP addresses. - -Using the previous monitor configuration as an example, assume you want to move -all the monitors from the ``10.0.0.x`` range to ``10.1.0.x``, and these -networks are unable to communicate. Use the following procedure: - -#. Retrieve the monitor map, where ``{tmp}`` is the path to - the retrieved monitor map, and ``{filename}`` is the name of the file - containing the retrieved monitor monitor map. :: - - ceph mon getmap -o {tmp}/{filename} - -#. The following example demonstrates the contents of the monmap. :: - - $ monmaptool --print {tmp}/{filename} - - monmaptool: monmap file {tmp}/{filename} - epoch 1 - fsid 224e376d-c5fe-4504-96bb-ea6332a19e61 - last_changed 2012-12-17 02:46:41.591248 - created 2012-12-17 02:46:41.591248 - 0: 10.0.0.1:6789/0 mon.a - 1: 10.0.0.2:6789/0 mon.b - 2: 10.0.0.3:6789/0 mon.c - -#. Remove the existing monitors. :: - - $ monmaptool --rm a --rm b --rm c {tmp}/{filename} - - monmaptool: monmap file {tmp}/{filename} - monmaptool: removing a - monmaptool: removing b - monmaptool: removing c - monmaptool: writing epoch 1 to {tmp}/{filename} (0 monitors) - -#. Add the new monitor locations. :: - - $ monmaptool --add a 10.1.0.1:6789 --add b 10.1.0.2:6789 --add c 10.1.0.3:6789 {tmp}/{filename} - - monmaptool: monmap file {tmp}/{filename} - monmaptool: writing epoch 1 to {tmp}/{filename} (3 monitors) - -#. Check new contents. :: - - $ monmaptool --print {tmp}/{filename} - - monmaptool: monmap file {tmp}/{filename} - epoch 1 - fsid 224e376d-c5fe-4504-96bb-ea6332a19e61 - last_changed 2012-12-17 02:46:41.591248 - created 2012-12-17 02:46:41.591248 - 0: 10.1.0.1:6789/0 mon.a - 1: 10.1.0.2:6789/0 mon.b - 2: 10.1.0.3:6789/0 mon.c - -At this point, we assume the monitors (and stores) are installed at the new -location. The next step is to propagate the modified monmap to the new -monitors, and inject the modified monmap into each new monitor. - -#. First, make sure to stop all your monitors. Injection must be done while - the daemon is not running. - -#. Inject the monmap. :: - - ceph-mon -i {mon-id} --inject-monmap {tmp}/{filename} - -#. Restart the monitors. - -After this step, migration to the new location is complete and -the monitors should operate successfully. - - -.. _Manual Deployment: ../../../install/manual-deployment -.. _Monitor Bootstrap: ../../../dev/mon-bootstrap -.. _Paxos: http://en.wikipedia.org/wiki/Paxos_(computer_science) diff --git a/src/ceph/doc/rados/operations/add-or-rm-osds.rst b/src/ceph/doc/rados/operations/add-or-rm-osds.rst deleted file mode 100644 index 59ce4c7..0000000 --- a/src/ceph/doc/rados/operations/add-or-rm-osds.rst +++ /dev/null @@ -1,366 +0,0 @@ -====================== - Adding/Removing OSDs -====================== - -When you have a cluster up and running, you may add OSDs or remove OSDs -from the cluster at runtime. - -Adding OSDs -=========== - -When you want to expand a cluster, you may add an OSD at runtime. With Ceph, an -OSD is generally one Ceph ``ceph-osd`` daemon for one storage drive within a -host machine. If your host has multiple storage drives, you may map one -``ceph-osd`` daemon for each drive. - -Generally, it's a good idea to check the capacity of your cluster to see if you -are reaching the upper end of its capacity. As your cluster reaches its ``near -full`` ratio, you should add one or more OSDs to expand your cluster's capacity. - -.. warning:: Do not let your cluster reach its ``full ratio`` before - adding an OSD. OSD failures that occur after the cluster reaches - its ``near full`` ratio may cause the cluster to exceed its - ``full ratio``. - -Deploy your Hardware --------------------- - -If you are adding a new host when adding a new OSD, see `Hardware -Recommendations`_ for details on minimum recommendations for OSD hardware. To -add an OSD host to your cluster, first make sure you have an up-to-date version -of Linux installed, and you have made some initial preparations for your -storage drives. See `Filesystem Recommendations`_ for details. - -Add your OSD host to a rack in your cluster, connect it to the network -and ensure that it has network connectivity. See the `Network Configuration -Reference`_ for details. - -.. _Hardware Recommendations: ../../../start/hardware-recommendations -.. _Filesystem Recommendations: ../../configuration/filesystem-recommendations -.. _Network Configuration Reference: ../../configuration/network-config-ref - -Install the Required Software ------------------------------ - -For manually deployed clusters, you must install Ceph packages -manually. See `Installing Ceph (Manual)`_ for details. -You should configure SSH to a user with password-less authentication -and root permissions. - -.. _Installing Ceph (Manual): ../../../install - - -Adding an OSD (Manual) ----------------------- - -This procedure sets up a ``ceph-osd`` daemon, configures it to use one drive, -and configures the cluster to distribute data to the OSD. If your host has -multiple drives, you may add an OSD for each drive by repeating this procedure. - -To add an OSD, create a data directory for it, mount a drive to that directory, -add the OSD to the cluster, and then add it to the CRUSH map. - -When you add the OSD to the CRUSH map, consider the weight you give to the new -OSD. Hard drive capacity grows 40% per year, so newer OSD hosts may have larger -hard drives than older hosts in the cluster (i.e., they may have greater -weight). - -.. tip:: Ceph prefers uniform hardware across pools. If you are adding drives - of dissimilar size, you can adjust their weights. However, for best - performance, consider a CRUSH hierarchy with drives of the same type/size. - -#. Create the OSD. If no UUID is given, it will be set automatically when the - OSD starts up. The following command will output the OSD number, which you - will need for subsequent steps. :: - - ceph osd create [{uuid} [{id}]] - - If the optional parameter {id} is given it will be used as the OSD id. - Note, in this case the command may fail if the number is already in use. - - .. warning:: In general, explicitly specifying {id} is not recommended. - IDs are allocated as an array, and skipping entries consumes some extra - memory. This can become significant if there are large gaps and/or - clusters are large. If {id} is not specified, the smallest available is - used. - -#. Create the default directory on your new OSD. :: - - ssh {new-osd-host} - sudo mkdir /var/lib/ceph/osd/ceph-{osd-number} - - -#. If the OSD is for a drive other than the OS drive, prepare it - for use with Ceph, and mount it to the directory you just created:: - - ssh {new-osd-host} - sudo mkfs -t {fstype} /dev/{drive} - sudo mount -o user_xattr /dev/{hdd} /var/lib/ceph/osd/ceph-{osd-number} - - -#. Initialize the OSD data directory. :: - - ssh {new-osd-host} - ceph-osd -i {osd-num} --mkfs --mkkey - - The directory must be empty before you can run ``ceph-osd``. - -#. Register the OSD authentication key. The value of ``ceph`` for - ``ceph-{osd-num}`` in the path is the ``$cluster-$id``. If your - cluster name differs from ``ceph``, use your cluster name instead.:: - - ceph auth add osd.{osd-num} osd 'allow *' mon 'allow rwx' -i /var/lib/ceph/osd/ceph-{osd-num}/keyring - - -#. Add the OSD to the CRUSH map so that the OSD can begin receiving data. The - ``ceph osd crush add`` command allows you to add OSDs to the CRUSH hierarchy - wherever you wish. If you specify at least one bucket, the command - will place the OSD into the most specific bucket you specify, *and* it will - move that bucket underneath any other buckets you specify. **Important:** If - you specify only the root bucket, the command will attach the OSD directly - to the root, but CRUSH rules expect OSDs to be inside of hosts. - - For Argonaut (v 0.48), execute the following:: - - ceph osd crush add {id} {name} {weight} [{bucket-type}={bucket-name} ...] - - For Bobtail (v 0.56) and later releases, execute the following:: - - ceph osd crush add {id-or-name} {weight} [{bucket-type}={bucket-name} ...] - - You may also decompile the CRUSH map, add the OSD to the device list, add the - host as a bucket (if it's not already in the CRUSH map), add the device as an - item in the host, assign it a weight, recompile it and set it. See - `Add/Move an OSD`_ for details. - - -.. topic:: Argonaut (v0.48) Best Practices - - To limit impact on user I/O performance, add an OSD to the CRUSH map - with an initial weight of ``0``. Then, ramp up the CRUSH weight a - little bit at a time. For example, to ramp by increments of ``0.2``, - start with:: - - ceph osd crush reweight {osd-id} .2 - - and allow migration to complete before reweighting to ``0.4``, - ``0.6``, and so on until the desired CRUSH weight is reached. - - To limit the impact of OSD failures, you can set:: - - mon osd down out interval = 0 - - which prevents down OSDs from automatically being marked out, and then - ramp them down manually with:: - - ceph osd reweight {osd-num} .8 - - Again, wait for the cluster to finish migrating data, and then adjust - the weight further until you reach a weight of 0. Note that this - problem prevents the cluster to automatically re-replicate data after - a failure, so please ensure that sufficient monitoring is in place for - an administrator to intervene promptly. - - Note that this practice will no longer be necessary in Bobtail and - subsequent releases. - - -Replacing an OSD ----------------- - -When disks fail, or if an admnistrator wants to reprovision OSDs with a new -backend, for instance, for switching from FileStore to BlueStore, OSDs need to -be replaced. Unlike `Removing the OSD`_, replaced OSD's id and CRUSH map entry -need to be keep intact after the OSD is destroyed for replacement. - -#. Destroy the OSD first:: - - ceph osd destroy {id} --yes-i-really-mean-it - -#. Zap a disk for the new OSD, if the disk was used before for other purposes. - It's not necessary for a new disk:: - - ceph-disk zap /dev/sdX - -#. Prepare the disk for replacement by using the previously destroyed OSD id:: - - ceph-disk prepare --bluestore /dev/sdX --osd-id {id} --osd-uuid `uuidgen` - -#. And activate the OSD:: - - ceph-disk activate /dev/sdX1 - - -Starting the OSD ----------------- - -After you add an OSD to Ceph, the OSD is in your configuration. However, -it is not yet running. The OSD is ``down`` and ``in``. You must start -your new OSD before it can begin receiving data. You may use -``service ceph`` from your admin host or start the OSD from its host -machine. - -For Ubuntu Trusty use Upstart. :: - - sudo start ceph-osd id={osd-num} - -For all other distros use systemd. :: - - sudo systemctl start ceph-osd@{osd-num} - - -Once you start your OSD, it is ``up`` and ``in``. - - -Observe the Data Migration --------------------------- - -Once you have added your new OSD to the CRUSH map, Ceph will begin rebalancing -the server by migrating placement groups to your new OSD. You can observe this -process with the `ceph`_ tool. :: - - ceph -w - -You should see the placement group states change from ``active+clean`` to -``active, some degraded objects``, and finally ``active+clean`` when migration -completes. (Control-c to exit.) - - -.. _Add/Move an OSD: ../crush-map#addosd -.. _ceph: ../monitoring - - - -Removing OSDs (Manual) -====================== - -When you want to reduce the size of a cluster or replace hardware, you may -remove an OSD at runtime. With Ceph, an OSD is generally one Ceph ``ceph-osd`` -daemon for one storage drive within a host machine. If your host has multiple -storage drives, you may need to remove one ``ceph-osd`` daemon for each drive. -Generally, it's a good idea to check the capacity of your cluster to see if you -are reaching the upper end of its capacity. Ensure that when you remove an OSD -that your cluster is not at its ``near full`` ratio. - -.. warning:: Do not let your cluster reach its ``full ratio`` when - removing an OSD. Removing OSDs could cause the cluster to reach - or exceed its ``full ratio``. - - -Take the OSD out of the Cluster ------------------------------------ - -Before you remove an OSD, it is usually ``up`` and ``in``. You need to take it -out of the cluster so that Ceph can begin rebalancing and copying its data to -other OSDs. :: - - ceph osd out {osd-num} - - -Observe the Data Migration --------------------------- - -Once you have taken your OSD ``out`` of the cluster, Ceph will begin -rebalancing the cluster by migrating placement groups out of the OSD you -removed. You can observe this process with the `ceph`_ tool. :: - - ceph -w - -You should see the placement group states change from ``active+clean`` to -``active, some degraded objects``, and finally ``active+clean`` when migration -completes. (Control-c to exit.) - -.. note:: Sometimes, typically in a "small" cluster with few hosts (for - instance with a small testing cluster), the fact to take ``out`` the - OSD can spawn a CRUSH corner case where some PGs remain stuck in the - ``active+remapped`` state. If you are in this case, you should mark - the OSD ``in`` with: - - ``ceph osd in {osd-num}`` - - to come back to the initial state and then, instead of marking ``out`` - the OSD, set its weight to 0 with: - - ``ceph osd crush reweight osd.{osd-num} 0`` - - After that, you can observe the data migration which should come to its - end. The difference between marking ``out`` the OSD and reweighting it - to 0 is that in the first case the weight of the bucket which contains - the OSD is not changed whereas in the second case the weight of the bucket - is updated (and decreased of the OSD weight). The reweight command could - be sometimes favoured in the case of a "small" cluster. - - - -Stopping the OSD ----------------- - -After you take an OSD out of the cluster, it may still be running. -That is, the OSD may be ``up`` and ``out``. You must stop -your OSD before you remove it from the configuration. :: - - ssh {osd-host} - sudo systemctl stop ceph-osd@{osd-num} - -Once you stop your OSD, it is ``down``. - - -Removing the OSD ----------------- - -This procedure removes an OSD from a cluster map, removes its authentication -key, removes the OSD from the OSD map, and removes the OSD from the -``ceph.conf`` file. If your host has multiple drives, you may need to remove an -OSD for each drive by repeating this procedure. - -#. Let the cluster forget the OSD first. This step removes the OSD from the CRUSH - map, removes its authentication key. And it is removed from the OSD map as - well. Please note the `purge subcommand`_ is introduced in Luminous, for older - versions, please see below :: - - ceph osd purge {id} --yes-i-really-mean-it - -#. Navigate to the host where you keep the master copy of the cluster's - ``ceph.conf`` file. :: - - ssh {admin-host} - cd /etc/ceph - vim ceph.conf - -#. Remove the OSD entry from your ``ceph.conf`` file (if it exists). :: - - [osd.1] - host = {hostname} - -#. From the host where you keep the master copy of the cluster's ``ceph.conf`` file, - copy the updated ``ceph.conf`` file to the ``/etc/ceph`` directory of other - hosts in your cluster. - -If your Ceph cluster is older than Luminous, instead of using ``ceph osd purge``, -you need to perform this step manually: - - -#. Remove the OSD from the CRUSH map so that it no longer receives data. You may - also decompile the CRUSH map, remove the OSD from the device list, remove the - device as an item in the host bucket or remove the host bucket (if it's in the - CRUSH map and you intend to remove the host), recompile the map and set it. - See `Remove an OSD`_ for details. :: - - ceph osd crush remove {name} - -#. Remove the OSD authentication key. :: - - ceph auth del osd.{osd-num} - - The value of ``ceph`` for ``ceph-{osd-num}`` in the path is the ``$cluster-$id``. - If your cluster name differs from ``ceph``, use your cluster name instead. - -#. Remove the OSD. :: - - ceph osd rm {osd-num} - #for example - ceph osd rm 1 - - -.. _Remove an OSD: ../crush-map#removeosd -.. _purge subcommand: /man/8/ceph#osd diff --git a/src/ceph/doc/rados/operations/cache-tiering.rst b/src/ceph/doc/rados/operations/cache-tiering.rst deleted file mode 100644 index 322c6ff..0000000 --- a/src/ceph/doc/rados/operations/cache-tiering.rst +++ /dev/null @@ -1,461 +0,0 @@ -=============== - Cache Tiering -=============== - -A cache tier provides Ceph Clients with better I/O performance for a subset of -the data stored in a backing storage tier. Cache tiering involves creating a -pool of relatively fast/expensive storage devices (e.g., solid state drives) -configured to act as a cache tier, and a backing pool of either erasure-coded -or relatively slower/cheaper devices configured to act as an economical storage -tier. The Ceph objecter handles where to place the objects and the tiering -agent determines when to flush objects from the cache to the backing storage -tier. So the cache tier and the backing storage tier are completely transparent -to Ceph clients. - - -.. ditaa:: - +-------------+ - | Ceph Client | - +------+------+ - ^ - Tiering is | - Transparent | Faster I/O - to Ceph | +---------------+ - Client Ops | | | - | +----->+ Cache Tier | - | | | | - | | +-----+---+-----+ - | | | ^ - v v | | Active Data in Cache Tier - +------+----+--+ | | - | Objecter | | | - +-----------+--+ | | - ^ | | Inactive Data in Storage Tier - | v | - | +-----+---+-----+ - | | | - +----->| Storage Tier | - | | - +---------------+ - Slower I/O - - -The cache tiering agent handles the migration of data between the cache tier -and the backing storage tier automatically. However, admins have the ability to -configure how this migration takes place. There are two main scenarios: - -- **Writeback Mode:** When admins configure tiers with ``writeback`` mode, Ceph - clients write data to the cache tier and receive an ACK from the cache tier. - In time, the data written to the cache tier migrates to the storage tier - and gets flushed from the cache tier. Conceptually, the cache tier is - overlaid "in front" of the backing storage tier. When a Ceph client needs - data that resides in the storage tier, the cache tiering agent migrates the - data to the cache tier on read, then it is sent to the Ceph client. - Thereafter, the Ceph client can perform I/O using the cache tier, until the - data becomes inactive. This is ideal for mutable data (e.g., photo/video - editing, transactional data, etc.). - -- **Read-proxy Mode:** This mode will use any objects that already - exist in the cache tier, but if an object is not present in the - cache the request will be proxied to the base tier. This is useful - for transitioning from ``writeback`` mode to a disabled cache as it - allows the workload to function properly while the cache is drained, - without adding any new objects to the cache. - -A word of caution -================= - -Cache tiering will *degrade* performance for most workloads. Users should use -extreme caution before using this feature. - -* *Workload dependent*: Whether a cache will improve performance is - highly dependent on the workload. Because there is a cost - associated with moving objects into or out of the cache, it can only - be effective when there is a *large skew* in the access pattern in - the data set, such that most of the requests touch a small number of - objects. The cache pool should be large enough to capture the - working set for your workload to avoid thrashing. - -* *Difficult to benchmark*: Most benchmarks that users run to measure - performance will show terrible performance with cache tiering, in - part because very few of them skew requests toward a small set of - objects, it can take a long time for the cache to "warm up," and - because the warm-up cost can be high. - -* *Usually slower*: For workloads that are not cache tiering-friendly, - performance is often slower than a normal RADOS pool without cache - tiering enabled. - -* *librados object enumeration*: The librados-level object enumeration - API is not meant to be coherent in the presence of the case. If - your applicatoin is using librados directly and relies on object - enumeration, cache tiering will probably not work as expected. - (This is not a problem for RGW, RBD, or CephFS.) - -* *Complexity*: Enabling cache tiering means that a lot of additional - machinery and complexity within the RADOS cluster is being used. - This increases the probability that you will encounter a bug in the system - that other users have not yet encountered and will put your deployment at a - higher level of risk. - -Known Good Workloads --------------------- - -* *RGW time-skewed*: If the RGW workload is such that almost all read - operations are directed at recently written objects, a simple cache - tiering configuration that destages recently written objects from - the cache to the base tier after a configurable period can work - well. - -Known Bad Workloads -------------------- - -The following configurations are *known to work poorly* with cache -tiering. - -* *RBD with replicated cache and erasure-coded base*: This is a common - request, but usually does not perform well. Even reasonably skewed - workloads still send some small writes to cold objects, and because - small writes are not yet supported by the erasure-coded pool, entire - (usually 4 MB) objects must be migrated into the cache in order to - satisfy a small (often 4 KB) write. Only a handful of users have - successfully deployed this configuration, and it only works for them - because their data is extremely cold (backups) and they are not in - any way sensitive to performance. - -* *RBD with replicated cache and base*: RBD with a replicated base - tier does better than when the base is erasure coded, but it is - still highly dependent on the amount of skew in the workload, and - very difficult to validate. The user will need to have a good - understanding of their workload and will need to tune the cache - tiering parameters carefully. - - -Setting Up Pools -================ - -To set up cache tiering, you must have two pools. One will act as the -backing storage and the other will act as the cache. - - -Setting Up a Backing Storage Pool ---------------------------------- - -Setting up a backing storage pool typically involves one of two scenarios: - -- **Standard Storage**: In this scenario, the pool stores multiple copies - of an object in the Ceph Storage Cluster. - -- **Erasure Coding:** In this scenario, the pool uses erasure coding to - store data much more efficiently with a small performance tradeoff. - -In the standard storage scenario, you can setup a CRUSH ruleset to establish -the failure domain (e.g., osd, host, chassis, rack, row, etc.). Ceph OSD -Daemons perform optimally when all storage drives in the ruleset are of the -same size, speed (both RPMs and throughput) and type. See `CRUSH Maps`_ -for details on creating a ruleset. Once you have created a ruleset, create -a backing storage pool. - -In the erasure coding scenario, the pool creation arguments will generate the -appropriate ruleset automatically. See `Create a Pool`_ for details. - -In subsequent examples, we will refer to the backing storage pool -as ``cold-storage``. - - -Setting Up a Cache Pool ------------------------ - -Setting up a cache pool follows the same procedure as the standard storage -scenario, but with this difference: the drives for the cache tier are typically -high performance drives that reside in their own servers and have their own -ruleset. When setting up a ruleset, it should take account of the hosts that -have the high performance drives while omitting the hosts that don't. See -`Placing Different Pools on Different OSDs`_ for details. - - -In subsequent examples, we will refer to the cache pool as ``hot-storage`` and -the backing pool as ``cold-storage``. - -For cache tier configuration and default values, see -`Pools - Set Pool Values`_. - - -Creating a Cache Tier -===================== - -Setting up a cache tier involves associating a backing storage pool with -a cache pool :: - - ceph osd tier add {storagepool} {cachepool} - -For example :: - - ceph osd tier add cold-storage hot-storage - -To set the cache mode, execute the following:: - - ceph osd tier cache-mode {cachepool} {cache-mode} - -For example:: - - ceph osd tier cache-mode hot-storage writeback - -The cache tiers overlay the backing storage tier, so they require one -additional step: you must direct all client traffic from the storage pool to -the cache pool. To direct client traffic directly to the cache pool, execute -the following:: - - ceph osd tier set-overlay {storagepool} {cachepool} - -For example:: - - ceph osd tier set-overlay cold-storage hot-storage - - -Configuring a Cache Tier -======================== - -Cache tiers have several configuration options. You may set -cache tier configuration options with the following usage:: - - ceph osd pool set {cachepool} {key} {value} - -See `Pools - Set Pool Values`_ for details. - - -Target Size and Type --------------------- - -Ceph's production cache tiers use a `Bloom Filter`_ for the ``hit_set_type``:: - - ceph osd pool set {cachepool} hit_set_type bloom - -For example:: - - ceph osd pool set hot-storage hit_set_type bloom - -The ``hit_set_count`` and ``hit_set_period`` define how much time each HitSet -should cover, and how many such HitSets to store. :: - - ceph osd pool set {cachepool} hit_set_count 12 - ceph osd pool set {cachepool} hit_set_period 14400 - ceph osd pool set {cachepool} target_max_bytes 1000000000000 - -.. note:: A larger ``hit_set_count`` results in more RAM consumed by - the ``ceph-osd`` process. - -Binning accesses over time allows Ceph to determine whether a Ceph client -accessed an object at least once, or more than once over a time period -("age" vs "temperature"). - -The ``min_read_recency_for_promote`` defines how many HitSets to check for the -existence of an object when handling a read operation. The checking result is -used to decide whether to promote the object asynchronously. Its value should be -between 0 and ``hit_set_count``. If it's set to 0, the object is always promoted. -If it's set to 1, the current HitSet is checked. And if this object is in the -current HitSet, it's promoted. Otherwise not. For the other values, the exact -number of archive HitSets are checked. The object is promoted if the object is -found in any of the most recent ``min_read_recency_for_promote`` HitSets. - -A similar parameter can be set for the write operation, which is -``min_write_recency_for_promote``. :: - - ceph osd pool set {cachepool} min_read_recency_for_promote 2 - ceph osd pool set {cachepool} min_write_recency_for_promote 2 - -.. note:: The longer the period and the higher the - ``min_read_recency_for_promote`` and - ``min_write_recency_for_promote``values, the more RAM the ``ceph-osd`` - daemon consumes. In particular, when the agent is active to flush - or evict cache objects, all ``hit_set_count`` HitSets are loaded - into RAM. - - -Cache Sizing ------------- - -The cache tiering agent performs two main functions: - -- **Flushing:** The agent identifies modified (or dirty) objects and forwards - them to the storage pool for long-term storage. - -- **Evicting:** The agent identifies objects that haven't been modified - (or clean) and evicts the least recently used among them from the cache. - - -Absolute Sizing -~~~~~~~~~~~~~~~ - -The cache tiering agent can flush or evict objects based upon the total number -of bytes or the total number of objects. To specify a maximum number of bytes, -execute the following:: - - ceph osd pool set {cachepool} target_max_bytes {#bytes} - -For example, to flush or evict at 1 TB, execute the following:: - - ceph osd pool set hot-storage target_max_bytes 1099511627776 - - -To specify the maximum number of objects, execute the following:: - - ceph osd pool set {cachepool} target_max_objects {#objects} - -For example, to flush or evict at 1M objects, execute the following:: - - ceph osd pool set hot-storage target_max_objects 1000000 - -.. note:: Ceph is not able to determine the size of a cache pool automatically, so - the configuration on the absolute size is required here, otherwise the - flush/evict will not work. If you specify both limits, the cache tiering - agent will begin flushing or evicting when either threshold is triggered. - -.. note:: All client requests will be blocked only when ``target_max_bytes`` or - ``target_max_objects`` reached - -Relative Sizing -~~~~~~~~~~~~~~~ - -The cache tiering agent can flush or evict objects relative to the size of the -cache pool(specified by ``target_max_bytes`` / ``target_max_objects`` in -`Absolute sizing`_). When the cache pool consists of a certain percentage of -modified (or dirty) objects, the cache tiering agent will flush them to the -storage pool. To set the ``cache_target_dirty_ratio``, execute the following:: - - ceph osd pool set {cachepool} cache_target_dirty_ratio {0.0..1.0} - -For example, setting the value to ``0.4`` will begin flushing modified -(dirty) objects when they reach 40% of the cache pool's capacity:: - - ceph osd pool set hot-storage cache_target_dirty_ratio 0.4 - -When the dirty objects reaches a certain percentage of its capacity, flush dirty -objects with a higher speed. To set the ``cache_target_dirty_high_ratio``:: - - ceph osd pool set {cachepool} cache_target_dirty_high_ratio {0.0..1.0} - -For example, setting the value to ``0.6`` will begin aggressively flush dirty objects -when they reach 60% of the cache pool's capacity. obviously, we'd better set the value -between dirty_ratio and full_ratio:: - - ceph osd pool set hot-storage cache_target_dirty_high_ratio 0.6 - -When the cache pool reaches a certain percentage of its capacity, the cache -tiering agent will evict objects to maintain free capacity. To set the -``cache_target_full_ratio``, execute the following:: - - ceph osd pool set {cachepool} cache_target_full_ratio {0.0..1.0} - -For example, setting the value to ``0.8`` will begin flushing unmodified -(clean) objects when they reach 80% of the cache pool's capacity:: - - ceph osd pool set hot-storage cache_target_full_ratio 0.8 - - -Cache Age ---------- - -You can specify the minimum age of an object before the cache tiering agent -flushes a recently modified (or dirty) object to the backing storage pool:: - - ceph osd pool set {cachepool} cache_min_flush_age {#seconds} - -For example, to flush modified (or dirty) objects after 10 minutes, execute -the following:: - - ceph osd pool set hot-storage cache_min_flush_age 600 - -You can specify the minimum age of an object before it will be evicted from -the cache tier:: - - ceph osd pool {cache-tier} cache_min_evict_age {#seconds} - -For example, to evict objects after 30 minutes, execute the following:: - - ceph osd pool set hot-storage cache_min_evict_age 1800 - - -Removing a Cache Tier -===================== - -Removing a cache tier differs depending on whether it is a writeback -cache or a read-only cache. - - -Removing a Read-Only Cache --------------------------- - -Since a read-only cache does not have modified data, you can disable -and remove it without losing any recent changes to objects in the cache. - -#. Change the cache-mode to ``none`` to disable it. :: - - ceph osd tier cache-mode {cachepool} none - - For example:: - - ceph osd tier cache-mode hot-storage none - -#. Remove the cache pool from the backing pool. :: - - ceph osd tier remove {storagepool} {cachepool} - - For example:: - - ceph osd tier remove cold-storage hot-storage - - - -Removing a Writeback Cache --------------------------- - -Since a writeback cache may have modified data, you must take steps to ensure -that you do not lose any recent changes to objects in the cache before you -disable and remove it. - - -#. Change the cache mode to ``forward`` so that new and modified objects will - flush to the backing storage pool. :: - - ceph osd tier cache-mode {cachepool} forward - - For example:: - - ceph osd tier cache-mode hot-storage forward - - -#. Ensure that the cache pool has been flushed. This may take a few minutes:: - - rados -p {cachepool} ls - - If the cache pool still has objects, you can flush them manually. - For example:: - - rados -p {cachepool} cache-flush-evict-all - - -#. Remove the overlay so that clients will not direct traffic to the cache. :: - - ceph osd tier remove-overlay {storagetier} - - For example:: - - ceph osd tier remove-overlay cold-storage - - -#. Finally, remove the cache tier pool from the backing storage pool. :: - - ceph osd tier remove {storagepool} {cachepool} - - For example:: - - ceph osd tier remove cold-storage hot-storage - - -.. _Create a Pool: ../pools#create-a-pool -.. _Pools - Set Pool Values: ../pools#set-pool-values -.. _Placing Different Pools on Different OSDs: ../crush-map/#placing-different-pools-on-different-osds -.. _Bloom Filter: http://en.wikipedia.org/wiki/Bloom_filter -.. _CRUSH Maps: ../crush-map -.. _Absolute Sizing: #absolute-sizing diff --git a/src/ceph/doc/rados/operations/control.rst b/src/ceph/doc/rados/operations/control.rst deleted file mode 100644 index 1a58076..0000000 --- a/src/ceph/doc/rados/operations/control.rst +++ /dev/null @@ -1,453 +0,0 @@ -.. index:: control, commands - -================== - Control Commands -================== - - -Monitor Commands -================ - -Monitor commands are issued using the ceph utility:: - - ceph [-m monhost] {command} - -The command is usually (though not always) of the form:: - - ceph {subsystem} {command} - - -System Commands -=============== - -Execute the following to display the current status of the cluster. :: - - ceph -s - ceph status - -Execute the following to display a running summary of the status of the cluster, -and major events. :: - - ceph -w - -Execute the following to show the monitor quorum, including which monitors are -participating and which one is the leader. :: - - ceph quorum_status - -Execute the following to query the status of a single monitor, including whether -or not it is in the quorum. :: - - ceph [-m monhost] mon_status - - -Authentication Subsystem -======================== - -To add a keyring for an OSD, execute the following:: - - ceph auth add {osd} {--in-file|-i} {path-to-osd-keyring} - -To list the cluster's keys and their capabilities, execute the following:: - - ceph auth ls - - -Placement Group Subsystem -========================= - -To display the statistics for all placement groups, execute the following:: - - ceph pg dump [--format {format}] - -The valid formats are ``plain`` (default) and ``json``. - -To display the statistics for all placement groups stuck in a specified state, -execute the following:: - - ceph pg dump_stuck inactive|unclean|stale|undersized|degraded [--format {format}] [-t|--threshold {seconds}] - - -``--format`` may be ``plain`` (default) or ``json`` - -``--threshold`` defines how many seconds "stuck" is (default: 300) - -**Inactive** Placement groups cannot process reads or writes because they are waiting for an OSD -with the most up-to-date data to come back. - -**Unclean** Placement groups contain objects that are not replicated the desired number -of times. They should be recovering. - -**Stale** Placement groups are in an unknown state - the OSDs that host them have not -reported to the monitor cluster in a while (configured by -``mon_osd_report_timeout``). - -Delete "lost" objects or revert them to their prior state, either a previous version -or delete them if they were just created. :: - - ceph pg {pgid} mark_unfound_lost revert|delete - - -OSD Subsystem -============= - -Query OSD subsystem status. :: - - ceph osd stat - -Write a copy of the most recent OSD map to a file. See -`osdmaptool`_. :: - - ceph osd getmap -o file - -.. _osdmaptool: ../../man/8/osdmaptool - -Write a copy of the crush map from the most recent OSD map to -file. :: - - ceph osd getcrushmap -o file - -The foregoing functionally equivalent to :: - - ceph osd getmap -o /tmp/osdmap - osdmaptool /tmp/osdmap --export-crush file - -Dump the OSD map. Valid formats for ``-f`` are ``plain`` and ``json``. If no -``--format`` option is given, the OSD map is dumped as plain text. :: - - ceph osd dump [--format {format}] - -Dump the OSD map as a tree with one line per OSD containing weight -and state. :: - - ceph osd tree [--format {format}] - -Find out where a specific object is or would be stored in the system:: - - ceph osd map <pool-name> <object-name> - -Add or move a new item (OSD) with the given id/name/weight at the specified -location. :: - - ceph osd crush set {id} {weight} [{loc1} [{loc2} ...]] - -Remove an existing item (OSD) from the CRUSH map. :: - - ceph osd crush remove {name} - -Remove an existing bucket from the CRUSH map. :: - - ceph osd crush remove {bucket-name} - -Move an existing bucket from one position in the hierarchy to another. :: - - ceph osd crush move {id} {loc1} [{loc2} ...] - -Set the weight of the item given by ``{name}`` to ``{weight}``. :: - - ceph osd crush reweight {name} {weight} - -Mark an OSD as lost. This may result in permanent data loss. Use with caution. :: - - ceph osd lost {id} [--yes-i-really-mean-it] - -Create a new OSD. If no UUID is given, it will be set automatically when the OSD -starts up. :: - - ceph osd create [{uuid}] - -Remove the given OSD(s). :: - - ceph osd rm [{id}...] - -Query the current max_osd parameter in the OSD map. :: - - ceph osd getmaxosd - -Import the given crush map. :: - - ceph osd setcrushmap -i file - -Set the ``max_osd`` parameter in the OSD map. This is necessary when -expanding the storage cluster. :: - - ceph osd setmaxosd - -Mark OSD ``{osd-num}`` down. :: - - ceph osd down {osd-num} - -Mark OSD ``{osd-num}`` out of the distribution (i.e. allocated no data). :: - - ceph osd out {osd-num} - -Mark ``{osd-num}`` in the distribution (i.e. allocated data). :: - - ceph osd in {osd-num} - -Set or clear the pause flags in the OSD map. If set, no IO requests -will be sent to any OSD. Clearing the flags via unpause results in -resending pending requests. :: - - ceph osd pause - ceph osd unpause - -Set the weight of ``{osd-num}`` to ``{weight}``. Two OSDs with the -same weight will receive roughly the same number of I/O requests and -store approximately the same amount of data. ``ceph osd reweight`` -sets an override weight on the OSD. This value is in the range 0 to 1, -and forces CRUSH to re-place (1-weight) of the data that would -otherwise live on this drive. It does not change the weights assigned -to the buckets above the OSD in the crush map, and is a corrective -measure in case the normal CRUSH distribution is not working out quite -right. For instance, if one of your OSDs is at 90% and the others are -at 50%, you could reduce this weight to try and compensate for it. :: - - ceph osd reweight {osd-num} {weight} - -Reweights all the OSDs by reducing the weight of OSDs which are -heavily overused. By default it will adjust the weights downward on -OSDs which have 120% of the average utilization, but if you include -threshold it will use that percentage instead. :: - - ceph osd reweight-by-utilization [threshold] - -Describes what reweight-by-utilization would do. :: - - ceph osd test-reweight-by-utilization - -Adds/removes the address to/from the blacklist. When adding an address, -you can specify how long it should be blacklisted in seconds; otherwise, -it will default to 1 hour. A blacklisted address is prevented from -connecting to any OSD. Blacklisting is most often used to prevent a -lagging metadata server from making bad changes to data on the OSDs. - -These commands are mostly only useful for failure testing, as -blacklists are normally maintained automatically and shouldn't need -manual intervention. :: - - ceph osd blacklist add ADDRESS[:source_port] [TIME] - ceph osd blacklist rm ADDRESS[:source_port] - -Creates/deletes a snapshot of a pool. :: - - ceph osd pool mksnap {pool-name} {snap-name} - ceph osd pool rmsnap {pool-name} {snap-name} - -Creates/deletes/renames a storage pool. :: - - ceph osd pool create {pool-name} pg_num [pgp_num] - ceph osd pool delete {pool-name} [{pool-name} --yes-i-really-really-mean-it] - ceph osd pool rename {old-name} {new-name} - -Changes a pool setting. :: - - ceph osd pool set {pool-name} {field} {value} - -Valid fields are: - - * ``size``: Sets the number of copies of data in the pool. - * ``pg_num``: The placement group number. - * ``pgp_num``: Effective number when calculating pg placement. - * ``crush_ruleset``: rule number for mapping placement. - -Get the value of a pool setting. :: - - ceph osd pool get {pool-name} {field} - -Valid fields are: - - * ``pg_num``: The placement group number. - * ``pgp_num``: Effective number of placement groups when calculating placement. - * ``lpg_num``: The number of local placement groups. - * ``lpgp_num``: The number used for placing the local placement groups. - - -Sends a scrub command to OSD ``{osd-num}``. To send the command to all OSDs, use ``*``. :: - - ceph osd scrub {osd-num} - -Sends a repair command to OSD.N. To send the command to all OSDs, use ``*``. :: - - ceph osd repair N - -Runs a simple throughput benchmark against OSD.N, writing ``TOTAL_DATA_BYTES`` -in write requests of ``BYTES_PER_WRITE`` each. By default, the test -writes 1 GB in total in 4-MB increments. -The benchmark is non-destructive and will not overwrite existing live -OSD data, but might temporarily affect the performance of clients -concurrently accessing the OSD. :: - - ceph tell osd.N bench [TOTAL_DATA_BYTES] [BYTES_PER_WRITE] - - -MDS Subsystem -============= - -Change configuration parameters on a running mds. :: - - ceph tell mds.{mds-id} injectargs --{switch} {value} [--{switch} {value}] - -Example:: - - ceph tell mds.0 injectargs --debug_ms 1 --debug_mds 10 - -Enables debug messages. :: - - ceph mds stat - -Displays the status of all metadata servers. :: - - ceph mds fail 0 - -Marks the active MDS as failed, triggering failover to a standby if present. - -.. todo:: ``ceph mds`` subcommands missing docs: set, dump, getmap, stop, setmap - - -Mon Subsystem -============= - -Show monitor stats:: - - ceph mon stat - - e2: 3 mons at {a=127.0.0.1:40000/0,b=127.0.0.1:40001/0,c=127.0.0.1:40002/0}, election epoch 6, quorum 0,1,2 a,b,c - - -The ``quorum`` list at the end lists monitor nodes that are part of the current quorum. - -This is also available more directly:: - - ceph quorum_status -f json-pretty - -.. code-block:: javascript - - { - "election_epoch": 6, - "quorum": [ - 0, - 1, - 2 - ], - "quorum_names": [ - "a", - "b", - "c" - ], - "quorum_leader_name": "a", - "monmap": { - "epoch": 2, - "fsid": "ba807e74-b64f-4b72-b43f-597dfe60ddbc", - "modified": "2016-12-26 14:42:09.288066", - "created": "2016-12-26 14:42:03.573585", - "features": { - "persistent": [ - "kraken" - ], - "optional": [] - }, - "mons": [ - { - "rank": 0, - "name": "a", - "addr": "127.0.0.1:40000\/0", - "public_addr": "127.0.0.1:40000\/0" - }, - { - "rank": 1, - "name": "b", - "addr": "127.0.0.1:40001\/0", - "public_addr": "127.0.0.1:40001\/0" - }, - { - "rank": 2, - "name": "c", - "addr": "127.0.0.1:40002\/0", - "public_addr": "127.0.0.1:40002\/0" - } - ] - } - } - - -The above will block until a quorum is reached. - -For a status of just the monitor you connect to (use ``-m HOST:PORT`` -to select):: - - ceph mon_status -f json-pretty - - -.. code-block:: javascript - - { - "name": "b", - "rank": 1, - "state": "peon", - "election_epoch": 6, - "quorum": [ - 0, - 1, - 2 - ], - "features": { - "required_con": "9025616074522624", - "required_mon": [ - "kraken" - ], - "quorum_con": "1152921504336314367", - "quorum_mon": [ - "kraken" - ] - }, - "outside_quorum": [], - "extra_probe_peers": [], - "sync_provider": [], - "monmap": { - "epoch": 2, - "fsid": "ba807e74-b64f-4b72-b43f-597dfe60ddbc", - "modified": "2016-12-26 14:42:09.288066", - "created": "2016-12-26 14:42:03.573585", - "features": { - "persistent": [ - "kraken" - ], - "optional": [] - }, - "mons": [ - { - "rank": 0, - "name": "a", - "addr": "127.0.0.1:40000\/0", - "public_addr": "127.0.0.1:40000\/0" - }, - { - "rank": 1, - "name": "b", - "addr": "127.0.0.1:40001\/0", - "public_addr": "127.0.0.1:40001\/0" - }, - { - "rank": 2, - "name": "c", - "addr": "127.0.0.1:40002\/0", - "public_addr": "127.0.0.1:40002\/0" - } - ] - } - } - -A dump of the monitor state:: - - ceph mon dump - - dumped monmap epoch 2 - epoch 2 - fsid ba807e74-b64f-4b72-b43f-597dfe60ddbc - last_changed 2016-12-26 14:42:09.288066 - created 2016-12-26 14:42:03.573585 - 0: 127.0.0.1:40000/0 mon.a - 1: 127.0.0.1:40001/0 mon.b - 2: 127.0.0.1:40002/0 mon.c - diff --git a/src/ceph/doc/rados/operations/crush-map-edits.rst b/src/ceph/doc/rados/operations/crush-map-edits.rst deleted file mode 100644 index 5222270..0000000 --- a/src/ceph/doc/rados/operations/crush-map-edits.rst +++ /dev/null @@ -1,654 +0,0 @@ -Manually editing a CRUSH Map -============================ - -.. note:: Manually editing the CRUSH map is considered an advanced - administrator operation. All CRUSH changes that are - necessary for the overwhelming majority of installations are - possible via the standard ceph CLI and do not require manual - CRUSH map edits. If you have identified a use case where - manual edits *are* necessary, consider contacting the Ceph - developers so that future versions of Ceph can make this - unnecessary. - -To edit an existing CRUSH map: - -#. `Get the CRUSH map`_. -#. `Decompile`_ the CRUSH map. -#. Edit at least one of `Devices`_, `Buckets`_ and `Rules`_. -#. `Recompile`_ the CRUSH map. -#. `Set the CRUSH map`_. - -To activate CRUSH map rules for a specific pool, identify the common ruleset -number for those rules and specify that ruleset number for the pool. See `Set -Pool Values`_ for details. - -.. _Get the CRUSH map: #getcrushmap -.. _Decompile: #decompilecrushmap -.. _Devices: #crushmapdevices -.. _Buckets: #crushmapbuckets -.. _Rules: #crushmaprules -.. _Recompile: #compilecrushmap -.. _Set the CRUSH map: #setcrushmap -.. _Set Pool Values: ../pools#setpoolvalues - -.. _getcrushmap: - -Get a CRUSH Map ---------------- - -To get the CRUSH map for your cluster, execute the following:: - - ceph osd getcrushmap -o {compiled-crushmap-filename} - -Ceph will output (-o) a compiled CRUSH map to the filename you specified. Since -the CRUSH map is in a compiled form, you must decompile it first before you can -edit it. - -.. _decompilecrushmap: - -Decompile a CRUSH Map ---------------------- - -To decompile a CRUSH map, execute the following:: - - crushtool -d {compiled-crushmap-filename} -o {decompiled-crushmap-filename} - - -Sections --------- - -There are six main sections to a CRUSH Map. - -#. **tunables:** The preamble at the top of the map described any *tunables* - for CRUSH behavior that vary from the historical/legacy CRUSH behavior. These - correct for old bugs, optimizations, or other changes in behavior that have - been made over the years to improve CRUSH's behavior. - -#. **devices:** Devices are individual ``ceph-osd`` daemons that can - store data. - -#. **types**: Bucket ``types`` define the types of buckets used in - your CRUSH hierarchy. Buckets consist of a hierarchical aggregation - of storage locations (e.g., rows, racks, chassis, hosts, etc.) and - their assigned weights. - -#. **buckets:** Once you define bucket types, you must define each node - in the hierarchy, its type, and which devices or other nodes it - containes. - -#. **rules:** Rules define policy about how data is distributed across - devices in the hierarchy. - -#. **choose_args:** Choose_args are alternative weights associated with - the hierarchy that have been adjusted to optimize data placement. A single - choose_args map can be used for the entire cluster, or one can be - created for each individual pool. - - -.. _crushmapdevices: - -CRUSH Map Devices ------------------ - -Devices are individual ``ceph-osd`` daemons that can store data. You -will normally have one defined here for each OSD daemon in your -cluster. Devices are identified by an id (a non-negative integer) and -a name, normally ``osd.N`` where ``N`` is the device id. - -Devices may also have a *device class* associated with them (e.g., -``hdd`` or ``ssd``), allowing them to be conveniently targetted by a -crush rule. - -:: - - # devices - device {num} {osd.name} [class {class}] - -For example:: - - # devices - device 0 osd.0 class ssd - device 1 osd.1 class hdd - device 2 osd.2 - device 3 osd.3 - -In most cases, each device maps to a single ``ceph-osd`` daemon. This -is normally a single storage device, a pair of devices (for example, -one for data and one for a journal or metadata), or in some cases a -small RAID device. - - - - - -CRUSH Map Bucket Types ----------------------- - -The second list in the CRUSH map defines 'bucket' types. Buckets facilitate -a hierarchy of nodes and leaves. Node (or non-leaf) buckets typically represent -physical locations in a hierarchy. Nodes aggregate other nodes or leaves. -Leaf buckets represent ``ceph-osd`` daemons and their corresponding storage -media. - -.. tip:: The term "bucket" used in the context of CRUSH means a node in - the hierarchy, i.e. a location or a piece of physical hardware. It - is a different concept from the term "bucket" when used in the - context of RADOS Gateway APIs. - -To add a bucket type to the CRUSH map, create a new line under your list of -bucket types. Enter ``type`` followed by a unique numeric ID and a bucket name. -By convention, there is one leaf bucket and it is ``type 0``; however, you may -give it any name you like (e.g., osd, disk, drive, storage, etc.):: - - #types - type {num} {bucket-name} - -For example:: - - # types - type 0 osd - type 1 host - type 2 chassis - type 3 rack - type 4 row - type 5 pdu - type 6 pod - type 7 room - type 8 datacenter - type 9 region - type 10 root - - - -.. _crushmapbuckets: - -CRUSH Map Bucket Hierarchy --------------------------- - -The CRUSH algorithm distributes data objects among storage devices according -to a per-device weight value, approximating a uniform probability distribution. -CRUSH distributes objects and their replicas according to the hierarchical -cluster map you define. Your CRUSH map represents the available storage -devices and the logical elements that contain them. - -To map placement groups to OSDs across failure domains, a CRUSH map defines a -hierarchical list of bucket types (i.e., under ``#types`` in the generated CRUSH -map). The purpose of creating a bucket hierarchy is to segregate the -leaf nodes by their failure domains, such as hosts, chassis, racks, power -distribution units, pods, rows, rooms, and data centers. With the exception of -the leaf nodes representing OSDs, the rest of the hierarchy is arbitrary, and -you may define it according to your own needs. - -We recommend adapting your CRUSH map to your firms's hardware naming conventions -and using instances names that reflect the physical hardware. Your naming -practice can make it easier to administer the cluster and troubleshoot -problems when an OSD and/or other hardware malfunctions and the administrator -need access to physical hardware. - -In the following example, the bucket hierarchy has a leaf bucket named ``osd``, -and two node buckets named ``host`` and ``rack`` respectively. - -.. ditaa:: - +-----------+ - | {o}rack | - | Bucket | - +-----+-----+ - | - +---------------+---------------+ - | | - +-----+-----+ +-----+-----+ - | {o}host | | {o}host | - | Bucket | | Bucket | - +-----+-----+ +-----+-----+ - | | - +-------+-------+ +-------+-------+ - | | | | - +-----+-----+ +-----+-----+ +-----+-----+ +-----+-----+ - | osd | | osd | | osd | | osd | - | Bucket | | Bucket | | Bucket | | Bucket | - +-----------+ +-----------+ +-----------+ +-----------+ - -.. note:: The higher numbered ``rack`` bucket type aggregates the lower - numbered ``host`` bucket type. - -Since leaf nodes reflect storage devices declared under the ``#devices`` list -at the beginning of the CRUSH map, you do not need to declare them as bucket -instances. The second lowest bucket type in your hierarchy usually aggregates -the devices (i.e., it's usually the computer containing the storage media, and -uses whatever term you prefer to describe it, such as "node", "computer", -"server," "host", "machine", etc.). In high density environments, it is -increasingly common to see multiple hosts/nodes per chassis. You should account -for chassis failure too--e.g., the need to pull a chassis if a node fails may -result in bringing down numerous hosts/nodes and their OSDs. - -When declaring a bucket instance, you must specify its type, give it a unique -name (string), assign it a unique ID expressed as a negative integer (optional), -specify a weight relative to the total capacity/capability of its item(s), -specify the bucket algorithm (usually ``straw``), and the hash (usually ``0``, -reflecting hash algorithm ``rjenkins1``). A bucket may have one or more items. -The items may consist of node buckets or leaves. Items may have a weight that -reflects the relative weight of the item. - -You may declare a node bucket with the following syntax:: - - [bucket-type] [bucket-name] { - id [a unique negative numeric ID] - weight [the relative capacity/capability of the item(s)] - alg [the bucket type: uniform | list | tree | straw ] - hash [the hash type: 0 by default] - item [item-name] weight [weight] - } - -For example, using the diagram above, we would define two host buckets -and one rack bucket. The OSDs are declared as items within the host buckets:: - - host node1 { - id -1 - alg straw - hash 0 - item osd.0 weight 1.00 - item osd.1 weight 1.00 - } - - host node2 { - id -2 - alg straw - hash 0 - item osd.2 weight 1.00 - item osd.3 weight 1.00 - } - - rack rack1 { - id -3 - alg straw - hash 0 - item node1 weight 2.00 - item node2 weight 2.00 - } - -.. note:: In the foregoing example, note that the rack bucket does not contain - any OSDs. Rather it contains lower level host buckets, and includes the - sum total of their weight in the item entry. - -.. topic:: Bucket Types - - Ceph supports four bucket types, each representing a tradeoff between - performance and reorganization efficiency. If you are unsure of which bucket - type to use, we recommend using a ``straw`` bucket. For a detailed - discussion of bucket types, refer to - `CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_, - and more specifically to **Section 3.4**. The bucket types are: - - #. **Uniform:** Uniform buckets aggregate devices with **exactly** the same - weight. For example, when firms commission or decommission hardware, they - typically do so with many machines that have exactly the same physical - configuration (e.g., bulk purchases). When storage devices have exactly - the same weight, you may use the ``uniform`` bucket type, which allows - CRUSH to map replicas into uniform buckets in constant time. With - non-uniform weights, you should use another bucket algorithm. - - #. **List**: List buckets aggregate their content as linked lists. Based on - the :abbr:`RUSH (Replication Under Scalable Hashing)` :sub:`P` algorithm, - a list is a natural and intuitive choice for an **expanding cluster**: - either an object is relocated to the newest device with some appropriate - probability, or it remains on the older devices as before. The result is - optimal data migration when items are added to the bucket. Items removed - from the middle or tail of the list, however, can result in a significant - amount of unnecessary movement, making list buckets most suitable for - circumstances in which they **never (or very rarely) shrink**. - - #. **Tree**: Tree buckets use a binary search tree. They are more efficient - than list buckets when a bucket contains a larger set of items. Based on - the :abbr:`RUSH (Replication Under Scalable Hashing)` :sub:`R` algorithm, - tree buckets reduce the placement time to O(log :sub:`n`), making them - suitable for managing much larger sets of devices or nested buckets. - - #. **Straw:** List and Tree buckets use a divide and conquer strategy - in a way that either gives certain items precedence (e.g., those - at the beginning of a list) or obviates the need to consider entire - subtrees of items at all. That improves the performance of the replica - placement process, but can also introduce suboptimal reorganization - behavior when the contents of a bucket change due an addition, removal, - or re-weighting of an item. The straw bucket type allows all items to - fairly “compete” against each other for replica placement through a - process analogous to a draw of straws. - -.. topic:: Hash - - Each bucket uses a hash algorithm. Currently, Ceph supports ``rjenkins1``. - Enter ``0`` as your hash setting to select ``rjenkins1``. - - -.. _weightingbucketitems: - -.. topic:: Weighting Bucket Items - - Ceph expresses bucket weights as doubles, which allows for fine - weighting. A weight is the relative difference between device capacities. We - recommend using ``1.00`` as the relative weight for a 1TB storage device. - In such a scenario, a weight of ``0.5`` would represent approximately 500GB, - and a weight of ``3.00`` would represent approximately 3TB. Higher level - buckets have a weight that is the sum total of the leaf items aggregated by - the bucket. - - A bucket item weight is one dimensional, but you may also calculate your - item weights to reflect the performance of the storage drive. For example, - if you have many 1TB drives where some have relatively low data transfer - rate and the others have a relatively high data transfer rate, you may - weight them differently, even though they have the same capacity (e.g., - a weight of 0.80 for the first set of drives with lower total throughput, - and 1.20 for the second set of drives with higher total throughput). - - -.. _crushmaprules: - -CRUSH Map Rules ---------------- - -CRUSH maps support the notion of 'CRUSH rules', which are the rules that -determine data placement for a pool. For large clusters, you will likely create -many pools where each pool may have its own CRUSH ruleset and rules. The default -CRUSH map has a rule for each pool, and one ruleset assigned to each of the -default pools. - -.. note:: In most cases, you will not need to modify the default rules. When - you create a new pool, its default ruleset is ``0``. - - -CRUSH rules define placement and replication strategies or distribution policies -that allow you to specify exactly how CRUSH places object replicas. For -example, you might create a rule selecting a pair of targets for 2-way -mirroring, another rule for selecting three targets in two different data -centers for 3-way mirroring, and yet another rule for erasure coding over six -storage devices. For a detailed discussion of CRUSH rules, refer to -`CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_, -and more specifically to **Section 3.2**. - -A rule takes the following form:: - - rule <rulename> { - - ruleset <ruleset> - type [ replicated | erasure ] - min_size <min-size> - max_size <max-size> - step take <bucket-name> [class <device-class>] - step [choose|chooseleaf] [firstn|indep] <N> <bucket-type> - step emit - } - - -``ruleset`` - -:Description: A means of classifying a rule as belonging to a set of rules. - Activated by `setting the ruleset in a pool`_. - -:Purpose: A component of the rule mask. -:Type: Integer -:Required: Yes -:Default: 0 - -.. _setting the ruleset in a pool: ../pools#setpoolvalues - - -``type`` - -:Description: Describes a rule for either a storage drive (replicated) - or a RAID. - -:Purpose: A component of the rule mask. -:Type: String -:Required: Yes -:Default: ``replicated`` -:Valid Values: Currently only ``replicated`` and ``erasure`` - -``min_size`` - -:Description: If a pool makes fewer replicas than this number, CRUSH will - **NOT** select this rule. - -:Type: Integer -:Purpose: A component of the rule mask. -:Required: Yes -:Default: ``1`` - -``max_size`` - -:Description: If a pool makes more replicas than this number, CRUSH will - **NOT** select this rule. - -:Type: Integer -:Purpose: A component of the rule mask. -:Required: Yes -:Default: 10 - - -``step take <bucket-name> [class <device-class>]`` - -:Description: Takes a bucket name, and begins iterating down the tree. - If the ``device-class`` is specified, it must match - a class previously used when defining a device. All - devices that do not belong to the class are excluded. -:Purpose: A component of the rule. -:Required: Yes -:Example: ``step take data`` - - -``step choose firstn {num} type {bucket-type}`` - -:Description: Selects the number of buckets of the given type. The number is - usually the number of replicas in the pool (i.e., pool size). - - - If ``{num} == 0``, choose ``pool-num-replicas`` buckets (all available). - - If ``{num} > 0 && < pool-num-replicas``, choose that many buckets. - - If ``{num} < 0``, it means ``pool-num-replicas - {num}``. - -:Purpose: A component of the rule. -:Prerequisite: Follows ``step take`` or ``step choose``. -:Example: ``step choose firstn 1 type row`` - - -``step chooseleaf firstn {num} type {bucket-type}`` - -:Description: Selects a set of buckets of ``{bucket-type}`` and chooses a leaf - node from the subtree of each bucket in the set of buckets. The - number of buckets in the set is usually the number of replicas in - the pool (i.e., pool size). - - - If ``{num} == 0``, choose ``pool-num-replicas`` buckets (all available). - - If ``{num} > 0 && < pool-num-replicas``, choose that many buckets. - - If ``{num} < 0``, it means ``pool-num-replicas - {num}``. - -:Purpose: A component of the rule. Usage removes the need to select a device using two steps. -:Prerequisite: Follows ``step take`` or ``step choose``. -:Example: ``step chooseleaf firstn 0 type row`` - - - -``step emit`` - -:Description: Outputs the current value and empties the stack. Typically used - at the end of a rule, but may also be used to pick from different - trees in the same rule. - -:Purpose: A component of the rule. -:Prerequisite: Follows ``step choose``. -:Example: ``step emit`` - -.. important:: To activate one or more rules with a common ruleset number to a - pool, set the ruleset number of the pool. - - -Placing Different Pools on Different OSDS: -========================================== - -Suppose you want to have most pools default to OSDs backed by large hard drives, -but have some pools mapped to OSDs backed by fast solid-state drives (SSDs). -It's possible to have multiple independent CRUSH hierarchies within the same -CRUSH map. Define two hierarchies with two different root nodes--one for hard -disks (e.g., "root platter") and one for SSDs (e.g., "root ssd") as shown -below:: - - device 0 osd.0 - device 1 osd.1 - device 2 osd.2 - device 3 osd.3 - device 4 osd.4 - device 5 osd.5 - device 6 osd.6 - device 7 osd.7 - - host ceph-osd-ssd-server-1 { - id -1 - alg straw - hash 0 - item osd.0 weight 1.00 - item osd.1 weight 1.00 - } - - host ceph-osd-ssd-server-2 { - id -2 - alg straw - hash 0 - item osd.2 weight 1.00 - item osd.3 weight 1.00 - } - - host ceph-osd-platter-server-1 { - id -3 - alg straw - hash 0 - item osd.4 weight 1.00 - item osd.5 weight 1.00 - } - - host ceph-osd-platter-server-2 { - id -4 - alg straw - hash 0 - item osd.6 weight 1.00 - item osd.7 weight 1.00 - } - - root platter { - id -5 - alg straw - hash 0 - item ceph-osd-platter-server-1 weight 2.00 - item ceph-osd-platter-server-2 weight 2.00 - } - - root ssd { - id -6 - alg straw - hash 0 - item ceph-osd-ssd-server-1 weight 2.00 - item ceph-osd-ssd-server-2 weight 2.00 - } - - rule data { - ruleset 0 - type replicated - min_size 2 - max_size 2 - step take platter - step chooseleaf firstn 0 type host - step emit - } - - rule metadata { - ruleset 1 - type replicated - min_size 0 - max_size 10 - step take platter - step chooseleaf firstn 0 type host - step emit - } - - rule rbd { - ruleset 2 - type replicated - min_size 0 - max_size 10 - step take platter - step chooseleaf firstn 0 type host - step emit - } - - rule platter { - ruleset 3 - type replicated - min_size 0 - max_size 10 - step take platter - step chooseleaf firstn 0 type host - step emit - } - - rule ssd { - ruleset 4 - type replicated - min_size 0 - max_size 4 - step take ssd - step chooseleaf firstn 0 type host - step emit - } - - rule ssd-primary { - ruleset 5 - type replicated - min_size 5 - max_size 10 - step take ssd - step chooseleaf firstn 1 type host - step emit - step take platter - step chooseleaf firstn -1 type host - step emit - } - -You can then set a pool to use the SSD rule by:: - - ceph osd pool set <poolname> crush_ruleset 4 - -Similarly, using the ``ssd-primary`` rule will cause each placement group in the -pool to be placed with an SSD as the primary and platters as the replicas. - - -Tuning CRUSH, the hard way --------------------------- - -If you can ensure that all clients are running recent code, you can -adjust the tunables by extracting the CRUSH map, modifying the values, -and reinjecting it into the cluster. - -* Extract the latest CRUSH map:: - - ceph osd getcrushmap -o /tmp/crush - -* Adjust tunables. These values appear to offer the best behavior - for both large and small clusters we tested with. You will need to - additionally specify the ``--enable-unsafe-tunables`` argument to - ``crushtool`` for this to work. Please use this option with - extreme care.:: - - crushtool -i /tmp/crush --set-choose-local-tries 0 --set-choose-local-fallback-tries 0 --set-choose-total-tries 50 -o /tmp/crush.new - -* Reinject modified map:: - - ceph osd setcrushmap -i /tmp/crush.new - -Legacy values -------------- - -For reference, the legacy values for the CRUSH tunables can be set -with:: - - crushtool -i /tmp/crush --set-choose-local-tries 2 --set-choose-local-fallback-tries 5 --set-choose-total-tries 19 --set-chooseleaf-descend-once 0 --set-chooseleaf-vary-r 0 -o /tmp/crush.legacy - -Again, the special ``--enable-unsafe-tunables`` option is required. -Further, as noted above, be careful running old versions of the -``ceph-osd`` daemon after reverting to legacy values as the feature -bit is not perfectly enforced. diff --git a/src/ceph/doc/rados/operations/crush-map.rst b/src/ceph/doc/rados/operations/crush-map.rst deleted file mode 100644 index 05fa4ff..0000000 --- a/src/ceph/doc/rados/operations/crush-map.rst +++ /dev/null @@ -1,956 +0,0 @@ -============ - CRUSH Maps -============ - -The :abbr:`CRUSH (Controlled Replication Under Scalable Hashing)` algorithm -determines how to store and retrieve data by computing data storage locations. -CRUSH empowers Ceph clients to communicate with OSDs directly rather than -through a centralized server or broker. With an algorithmically determined -method of storing and retrieving data, Ceph avoids a single point of failure, a -performance bottleneck, and a physical limit to its scalability. - -CRUSH requires a map of your cluster, and uses the CRUSH map to pseudo-randomly -store and retrieve data in OSDs with a uniform distribution of data across the -cluster. For a detailed discussion of CRUSH, see -`CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_ - -CRUSH maps contain a list of :abbr:`OSDs (Object Storage Devices)`, a list of -'buckets' for aggregating the devices into physical locations, and a list of -rules that tell CRUSH how it should replicate data in a Ceph cluster's pools. By -reflecting the underlying physical organization of the installation, CRUSH can -model—and thereby address—potential sources of correlated device failures. -Typical sources include physical proximity, a shared power source, and a shared -network. By encoding this information into the cluster map, CRUSH placement -policies can separate object replicas across different failure domains while -still maintaining the desired distribution. For example, to address the -possibility of concurrent failures, it may be desirable to ensure that data -replicas are on devices using different shelves, racks, power supplies, -controllers, and/or physical locations. - -When you deploy OSDs they are automatically placed within the CRUSH map under a -``host`` node named with the hostname for the host they are running on. This, -combined with the default CRUSH failure domain, ensures that replicas or erasure -code shards are separated across hosts and a single host failure will not -affect availability. For larger clusters, however, administrators should carefully consider their choice of failure domain. Separating replicas across racks, -for example, is common for mid- to large-sized clusters. - - -CRUSH Location -============== - -The location of an OSD in terms of the CRUSH map's hierarchy is -referred to as a ``crush location``. This location specifier takes the -form of a list of key and value pairs describing a position. For -example, if an OSD is in a particular row, rack, chassis and host, and -is part of the 'default' CRUSH tree (this is the case for the vast -majority of clusters), its crush location could be described as:: - - root=default row=a rack=a2 chassis=a2a host=a2a1 - -Note: - -#. Note that the order of the keys does not matter. -#. The key name (left of ``=``) must be a valid CRUSH ``type``. By default - these include root, datacenter, room, row, pod, pdu, rack, chassis and host, - but those types can be customized to be anything appropriate by modifying - the CRUSH map. -#. Not all keys need to be specified. For example, by default, Ceph - automatically sets a ``ceph-osd`` daemon's location to be - ``root=default host=HOSTNAME`` (based on the output from ``hostname -s``). - -The crush location for an OSD is normally expressed via the ``crush location`` -config option being set in the ``ceph.conf`` file. Each time the OSD starts, -it verifies it is in the correct location in the CRUSH map and, if it is not, -it moved itself. To disable this automatic CRUSH map management, add the -following to your configuration file in the ``[osd]`` section:: - - osd crush update on start = false - - -Custom location hooks ---------------------- - -A customized location hook can be used to generate a more complete -crush location on startup. The sample ``ceph-crush-location`` utility -will generate a CRUSH location string for a given daemon. The -location is based on, in order of preference: - -#. A ``crush location`` option in ceph.conf. -#. A default of ``root=default host=HOSTNAME`` where the hostname is - generated with the ``hostname -s`` command. - -This is not useful by itself, as the OSD itself has the exact same -behavior. However, the script can be modified to provide additional -location fields (for example, the rack or datacenter), and then the -hook enabled via the config option:: - - crush location hook = /path/to/customized-ceph-crush-location - -This hook is passed several arguments (below) and should output a single line -to stdout with the CRUSH location description.:: - - $ ceph-crush-location --cluster CLUSTER --id ID --type TYPE - -where the cluster name is typically 'ceph', the id is the daemon -identifier (the OSD number), and the daemon type is typically ``osd``. - - -CRUSH structure -=============== - -The CRUSH map consists of, loosely speaking, a hierarchy describing -the physical topology of the cluster, and a set of rules defining -policy about how we place data on those devices. The hierarchy has -devices (``ceph-osd`` daemons) at the leaves, and internal nodes -corresponding to other physical features or groupings: hosts, racks, -rows, datacenters, and so on. The rules describe how replicas are -placed in terms of that hierarchy (e.g., 'three replicas in different -racks'). - -Devices -------- - -Devices are individual ``ceph-osd`` daemons that can store data. You -will normally have one defined here for each OSD daemon in your -cluster. Devices are identified by an id (a non-negative integer) and -a name, normally ``osd.N`` where ``N`` is the device id. - -Devices may also have a *device class* associated with them (e.g., -``hdd`` or ``ssd``), allowing them to be conveniently targetted by a -crush rule. - -Types and Buckets ------------------ - -A bucket is the CRUSH term for internal nodes in the hierarchy: hosts, -racks, rows, etc. The CRUSH map defines a series of *types* that are -used to describe these nodes. By default, these types include: - -- osd (or device) -- host -- chassis -- rack -- row -- pdu -- pod -- room -- datacenter -- region -- root - -Most clusters make use of only a handful of these types, and others -can be defined as needed. - -The hierarchy is built with devices (normally type ``osd``) at the -leaves, interior nodes with non-device types, and a root node of type -``root``. For example, - -.. ditaa:: - - +-----------------+ - | {o}root default | - +--------+--------+ - | - +---------------+---------------+ - | | - +-------+-------+ +-----+-------+ - | {o}host foo | | {o}host bar | - +-------+-------+ +-----+-------+ - | | - +-------+-------+ +-------+-------+ - | | | | - +-----+-----+ +-----+-----+ +-----+-----+ +-----+-----+ - | osd.0 | | osd.1 | | osd.2 | | osd.3 | - +-----------+ +-----------+ +-----------+ +-----------+ - -Each node (device or bucket) in the hierarchy has a *weight* -associated with it, indicating the relative proportion of the total -data that device or hierarchy subtree should store. Weights are set -at the leaves, indicating the size of the device, and automatically -sum up the tree from there, such that the weight of the default node -will be the total of all devices contained beneath it. Normally -weights are in units of terabytes (TB). - -You can get a simple view the CRUSH hierarchy for your cluster, -including the weights, with:: - - ceph osd crush tree - -Rules ------ - -Rules define policy about how data is distributed across the devices -in the hierarchy. - -CRUSH rules define placement and replication strategies or -distribution policies that allow you to specify exactly how CRUSH -places object replicas. For example, you might create a rule selecting -a pair of targets for 2-way mirroring, another rule for selecting -three targets in two different data centers for 3-way mirroring, and -yet another rule for erasure coding over six storage devices. For a -detailed discussion of CRUSH rules, refer to `CRUSH - Controlled, -Scalable, Decentralized Placement of Replicated Data`_, and more -specifically to **Section 3.2**. - -In almost all cases, CRUSH rules can be created via the CLI by -specifying the *pool type* they will be used for (replicated or -erasure coded), the *failure domain*, and optionally a *device class*. -In rare cases rules must be written by hand by manually editing the -CRUSH map. - -You can see what rules are defined for your cluster with:: - - ceph osd crush rule ls - -You can view the contents of the rules with:: - - ceph osd crush rule dump - -Device classes --------------- - -Each device can optionally have a *class* associated with it. By -default, OSDs automatically set their class on startup to either -`hdd`, `ssd`, or `nvme` based on the type of device they are backed -by. - -The device class for one or more OSDs can be explicitly set with:: - - ceph osd crush set-device-class <class> <osd-name> [...] - -Once a device class is set, it cannot be changed to another class -until the old class is unset with:: - - ceph osd crush rm-device-class <osd-name> [...] - -This allows administrators to set device classes without the class -being changed on OSD restart or by some other script. - -A placement rule that targets a specific device class can be created with:: - - ceph osd crush rule create-replicated <rule-name> <root> <failure-domain> <class> - -A pool can then be changed to use the new rule with:: - - ceph osd pool set <pool-name> crush_rule <rule-name> - -Device classes are implemented by creating a "shadow" CRUSH hierarchy -for each device class in use that contains only devices of that class. -Rules can then distribute data over the shadow hierarchy. One nice -thing about this approach is that it is fully backward compatible with -old Ceph clients. You can view the CRUSH hierarchy with shadow items -with:: - - ceph osd crush tree --show-shadow - - -Weights sets ------------- - -A *weight set* is an alternative set of weights to use when -calculating data placement. The normal weights associated with each -device in the CRUSH map are set based on the device size and indicate -how much data we *should* be storing where. However, because CRUSH is -based on a pseudorandom placement process, there is always some -variation from this ideal distribution, the same way that rolling a -dice sixty times will not result in rolling exactly 10 ones and 10 -sixes. Weight sets allow the cluster to do a numerical optimization -based on the specifics of your cluster (hierarchy, pools, etc.) to achieve -a balanced distribution. - -There are two types of weight sets supported: - - #. A **compat** weight set is a single alternative set of weights for - each device and node in the cluster. This is not well-suited for - correcting for all anomalies (for example, placement groups for - different pools may be different sizes and have different load - levels, but will be mostly treated the same by the balancer). - However, compat weight sets have the huge advantage that they are - *backward compatible* with previous versions of Ceph, which means - that even though weight sets were first introduced in Luminous - v12.2.z, older clients (e.g., firefly) can still connect to the - cluster when a compat weight set is being used to balance data. - #. A **per-pool** weight set is more flexible in that it allows - placement to be optimized for each data pool. Additionally, - weights can be adjusted for each position of placement, allowing - the optimizer to correct for a suble skew of data toward devices - with small weights relative to their peers (and effect that is - usually only apparently in very large clusters but which can cause - balancing problems). - -When weight sets are in use, the weights associated with each node in -the hierarchy is visible as a separate column (labeled either -``(compat)`` or the pool name) from the command:: - - ceph osd crush tree - -When both *compat* and *per-pool* weight sets are in use, data -placement for a particular pool will use its own per-pool weight set -if present. If not, it will use the compat weight set if present. If -neither are present, it will use the normal CRUSH weights. - -Although weight sets can be set up and manipulated by hand, it is -recommended that the *balancer* module be enabled to do so -automatically. - - -Modifying the CRUSH map -======================= - -.. _addosd: - -Add/Move an OSD ---------------- - -.. note: OSDs are normally automatically added to the CRUSH map when - the OSD is created. This command is rarely needed. - -To add or move an OSD in the CRUSH map of a running cluster:: - - ceph osd crush set {name} {weight} root={root} [{bucket-type}={bucket-name} ...] - -Where: - -``name`` - -:Description: The full name of the OSD. -:Type: String -:Required: Yes -:Example: ``osd.0`` - - -``weight`` - -:Description: The CRUSH weight for the OSD, normally its size measure in terabytes (TB). -:Type: Double -:Required: Yes -:Example: ``2.0`` - - -``root`` - -:Description: The root node of the tree in which the OSD resides (normally ``default``) -:Type: Key/value pair. -:Required: Yes -:Example: ``root=default`` - - -``bucket-type`` - -:Description: You may specify the OSD's location in the CRUSH hierarchy. -:Type: Key/value pairs. -:Required: No -:Example: ``datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1`` - - -The following example adds ``osd.0`` to the hierarchy, or moves the -OSD from a previous location. :: - - ceph osd crush set osd.0 1.0 root=default datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1 - - -Adjust OSD weight ------------------ - -.. note: Normally OSDs automatically add themselves to the CRUSH map - with the correct weight when they are created. This command - is rarely needed. - -To adjust an OSD's crush weight in the CRUSH map of a running cluster, execute -the following:: - - ceph osd crush reweight {name} {weight} - -Where: - -``name`` - -:Description: The full name of the OSD. -:Type: String -:Required: Yes -:Example: ``osd.0`` - - -``weight`` - -:Description: The CRUSH weight for the OSD. -:Type: Double -:Required: Yes -:Example: ``2.0`` - - -.. _removeosd: - -Remove an OSD -------------- - -.. note: OSDs are normally removed from the CRUSH as part of the - ``ceph osd purge`` command. This command is rarely needed. - -To remove an OSD from the CRUSH map of a running cluster, execute the -following:: - - ceph osd crush remove {name} - -Where: - -``name`` - -:Description: The full name of the OSD. -:Type: String -:Required: Yes -:Example: ``osd.0`` - - -Add a Bucket ------------- - -.. note: Buckets are normally implicitly created when an OSD is added - that specifies a ``{bucket-type}={bucket-name}`` as part of its - location and a bucket with that name does not already exist. This - command is typically used when manually adjusting the structure of the - hierarchy after OSDs have been created (for example, to move a - series of hosts underneath a new rack-level bucket). - -To add a bucket in the CRUSH map of a running cluster, execute the -``ceph osd crush add-bucket`` command:: - - ceph osd crush add-bucket {bucket-name} {bucket-type} - -Where: - -``bucket-name`` - -:Description: The full name of the bucket. -:Type: String -:Required: Yes -:Example: ``rack12`` - - -``bucket-type`` - -:Description: The type of the bucket. The type must already exist in the hierarchy. -:Type: String -:Required: Yes -:Example: ``rack`` - - -The following example adds the ``rack12`` bucket to the hierarchy:: - - ceph osd crush add-bucket rack12 rack - -Move a Bucket -------------- - -To move a bucket to a different location or position in the CRUSH map -hierarchy, execute the following:: - - ceph osd crush move {bucket-name} {bucket-type}={bucket-name}, [...] - -Where: - -``bucket-name`` - -:Description: The name of the bucket to move/reposition. -:Type: String -:Required: Yes -:Example: ``foo-bar-1`` - -``bucket-type`` - -:Description: You may specify the bucket's location in the CRUSH hierarchy. -:Type: Key/value pairs. -:Required: No -:Example: ``datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1`` - -Remove a Bucket ---------------- - -To remove a bucket from the CRUSH map hierarchy, execute the following:: - - ceph osd crush remove {bucket-name} - -.. note:: A bucket must be empty before removing it from the CRUSH hierarchy. - -Where: - -``bucket-name`` - -:Description: The name of the bucket that you'd like to remove. -:Type: String -:Required: Yes -:Example: ``rack12`` - -The following example removes the ``rack12`` bucket from the hierarchy:: - - ceph osd crush remove rack12 - -Creating a compat weight set ----------------------------- - -.. note: This step is normally done automatically by the ``balancer`` - module when enabled. - -To create a *compat* weight set:: - - ceph osd crush weight-set create-compat - -Weights for the compat weight set can be adjusted with:: - - ceph osd crush weight-set reweight-compat {name} {weight} - -The compat weight set can be destroyed with:: - - ceph osd crush weight-set rm-compat - -Creating per-pool weight sets ------------------------------ - -To create a weight set for a specific pool,:: - - ceph osd crush weight-set create {pool-name} {mode} - -.. note:: Per-pool weight sets require that all servers and daemons - run Luminous v12.2.z or later. - -Where: - -``pool-name`` - -:Description: The name of a RADOS pool -:Type: String -:Required: Yes -:Example: ``rbd`` - -``mode`` - -:Description: Either ``flat`` or ``positional``. A *flat* weight set - has a single weight for each device or bucket. A - *positional* weight set has a potentially different - weight for each position in the resulting placement - mapping. For example, if a pool has a replica count of - 3, then a positional weight set will have three weights - for each device and bucket. -:Type: String -:Required: Yes -:Example: ``flat`` - -To adjust the weight of an item in a weight set:: - - ceph osd crush weight-set reweight {pool-name} {item-name} {weight [...]} - -To list existing weight sets,:: - - ceph osd crush weight-set ls - -To remove a weight set,:: - - ceph osd crush weight-set rm {pool-name} - -Creating a rule for a replicated pool -------------------------------------- - -For a replicated pool, the primary decision when creating the CRUSH -rule is what the failure domain is going to be. For example, if a -failure domain of ``host`` is selected, then CRUSH will ensure that -each replica of the data is stored on a different host. If ``rack`` -is selected, then each replica will be stored in a different rack. -What failure domain you choose primarily depends on the size of your -cluster and how your hierarchy is structured. - -Normally, the entire cluster hierarchy is nested beneath a root node -named ``default``. If you have customized your hierarchy, you may -want to create a rule nested at some other node in the hierarchy. It -doesn't matter what type is associated with that node (it doesn't have -to be a ``root`` node). - -It is also possible to create a rule that restricts data placement to -a specific *class* of device. By default, Ceph OSDs automatically -classify themselves as either ``hdd`` or ``ssd``, depending on the -underlying type of device being used. These classes can also be -customized. - -To create a replicated rule,:: - - ceph osd crush rule create-replicated {name} {root} {failure-domain-type} [{class}] - -Where: - -``name`` - -:Description: The name of the rule -:Type: String -:Required: Yes -:Example: ``rbd-rule`` - -``root`` - -:Description: The name of the node under which data should be placed. -:Type: String -:Required: Yes -:Example: ``default`` - -``failure-domain-type`` - -:Description: The type of CRUSH nodes across which we should separate replicas. -:Type: String -:Required: Yes -:Example: ``rack`` - -``class`` - -:Description: The device class data should be placed on. -:Type: String -:Required: No -:Example: ``ssd`` - -Creating a rule for an erasure coded pool ------------------------------------------ - -For an erasure-coded pool, the same basic decisions need to be made as -with a replicated pool: what is the failure domain, what node in the -hierarchy will data be placed under (usually ``default``), and will -placement be restricted to a specific device class. Erasure code -pools are created a bit differently, however, because they need to be -constructed carefully based on the erasure code being used. For this reason, -you must include this information in the *erasure code profile*. A CRUSH -rule will then be created from that either explicitly or automatically when -the profile is used to create a pool. - -The erasure code profiles can be listed with:: - - ceph osd erasure-code-profile ls - -An existing profile can be viewed with:: - - ceph osd erasure-code-profile get {profile-name} - -Normally profiles should never be modified; instead, a new profile -should be created and used when creating a new pool or creating a new -rule for an existing pool. - -An erasure code profile consists of a set of key=value pairs. Most of -these control the behavior of the erasure code that is encoding data -in the pool. Those that begin with ``crush-``, however, affect the -CRUSH rule that is created. - -The erasure code profile properties of interest are: - - * **crush-root**: the name of the CRUSH node to place data under [default: ``default``]. - * **crush-failure-domain**: the CRUSH type to separate erasure-coded shards across [default: ``host``]. - * **crush-device-class**: the device class to place data on [default: none, meaning all devices are used]. - * **k** and **m** (and, for the ``lrc`` plugin, **l**): these determine the number of erasure code shards, affecting the resulting CRUSH rule. - -Once a profile is defined, you can create a CRUSH rule with:: - - ceph osd crush rule create-erasure {name} {profile-name} - -.. note: When creating a new pool, it is not actually necessary to - explicitly create the rule. If the erasure code profile alone is - specified and the rule argument is left off then Ceph will create - the CRUSH rule automatically. - -Deleting rules --------------- - -Rules that are not in use by pools can be deleted with:: - - ceph osd crush rule rm {rule-name} - - -Tunables -======== - -Over time, we have made (and continue to make) improvements to the -CRUSH algorithm used to calculate the placement of data. In order to -support the change in behavior, we have introduced a series of tunable -options that control whether the legacy or improved variation of the -algorithm is used. - -In order to use newer tunables, both clients and servers must support -the new version of CRUSH. For this reason, we have created -``profiles`` that are named after the Ceph version in which they were -introduced. For example, the ``firefly`` tunables are first supported -in the firefly release, and will not work with older (e.g., dumpling) -clients. Once a given set of tunables are changed from the legacy -default behavior, the ``ceph-mon`` and ``ceph-osd`` will prevent older -clients who do not support the new CRUSH features from connecting to -the cluster. - -argonaut (legacy) ------------------ - -The legacy CRUSH behavior used by argonaut and older releases works -fine for most clusters, provided there are not too many OSDs that have -been marked out. - -bobtail (CRUSH_TUNABLES2) -------------------------- - -The bobtail tunable profile fixes a few key misbehaviors: - - * For hierarchies with a small number of devices in the leaf buckets, - some PGs map to fewer than the desired number of replicas. This - commonly happens for hierarchies with "host" nodes with a small - number (1-3) of OSDs nested beneath each one. - - * For large clusters, some small percentages of PGs map to less than - the desired number of OSDs. This is more prevalent when there are - several layers of the hierarchy (e.g., row, rack, host, osd). - - * When some OSDs are marked out, the data tends to get redistributed - to nearby OSDs instead of across the entire hierarchy. - -The new tunables are: - - * ``choose_local_tries``: Number of local retries. Legacy value is - 2, optimal value is 0. - - * ``choose_local_fallback_tries``: Legacy value is 5, optimal value - is 0. - - * ``choose_total_tries``: Total number of attempts to choose an item. - Legacy value was 19, subsequent testing indicates that a value of - 50 is more appropriate for typical clusters. For extremely large - clusters, a larger value might be necessary. - - * ``chooseleaf_descend_once``: Whether a recursive chooseleaf attempt - will retry, or only try once and allow the original placement to - retry. Legacy default is 0, optimal value is 1. - -Migration impact: - - * Moving from argonaut to bobtail tunables triggers a moderate amount - of data movement. Use caution on a cluster that is already - populated with data. - -firefly (CRUSH_TUNABLES3) -------------------------- - -The firefly tunable profile fixes a problem -with the ``chooseleaf`` CRUSH rule behavior that tends to result in PG -mappings with too few results when too many OSDs have been marked out. - -The new tunable is: - - * ``chooseleaf_vary_r``: Whether a recursive chooseleaf attempt will - start with a non-zero value of r, based on how many attempts the - parent has already made. Legacy default is 0, but with this value - CRUSH is sometimes unable to find a mapping. The optimal value (in - terms of computational cost and correctness) is 1. - -Migration impact: - - * For existing clusters that have lots of existing data, changing - from 0 to 1 will cause a lot of data to move; a value of 4 or 5 - will allow CRUSH to find a valid mapping but will make less data - move. - -straw_calc_version tunable (introduced with Firefly too) --------------------------------------------------------- - -There were some problems with the internal weights calculated and -stored in the CRUSH map for ``straw`` buckets. Specifically, when -there were items with a CRUSH weight of 0 or both a mix of weights and -some duplicated weights CRUSH would distribute data incorrectly (i.e., -not in proportion to the weights). - -The new tunable is: - - * ``straw_calc_version``: A value of 0 preserves the old, broken - internal weight calculation; a value of 1 fixes the behavior. - -Migration impact: - - * Moving to straw_calc_version 1 and then adjusting a straw bucket - (by adding, removing, or reweighting an item, or by using the - reweight-all command) can trigger a small to moderate amount of - data movement *if* the cluster has hit one of the problematic - conditions. - -This tunable option is special because it has absolutely no impact -concerning the required kernel version in the client side. - -hammer (CRUSH_V4) ------------------ - -The hammer tunable profile does not affect the -mapping of existing CRUSH maps simply by changing the profile. However: - - * There is a new bucket type (``straw2``) supported. The new - ``straw2`` bucket type fixes several limitations in the original - ``straw`` bucket. Specifically, the old ``straw`` buckets would - change some mappings that should have changed when a weight was - adjusted, while ``straw2`` achieves the original goal of only - changing mappings to or from the bucket item whose weight has - changed. - - * ``straw2`` is the default for any newly created buckets. - -Migration impact: - - * Changing a bucket type from ``straw`` to ``straw2`` will result in - a reasonably small amount of data movement, depending on how much - the bucket item weights vary from each other. When the weights are - all the same no data will move, and when item weights vary - significantly there will be more movement. - -jewel (CRUSH_TUNABLES5) ------------------------ - -The jewel tunable profile improves the -overall behavior of CRUSH such that significantly fewer mappings -change when an OSD is marked out of the cluster. - -The new tunable is: - - * ``chooseleaf_stable``: Whether a recursive chooseleaf attempt will - use a better value for an inner loop that greatly reduces the number - of mapping changes when an OSD is marked out. The legacy value is 0, - while the new value of 1 uses the new approach. - -Migration impact: - - * Changing this value on an existing cluster will result in a very - large amount of data movement as almost every PG mapping is likely - to change. - - - - -Which client versions support CRUSH_TUNABLES --------------------------------------------- - - * argonaut series, v0.48.1 or later - * v0.49 or later - * Linux kernel version v3.6 or later (for the file system and RBD kernel clients) - -Which client versions support CRUSH_TUNABLES2 ---------------------------------------------- - - * v0.55 or later, including bobtail series (v0.56.x) - * Linux kernel version v3.9 or later (for the file system and RBD kernel clients) - -Which client versions support CRUSH_TUNABLES3 ---------------------------------------------- - - * v0.78 (firefly) or later - * Linux kernel version v3.15 or later (for the file system and RBD kernel clients) - -Which client versions support CRUSH_V4 --------------------------------------- - - * v0.94 (hammer) or later - * Linux kernel version v4.1 or later (for the file system and RBD kernel clients) - -Which client versions support CRUSH_TUNABLES5 ---------------------------------------------- - - * v10.0.2 (jewel) or later - * Linux kernel version v4.5 or later (for the file system and RBD kernel clients) - -Warning when tunables are non-optimal -------------------------------------- - -Starting with version v0.74, Ceph will issue a health warning if the -current CRUSH tunables don't include all the optimal values from the -``default`` profile (see below for the meaning of the ``default`` profile). -To make this warning go away, you have two options: - -1. Adjust the tunables on the existing cluster. Note that this will - result in some data movement (possibly as much as 10%). This is the - preferred route, but should be taken with care on a production cluster - where the data movement may affect performance. You can enable optimal - tunables with:: - - ceph osd crush tunables optimal - - If things go poorly (e.g., too much load) and not very much - progress has been made, or there is a client compatibility problem - (old kernel cephfs or rbd clients, or pre-bobtail librados - clients), you can switch back with:: - - ceph osd crush tunables legacy - -2. You can make the warning go away without making any changes to CRUSH by - adding the following option to your ceph.conf ``[mon]`` section:: - - mon warn on legacy crush tunables = false - - For the change to take effect, you will need to restart the monitors, or - apply the option to running monitors with:: - - ceph tell mon.\* injectargs --no-mon-warn-on-legacy-crush-tunables - - -A few important points ----------------------- - - * Adjusting these values will result in the shift of some PGs between - storage nodes. If the Ceph cluster is already storing a lot of - data, be prepared for some fraction of the data to move. - * The ``ceph-osd`` and ``ceph-mon`` daemons will start requiring the - feature bits of new connections as soon as they get - the updated map. However, already-connected clients are - effectively grandfathered in, and will misbehave if they do not - support the new feature. - * If the CRUSH tunables are set to non-legacy values and then later - changed back to the defult values, ``ceph-osd`` daemons will not be - required to support the feature. However, the OSD peering process - requires examining and understanding old maps. Therefore, you - should not run old versions of the ``ceph-osd`` daemon - if the cluster has previously used non-legacy CRUSH values, even if - the latest version of the map has been switched back to using the - legacy defaults. - -Tuning CRUSH ------------- - -The simplest way to adjust the crush tunables is by changing to a known -profile. Those are: - - * ``legacy``: the legacy behavior from argonaut and earlier. - * ``argonaut``: the legacy values supported by the original argonaut release - * ``bobtail``: the values supported by the bobtail release - * ``firefly``: the values supported by the firefly release - * ``hammer``: the values supported by the hammer release - * ``jewel``: the values supported by the jewel release - * ``optimal``: the best (ie optimal) values of the current version of Ceph - * ``default``: the default values of a new cluster installed from - scratch. These values, which depend on the current version of Ceph, - are hard coded and are generally a mix of optimal and legacy values. - These values generally match the ``optimal`` profile of the previous - LTS release, or the most recent release for which we generally except - more users to have up to date clients for. - -You can select a profile on a running cluster with the command:: - - ceph osd crush tunables {PROFILE} - -Note that this may result in some data movement. - - -.. _CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data: https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf - - -Primary Affinity -================ - -When a Ceph Client reads or writes data, it always contacts the primary OSD in -the acting set. For set ``[2, 3, 4]``, ``osd.2`` is the primary. Sometimes an -OSD is not well suited to act as a primary compared to other OSDs (e.g., it has -a slow disk or a slow controller). To prevent performance bottlenecks -(especially on read operations) while maximizing utilization of your hardware, -you can set a Ceph OSD's primary affinity so that CRUSH is less likely to use -the OSD as a primary in an acting set. :: - - ceph osd primary-affinity <osd-id> <weight> - -Primary affinity is ``1`` by default (*i.e.,* an OSD may act as a primary). You -may set the OSD primary range from ``0-1``, where ``0`` means that the OSD may -**NOT** be used as a primary and ``1`` means that an OSD may be used as a -primary. When the weight is ``< 1``, it is less likely that CRUSH will select -the Ceph OSD Daemon to act as a primary. - - - diff --git a/src/ceph/doc/rados/operations/data-placement.rst b/src/ceph/doc/rados/operations/data-placement.rst deleted file mode 100644 index 27966b0..0000000 --- a/src/ceph/doc/rados/operations/data-placement.rst +++ /dev/null @@ -1,37 +0,0 @@ -========================= - Data Placement Overview -========================= - -Ceph stores, replicates and rebalances data objects across a RADOS cluster -dynamically. With many different users storing objects in different pools for -different purposes on countless OSDs, Ceph operations require some data -placement planning. The main data placement planning concepts in Ceph include: - -- **Pools:** Ceph stores data within pools, which are logical groups for storing - objects. Pools manage the number of placement groups, the number of replicas, - and the ruleset for the pool. To store data in a pool, you must have - an authenticated user with permissions for the pool. Ceph can snapshot pools. - See `Pools`_ for additional details. - -- **Placement Groups:** Ceph maps objects to placement groups (PGs). - Placement groups (PGs) are shards or fragments of a logical object pool - that place objects as a group into OSDs. Placement groups reduce the amount - of per-object metadata when Ceph stores the data in OSDs. A larger number of - placement groups (e.g., 100 per OSD) leads to better balancing. See - `Placement Groups`_ for additional details. - -- **CRUSH Maps:** CRUSH is a big part of what allows Ceph to scale without - performance bottlenecks, without limitations to scalability, and without a - single point of failure. CRUSH maps provide the physical topology of the - cluster to the CRUSH algorithm to determine where the data for an object - and its replicas should be stored, and how to do so across failure domains - for added data safety among other things. See `CRUSH Maps`_ for additional - details. - -When you initially set up a test cluster, you can use the default values. Once -you begin planning for a large Ceph cluster, refer to pools, placement groups -and CRUSH for data placement operations. - -.. _Pools: ../pools -.. _Placement Groups: ../placement-groups -.. _CRUSH Maps: ../crush-map diff --git a/src/ceph/doc/rados/operations/erasure-code-isa.rst b/src/ceph/doc/rados/operations/erasure-code-isa.rst deleted file mode 100644 index b52933a..0000000 --- a/src/ceph/doc/rados/operations/erasure-code-isa.rst +++ /dev/null @@ -1,105 +0,0 @@ -======================= -ISA erasure code plugin -======================= - -The *isa* plugin encapsulates the `ISA -<https://01.org/intel%C2%AE-storage-acceleration-library-open-source-version/>`_ -library. It only runs on Intel processors. - -Create an isa profile -===================== - -To create a new *isa* erasure code profile:: - - ceph osd erasure-code-profile set {name} \ - plugin=isa \ - technique={reed_sol_van|cauchy} \ - [k={data-chunks}] \ - [m={coding-chunks}] \ - [crush-root={root}] \ - [crush-failure-domain={bucket-type}] \ - [crush-device-class={device-class}] \ - [directory={directory}] \ - [--force] - -Where: - -``k={data chunks}`` - -:Description: Each object is split in **data-chunks** parts, - each stored on a different OSD. - -:Type: Integer -:Required: No. -:Default: 7 - -``m={coding-chunks}`` - -:Description: Compute **coding chunks** for each object and store them - on different OSDs. The number of coding chunks is also - the number of OSDs that can be down without losing data. - -:Type: Integer -:Required: No. -:Default: 3 - -``technique={reed_sol_van|cauchy}`` - -:Description: The ISA plugin comes in two `Reed Solomon - <https://en.wikipedia.org/wiki/Reed%E2%80%93Solomon_error_correction>`_ - forms. If *reed_sol_van* is set, it is `Vandermonde - <https://en.wikipedia.org/wiki/Vandermonde_matrix>`_, if - *cauchy* is set, it is `Cauchy - <https://en.wikipedia.org/wiki/Cauchy_matrix>`_. - -:Type: String -:Required: No. -:Default: reed_sol_van - -``crush-root={root}`` - -:Description: The name of the crush bucket used for the first step of - the ruleset. For intance **step take default**. - -:Type: String -:Required: No. -:Default: default - -``crush-failure-domain={bucket-type}`` - -:Description: Ensure that no two chunks are in a bucket with the same - failure domain. For instance, if the failure domain is - **host** no two chunks will be stored on the same - host. It is used to create a ruleset step such as **step - chooseleaf host**. - -:Type: String -:Required: No. -:Default: host - -``crush-device-class={device-class}`` - -:Description: Restrict placement to devices of a specific class (e.g., - ``ssd`` or ``hdd``), using the crush device class names - in the CRUSH map. - -:Type: String -:Required: No. -:Default: - -``directory={directory}`` - -:Description: Set the **directory** name from which the erasure code - plugin is loaded. - -:Type: String -:Required: No. -:Default: /usr/lib/ceph/erasure-code - -``--force`` - -:Description: Override an existing profile by the same name. - -:Type: String -:Required: No. - diff --git a/src/ceph/doc/rados/operations/erasure-code-jerasure.rst b/src/ceph/doc/rados/operations/erasure-code-jerasure.rst deleted file mode 100644 index e8da097..0000000 --- a/src/ceph/doc/rados/operations/erasure-code-jerasure.rst +++ /dev/null @@ -1,120 +0,0 @@ -============================ -Jerasure erasure code plugin -============================ - -The *jerasure* plugin is the most generic and flexible plugin, it is -also the default for Ceph erasure coded pools. - -The *jerasure* plugin encapsulates the `Jerasure -<http://jerasure.org>`_ library. It is -recommended to read the *jerasure* documentation to get a better -understanding of the parameters. - -Create a jerasure profile -========================= - -To create a new *jerasure* erasure code profile:: - - ceph osd erasure-code-profile set {name} \ - plugin=jerasure \ - k={data-chunks} \ - m={coding-chunks} \ - technique={reed_sol_van|reed_sol_r6_op|cauchy_orig|cauchy_good|liberation|blaum_roth|liber8tion} \ - [crush-root={root}] \ - [crush-failure-domain={bucket-type}] \ - [crush-device-class={device-class}] \ - [directory={directory}] \ - [--force] - -Where: - -``k={data chunks}`` - -:Description: Each object is split in **data-chunks** parts, - each stored on a different OSD. - -:Type: Integer -:Required: Yes. -:Example: 4 - -``m={coding-chunks}`` - -:Description: Compute **coding chunks** for each object and store them - on different OSDs. The number of coding chunks is also - the number of OSDs that can be down without losing data. - -:Type: Integer -:Required: Yes. -:Example: 2 - -``technique={reed_sol_van|reed_sol_r6_op|cauchy_orig|cauchy_good|liberation|blaum_roth|liber8tion}`` - -:Description: The more flexible technique is *reed_sol_van* : it is - enough to set *k* and *m*. The *cauchy_good* technique - can be faster but you need to chose the *packetsize* - carefully. All of *reed_sol_r6_op*, *liberation*, - *blaum_roth*, *liber8tion* are *RAID6* equivalents in - the sense that they can only be configured with *m=2*. - -:Type: String -:Required: No. -:Default: reed_sol_van - -``packetsize={bytes}`` - -:Description: The encoding will be done on packets of *bytes* size at - a time. Chosing the right packet size is difficult. The - *jerasure* documentation contains extensive information - on this topic. - -:Type: Integer -:Required: No. -:Default: 2048 - -``crush-root={root}`` - -:Description: The name of the crush bucket used for the first step of - the ruleset. For intance **step take default**. - -:Type: String -:Required: No. -:Default: default - -``crush-failure-domain={bucket-type}`` - -:Description: Ensure that no two chunks are in a bucket with the same - failure domain. For instance, if the failure domain is - **host** no two chunks will be stored on the same - host. It is used to create a ruleset step such as **step - chooseleaf host**. - -:Type: String -:Required: No. -:Default: host - -``crush-device-class={device-class}`` - -:Description: Restrict placement to devices of a specific class (e.g., - ``ssd`` or ``hdd``), using the crush device class names - in the CRUSH map. - -:Type: String -:Required: No. -:Default: - - ``directory={directory}`` - -:Description: Set the **directory** name from which the erasure code - plugin is loaded. - -:Type: String -:Required: No. -:Default: /usr/lib/ceph/erasure-code - -``--force`` - -:Description: Override an existing profile by the same name. - -:Type: String -:Required: No. - diff --git a/src/ceph/doc/rados/operations/erasure-code-lrc.rst b/src/ceph/doc/rados/operations/erasure-code-lrc.rst deleted file mode 100644 index 447ce23..0000000 --- a/src/ceph/doc/rados/operations/erasure-code-lrc.rst +++ /dev/null @@ -1,371 +0,0 @@ -====================================== -Locally repairable erasure code plugin -====================================== - -With the *jerasure* plugin, when an erasure coded object is stored on -multiple OSDs, recovering from the loss of one OSD requires reading -from all the others. For instance if *jerasure* is configured with -*k=8* and *m=4*, losing one OSD requires reading from the eleven -others to repair. - -The *lrc* erasure code plugin creates local parity chunks to be able -to recover using less OSDs. For instance if *lrc* is configured with -*k=8*, *m=4* and *l=4*, it will create an additional parity chunk for -every four OSDs. When a single OSD is lost, it can be recovered with -only four OSDs instead of eleven. - -Erasure code profile examples -============================= - -Reduce recovery bandwidth between hosts ---------------------------------------- - -Although it is probably not an interesting use case when all hosts are -connected to the same switch, reduced bandwidth usage can actually be -observed.:: - - $ ceph osd erasure-code-profile set LRCprofile \ - plugin=lrc \ - k=4 m=2 l=3 \ - crush-failure-domain=host - $ ceph osd pool create lrcpool 12 12 erasure LRCprofile - - -Reduce recovery bandwidth between racks ---------------------------------------- - -In Firefly the reduced bandwidth will only be observed if the primary -OSD is in the same rack as the lost chunk.:: - - $ ceph osd erasure-code-profile set LRCprofile \ - plugin=lrc \ - k=4 m=2 l=3 \ - crush-locality=rack \ - crush-failure-domain=host - $ ceph osd pool create lrcpool 12 12 erasure LRCprofile - - -Create an lrc profile -===================== - -To create a new lrc erasure code profile:: - - ceph osd erasure-code-profile set {name} \ - plugin=lrc \ - k={data-chunks} \ - m={coding-chunks} \ - l={locality} \ - [crush-root={root}] \ - [crush-locality={bucket-type}] \ - [crush-failure-domain={bucket-type}] \ - [crush-device-class={device-class}] \ - [directory={directory}] \ - [--force] - -Where: - -``k={data chunks}`` - -:Description: Each object is split in **data-chunks** parts, - each stored on a different OSD. - -:Type: Integer -:Required: Yes. -:Example: 4 - -``m={coding-chunks}`` - -:Description: Compute **coding chunks** for each object and store them - on different OSDs. The number of coding chunks is also - the number of OSDs that can be down without losing data. - -:Type: Integer -:Required: Yes. -:Example: 2 - -``l={locality}`` - -:Description: Group the coding and data chunks into sets of size - **locality**. For instance, for **k=4** and **m=2**, - when **locality=3** two groups of three are created. - Each set can be recovered without reading chunks - from another set. - -:Type: Integer -:Required: Yes. -:Example: 3 - -``crush-root={root}`` - -:Description: The name of the crush bucket used for the first step of - the ruleset. For intance **step take default**. - -:Type: String -:Required: No. -:Default: default - -``crush-locality={bucket-type}`` - -:Description: The type of the crush bucket in which each set of chunks - defined by **l** will be stored. For instance, if it is - set to **rack**, each group of **l** chunks will be - placed in a different rack. It is used to create a - ruleset step such as **step choose rack**. If it is not - set, no such grouping is done. - -:Type: String -:Required: No. - -``crush-failure-domain={bucket-type}`` - -:Description: Ensure that no two chunks are in a bucket with the same - failure domain. For instance, if the failure domain is - **host** no two chunks will be stored on the same - host. It is used to create a ruleset step such as **step - chooseleaf host**. - -:Type: String -:Required: No. -:Default: host - -``crush-device-class={device-class}`` - -:Description: Restrict placement to devices of a specific class (e.g., - ``ssd`` or ``hdd``), using the crush device class names - in the CRUSH map. - -:Type: String -:Required: No. -:Default: - -``directory={directory}`` - -:Description: Set the **directory** name from which the erasure code - plugin is loaded. - -:Type: String -:Required: No. -:Default: /usr/lib/ceph/erasure-code - -``--force`` - -:Description: Override an existing profile by the same name. - -:Type: String -:Required: No. - -Low level plugin configuration -============================== - -The sum of **k** and **m** must be a multiple of the **l** parameter. -The low level configuration parameters do not impose such a -restriction and it may be more convienient to use it for specific -purposes. It is for instance possible to define two groups, one with 4 -chunks and another with 3 chunks. It is also possible to recursively -define locality sets, for instance datacenters and racks into -datacenters. The **k/m/l** are implemented by generating a low level -configuration. - -The *lrc* erasure code plugin recursively applies erasure code -techniques so that recovering from the loss of some chunks only -requires a subset of the available chunks, most of the time. - -For instance, when three coding steps are described as:: - - chunk nr 01234567 - step 1 _cDD_cDD - step 2 cDDD____ - step 3 ____cDDD - -where *c* are coding chunks calculated from the data chunks *D*, the -loss of chunk *7* can be recovered with the last four chunks. And the -loss of chunk *2* chunk can be recovered with the first four -chunks. - -Erasure code profile examples using low level configuration -=========================================================== - -Minimal testing ---------------- - -It is strictly equivalent to using the default erasure code profile. The *DD* -implies *K=2*, the *c* implies *M=1* and the *jerasure* plugin is used -by default.:: - - $ ceph osd erasure-code-profile set LRCprofile \ - plugin=lrc \ - mapping=DD_ \ - layers='[ [ "DDc", "" ] ]' - $ ceph osd pool create lrcpool 12 12 erasure LRCprofile - -Reduce recovery bandwidth between hosts ---------------------------------------- - -Although it is probably not an interesting use case when all hosts are -connected to the same switch, reduced bandwidth usage can actually be -observed. It is equivalent to **k=4**, **m=2** and **l=3** although -the layout of the chunks is different:: - - $ ceph osd erasure-code-profile set LRCprofile \ - plugin=lrc \ - mapping=__DD__DD \ - layers='[ - [ "_cDD_cDD", "" ], - [ "cDDD____", "" ], - [ "____cDDD", "" ], - ]' - $ ceph osd pool create lrcpool 12 12 erasure LRCprofile - - -Reduce recovery bandwidth between racks ---------------------------------------- - -In Firefly the reduced bandwidth will only be observed if the primary -OSD is in the same rack as the lost chunk.:: - - $ ceph osd erasure-code-profile set LRCprofile \ - plugin=lrc \ - mapping=__DD__DD \ - layers='[ - [ "_cDD_cDD", "" ], - [ "cDDD____", "" ], - [ "____cDDD", "" ], - ]' \ - crush-steps='[ - [ "choose", "rack", 2 ], - [ "chooseleaf", "host", 4 ], - ]' - $ ceph osd pool create lrcpool 12 12 erasure LRCprofile - -Testing with different Erasure Code backends --------------------------------------------- - -LRC now uses jerasure as the default EC backend. It is possible to -specify the EC backend/algorithm on a per layer basis using the low -level configuration. The second argument in layers='[ [ "DDc", "" ] ]' -is actually an erasure code profile to be used for this level. The -example below specifies the ISA backend with the cauchy technique to -be used in the lrcpool.:: - - $ ceph osd erasure-code-profile set LRCprofile \ - plugin=lrc \ - mapping=DD_ \ - layers='[ [ "DDc", "plugin=isa technique=cauchy" ] ]' - $ ceph osd pool create lrcpool 12 12 erasure LRCprofile - -You could also use a different erasure code profile for for each -layer.:: - - $ ceph osd erasure-code-profile set LRCprofile \ - plugin=lrc \ - mapping=__DD__DD \ - layers='[ - [ "_cDD_cDD", "plugin=isa technique=cauchy" ], - [ "cDDD____", "plugin=isa" ], - [ "____cDDD", "plugin=jerasure" ], - ]' - $ ceph osd pool create lrcpool 12 12 erasure LRCprofile - - - -Erasure coding and decoding algorithm -===================================== - -The steps found in the layers description:: - - chunk nr 01234567 - - step 1 _cDD_cDD - step 2 cDDD____ - step 3 ____cDDD - -are applied in order. For instance, if a 4K object is encoded, it will -first go thru *step 1* and be divided in four 1K chunks (the four -uppercase D). They are stored in the chunks 2, 3, 6 and 7, in -order. From these, two coding chunks are calculated (the two lowercase -c). The coding chunks are stored in the chunks 1 and 5, respectively. - -The *step 2* re-uses the content created by *step 1* in a similar -fashion and stores a single coding chunk *c* at position 0. The last four -chunks, marked with an underscore (*_*) for readability, are ignored. - -The *step 3* stores a single coding chunk *c* at position 4. The three -chunks created by *step 1* are used to compute this coding chunk, -i.e. the coding chunk from *step 1* becomes a data chunk in *step 3*. - -If chunk *2* is lost:: - - chunk nr 01234567 - - step 1 _c D_cDD - step 2 cD D____ - step 3 __ _cDDD - -decoding will attempt to recover it by walking the steps in reverse -order: *step 3* then *step 2* and finally *step 1*. - -The *step 3* knows nothing about chunk *2* (i.e. it is an underscore) -and is skipped. - -The coding chunk from *step 2*, stored in chunk *0*, allows it to -recover the content of chunk *2*. There are no more chunks to recover -and the process stops, without considering *step 1*. - -Recovering chunk *2* requires reading chunks *0, 1, 3* and writing -back chunk *2*. - -If chunk *2, 3, 6* are lost:: - - chunk nr 01234567 - - step 1 _c _c D - step 2 cD __ _ - step 3 __ cD D - -The *step 3* can recover the content of chunk *6*:: - - chunk nr 01234567 - - step 1 _c _cDD - step 2 cD ____ - step 3 __ cDDD - -The *step 2* fails to recover and is skipped because there are two -chunks missing (*2, 3*) and it can only recover from one missing -chunk. - -The coding chunk from *step 1*, stored in chunk *1, 5*, allows it to -recover the content of chunk *2, 3*:: - - chunk nr 01234567 - - step 1 _cDD_cDD - step 2 cDDD____ - step 3 ____cDDD - -Controlling crush placement -=========================== - -The default crush ruleset provides OSDs that are on different hosts. For instance:: - - chunk nr 01234567 - - step 1 _cDD_cDD - step 2 cDDD____ - step 3 ____cDDD - -needs exactly *8* OSDs, one for each chunk. If the hosts are in two -adjacent racks, the first four chunks can be placed in the first rack -and the last four in the second rack. So that recovering from the loss -of a single OSD does not require using bandwidth between the two -racks. - -For instance:: - - crush-steps='[ [ "choose", "rack", 2 ], [ "chooseleaf", "host", 4 ] ]' - -will create a ruleset that will select two crush buckets of type -*rack* and for each of them choose four OSDs, each of them located in -different buckets of type *host*. - -The ruleset can also be manually crafted for finer control. diff --git a/src/ceph/doc/rados/operations/erasure-code-profile.rst b/src/ceph/doc/rados/operations/erasure-code-profile.rst deleted file mode 100644 index ddf772d..0000000 --- a/src/ceph/doc/rados/operations/erasure-code-profile.rst +++ /dev/null @@ -1,121 +0,0 @@ -===================== -Erasure code profiles -===================== - -Erasure code is defined by a **profile** and is used when creating an -erasure coded pool and the associated crush ruleset. - -The **default** erasure code profile (which is created when the Ceph -cluster is initialized) provides the same level of redundancy as two -copies but requires 25% less disk space. It is described as a profile -with **k=2** and **m=1**, meaning the information is spread over three -OSD (k+m == 3) and one of them can be lost. - -To improve redundancy without increasing raw storage requirements, a -new profile can be created. For instance, a profile with **k=10** and -**m=4** can sustain the loss of four (**m=4**) OSDs by distributing an -object on fourteen (k+m=14) OSDs. The object is first divided in -**10** chunks (if the object is 10MB, each chunk is 1MB) and **4** -coding chunks are computed, for recovery (each coding chunk has the -same size as the data chunk, i.e. 1MB). The raw space overhead is only -40% and the object will not be lost even if four OSDs break at the -same time. - -.. _list of available plugins: - -.. toctree:: - :maxdepth: 1 - - erasure-code-jerasure - erasure-code-isa - erasure-code-lrc - erasure-code-shec - -osd erasure-code-profile set -============================ - -To create a new erasure code profile:: - - ceph osd erasure-code-profile set {name} \ - [{directory=directory}] \ - [{plugin=plugin}] \ - [{stripe_unit=stripe_unit}] \ - [{key=value} ...] \ - [--force] - -Where: - -``{directory=directory}`` - -:Description: Set the **directory** name from which the erasure code - plugin is loaded. - -:Type: String -:Required: No. -:Default: /usr/lib/ceph/erasure-code - -``{plugin=plugin}`` - -:Description: Use the erasure code **plugin** to compute coding chunks - and recover missing chunks. See the `list of available - plugins`_ for more information. - -:Type: String -:Required: No. -:Default: jerasure - -``{stripe_unit=stripe_unit}`` - -:Description: The amount of data in a data chunk, per stripe. For - example, a profile with 2 data chunks and stripe_unit=4K - would put the range 0-4K in chunk 0, 4K-8K in chunk 1, - then 8K-12K in chunk 0 again. This should be a multiple - of 4K for best performance. The default value is taken - from the monitor config option - ``osd_pool_erasure_code_stripe_unit`` when a pool is - created. The stripe_width of a pool using this profile - will be the number of data chunks multiplied by this - stripe_unit. - -:Type: String -:Required: No. - -``{key=value}`` - -:Description: The semantic of the remaining key/value pairs is defined - by the erasure code plugin. - -:Type: String -:Required: No. - -``--force`` - -:Description: Override an existing profile by the same name, and allow - setting a non-4K-aligned stripe_unit. - -:Type: String -:Required: No. - -osd erasure-code-profile rm -============================ - -To remove an erasure code profile:: - - ceph osd erasure-code-profile rm {name} - -If the profile is referenced by a pool, the deletion will fail. - -osd erasure-code-profile get -============================ - -To display an erasure code profile:: - - ceph osd erasure-code-profile get {name} - -osd erasure-code-profile ls -=========================== - -To list the names of all erasure code profiles:: - - ceph osd erasure-code-profile ls - diff --git a/src/ceph/doc/rados/operations/erasure-code-shec.rst b/src/ceph/doc/rados/operations/erasure-code-shec.rst deleted file mode 100644 index e3bab37..0000000 --- a/src/ceph/doc/rados/operations/erasure-code-shec.rst +++ /dev/null @@ -1,144 +0,0 @@ -======================== -SHEC erasure code plugin -======================== - -The *shec* plugin encapsulates the `multiple SHEC -<http://tracker.ceph.com/projects/ceph/wiki/Shingled_Erasure_Code_(SHEC)>`_ -library. It allows ceph to recover data more efficiently than Reed Solomon codes. - -Create an SHEC profile -====================== - -To create a new *shec* erasure code profile:: - - ceph osd erasure-code-profile set {name} \ - plugin=shec \ - [k={data-chunks}] \ - [m={coding-chunks}] \ - [c={durability-estimator}] \ - [crush-root={root}] \ - [crush-failure-domain={bucket-type}] \ - [crush-device-class={device-class}] \ - [directory={directory}] \ - [--force] - -Where: - -``k={data-chunks}`` - -:Description: Each object is split in **data-chunks** parts, - each stored on a different OSD. - -:Type: Integer -:Required: No. -:Default: 4 - -``m={coding-chunks}`` - -:Description: Compute **coding-chunks** for each object and store them on - different OSDs. The number of **coding-chunks** does not necessarily - equal the number of OSDs that can be down without losing data. - -:Type: Integer -:Required: No. -:Default: 3 - -``c={durability-estimator}`` - -:Description: The number of parity chunks each of which includes each data chunk in its - calculation range. The number is used as a **durability estimator**. - For instance, if c=2, 2 OSDs can be down without losing data. - -:Type: Integer -:Required: No. -:Default: 2 - -``crush-root={root}`` - -:Description: The name of the crush bucket used for the first step of - the ruleset. For intance **step take default**. - -:Type: String -:Required: No. -:Default: default - -``crush-failure-domain={bucket-type}`` - -:Description: Ensure that no two chunks are in a bucket with the same - failure domain. For instance, if the failure domain is - **host** no two chunks will be stored on the same - host. It is used to create a ruleset step such as **step - chooseleaf host**. - -:Type: String -:Required: No. -:Default: host - -``crush-device-class={device-class}`` - -:Description: Restrict placement to devices of a specific class (e.g., - ``ssd`` or ``hdd``), using the crush device class names - in the CRUSH map. - -:Type: String -:Required: No. -:Default: - -``directory={directory}`` - -:Description: Set the **directory** name from which the erasure code - plugin is loaded. - -:Type: String -:Required: No. -:Default: /usr/lib/ceph/erasure-code - -``--force`` - -:Description: Override an existing profile by the same name. - -:Type: String -:Required: No. - -Brief description of SHEC's layouts -=================================== - -Space Efficiency ----------------- - -Space efficiency is a ratio of data chunks to all ones in a object and -represented as k/(k+m). -In order to improve space efficiency, you should increase k or decrease m. - -:: - - space efficiency of SHEC(4,3,2) = 4/(4+3) = 0.57 - SHEC(5,3,2) or SHEC(4,2,2) improves SHEC(4,3,2)'s space efficiency - -Durability ----------- - -The third parameter of SHEC (=c) is a durability estimator, which approximates -the number of OSDs that can be down without losing data. - -``durability estimator of SHEC(4,3,2) = 2`` - -Recovery Efficiency -------------------- - -Describing calculation of recovery efficiency is beyond the scope of this document, -but at least increasing m without increasing c achieves improvement of recovery efficiency. -(However, we must pay attention to the sacrifice of space efficiency in this case.) - -``SHEC(4,2,2) -> SHEC(4,3,2) : achieves improvement of recovery efficiency`` - -Erasure code profile examples -============================= - -:: - - $ ceph osd erasure-code-profile set SHECprofile \ - plugin=shec \ - k=8 m=4 c=3 \ - crush-failure-domain=host - $ ceph osd pool create shecpool 256 256 erasure SHECprofile diff --git a/src/ceph/doc/rados/operations/erasure-code.rst b/src/ceph/doc/rados/operations/erasure-code.rst deleted file mode 100644 index 6ec5a09..0000000 --- a/src/ceph/doc/rados/operations/erasure-code.rst +++ /dev/null @@ -1,195 +0,0 @@ -============= - Erasure code -============= - -A Ceph pool is associated to a type to sustain the loss of an OSD -(i.e. a disk since most of the time there is one OSD per disk). The -default choice when `creating a pool <../pools>`_ is *replicated*, -meaning every object is copied on multiple disks. The `Erasure Code -<https://en.wikipedia.org/wiki/Erasure_code>`_ pool type can be used -instead to save space. - -Creating a sample erasure coded pool ------------------------------------- - -The simplest erasure coded pool is equivalent to `RAID5 -<https://en.wikipedia.org/wiki/Standard_RAID_levels#RAID_5>`_ and -requires at least three hosts:: - - $ ceph osd pool create ecpool 12 12 erasure - pool 'ecpool' created - $ echo ABCDEFGHI | rados --pool ecpool put NYAN - - $ rados --pool ecpool get NYAN - - ABCDEFGHI - -.. note:: the 12 in *pool create* stands for - `the number of placement groups <../pools>`_. - -Erasure code profiles ---------------------- - -The default erasure code profile sustains the loss of a single OSD. It -is equivalent to a replicated pool of size two but requires 1.5TB -instead of 2TB to store 1TB of data. The default profile can be -displayed with:: - - $ ceph osd erasure-code-profile get default - k=2 - m=1 - plugin=jerasure - crush-failure-domain=host - technique=reed_sol_van - -Choosing the right profile is important because it cannot be modified -after the pool is created: a new pool with a different profile needs -to be created and all objects from the previous pool moved to the new. - -The most important parameters of the profile are *K*, *M* and -*crush-failure-domain* because they define the storage overhead and -the data durability. For instance, if the desired architecture must -sustain the loss of two racks with a storage overhead of 40% overhead, -the following profile can be defined:: - - $ ceph osd erasure-code-profile set myprofile \ - k=3 \ - m=2 \ - crush-failure-domain=rack - $ ceph osd pool create ecpool 12 12 erasure myprofile - $ echo ABCDEFGHI | rados --pool ecpool put NYAN - - $ rados --pool ecpool get NYAN - - ABCDEFGHI - -The *NYAN* object will be divided in three (*K=3*) and two additional -*chunks* will be created (*M=2*). The value of *M* defines how many -OSD can be lost simultaneously without losing any data. The -*crush-failure-domain=rack* will create a CRUSH ruleset that ensures -no two *chunks* are stored in the same rack. - -.. ditaa:: - +-------------------+ - name | NYAN | - +-------------------+ - content | ABCDEFGHI | - +--------+----------+ - | - | - v - +------+------+ - +---------------+ encode(3,2) +-----------+ - | +--+--+---+---+ | - | | | | | - | +-------+ | +-----+ | - | | | | | - +--v---+ +--v---+ +--v---+ +--v---+ +--v---+ - name | NYAN | | NYAN | | NYAN | | NYAN | | NYAN | - +------+ +------+ +------+ +------+ +------+ - shard | 1 | | 2 | | 3 | | 4 | | 5 | - +------+ +------+ +------+ +------+ +------+ - content | ABC | | DEF | | GHI | | YXY | | QGC | - +--+---+ +--+---+ +--+---+ +--+---+ +--+---+ - | | | | | - | | v | | - | | +--+---+ | | - | | | OSD1 | | | - | | +------+ | | - | | | | - | | +------+ | | - | +------>| OSD2 | | | - | +------+ | | - | | | - | +------+ | | - | | OSD3 |<----+ | - | +------+ | - | | - | +------+ | - | | OSD4 |<--------------+ - | +------+ - | - | +------+ - +----------------->| OSD5 | - +------+ - - -More information can be found in the `erasure code profiles -<../erasure-code-profile>`_ documentation. - - -Erasure Coding with Overwrites ------------------------------- - -By default, erasure coded pools only work with uses like RGW that -perform full object writes and appends. - -Since Luminous, partial writes for an erasure coded pool may be -enabled with a per-pool setting. This lets RBD and Cephfs store their -data in an erasure coded pool:: - - ceph osd pool set ec_pool allow_ec_overwrites true - -This can only be enabled on a pool residing on bluestore OSDs, since -bluestore's checksumming is used to detect bitrot or other corruption -during deep-scrub. In addition to being unsafe, using filestore with -ec overwrites yields low performance compared to bluestore. - -Erasure coded pools do not support omap, so to use them with RBD and -Cephfs you must instruct them to store their data in an ec pool, and -their metadata in a replicated pool. For RBD, this means using the -erasure coded pool as the ``--data-pool`` during image creation:: - - rbd create --size 1G --data-pool ec_pool replicated_pool/image_name - -For Cephfs, using an erasure coded pool means setting that pool in -a `file layout <../../../cephfs/file-layouts>`_. - - -Erasure coded pool and cache tiering ------------------------------------- - -Erasure coded pools require more resources than replicated pools and -lack some functionalities such as omap. To overcome these -limitations, one can set up a `cache tier <../cache-tiering>`_ -before the erasure coded pool. - -For instance, if the pool *hot-storage* is made of fast storage:: - - $ ceph osd tier add ecpool hot-storage - $ ceph osd tier cache-mode hot-storage writeback - $ ceph osd tier set-overlay ecpool hot-storage - -will place the *hot-storage* pool as tier of *ecpool* in *writeback* -mode so that every write and read to the *ecpool* are actually using -the *hot-storage* and benefit from its flexibility and speed. - -More information can be found in the `cache tiering -<../cache-tiering>`_ documentation. - -Glossary --------- - -*chunk* - when the encoding function is called, it returns chunks of the same - size. Data chunks which can be concatenated to reconstruct the original - object and coding chunks which can be used to rebuild a lost chunk. - -*K* - the number of data *chunks*, i.e. the number of *chunks* in which the - original object is divided. For instance if *K* = 2 a 10KB object - will be divided into *K* objects of 5KB each. - -*M* - the number of coding *chunks*, i.e. the number of additional *chunks* - computed by the encoding functions. If there are 2 coding *chunks*, - it means 2 OSDs can be out without losing data. - - -Table of content ----------------- - -.. toctree:: - :maxdepth: 1 - - erasure-code-profile - erasure-code-jerasure - erasure-code-isa - erasure-code-lrc - erasure-code-shec diff --git a/src/ceph/doc/rados/operations/health-checks.rst b/src/ceph/doc/rados/operations/health-checks.rst deleted file mode 100644 index c1e2200..0000000 --- a/src/ceph/doc/rados/operations/health-checks.rst +++ /dev/null @@ -1,527 +0,0 @@ - -============= -Health checks -============= - -Overview -======== - -There is a finite set of possible health messages that a Ceph cluster can -raise -- these are defined as *health checks* which have unique identifiers. - -The identifier is a terse pseudo-human-readable (i.e. like a variable name) -string. It is intended to enable tools (such as UIs) to make sense of -health checks, and present them in a way that reflects their meaning. - -This page lists the health checks that are raised by the monitor and manager -daemons. In addition to these, you may also see health checks that originate -from MDS daemons (see :doc:`/cephfs/health-messages`), and health checks -that are defined by ceph-mgr python modules. - -Definitions -=========== - - -OSDs ----- - -OSD_DOWN -________ - -One or more OSDs are marked down. The ceph-osd daemon may have been -stopped, or peer OSDs may be unable to reach the OSD over the network. -Common causes include a stopped or crashed daemon, a down host, or a -network outage. - -Verify the host is healthy, the daemon is started, and network is -functioning. If the daemon has crashed, the daemon log file -(``/var/log/ceph/ceph-osd.*``) may contain debugging information. - -OSD_<crush type>_DOWN -_____________________ - -(e.g. OSD_HOST_DOWN, OSD_ROOT_DOWN) - -All the OSDs within a particular CRUSH subtree are marked down, for example -all OSDs on a host. - -OSD_ORPHAN -__________ - -An OSD is referenced in the CRUSH map hierarchy but does not exist. - -The OSD can be removed from the CRUSH hierarchy with:: - - ceph osd crush rm osd.<id> - -OSD_OUT_OF_ORDER_FULL -_____________________ - -The utilization thresholds for `backfillfull`, `nearfull`, `full`, -and/or `failsafe_full` are not ascending. In particular, we expect -`backfillfull < nearfull`, `nearfull < full`, and `full < -failsafe_full`. - -The thresholds can be adjusted with:: - - ceph osd set-backfillfull-ratio <ratio> - ceph osd set-nearfull-ratio <ratio> - ceph osd set-full-ratio <ratio> - - -OSD_FULL -________ - -One or more OSDs has exceeded the `full` threshold and is preventing -the cluster from servicing writes. - -Utilization by pool can be checked with:: - - ceph df - -The currently defined `full` ratio can be seen with:: - - ceph osd dump | grep full_ratio - -A short-term workaround to restore write availability is to raise the full -threshold by a small amount:: - - ceph osd set-full-ratio <ratio> - -New storage should be added to the cluster by deploying more OSDs or -existing data should be deleted in order to free up space. - -OSD_BACKFILLFULL -________________ - -One or more OSDs has exceeded the `backfillfull` threshold, which will -prevent data from being allowed to rebalance to this device. This is -an early warning that rebalancing may not be able to complete and that -the cluster is approaching full. - -Utilization by pool can be checked with:: - - ceph df - -OSD_NEARFULL -____________ - -One or more OSDs has exceeded the `nearfull` threshold. This is an early -warning that the cluster is approaching full. - -Utilization by pool can be checked with:: - - ceph df - -OSDMAP_FLAGS -____________ - -One or more cluster flags of interest has been set. These flags include: - -* *full* - the cluster is flagged as full and cannot service writes -* *pauserd*, *pausewr* - paused reads or writes -* *noup* - OSDs are not allowed to start -* *nodown* - OSD failure reports are being ignored, such that the - monitors will not mark OSDs `down` -* *noin* - OSDs that were previously marked `out` will not be marked - back `in` when they start -* *noout* - down OSDs will not automatically be marked out after the - configured interval -* *nobackfill*, *norecover*, *norebalance* - recovery or data - rebalancing is suspended -* *noscrub*, *nodeep_scrub* - scrubbing is disabled -* *notieragent* - cache tiering activity is suspended - -With the exception of *full*, these flags can be set or cleared with:: - - ceph osd set <flag> - ceph osd unset <flag> - -OSD_FLAGS -_________ - -One or more OSDs has a per-OSD flag of interest set. These flags include: - -* *noup*: OSD is not allowed to start -* *nodown*: failure reports for this OSD will be ignored -* *noin*: if this OSD was previously marked `out` automatically - after a failure, it will not be marked in when it stats -* *noout*: if this OSD is down it will not automatically be marked - `out` after the configured interval - -Per-OSD flags can be set and cleared with:: - - ceph osd add-<flag> <osd-id> - ceph osd rm-<flag> <osd-id> - -For example, :: - - ceph osd rm-nodown osd.123 - -OLD_CRUSH_TUNABLES -__________________ - -The CRUSH map is using very old settings and should be updated. The -oldest tunables that can be used (i.e., the oldest client version that -can connect to the cluster) without triggering this health warning is -determined by the ``mon_crush_min_required_version`` config option. -See :doc:`/rados/operations/crush-map/#tunables` for more information. - -OLD_CRUSH_STRAW_CALC_VERSION -____________________________ - -The CRUSH map is using an older, non-optimal method for calculating -intermediate weight values for ``straw`` buckets. - -The CRUSH map should be updated to use the newer method -(``straw_calc_version=1``). See -:doc:`/rados/operations/crush-map/#tunables` for more information. - -CACHE_POOL_NO_HIT_SET -_____________________ - -One or more cache pools is not configured with a *hit set* to track -utilization, which will prevent the tiering agent from identifying -cold objects to flush and evict from the cache. - -Hit sets can be configured on the cache pool with:: - - ceph osd pool set <poolname> hit_set_type <type> - ceph osd pool set <poolname> hit_set_period <period-in-seconds> - ceph osd pool set <poolname> hit_set_count <number-of-hitsets> - ceph osd pool set <poolname> hit_set_fpp <target-false-positive-rate> - -OSD_NO_SORTBITWISE -__________________ - -No pre-luminous v12.y.z OSDs are running but the ``sortbitwise`` flag has not -been set. - -The ``sortbitwise`` flag must be set before luminous v12.y.z or newer -OSDs can start. You can safely set the flag with:: - - ceph osd set sortbitwise - -POOL_FULL -_________ - -One or more pools has reached its quota and is no longer allowing writes. - -Pool quotas and utilization can be seen with:: - - ceph df detail - -You can either raise the pool quota with:: - - ceph osd pool set-quota <poolname> max_objects <num-objects> - ceph osd pool set-quota <poolname> max_bytes <num-bytes> - -or delete some existing data to reduce utilization. - - -Data health (pools & placement groups) --------------------------------------- - -PG_AVAILABILITY -_______________ - -Data availability is reduced, meaning that the cluster is unable to -service potential read or write requests for some data in the cluster. -Specifically, one or more PGs is in a state that does not allow IO -requests to be serviced. Problematic PG states include *peering*, -*stale*, *incomplete*, and the lack of *active* (if those conditions do not clear -quickly). - -Detailed information about which PGs are affected is available from:: - - ceph health detail - -In most cases the root cause is that one or more OSDs is currently -down; see the dicussion for ``OSD_DOWN`` above. - -The state of specific problematic PGs can be queried with:: - - ceph tell <pgid> query - -PG_DEGRADED -___________ - -Data redundancy is reduced for some data, meaning the cluster does not -have the desired number of replicas for all data (for replicated -pools) or erasure code fragments (for erasure coded pools). -Specifically, one or more PGs: - -* has the *degraded* or *undersized* flag set, meaning there are not - enough instances of that placement group in the cluster; -* has not had the *clean* flag set for some time. - -Detailed information about which PGs are affected is available from:: - - ceph health detail - -In most cases the root cause is that one or more OSDs is currently -down; see the dicussion for ``OSD_DOWN`` above. - -The state of specific problematic PGs can be queried with:: - - ceph tell <pgid> query - - -PG_DEGRADED_FULL -________________ - -Data redundancy may be reduced or at risk for some data due to a lack -of free space in the cluster. Specifically, one or more PGs has the -*backfill_toofull* or *recovery_toofull* flag set, meaning that the -cluster is unable to migrate or recover data because one or more OSDs -is above the *backfillfull* threshold. - -See the discussion for *OSD_BACKFILLFULL* or *OSD_FULL* above for -steps to resolve this condition. - -PG_DAMAGED -__________ - -Data scrubbing has discovered some problems with data consistency in -the cluster. Specifically, one or more PGs has the *inconsistent* or -*snaptrim_error* flag is set, indicating an earlier scrub operation -found a problem, or that the *repair* flag is set, meaning a repair -for such an inconsistency is currently in progress. - -See :doc:`pg-repair` for more information. - -OSD_SCRUB_ERRORS -________________ - -Recent OSD scrubs have uncovered inconsistencies. This error is generally -paired with *PG_DAMANGED* (see above). - -See :doc:`pg-repair` for more information. - -CACHE_POOL_NEAR_FULL -____________________ - -A cache tier pool is nearly full. Full in this context is determined -by the ``target_max_bytes`` and ``target_max_objects`` properties on -the cache pool. Once the pool reaches the target threshold, write -requests to the pool may block while data is flushed and evicted -from the cache, a state that normally leads to very high latencies and -poor performance. - -The cache pool target size can be adjusted with:: - - ceph osd pool set <cache-pool-name> target_max_bytes <bytes> - ceph osd pool set <cache-pool-name> target_max_objects <objects> - -Normal cache flush and evict activity may also be throttled due to reduced -availability or performance of the base tier, or overall cluster load. - -TOO_FEW_PGS -___________ - -The number of PGs in use in the cluster is below the configurable -threshold of ``mon_pg_warn_min_per_osd`` PGs per OSD. This can lead -to suboptimizal distribution and balance of data across the OSDs in -the cluster, and similar reduce overall performance. - -This may be an expected condition if data pools have not yet been -created. - -The PG count for existing pools can be increased or new pools can be -created. Please refer to -:doc:`placement-groups#Choosing-the-number-of-Placement-Groups` for -more information. - -TOO_MANY_PGS -____________ - -The number of PGs in use in the cluster is above the configurable -threshold of ``mon_max_pg_per_osd`` PGs per OSD. If this threshold is -exceed the cluster will not allow new pools to be created, pool `pg_num` to -be increased, or pool replication to be increased (any of which would lead to -more PGs in the cluster). A large number of PGs can lead -to higher memory utilization for OSD daemons, slower peering after -cluster state changes (like OSD restarts, additions, or removals), and -higher load on the Manager and Monitor daemons. - -The simplest way to mitigate the problem is to increase the number of -OSDs in the cluster by adding more hardware. Note that the OSD count -used for the purposes of this health check is the number of "in" OSDs, -so marking "out" OSDs "in" (if there are any) can also help:: - - ceph osd in <osd id(s)> - -Please refer to -:doc:`placement-groups#Choosing-the-number-of-Placement-Groups` for -more information. - -SMALLER_PGP_NUM -_______________ - -One or more pools has a ``pgp_num`` value less than ``pg_num``. This -is normally an indication that the PG count was increased without -also increasing the placement behavior. - -This is sometimes done deliberately to separate out the `split` step -when the PG count is adjusted from the data migration that is needed -when ``pgp_num`` is changed. - -This is normally resolved by setting ``pgp_num`` to match ``pg_num``, -triggering the data migration, with:: - - ceph osd pool set <pool> pgp_num <pg-num-value> - -MANY_OBJECTS_PER_PG -___________________ - -One or more pools has an average number of objects per PG that is -significantly higher than the overall cluster average. The specific -threshold is controlled by the ``mon_pg_warn_max_object_skew`` -configuration value. - -This is usually an indication that the pool(s) containing most of the -data in the cluster have too few PGs, and/or that other pools that do -not contain as much data have too many PGs. See the discussion of -*TOO_MANY_PGS* above. - -The threshold can be raised to silence the health warning by adjusting -the ``mon_pg_warn_max_object_skew`` config option on the monitors. - -POOL_APP_NOT_ENABLED -____________________ - -A pool exists that contains one or more objects but has not been -tagged for use by a particular application. - -Resolve this warning by labeling the pool for use by an application. For -example, if the pool is used by RBD,:: - - rbd pool init <poolname> - -If the pool is being used by a custom application 'foo', you can also label -via the low-level command:: - - ceph osd pool application enable foo - -For more information, see :doc:`pools.rst#associate-pool-to-application`. - -POOL_FULL -_________ - -One or more pools has reached (or is very close to reaching) its -quota. The threshold to trigger this error condition is controlled by -the ``mon_pool_quota_crit_threshold`` configuration option. - -Pool quotas can be adjusted up or down (or removed) with:: - - ceph osd pool set-quota <pool> max_bytes <bytes> - ceph osd pool set-quota <pool> max_objects <objects> - -Setting the quota value to 0 will disable the quota. - -POOL_NEAR_FULL -______________ - -One or more pools is approaching is quota. The threshold to trigger -this warning condition is controlled by the -``mon_pool_quota_warn_threshold`` configuration option. - -Pool quotas can be adjusted up or down (or removed) with:: - - ceph osd pool set-quota <pool> max_bytes <bytes> - ceph osd pool set-quota <pool> max_objects <objects> - -Setting the quota value to 0 will disable the quota. - -OBJECT_MISPLACED -________________ - -One or more objects in the cluster is not stored on the node the -cluster would like it to be stored on. This is an indication that -data migration due to some recent cluster change has not yet completed. - -Misplaced data is not a dangerous condition in and of itself; data -consistency is never at risk, and old copies of objects are never -removed until the desired number of new copies (in the desired -locations) are present. - -OBJECT_UNFOUND -______________ - -One or more objects in the cluster cannot be found. Specifically, the -OSDs know that a new or updated copy of an object should exist, but a -copy of that version of the object has not been found on OSDs that are -currently online. - -Read or write requests to unfound objects will block. - -Ideally, a down OSD can be brought back online that has the more -recent copy of the unfound object. Candidate OSDs can be identified from the -peering state for the PG(s) responsible for the unfound object:: - - ceph tell <pgid> query - -If the latest copy of the object is not available, the cluster can be -told to roll back to a previous version of the object. See -:doc:`troubleshooting-pg#Unfound-objects` for more information. - -REQUEST_SLOW -____________ - -One or more OSD requests is taking a long time to process. This can -be an indication of extreme load, a slow storage device, or a software -bug. - -The request queue on the OSD(s) in question can be queried with the -following command, executed from the OSD host:: - - ceph daemon osd.<id> ops - -A summary of the slowest recent requests can be seen with:: - - ceph daemon osd.<id> dump_historic_ops - -The location of an OSD can be found with:: - - ceph osd find osd.<id> - -REQUEST_STUCK -_____________ - -One or more OSD requests has been blocked for an extremely long time. -This is an indication that either the cluster has been unhealthy for -an extended period of time (e.g., not enough running OSDs) or there is -some internal problem with the OSD. See the dicussion of -*REQUEST_SLOW* above. - -PG_NOT_SCRUBBED -_______________ - -One or more PGs has not been scrubbed recently. PGs are normally -scrubbed every ``mon_scrub_interval`` seconds, and this warning -triggers when ``mon_warn_not_scrubbed`` such intervals have elapsed -without a scrub. - -PGs will not scrub if they are not flagged as *clean*, which may -happen if they are misplaced or degraded (see *PG_AVAILABILITY* and -*PG_DEGRADED* above). - -You can manually initiate a scrub of a clean PG with:: - - ceph pg scrub <pgid> - -PG_NOT_DEEP_SCRUBBED -____________________ - -One or more PGs has not been deep scrubbed recently. PGs are normally -scrubbed every ``osd_deep_mon_scrub_interval`` seconds, and this warning -triggers when ``mon_warn_not_deep_scrubbed`` such intervals have elapsed -without a scrub. - -PGs will not (deep) scrub if they are not flagged as *clean*, which may -happen if they are misplaced or degraded (see *PG_AVAILABILITY* and -*PG_DEGRADED* above). - -You can manually initiate a scrub of a clean PG with:: - - ceph pg deep-scrub <pgid> diff --git a/src/ceph/doc/rados/operations/index.rst b/src/ceph/doc/rados/operations/index.rst deleted file mode 100644 index aacf764..0000000 --- a/src/ceph/doc/rados/operations/index.rst +++ /dev/null @@ -1,90 +0,0 @@ -==================== - Cluster Operations -==================== - -.. raw:: html - - <table><colgroup><col width="50%"><col width="50%"></colgroup><tbody valign="top"><tr><td><h3>High-level Operations</h3> - -High-level cluster operations consist primarily of starting, stopping, and -restarting a cluster with the ``ceph`` service; checking the cluster's health; -and, monitoring an operating cluster. - -.. toctree:: - :maxdepth: 1 - - operating - health-checks - monitoring - monitoring-osd-pg - user-management - -.. raw:: html - - </td><td><h3>Data Placement</h3> - -Once you have your cluster up and running, you may begin working with data -placement. Ceph supports petabyte-scale data storage clusters, with storage -pools and placement groups that distribute data across the cluster using Ceph's -CRUSH algorithm. - -.. toctree:: - :maxdepth: 1 - - data-placement - pools - erasure-code - cache-tiering - placement-groups - upmap - crush-map - crush-map-edits - - - -.. raw:: html - - </td></tr><tr><td><h3>Low-level Operations</h3> - -Low-level cluster operations consist of starting, stopping, and restarting a -particular daemon within a cluster; changing the settings of a particular -daemon or subsystem; and, adding a daemon to the cluster or removing a daemon -from the cluster. The most common use cases for low-level operations include -growing or shrinking the Ceph cluster and replacing legacy or failed hardware -with new hardware. - -.. toctree:: - :maxdepth: 1 - - add-or-rm-osds - add-or-rm-mons - Command Reference <control> - - - -.. raw:: html - - </td><td><h3>Troubleshooting</h3> - -Ceph is still on the leading edge, so you may encounter situations that require -you to evaluate your Ceph configuration and modify your logging and debugging -settings to identify and remedy issues you are encountering with your cluster. - -.. toctree:: - :maxdepth: 1 - - ../troubleshooting/community - ../troubleshooting/troubleshooting-mon - ../troubleshooting/troubleshooting-osd - ../troubleshooting/troubleshooting-pg - ../troubleshooting/log-and-debug - ../troubleshooting/cpu-profiling - ../troubleshooting/memory-profiling - - - - -.. raw:: html - - </td></tr></tbody></table> - diff --git a/src/ceph/doc/rados/operations/monitoring-osd-pg.rst b/src/ceph/doc/rados/operations/monitoring-osd-pg.rst deleted file mode 100644 index 0107e34..0000000 --- a/src/ceph/doc/rados/operations/monitoring-osd-pg.rst +++ /dev/null @@ -1,617 +0,0 @@ -========================= - Monitoring OSDs and PGs -========================= - -High availability and high reliability require a fault-tolerant approach to -managing hardware and software issues. Ceph has no single point-of-failure, and -can service requests for data in a "degraded" mode. Ceph's `data placement`_ -introduces a layer of indirection to ensure that data doesn't bind directly to -particular OSD addresses. This means that tracking down system faults requires -finding the `placement group`_ and the underlying OSDs at root of the problem. - -.. tip:: A fault in one part of the cluster may prevent you from accessing a - particular object, but that doesn't mean that you cannot access other objects. - When you run into a fault, don't panic. Just follow the steps for monitoring - your OSDs and placement groups. Then, begin troubleshooting. - -Ceph is generally self-repairing. However, when problems persist, monitoring -OSDs and placement groups will help you identify the problem. - - -Monitoring OSDs -=============== - -An OSD's status is either in the cluster (``in``) or out of the cluster -(``out``); and, it is either up and running (``up``), or it is down and not -running (``down``). If an OSD is ``up``, it may be either ``in`` the cluster -(you can read and write data) or it is ``out`` of the cluster. If it was -``in`` the cluster and recently moved ``out`` of the cluster, Ceph will migrate -placement groups to other OSDs. If an OSD is ``out`` of the cluster, CRUSH will -not assign placement groups to the OSD. If an OSD is ``down``, it should also be -``out``. - -.. note:: If an OSD is ``down`` and ``in``, there is a problem and the cluster - will not be in a healthy state. - -.. ditaa:: +----------------+ +----------------+ - | | | | - | OSD #n In | | OSD #n Up | - | | | | - +----------------+ +----------------+ - ^ ^ - | | - | | - v v - +----------------+ +----------------+ - | | | | - | OSD #n Out | | OSD #n Down | - | | | | - +----------------+ +----------------+ - -If you execute a command such as ``ceph health``, ``ceph -s`` or ``ceph -w``, -you may notice that the cluster does not always echo back ``HEALTH OK``. Don't -panic. With respect to OSDs, you should expect that the cluster will **NOT** -echo ``HEALTH OK`` in a few expected circumstances: - -#. You haven't started the cluster yet (it won't respond). -#. You have just started or restarted the cluster and it's not ready yet, - because the placement groups are getting created and the OSDs are in - the process of peering. -#. You just added or removed an OSD. -#. You just have modified your cluster map. - -An important aspect of monitoring OSDs is to ensure that when the cluster -is up and running that all OSDs that are ``in`` the cluster are ``up`` and -running, too. To see if all OSDs are running, execute:: - - ceph osd stat - -The result should tell you the map epoch (eNNNN), the total number of OSDs (x), -how many are ``up`` (y) and how many are ``in`` (z). :: - - eNNNN: x osds: y up, z in - -If the number of OSDs that are ``in`` the cluster is more than the number of -OSDs that are ``up``, execute the following command to identify the ``ceph-osd`` -daemons that are not running:: - - ceph osd tree - -:: - - dumped osdmap tree epoch 1 - # id weight type name up/down reweight - -1 2 pool openstack - -3 2 rack dell-2950-rack-A - -2 2 host dell-2950-A1 - 0 1 osd.0 up 1 - 1 1 osd.1 down 1 - - -.. tip:: The ability to search through a well-designed CRUSH hierarchy may help - you troubleshoot your cluster by identifying the physcial locations faster. - -If an OSD is ``down``, start it:: - - sudo systemctl start ceph-osd@1 - -See `OSD Not Running`_ for problems associated with OSDs that stopped, or won't -restart. - - -PG Sets -======= - -When CRUSH assigns placement groups to OSDs, it looks at the number of replicas -for the pool and assigns the placement group to OSDs such that each replica of -the placement group gets assigned to a different OSD. For example, if the pool -requires three replicas of a placement group, CRUSH may assign them to -``osd.1``, ``osd.2`` and ``osd.3`` respectively. CRUSH actually seeks a -pseudo-random placement that will take into account failure domains you set in -your `CRUSH map`_, so you will rarely see placement groups assigned to nearest -neighbor OSDs in a large cluster. We refer to the set of OSDs that should -contain the replicas of a particular placement group as the **Acting Set**. In -some cases, an OSD in the Acting Set is ``down`` or otherwise not able to -service requests for objects in the placement group. When these situations -arise, don't panic. Common examples include: - -- You added or removed an OSD. Then, CRUSH reassigned the placement group to - other OSDs--thereby changing the composition of the Acting Set and spawning - the migration of data with a "backfill" process. -- An OSD was ``down``, was restarted, and is now ``recovering``. -- An OSD in the Acting Set is ``down`` or unable to service requests, - and another OSD has temporarily assumed its duties. - -Ceph processes a client request using the **Up Set**, which is the set of OSDs -that will actually handle the requests. In most cases, the Up Set and the Acting -Set are virtually identical. When they are not, it may indicate that Ceph is -migrating data, an OSD is recovering, or that there is a problem (i.e., Ceph -usually echoes a "HEALTH WARN" state with a "stuck stale" message in such -scenarios). - -To retrieve a list of placement groups, execute:: - - ceph pg dump - -To view which OSDs are within the Acting Set or the Up Set for a given placement -group, execute:: - - ceph pg map {pg-num} - -The result should tell you the osdmap epoch (eNNN), the placement group number -({pg-num}), the OSDs in the Up Set (up[]), and the OSDs in the acting set -(acting[]). :: - - osdmap eNNN pg {pg-num} -> up [0,1,2] acting [0,1,2] - -.. note:: If the Up Set and Acting Set do not match, this may be an indicator - that the cluster rebalancing itself or of a potential problem with - the cluster. - - -Peering -======= - -Before you can write data to a placement group, it must be in an ``active`` -state, and it **should** be in a ``clean`` state. For Ceph to determine the -current state of a placement group, the primary OSD of the placement group -(i.e., the first OSD in the acting set), peers with the secondary and tertiary -OSDs to establish agreement on the current state of the placement group -(assuming a pool with 3 replicas of the PG). - - -.. ditaa:: +---------+ +---------+ +-------+ - | OSD 1 | | OSD 2 | | OSD 3 | - +---------+ +---------+ +-------+ - | | | - | Request To | | - | Peer | | - |-------------->| | - |<--------------| | - | Peering | - | | - | Request To | - | Peer | - |----------------------------->| - |<-----------------------------| - | Peering | - -The OSDs also report their status to the monitor. See `Configuring Monitor/OSD -Interaction`_ for details. To troubleshoot peering issues, see `Peering -Failure`_. - - -Monitoring Placement Group States -================================= - -If you execute a command such as ``ceph health``, ``ceph -s`` or ``ceph -w``, -you may notice that the cluster does not always echo back ``HEALTH OK``. After -you check to see if the OSDs are running, you should also check placement group -states. You should expect that the cluster will **NOT** echo ``HEALTH OK`` in a -number of placement group peering-related circumstances: - -#. You have just created a pool and placement groups haven't peered yet. -#. The placement groups are recovering. -#. You have just added an OSD to or removed an OSD from the cluster. -#. You have just modified your CRUSH map and your placement groups are migrating. -#. There is inconsistent data in different replicas of a placement group. -#. Ceph is scrubbing a placement group's replicas. -#. Ceph doesn't have enough storage capacity to complete backfilling operations. - -If one of the foregoing circumstances causes Ceph to echo ``HEALTH WARN``, don't -panic. In many cases, the cluster will recover on its own. In some cases, you -may need to take action. An important aspect of monitoring placement groups is -to ensure that when the cluster is up and running that all placement groups are -``active``, and preferably in the ``clean`` state. To see the status of all -placement groups, execute:: - - ceph pg stat - -The result should tell you the placement group map version (vNNNNNN), the total -number of placement groups (x), and how many placement groups are in a -particular state such as ``active+clean`` (y). :: - - vNNNNNN: x pgs: y active+clean; z bytes data, aa MB used, bb GB / cc GB avail - -.. note:: It is common for Ceph to report multiple states for placement groups. - -In addition to the placement group states, Ceph will also echo back the amount -of data used (aa), the amount of storage capacity remaining (bb), and the total -storage capacity for the placement group. These numbers can be important in a -few cases: - -- You are reaching your ``near full ratio`` or ``full ratio``. -- Your data is not getting distributed across the cluster due to an - error in your CRUSH configuration. - - -.. topic:: Placement Group IDs - - Placement group IDs consist of the pool number (not pool name) followed - by a period (.) and the placement group ID--a hexadecimal number. You - can view pool numbers and their names from the output of ``ceph osd - lspools``. For example, the default pool ``rbd`` corresponds to - pool number ``0``. A fully qualified placement group ID has the - following form:: - - {pool-num}.{pg-id} - - And it typically looks like this:: - - 0.1f - - -To retrieve a list of placement groups, execute the following:: - - ceph pg dump - -You can also format the output in JSON format and save it to a file:: - - ceph pg dump -o {filename} --format=json - -To query a particular placement group, execute the following:: - - ceph pg {poolnum}.{pg-id} query - -Ceph will output the query in JSON format. - -.. code-block:: javascript - - { - "state": "active+clean", - "up": [ - 1, - 0 - ], - "acting": [ - 1, - 0 - ], - "info": { - "pgid": "1.e", - "last_update": "4'1", - "last_complete": "4'1", - "log_tail": "0'0", - "last_backfill": "MAX", - "purged_snaps": "[]", - "history": { - "epoch_created": 1, - "last_epoch_started": 537, - "last_epoch_clean": 537, - "last_epoch_split": 534, - "same_up_since": 536, - "same_interval_since": 536, - "same_primary_since": 536, - "last_scrub": "4'1", - "last_scrub_stamp": "2013-01-25 10:12:23.828174" - }, - "stats": { - "version": "4'1", - "reported": "536'782", - "state": "active+clean", - "last_fresh": "2013-01-25 10:12:23.828271", - "last_change": "2013-01-25 10:12:23.828271", - "last_active": "2013-01-25 10:12:23.828271", - "last_clean": "2013-01-25 10:12:23.828271", - "last_unstale": "2013-01-25 10:12:23.828271", - "mapping_epoch": 535, - "log_start": "0'0", - "ondisk_log_start": "0'0", - "created": 1, - "last_epoch_clean": 1, - "parent": "0.0", - "parent_split_bits": 0, - "last_scrub": "4'1", - "last_scrub_stamp": "2013-01-25 10:12:23.828174", - "log_size": 128, - "ondisk_log_size": 128, - "stat_sum": { - "num_bytes": 205, - "num_objects": 1, - "num_object_clones": 0, - "num_object_copies": 0, - "num_objects_missing_on_primary": 0, - "num_objects_degraded": 0, - "num_objects_unfound": 0, - "num_read": 1, - "num_read_kb": 0, - "num_write": 3, - "num_write_kb": 1 - }, - "stat_cat_sum": { - - }, - "up": [ - 1, - 0 - ], - "acting": [ - 1, - 0 - ] - }, - "empty": 0, - "dne": 0, - "incomplete": 0 - }, - "recovery_state": [ - { - "name": "Started\/Primary\/Active", - "enter_time": "2013-01-23 09:35:37.594691", - "might_have_unfound": [ - - ], - "scrub": { - "scrub_epoch_start": "536", - "scrub_active": 0, - "scrub_block_writes": 0, - "finalizing_scrub": 0, - "scrub_waiting_on": 0, - "scrub_waiting_on_whom": [ - - ] - } - }, - { - "name": "Started", - "enter_time": "2013-01-23 09:35:31.581160" - } - ] - } - - - -The following subsections describe common states in greater detail. - -Creating --------- - -When you create a pool, it will create the number of placement groups you -specified. Ceph will echo ``creating`` when it is creating one or more -placement groups. Once they are created, the OSDs that are part of a placement -group's Acting Set will peer. Once peering is complete, the placement group -status should be ``active+clean``, which means a Ceph client can begin writing -to the placement group. - -.. ditaa:: - - /-----------\ /-----------\ /-----------\ - | Creating |------>| Peering |------>| Active | - \-----------/ \-----------/ \-----------/ - -Peering -------- - -When Ceph is Peering a placement group, Ceph is bringing the OSDs that -store the replicas of the placement group into **agreement about the state** -of the objects and metadata in the placement group. When Ceph completes peering, -this means that the OSDs that store the placement group agree about the current -state of the placement group. However, completion of the peering process does -**NOT** mean that each replica has the latest contents. - -.. topic:: Authoratative History - - Ceph will **NOT** acknowledge a write operation to a client, until - all OSDs of the acting set persist the write operation. This practice - ensures that at least one member of the acting set will have a record - of every acknowledged write operation since the last successful - peering operation. - - With an accurate record of each acknowledged write operation, Ceph can - construct and disseminate a new authoritative history of the placement - group--a complete, and fully ordered set of operations that, if performed, - would bring an OSD’s copy of a placement group up to date. - - -Active ------- - -Once Ceph completes the peering process, a placement group may become -``active``. The ``active`` state means that the data in the placement group is -generally available in the primary placement group and the replicas for read -and write operations. - - -Clean ------ - -When a placement group is in the ``clean`` state, the primary OSD and the -replica OSDs have successfully peered and there are no stray replicas for the -placement group. Ceph replicated all objects in the placement group the correct -number of times. - - -Degraded --------- - -When a client writes an object to the primary OSD, the primary OSD is -responsible for writing the replicas to the replica OSDs. After the primary OSD -writes the object to storage, the placement group will remain in a ``degraded`` -state until the primary OSD has received an acknowledgement from the replica -OSDs that Ceph created the replica objects successfully. - -The reason a placement group can be ``active+degraded`` is that an OSD may be -``active`` even though it doesn't hold all of the objects yet. If an OSD goes -``down``, Ceph marks each placement group assigned to the OSD as ``degraded``. -The OSDs must peer again when the OSD comes back online. However, a client can -still write a new object to a ``degraded`` placement group if it is ``active``. - -If an OSD is ``down`` and the ``degraded`` condition persists, Ceph may mark the -``down`` OSD as ``out`` of the cluster and remap the data from the ``down`` OSD -to another OSD. The time between being marked ``down`` and being marked ``out`` -is controlled by ``mon osd down out interval``, which is set to ``600`` seconds -by default. - -A placement group can also be ``degraded``, because Ceph cannot find one or more -objects that Ceph thinks should be in the placement group. While you cannot -read or write to unfound objects, you can still access all of the other objects -in the ``degraded`` placement group. - - -Recovering ----------- - -Ceph was designed for fault-tolerance at a scale where hardware and software -problems are ongoing. When an OSD goes ``down``, its contents may fall behind -the current state of other replicas in the placement groups. When the OSD is -back ``up``, the contents of the placement groups must be updated to reflect the -current state. During that time period, the OSD may reflect a ``recovering`` -state. - -Recovery is not always trivial, because a hardware failure might cause a -cascading failure of multiple OSDs. For example, a network switch for a rack or -cabinet may fail, which can cause the OSDs of a number of host machines to fall -behind the current state of the cluster. Each one of the OSDs must recover once -the fault is resolved. - -Ceph provides a number of settings to balance the resource contention between -new service requests and the need to recover data objects and restore the -placement groups to the current state. The ``osd recovery delay start`` setting -allows an OSD to restart, re-peer and even process some replay requests before -starting the recovery process. The ``osd -recovery thread timeout`` sets a thread timeout, because multiple OSDs may fail, -restart and re-peer at staggered rates. The ``osd recovery max active`` setting -limits the number of recovery requests an OSD will entertain simultaneously to -prevent the OSD from failing to serve . The ``osd recovery max chunk`` setting -limits the size of the recovered data chunks to prevent network congestion. - - -Back Filling ------------- - -When a new OSD joins the cluster, CRUSH will reassign placement groups from OSDs -in the cluster to the newly added OSD. Forcing the new OSD to accept the -reassigned placement groups immediately can put excessive load on the new OSD. -Back filling the OSD with the placement groups allows this process to begin in -the background. Once backfilling is complete, the new OSD will begin serving -requests when it is ready. - -During the backfill operations, you may see one of several states: -``backfill_wait`` indicates that a backfill operation is pending, but is not -underway yet; ``backfill`` indicates that a backfill operation is underway; -and, ``backfill_too_full`` indicates that a backfill operation was requested, -but couldn't be completed due to insufficient storage capacity. When a -placement group cannot be backfilled, it may be considered ``incomplete``. - -Ceph provides a number of settings to manage the load spike associated with -reassigning placement groups to an OSD (especially a new OSD). By default, -``osd_max_backfills`` sets the maximum number of concurrent backfills to or from -an OSD to 10. The ``backfill full ratio`` enables an OSD to refuse a -backfill request if the OSD is approaching its full ratio (90%, by default) and -change with ``ceph osd set-backfillfull-ratio`` comand. -If an OSD refuses a backfill request, the ``osd backfill retry interval`` -enables an OSD to retry the request (after 10 seconds, by default). OSDs can -also set ``osd backfill scan min`` and ``osd backfill scan max`` to manage scan -intervals (64 and 512, by default). - - -Remapped --------- - -When the Acting Set that services a placement group changes, the data migrates -from the old acting set to the new acting set. It may take some time for a new -primary OSD to service requests. So it may ask the old primary to continue to -service requests until the placement group migration is complete. Once data -migration completes, the mapping uses the primary OSD of the new acting set. - - -Stale ------ - -While Ceph uses heartbeats to ensure that hosts and daemons are running, the -``ceph-osd`` daemons may also get into a ``stuck`` state where they are not -reporting statistics in a timely manner (e.g., a temporary network fault). By -default, OSD daemons report their placement group, up thru, boot and failure -statistics every half second (i.e., ``0.5``), which is more frequent than the -heartbeat thresholds. If the **Primary OSD** of a placement group's acting set -fails to report to the monitor or if other OSDs have reported the primary OSD -``down``, the monitors will mark the placement group ``stale``. - -When you start your cluster, it is common to see the ``stale`` state until -the peering process completes. After your cluster has been running for awhile, -seeing placement groups in the ``stale`` state indicates that the primary OSD -for those placement groups is ``down`` or not reporting placement group statistics -to the monitor. - - -Identifying Troubled PGs -======================== - -As previously noted, a placement group is not necessarily problematic just -because its state is not ``active+clean``. Generally, Ceph's ability to self -repair may not be working when placement groups get stuck. The stuck states -include: - -- **Unclean**: Placement groups contain objects that are not replicated the - desired number of times. They should be recovering. -- **Inactive**: Placement groups cannot process reads or writes because they - are waiting for an OSD with the most up-to-date data to come back ``up``. -- **Stale**: Placement groups are in an unknown state, because the OSDs that - host them have not reported to the monitor cluster in a while (configured - by ``mon osd report timeout``). - -To identify stuck placement groups, execute the following:: - - ceph pg dump_stuck [unclean|inactive|stale|undersized|degraded] - -See `Placement Group Subsystem`_ for additional details. To troubleshoot -stuck placement groups, see `Troubleshooting PG Errors`_. - - -Finding an Object Location -========================== - -To store object data in the Ceph Object Store, a Ceph client must: - -#. Set an object name -#. Specify a `pool`_ - -The Ceph client retrieves the latest cluster map and the CRUSH algorithm -calculates how to map the object to a `placement group`_, and then calculates -how to assign the placement group to an OSD dynamically. To find the object -location, all you need is the object name and the pool name. For example:: - - ceph osd map {poolname} {object-name} - -.. topic:: Exercise: Locate an Object - - As an exercise, lets create an object. Specify an object name, a path to a - test file containing some object data and a pool name using the - ``rados put`` command on the command line. For example:: - - rados put {object-name} {file-path} --pool=data - rados put test-object-1 testfile.txt --pool=data - - To verify that the Ceph Object Store stored the object, execute the following:: - - rados -p data ls - - Now, identify the object location:: - - ceph osd map {pool-name} {object-name} - ceph osd map data test-object-1 - - Ceph should output the object's location. For example:: - - osdmap e537 pool 'data' (0) object 'test-object-1' -> pg 0.d1743484 (0.4) -> up [1,0] acting [1,0] - - To remove the test object, simply delete it using the ``rados rm`` command. - For example:: - - rados rm test-object-1 --pool=data - - -As the cluster evolves, the object location may change dynamically. One benefit -of Ceph's dynamic rebalancing is that Ceph relieves you from having to perform -the migration manually. See the `Architecture`_ section for details. - -.. _data placement: ../data-placement -.. _pool: ../pools -.. _placement group: ../placement-groups -.. _Architecture: ../../../architecture -.. _OSD Not Running: ../../troubleshooting/troubleshooting-osd#osd-not-running -.. _Troubleshooting PG Errors: ../../troubleshooting/troubleshooting-pg#troubleshooting-pg-errors -.. _Peering Failure: ../../troubleshooting/troubleshooting-pg#failures-osd-peering -.. _CRUSH map: ../crush-map -.. _Configuring Monitor/OSD Interaction: ../../configuration/mon-osd-interaction/ -.. _Placement Group Subsystem: ../control#placement-group-subsystem diff --git a/src/ceph/doc/rados/operations/monitoring.rst b/src/ceph/doc/rados/operations/monitoring.rst deleted file mode 100644 index c291440..0000000 --- a/src/ceph/doc/rados/operations/monitoring.rst +++ /dev/null @@ -1,351 +0,0 @@ -====================== - Monitoring a Cluster -====================== - -Once you have a running cluster, you may use the ``ceph`` tool to monitor your -cluster. Monitoring a cluster typically involves checking OSD status, monitor -status, placement group status and metadata server status. - -Using the command line -====================== - -Interactive mode ----------------- - -To run the ``ceph`` tool in interactive mode, type ``ceph`` at the command line -with no arguments. For example:: - - ceph - ceph> health - ceph> status - ceph> quorum_status - ceph> mon_status - -Non-default paths ------------------ - -If you specified non-default locations for your configuration or keyring, -you may specify their locations:: - - ceph -c /path/to/conf -k /path/to/keyring health - -Checking a Cluster's Status -=========================== - -After you start your cluster, and before you start reading and/or -writing data, check your cluster's status first. - -To check a cluster's status, execute the following:: - - ceph status - -Or:: - - ceph -s - -In interactive mode, type ``status`` and press **Enter**. :: - - ceph> status - -Ceph will print the cluster status. For example, a tiny Ceph demonstration -cluster with one of each service may print the following: - -:: - - cluster: - id: 477e46f1-ae41-4e43-9c8f-72c918ab0a20 - health: HEALTH_OK - - services: - mon: 1 daemons, quorum a - mgr: x(active) - mds: 1/1/1 up {0=a=up:active} - osd: 1 osds: 1 up, 1 in - - data: - pools: 2 pools, 16 pgs - objects: 21 objects, 2246 bytes - usage: 546 GB used, 384 GB / 931 GB avail - pgs: 16 active+clean - - -.. topic:: How Ceph Calculates Data Usage - - The ``usage`` value reflects the *actual* amount of raw storage used. The - ``xxx GB / xxx GB`` value means the amount available (the lesser number) - of the overall storage capacity of the cluster. The notional number reflects - the size of the stored data before it is replicated, cloned or snapshotted. - Therefore, the amount of data actually stored typically exceeds the notional - amount stored, because Ceph creates replicas of the data and may also use - storage capacity for cloning and snapshotting. - - -Watching a Cluster -================== - -In addition to local logging by each daemon, Ceph clusters maintain -a *cluster log* that records high level events about the whole system. -This is logged to disk on monitor servers (as ``/var/log/ceph/ceph.log`` by -default), but can also be monitored via the command line. - -To follow the cluster log, use the following command - -:: - - ceph -w - -Ceph will print the status of the system, followed by each log message as it -is emitted. For example: - -:: - - cluster: - id: 477e46f1-ae41-4e43-9c8f-72c918ab0a20 - health: HEALTH_OK - - services: - mon: 1 daemons, quorum a - mgr: x(active) - mds: 1/1/1 up {0=a=up:active} - osd: 1 osds: 1 up, 1 in - - data: - pools: 2 pools, 16 pgs - objects: 21 objects, 2246 bytes - usage: 546 GB used, 384 GB / 931 GB avail - pgs: 16 active+clean - - - 2017-07-24 08:15:11.329298 mon.a mon.0 172.21.9.34:6789/0 23 : cluster [INF] osd.0 172.21.9.34:6806/20527 boot - 2017-07-24 08:15:14.258143 mon.a mon.0 172.21.9.34:6789/0 39 : cluster [INF] Activating manager daemon x - 2017-07-24 08:15:15.446025 mon.a mon.0 172.21.9.34:6789/0 47 : cluster [INF] Manager daemon x is now available - - -In addition to using ``ceph -w`` to print log lines as they are emitted, -use ``ceph log last [n]`` to see the most recent ``n`` lines from the cluster -log. - -Monitoring Health Checks -======================== - -Ceph continously runs various *health checks* against its own status. When -a health check fails, this is reflected in the output of ``ceph status`` (or -``ceph health``). In addition, messages are sent to the cluster log to -indicate when a check fails, and when the cluster recovers. - -For example, when an OSD goes down, the ``health`` section of the status -output may be updated as follows: - -:: - - health: HEALTH_WARN - 1 osds down - Degraded data redundancy: 21/63 objects degraded (33.333%), 16 pgs unclean, 16 pgs degraded - -At this time, cluster log messages are also emitted to record the failure of the -health checks: - -:: - - 2017-07-25 10:08:58.265945 mon.a mon.0 172.21.9.34:6789/0 91 : cluster [WRN] Health check failed: 1 osds down (OSD_DOWN) - 2017-07-25 10:09:01.302624 mon.a mon.0 172.21.9.34:6789/0 94 : cluster [WRN] Health check failed: Degraded data redundancy: 21/63 objects degraded (33.333%), 16 pgs unclean, 16 pgs degraded (PG_DEGRADED) - -When the OSD comes back online, the cluster log records the cluster's return -to a health state: - -:: - - 2017-07-25 10:11:11.526841 mon.a mon.0 172.21.9.34:6789/0 109 : cluster [WRN] Health check update: Degraded data redundancy: 2 pgs unclean, 2 pgs degraded, 2 pgs undersized (PG_DEGRADED) - 2017-07-25 10:11:13.535493 mon.a mon.0 172.21.9.34:6789/0 110 : cluster [INF] Health check cleared: PG_DEGRADED (was: Degraded data redundancy: 2 pgs unclean, 2 pgs degraded, 2 pgs undersized) - 2017-07-25 10:11:13.535577 mon.a mon.0 172.21.9.34:6789/0 111 : cluster [INF] Cluster is now healthy - - -Detecting configuration issues -============================== - -In addition to the health checks that Ceph continuously runs on its -own status, there are some configuration issues that may only be detected -by an external tool. - -Use the `ceph-medic`_ tool to run these additional checks on your Ceph -cluster's configuration. - -Checking a Cluster's Usage Stats -================================ - -To check a cluster's data usage and data distribution among pools, you can -use the ``df`` option. It is similar to Linux ``df``. Execute -the following:: - - ceph df - -The **GLOBAL** section of the output provides an overview of the amount of -storage your cluster uses for your data. - -- **SIZE:** The overall storage capacity of the cluster. -- **AVAIL:** The amount of free space available in the cluster. -- **RAW USED:** The amount of raw storage used. -- **% RAW USED:** The percentage of raw storage used. Use this number in - conjunction with the ``full ratio`` and ``near full ratio`` to ensure that - you are not reaching your cluster's capacity. See `Storage Capacity`_ for - additional details. - -The **POOLS** section of the output provides a list of pools and the notional -usage of each pool. The output from this section **DOES NOT** reflect replicas, -clones or snapshots. For example, if you store an object with 1MB of data, the -notional usage will be 1MB, but the actual usage may be 2MB or more depending -on the number of replicas, clones and snapshots. - -- **NAME:** The name of the pool. -- **ID:** The pool ID. -- **USED:** The notional amount of data stored in kilobytes, unless the number - appends **M** for megabytes or **G** for gigabytes. -- **%USED:** The notional percentage of storage used per pool. -- **MAX AVAIL:** An estimate of the notional amount of data that can be written - to this pool. -- **Objects:** The notional number of objects stored per pool. - -.. note:: The numbers in the **POOLS** section are notional. They are not - inclusive of the number of replicas, shapshots or clones. As a result, - the sum of the **USED** and **%USED** amounts will not add up to the - **RAW USED** and **%RAW USED** amounts in the **GLOBAL** section of the - output. - -.. note:: The **MAX AVAIL** value is a complicated function of the - replication or erasure code used, the CRUSH rule that maps storage - to devices, the utilization of those devices, and the configured - mon_osd_full_ratio. - - - -Checking OSD Status -=================== - -You can check OSDs to ensure they are ``up`` and ``in`` by executing:: - - ceph osd stat - -Or:: - - ceph osd dump - -You can also check view OSDs according to their position in the CRUSH map. :: - - ceph osd tree - -Ceph will print out a CRUSH tree with a host, its OSDs, whether they are up -and their weight. :: - - # id weight type name up/down reweight - -1 3 pool default - -3 3 rack mainrack - -2 3 host osd-host - 0 1 osd.0 up 1 - 1 1 osd.1 up 1 - 2 1 osd.2 up 1 - -For a detailed discussion, refer to `Monitoring OSDs and Placement Groups`_. - -Checking Monitor Status -======================= - -If your cluster has multiple monitors (likely), you should check the monitor -quorum status after you start the cluster before reading and/or writing data. A -quorum must be present when multiple monitors are running. You should also check -monitor status periodically to ensure that they are running. - -To see display the monitor map, execute the following:: - - ceph mon stat - -Or:: - - ceph mon dump - -To check the quorum status for the monitor cluster, execute the following:: - - ceph quorum_status - -Ceph will return the quorum status. For example, a Ceph cluster consisting of -three monitors may return the following: - -.. code-block:: javascript - - { "election_epoch": 10, - "quorum": [ - 0, - 1, - 2], - "monmap": { "epoch": 1, - "fsid": "444b489c-4f16-4b75-83f0-cb8097468898", - "modified": "2011-12-12 13:28:27.505520", - "created": "2011-12-12 13:28:27.505520", - "mons": [ - { "rank": 0, - "name": "a", - "addr": "127.0.0.1:6789\/0"}, - { "rank": 1, - "name": "b", - "addr": "127.0.0.1:6790\/0"}, - { "rank": 2, - "name": "c", - "addr": "127.0.0.1:6791\/0"} - ] - } - } - -Checking MDS Status -=================== - -Metadata servers provide metadata services for Ceph FS. Metadata servers have -two sets of states: ``up | down`` and ``active | inactive``. To ensure your -metadata servers are ``up`` and ``active``, execute the following:: - - ceph mds stat - -To display details of the metadata cluster, execute the following:: - - ceph fs dump - - -Checking Placement Group States -=============================== - -Placement groups map objects to OSDs. When you monitor your -placement groups, you will want them to be ``active`` and ``clean``. -For a detailed discussion, refer to `Monitoring OSDs and Placement Groups`_. - -.. _Monitoring OSDs and Placement Groups: ../monitoring-osd-pg - - -Using the Admin Socket -====================== - -The Ceph admin socket allows you to query a daemon via a socket interface. -By default, Ceph sockets reside under ``/var/run/ceph``. To access a daemon -via the admin socket, login to the host running the daemon and use the -following command:: - - ceph daemon {daemon-name} - ceph daemon {path-to-socket-file} - -For example, the following are equivalent:: - - ceph daemon osd.0 foo - ceph daemon /var/run/ceph/ceph-osd.0.asok foo - -To view the available admin socket commands, execute the following command:: - - ceph daemon {daemon-name} help - -The admin socket command enables you to show and set your configuration at -runtime. See `Viewing a Configuration at Runtime`_ for details. - -Additionally, you can set configuration values at runtime directly (i.e., the -admin socket bypasses the monitor, unlike ``ceph tell {daemon-type}.{id} -injectargs``, which relies on the monitor but doesn't require you to login -directly to the host in question ). - -.. _Viewing a Configuration at Runtime: ../../configuration/ceph-conf#ceph-runtime-config -.. _Storage Capacity: ../../configuration/mon-config-ref#storage-capacity -.. _ceph-medic: http://docs.ceph.com/ceph-medic/master/ diff --git a/src/ceph/doc/rados/operations/operating.rst b/src/ceph/doc/rados/operations/operating.rst deleted file mode 100644 index 791941a..0000000 --- a/src/ceph/doc/rados/operations/operating.rst +++ /dev/null @@ -1,251 +0,0 @@ -===================== - Operating a Cluster -===================== - -.. index:: systemd; operating a cluster - - -Running Ceph with systemd -========================== - -For all distributions that support systemd (CentOS 7, Fedora, Debian -Jessie 8 and later, SUSE), ceph daemons are now managed using native -systemd files instead of the legacy sysvinit scripts. For example:: - - sudo systemctl start ceph.target # start all daemons - sudo systemctl status ceph-osd@12 # check status of osd.12 - -To list the Ceph systemd units on a node, execute:: - - sudo systemctl status ceph\*.service ceph\*.target - -Starting all Daemons --------------------- - -To start all daemons on a Ceph Node (irrespective of type), execute the -following:: - - sudo systemctl start ceph.target - - -Stopping all Daemons --------------------- - -To stop all daemons on a Ceph Node (irrespective of type), execute the -following:: - - sudo systemctl stop ceph\*.service ceph\*.target - - -Starting all Daemons by Type ----------------------------- - -To start all daemons of a particular type on a Ceph Node, execute one of the -following:: - - sudo systemctl start ceph-osd.target - sudo systemctl start ceph-mon.target - sudo systemctl start ceph-mds.target - - -Stopping all Daemons by Type ----------------------------- - -To stop all daemons of a particular type on a Ceph Node, execute one of the -following:: - - sudo systemctl stop ceph-mon\*.service ceph-mon.target - sudo systemctl stop ceph-osd\*.service ceph-osd.target - sudo systemctl stop ceph-mds\*.service ceph-mds.target - - -Starting a Daemon ------------------ - -To start a specific daemon instance on a Ceph Node, execute one of the -following:: - - sudo systemctl start ceph-osd@{id} - sudo systemctl start ceph-mon@{hostname} - sudo systemctl start ceph-mds@{hostname} - -For example:: - - sudo systemctl start ceph-osd@1 - sudo systemctl start ceph-mon@ceph-server - sudo systemctl start ceph-mds@ceph-server - - -Stopping a Daemon ------------------ - -To stop a specific daemon instance on a Ceph Node, execute one of the -following:: - - sudo systemctl stop ceph-osd@{id} - sudo systemctl stop ceph-mon@{hostname} - sudo systemctl stop ceph-mds@{hostname} - -For example:: - - sudo systemctl stop ceph-osd@1 - sudo systemctl stop ceph-mon@ceph-server - sudo systemctl stop ceph-mds@ceph-server - - -.. index:: Ceph service; Upstart; operating a cluster - - - -Running Ceph with Upstart -========================= - -When deploying Ceph with ``ceph-deploy`` on Ubuntu Trusty, you may start and -stop Ceph daemons on a :term:`Ceph Node` using the event-based `Upstart`_. -Upstart does not require you to define daemon instances in the Ceph -configuration file. - -To list the Ceph Upstart jobs and instances on a node, execute:: - - sudo initctl list | grep ceph - -See `initctl`_ for additional details. - - -Starting all Daemons --------------------- - -To start all daemons on a Ceph Node (irrespective of type), execute the -following:: - - sudo start ceph-all - - -Stopping all Daemons --------------------- - -To stop all daemons on a Ceph Node (irrespective of type), execute the -following:: - - sudo stop ceph-all - - -Starting all Daemons by Type ----------------------------- - -To start all daemons of a particular type on a Ceph Node, execute one of the -following:: - - sudo start ceph-osd-all - sudo start ceph-mon-all - sudo start ceph-mds-all - - -Stopping all Daemons by Type ----------------------------- - -To stop all daemons of a particular type on a Ceph Node, execute one of the -following:: - - sudo stop ceph-osd-all - sudo stop ceph-mon-all - sudo stop ceph-mds-all - - -Starting a Daemon ------------------ - -To start a specific daemon instance on a Ceph Node, execute one of the -following:: - - sudo start ceph-osd id={id} - sudo start ceph-mon id={hostname} - sudo start ceph-mds id={hostname} - -For example:: - - sudo start ceph-osd id=1 - sudo start ceph-mon id=ceph-server - sudo start ceph-mds id=ceph-server - - -Stopping a Daemon ------------------ - -To stop a specific daemon instance on a Ceph Node, execute one of the -following:: - - sudo stop ceph-osd id={id} - sudo stop ceph-mon id={hostname} - sudo stop ceph-mds id={hostname} - -For example:: - - sudo stop ceph-osd id=1 - sudo start ceph-mon id=ceph-server - sudo start ceph-mds id=ceph-server - - -.. index:: Ceph service; sysvinit; operating a cluster - - -Running Ceph -============ - -Each time you to **start**, **restart**, and **stop** Ceph daemons (or your -entire cluster) you must specify at least one option and one command. You may -also specify a daemon type or a daemon instance. :: - - {commandline} [options] [commands] [daemons] - - -The ``ceph`` options include: - -+-----------------+----------+-------------------------------------------------+ -| Option | Shortcut | Description | -+=================+==========+=================================================+ -| ``--verbose`` | ``-v`` | Use verbose logging. | -+-----------------+----------+-------------------------------------------------+ -| ``--valgrind`` | ``N/A`` | (Dev and QA only) Use `Valgrind`_ debugging. | -+-----------------+----------+-------------------------------------------------+ -| ``--allhosts`` | ``-a`` | Execute on all nodes in ``ceph.conf.`` | -| | | Otherwise, it only executes on ``localhost``. | -+-----------------+----------+-------------------------------------------------+ -| ``--restart`` | ``N/A`` | Automatically restart daemon if it core dumps. | -+-----------------+----------+-------------------------------------------------+ -| ``--norestart`` | ``N/A`` | Don't restart a daemon if it core dumps. | -+-----------------+----------+-------------------------------------------------+ -| ``--conf`` | ``-c`` | Use an alternate configuration file. | -+-----------------+----------+-------------------------------------------------+ - -The ``ceph`` commands include: - -+------------------+------------------------------------------------------------+ -| Command | Description | -+==================+============================================================+ -| ``start`` | Start the daemon(s). | -+------------------+------------------------------------------------------------+ -| ``stop`` | Stop the daemon(s). | -+------------------+------------------------------------------------------------+ -| ``forcestop`` | Force the daemon(s) to stop. Same as ``kill -9`` | -+------------------+------------------------------------------------------------+ -| ``killall`` | Kill all daemons of a particular type. | -+------------------+------------------------------------------------------------+ -| ``cleanlogs`` | Cleans out the log directory. | -+------------------+------------------------------------------------------------+ -| ``cleanalllogs`` | Cleans out **everything** in the log directory. | -+------------------+------------------------------------------------------------+ - -For subsystem operations, the ``ceph`` service can target specific daemon types -by adding a particular daemon type for the ``[daemons]`` option. Daemon types -include: - -- ``mon`` -- ``osd`` -- ``mds`` - - - -.. _Valgrind: http://www.valgrind.org/ -.. _Upstart: http://upstart.ubuntu.com/index.html -.. _initctl: http://manpages.ubuntu.com/manpages/raring/en/man8/initctl.8.html diff --git a/src/ceph/doc/rados/operations/pg-concepts.rst b/src/ceph/doc/rados/operations/pg-concepts.rst deleted file mode 100644 index 636d6bf..0000000 --- a/src/ceph/doc/rados/operations/pg-concepts.rst +++ /dev/null @@ -1,102 +0,0 @@ -========================== - Placement Group Concepts -========================== - -When you execute commands like ``ceph -w``, ``ceph osd dump``, and other -commands related to placement groups, Ceph may return values using some -of the following terms: - -*Peering* - The process of bringing all of the OSDs that store - a Placement Group (PG) into agreement about the state - of all of the objects (and their metadata) in that PG. - Note that agreeing on the state does not mean that - they all have the latest contents. - -*Acting Set* - The ordered list of OSDs who are (or were as of some epoch) - responsible for a particular placement group. - -*Up Set* - The ordered list of OSDs responsible for a particular placement - group for a particular epoch according to CRUSH. Normally this - is the same as the *Acting Set*, except when the *Acting Set* has - been explicitly overridden via ``pg_temp`` in the OSD Map. - -*Current Interval* or *Past Interval* - A sequence of OSD map epochs during which the *Acting Set* and *Up - Set* for particular placement group do not change. - -*Primary* - The member (and by convention first) of the *Acting Set*, - that is responsible for coordination peering, and is - the only OSD that will accept client-initiated - writes to objects in a placement group. - -*Replica* - A non-primary OSD in the *Acting Set* for a placement group - (and who has been recognized as such and *activated* by the primary). - -*Stray* - An OSD that is not a member of the current *Acting Set*, but - has not yet been told that it can delete its copies of a - particular placement group. - -*Recovery* - Ensuring that copies of all of the objects in a placement group - are on all of the OSDs in the *Acting Set*. Once *Peering* has - been performed, the *Primary* can start accepting write operations, - and *Recovery* can proceed in the background. - -*PG Info* - Basic metadata about the placement group's creation epoch, the version - for the most recent write to the placement group, *last epoch started*, - *last epoch clean*, and the beginning of the *current interval*. Any - inter-OSD communication about placement groups includes the *PG Info*, - such that any OSD that knows a placement group exists (or once existed) - also has a lower bound on *last epoch clean* or *last epoch started*. - -*PG Log* - A list of recent updates made to objects in a placement group. - Note that these logs can be truncated after all OSDs - in the *Acting Set* have acknowledged up to a certain - point. - -*Missing Set* - Each OSD notes update log entries and if they imply updates to - the contents of an object, adds that object to a list of needed - updates. This list is called the *Missing Set* for that ``<OSD,PG>``. - -*Authoritative History* - A complete, and fully ordered set of operations that, if - performed, would bring an OSD's copy of a placement group - up to date. - -*Epoch* - A (monotonically increasing) OSD map version number - -*Last Epoch Start* - The last epoch at which all nodes in the *Acting Set* - for a particular placement group agreed on an - *Authoritative History*. At this point, *Peering* is - deemed to have been successful. - -*up_thru* - Before a *Primary* can successfully complete the *Peering* process, - it must inform a monitor that is alive through the current - OSD map *Epoch* by having the monitor set its *up_thru* in the osd - map. This helps *Peering* ignore previous *Acting Sets* for which - *Peering* never completed after certain sequences of failures, such as - the second interval below: - - - *acting set* = [A,B] - - *acting set* = [A] - - *acting set* = [] very shortly after (e.g., simultaneous failure, but staggered detection) - - *acting set* = [B] (B restarts, A does not) - -*Last Epoch Clean* - The last *Epoch* at which all nodes in the *Acting set* - for a particular placement group were completely - up to date (both placement group logs and object contents). - At this point, *recovery* is deemed to have been - completed. diff --git a/src/ceph/doc/rados/operations/pg-repair.rst b/src/ceph/doc/rados/operations/pg-repair.rst deleted file mode 100644 index 0d6692a..0000000 --- a/src/ceph/doc/rados/operations/pg-repair.rst +++ /dev/null @@ -1,4 +0,0 @@ -Repairing PG inconsistencies -============================ - - diff --git a/src/ceph/doc/rados/operations/pg-states.rst b/src/ceph/doc/rados/operations/pg-states.rst deleted file mode 100644 index 0fbd3dc..0000000 --- a/src/ceph/doc/rados/operations/pg-states.rst +++ /dev/null @@ -1,80 +0,0 @@ -======================== - Placement Group States -======================== - -When checking a cluster's status (e.g., running ``ceph -w`` or ``ceph -s``), -Ceph will report on the status of the placement groups. A placement group has -one or more states. The optimum state for placement groups in the placement group -map is ``active + clean``. - -*Creating* - Ceph is still creating the placement group. - -*Active* - Ceph will process requests to the placement group. - -*Clean* - Ceph replicated all objects in the placement group the correct number of times. - -*Down* - A replica with necessary data is down, so the placement group is offline. - -*Scrubbing* - Ceph is checking the placement group for inconsistencies. - -*Degraded* - Ceph has not replicated some objects in the placement group the correct number of times yet. - -*Inconsistent* - Ceph detects inconsistencies in the one or more replicas of an object in the placement group - (e.g. objects are the wrong size, objects are missing from one replica *after* recovery finished, etc.). - -*Peering* - The placement group is undergoing the peering process - -*Repair* - Ceph is checking the placement group and repairing any inconsistencies it finds (if possible). - -*Recovering* - Ceph is migrating/synchronizing objects and their replicas. - -*Forced-Recovery* - High recovery priority of that PG is enforced by user. - -*Backfill* - Ceph is scanning and synchronizing the entire contents of a placement group - instead of inferring what contents need to be synchronized from the logs of - recent operations. *Backfill* is a special case of recovery. - -*Forced-Backfill* - High backfill priority of that PG is enforced by user. - -*Wait-backfill* - The placement group is waiting in line to start backfill. - -*Backfill-toofull* - A backfill operation is waiting because the destination OSD is over its - full ratio. - -*Incomplete* - Ceph detects that a placement group is missing information about - writes that may have occurred, or does not have any healthy - copies. If you see this state, try to start any failed OSDs that may - contain the needed information. In the case of an erasure coded pool - temporarily reducing min_size may allow recovery. - -*Stale* - The placement group is in an unknown state - the monitors have not received - an update for it since the placement group mapping changed. - -*Remapped* - The placement group is temporarily mapped to a different set of OSDs from what - CRUSH specified. - -*Undersized* - The placement group fewer copies than the configured pool replication level. - -*Peered* - The placement group has peered, but cannot serve client IO due to not having - enough copies to reach the pool's configured min_size parameter. Recovery - may occur in this state, so the pg may heal up to min_size eventually. diff --git a/src/ceph/doc/rados/operations/placement-groups.rst b/src/ceph/doc/rados/operations/placement-groups.rst deleted file mode 100644 index fee833a..0000000 --- a/src/ceph/doc/rados/operations/placement-groups.rst +++ /dev/null @@ -1,469 +0,0 @@ -================== - Placement Groups -================== - -.. _preselection: - -A preselection of pg_num -======================== - -When creating a new pool with:: - - ceph osd pool create {pool-name} pg_num - -it is mandatory to choose the value of ``pg_num`` because it cannot be -calculated automatically. Here are a few values commonly used: - -- Less than 5 OSDs set ``pg_num`` to 128 - -- Between 5 and 10 OSDs set ``pg_num`` to 512 - -- Between 10 and 50 OSDs set ``pg_num`` to 1024 - -- If you have more than 50 OSDs, you need to understand the tradeoffs - and how to calculate the ``pg_num`` value by yourself - -- For calculating ``pg_num`` value by yourself please take help of `pgcalc`_ tool - -As the number of OSDs increases, chosing the right value for pg_num -becomes more important because it has a significant influence on the -behavior of the cluster as well as the durability of the data when -something goes wrong (i.e. the probability that a catastrophic event -leads to data loss). - -How are Placement Groups used ? -=============================== - -A placement group (PG) aggregates objects within a pool because -tracking object placement and object metadata on a per-object basis is -computationally expensive--i.e., a system with millions of objects -cannot realistically track placement on a per-object basis. - -.. ditaa:: - /-----\ /-----\ /-----\ /-----\ /-----\ - | obj | | obj | | obj | | obj | | obj | - \-----/ \-----/ \-----/ \-----/ \-----/ - | | | | | - +--------+--------+ +---+----+ - | | - v v - +-----------------------+ +-----------------------+ - | Placement Group #1 | | Placement Group #2 | - | | | | - +-----------------------+ +-----------------------+ - | | - +------------------------------+ - | - v - +-----------------------+ - | Pool | - | | - +-----------------------+ - -The Ceph client will calculate which placement group an object should -be in. It does this by hashing the object ID and applying an operation -based on the number of PGs in the defined pool and the ID of the pool. -See `Mapping PGs to OSDs`_ for details. - -The object's contents within a placement group are stored in a set of -OSDs. For instance, in a replicated pool of size two, each placement -group will store objects on two OSDs, as shown below. - -.. ditaa:: - - +-----------------------+ +-----------------------+ - | Placement Group #1 | | Placement Group #2 | - | | | | - +-----------------------+ +-----------------------+ - | | | | - v v v v - /----------\ /----------\ /----------\ /----------\ - | | | | | | | | - | OSD #1 | | OSD #2 | | OSD #2 | | OSD #3 | - | | | | | | | | - \----------/ \----------/ \----------/ \----------/ - - -Should OSD #2 fail, another will be assigned to Placement Group #1 and -will be filled with copies of all objects in OSD #1. If the pool size -is changed from two to three, an additional OSD will be assigned to -the placement group and will receive copies of all objects in the -placement group. - -Placement groups do not own the OSD, they share it with other -placement groups from the same pool or even other pools. If OSD #2 -fails, the Placement Group #2 will also have to restore copies of -objects, using OSD #3. - -When the number of placement groups increases, the new placement -groups will be assigned OSDs. The result of the CRUSH function will -also change and some objects from the former placement groups will be -copied over to the new Placement Groups and removed from the old ones. - -Placement Groups Tradeoffs -========================== - -Data durability and even distribution among all OSDs call for more -placement groups but their number should be reduced to the minimum to -save CPU and memory. - -.. _data durability: - -Data durability ---------------- - -After an OSD fails, the risk of data loss increases until the data it -contained is fully recovered. Let's imagine a scenario that causes -permanent data loss in a single placement group: - -- The OSD fails and all copies of the object it contains are lost. - For all objects within the placement group the number of replica - suddently drops from three to two. - -- Ceph starts recovery for this placement group by chosing a new OSD - to re-create the third copy of all objects. - -- Another OSD, within the same placement group, fails before the new - OSD is fully populated with the third copy. Some objects will then - only have one surviving copies. - -- Ceph picks yet another OSD and keeps copying objects to restore the - desired number of copies. - -- A third OSD, within the same placement group, fails before recovery - is complete. If this OSD contained the only remaining copy of an - object, it is permanently lost. - -In a cluster containing 10 OSDs with 512 placement groups in a three -replica pool, CRUSH will give each placement groups three OSDs. In the -end, each OSDs will end up hosting (512 * 3) / 10 = ~150 Placement -Groups. When the first OSD fails, the above scenario will therefore -start recovery for all 150 placement groups at the same time. - -The 150 placement groups being recovered are likely to be -homogeneously spread over the 9 remaining OSDs. Each remaining OSD is -therefore likely to send copies of objects to all others and also -receive some new objects to be stored because they became part of a -new placement group. - -The amount of time it takes for this recovery to complete entirely -depends on the architecture of the Ceph cluster. Let say each OSD is -hosted by a 1TB SSD on a single machine and all of them are connected -to a 10Gb/s switch and the recovery for a single OSD completes within -M minutes. If there are two OSDs per machine using spinners with no -SSD journal and a 1Gb/s switch, it will at least be an order of -magnitude slower. - -In a cluster of this size, the number of placement groups has almost -no influence on data durability. It could be 128 or 8192 and the -recovery would not be slower or faster. - -However, growing the same Ceph cluster to 20 OSDs instead of 10 OSDs -is likely to speed up recovery and therefore improve data durability -significantly. Each OSD now participates in only ~75 placement groups -instead of ~150 when there were only 10 OSDs and it will still require -all 19 remaining OSDs to perform the same amount of object copies in -order to recover. But where 10 OSDs had to copy approximately 100GB -each, they now have to copy 50GB each instead. If the network was the -bottleneck, recovery will happen twice as fast. In other words, -recovery goes faster when the number of OSDs increases. - -If this cluster grows to 40 OSDs, each of them will only host ~35 -placement groups. If an OSD dies, recovery will keep going faster -unless it is blocked by another bottleneck. However, if this cluster -grows to 200 OSDs, each of them will only host ~7 placement groups. If -an OSD dies, recovery will happen between at most of ~21 (7 * 3) OSDs -in these placement groups: recovery will take longer than when there -were 40 OSDs, meaning the number of placement groups should be -increased. - -No matter how short the recovery time is, there is a chance for a -second OSD to fail while it is in progress. In the 10 OSDs cluster -described above, if any of them fail, then ~17 placement groups -(i.e. ~150 / 9 placement groups being recovered) will only have one -surviving copy. And if any of the 8 remaining OSD fail, the last -objects of two placement groups are likely to be lost (i.e. ~17 / 8 -placement groups with only one remaining copy being recovered). - -When the size of the cluster grows to 20 OSDs, the number of Placement -Groups damaged by the loss of three OSDs drops. The second OSD lost -will degrade ~4 (i.e. ~75 / 19 placement groups being recovered) -instead of ~17 and the third OSD lost will only lose data if it is one -of the four OSDs containing the surviving copy. In other words, if the -probability of losing one OSD is 0.0001% during the recovery time -frame, it goes from 17 * 10 * 0.0001% in the cluster with 10 OSDs to 4 * 20 * -0.0001% in the cluster with 20 OSDs. - -In a nutshell, more OSDs mean faster recovery and a lower risk of -cascading failures leading to the permanent loss of a Placement -Group. Having 512 or 4096 Placement Groups is roughly equivalent in a -cluster with less than 50 OSDs as far as data durability is concerned. - -Note: It may take a long time for a new OSD added to the cluster to be -populated with placement groups that were assigned to it. However -there is no degradation of any object and it has no impact on the -durability of the data contained in the Cluster. - -.. _object distribution: - -Object distribution within a pool ---------------------------------- - -Ideally objects are evenly distributed in each placement group. Since -CRUSH computes the placement group for each object, but does not -actually know how much data is stored in each OSD within this -placement group, the ratio between the number of placement groups and -the number of OSDs may influence the distribution of the data -significantly. - -For instance, if there was single a placement group for ten OSDs in a -three replica pool, only three OSD would be used because CRUSH would -have no other choice. When more placement groups are available, -objects are more likely to be evenly spread among them. CRUSH also -makes every effort to evenly spread OSDs among all existing Placement -Groups. - -As long as there are one or two orders of magnitude more Placement -Groups than OSDs, the distribution should be even. For instance, 300 -placement groups for 3 OSDs, 1000 placement groups for 10 OSDs etc. - -Uneven data distribution can be caused by factors other than the ratio -between OSDs and placement groups. Since CRUSH does not take into -account the size of the objects, a few very large objects may create -an imbalance. Let say one million 4K objects totaling 4GB are evenly -spread among 1000 placement groups on 10 OSDs. They will use 4GB / 10 -= 400MB on each OSD. If one 400MB object is added to the pool, the -three OSDs supporting the placement group in which the object has been -placed will be filled with 400MB + 400MB = 800MB while the seven -others will remain occupied with only 400MB. - -.. _resource usage: - -Memory, CPU and network usage ------------------------------ - -For each placement group, OSDs and MONs need memory, network and CPU -at all times and even more during recovery. Sharing this overhead by -clustering objects within a placement group is one of the main reasons -they exist. - -Minimizing the number of placement groups saves significant amounts of -resources. - -Choosing the number of Placement Groups -======================================= - -If you have more than 50 OSDs, we recommend approximately 50-100 -placement groups per OSD to balance out resource usage, data -durability and distribution. If you have less than 50 OSDs, chosing -among the `preselection`_ above is best. For a single pool of objects, -you can use the following formula to get a baseline:: - - (OSDs * 100) - Total PGs = ------------ - pool size - -Where **pool size** is either the number of replicas for replicated -pools or the K+M sum for erasure coded pools (as returned by **ceph -osd erasure-code-profile get**). - -You should then check if the result makes sense with the way you -designed your Ceph cluster to maximize `data durability`_, -`object distribution`_ and minimize `resource usage`_. - -The result should be **rounded up to the nearest power of two.** -Rounding up is optional, but recommended for CRUSH to evenly balance -the number of objects among placement groups. - -As an example, for a cluster with 200 OSDs and a pool size of 3 -replicas, you would estimate your number of PGs as follows:: - - (200 * 100) - ----------- = 6667. Nearest power of 2: 8192 - 3 - -When using multiple data pools for storing objects, you need to ensure -that you balance the number of placement groups per pool with the -number of placement groups per OSD so that you arrive at a reasonable -total number of placement groups that provides reasonably low variance -per OSD without taxing system resources or making the peering process -too slow. - -For instance a cluster of 10 pools each with 512 placement groups on -ten OSDs is a total of 5,120 placement groups spread over ten OSDs, -that is 512 placement groups per OSD. That does not use too many -resources. However, if 1,000 pools were created with 512 placement -groups each, the OSDs will handle ~50,000 placement groups each and it -would require significantly more resources and time for peering. - -You may find the `PGCalc`_ tool helpful. - - -.. _setting the number of placement groups: - -Set the Number of Placement Groups -================================== - -To set the number of placement groups in a pool, you must specify the -number of placement groups at the time you create the pool. -See `Create a Pool`_ for details. Once you have set placement groups for a -pool, you may increase the number of placement groups (but you cannot -decrease the number of placement groups). To increase the number of -placement groups, execute the following:: - - ceph osd pool set {pool-name} pg_num {pg_num} - -Once you increase the number of placement groups, you must also -increase the number of placement groups for placement (``pgp_num``) -before your cluster will rebalance. The ``pgp_num`` will be the number of -placement groups that will be considered for placement by the CRUSH -algorithm. Increasing ``pg_num`` splits the placement groups but data -will not be migrated to the newer placement groups until placement -groups for placement, ie. ``pgp_num`` is increased. The ``pgp_num`` -should be equal to the ``pg_num``. To increase the number of -placement groups for placement, execute the following:: - - ceph osd pool set {pool-name} pgp_num {pgp_num} - - -Get the Number of Placement Groups -================================== - -To get the number of placement groups in a pool, execute the following:: - - ceph osd pool get {pool-name} pg_num - - -Get a Cluster's PG Statistics -============================= - -To get the statistics for the placement groups in your cluster, execute the following:: - - ceph pg dump [--format {format}] - -Valid formats are ``plain`` (default) and ``json``. - - -Get Statistics for Stuck PGs -============================ - -To get the statistics for all placement groups stuck in a specified state, -execute the following:: - - ceph pg dump_stuck inactive|unclean|stale|undersized|degraded [--format <format>] [-t|--threshold <seconds>] - -**Inactive** Placement groups cannot process reads or writes because they are waiting for an OSD -with the most up-to-date data to come up and in. - -**Unclean** Placement groups contain objects that are not replicated the desired number -of times. They should be recovering. - -**Stale** Placement groups are in an unknown state - the OSDs that host them have not -reported to the monitor cluster in a while (configured by ``mon_osd_report_timeout``). - -Valid formats are ``plain`` (default) and ``json``. The threshold defines the minimum number -of seconds the placement group is stuck before including it in the returned statistics -(default 300 seconds). - - -Get a PG Map -============ - -To get the placement group map for a particular placement group, execute the following:: - - ceph pg map {pg-id} - -For example:: - - ceph pg map 1.6c - -Ceph will return the placement group map, the placement group, and the OSD status:: - - osdmap e13 pg 1.6c (1.6c) -> up [1,0] acting [1,0] - - -Get a PGs Statistics -==================== - -To retrieve statistics for a particular placement group, execute the following:: - - ceph pg {pg-id} query - - -Scrub a Placement Group -======================= - -To scrub a placement group, execute the following:: - - ceph pg scrub {pg-id} - -Ceph checks the primary and any replica nodes, generates a catalog of all objects -in the placement group and compares them to ensure that no objects are missing -or mismatched, and their contents are consistent. Assuming the replicas all -match, a final semantic sweep ensures that all of the snapshot-related object -metadata is consistent. Errors are reported via logs. - -Prioritize backfill/recovery of a Placement Group(s) -==================================================== - -You may run into a situation where a bunch of placement groups will require -recovery and/or backfill, and some particular groups hold data more important -than others (for example, those PGs may hold data for images used by running -machines and other PGs may be used by inactive machines/less relevant data). -In that case, you may want to prioritize recovery of those groups so -performance and/or availability of data stored on those groups is restored -earlier. To do this (mark particular placement group(s) as prioritized during -backfill or recovery), execute the following:: - - ceph pg force-recovery {pg-id} [{pg-id #2}] [{pg-id #3} ...] - ceph pg force-backfill {pg-id} [{pg-id #2}] [{pg-id #3} ...] - -This will cause Ceph to perform recovery or backfill on specified placement -groups first, before other placement groups. This does not interrupt currently -ongoing backfills or recovery, but causes specified PGs to be processed -as soon as possible. If you change your mind or prioritize wrong groups, -use:: - - ceph pg cancel-force-recovery {pg-id} [{pg-id #2}] [{pg-id #3} ...] - ceph pg cancel-force-backfill {pg-id} [{pg-id #2}] [{pg-id #3} ...] - -This will remove "force" flag from those PGs and they will be processed -in default order. Again, this doesn't affect currently processed placement -group, only those that are still queued. - -The "force" flag is cleared automatically after recovery or backfill of group -is done. - -Revert Lost -=========== - -If the cluster has lost one or more objects, and you have decided to -abandon the search for the lost data, you must mark the unfound objects -as ``lost``. - -If all possible locations have been queried and objects are still -lost, you may have to give up on the lost objects. This is -possible given unusual combinations of failures that allow the cluster -to learn about writes that were performed before the writes themselves -are recovered. - -Currently the only supported option is "revert", which will either roll back to -a previous version of the object or (if it was a new object) forget about it -entirely. To mark the "unfound" objects as "lost", execute the following:: - - ceph pg {pg-id} mark_unfound_lost revert|delete - -.. important:: Use this feature with caution, because it may confuse - applications that expect the object(s) to exist. - - -.. toctree:: - :hidden: - - pg-states - pg-concepts - - -.. _Create a Pool: ../pools#createpool -.. _Mapping PGs to OSDs: ../../../architecture#mapping-pgs-to-osds -.. _pgcalc: http://ceph.com/pgcalc/ diff --git a/src/ceph/doc/rados/operations/pools.rst b/src/ceph/doc/rados/operations/pools.rst deleted file mode 100644 index 7015593..0000000 --- a/src/ceph/doc/rados/operations/pools.rst +++ /dev/null @@ -1,798 +0,0 @@ -======= - Pools -======= - -When you first deploy a cluster without creating a pool, Ceph uses the default -pools for storing data. A pool provides you with: - -- **Resilience**: You can set how many OSD are allowed to fail without losing data. - For replicated pools, it is the desired number of copies/replicas of an object. - A typical configuration stores an object and one additional copy - (i.e., ``size = 2``), but you can determine the number of copies/replicas. - For `erasure coded pools <../erasure-code>`_, it is the number of coding chunks - (i.e. ``m=2`` in the **erasure code profile**) - -- **Placement Groups**: You can set the number of placement groups for the pool. - A typical configuration uses approximately 100 placement groups per OSD to - provide optimal balancing without using up too many computing resources. When - setting up multiple pools, be careful to ensure you set a reasonable number of - placement groups for both the pool and the cluster as a whole. - -- **CRUSH Rules**: When you store data in a pool, a CRUSH ruleset mapped to the - pool enables CRUSH to identify a rule for the placement of the object - and its replicas (or chunks for erasure coded pools) in your cluster. - You can create a custom CRUSH rule for your pool. - -- **Snapshots**: When you create snapshots with ``ceph osd pool mksnap``, - you effectively take a snapshot of a particular pool. - -To organize data into pools, you can list, create, and remove pools. -You can also view the utilization statistics for each pool. - -List Pools -========== - -To list your cluster's pools, execute:: - - ceph osd lspools - -On a freshly installed cluster, only the ``rbd`` pool exists. - - -.. _createpool: - -Create a Pool -============= - -Before creating pools, refer to the `Pool, PG and CRUSH Config Reference`_. -Ideally, you should override the default value for the number of placement -groups in your Ceph configuration file, as the default is NOT ideal. -For details on placement group numbers refer to `setting the number of placement groups`_ - -.. note:: Starting with Luminous, all pools need to be associated to the - application using the pool. See `Associate Pool to Application`_ below for - more information. - -For example:: - - osd pool default pg num = 100 - osd pool default pgp num = 100 - -To create a pool, execute:: - - ceph osd pool create {pool-name} {pg-num} [{pgp-num}] [replicated] \ - [crush-rule-name] [expected-num-objects] - ceph osd pool create {pool-name} {pg-num} {pgp-num} erasure \ - [erasure-code-profile] [crush-rule-name] [expected_num_objects] - -Where: - -``{pool-name}`` - -:Description: The name of the pool. It must be unique. -:Type: String -:Required: Yes. - -``{pg-num}`` - -:Description: The total number of placement groups for the pool. See `Placement - Groups`_ for details on calculating a suitable number. The - default value ``8`` is NOT suitable for most systems. - -:Type: Integer -:Required: Yes. -:Default: 8 - -``{pgp-num}`` - -:Description: The total number of placement groups for placement purposes. This - **should be equal to the total number of placement groups**, except - for placement group splitting scenarios. - -:Type: Integer -:Required: Yes. Picks up default or Ceph configuration value if not specified. -:Default: 8 - -``{replicated|erasure}`` - -:Description: The pool type which may either be **replicated** to - recover from lost OSDs by keeping multiple copies of the - objects or **erasure** to get a kind of - `generalized RAID5 <../erasure-code>`_ capability. - The **replicated** pools require more - raw storage but implement all Ceph operations. The - **erasure** pools require less raw storage but only - implement a subset of the available operations. - -:Type: String -:Required: No. -:Default: replicated - -``[crush-rule-name]`` - -:Description: The name of a CRUSH rule to use for this pool. The specified - rule must exist. - -:Type: String -:Required: No. -:Default: For **replicated** pools it is the ruleset specified by the ``osd - pool default crush replicated ruleset`` config variable. This - ruleset must exist. - For **erasure** pools it is ``erasure-code`` if the ``default`` - `erasure code profile`_ is used or ``{pool-name}`` otherwise. This - ruleset will be created implicitly if it doesn't exist already. - - -``[erasure-code-profile=profile]`` - -.. _erasure code profile: ../erasure-code-profile - -:Description: For **erasure** pools only. Use the `erasure code profile`_. It - must be an existing profile as defined by - **osd erasure-code-profile set**. - -:Type: String -:Required: No. - -When you create a pool, set the number of placement groups to a reasonable value -(e.g., ``100``). Consider the total number of placement groups per OSD too. -Placement groups are computationally expensive, so performance will degrade when -you have many pools with many placement groups (e.g., 50 pools with 100 -placement groups each). The point of diminishing returns depends upon the power -of the OSD host. - -See `Placement Groups`_ for details on calculating an appropriate number of -placement groups for your pool. - -.. _Placement Groups: ../placement-groups - -``[expected-num-objects]`` - -:Description: The expected number of objects for this pool. By setting this value ( - together with a negative **filestore merge threshold**), the PG folder - splitting would happen at the pool creation time, to avoid the latency - impact to do a runtime folder splitting. - -:Type: Integer -:Required: No. -:Default: 0, no splitting at the pool creation time. - -Associate Pool to Application -============================= - -Pools need to be associated with an application before use. Pools that will be -used with CephFS or pools that are automatically created by RGW are -automatically associated. Pools that are intended for use with RBD should be -initialized using the ``rbd`` tool (see `Block Device Commands`_ for more -information). - -For other cases, you can manually associate a free-form application name to -a pool.:: - - ceph osd pool application enable {pool-name} {application-name} - -.. note:: CephFS uses the application name ``cephfs``, RBD uses the - application name ``rbd``, and RGW uses the application name ``rgw``. - -Set Pool Quotas -=============== - -You can set pool quotas for the maximum number of bytes and/or the maximum -number of objects per pool. :: - - ceph osd pool set-quota {pool-name} [max_objects {obj-count}] [max_bytes {bytes}] - -For example:: - - ceph osd pool set-quota data max_objects 10000 - -To remove a quota, set its value to ``0``. - - -Delete a Pool -============= - -To delete a pool, execute:: - - ceph osd pool delete {pool-name} [{pool-name} --yes-i-really-really-mean-it] - - -To remove a pool the mon_allow_pool_delete flag must be set to true in the Monitor's -configuration. Otherwise they will refuse to remove a pool. - -See `Monitor Configuration`_ for more information. - -.. _Monitor Configuration: ../../configuration/mon-config-ref - -If you created your own rulesets and rules for a pool you created, you should -consider removing them when you no longer need your pool:: - - ceph osd pool get {pool-name} crush_ruleset - -If the ruleset was "123", for example, you can check the other pools like so:: - - ceph osd dump | grep "^pool" | grep "crush_ruleset 123" - -If no other pools use that custom ruleset, then it's safe to delete that -ruleset from the cluster. - -If you created users with permissions strictly for a pool that no longer -exists, you should consider deleting those users too:: - - ceph auth ls | grep -C 5 {pool-name} - ceph auth del {user} - - -Rename a Pool -============= - -To rename a pool, execute:: - - ceph osd pool rename {current-pool-name} {new-pool-name} - -If you rename a pool and you have per-pool capabilities for an authenticated -user, you must update the user's capabilities (i.e., caps) with the new pool -name. - -.. note:: Version ``0.48`` Argonaut and above. - -Show Pool Statistics -==================== - -To show a pool's utilization statistics, execute:: - - rados df - - -Make a Snapshot of a Pool -========================= - -To make a snapshot of a pool, execute:: - - ceph osd pool mksnap {pool-name} {snap-name} - -.. note:: Version ``0.48`` Argonaut and above. - - -Remove a Snapshot of a Pool -=========================== - -To remove a snapshot of a pool, execute:: - - ceph osd pool rmsnap {pool-name} {snap-name} - -.. note:: Version ``0.48`` Argonaut and above. - -.. _setpoolvalues: - - -Set Pool Values -=============== - -To set a value to a pool, execute the following:: - - ceph osd pool set {pool-name} {key} {value} - -You may set values for the following keys: - -.. _compression_algorithm: - -``compression_algorithm`` -:Description: Sets inline compression algorithm to use for underlying BlueStore. - This setting overrides the `global setting <rados/configuration/bluestore-config-ref/#inline-compression>`_ of ``bluestore compression algorithm``. - -:Type: String -:Valid Settings: ``lz4``, ``snappy``, ``zlib``, ``zstd`` - -``compression_mode`` - -:Description: Sets the policy for the inline compression algorithm for underlying BlueStore. - This setting overrides the `global setting <rados/configuration/bluestore-config-ref/#inline-compression>`_ of ``bluestore compression mode``. - -:Type: String -:Valid Settings: ``none``, ``passive``, ``aggressive``, ``force`` - -``compression_min_blob_size`` - -:Description: Chunks smaller than this are never compressed. - This setting overrides the `global setting <rados/configuration/bluestore-config-ref/#inline-compression>`_ of ``bluestore compression min blob *``. - -:Type: Unsigned Integer - -``compression_max_blob_size`` - -:Description: Chunks larger than this are broken into smaller blobs sizing - ``compression_max_blob_size`` before being compressed. - -:Type: Unsigned Integer - -.. _size: - -``size`` - -:Description: Sets the number of replicas for objects in the pool. - See `Set the Number of Object Replicas`_ for further details. - Replicated pools only. - -:Type: Integer - -.. _min_size: - -``min_size`` - -:Description: Sets the minimum number of replicas required for I/O. - See `Set the Number of Object Replicas`_ for further details. - Replicated pools only. - -:Type: Integer -:Version: ``0.54`` and above - -.. _pg_num: - -``pg_num`` - -:Description: The effective number of placement groups to use when calculating - data placement. -:Type: Integer -:Valid Range: Superior to ``pg_num`` current value. - -.. _pgp_num: - -``pgp_num`` - -:Description: The effective number of placement groups for placement to use - when calculating data placement. - -:Type: Integer -:Valid Range: Equal to or less than ``pg_num``. - -.. _crush_ruleset: - -``crush_ruleset`` - -:Description: The ruleset to use for mapping object placement in the cluster. -:Type: Integer - -.. _allow_ec_overwrites: - -``allow_ec_overwrites`` - -:Description: Whether writes to an erasure coded pool can update part - of an object, so cephfs and rbd can use it. See - `Erasure Coding with Overwrites`_ for more details. -:Type: Boolean -:Version: ``12.2.0`` and above - -.. _hashpspool: - -``hashpspool`` - -:Description: Set/Unset HASHPSPOOL flag on a given pool. -:Type: Integer -:Valid Range: 1 sets flag, 0 unsets flag -:Version: Version ``0.48`` Argonaut and above. - -.. _nodelete: - -``nodelete`` - -:Description: Set/Unset NODELETE flag on a given pool. -:Type: Integer -:Valid Range: 1 sets flag, 0 unsets flag -:Version: Version ``FIXME`` - -.. _nopgchange: - -``nopgchange`` - -:Description: Set/Unset NOPGCHANGE flag on a given pool. -:Type: Integer -:Valid Range: 1 sets flag, 0 unsets flag -:Version: Version ``FIXME`` - -.. _nosizechange: - -``nosizechange`` - -:Description: Set/Unset NOSIZECHANGE flag on a given pool. -:Type: Integer -:Valid Range: 1 sets flag, 0 unsets flag -:Version: Version ``FIXME`` - -.. _write_fadvise_dontneed: - -``write_fadvise_dontneed`` - -:Description: Set/Unset WRITE_FADVISE_DONTNEED flag on a given pool. -:Type: Integer -:Valid Range: 1 sets flag, 0 unsets flag - -.. _noscrub: - -``noscrub`` - -:Description: Set/Unset NOSCRUB flag on a given pool. -:Type: Integer -:Valid Range: 1 sets flag, 0 unsets flag - -.. _nodeep-scrub: - -``nodeep-scrub`` - -:Description: Set/Unset NODEEP_SCRUB flag on a given pool. -:Type: Integer -:Valid Range: 1 sets flag, 0 unsets flag - -.. _hit_set_type: - -``hit_set_type`` - -:Description: Enables hit set tracking for cache pools. - See `Bloom Filter`_ for additional information. - -:Type: String -:Valid Settings: ``bloom``, ``explicit_hash``, ``explicit_object`` -:Default: ``bloom``. Other values are for testing. - -.. _hit_set_count: - -``hit_set_count`` - -:Description: The number of hit sets to store for cache pools. The higher - the number, the more RAM consumed by the ``ceph-osd`` daemon. - -:Type: Integer -:Valid Range: ``1``. Agent doesn't handle > 1 yet. - -.. _hit_set_period: - -``hit_set_period`` - -:Description: The duration of a hit set period in seconds for cache pools. - The higher the number, the more RAM consumed by the - ``ceph-osd`` daemon. - -:Type: Integer -:Example: ``3600`` 1hr - -.. _hit_set_fpp: - -``hit_set_fpp`` - -:Description: The false positive probability for the ``bloom`` hit set type. - See `Bloom Filter`_ for additional information. - -:Type: Double -:Valid Range: 0.0 - 1.0 -:Default: ``0.05`` - -.. _cache_target_dirty_ratio: - -``cache_target_dirty_ratio`` - -:Description: The percentage of the cache pool containing modified (dirty) - objects before the cache tiering agent will flush them to the - backing storage pool. - -:Type: Double -:Default: ``.4`` - -.. _cache_target_dirty_high_ratio: - -``cache_target_dirty_high_ratio`` - -:Description: The percentage of the cache pool containing modified (dirty) - objects before the cache tiering agent will flush them to the - backing storage pool with a higher speed. - -:Type: Double -:Default: ``.6`` - -.. _cache_target_full_ratio: - -``cache_target_full_ratio`` - -:Description: The percentage of the cache pool containing unmodified (clean) - objects before the cache tiering agent will evict them from the - cache pool. - -:Type: Double -:Default: ``.8`` - -.. _target_max_bytes: - -``target_max_bytes`` - -:Description: Ceph will begin flushing or evicting objects when the - ``max_bytes`` threshold is triggered. - -:Type: Integer -:Example: ``1000000000000`` #1-TB - -.. _target_max_objects: - -``target_max_objects`` - -:Description: Ceph will begin flushing or evicting objects when the - ``max_objects`` threshold is triggered. - -:Type: Integer -:Example: ``1000000`` #1M objects - - -``hit_set_grade_decay_rate`` - -:Description: Temperature decay rate between two successive hit_sets -:Type: Integer -:Valid Range: 0 - 100 -:Default: ``20`` - - -``hit_set_search_last_n`` - -:Description: Count at most N appearance in hit_sets for temperature calculation -:Type: Integer -:Valid Range: 0 - hit_set_count -:Default: ``1`` - - -.. _cache_min_flush_age: - -``cache_min_flush_age`` - -:Description: The time (in seconds) before the cache tiering agent will flush - an object from the cache pool to the storage pool. - -:Type: Integer -:Example: ``600`` 10min - -.. _cache_min_evict_age: - -``cache_min_evict_age`` - -:Description: The time (in seconds) before the cache tiering agent will evict - an object from the cache pool. - -:Type: Integer -:Example: ``1800`` 30min - -.. _fast_read: - -``fast_read`` - -:Description: On Erasure Coding pool, if this flag is turned on, the read request - would issue sub reads to all shards, and waits until it receives enough - shards to decode to serve the client. In the case of jerasure and isa - erasure plugins, once the first K replies return, client's request is - served immediately using the data decoded from these replies. This - helps to tradeoff some resources for better performance. Currently this - flag is only supported for Erasure Coding pool. - -:Type: Boolean -:Defaults: ``0`` - -.. _scrub_min_interval: - -``scrub_min_interval`` - -:Description: The minimum interval in seconds for pool scrubbing when - load is low. If it is 0, the value osd_scrub_min_interval - from config is used. - -:Type: Double -:Default: ``0`` - -.. _scrub_max_interval: - -``scrub_max_interval`` - -:Description: The maximum interval in seconds for pool scrubbing - irrespective of cluster load. If it is 0, the value - osd_scrub_max_interval from config is used. - -:Type: Double -:Default: ``0`` - -.. _deep_scrub_interval: - -``deep_scrub_interval`` - -:Description: The interval in seconds for pool “deep” scrubbing. If it - is 0, the value osd_deep_scrub_interval from config is used. - -:Type: Double -:Default: ``0`` - - -Get Pool Values -=============== - -To get a value from a pool, execute the following:: - - ceph osd pool get {pool-name} {key} - -You may get values for the following keys: - -``size`` - -:Description: see size_ - -:Type: Integer - -``min_size`` - -:Description: see min_size_ - -:Type: Integer -:Version: ``0.54`` and above - -``pg_num`` - -:Description: see pg_num_ - -:Type: Integer - - -``pgp_num`` - -:Description: see pgp_num_ - -:Type: Integer -:Valid Range: Equal to or less than ``pg_num``. - - -``crush_ruleset`` - -:Description: see crush_ruleset_ - - -``hit_set_type`` - -:Description: see hit_set_type_ - -:Type: String -:Valid Settings: ``bloom``, ``explicit_hash``, ``explicit_object`` - -``hit_set_count`` - -:Description: see hit_set_count_ - -:Type: Integer - - -``hit_set_period`` - -:Description: see hit_set_period_ - -:Type: Integer - - -``hit_set_fpp`` - -:Description: see hit_set_fpp_ - -:Type: Double - - -``cache_target_dirty_ratio`` - -:Description: see cache_target_dirty_ratio_ - -:Type: Double - - -``cache_target_dirty_high_ratio`` - -:Description: see cache_target_dirty_high_ratio_ - -:Type: Double - - -``cache_target_full_ratio`` - -:Description: see cache_target_full_ratio_ - -:Type: Double - - -``target_max_bytes`` - -:Description: see target_max_bytes_ - -:Type: Integer - - -``target_max_objects`` - -:Description: see target_max_objects_ - -:Type: Integer - - -``cache_min_flush_age`` - -:Description: see cache_min_flush_age_ - -:Type: Integer - - -``cache_min_evict_age`` - -:Description: see cache_min_evict_age_ - -:Type: Integer - - -``fast_read`` - -:Description: see fast_read_ - -:Type: Boolean - - -``scrub_min_interval`` - -:Description: see scrub_min_interval_ - -:Type: Double - - -``scrub_max_interval`` - -:Description: see scrub_max_interval_ - -:Type: Double - - -``deep_scrub_interval`` - -:Description: see deep_scrub_interval_ - -:Type: Double - - -Set the Number of Object Replicas -================================= - -To set the number of object replicas on a replicated pool, execute the following:: - - ceph osd pool set {poolname} size {num-replicas} - -.. important:: The ``{num-replicas}`` includes the object itself. - If you want the object and two copies of the object for a total of - three instances of the object, specify ``3``. - -For example:: - - ceph osd pool set data size 3 - -You may execute this command for each pool. **Note:** An object might accept -I/Os in degraded mode with fewer than ``pool size`` replicas. To set a minimum -number of required replicas for I/O, you should use the ``min_size`` setting. -For example:: - - ceph osd pool set data min_size 2 - -This ensures that no object in the data pool will receive I/O with fewer than -``min_size`` replicas. - - -Get the Number of Object Replicas -================================= - -To get the number of object replicas, execute the following:: - - ceph osd dump | grep 'replicated size' - -Ceph will list the pools, with the ``replicated size`` attribute highlighted. -By default, ceph creates two replicas of an object (a total of three copies, or -a size of 3). - - - -.. _Pool, PG and CRUSH Config Reference: ../../configuration/pool-pg-config-ref -.. _Bloom Filter: http://en.wikipedia.org/wiki/Bloom_filter -.. _setting the number of placement groups: ../placement-groups#set-the-number-of-placement-groups -.. _Erasure Coding with Overwrites: ../erasure-code#erasure-coding-with-overwrites -.. _Block Device Commands: ../../../rbd/rados-rbd-cmds/#create-a-block-device-pool - diff --git a/src/ceph/doc/rados/operations/upmap.rst b/src/ceph/doc/rados/operations/upmap.rst deleted file mode 100644 index 58f6322..0000000 --- a/src/ceph/doc/rados/operations/upmap.rst +++ /dev/null @@ -1,75 +0,0 @@ -Using the pg-upmap -================== - -Starting in Luminous v12.2.z there is a new *pg-upmap* exception table -in the OSDMap that allows the cluster to explicitly map specific PGs to -specific OSDs. This allows the cluster to fine-tune the data -distribution to, in most cases, perfectly distributed PGs across OSDs. - -The key caveat to this new mechanism is that it requires that all -clients understand the new *pg-upmap* structure in the OSDMap. - -Enabling --------- - -To allow use of the feature, you must tell the cluster that it only -needs to support luminous (and newer) clients with:: - - ceph osd set-require-min-compat-client luminous - -This command will fail if any pre-luminous clients or daemons are -connected to the monitors. You can see what client versions are in -use with:: - - ceph features - -A word of caution ------------------ - -This is a new feature and not very user friendly. At the time of this -writing we are working on a new `balancer` module for ceph-mgr that -will eventually do all of this automatically. - -Until then, - -Offline optimization --------------------- - -Upmap entries are updated with an offline optimizer built into ``osdmaptool``. - -#. Grab the latest copy of your osdmap:: - - ceph osd getmap -o om - -#. Run the optimizer:: - - osdmaptool om --upmap out.txt [--upmap-pool <pool>] [--upmap-max <max-count>] [--upmap-deviation <max-deviation>] - - It is highly recommended that optimization be done for each pool - individually, or for sets of similarly-utilized pools. You can - specify the ``--upmap-pool`` option multiple times. "Similar pools" - means pools that are mapped to the same devices and store the same - kind of data (e.g., RBD image pools, yes; RGW index pool and RGW - data pool, no). - - The ``max-count`` value is the maximum number of upmap entries to - identify in the run. The default is 100, but you may want to make - this a smaller number so that the tool completes more quickly (but - does less work). If it cannot find any additional changes to make - it will stop early (i.e., when the pool distribution is perfect). - - The ``max-deviation`` value defaults to `.01` (i.e., 1%). If an OSD - utilization varies from the average by less than this amount it - will be considered perfect. - -#. The proposed changes are written to the output file ``out.txt`` in - the example above. These are normal ceph CLI commands that can be - run to apply the changes to the cluster. This can be done with:: - - source out.txt - -The above steps can be repeated as many times as necessary to achieve -a perfect distribution of PGs for each set of pools. - -You can see some (gory) details about what the tool is doing by -passing ``--debug-osd 10`` to ``osdmaptool``. diff --git a/src/ceph/doc/rados/operations/user-management.rst b/src/ceph/doc/rados/operations/user-management.rst deleted file mode 100644 index 8a35a50..0000000 --- a/src/ceph/doc/rados/operations/user-management.rst +++ /dev/null @@ -1,665 +0,0 @@ -================= - User Management -================= - -This document describes :term:`Ceph Client` users, and their authentication and -authorization with the :term:`Ceph Storage Cluster`. Users are either -individuals or system actors such as applications, which use Ceph clients to -interact with the Ceph Storage Cluster daemons. - -.. ditaa:: +-----+ - | {o} | - | | - +--+--+ /---------\ /---------\ - | | Ceph | | Ceph | - ---+---*----->| |<------------->| | - | uses | Clients | | Servers | - | \---------/ \---------/ - /--+--\ - | | - | | - actor - - -When Ceph runs with authentication and authorization enabled (enabled by -default), you must specify a user name and a keyring containing the secret key -of the specified user (usually via the command line). If you do not specify a -user name, Ceph will use ``client.admin`` as the default user name. If you do -not specify a keyring, Ceph will look for a keyring via the ``keyring`` setting -in the Ceph configuration. For example, if you execute the ``ceph health`` -command without specifying a user or keyring:: - - ceph health - -Ceph interprets the command like this:: - - ceph -n client.admin --keyring=/etc/ceph/ceph.client.admin.keyring health - -Alternatively, you may use the ``CEPH_ARGS`` environment variable to avoid -re-entry of the user name and secret. - -For details on configuring the Ceph Storage Cluster to use authentication, -see `Cephx Config Reference`_. For details on the architecture of Cephx, see -`Architecture - High Availability Authentication`_. - - -Background -========== - -Irrespective of the type of Ceph client (e.g., Block Device, Object Storage, -Filesystem, native API, etc.), Ceph stores all data as objects within `pools`_. -Ceph users must have access to pools in order to read and write data. -Additionally, Ceph users must have execute permissions to use Ceph's -administrative commands. The following concepts will help you understand Ceph -user management. - - -User ----- - -A user is either an individual or a system actor such as an application. -Creating users allows you to control who (or what) can access your Ceph Storage -Cluster, its pools, and the data within pools. - -Ceph has the notion of a ``type`` of user. For the purposes of user management, -the type will always be ``client``. Ceph identifies users in period (.) -delimited form consisting of the user type and the user ID: for example, -``TYPE.ID``, ``client.admin``, or ``client.user1``. The reason for user typing -is that Ceph Monitors, OSDs, and Metadata Servers also use the Cephx protocol, -but they are not clients. Distinguishing the user type helps to distinguish -between client users and other users--streamlining access control, user -monitoring and traceability. - -Sometimes Ceph's user type may seem confusing, because the Ceph command line -allows you to specify a user with or without the type, depending upon your -command line usage. If you specify ``--user`` or ``--id``, you can omit the -type. So ``client.user1`` can be entered simply as ``user1``. If you specify -``--name`` or ``-n``, you must specify the type and name, such as -``client.user1``. We recommend using the type and name as a best practice -wherever possible. - -.. note:: A Ceph Storage Cluster user is not the same as a Ceph Object Storage - user or a Ceph Filesystem user. The Ceph Object Gateway uses a Ceph Storage - Cluster user to communicate between the gateway daemon and the storage - cluster, but the gateway has its own user management functionality for end - users. The Ceph Filesystem uses POSIX semantics. The user space associated - with the Ceph Filesystem is not the same as a Ceph Storage Cluster user. - - - -Authorization (Capabilities) ----------------------------- - -Ceph uses the term "capabilities" (caps) to describe authorizing an -authenticated user to exercise the functionality of the monitors, OSDs and -metadata servers. Capabilities can also restrict access to data within a pool or -a namespace within a pool. A Ceph administrative user sets a user's -capabilities when creating or updating a user. - -Capability syntax follows the form:: - - {daemon-type} '{capspec}[, {capspec} ...]' - -- **Monitor Caps:** Monitor capabilities include ``r``, ``w``, ``x`` access - settings or ``profile {name}``. For example:: - - mon 'allow rwx' - mon 'profile osd' - -- **OSD Caps:** OSD capabilities include ``r``, ``w``, ``x``, ``class-read``, - ``class-write`` access settings or ``profile {name}``. Additionally, OSD - capabilities also allow for pool and namespace settings. :: - - osd 'allow {access} [pool={pool-name} [namespace={namespace-name}]]' - osd 'profile {name} [pool={pool-name} [namespace={namespace-name}]]' - -- **Metadata Server Caps:** For administrators, use ``allow *``. For all - other users, such as CephFS clients, consult :doc:`/cephfs/client-auth` - - -.. note:: The Ceph Object Gateway daemon (``radosgw``) is a client of the - Ceph Storage Cluster, so it is not represented as a Ceph Storage - Cluster daemon type. - -The following entries describe each capability. - -``allow`` - -:Description: Precedes access settings for a daemon. Implies ``rw`` - for MDS only. - - -``r`` - -:Description: Gives the user read access. Required with monitors to retrieve - the CRUSH map. - - -``w`` - -:Description: Gives the user write access to objects. - - -``x`` - -:Description: Gives the user the capability to call class methods - (i.e., both read and write) and to conduct ``auth`` - operations on monitors. - - -``class-read`` - -:Descriptions: Gives the user the capability to call class read methods. - Subset of ``x``. - - -``class-write`` - -:Description: Gives the user the capability to call class write methods. - Subset of ``x``. - - -``*`` - -:Description: Gives the user read, write and execute permissions for a - particular daemon/pool, and the ability to execute - admin commands. - - -``profile osd`` (Monitor only) - -:Description: Gives a user permissions to connect as an OSD to other OSDs or - monitors. Conferred on OSDs to enable OSDs to handle replication - heartbeat traffic and status reporting. - - -``profile mds`` (Monitor only) - -:Description: Gives a user permissions to connect as a MDS to other MDSs or - monitors. - - -``profile bootstrap-osd`` (Monitor only) - -:Description: Gives a user permissions to bootstrap an OSD. Conferred on - deployment tools such as ``ceph-disk``, ``ceph-deploy``, etc. - so that they have permissions to add keys, etc. when - bootstrapping an OSD. - - -``profile bootstrap-mds`` (Monitor only) - -:Description: Gives a user permissions to bootstrap a metadata server. - Conferred on deployment tools such as ``ceph-deploy``, etc. - so they have permissions to add keys, etc. when bootstrapping - a metadata server. - -``profile rbd`` (Monitor and OSD) - -:Description: Gives a user permissions to manipulate RBD images. When used - as a Monitor cap, it provides the minimal privileges required - by an RBD client application. When used as an OSD cap, it - provides read-write access to an RBD client application. - -``profile rbd-read-only`` (OSD only) - -:Description: Gives a user read-only permissions to an RBD image. - - -Pool ----- - -A pool is a logical partition where users store data. -In Ceph deployments, it is common to create a pool as a logical partition for -similar types of data. For example, when deploying Ceph as a backend for -OpenStack, a typical deployment would have pools for volumes, images, backups -and virtual machines, and users such as ``client.glance``, ``client.cinder``, -etc. - - -Namespace ---------- - -Objects within a pool can be associated to a namespace--a logical group of -objects within the pool. A user's access to a pool can be associated with a -namespace such that reads and writes by the user take place only within the -namespace. Objects written to a namespace within the pool can only be accessed -by users who have access to the namespace. - -.. note:: Namespaces are primarily useful for applications written on top of - ``librados`` where the logical grouping can alleviate the need to create - different pools. Ceph Object Gateway (from ``luminous``) uses namespaces for various - metadata objects. - -The rationale for namespaces is that pools can be a computationally expensive -method of segregating data sets for the purposes of authorizing separate sets -of users. For example, a pool should have ~100 placement groups per OSD. So an -exemplary cluster with 1000 OSDs would have 100,000 placement groups for one -pool. Each pool would create another 100,000 placement groups in the exemplary -cluster. By contrast, writing an object to a namespace simply associates the -namespace to the object name with out the computational overhead of a separate -pool. Rather than creating a separate pool for a user or set of users, you may -use a namespace. **Note:** Only available using ``librados`` at this time. - - -Managing Users -============== - -User management functionality provides Ceph Storage Cluster administrators with -the ability to create, update and delete users directly in the Ceph Storage -Cluster. - -When you create or delete users in the Ceph Storage Cluster, you may need to -distribute keys to clients so that they can be added to keyrings. See `Keyring -Management`_ for details. - - -List Users ----------- - -To list the users in your cluster, execute the following:: - - ceph auth ls - -Ceph will list out all users in your cluster. For example, in a two-node -exemplary cluster, ``ceph auth ls`` will output something that looks like -this:: - - installed auth entries: - - osd.0 - key: AQCvCbtToC6MDhAATtuT70Sl+DymPCfDSsyV4w== - caps: [mon] allow profile osd - caps: [osd] allow * - osd.1 - key: AQC4CbtTCFJBChAAVq5spj0ff4eHZICxIOVZeA== - caps: [mon] allow profile osd - caps: [osd] allow * - client.admin - key: AQBHCbtT6APDHhAA5W00cBchwkQjh3dkKsyPjw== - caps: [mds] allow - caps: [mon] allow * - caps: [osd] allow * - client.bootstrap-mds - key: AQBICbtTOK9uGBAAdbe5zcIGHZL3T/u2g6EBww== - caps: [mon] allow profile bootstrap-mds - client.bootstrap-osd - key: AQBHCbtT4GxqORAADE5u7RkpCN/oo4e5W0uBtw== - caps: [mon] allow profile bootstrap-osd - - -Note that the ``TYPE.ID`` notation for users applies such that ``osd.0`` is a -user of type ``osd`` and its ID is ``0``, ``client.admin`` is a user of type -``client`` and its ID is ``admin`` (i.e., the default ``client.admin`` user). -Note also that each entry has a ``key: <value>`` entry, and one or more -``caps:`` entries. - -You may use the ``-o {filename}`` option with ``ceph auth ls`` to -save the output to a file. - - -Get a User ----------- - -To retrieve a specific user, key and capabilities, execute the -following:: - - ceph auth get {TYPE.ID} - -For example:: - - ceph auth get client.admin - -You may also use the ``-o {filename}`` option with ``ceph auth get`` to -save the output to a file. Developers may also execute the following:: - - ceph auth export {TYPE.ID} - -The ``auth export`` command is identical to ``auth get``, but also prints -out the internal ``auid``, which is not relevant to end users. - - - -Add a User ----------- - -Adding a user creates a username (i.e., ``TYPE.ID``), a secret key and -any capabilities included in the command you use to create the user. - -A user's key enables the user to authenticate with the Ceph Storage Cluster. -The user's capabilities authorize the user to read, write, or execute on Ceph -monitors (``mon``), Ceph OSDs (``osd``) or Ceph Metadata Servers (``mds``). - -There are a few ways to add a user: - -- ``ceph auth add``: This command is the canonical way to add a user. It - will create the user, generate a key and add any specified capabilities. - -- ``ceph auth get-or-create``: This command is often the most convenient way - to create a user, because it returns a keyfile format with the user name - (in brackets) and the key. If the user already exists, this command - simply returns the user name and key in the keyfile format. You may use the - ``-o {filename}`` option to save the output to a file. - -- ``ceph auth get-or-create-key``: This command is a convenient way to create - a user and return the user's key (only). This is useful for clients that - need the key only (e.g., libvirt). If the user already exists, this command - simply returns the key. You may use the ``-o {filename}`` option to save the - output to a file. - -When creating client users, you may create a user with no capabilities. A user -with no capabilities is useless beyond mere authentication, because the client -cannot retrieve the cluster map from the monitor. However, you can create a -user with no capabilities if you wish to defer adding capabilities later using -the ``ceph auth caps`` command. - -A typical user has at least read capabilities on the Ceph monitor and -read and write capability on Ceph OSDs. Additionally, a user's OSD permissions -are often restricted to accessing a particular pool. :: - - ceph auth add client.john mon 'allow r' osd 'allow rw pool=liverpool' - ceph auth get-or-create client.paul mon 'allow r' osd 'allow rw pool=liverpool' - ceph auth get-or-create client.george mon 'allow r' osd 'allow rw pool=liverpool' -o george.keyring - ceph auth get-or-create-key client.ringo mon 'allow r' osd 'allow rw pool=liverpool' -o ringo.key - - -.. important:: If you provide a user with capabilities to OSDs, but you DO NOT - restrict access to particular pools, the user will have access to ALL - pools in the cluster! - - -.. _modify-user-capabilities: - -Modify User Capabilities ------------------------- - -The ``ceph auth caps`` command allows you to specify a user and change the -user's capabilities. Setting new capabilities will overwrite current capabilities. -To view current capabilities run ``ceph auth get USERTYPE.USERID``. To add -capabilities, you should also specify the existing capabilities when using the form:: - - ceph auth caps USERTYPE.USERID {daemon} 'allow [r|w|x|*|...] [pool={pool-name}] [namespace={namespace-name}]' [{daemon} 'allow [r|w|x|*|...] [pool={pool-name}] [namespace={namespace-name}]'] - -For example:: - - ceph auth get client.john - ceph auth caps client.john mon 'allow r' osd 'allow rw pool=liverpool' - ceph auth caps client.paul mon 'allow rw' osd 'allow rwx pool=liverpool' - ceph auth caps client.brian-manager mon 'allow *' osd 'allow *' - -To remove a capability, you may reset the capability. If you want the user -to have no access to a particular daemon that was previously set, specify -an empty string. For example:: - - ceph auth caps client.ringo mon ' ' osd ' ' - -See `Authorization (Capabilities)`_ for additional details on capabilities. - - -Delete a User -------------- - -To delete a user, use ``ceph auth del``:: - - ceph auth del {TYPE}.{ID} - -Where ``{TYPE}`` is one of ``client``, ``osd``, ``mon``, or ``mds``, -and ``{ID}`` is the user name or ID of the daemon. - - -Print a User's Key ------------------- - -To print a user's authentication key to standard output, execute the following:: - - ceph auth print-key {TYPE}.{ID} - -Where ``{TYPE}`` is one of ``client``, ``osd``, ``mon``, or ``mds``, -and ``{ID}`` is the user name or ID of the daemon. - -Printing a user's key is useful when you need to populate client -software with a user's key (e.g., libvirt). :: - - mount -t ceph serverhost:/ mountpoint -o name=client.user,secret=`ceph auth print-key client.user` - - -Import a User(s) ----------------- - -To import one or more users, use ``ceph auth import`` and -specify a keyring:: - - ceph auth import -i /path/to/keyring - -For example:: - - sudo ceph auth import -i /etc/ceph/ceph.keyring - - -.. note:: The ceph storage cluster will add new users, their keys and their - capabilities and will update existing users, their keys and their - capabilities. - - -Keyring Management -================== - -When you access Ceph via a Ceph client, the Ceph client will look for a local -keyring. Ceph presets the ``keyring`` setting with the following four keyring -names by default so you don't have to set them in your Ceph configuration file -unless you want to override the defaults (not recommended): - -- ``/etc/ceph/$cluster.$name.keyring`` -- ``/etc/ceph/$cluster.keyring`` -- ``/etc/ceph/keyring`` -- ``/etc/ceph/keyring.bin`` - -The ``$cluster`` metavariable is your Ceph cluster name as defined by the -name of the Ceph configuration file (i.e., ``ceph.conf`` means the cluster name -is ``ceph``; thus, ``ceph.keyring``). The ``$name`` metavariable is the user -type and user ID (e.g., ``client.admin``; thus, ``ceph.client.admin.keyring``). - -.. note:: When executing commands that read or write to ``/etc/ceph``, you may - need to use ``sudo`` to execute the command as ``root``. - -After you create a user (e.g., ``client.ringo``), you must get the key and add -it to a keyring on a Ceph client so that the user can access the Ceph Storage -Cluster. - -The `User Management`_ section details how to list, get, add, modify and delete -users directly in the Ceph Storage Cluster. However, Ceph also provides the -``ceph-authtool`` utility to allow you to manage keyrings from a Ceph client. - - -Create a Keyring ----------------- - -When you use the procedures in the `Managing Users`_ section to create users, -you need to provide user keys to the Ceph client(s) so that the Ceph client -can retrieve the key for the specified user and authenticate with the Ceph -Storage Cluster. Ceph Clients access keyrings to lookup a user name and -retrieve the user's key. - -The ``ceph-authtool`` utility allows you to create a keyring. To create an -empty keyring, use ``--create-keyring`` or ``-C``. For example:: - - ceph-authtool --create-keyring /path/to/keyring - -When creating a keyring with multiple users, we recommend using the cluster name -(e.g., ``$cluster.keyring``) for the keyring filename and saving it in the -``/etc/ceph`` directory so that the ``keyring`` configuration default setting -will pick up the filename without requiring you to specify it in the local copy -of your Ceph configuration file. For example, create ``ceph.keyring`` by -executing the following:: - - sudo ceph-authtool -C /etc/ceph/ceph.keyring - -When creating a keyring with a single user, we recommend using the cluster name, -the user type and the user name and saving it in the ``/etc/ceph`` directory. -For example, ``ceph.client.admin.keyring`` for the ``client.admin`` user. - -To create a keyring in ``/etc/ceph``, you must do so as ``root``. This means -the file will have ``rw`` permissions for the ``root`` user only, which is -appropriate when the keyring contains administrator keys. However, if you -intend to use the keyring for a particular user or group of users, ensure -that you execute ``chown`` or ``chmod`` to establish appropriate keyring -ownership and access. - - -Add a User to a Keyring ------------------------ - -When you `Add a User`_ to the Ceph Storage Cluster, you can use the `Get a -User`_ procedure to retrieve a user, key and capabilities and save the user to a -keyring. - -When you only want to use one user per keyring, the `Get a User`_ procedure with -the ``-o`` option will save the output in the keyring file format. For example, -to create a keyring for the ``client.admin`` user, execute the following:: - - sudo ceph auth get client.admin -o /etc/ceph/ceph.client.admin.keyring - -Notice that we use the recommended file format for an individual user. - -When you want to import users to a keyring, you can use ``ceph-authtool`` -to specify the destination keyring and the source keyring. -For example:: - - sudo ceph-authtool /etc/ceph/ceph.keyring --import-keyring /etc/ceph/ceph.client.admin.keyring - - -Create a User -------------- - -Ceph provides the `Add a User`_ function to create a user directly in the Ceph -Storage Cluster. However, you can also create a user, keys and capabilities -directly on a Ceph client keyring. Then, you can import the user to the Ceph -Storage Cluster. For example:: - - sudo ceph-authtool -n client.ringo --cap osd 'allow rwx' --cap mon 'allow rwx' /etc/ceph/ceph.keyring - -See `Authorization (Capabilities)`_ for additional details on capabilities. - -You can also create a keyring and add a new user to the keyring simultaneously. -For example:: - - sudo ceph-authtool -C /etc/ceph/ceph.keyring -n client.ringo --cap osd 'allow rwx' --cap mon 'allow rwx' --gen-key - -In the foregoing scenarios, the new user ``client.ringo`` is only in the -keyring. To add the new user to the Ceph Storage Cluster, you must still add -the new user to the Ceph Storage Cluster. :: - - sudo ceph auth add client.ringo -i /etc/ceph/ceph.keyring - - -Modify a User -------------- - -To modify the capabilities of a user record in a keyring, specify the keyring, -and the user followed by the capabilities. For example:: - - sudo ceph-authtool /etc/ceph/ceph.keyring -n client.ringo --cap osd 'allow rwx' --cap mon 'allow rwx' - -To update the user to the Ceph Storage Cluster, you must update the user -in the keyring to the user entry in the the Ceph Storage Cluster. :: - - sudo ceph auth import -i /etc/ceph/ceph.keyring - -See `Import a User(s)`_ for details on updating a Ceph Storage Cluster user -from a keyring. - -You may also `Modify User Capabilities`_ directly in the cluster, store the -results to a keyring file; then, import the keyring into your main -``ceph.keyring`` file. - - -Command Line Usage -================== - -Ceph supports the following usage for user name and secret: - -``--id`` | ``--user`` - -:Description: Ceph identifies users with a type and an ID (e.g., ``TYPE.ID`` or - ``client.admin``, ``client.user1``). The ``id``, ``name`` and - ``-n`` options enable you to specify the ID portion of the user - name (e.g., ``admin``, ``user1``, ``foo``, etc.). You can specify - the user with the ``--id`` and omit the type. For example, - to specify user ``client.foo`` enter the following:: - - ceph --id foo --keyring /path/to/keyring health - ceph --user foo --keyring /path/to/keyring health - - -``--name`` | ``-n`` - -:Description: Ceph identifies users with a type and an ID (e.g., ``TYPE.ID`` or - ``client.admin``, ``client.user1``). The ``--name`` and ``-n`` - options enables you to specify the fully qualified user name. - You must specify the user type (typically ``client``) with the - user ID. For example:: - - ceph --name client.foo --keyring /path/to/keyring health - ceph -n client.foo --keyring /path/to/keyring health - - -``--keyring`` - -:Description: The path to the keyring containing one or more user name and - secret. The ``--secret`` option provides the same functionality, - but it does not work with Ceph RADOS Gateway, which uses - ``--secret`` for another purpose. You may retrieve a keyring with - ``ceph auth get-or-create`` and store it locally. This is a - preferred approach, because you can switch user names without - switching the keyring path. For example:: - - sudo rbd map --id foo --keyring /path/to/keyring mypool/myimage - - -.. _pools: ../pools - - -Limitations -=========== - -The ``cephx`` protocol authenticates Ceph clients and servers to each other. It -is not intended to handle authentication of human users or application programs -run on their behalf. If that effect is required to handle your access control -needs, you must have another mechanism, which is likely to be specific to the -front end used to access the Ceph object store. This other mechanism has the -role of ensuring that only acceptable users and programs are able to run on the -machine that Ceph will permit to access its object store. - -The keys used to authenticate Ceph clients and servers are typically stored in -a plain text file with appropriate permissions in a trusted host. - -.. important:: Storing keys in plaintext files has security shortcomings, but - they are difficult to avoid, given the basic authentication methods Ceph - uses in the background. Those setting up Ceph systems should be aware of - these shortcomings. - -In particular, arbitrary user machines, especially portable machines, should not -be configured to interact directly with Ceph, since that mode of use would -require the storage of a plaintext authentication key on an insecure machine. -Anyone who stole that machine or obtained surreptitious access to it could -obtain the key that will allow them to authenticate their own machines to Ceph. - -Rather than permitting potentially insecure machines to access a Ceph object -store directly, users should be required to sign in to a trusted machine in -your environment using a method that provides sufficient security for your -purposes. That trusted machine will store the plaintext Ceph keys for the -human users. A future version of Ceph may address these particular -authentication issues more fully. - -At the moment, none of the Ceph authentication protocols provide secrecy for -messages in transit. Thus, an eavesdropper on the wire can hear and understand -all data sent between clients and servers in Ceph, even if it cannot create or -alter them. Further, Ceph does not include options to encrypt user data in the -object store. Users can hand-encrypt and store their own data in the Ceph -object store, of course, but Ceph provides no features to perform object -encryption itself. Those storing sensitive data in Ceph should consider -encrypting their data before providing it to the Ceph system. - - -.. _Architecture - High Availability Authentication: ../../../architecture#high-availability-authentication -.. _Cephx Config Reference: ../../configuration/auth-config-ref |