diff options
Diffstat (limited to 'src/ceph/doc/rados')
70 files changed, 0 insertions, 19407 deletions
diff --git a/src/ceph/doc/rados/api/index.rst b/src/ceph/doc/rados/api/index.rst deleted file mode 100644 index cccc153..0000000 --- a/src/ceph/doc/rados/api/index.rst +++ /dev/null @@ -1,22 +0,0 @@ -=========================== - Ceph Storage Cluster APIs -=========================== - -The :term:`Ceph Storage Cluster` has a messaging layer protocol that enables -clients to interact with a :term:`Ceph Monitor` and a :term:`Ceph OSD Daemon`. -``librados`` provides this functionality to :term:`Ceph Clients` in the form of -a library. All Ceph Clients either use ``librados`` or the same functionality -encapsulated in ``librados`` to interact with the object store. For example, -``librbd`` and ``libcephfs`` leverage this functionality. You may use -``librados`` to interact with Ceph directly (e.g., an application that talks to -Ceph, your own interface to Ceph, etc.). - - -.. toctree:: - :maxdepth: 2 - - Introduction to librados <librados-intro> - librados (C) <librados> - librados (C++) <libradospp> - librados (Python) <python> - object class <objclass-sdk> diff --git a/src/ceph/doc/rados/api/librados-intro.rst b/src/ceph/doc/rados/api/librados-intro.rst deleted file mode 100644 index 8405f6e..0000000 --- a/src/ceph/doc/rados/api/librados-intro.rst +++ /dev/null @@ -1,1003 +0,0 @@ -========================== - Introduction to librados -========================== - -The :term:`Ceph Storage Cluster` provides the basic storage service that allows -:term:`Ceph` to uniquely deliver **object, block, and file storage** in one -unified system. However, you are not limited to using the RESTful, block, or -POSIX interfaces. Based upon :abbr:`RADOS (Reliable Autonomic Distributed Object -Store)`, the ``librados`` API enables you to create your own interface to the -Ceph Storage Cluster. - -The ``librados`` API enables you to interact with the two types of daemons in -the Ceph Storage Cluster: - -- The :term:`Ceph Monitor`, which maintains a master copy of the cluster map. -- The :term:`Ceph OSD Daemon` (OSD), which stores data as objects on a storage node. - -.. ditaa:: - +---------------------------------+ - | Ceph Storage Cluster Protocol | - | (librados) | - +---------------------------------+ - +---------------+ +---------------+ - | OSDs | | Monitors | - +---------------+ +---------------+ - -This guide provides a high-level introduction to using ``librados``. -Refer to :doc:`../../architecture` for additional details of the Ceph -Storage Cluster. To use the API, you need a running Ceph Storage Cluster. -See `Installation (Quick)`_ for details. - - -Step 1: Getting librados -======================== - -Your client application must bind with ``librados`` to connect to the Ceph -Storage Cluster. You must install ``librados`` and any required packages to -write applications that use ``librados``. The ``librados`` API is written in -C++, with additional bindings for C, Python, Java and PHP. - - -Getting librados for C/C++ --------------------------- - -To install ``librados`` development support files for C/C++ on Debian/Ubuntu -distributions, execute the following:: - - sudo apt-get install librados-dev - -To install ``librados`` development support files for C/C++ on RHEL/CentOS -distributions, execute the following:: - - sudo yum install librados2-devel - -Once you install ``librados`` for developers, you can find the required -headers for C/C++ under ``/usr/include/rados``. :: - - ls /usr/include/rados - - -Getting librados for Python ---------------------------- - -The ``rados`` module provides ``librados`` support to Python -applications. The ``librados-dev`` package for Debian/Ubuntu -and the ``librados2-devel`` package for RHEL/CentOS will install the -``python-rados`` package for you. You may install ``python-rados`` -directly too. - -To install ``librados`` development support files for Python on Debian/Ubuntu -distributions, execute the following:: - - sudo apt-get install python-rados - -To install ``librados`` development support files for Python on RHEL/CentOS -distributions, execute the following:: - - sudo yum install python-rados - -You can find the module under ``/usr/share/pyshared`` on Debian systems, -or under ``/usr/lib/python*/site-packages`` on CentOS/RHEL systems. - - -Getting librados for Java -------------------------- - -To install ``librados`` for Java, you need to execute the following procedure: - -#. Install ``jna.jar``. For Debian/Ubuntu, execute:: - - sudo apt-get install libjna-java - - For CentOS/RHEL, execute:: - - sudo yum install jna - - The JAR files are located in ``/usr/share/java``. - -#. Clone the ``rados-java`` repository:: - - git clone --recursive https://github.com/ceph/rados-java.git - -#. Build the ``rados-java`` repository:: - - cd rados-java - ant - - The JAR file is located under ``rados-java/target``. - -#. Copy the JAR for RADOS to a common location (e.g., ``/usr/share/java``) and - ensure that it and the JNA JAR are in your JVM's classpath. For example:: - - sudo cp target/rados-0.1.3.jar /usr/share/java/rados-0.1.3.jar - sudo ln -s /usr/share/java/jna-3.2.7.jar /usr/lib/jvm/default-java/jre/lib/ext/jna-3.2.7.jar - sudo ln -s /usr/share/java/rados-0.1.3.jar /usr/lib/jvm/default-java/jre/lib/ext/rados-0.1.3.jar - -To build the documentation, execute the following:: - - ant docs - - -Getting librados for PHP -------------------------- - -To install the ``librados`` extension for PHP, you need to execute the following procedure: - -#. Install php-dev. For Debian/Ubuntu, execute:: - - sudo apt-get install php5-dev build-essential - - For CentOS/RHEL, execute:: - - sudo yum install php-devel - -#. Clone the ``phprados`` repository:: - - git clone https://github.com/ceph/phprados.git - -#. Build ``phprados``:: - - cd phprados - phpize - ./configure - make - sudo make install - -#. Enable ``phprados`` in php.ini by adding:: - - extension=rados.so - - -Step 2: Configuring a Cluster Handle -==================================== - -A :term:`Ceph Client`, via ``librados``, interacts directly with OSDs to store -and retrieve data. To interact with OSDs, the client app must invoke -``librados`` and connect to a Ceph Monitor. Once connected, ``librados`` -retrieves the :term:`Cluster Map` from the Ceph Monitor. When the client app -wants to read or write data, it creates an I/O context and binds to a -:term:`pool`. The pool has an associated :term:`ruleset` that defines how it -will place data in the storage cluster. Via the I/O context, the client -provides the object name to ``librados``, which takes the object name -and the cluster map (i.e., the topology of the cluster) and `computes`_ the -placement group and `OSD`_ for locating the data. Then the client application -can read or write data. The client app doesn't need to learn about the topology -of the cluster directly. - -.. ditaa:: - +--------+ Retrieves +---------------+ - | Client |------------>| Cluster Map | - +--------+ +---------------+ - | - v Writes - /-----\ - | obj | - \-----/ - | To - v - +--------+ +---------------+ - | Pool |---------->| CRUSH Ruleset | - +--------+ Selects +---------------+ - - -The Ceph Storage Cluster handle encapsulates the client configuration, including: - -- The `user ID`_ for ``rados_create()`` or user name for ``rados_create2()`` - (preferred). -- The :term:`cephx` authentication key -- The monitor ID and IP address -- Logging levels -- Debugging levels - -Thus, the first steps in using the cluster from your app are to 1) create -a cluster handle that your app will use to connect to the storage cluster, -and then 2) use that handle to connect. To connect to the cluster, the -app must supply a monitor address, a username and an authentication key -(cephx is enabled by default). - -.. tip:: Talking to different Ceph Storage Clusters – or to the same cluster - with different users – requires different cluster handles. - -RADOS provides a number of ways for you to set the required values. For -the monitor and encryption key settings, an easy way to handle them is to ensure -that your Ceph configuration file contains a ``keyring`` path to a keyring file -and at least one monitor address (e.g,. ``mon host``). For example:: - - [global] - mon host = 192.168.1.1 - keyring = /etc/ceph/ceph.client.admin.keyring - -Once you create the handle, you can read a Ceph configuration file to configure -the handle. You can also pass arguments to your app and parse them with the -function for parsing command line arguments (e.g., ``rados_conf_parse_argv()``), -or parse Ceph environment variables (e.g., ``rados_conf_parse_env()``). Some -wrappers may not implement convenience methods, so you may need to implement -these capabilities. The following diagram provides a high-level flow for the -initial connection. - - -.. ditaa:: +---------+ +---------+ - | Client | | Monitor | - +---------+ +---------+ - | | - |-----+ create | - | | cluster | - |<----+ handle | - | | - |-----+ read | - | | config | - |<----+ file | - | | - | connect | - |-------------->| - | | - |<--------------| - | connected | - | | - - -Once connected, your app can invoke functions that affect the whole cluster -with only the cluster handle. For example, once you have a cluster -handle, you can: - -- Get cluster statistics -- Use Pool Operation (exists, create, list, delete) -- Get and set the configuration - - -One of the powerful features of Ceph is the ability to bind to different pools. -Each pool may have a different number of placement groups, object replicas and -replication strategies. For example, a pool could be set up as a "hot" pool that -uses SSDs for frequently used objects or a "cold" pool that uses erasure coding. - -The main difference in the various ``librados`` bindings is between C and -the object-oriented bindings for C++, Java and Python. The object-oriented -bindings use objects to represent cluster handles, IO Contexts, iterators, -exceptions, etc. - - -C Example ---------- - -For C, creating a simple cluster handle using the ``admin`` user, configuring -it and connecting to the cluster might look something like this: - -.. code-block:: c - - #include <stdio.h> - #include <stdlib.h> - #include <string.h> - #include <rados/librados.h> - - int main (int argc, const char **argv) - { - - /* Declare the cluster handle and required arguments. */ - rados_t cluster; - char cluster_name[] = "ceph"; - char user_name[] = "client.admin"; - uint64_t flags; - - /* Initialize the cluster handle with the "ceph" cluster name and the "client.admin" user */ - int err; - err = rados_create2(&cluster, cluster_name, user_name, flags); - - if (err < 0) { - fprintf(stderr, "%s: Couldn't create the cluster handle! %s\n", argv[0], strerror(-err)); - exit(EXIT_FAILURE); - } else { - printf("\nCreated a cluster handle.\n"); - } - - - /* Read a Ceph configuration file to configure the cluster handle. */ - err = rados_conf_read_file(cluster, "/etc/ceph/ceph.conf"); - if (err < 0) { - fprintf(stderr, "%s: cannot read config file: %s\n", argv[0], strerror(-err)); - exit(EXIT_FAILURE); - } else { - printf("\nRead the config file.\n"); - } - - /* Read command line arguments */ - err = rados_conf_parse_argv(cluster, argc, argv); - if (err < 0) { - fprintf(stderr, "%s: cannot parse command line arguments: %s\n", argv[0], strerror(-err)); - exit(EXIT_FAILURE); - } else { - printf("\nRead the command line arguments.\n"); - } - - /* Connect to the cluster */ - err = rados_connect(cluster); - if (err < 0) { - fprintf(stderr, "%s: cannot connect to cluster: %s\n", argv[0], strerror(-err)); - exit(EXIT_FAILURE); - } else { - printf("\nConnected to the cluster.\n"); - } - - } - -Compile your client and link to ``librados`` using ``-lrados``. For example:: - - gcc ceph-client.c -lrados -o ceph-client - - -C++ Example ------------ - -The Ceph project provides a C++ example in the ``ceph/examples/librados`` -directory. For C++, a simple cluster handle using the ``admin`` user requires -you to initialize a ``librados::Rados`` cluster handle object: - -.. code-block:: c++ - - #include <iostream> - #include <string> - #include <rados/librados.hpp> - - int main(int argc, const char **argv) - { - - int ret = 0; - - /* Declare the cluster handle and required variables. */ - librados::Rados cluster; - char cluster_name[] = "ceph"; - char user_name[] = "client.admin"; - uint64_t flags = 0; - - /* Initialize the cluster handle with the "ceph" cluster name and "client.admin" user */ - { - ret = cluster.init2(user_name, cluster_name, flags); - if (ret < 0) { - std::cerr << "Couldn't initialize the cluster handle! error " << ret << std::endl; - return EXIT_FAILURE; - } else { - std::cout << "Created a cluster handle." << std::endl; - } - } - - /* Read a Ceph configuration file to configure the cluster handle. */ - { - ret = cluster.conf_read_file("/etc/ceph/ceph.conf"); - if (ret < 0) { - std::cerr << "Couldn't read the Ceph configuration file! error " << ret << std::endl; - return EXIT_FAILURE; - } else { - std::cout << "Read the Ceph configuration file." << std::endl; - } - } - - /* Read command line arguments */ - { - ret = cluster.conf_parse_argv(argc, argv); - if (ret < 0) { - std::cerr << "Couldn't parse command line options! error " << ret << std::endl; - return EXIT_FAILURE; - } else { - std::cout << "Parsed command line options." << std::endl; - } - } - - /* Connect to the cluster */ - { - ret = cluster.connect(); - if (ret < 0) { - std::cerr << "Couldn't connect to cluster! error " << ret << std::endl; - return EXIT_FAILURE; - } else { - std::cout << "Connected to the cluster." << std::endl; - } - } - - return 0; - } - - -Compile the source; then, link ``librados`` using ``-lrados``. -For example:: - - g++ -g -c ceph-client.cc -o ceph-client.o - g++ -g ceph-client.o -lrados -o ceph-client - - - -Python Example --------------- - -Python uses the ``admin`` id and the ``ceph`` cluster name by default, and -will read the standard ``ceph.conf`` file if the conffile parameter is -set to the empty string. The Python binding converts C++ errors -into exceptions. - - -.. code-block:: python - - import rados - - try: - cluster = rados.Rados(conffile='') - except TypeError as e: - print 'Argument validation error: ', e - raise e - - print "Created cluster handle." - - try: - cluster.connect() - except Exception as e: - print "connection error: ", e - raise e - finally: - print "Connected to the cluster." - - -Execute the example to verify that it connects to your cluster. :: - - python ceph-client.py - - -Java Example ------------- - -Java requires you to specify the user ID (``admin``) or user name -(``client.admin``), and uses the ``ceph`` cluster name by default . The Java -binding converts C++-based errors into exceptions. - -.. code-block:: java - - import com.ceph.rados.Rados; - import com.ceph.rados.RadosException; - - import java.io.File; - - public class CephClient { - public static void main (String args[]){ - - try { - Rados cluster = new Rados("admin"); - System.out.println("Created cluster handle."); - - File f = new File("/etc/ceph/ceph.conf"); - cluster.confReadFile(f); - System.out.println("Read the configuration file."); - - cluster.connect(); - System.out.println("Connected to the cluster."); - - } catch (RadosException e) { - System.out.println(e.getMessage() + ": " + e.getReturnValue()); - } - } - } - - -Compile the source; then, run it. If you have copied the JAR to -``/usr/share/java`` and sym linked from your ``ext`` directory, you won't need -to specify the classpath. For example:: - - javac CephClient.java - java CephClient - - -PHP Example ------------- - -With the RADOS extension enabled in PHP you can start creating a new cluster handle very easily: - -.. code-block:: php - - <?php - - $r = rados_create(); - rados_conf_read_file($r, '/etc/ceph/ceph.conf'); - if (!rados_connect($r)) { - echo "Failed to connect to Ceph cluster"; - } else { - echo "Successfully connected to Ceph cluster"; - } - - -Save this as rados.php and run the code:: - - php rados.php - - -Step 3: Creating an I/O Context -=============================== - -Once your app has a cluster handle and a connection to a Ceph Storage Cluster, -you may create an I/O Context and begin reading and writing data. An I/O Context -binds the connection to a specific pool. The user must have appropriate -`CAPS`_ permissions to access the specified pool. For example, a user with read -access but not write access will only be able to read data. I/O Context -functionality includes: - -- Write/read data and extended attributes -- List and iterate over objects and extended attributes -- Snapshot pools, list snapshots, etc. - - -.. ditaa:: +---------+ +---------+ +---------+ - | Client | | Monitor | | OSD | - +---------+ +---------+ +---------+ - | | | - |-----+ create | | - | | I/O | | - |<----+ context | | - | | | - | write data | | - |---------------+-------------->| - | | | - | write ack | | - |<--------------+---------------| - | | | - | write xattr | | - |---------------+-------------->| - | | | - | xattr ack | | - |<--------------+---------------| - | | | - | read data | | - |---------------+-------------->| - | | | - | read ack | | - |<--------------+---------------| - | | | - | remove data | | - |---------------+-------------->| - | | | - | remove ack | | - |<--------------+---------------| - - - -RADOS enables you to interact both synchronously and asynchronously. Once your -app has an I/O Context, read/write operations only require you to know the -object/xattr name. The CRUSH algorithm encapsulated in ``librados`` uses the -cluster map to identify the appropriate OSD. OSD daemons handle the replication, -as described in `Smart Daemons Enable Hyperscale`_. The ``librados`` library also -maps objects to placement groups, as described in `Calculating PG IDs`_. - -The following examples use the default ``data`` pool. However, you may also -use the API to list pools, ensure they exist, or create and delete pools. For -the write operations, the examples illustrate how to use synchronous mode. For -the read operations, the examples illustrate how to use asynchronous mode. - -.. important:: Use caution when deleting pools with this API. If you delete - a pool, the pool and ALL DATA in the pool will be lost. - - -C Example ---------- - - -.. code-block:: c - - #include <stdio.h> - #include <stdlib.h> - #include <string.h> - #include <rados/librados.h> - - int main (int argc, const char **argv) - { - /* - * Continued from previous C example, where cluster handle and - * connection are established. First declare an I/O Context. - */ - - rados_ioctx_t io; - char *poolname = "data"; - - err = rados_ioctx_create(cluster, poolname, &io); - if (err < 0) { - fprintf(stderr, "%s: cannot open rados pool %s: %s\n", argv[0], poolname, strerror(-err)); - rados_shutdown(cluster); - exit(EXIT_FAILURE); - } else { - printf("\nCreated I/O context.\n"); - } - - /* Write data to the cluster synchronously. */ - err = rados_write(io, "hw", "Hello World!", 12, 0); - if (err < 0) { - fprintf(stderr, "%s: Cannot write object \"hw\" to pool %s: %s\n", argv[0], poolname, strerror(-err)); - rados_ioctx_destroy(io); - rados_shutdown(cluster); - exit(1); - } else { - printf("\nWrote \"Hello World\" to object \"hw\".\n"); - } - - char xattr[] = "en_US"; - err = rados_setxattr(io, "hw", "lang", xattr, 5); - if (err < 0) { - fprintf(stderr, "%s: Cannot write xattr to pool %s: %s\n", argv[0], poolname, strerror(-err)); - rados_ioctx_destroy(io); - rados_shutdown(cluster); - exit(1); - } else { - printf("\nWrote \"en_US\" to xattr \"lang\" for object \"hw\".\n"); - } - - /* - * Read data from the cluster asynchronously. - * First, set up asynchronous I/O completion. - */ - rados_completion_t comp; - err = rados_aio_create_completion(NULL, NULL, NULL, &comp); - if (err < 0) { - fprintf(stderr, "%s: Could not create aio completion: %s\n", argv[0], strerror(-err)); - rados_ioctx_destroy(io); - rados_shutdown(cluster); - exit(1); - } else { - printf("\nCreated AIO completion.\n"); - } - - /* Next, read data using rados_aio_read. */ - char read_res[100]; - err = rados_aio_read(io, "hw", comp, read_res, 12, 0); - if (err < 0) { - fprintf(stderr, "%s: Cannot read object. %s %s\n", argv[0], poolname, strerror(-err)); - rados_ioctx_destroy(io); - rados_shutdown(cluster); - exit(1); - } else { - printf("\nRead object \"hw\". The contents are:\n %s \n", read_res); - } - - /* Wait for the operation to complete */ - rados_aio_wait_for_complete(comp); - - /* Release the asynchronous I/O complete handle to avoid memory leaks. */ - rados_aio_release(comp); - - - char xattr_res[100]; - err = rados_getxattr(io, "hw", "lang", xattr_res, 5); - if (err < 0) { - fprintf(stderr, "%s: Cannot read xattr. %s %s\n", argv[0], poolname, strerror(-err)); - rados_ioctx_destroy(io); - rados_shutdown(cluster); - exit(1); - } else { - printf("\nRead xattr \"lang\" for object \"hw\". The contents are:\n %s \n", xattr_res); - } - - err = rados_rmxattr(io, "hw", "lang"); - if (err < 0) { - fprintf(stderr, "%s: Cannot remove xattr. %s %s\n", argv[0], poolname, strerror(-err)); - rados_ioctx_destroy(io); - rados_shutdown(cluster); - exit(1); - } else { - printf("\nRemoved xattr \"lang\" for object \"hw\".\n"); - } - - err = rados_remove(io, "hw"); - if (err < 0) { - fprintf(stderr, "%s: Cannot remove object. %s %s\n", argv[0], poolname, strerror(-err)); - rados_ioctx_destroy(io); - rados_shutdown(cluster); - exit(1); - } else { - printf("\nRemoved object \"hw\".\n"); - } - - } - - - -C++ Example ------------ - - -.. code-block:: c++ - - #include <iostream> - #include <string> - #include <rados/librados.hpp> - - int main(int argc, const char **argv) - { - - /* Continued from previous C++ example, where cluster handle and - * connection are established. First declare an I/O Context. - */ - - librados::IoCtx io_ctx; - const char *pool_name = "data"; - - { - ret = cluster.ioctx_create(pool_name, io_ctx); - if (ret < 0) { - std::cerr << "Couldn't set up ioctx! error " << ret << std::endl; - exit(EXIT_FAILURE); - } else { - std::cout << "Created an ioctx for the pool." << std::endl; - } - } - - - /* Write an object synchronously. */ - { - librados::bufferlist bl; - bl.append("Hello World!"); - ret = io_ctx.write_full("hw", bl); - if (ret < 0) { - std::cerr << "Couldn't write object! error " << ret << std::endl; - exit(EXIT_FAILURE); - } else { - std::cout << "Wrote new object 'hw' " << std::endl; - } - } - - - /* - * Add an xattr to the object. - */ - { - librados::bufferlist lang_bl; - lang_bl.append("en_US"); - ret = io_ctx.setxattr("hw", "lang", lang_bl); - if (ret < 0) { - std::cerr << "failed to set xattr version entry! error " - << ret << std::endl; - exit(EXIT_FAILURE); - } else { - std::cout << "Set the xattr 'lang' on our object!" << std::endl; - } - } - - - /* - * Read the object back asynchronously. - */ - { - librados::bufferlist read_buf; - int read_len = 4194304; - - //Create I/O Completion. - librados::AioCompletion *read_completion = librados::Rados::aio_create_completion(); - - //Send read request. - ret = io_ctx.aio_read("hw", read_completion, &read_buf, read_len, 0); - if (ret < 0) { - std::cerr << "Couldn't start read object! error " << ret << std::endl; - exit(EXIT_FAILURE); - } - - // Wait for the request to complete, and check that it succeeded. - read_completion->wait_for_complete(); - ret = read_completion->get_return_value(); - if (ret < 0) { - std::cerr << "Couldn't read object! error " << ret << std::endl; - exit(EXIT_FAILURE); - } else { - std::cout << "Read object hw asynchronously with contents.\n" - << read_buf.c_str() << std::endl; - } - } - - - /* - * Read the xattr. - */ - { - librados::bufferlist lang_res; - ret = io_ctx.getxattr("hw", "lang", lang_res); - if (ret < 0) { - std::cerr << "failed to get xattr version entry! error " - << ret << std::endl; - exit(EXIT_FAILURE); - } else { - std::cout << "Got the xattr 'lang' from object hw!" - << lang_res.c_str() << std::endl; - } - } - - - /* - * Remove the xattr. - */ - { - ret = io_ctx.rmxattr("hw", "lang"); - if (ret < 0) { - std::cerr << "Failed to remove xattr! error " - << ret << std::endl; - exit(EXIT_FAILURE); - } else { - std::cout << "Removed the xattr 'lang' from our object!" << std::endl; - } - } - - /* - * Remove the object. - */ - { - ret = io_ctx.remove("hw"); - if (ret < 0) { - std::cerr << "Couldn't remove object! error " << ret << std::endl; - exit(EXIT_FAILURE); - } else { - std::cout << "Removed object 'hw'." << std::endl; - } - } - } - - - -Python Example --------------- - -.. code-block:: python - - print "\n\nI/O Context and Object Operations" - print "=================================" - - print "\nCreating a context for the 'data' pool" - if not cluster.pool_exists('data'): - raise RuntimeError('No data pool exists') - ioctx = cluster.open_ioctx('data') - - print "\nWriting object 'hw' with contents 'Hello World!' to pool 'data'." - ioctx.write("hw", "Hello World!") - print "Writing XATTR 'lang' with value 'en_US' to object 'hw'" - ioctx.set_xattr("hw", "lang", "en_US") - - - print "\nWriting object 'bm' with contents 'Bonjour tout le monde!' to pool 'data'." - ioctx.write("bm", "Bonjour tout le monde!") - print "Writing XATTR 'lang' with value 'fr_FR' to object 'bm'" - ioctx.set_xattr("bm", "lang", "fr_FR") - - print "\nContents of object 'hw'\n------------------------" - print ioctx.read("hw") - - print "\n\nGetting XATTR 'lang' from object 'hw'" - print ioctx.get_xattr("hw", "lang") - - print "\nContents of object 'bm'\n------------------------" - print ioctx.read("bm") - - print "Getting XATTR 'lang' from object 'bm'" - print ioctx.get_xattr("bm", "lang") - - - print "\nRemoving object 'hw'" - ioctx.remove_object("hw") - - print "Removing object 'bm'" - ioctx.remove_object("bm") - - -Java-Example ------------- - -.. code-block:: java - - import com.ceph.rados.Rados; - import com.ceph.rados.RadosException; - - import java.io.File; - import com.ceph.rados.IoCTX; - - public class CephClient { - public static void main (String args[]){ - - try { - Rados cluster = new Rados("admin"); - System.out.println("Created cluster handle."); - - File f = new File("/etc/ceph/ceph.conf"); - cluster.confReadFile(f); - System.out.println("Read the configuration file."); - - cluster.connect(); - System.out.println("Connected to the cluster."); - - IoCTX io = cluster.ioCtxCreate("data"); - - String oidone = "hw"; - String contentone = "Hello World!"; - io.write(oidone, contentone); - - String oidtwo = "bm"; - String contenttwo = "Bonjour tout le monde!"; - io.write(oidtwo, contenttwo); - - String[] objects = io.listObjects(); - for (String object: objects) - System.out.println(object); - - io.remove(oidone); - io.remove(oidtwo); - - cluster.ioCtxDestroy(io); - - } catch (RadosException e) { - System.out.println(e.getMessage() + ": " + e.getReturnValue()); - } - } - } - - -PHP Example ------------ - -.. code-block:: php - - <?php - - $io = rados_ioctx_create($r, "mypool"); - rados_write_full($io, "oidOne", "mycontents"); - rados_remove("oidOne"); - rados_ioctx_destroy($io); - - -Step 4: Closing Sessions -======================== - -Once your app finishes with the I/O Context and cluster handle, the app should -close the connection and shutdown the handle. For asynchronous I/O, the app -should also ensure that pending asynchronous operations have completed. - - -C Example ---------- - -.. code-block:: c - - rados_ioctx_destroy(io); - rados_shutdown(cluster); - - -C++ Example ------------ - -.. code-block:: c++ - - io_ctx.close(); - cluster.shutdown(); - - -Java Example --------------- - -.. code-block:: java - - cluster.ioCtxDestroy(io); - cluster.shutDown(); - - -Python Example --------------- - -.. code-block:: python - - print "\nClosing the connection." - ioctx.close() - - print "Shutting down the handle." - cluster.shutdown() - -PHP Example ------------ - -.. code-block:: php - - rados_shutdown($r); - - - -.. _user ID: ../../operations/user-management#command-line-usage -.. _CAPS: ../../operations/user-management#authorization-capabilities -.. _Installation (Quick): ../../../start -.. _Smart Daemons Enable Hyperscale: ../../../architecture#smart-daemons-enable-hyperscale -.. _Calculating PG IDs: ../../../architecture#calculating-pg-ids -.. _computes: ../../../architecture#calculating-pg-ids -.. _OSD: ../../../architecture#mapping-pgs-to-osds diff --git a/src/ceph/doc/rados/api/librados.rst b/src/ceph/doc/rados/api/librados.rst deleted file mode 100644 index 73d0e42..0000000 --- a/src/ceph/doc/rados/api/librados.rst +++ /dev/null @@ -1,187 +0,0 @@ -============== - Librados (C) -============== - -.. highlight:: c - -`librados` provides low-level access to the RADOS service. For an -overview of RADOS, see :doc:`../../architecture`. - - -Example: connecting and writing an object -========================================= - -To use `Librados`, you instantiate a :c:type:`rados_t` variable (a cluster handle) and -call :c:func:`rados_create()` with a pointer to it:: - - int err; - rados_t cluster; - - err = rados_create(&cluster, NULL); - if (err < 0) { - fprintf(stderr, "%s: cannot create a cluster handle: %s\n", argv[0], strerror(-err)); - exit(1); - } - -Then you configure your :c:type:`rados_t` to connect to your cluster, -either by setting individual values (:c:func:`rados_conf_set()`), -using a configuration file (:c:func:`rados_conf_read_file()`), using -command line options (:c:func:`rados_conf_parse_argv`), or an -environment variable (:c:func:`rados_conf_parse_env()`):: - - err = rados_conf_read_file(cluster, "/path/to/myceph.conf"); - if (err < 0) { - fprintf(stderr, "%s: cannot read config file: %s\n", argv[0], strerror(-err)); - exit(1); - } - -Once the cluster handle is configured, you can connect to the cluster with :c:func:`rados_connect()`:: - - err = rados_connect(cluster); - if (err < 0) { - fprintf(stderr, "%s: cannot connect to cluster: %s\n", argv[0], strerror(-err)); - exit(1); - } - -Then you open an "IO context", a :c:type:`rados_ioctx_t`, with :c:func:`rados_ioctx_create()`:: - - rados_ioctx_t io; - char *poolname = "mypool"; - - err = rados_ioctx_create(cluster, poolname, &io); - if (err < 0) { - fprintf(stderr, "%s: cannot open rados pool %s: %s\n", argv[0], poolname, strerror(-err)); - rados_shutdown(cluster); - exit(1); - } - -Note that the pool you try to access must exist. - -Then you can use the RADOS data manipulation functions, for example -write into an object called ``greeting`` with -:c:func:`rados_write_full()`:: - - err = rados_write_full(io, "greeting", "hello", 5); - if (err < 0) { - fprintf(stderr, "%s: cannot write pool %s: %s\n", argv[0], poolname, strerror(-err)); - rados_ioctx_destroy(io); - rados_shutdown(cluster); - exit(1); - } - -In the end, you will want to close your IO context and connection to RADOS with :c:func:`rados_ioctx_destroy()` and :c:func:`rados_shutdown()`:: - - rados_ioctx_destroy(io); - rados_shutdown(cluster); - - -Asychronous IO -============== - -When doing lots of IO, you often don't need to wait for one operation -to complete before starting the next one. `Librados` provides -asynchronous versions of several operations: - -* :c:func:`rados_aio_write` -* :c:func:`rados_aio_append` -* :c:func:`rados_aio_write_full` -* :c:func:`rados_aio_read` - -For each operation, you must first create a -:c:type:`rados_completion_t` that represents what to do when the -operation is safe or complete by calling -:c:func:`rados_aio_create_completion`. If you don't need anything -special to happen, you can pass NULL:: - - rados_completion_t comp; - err = rados_aio_create_completion(NULL, NULL, NULL, &comp); - if (err < 0) { - fprintf(stderr, "%s: could not create aio completion: %s\n", argv[0], strerror(-err)); - rados_ioctx_destroy(io); - rados_shutdown(cluster); - exit(1); - } - -Now you can call any of the aio operations, and wait for it to -be in memory or on disk on all replicas:: - - err = rados_aio_write(io, "foo", comp, "bar", 3, 0); - if (err < 0) { - fprintf(stderr, "%s: could not schedule aio write: %s\n", argv[0], strerror(-err)); - rados_aio_release(comp); - rados_ioctx_destroy(io); - rados_shutdown(cluster); - exit(1); - } - rados_aio_wait_for_complete(comp); // in memory - rados_aio_wait_for_safe(comp); // on disk - -Finally, we need to free the memory used by the completion with :c:func:`rados_aio_release`:: - - rados_aio_release(comp); - -You can use the callbacks to tell your application when writes are -durable, or when read buffers are full. For example, if you wanted to -measure the latency of each operation when appending to several -objects, you could schedule several writes and store the ack and -commit time in the corresponding callback, then wait for all of them -to complete using :c:func:`rados_aio_flush` before analyzing the -latencies:: - - typedef struct { - struct timeval start; - struct timeval ack_end; - struct timeval commit_end; - } req_duration; - - void ack_callback(rados_completion_t comp, void *arg) { - req_duration *dur = (req_duration *) arg; - gettimeofday(&dur->ack_end, NULL); - } - - void commit_callback(rados_completion_t comp, void *arg) { - req_duration *dur = (req_duration *) arg; - gettimeofday(&dur->commit_end, NULL); - } - - int output_append_latency(rados_ioctx_t io, const char *data, size_t len, size_t num_writes) { - req_duration times[num_writes]; - rados_completion_t comps[num_writes]; - for (size_t i = 0; i < num_writes; ++i) { - gettimeofday(×[i].start, NULL); - int err = rados_aio_create_completion((void*) ×[i], ack_callback, commit_callback, &comps[i]); - if (err < 0) { - fprintf(stderr, "Error creating rados completion: %s\n", strerror(-err)); - return err; - } - char obj_name[100]; - snprintf(obj_name, sizeof(obj_name), "foo%ld", (unsigned long)i); - err = rados_aio_append(io, obj_name, comps[i], data, len); - if (err < 0) { - fprintf(stderr, "Error from rados_aio_append: %s", strerror(-err)); - return err; - } - } - // wait until all requests finish *and* the callbacks complete - rados_aio_flush(io); - // the latencies can now be analyzed - printf("Request # | Ack latency (s) | Commit latency (s)\n"); - for (size_t i = 0; i < num_writes; ++i) { - // don't forget to free the completions - rados_aio_release(comps[i]); - struct timeval ack_lat, commit_lat; - timersub(×[i].ack_end, ×[i].start, &ack_lat); - timersub(×[i].commit_end, ×[i].start, &commit_lat); - printf("%9ld | %8ld.%06ld | %10ld.%06ld\n", (unsigned long) i, ack_lat.tv_sec, ack_lat.tv_usec, commit_lat.tv_sec, commit_lat.tv_usec); - } - return 0; - } - -Note that all the :c:type:`rados_completion_t` must be freed with :c:func:`rados_aio_release` to avoid leaking memory. - - -API calls -========= - - .. autodoxygenfile:: rados_types.h - .. autodoxygenfile:: librados.h diff --git a/src/ceph/doc/rados/api/libradospp.rst b/src/ceph/doc/rados/api/libradospp.rst deleted file mode 100644 index 27d3fa7..0000000 --- a/src/ceph/doc/rados/api/libradospp.rst +++ /dev/null @@ -1,5 +0,0 @@ -================== - LibradosPP (C++) -================== - -.. todo:: write me! diff --git a/src/ceph/doc/rados/api/objclass-sdk.rst b/src/ceph/doc/rados/api/objclass-sdk.rst deleted file mode 100644 index 6b1162f..0000000 --- a/src/ceph/doc/rados/api/objclass-sdk.rst +++ /dev/null @@ -1,37 +0,0 @@ -=========================== -SDK for Ceph Object Classes -=========================== - -`Ceph` can be extended by creating shared object classes called `Ceph Object -Classes`. The existing framework to build these object classes has dependencies -on the internal functionality of `Ceph`, which restricts users to build object -classes within the tree. The aim of this project is to create an independent -object class interface, which can be used to build object classes outside the -`Ceph` tree. This allows us to have two types of object classes, 1) those that -have in-tree dependencies and reside in the tree and 2) those that can make use -of the `Ceph Object Class SDK framework` and can be built outside of the `Ceph` -tree because they do not depend on any internal implementation of `Ceph`. This -project decouples object class development from Ceph and encourages creation -and distribution of object classes as packages. - -In order to demonstrate the use of this framework, we have provided an example -called ``cls_sdk``, which is a very simple object class that makes use of the -SDK framework. This object class resides in the ``src/cls`` directory. - -Installing objclass.h ---------------------- - -The object class interface that enables out-of-tree development of object -classes resides in ``src/include/rados/`` and gets installed with `Ceph` -installation. After running ``make install``, you should be able to see it -in ``<prefix>/include/rados``. :: - - ls /usr/local/include/rados - -Using the SDK example ---------------------- - -The ``cls_sdk`` object class resides in ``src/cls/sdk/``. This gets built and -loaded into Ceph, with the Ceph build process. You can run the -``ceph_test_cls_sdk`` unittest, which resides in ``src/test/cls_sdk/``, -to test this class. diff --git a/src/ceph/doc/rados/api/python.rst b/src/ceph/doc/rados/api/python.rst deleted file mode 100644 index b4fd7e0..0000000 --- a/src/ceph/doc/rados/api/python.rst +++ /dev/null @@ -1,397 +0,0 @@ -=================== - Librados (Python) -=================== - -The ``rados`` module is a thin Python wrapper for ``librados``. - -Installation -============ - -To install Python libraries for Ceph, see `Getting librados for Python`_. - - -Getting Started -=============== - -You can create your own Ceph client using Python. The following tutorial will -show you how to import the Ceph Python module, connect to a Ceph cluster, and -perform object operations as a ``client.admin`` user. - -.. note:: To use the Ceph Python bindings, you must have access to a - running Ceph cluster. To set one up quickly, see `Getting Started`_. - -First, create a Python source file for your Ceph client. :: - :linenos: - - sudo vim client.py - - -Import the Module ------------------ - -To use the ``rados`` module, import it into your source file. - -.. code-block:: python - :linenos: - - import rados - - -Configure a Cluster Handle --------------------------- - -Before connecting to the Ceph Storage Cluster, create a cluster handle. By -default, the cluster handle assumes a cluster named ``ceph`` (i.e., the default -for deployment tools, and our Getting Started guides too), and a -``client.admin`` user name. You may change these defaults to suit your needs. - -To connect to the Ceph Storage Cluster, your application needs to know where to -find the Ceph Monitor. Provide this information to your application by -specifying the path to your Ceph configuration file, which contains the location -of the initial Ceph monitors. - -.. code-block:: python - :linenos: - - import rados, sys - - #Create Handle Examples. - cluster = rados.Rados(conffile='ceph.conf') - cluster = rados.Rados(conffile=sys.argv[1]) - cluster = rados.Rados(conffile = 'ceph.conf', conf = dict (keyring = '/path/to/keyring')) - -Ensure that the ``conffile`` argument provides the path and file name of your -Ceph configuration file. You may use the ``sys`` module to avoid hard-coding the -Ceph configuration path and file name. - -Your Python client also requires a client keyring. For this example, we use the -``client.admin`` key by default. If you would like to specify the keyring when -creating the cluster handle, you may use the ``conf`` argument. Alternatively, -you may specify the keyring path in your Ceph configuration file. For example, -you may add something like the following line to you Ceph configuration file:: - - keyring = /path/to/ceph.client.admin.keyring - -For additional details on modifying your configuration via Python, see `Configuration`_. - - -Connect to the Cluster ----------------------- - -Once you have a cluster handle configured, you may connect to the cluster. -With a connection to the cluster, you may execute methods that return -information about the cluster. - -.. code-block:: python - :linenos: - :emphasize-lines: 7 - - import rados, sys - - cluster = rados.Rados(conffile='ceph.conf') - print "\nlibrados version: " + str(cluster.version()) - print "Will attempt to connect to: " + str(cluster.conf_get('mon initial members')) - - cluster.connect() - print "\nCluster ID: " + cluster.get_fsid() - - print "\n\nCluster Statistics" - print "==================" - cluster_stats = cluster.get_cluster_stats() - - for key, value in cluster_stats.iteritems(): - print key, value - - -By default, Ceph authentication is ``on``. Your application will need to know -the location of the keyring. The ``python-ceph`` module doesn't have the default -location, so you need to specify the keyring path. The easiest way to specify -the keyring is to add it to the Ceph configuration file. The following Ceph -configuration file example uses the ``client.admin`` keyring you generated with -``ceph-deploy``. - -.. code-block:: ini - :linenos: - - [global] - ... - keyring=/path/to/keyring/ceph.client.admin.keyring - - -Manage Pools ------------- - -When connected to the cluster, the ``Rados`` API allows you to manage pools. You -can list pools, check for the existence of a pool, create a pool and delete a -pool. - -.. code-block:: python - :linenos: - :emphasize-lines: 6, 13, 18, 25 - - print "\n\nPool Operations" - print "===============" - - print "\nAvailable Pools" - print "----------------" - pools = cluster.list_pools() - - for pool in pools: - print pool - - print "\nCreate 'test' Pool" - print "------------------" - cluster.create_pool('test') - - print "\nPool named 'test' exists: " + str(cluster.pool_exists('test')) - print "\nVerify 'test' Pool Exists" - print "-------------------------" - pools = cluster.list_pools() - - for pool in pools: - print pool - - print "\nDelete 'test' Pool" - print "------------------" - cluster.delete_pool('test') - print "\nPool named 'test' exists: " + str(cluster.pool_exists('test')) - - - -Input/Output Context --------------------- - -Reading from and writing to the Ceph Storage Cluster requires an input/output -context (ioctx). You can create an ioctx with the ``open_ioctx()`` method of the -``Rados`` class. The ``ioctx_name`` parameter is the name of the pool you wish -to use. - -.. code-block:: python - :linenos: - - ioctx = cluster.open_ioctx('data') - - -Once you have an I/O context, you can read/write objects, extended attributes, -and perform a number of other operations. After you complete operations, ensure -that you close the connection. For example: - -.. code-block:: python - :linenos: - - print "\nClosing the connection." - ioctx.close() - - -Writing, Reading and Removing Objects -------------------------------------- - -Once you create an I/O context, you can write objects to the cluster. If you -write to an object that doesn't exist, Ceph creates it. If you write to an -object that exists, Ceph overwrites it (except when you specify a range, and -then it only overwrites the range). You may read objects (and object ranges) -from the cluster. You may also remove objects from the cluster. For example: - -.. code-block:: python - :linenos: - :emphasize-lines: 2, 5, 8 - - print "\nWriting object 'hw' with contents 'Hello World!' to pool 'data'." - ioctx.write_full("hw", "Hello World!") - - print "\n\nContents of object 'hw'\n------------------------\n" - print ioctx.read("hw") - - print "\nRemoving object 'hw'" - ioctx.remove_object("hw") - - -Writing and Reading XATTRS --------------------------- - -Once you create an object, you can write extended attributes (XATTRs) to -the object and read XATTRs from the object. For example: - -.. code-block:: python - :linenos: - :emphasize-lines: 2, 5 - - print "\n\nWriting XATTR 'lang' with value 'en_US' to object 'hw'" - ioctx.set_xattr("hw", "lang", "en_US") - - print "\n\nGetting XATTR 'lang' from object 'hw'\n" - print ioctx.get_xattr("hw", "lang") - - -Listing Objects ---------------- - -If you want to examine the list of objects in a pool, you may -retrieve the list of objects and iterate over them with the object iterator. -For example: - -.. code-block:: python - :linenos: - :emphasize-lines: 1, 6, 7 - - object_iterator = ioctx.list_objects() - - while True : - - try : - rados_object = object_iterator.next() - print "Object contents = " + rados_object.read() - - except StopIteration : - break - -The ``Object`` class provides a file-like interface to an object, allowing -you to read and write content and extended attributes. Object operations using -the I/O context provide additional functionality and asynchronous capabilities. - - -Cluster Handle API -================== - -The ``Rados`` class provides an interface into the Ceph Storage Daemon. - - -Configuration -------------- - -The ``Rados`` class provides methods for getting and setting configuration -values, reading the Ceph configuration file, and parsing arguments. You -do not need to be connected to the Ceph Storage Cluster to invoke the following -methods. See `Storage Cluster Configuration`_ for details on settings. - -.. currentmodule:: rados -.. automethod:: Rados.conf_get(option) -.. automethod:: Rados.conf_set(option, val) -.. automethod:: Rados.conf_read_file(path=None) -.. automethod:: Rados.conf_parse_argv(args) -.. automethod:: Rados.version() - - -Connection Management ---------------------- - -Once you configure your cluster handle, you may connect to the cluster, check -the cluster ``fsid``, retrieve cluster statistics, and disconnect (shutdown) -from the cluster. You may also assert that the cluster handle is in a particular -state (e.g., "configuring", "connecting", etc.). - - -.. automethod:: Rados.connect(timeout=0) -.. automethod:: Rados.shutdown() -.. automethod:: Rados.get_fsid() -.. automethod:: Rados.get_cluster_stats() -.. automethod:: Rados.require_state(*args) - - -Pool Operations ---------------- - -To use pool operation methods, you must connect to the Ceph Storage Cluster -first. You may list the available pools, create a pool, check to see if a pool -exists, and delete a pool. - -.. automethod:: Rados.list_pools() -.. automethod:: Rados.create_pool(pool_name, auid=None, crush_rule=None) -.. automethod:: Rados.pool_exists() -.. automethod:: Rados.delete_pool(pool_name) - - - -Input/Output Context API -======================== - -To write data to and read data from the Ceph Object Store, you must create -an Input/Output context (ioctx). The `Rados` class provides a `open_ioctx()` -method. The remaining ``ioctx`` operations involve invoking methods of the -`Ioctx` and other classes. - -.. automethod:: Rados.open_ioctx(ioctx_name) -.. automethod:: Ioctx.require_ioctx_open() -.. automethod:: Ioctx.get_stats() -.. automethod:: Ioctx.change_auid(auid) -.. automethod:: Ioctx.get_last_version() -.. automethod:: Ioctx.close() - - -.. Pool Snapshots -.. -------------- - -.. The Ceph Storage Cluster allows you to make a snapshot of a pool's state. -.. Whereas, basic pool operations only require a connection to the cluster, -.. snapshots require an I/O context. - -.. Ioctx.create_snap(self, snap_name) -.. Ioctx.list_snaps(self) -.. SnapIterator.next(self) -.. Snap.get_timestamp(self) -.. Ioctx.lookup_snap(self, snap_name) -.. Ioctx.remove_snap(self, snap_name) - -.. not published. This doesn't seem ready yet. - -Object Operations ------------------ - -The Ceph Storage Cluster stores data as objects. You can read and write objects -synchronously or asynchronously. You can read and write from offsets. An object -has a name (or key) and data. - - -.. automethod:: Ioctx.aio_write(object_name, to_write, offset=0, oncomplete=None, onsafe=None) -.. automethod:: Ioctx.aio_write_full(object_name, to_write, oncomplete=None, onsafe=None) -.. automethod:: Ioctx.aio_append(object_name, to_append, oncomplete=None, onsafe=None) -.. automethod:: Ioctx.write(key, data, offset=0) -.. automethod:: Ioctx.write_full(key, data) -.. automethod:: Ioctx.aio_flush() -.. automethod:: Ioctx.set_locator_key(loc_key) -.. automethod:: Ioctx.aio_read(object_name, length, offset, oncomplete) -.. automethod:: Ioctx.read(key, length=8192, offset=0) -.. automethod:: Ioctx.stat(key) -.. automethod:: Ioctx.trunc(key, size) -.. automethod:: Ioctx.remove_object(key) - - -Object Extended Attributes --------------------------- - -You may set extended attributes (XATTRs) on an object. You can retrieve a list -of objects or XATTRs and iterate over them. - -.. automethod:: Ioctx.set_xattr(key, xattr_name, xattr_value) -.. automethod:: Ioctx.get_xattrs(oid) -.. automethod:: XattrIterator.next() -.. automethod:: Ioctx.get_xattr(key, xattr_name) -.. automethod:: Ioctx.rm_xattr(key, xattr_name) - - - -Object Interface -================ - -From an I/O context, you can retrieve a list of objects from a pool and iterate -over them. The object interface provide makes each object look like a file, and -you may perform synchronous operations on the objects. For asynchronous -operations, you should use the I/O context methods. - -.. automethod:: Ioctx.list_objects() -.. automethod:: ObjectIterator.next() -.. automethod:: Object.read(length = 1024*1024) -.. automethod:: Object.write(string_to_write) -.. automethod:: Object.get_xattrs() -.. automethod:: Object.get_xattr(xattr_name) -.. automethod:: Object.set_xattr(xattr_name, xattr_value) -.. automethod:: Object.rm_xattr(xattr_name) -.. automethod:: Object.stat() -.. automethod:: Object.remove() - - - - -.. _Getting Started: ../../../start -.. _Storage Cluster Configuration: ../../configuration -.. _Getting librados for Python: ../librados-intro#getting-librados-for-python diff --git a/src/ceph/doc/rados/command/list-inconsistent-obj.json b/src/ceph/doc/rados/command/list-inconsistent-obj.json deleted file mode 100644 index 76ca43e..0000000 --- a/src/ceph/doc/rados/command/list-inconsistent-obj.json +++ /dev/null @@ -1,195 +0,0 @@ -{ - "$schema": "http://json-schema.org/draft-04/schema#", - "type": "object", - "properties": { - "epoch": { - "description": "Scrub epoch", - "type": "integer" - }, - "inconsistents": { - "type": "array", - "items": { - "type": "object", - "properties": { - "object": { - "description": "Identify a Ceph object", - "type": "object", - "properties": { - "name": { - "type": "string" - }, - "nspace": { - "type": "string" - }, - "locator": { - "type": "string" - }, - "version": { - "type": "integer", - "minimum": 0 - }, - "snap": { - "oneOf": [ - { - "type": "string", - "enum": [ "head", "snapdir" ] - }, - { - "type": "integer", - "minimum": 0 - } - ] - } - }, - "required": [ - "name", - "nspace", - "locator", - "version", - "snap" - ] - }, - "selected_object_info": { - "type": "string" - }, - "union_shard_errors": { - "description": "Union of all shard errors", - "type": "array", - "items": { - "enum": [ - "missing", - "stat_error", - "read_error", - "data_digest_mismatch_oi", - "omap_digest_mismatch_oi", - "size_mismatch_oi", - "ec_hash_error", - "ec_size_error", - "oi_attr_missing", - "oi_attr_corrupted", - "obj_size_oi_mismatch", - "ss_attr_missing", - "ss_attr_corrupted" - ] - }, - "minItems": 0, - "uniqueItems": true - }, - "errors": { - "description": "Errors related to the analysis of this object", - "type": "array", - "items": { - "enum": [ - "object_info_inconsistency", - "data_digest_mismatch", - "omap_digest_mismatch", - "size_mismatch", - "attr_value_mismatch", - "attr_name_mismatch" - ] - }, - "minItems": 0, - "uniqueItems": true - }, - "shards": { - "description": "All found or expected shards", - "type": "array", - "items": { - "description": "Information about a particular shard of object", - "type": "object", - "properties": { - "object_info": { - "type": "string" - }, - "shard": { - "type": "integer" - }, - "osd": { - "type": "integer" - }, - "primary": { - "type": "boolean" - }, - "size": { - "type": "integer" - }, - "omap_digest": { - "description": "Hex representation (e.g. 0x1abd1234)", - "type": "string" - }, - "data_digest": { - "description": "Hex representation (e.g. 0x1abd1234)", - "type": "string" - }, - "errors": { - "description": "Errors with this shard", - "type": "array", - "items": { - "enum": [ - "missing", - "stat_error", - "read_error", - "data_digest_mismatch_oi", - "omap_digest_mismatch_oi", - "size_mismatch_oi", - "ec_hash_error", - "ec_size_error", - "oi_attr_missing", - "oi_attr_corrupted", - "obj_size_oi_mismatch", - "ss_attr_missing", - "ss_attr_corrupted" - ] - }, - "minItems": 0, - "uniqueItems": true - }, - "attrs": { - "description": "If any shard's attr error is set then all attrs are here", - "type": "array", - "items": { - "description": "Information about a particular shard of object", - "type": "object", - "properties": { - "name": { - "type": "string" - }, - "value": { - "type": "string" - }, - "Base64": { - "type": "boolean" - } - }, - "required": [ - "name", - "value", - "Base64" - ], - "additionalProperties": false, - "minItems": 1 - } - } - }, - "required": [ - "osd", - "primary", - "errors" - ] - } - } - }, - "required": [ - "object", - "union_shard_errors", - "errors", - "shards" - ] - } - } - }, - "required": [ - "epoch", - "inconsistents" - ] -} diff --git a/src/ceph/doc/rados/command/list-inconsistent-snap.json b/src/ceph/doc/rados/command/list-inconsistent-snap.json deleted file mode 100644 index 0da6b0f..0000000 --- a/src/ceph/doc/rados/command/list-inconsistent-snap.json +++ /dev/null @@ -1,87 +0,0 @@ -{ - "$schema": "http://json-schema.org/draft-04/schema#", - "type": "object", - "properties": { - "epoch": { - "description": "Scrub epoch", - "type": "integer" - }, - "inconsistents": { - "type": "array", - "items": { - "type": "object", - "properties": { - "name": { - "type": "string" - }, - "nspace": { - "type": "string" - }, - "locator": { - "type": "string" - }, - "snap": { - "oneOf": [ - { - "type": "string", - "enum": [ - "head", - "snapdir" - ] - }, - { - "type": "integer", - "minimum": 0 - } - ] - }, - "errors": { - "description": "Errors for this object's snap", - "type": "array", - "items": { - "enum": [ - "ss_attr_missing", - "ss_attr_corrupted", - "oi_attr_missing", - "oi_attr_corrupted", - "snapset_mismatch", - "head_mismatch", - "headless", - "size_mismatch", - "extra_clones", - "clone_missing" - ] - }, - "minItems": 1, - "uniqueItems": true - }, - "missing": { - "description": "List of missing clones if clone_missing error set", - "type": "array", - "items": { - "type": "integer" - } - }, - "extra_clones": { - "description": "List of extra clones if extra_clones error set", - "type": "array", - "items": { - "type": "integer" - } - } - }, - "required": [ - "name", - "nspace", - "locator", - "snap", - "errors" - ] - } - } - }, - "required": [ - "epoch", - "inconsistents" - ] -} diff --git a/src/ceph/doc/rados/configuration/auth-config-ref.rst b/src/ceph/doc/rados/configuration/auth-config-ref.rst deleted file mode 100644 index eb14fa4..0000000 --- a/src/ceph/doc/rados/configuration/auth-config-ref.rst +++ /dev/null @@ -1,432 +0,0 @@ -======================== - Cephx Config Reference -======================== - -The ``cephx`` protocol is enabled by default. Cryptographic authentication has -some computational costs, though they should generally be quite low. If the -network environment connecting your client and server hosts is very safe and -you cannot afford authentication, you can turn it off. **This is not generally -recommended**. - -.. note:: If you disable authentication, you are at risk of a man-in-the-middle - attack altering your client/server messages, which could lead to disastrous - security effects. - -For creating users, see `User Management`_. For details on the architecture -of Cephx, see `Architecture - High Availability Authentication`_. - - -Deployment Scenarios -==================== - -There are two main scenarios for deploying a Ceph cluster, which impact -how you initially configure Cephx. Most first time Ceph users use -``ceph-deploy`` to create a cluster (easiest). For clusters using -other deployment tools (e.g., Chef, Juju, Puppet, etc.), you will need -to use the manual procedures or configure your deployment tool to -bootstrap your monitor(s). - -ceph-deploy ------------ - -When you deploy a cluster with ``ceph-deploy``, you do not have to bootstrap the -monitor manually or create the ``client.admin`` user or keyring. The steps you -execute in the `Storage Cluster Quick Start`_ will invoke ``ceph-deploy`` to do -that for you. - -When you execute ``ceph-deploy new {initial-monitor(s)}``, Ceph will create a -monitor keyring for you (only used to bootstrap monitors), and it will generate -an initial Ceph configuration file for you, which contains the following -authentication settings, indicating that Ceph enables authentication by -default:: - - auth_cluster_required = cephx - auth_service_required = cephx - auth_client_required = cephx - -When you execute ``ceph-deploy mon create-initial``, Ceph will bootstrap the -initial monitor(s), retrieve a ``ceph.client.admin.keyring`` file containing the -key for the ``client.admin`` user. Additionally, it will also retrieve keyrings -that give ``ceph-deploy`` and ``ceph-disk`` utilities the ability to prepare and -activate OSDs and metadata servers. - -When you execute ``ceph-deploy admin {node-name}`` (**note:** Ceph must be -installed first), you are pushing a Ceph configuration file and the -``ceph.client.admin.keyring`` to the ``/etc/ceph`` directory of the node. You -will be able to execute Ceph administrative functions as ``root`` on the command -line of that node. - - -Manual Deployment ------------------ - -When you deploy a cluster manually, you have to bootstrap the monitor manually -and create the ``client.admin`` user and keyring. To bootstrap monitors, follow -the steps in `Monitor Bootstrapping`_. The steps for monitor bootstrapping are -the logical steps you must perform when using third party deployment tools like -Chef, Puppet, Juju, etc. - - -Enabling/Disabling Cephx -======================== - -Enabling Cephx requires that you have deployed keys for your monitors, -OSDs and metadata servers. If you are simply toggling Cephx on / off, -you do not have to repeat the bootstrapping procedures. - - -Enabling Cephx --------------- - -When ``cephx`` is enabled, Ceph will look for the keyring in the default search -path, which includes ``/etc/ceph/$cluster.$name.keyring``. You can override -this location by adding a ``keyring`` option in the ``[global]`` section of -your `Ceph configuration`_ file, but this is not recommended. - -Execute the following procedures to enable ``cephx`` on a cluster with -authentication disabled. If you (or your deployment utility) have already -generated the keys, you may skip the steps related to generating keys. - -#. Create a ``client.admin`` key, and save a copy of the key for your client - host:: - - ceph auth get-or-create client.admin mon 'allow *' mds 'allow *' osd 'allow *' -o /etc/ceph/ceph.client.admin.keyring - - **Warning:** This will clobber any existing - ``/etc/ceph/client.admin.keyring`` file. Do not perform this step if a - deployment tool has already done it for you. Be careful! - -#. Create a keyring for your monitor cluster and generate a monitor - secret key. :: - - ceph-authtool --create-keyring /tmp/ceph.mon.keyring --gen-key -n mon. --cap mon 'allow *' - -#. Copy the monitor keyring into a ``ceph.mon.keyring`` file in every monitor's - ``mon data`` directory. For example, to copy it to ``mon.a`` in cluster ``ceph``, - use the following:: - - cp /tmp/ceph.mon.keyring /var/lib/ceph/mon/ceph-a/keyring - -#. Generate a secret key for every OSD, where ``{$id}`` is the OSD number:: - - ceph auth get-or-create osd.{$id} mon 'allow rwx' osd 'allow *' -o /var/lib/ceph/osd/ceph-{$id}/keyring - -#. Generate a secret key for every MDS, where ``{$id}`` is the MDS letter:: - - ceph auth get-or-create mds.{$id} mon 'allow rwx' osd 'allow *' mds 'allow *' -o /var/lib/ceph/mds/ceph-{$id}/keyring - -#. Enable ``cephx`` authentication by setting the following options in the - ``[global]`` section of your `Ceph configuration`_ file:: - - auth cluster required = cephx - auth service required = cephx - auth client required = cephx - - -#. Start or restart the Ceph cluster. See `Operating a Cluster`_ for details. - -For details on bootstrapping a monitor manually, see `Manual Deployment`_. - - - -Disabling Cephx ---------------- - -The following procedure describes how to disable Cephx. If your cluster -environment is relatively safe, you can offset the computation expense of -running authentication. **We do not recommend it.** However, it may be easier -during setup and/or troubleshooting to temporarily disable authentication. - -#. Disable ``cephx`` authentication by setting the following options in the - ``[global]`` section of your `Ceph configuration`_ file:: - - auth cluster required = none - auth service required = none - auth client required = none - - -#. Start or restart the Ceph cluster. See `Operating a Cluster`_ for details. - - -Configuration Settings -====================== - -Enablement ----------- - - -``auth cluster required`` - -:Description: If enabled, the Ceph Storage Cluster daemons (i.e., ``ceph-mon``, - ``ceph-osd``, and ``ceph-mds``) must authenticate with - each other. Valid settings are ``cephx`` or ``none``. - -:Type: String -:Required: No -:Default: ``cephx``. - - -``auth service required`` - -:Description: If enabled, the Ceph Storage Cluster daemons require Ceph Clients - to authenticate with the Ceph Storage Cluster in order to access - Ceph services. Valid settings are ``cephx`` or ``none``. - -:Type: String -:Required: No -:Default: ``cephx``. - - -``auth client required`` - -:Description: If enabled, the Ceph Client requires the Ceph Storage Cluster to - authenticate with the Ceph Client. Valid settings are ``cephx`` - or ``none``. - -:Type: String -:Required: No -:Default: ``cephx``. - - -.. index:: keys; keyring - -Keys ----- - -When you run Ceph with authentication enabled, ``ceph`` administrative commands -and Ceph Clients require authentication keys to access the Ceph Storage Cluster. - -The most common way to provide these keys to the ``ceph`` administrative -commands and clients is to include a Ceph keyring under the ``/etc/ceph`` -directory. For Cuttlefish and later releases using ``ceph-deploy``, the filename -is usually ``ceph.client.admin.keyring`` (or ``$cluster.client.admin.keyring``). -If you include the keyring under the ``/etc/ceph`` directory, you don't need to -specify a ``keyring`` entry in your Ceph configuration file. - -We recommend copying the Ceph Storage Cluster's keyring file to nodes where you -will run administrative commands, because it contains the ``client.admin`` key. - -You may use ``ceph-deploy admin`` to perform this task. See `Create an Admin -Host`_ for details. To perform this step manually, execute the following:: - - sudo scp {user}@{ceph-cluster-host}:/etc/ceph/ceph.client.admin.keyring /etc/ceph/ceph.client.admin.keyring - -.. tip:: Ensure the ``ceph.keyring`` file has appropriate permissions set - (e.g., ``chmod 644``) on your client machine. - -You may specify the key itself in the Ceph configuration file using the ``key`` -setting (not recommended), or a path to a keyfile using the ``keyfile`` setting. - - -``keyring`` - -:Description: The path to the keyring file. -:Type: String -:Required: No -:Default: ``/etc/ceph/$cluster.$name.keyring,/etc/ceph/$cluster.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin`` - - -``keyfile`` - -:Description: The path to a key file (i.e,. a file containing only the key). -:Type: String -:Required: No -:Default: None - - -``key`` - -:Description: The key (i.e., the text string of the key itself). Not recommended. -:Type: String -:Required: No -:Default: None - - -Daemon Keyrings ---------------- - -Administrative users or deployment tools (e.g., ``ceph-deploy``) may generate -daemon keyrings in the same way as generating user keyrings. By default, Ceph -stores daemons keyrings inside their data directory. The default keyring -locations, and the capabilities necessary for the daemon to function, are shown -below. - -``ceph-mon`` - -:Location: ``$mon_data/keyring`` -:Capabilities: ``mon 'allow *'`` - -``ceph-osd`` - -:Location: ``$osd_data/keyring`` -:Capabilities: ``mon 'allow profile osd' osd 'allow *'`` - -``ceph-mds`` - -:Location: ``$mds_data/keyring`` -:Capabilities: ``mds 'allow' mon 'allow profile mds' osd 'allow rwx'`` - -``radosgw`` - -:Location: ``$rgw_data/keyring`` -:Capabilities: ``mon 'allow rwx' osd 'allow rwx'`` - - -.. note:: The monitor keyring (i.e., ``mon.``) contains a key but no - capabilities, and is not part of the cluster ``auth`` database. - -The daemon data directory locations default to directories of the form:: - - /var/lib/ceph/$type/$cluster-$id - -For example, ``osd.12`` would be:: - - /var/lib/ceph/osd/ceph-12 - -You can override these locations, but it is not recommended. - - -.. index:: signatures - -Signatures ----------- - -In Ceph Bobtail and subsequent versions, we prefer that Ceph authenticate all -ongoing messages between the entities using the session key set up for that -initial authentication. However, Argonaut and earlier Ceph daemons do not know -how to perform ongoing message authentication. To maintain backward -compatibility (e.g., running both Botbail and Argonaut daemons in the same -cluster), message signing is **off** by default. If you are running Bobtail or -later daemons exclusively, configure Ceph to require signatures. - -Like other parts of Ceph authentication, Ceph provides fine-grained control so -you can enable/disable signatures for service messages between the client and -Ceph, and you can enable/disable signatures for messages between Ceph daemons. - - -``cephx require signatures`` - -:Description: If set to ``true``, Ceph requires signatures on all message - traffic between the Ceph Client and the Ceph Storage Cluster, and - between daemons comprising the Ceph Storage Cluster. - -:Type: Boolean -:Required: No -:Default: ``false`` - - -``cephx cluster require signatures`` - -:Description: If set to ``true``, Ceph requires signatures on all message - traffic between Ceph daemons comprising the Ceph Storage Cluster. - -:Type: Boolean -:Required: No -:Default: ``false`` - - -``cephx service require signatures`` - -:Description: If set to ``true``, Ceph requires signatures on all message - traffic between Ceph Clients and the Ceph Storage Cluster. - -:Type: Boolean -:Required: No -:Default: ``false`` - - -``cephx sign messages`` - -:Description: If the Ceph version supports message signing, Ceph will sign - all messages so they cannot be spoofed. - -:Type: Boolean -:Default: ``true`` - - -Time to Live ------------- - -``auth service ticket ttl`` - -:Description: When the Ceph Storage Cluster sends a Ceph Client a ticket for - authentication, the Ceph Storage Cluster assigns the ticket a - time to live. - -:Type: Double -:Default: ``60*60`` - - -Backward Compatibility -====================== - -For Cuttlefish and earlier releases, see `Cephx`_. - -In Ceph Argonaut v0.48 and earlier versions, if you enable ``cephx`` -authentication, Ceph only authenticates the initial communication between the -client and daemon; Ceph does not authenticate the subsequent messages they send -to each other, which has security implications. In Ceph Bobtail and subsequent -versions, Ceph authenticates all ongoing messages between the entities using the -session key set up for that initial authentication. - -We identified a backward compatibility issue between Argonaut v0.48 (and prior -versions) and Bobtail (and subsequent versions). During testing, if you -attempted to use Argonaut (and earlier) daemons with Bobtail (and later) -daemons, the Argonaut daemons did not know how to perform ongoing message -authentication, while the Bobtail versions of the daemons insist on -authenticating message traffic subsequent to the initial -request/response--making it impossible for Argonaut (and prior) daemons to -interoperate with Bobtail (and subsequent) daemons. - -We have addressed this potential problem by providing a means for Argonaut (and -prior) systems to interact with Bobtail (and subsequent) systems. Here's how it -works: by default, the newer systems will not insist on seeing signatures from -older systems that do not know how to perform them, but will simply accept such -messages without authenticating them. This new default behavior provides the -advantage of allowing two different releases to interact. **We do not recommend -this as a long term solution**. Allowing newer daemons to forgo ongoing -authentication has the unfortunate security effect that an attacker with control -of some of your machines or some access to your network can disable session -security simply by claiming to be unable to sign messages. - -.. note:: Even if you don't actually run any old versions of Ceph, - the attacker may be able to force some messages to be accepted unsigned in the - default scenario. While running Cephx with the default scenario, Ceph still - authenticates the initial communication, but you lose desirable session security. - -If you know that you are not running older versions of Ceph, or you are willing -to accept that old servers and new servers will not be able to interoperate, you -can eliminate this security risk. If you do so, any Ceph system that is new -enough to support session authentication and that has Cephx enabled will reject -unsigned messages. To preclude new servers from interacting with old servers, -include the following in the ``[global]`` section of your `Ceph -configuration`_ file directly below the line that specifies the use of Cephx -for authentication:: - - cephx require signatures = true ; everywhere possible - -You can also selectively require signatures for cluster internal -communications only, separate from client-facing service:: - - cephx cluster require signatures = true ; for cluster-internal communication - cephx service require signatures = true ; for client-facing service - -An option to make a client require signatures from the cluster is not -yet implemented. - -**We recommend migrating all daemons to the newer versions and enabling the -foregoing flag** at the nearest practical time so that you may avail yourself -of the enhanced authentication. - -.. note:: Ceph kernel modules do not support signatures yet. - - -.. _Storage Cluster Quick Start: ../../../start/quick-ceph-deploy/ -.. _Monitor Bootstrapping: ../../../install/manual-deployment#monitor-bootstrapping -.. _Operating a Cluster: ../../operations/operating -.. _Manual Deployment: ../../../install/manual-deployment -.. _Cephx: http://docs.ceph.com/docs/cuttlefish/rados/configuration/auth-config-ref/ -.. _Ceph configuration: ../ceph-conf -.. _Create an Admin Host: ../../deployment/ceph-deploy-admin -.. _Architecture - High Availability Authentication: ../../../architecture#high-availability-authentication -.. _User Management: ../../operations/user-management diff --git a/src/ceph/doc/rados/configuration/bluestore-config-ref.rst b/src/ceph/doc/rados/configuration/bluestore-config-ref.rst deleted file mode 100644 index 8d8ace6..0000000 --- a/src/ceph/doc/rados/configuration/bluestore-config-ref.rst +++ /dev/null @@ -1,297 +0,0 @@ -========================== -BlueStore Config Reference -========================== - -Devices -======= - -BlueStore manages either one, two, or (in certain cases) three storage -devices. - -In the simplest case, BlueStore consumes a single (primary) storage -device. The storage device is normally partitioned into two parts: - -#. A small partition is formatted with XFS and contains basic metadata - for the OSD. This *data directory* includes information about the - OSD (its identifier, which cluster it belongs to, and its private - keyring. - -#. The rest of the device is normally a large partition occupying the - rest of the device that is managed directly by BlueStore contains - all of the actual data. This *primary device* is normally identifed - by a ``block`` symlink in data directory. - -It is also possible to deploy BlueStore across two additional devices: - -* A *WAL device* can be used for BlueStore's internal journal or - write-ahead log. It is identified by the ``block.wal`` symlink in - the data directory. It is only useful to use a WAL device if the - device is faster than the primary device (e.g., when it is on an SSD - and the primary device is an HDD). -* A *DB device* can be used for storing BlueStore's internal metadata. - BlueStore (or rather, the embedded RocksDB) will put as much - metadata as it can on the DB device to improve performance. If the - DB device fills up, metadata will spill back onto the primary device - (where it would have been otherwise). Again, it is only helpful to - provision a DB device if it is faster than the primary device. - -If there is only a small amount of fast storage available (e.g., less -than a gigabyte), we recommend using it as a WAL device. If there is -more, provisioning a DB device makes more sense. The BlueStore -journal will always be placed on the fastest device available, so -using a DB device will provide the same benefit that the WAL device -would while *also* allowing additional metadata to be stored there (if -it will fix). - -A single-device BlueStore OSD can be provisioned with:: - - ceph-disk prepare --bluestore <device> - -To specify a WAL device and/or DB device, :: - - ceph-disk prepare --bluestore <device> --block.wal <wal-device> --block-db <db-device> - -Cache size -========== - -The amount of memory consumed by each OSD for BlueStore's cache is -determined by the ``bluestore_cache_size`` configuration option. If -that config option is not set (i.e., remains at 0), there is a -different default value that is used depending on whether an HDD or -SSD is used for the primary device (set by the -``bluestore_cache_size_ssd`` and ``bluestore_cache_size_hdd`` config -options). - -BlueStore and the rest of the Ceph OSD does the best it can currently -to stick to the budgeted memory. Note that on top of the configured -cache size, there is also memory consumed by the OSD itself, and -generally some overhead due to memory fragmentation and other -allocator overhead. - -The configured cache memory budget can be used in a few different ways: - -* Key/Value metadata (i.e., RocksDB's internal cache) -* BlueStore metadata -* BlueStore data (i.e., recently read or written object data) - -Cache memory usage is governed by the following options: -``bluestore_cache_meta_ratio``, ``bluestore_cache_kv_ratio``, and -``bluestore_cache_kv_max``. The fraction of the cache devoted to data -is 1.0 minus the meta and kv ratios. The memory devoted to kv -metadata (the RocksDB cache) is capped by ``bluestore_cache_kv_max`` -since our testing indicates there are diminishing returns beyond a -certain point. - -``bluestore_cache_size`` - -:Description: The amount of memory BlueStore will use for its cache. If zero, ``bluestore_cache_size_hdd`` or ``bluestore_cache_size_ssd`` will be used instead. -:Type: Integer -:Required: Yes -:Default: ``0`` - -``bluestore_cache_size_hdd`` - -:Description: The default amount of memory BlueStore will use for its cache when backed by an HDD. -:Type: Integer -:Required: Yes -:Default: ``1 * 1024 * 1024 * 1024`` (1 GB) - -``bluestore_cache_size_ssd`` - -:Description: The default amount of memory BlueStore will use for its cache when backed by an SSD. -:Type: Integer -:Required: Yes -:Default: ``3 * 1024 * 1024 * 1024`` (3 GB) - -``bluestore_cache_meta_ratio`` - -:Description: The ratio of cache devoted to metadata. -:Type: Floating point -:Required: Yes -:Default: ``.01`` - -``bluestore_cache_kv_ratio`` - -:Description: The ratio of cache devoted to key/value data (rocksdb). -:Type: Floating point -:Required: Yes -:Default: ``.99`` - -``bluestore_cache_kv_max`` - -:Description: The maximum amount of cache devoted to key/value data (rocksdb). -:Type: Floating point -:Required: Yes -:Default: ``512 * 1024*1024`` (512 MB) - - -Checksums -========= - -BlueStore checksums all metadata and data written to disk. Metadata -checksumming is handled by RocksDB and uses `crc32c`. Data -checksumming is done by BlueStore and can make use of `crc32c`, -`xxhash32`, or `xxhash64`. The default is `crc32c` and should be -suitable for most purposes. - -Full data checksumming does increase the amount of metadata that -BlueStore must store and manage. When possible, e.g., when clients -hint that data is written and read sequentially, BlueStore will -checksum larger blocks, but in many cases it must store a checksum -value (usually 4 bytes) for every 4 kilobyte block of data. - -It is possible to use a smaller checksum value by truncating the -checksum to two or one byte, reducing the metadata overhead. The -trade-off is that the probability that a random error will not be -detected is higher with a smaller checksum, going from about one if -four billion with a 32-bit (4 byte) checksum to one is 65,536 for a -16-bit (2 byte) checksum or one in 256 for an 8-bit (1 byte) checksum. -The smaller checksum values can be used by selecting `crc32c_16` or -`crc32c_8` as the checksum algorithm. - -The *checksum algorithm* can be set either via a per-pool -``csum_type`` property or the global config option. For example, :: - - ceph osd pool set <pool-name> csum_type <algorithm> - -``bluestore_csum_type`` - -:Description: The default checksum algorithm to use. -:Type: String -:Required: Yes -:Valid Settings: ``none``, ``crc32c``, ``crc32c_16``, ``crc32c_8``, ``xxhash32``, ``xxhash64`` -:Default: ``crc32c`` - - -Inline Compression -================== - -BlueStore supports inline compression using `snappy`, `zlib`, or -`lz4`. Please note that the `lz4` compression plugin is not -distributed in the official release. - -Whether data in BlueStore is compressed is determined by a combination -of the *compression mode* and any hints associated with a write -operation. The modes are: - -* **none**: Never compress data. -* **passive**: Do not compress data unless the write operation as a - *compressible* hint set. -* **aggressive**: Compress data unless the write operation as an - *incompressible* hint set. -* **force**: Try to compress data no matter what. - -For more information about the *compressible* and *incompressible* IO -hints, see :doc:`/api/librados/#rados_set_alloc_hint`. - -Note that regardless of the mode, if the size of the data chunk is not -reduced sufficiently it will not be used and the original -(uncompressed) data will be stored. For example, if the ``bluestore -compression required ratio`` is set to ``.7`` then the compressed data -must be 70% of the size of the original (or smaller). - -The *compression mode*, *compression algorithm*, *compression required -ratio*, *min blob size*, and *max blob size* can be set either via a -per-pool property or a global config option. Pool properties can be -set with:: - - ceph osd pool set <pool-name> compression_algorithm <algorithm> - ceph osd pool set <pool-name> compression_mode <mode> - ceph osd pool set <pool-name> compression_required_ratio <ratio> - ceph osd pool set <pool-name> compression_min_blob_size <size> - ceph osd pool set <pool-name> compression_max_blob_size <size> - -``bluestore compression algorithm`` - -:Description: The default compressor to use (if any) if the per-pool property - ``compression_algorithm`` is not set. Note that zstd is *not* - recommended for bluestore due to high CPU overhead when - compressing small amounts of data. -:Type: String -:Required: No -:Valid Settings: ``lz4``, ``snappy``, ``zlib``, ``zstd`` -:Default: ``snappy`` - -``bluestore compression mode`` - -:Description: The default policy for using compression if the per-pool property - ``compression_mode`` is not set. ``none`` means never use - compression. ``passive`` means use compression when - `clients hint`_ that data is compressible. ``aggressive`` means - use compression unless clients hint that data is not compressible. - ``force`` means use compression under all circumstances even if - the clients hint that the data is not compressible. -:Type: String -:Required: No -:Valid Settings: ``none``, ``passive``, ``aggressive``, ``force`` -:Default: ``none`` - -``bluestore compression required ratio`` - -:Description: The ratio of the size of the data chunk after - compression relative to the original size must be at - least this small in order to store the compressed - version. - -:Type: Floating point -:Required: No -:Default: .875 - -``bluestore compression min blob size`` - -:Description: Chunks smaller than this are never compressed. - The per-pool property ``compression_min_blob_size`` overrides - this setting. - -:Type: Unsigned Integer -:Required: No -:Default: 0 - -``bluestore compression min blob size hdd`` - -:Description: Default value of ``bluestore compression min blob size`` - for rotational media. - -:Type: Unsigned Integer -:Required: No -:Default: 128K - -``bluestore compression min blob size ssd`` - -:Description: Default value of ``bluestore compression min blob size`` - for non-rotational (solid state) media. - -:Type: Unsigned Integer -:Required: No -:Default: 8K - -``bluestore compression max blob size`` - -:Description: Chunks larger than this are broken into smaller blobs sizing - ``bluestore compression max blob size`` before being compressed. - The per-pool property ``compression_max_blob_size`` overrides - this setting. - -:Type: Unsigned Integer -:Required: No -:Default: 0 - -``bluestore compression max blob size hdd`` - -:Description: Default value of ``bluestore compression max blob size`` - for rotational media. - -:Type: Unsigned Integer -:Required: No -:Default: 512K - -``bluestore compression max blob size ssd`` - -:Description: Default value of ``bluestore compression max blob size`` - for non-rotational (solid state) media. - -:Type: Unsigned Integer -:Required: No -:Default: 64K - -.. _clients hint: ../../api/librados/#rados_set_alloc_hint diff --git a/src/ceph/doc/rados/configuration/ceph-conf.rst b/src/ceph/doc/rados/configuration/ceph-conf.rst deleted file mode 100644 index df88452..0000000 --- a/src/ceph/doc/rados/configuration/ceph-conf.rst +++ /dev/null @@ -1,629 +0,0 @@ -================== - Configuring Ceph -================== - -When you start the Ceph service, the initialization process activates a series -of daemons that run in the background. A :term:`Ceph Storage Cluster` runs -two types of daemons: - -- :term:`Ceph Monitor` (``ceph-mon``) -- :term:`Ceph OSD Daemon` (``ceph-osd``) - -Ceph Storage Clusters that support the :term:`Ceph Filesystem` run at least one -:term:`Ceph Metadata Server` (``ceph-mds``). Clusters that support :term:`Ceph -Object Storage` run Ceph Gateway daemons (``radosgw``). For your convenience, -each daemon has a series of default values (*i.e.*, many are set by -``ceph/src/common/config_opts.h``). You may override these settings with a Ceph -configuration file. - - -.. _ceph-conf-file: - -The Configuration File -====================== - -When you start a Ceph Storage Cluster, each daemon looks for a Ceph -configuration file (i.e., ``ceph.conf`` by default) that provides the cluster's -configuration settings. For manual deployments, you need to create a Ceph -configuration file. For tools that create configuration files for you (*e.g.*, -``ceph-deploy``, Chef, etc.), you may use the information contained herein as a -reference. The Ceph configuration file defines: - -- Cluster Identity -- Authentication settings -- Cluster membership -- Host names -- Host addresses -- Paths to keyrings -- Paths to journals -- Paths to data -- Other runtime options - -The default Ceph configuration file locations in sequential order include: - -#. ``$CEPH_CONF`` (*i.e.,* the path following the ``$CEPH_CONF`` - environment variable) -#. ``-c path/path`` (*i.e.,* the ``-c`` command line argument) -#. ``/etc/ceph/ceph.conf`` -#. ``~/.ceph/config`` -#. ``./ceph.conf`` (*i.e.,* in the current working directory) - - -The Ceph configuration file uses an *ini* style syntax. You can add comments -by preceding comments with a pound sign (#) or a semi-colon (;). For example: - -.. code-block:: ini - - # <--A number (#) sign precedes a comment. - ; A comment may be anything. - # Comments always follow a semi-colon (;) or a pound (#) on each line. - # The end of the line terminates a comment. - # We recommend that you provide comments in your configuration file(s). - - -.. _ceph-conf-settings: - -Config Sections -=============== - -The configuration file can configure all Ceph daemons in a Ceph Storage Cluster, -or all Ceph daemons of a particular type. To configure a series of daemons, the -settings must be included under the processes that will receive the -configuration as follows: - -``[global]`` - -:Description: Settings under ``[global]`` affect all daemons in a Ceph Storage - Cluster. - -:Example: ``auth supported = cephx`` - -``[osd]`` - -:Description: Settings under ``[osd]`` affect all ``ceph-osd`` daemons in - the Ceph Storage Cluster, and override the same setting in - ``[global]``. - -:Example: ``osd journal size = 1000`` - -``[mon]`` - -:Description: Settings under ``[mon]`` affect all ``ceph-mon`` daemons in - the Ceph Storage Cluster, and override the same setting in - ``[global]``. - -:Example: ``mon addr = 10.0.0.101:6789`` - - -``[mds]`` - -:Description: Settings under ``[mds]`` affect all ``ceph-mds`` daemons in - the Ceph Storage Cluster, and override the same setting in - ``[global]``. - -:Example: ``host = myserver01`` - -``[client]`` - -:Description: Settings under ``[client]`` affect all Ceph Clients - (e.g., mounted Ceph Filesystems, mounted Ceph Block Devices, - etc.). - -:Example: ``log file = /var/log/ceph/radosgw.log`` - - -Global settings affect all instances of all daemon in the Ceph Storage Cluster. -Use the ``[global]`` setting for values that are common for all daemons in the -Ceph Storage Cluster. You can override each ``[global]`` setting by: - -#. Changing the setting in a particular process type - (*e.g.,* ``[osd]``, ``[mon]``, ``[mds]`` ). - -#. Changing the setting in a particular process (*e.g.,* ``[osd.1]`` ). - -Overriding a global setting affects all child processes, except those that -you specifically override in a particular daemon. - -A typical global setting involves activating authentication. For example: - -.. code-block:: ini - - [global] - #Enable authentication between hosts within the cluster. - #v 0.54 and earlier - auth supported = cephx - - #v 0.55 and after - auth cluster required = cephx - auth service required = cephx - auth client required = cephx - - -You can specify settings that apply to a particular type of daemon. When you -specify settings under ``[osd]``, ``[mon]`` or ``[mds]`` without specifying a -particular instance, the setting will apply to all OSDs, monitors or metadata -daemons respectively. - -A typical daemon-wide setting involves setting journal sizes, filestore -settings, etc. For example: - -.. code-block:: ini - - [osd] - osd journal size = 1000 - - -You may specify settings for particular instances of a daemon. You may specify -an instance by entering its type, delimited by a period (.) and by the instance -ID. The instance ID for a Ceph OSD Daemon is always numeric, but it may be -alphanumeric for Ceph Monitors and Ceph Metadata Servers. - -.. code-block:: ini - - [osd.1] - # settings affect osd.1 only. - - [mon.a] - # settings affect mon.a only. - - [mds.b] - # settings affect mds.b only. - - -If the daemon you specify is a Ceph Gateway client, specify the daemon and the -instance, delimited by a period (.). For example:: - - [client.radosgw.instance-name] - # settings affect client.radosgw.instance-name only. - - - -.. _ceph-metavariables: - -Metavariables -============= - -Metavariables simplify Ceph Storage Cluster configuration dramatically. When a -metavariable is set in a configuration value, Ceph expands the metavariable into -a concrete value. Metavariables are very powerful when used within the -``[global]``, ``[osd]``, ``[mon]``, ``[mds]`` or ``[client]`` sections of your -configuration file. Ceph metavariables are similar to Bash shell expansion. - -Ceph supports the following metavariables: - - -``$cluster`` - -:Description: Expands to the Ceph Storage Cluster name. Useful when running - multiple Ceph Storage Clusters on the same hardware. - -:Example: ``/etc/ceph/$cluster.keyring`` -:Default: ``ceph`` - - -``$type`` - -:Description: Expands to one of ``mds``, ``osd``, or ``mon``, depending on the - type of the instant daemon. - -:Example: ``/var/lib/ceph/$type`` - - -``$id`` - -:Description: Expands to the daemon identifier. For ``osd.0``, this would be - ``0``; for ``mds.a``, it would be ``a``. - -:Example: ``/var/lib/ceph/$type/$cluster-$id`` - - -``$host`` - -:Description: Expands to the host name of the instant daemon. - - -``$name`` - -:Description: Expands to ``$type.$id``. -:Example: ``/var/run/ceph/$cluster-$name.asok`` - -``$pid`` - -:Description: Expands to daemon pid. -:Example: ``/var/run/ceph/$cluster-$name-$pid.asok`` - - -.. _ceph-conf-common-settings: - -Common Settings -=============== - -The `Hardware Recommendations`_ section provides some hardware guidelines for -configuring a Ceph Storage Cluster. It is possible for a single :term:`Ceph -Node` to run multiple daemons. For example, a single node with multiple drives -may run one ``ceph-osd`` for each drive. Ideally, you will have a node for a -particular type of process. For example, some nodes may run ``ceph-osd`` -daemons, other nodes may run ``ceph-mds`` daemons, and still other nodes may -run ``ceph-mon`` daemons. - -Each node has a name identified by the ``host`` setting. Monitors also specify -a network address and port (i.e., domain name or IP address) identified by the -``addr`` setting. A basic configuration file will typically specify only -minimal settings for each instance of monitor daemons. For example: - -.. code-block:: ini - - [global] - mon_initial_members = ceph1 - mon_host = 10.0.0.1 - - -.. important:: The ``host`` setting is the short name of the node (i.e., not - an fqdn). It is **NOT** an IP address either. Enter ``hostname -s`` on - the command line to retrieve the name of the node. Do not use ``host`` - settings for anything other than initial monitors unless you are deploying - Ceph manually. You **MUST NOT** specify ``host`` under individual daemons - when using deployment tools like ``chef`` or ``ceph-deploy``, as those tools - will enter the appropriate values for you in the cluster map. - - -.. _ceph-network-config: - -Networks -======== - -See the `Network Configuration Reference`_ for a detailed discussion about -configuring a network for use with Ceph. - - -Monitors -======== - -Ceph production clusters typically deploy with a minimum 3 :term:`Ceph Monitor` -daemons to ensure high availability should a monitor instance crash. At least -three (3) monitors ensures that the Paxos algorithm can determine which version -of the :term:`Ceph Cluster Map` is the most recent from a majority of Ceph -Monitors in the quorum. - -.. note:: You may deploy Ceph with a single monitor, but if the instance fails, - the lack of other monitors may interrupt data service availability. - -Ceph Monitors typically listen on port ``6789``. For example: - -.. code-block:: ini - - [mon.a] - host = hostName - mon addr = 150.140.130.120:6789 - -By default, Ceph expects that you will store a monitor's data under the -following path:: - - /var/lib/ceph/mon/$cluster-$id - -You or a deployment tool (e.g., ``ceph-deploy``) must create the corresponding -directory. With metavariables fully expressed and a cluster named "ceph", the -foregoing directory would evaluate to:: - - /var/lib/ceph/mon/ceph-a - -For additional details, see the `Monitor Config Reference`_. - -.. _Monitor Config Reference: ../mon-config-ref - - -.. _ceph-osd-config: - - -Authentication -============== - -.. versionadded:: Bobtail 0.56 - -For Bobtail (v 0.56) and beyond, you should expressly enable or disable -authentication in the ``[global]`` section of your Ceph configuration file. :: - - auth cluster required = cephx - auth service required = cephx - auth client required = cephx - -Additionally, you should enable message signing. See `Cephx Config Reference`_ for details. - -.. important:: When upgrading, we recommend expressly disabling authentication - first, then perform the upgrade. Once the upgrade is complete, re-enable - authentication. - -.. _Cephx Config Reference: ../auth-config-ref - - -.. _ceph-monitor-config: - - -OSDs -==== - -Ceph production clusters typically deploy :term:`Ceph OSD Daemons` where one node -has one OSD daemon running a filestore on one storage drive. A typical -deployment specifies a journal size. For example: - -.. code-block:: ini - - [osd] - osd journal size = 10000 - - [osd.0] - host = {hostname} #manual deployments only. - - -By default, Ceph expects that you will store a Ceph OSD Daemon's data with the -following path:: - - /var/lib/ceph/osd/$cluster-$id - -You or a deployment tool (e.g., ``ceph-deploy``) must create the corresponding -directory. With metavariables fully expressed and a cluster named "ceph", the -foregoing directory would evaluate to:: - - /var/lib/ceph/osd/ceph-0 - -You may override this path using the ``osd data`` setting. We don't recommend -changing the default location. Create the default directory on your OSD host. - -:: - - ssh {osd-host} - sudo mkdir /var/lib/ceph/osd/ceph-{osd-number} - -The ``osd data`` path ideally leads to a mount point with a hard disk that is -separate from the hard disk storing and running the operating system and -daemons. If the OSD is for a disk other than the OS disk, prepare it for -use with Ceph, and mount it to the directory you just created:: - - ssh {new-osd-host} - sudo mkfs -t {fstype} /dev/{disk} - sudo mount -o user_xattr /dev/{hdd} /var/lib/ceph/osd/ceph-{osd-number} - -We recommend using the ``xfs`` file system when running -:command:`mkfs`. (``btrfs`` and ``ext4`` are not recommended and no -longer tested.) - -See the `OSD Config Reference`_ for additional configuration details. - - -Heartbeats -========== - -During runtime operations, Ceph OSD Daemons check up on other Ceph OSD Daemons -and report their findings to the Ceph Monitor. You do not have to provide any -settings. However, if you have network latency issues, you may wish to modify -the settings. - -See `Configuring Monitor/OSD Interaction`_ for additional details. - - -.. _ceph-logging-and-debugging: - -Logs / Debugging -================ - -Sometimes you may encounter issues with Ceph that require -modifying logging output and using Ceph's debugging. See `Debugging and -Logging`_ for details on log rotation. - -.. _Debugging and Logging: ../../troubleshooting/log-and-debug - - -Example ceph.conf -================= - -.. literalinclude:: demo-ceph.conf - :language: ini - -.. _ceph-runtime-config: - -Runtime Changes -=============== - -Ceph allows you to make changes to the configuration of a ``ceph-osd``, -``ceph-mon``, or ``ceph-mds`` daemon at runtime. This capability is quite -useful for increasing/decreasing logging output, enabling/disabling debug -settings, and even for runtime optimization. The following reflects runtime -configuration usage:: - - ceph tell {daemon-type}.{id or *} injectargs --{name} {value} [--{name} {value}] - -Replace ``{daemon-type}`` with one of ``osd``, ``mon`` or ``mds``. You may apply -the runtime setting to all daemons of a particular type with ``*``, or specify -a specific daemon's ID (i.e., its number or letter). For example, to increase -debug logging for a ``ceph-osd`` daemon named ``osd.0``, execute the following:: - - ceph tell osd.0 injectargs --debug-osd 20 --debug-ms 1 - -In your ``ceph.conf`` file, you may use spaces when specifying a -setting name. When specifying a setting name on the command line, -ensure that you use an underscore or hyphen (``_`` or ``-``) between -terms (e.g., ``debug osd`` becomes ``--debug-osd``). - - -Viewing a Configuration at Runtime -================================== - -If your Ceph Storage Cluster is running, and you would like to see the -configuration settings from a running daemon, execute the following:: - - ceph daemon {daemon-type}.{id} config show | less - -If you are on a machine where osd.0 is running, the command would be:: - - ceph daemon osd.0 config show | less - -Reading Configuration Metadata at Runtime -========================================= - -Information about the available configuration options is available via -the ``config help`` command: - -:: - - ceph daemon {daemon-type}.{id} config help | less - - -This metadata is primarily intended to be used when integrating other -software with Ceph, such as graphical user interfaces. The output is -a list of JSON objects, for example: - -:: - - { - "name": "mon_host", - "type": "std::string", - "level": "basic", - "desc": "list of hosts or addresses to search for a monitor", - "long_desc": "This is a comma, whitespace, or semicolon separated list of IP addresses or hostnames. Hostnames are resolved via DNS and all A or AAAA records are included in the search list.", - "default": "", - "daemon_default": "", - "tags": [], - "services": [ - "common" - ], - "see_also": [], - "enum_values": [], - "min": "", - "max": "" - } - -type -____ - -The type of the setting, given as a C++ type name. - -level -_____ - -One of `basic`, `advanced`, `dev`. The `dev` options are not intended -for use outside of development and testing. - -desc -____ - -A short description -- this is a sentence fragment suitable for display -in small spaces like a single line in a list. - -long_desc -_________ - -A full description of what the setting does, this may be as long as needed. - -default -_______ - -The default value, if any. - -daemon_default -______________ - -An alternative default used for daemons (services) as opposed to clients. - -tags -____ - -A list of strings indicating topics to which this setting relates. Examples -of tags are `performance` and `networking`. - -services -________ - -A list of strings indicating which Ceph services the setting relates to, such -as `osd`, `mds`, `mon`. For settings that are relevant to any Ceph client -or server, `common` is used. - -see_also -________ - -A list of strings indicating other configuration options that may also -be of interest to a user setting this option. - -enum_values -___________ - -Optional: a list of strings indicating the valid settings. - -min, max -________ - -Optional: upper and lower (inclusive) bounds on valid settings. - - - - -Running Multiple Clusters -========================= - -With Ceph, you can run multiple Ceph Storage Clusters on the same hardware. -Running multiple clusters provides a higher level of isolation compared to -using different pools on the same cluster with different CRUSH rulesets. A -separate cluster will have separate monitor, OSD and metadata server processes. -When running Ceph with default settings, the default cluster name is ``ceph``, -which means you would save your Ceph configuration file with the file name -``ceph.conf`` in the ``/etc/ceph`` default directory. - -See `ceph-deploy new`_ for details. -.. _ceph-deploy new:../ceph-deploy-new - -When you run multiple clusters, you must name your cluster and save the Ceph -configuration file with the name of the cluster. For example, a cluster named -``openstack`` will have a Ceph configuration file with the file name -``openstack.conf`` in the ``/etc/ceph`` default directory. - -.. important:: Cluster names must consist of letters a-z and digits 0-9 only. - -Separate clusters imply separate data disks and journals, which are not shared -between clusters. Referring to `Metavariables`_, the ``$cluster`` metavariable -evaluates to the cluster name (i.e., ``openstack`` in the foregoing example). -Various settings use the ``$cluster`` metavariable, including: - -- ``keyring`` -- ``admin socket`` -- ``log file`` -- ``pid file`` -- ``mon data`` -- ``mon cluster log file`` -- ``osd data`` -- ``osd journal`` -- ``mds data`` -- ``rgw data`` - -See `General Settings`_, `OSD Settings`_, `Monitor Settings`_, `MDS Settings`_, -`RGW Settings`_ and `Log Settings`_ for relevant path defaults that use the -``$cluster`` metavariable. - -.. _General Settings: ../general-config-ref -.. _OSD Settings: ../osd-config-ref -.. _Monitor Settings: ../mon-config-ref -.. _MDS Settings: ../../../cephfs/mds-config-ref -.. _RGW Settings: ../../../radosgw/config-ref/ -.. _Log Settings: ../../troubleshooting/log-and-debug - - -When creating default directories or files, you should use the cluster -name at the appropriate places in the path. For example:: - - sudo mkdir /var/lib/ceph/osd/openstack-0 - sudo mkdir /var/lib/ceph/mon/openstack-a - -.. important:: When running monitors on the same host, you should use - different ports. By default, monitors use port 6789. If you already - have monitors using port 6789, use a different port for your other cluster(s). - -To invoke a cluster other than the default ``ceph`` cluster, use the -``-c {filename}.conf`` option with the ``ceph`` command. For example:: - - ceph -c {cluster-name}.conf health - ceph -c openstack.conf health - - -.. _Hardware Recommendations: ../../../start/hardware-recommendations -.. _Network Configuration Reference: ../network-config-ref -.. _OSD Config Reference: ../osd-config-ref -.. _Configuring Monitor/OSD Interaction: ../mon-osd-interaction -.. _ceph-deploy new: ../../deployment/ceph-deploy-new#naming-a-cluster diff --git a/src/ceph/doc/rados/configuration/demo-ceph.conf b/src/ceph/doc/rados/configuration/demo-ceph.conf deleted file mode 100644 index ba86d53..0000000 --- a/src/ceph/doc/rados/configuration/demo-ceph.conf +++ /dev/null @@ -1,31 +0,0 @@ -[global] -fsid = {cluster-id} -mon initial members = {hostname}[, {hostname}] -mon host = {ip-address}[, {ip-address}] - -#All clusters have a front-side public network. -#If you have two NICs, you can configure a back side cluster -#network for OSD object replication, heart beats, backfilling, -#recovery, etc. -public network = {network}[, {network}] -#cluster network = {network}[, {network}] - -#Clusters require authentication by default. -auth cluster required = cephx -auth service required = cephx -auth client required = cephx - -#Choose reasonable numbers for your journals, number of replicas -#and placement groups. -osd journal size = {n} -osd pool default size = {n} # Write an object n times. -osd pool default min size = {n} # Allow writing n copy in a degraded state. -osd pool default pg num = {n} -osd pool default pgp num = {n} - -#Choose a reasonable crush leaf type. -#0 for a 1-node cluster. -#1 for a multi node cluster in a single rack -#2 for a multi node, multi chassis cluster with multiple hosts in a chassis -#3 for a multi node cluster with hosts across racks, etc. -osd crush chooseleaf type = {n}
\ No newline at end of file diff --git a/src/ceph/doc/rados/configuration/filestore-config-ref.rst b/src/ceph/doc/rados/configuration/filestore-config-ref.rst deleted file mode 100644 index 4dff60c..0000000 --- a/src/ceph/doc/rados/configuration/filestore-config-ref.rst +++ /dev/null @@ -1,365 +0,0 @@ -============================ - Filestore Config Reference -============================ - - -``filestore debug omap check`` - -:Description: Debugging check on synchronization. Expensive. For debugging only. -:Type: Boolean -:Required: No -:Default: ``0`` - - -.. index:: filestore; extended attributes - -Extended Attributes -=================== - -Extended Attributes (XATTRs) are an important aspect in your configuration. -Some file systems have limits on the number of bytes stored in XATTRS. -Additionally, in some cases, the filesystem may not be as fast as an alternative -method of storing XATTRs. The following settings may help improve performance -by using a method of storing XATTRs that is extrinsic to the underlying filesystem. - -Ceph XATTRs are stored as ``inline xattr``, using the XATTRs provided -by the underlying file system, if it does not impose a size limit. If -there is a size limit (4KB total on ext4, for instance), some Ceph -XATTRs will be stored in an key/value database when either the -``filestore max inline xattr size`` or ``filestore max inline -xattrs`` threshold is reached. - - -``filestore max inline xattr size`` - -:Description: The maximimum size of an XATTR stored in the filesystem (i.e., XFS, - btrfs, ext4, etc.) per object. Should not be larger than the - filesytem can handle. Default value of 0 means to use the value - specific to the underlying filesystem. -:Type: Unsigned 32-bit Integer -:Required: No -:Default: ``0`` - - -``filestore max inline xattr size xfs`` - -:Description: The maximimum size of an XATTR stored in the XFS filesystem. - Only used if ``filestore max inline xattr size`` == 0. -:Type: Unsigned 32-bit Integer -:Required: No -:Default: ``65536`` - - -``filestore max inline xattr size btrfs`` - -:Description: The maximimum size of an XATTR stored in the btrfs filesystem. - Only used if ``filestore max inline xattr size`` == 0. -:Type: Unsigned 32-bit Integer -:Required: No -:Default: ``2048`` - - -``filestore max inline xattr size other`` - -:Description: The maximimum size of an XATTR stored in other filesystems. - Only used if ``filestore max inline xattr size`` == 0. -:Type: Unsigned 32-bit Integer -:Required: No -:Default: ``512`` - - -``filestore max inline xattrs`` - -:Description: The maximum number of XATTRs stored in the filesystem per object. - Default value of 0 means to use the value specific to the - underlying filesystem. -:Type: 32-bit Integer -:Required: No -:Default: ``0`` - - -``filestore max inline xattrs xfs`` - -:Description: The maximum number of XATTRs stored in the XFS filesystem per object. - Only used if ``filestore max inline xattrs`` == 0. -:Type: 32-bit Integer -:Required: No -:Default: ``10`` - - -``filestore max inline xattrs btrfs`` - -:Description: The maximum number of XATTRs stored in the btrfs filesystem per object. - Only used if ``filestore max inline xattrs`` == 0. -:Type: 32-bit Integer -:Required: No -:Default: ``10`` - - -``filestore max inline xattrs other`` - -:Description: The maximum number of XATTRs stored in other filesystems per object. - Only used if ``filestore max inline xattrs`` == 0. -:Type: 32-bit Integer -:Required: No -:Default: ``2`` - -.. index:: filestore; synchronization - -Synchronization Intervals -========================= - -Periodically, the filestore needs to quiesce writes and synchronize the -filesystem, which creates a consistent commit point. It can then free journal -entries up to the commit point. Synchronizing more frequently tends to reduce -the time required to perform synchronization, and reduces the amount of data -that needs to remain in the journal. Less frequent synchronization allows the -backing filesystem to coalesce small writes and metadata updates more -optimally--potentially resulting in more efficient synchronization. - - -``filestore max sync interval`` - -:Description: The maximum interval in seconds for synchronizing the filestore. -:Type: Double -:Required: No -:Default: ``5`` - - -``filestore min sync interval`` - -:Description: The minimum interval in seconds for synchronizing the filestore. -:Type: Double -:Required: No -:Default: ``.01`` - - -.. index:: filestore; flusher - -Flusher -======= - -The filestore flusher forces data from large writes to be written out using -``sync file range`` before the sync in order to (hopefully) reduce the cost of -the eventual sync. In practice, disabling 'filestore flusher' seems to improve -performance in some cases. - - -``filestore flusher`` - -:Description: Enables the filestore flusher. -:Type: Boolean -:Required: No -:Default: ``false`` - -.. deprecated:: v.65 - -``filestore flusher max fds`` - -:Description: Sets the maximum number of file descriptors for the flusher. -:Type: Integer -:Required: No -:Default: ``512`` - -.. deprecated:: v.65 - -``filestore sync flush`` - -:Description: Enables the synchronization flusher. -:Type: Boolean -:Required: No -:Default: ``false`` - -.. deprecated:: v.65 - -``filestore fsync flushes journal data`` - -:Description: Flush journal data during filesystem synchronization. -:Type: Boolean -:Required: No -:Default: ``false`` - - -.. index:: filestore; queue - -Queue -===== - -The following settings provide limits on the size of filestore queue. - -``filestore queue max ops`` - -:Description: Defines the maximum number of in progress operations the file store accepts before blocking on queuing new operations. -:Type: Integer -:Required: No. Minimal impact on performance. -:Default: ``50`` - - -``filestore queue max bytes`` - -:Description: The maximum number of bytes for an operation. -:Type: Integer -:Required: No -:Default: ``100 << 20`` - - - - -.. index:: filestore; timeouts - -Timeouts -======== - - -``filestore op threads`` - -:Description: The number of filesystem operation threads that execute in parallel. -:Type: Integer -:Required: No -:Default: ``2`` - - -``filestore op thread timeout`` - -:Description: The timeout for a filesystem operation thread (in seconds). -:Type: Integer -:Required: No -:Default: ``60`` - - -``filestore op thread suicide timeout`` - -:Description: The timeout for a commit operation before cancelling the commit (in seconds). -:Type: Integer -:Required: No -:Default: ``180`` - - -.. index:: filestore; btrfs - -B-Tree Filesystem -================= - - -``filestore btrfs snap`` - -:Description: Enable snapshots for a ``btrfs`` filestore. -:Type: Boolean -:Required: No. Only used for ``btrfs``. -:Default: ``true`` - - -``filestore btrfs clone range`` - -:Description: Enable cloning ranges for a ``btrfs`` filestore. -:Type: Boolean -:Required: No. Only used for ``btrfs``. -:Default: ``true`` - - -.. index:: filestore; journal - -Journal -======= - - -``filestore journal parallel`` - -:Description: Enables parallel journaling, default for btrfs. -:Type: Boolean -:Required: No -:Default: ``false`` - - -``filestore journal writeahead`` - -:Description: Enables writeahead journaling, default for xfs. -:Type: Boolean -:Required: No -:Default: ``false`` - - -``filestore journal trailing`` - -:Description: Deprecated, never use. -:Type: Boolean -:Required: No -:Default: ``false`` - - -Misc -==== - - -``filestore merge threshold`` - -:Description: Min number of files in a subdir before merging into parent - NOTE: A negative value means to disable subdir merging -:Type: Integer -:Required: No -:Default: ``10`` - - -``filestore split multiple`` - -:Description: ``(filestore_split_multiple * abs(filestore_merge_threshold) + (rand() % filestore_split_rand_factor)) * 16`` - is the maximum number of files in a subdirectory before - splitting into child directories. - -:Type: Integer -:Required: No -:Default: ``2`` - - -``filestore split rand factor`` - -:Description: A random factor added to the split threshold to avoid - too many filestore splits occurring at once. See - ``filestore split multiple`` for details. - This can only be changed for an existing osd offline, - via ceph-objectstore-tool's apply-layout-settings command. - -:Type: Unsigned 32-bit Integer -:Required: No -:Default: ``20`` - - -``filestore update to`` - -:Description: Limits filestore auto upgrade to specified version. -:Type: Integer -:Required: No -:Default: ``1000`` - - -``filestore blackhole`` - -:Description: Drop any new transactions on the floor. -:Type: Boolean -:Required: No -:Default: ``false`` - - -``filestore dump file`` - -:Description: File onto which store transaction dumps. -:Type: Boolean -:Required: No -:Default: ``false`` - - -``filestore kill at`` - -:Description: inject a failure at the n'th opportunity -:Type: String -:Required: No -:Default: ``false`` - - -``filestore fail eio`` - -:Description: Fail/Crash on eio. -:Type: Boolean -:Required: No -:Default: ``true`` - diff --git a/src/ceph/doc/rados/configuration/general-config-ref.rst b/src/ceph/doc/rados/configuration/general-config-ref.rst deleted file mode 100644 index ca09ee5..0000000 --- a/src/ceph/doc/rados/configuration/general-config-ref.rst +++ /dev/null @@ -1,66 +0,0 @@ -========================== - General Config Reference -========================== - - -``fsid`` - -:Description: The filesystem ID. One per cluster. -:Type: UUID -:Required: No. -:Default: N/A. Usually generated by deployment tools. - - -``admin socket`` - -:Description: The socket for executing administrative commands on a daemon, - irrespective of whether Ceph Monitors have established a quorum. - -:Type: String -:Required: No -:Default: ``/var/run/ceph/$cluster-$name.asok`` - - -``pid file`` - -:Description: The file in which the mon, osd or mds will write its - PID. For instance, ``/var/run/$cluster/$type.$id.pid`` - will create /var/run/ceph/mon.a.pid for the ``mon`` with - id ``a`` running in the ``ceph`` cluster. The ``pid - file`` is removed when the daemon stops gracefully. If - the process is not daemonized (i.e. runs with the ``-f`` - or ``-d`` option), the ``pid file`` is not created. -:Type: String -:Required: No -:Default: No - - -``chdir`` - -:Description: The directory Ceph daemons change to once they are - up and running. Default ``/`` directory recommended. - -:Type: String -:Required: No -:Default: ``/`` - - -``max open files`` - -:Description: If set, when the :term:`Ceph Storage Cluster` starts, Ceph sets - the ``max open fds`` at the OS level (i.e., the max # of file - descriptors). It helps prevents Ceph OSD Daemons from running out - of file descriptors. - -:Type: 64-bit Integer -:Required: No -:Default: ``0`` - - -``fatal signal handlers`` - -:Description: If set, we will install signal handlers for SEGV, ABRT, BUS, ILL, - FPE, XCPU, XFSZ, SYS signals to generate a useful log message - -:Type: Boolean -:Default: ``true`` diff --git a/src/ceph/doc/rados/configuration/index.rst b/src/ceph/doc/rados/configuration/index.rst deleted file mode 100644 index 48b58ef..0000000 --- a/src/ceph/doc/rados/configuration/index.rst +++ /dev/null @@ -1,64 +0,0 @@ -=============== - Configuration -=============== - -Ceph can run with a cluster containing thousands of Object Storage Devices -(OSDs). A minimal system will have at least two OSDs for data replication. To -configure OSD clusters, you must provide settings in the configuration file. -Ceph provides default values for many settings, which you can override in the -configuration file. Additionally, you can make runtime modification to the -configuration using command-line utilities. - -When Ceph starts, it activates three daemons: - -- ``ceph-mon`` (mandatory) -- ``ceph-osd`` (mandatory) -- ``ceph-mds`` (mandatory for cephfs only) - -Each process, daemon or utility loads the host's configuration file. A process -may have information about more than one daemon instance (*i.e.,* multiple -contexts). A daemon or utility only has information about a single daemon -instance (a single context). - -.. note:: Ceph can run on a single host for evaluation purposes. - - -.. raw:: html - - <table cellpadding="10"><colgroup><col width="50%"><col width="50%"></colgroup><tbody valign="top"><tr><td><h3>Configuring the Object Store</h3> - -For general object store configuration, refer to the following: - -.. toctree:: - :maxdepth: 1 - - Storage devices <storage-devices> - ceph-conf - - -.. raw:: html - - </td><td><h3>Reference</h3> - -To optimize the performance of your cluster, refer to the following: - -.. toctree:: - :maxdepth: 1 - - Network Settings <network-config-ref> - Auth Settings <auth-config-ref> - Monitor Settings <mon-config-ref> - mon-lookup-dns - Heartbeat Settings <mon-osd-interaction> - OSD Settings <osd-config-ref> - BlueStore Settings <bluestore-config-ref> - FileStore Settings <filestore-config-ref> - Journal Settings <journal-ref> - Pool, PG & CRUSH Settings <pool-pg-config-ref.rst> - Messaging Settings <ms-ref> - General Settings <general-config-ref> - - -.. raw:: html - - </td></tr></tbody></table> diff --git a/src/ceph/doc/rados/configuration/journal-ref.rst b/src/ceph/doc/rados/configuration/journal-ref.rst deleted file mode 100644 index 97300f4..0000000 --- a/src/ceph/doc/rados/configuration/journal-ref.rst +++ /dev/null @@ -1,116 +0,0 @@ -========================== - Journal Config Reference -========================== - -.. index:: journal; journal configuration - -Ceph OSDs use a journal for two reasons: speed and consistency. - -- **Speed:** The journal enables the Ceph OSD Daemon to commit small writes - quickly. Ceph writes small, random i/o to the journal sequentially, which - tends to speed up bursty workloads by allowing the backing filesystem more - time to coalesce writes. The Ceph OSD Daemon's journal, however, can lead - to spiky performance with short spurts of high-speed writes followed by - periods without any write progress as the filesystem catches up to the - journal. - -- **Consistency:** Ceph OSD Daemons require a filesystem interface that - guarantees atomic compound operations. Ceph OSD Daemons write a description - of the operation to the journal and apply the operation to the filesystem. - This enables atomic updates to an object (for example, placement group - metadata). Every few seconds--between ``filestore max sync interval`` and - ``filestore min sync interval``--the Ceph OSD Daemon stops writes and - synchronizes the journal with the filesystem, allowing Ceph OSD Daemons to - trim operations from the journal and reuse the space. On failure, Ceph - OSD Daemons replay the journal starting after the last synchronization - operation. - -Ceph OSD Daemons support the following journal settings: - - -``journal dio`` - -:Description: Enables direct i/o to the journal. Requires ``journal block - align`` set to ``true``. - -:Type: Boolean -:Required: Yes when using ``aio``. -:Default: ``true`` - - - -``journal aio`` - -.. versionchanged:: 0.61 Cuttlefish - -:Description: Enables using ``libaio`` for asynchronous writes to the journal. - Requires ``journal dio`` set to ``true``. - -:Type: Boolean -:Required: No. -:Default: Version 0.61 and later, ``true``. Version 0.60 and earlier, ``false``. - - -``journal block align`` - -:Description: Block aligns write operations. Required for ``dio`` and ``aio``. -:Type: Boolean -:Required: Yes when using ``dio`` and ``aio``. -:Default: ``true`` - - -``journal max write bytes`` - -:Description: The maximum number of bytes the journal will write at - any one time. - -:Type: Integer -:Required: No -:Default: ``10 << 20`` - - -``journal max write entries`` - -:Description: The maximum number of entries the journal will write at - any one time. - -:Type: Integer -:Required: No -:Default: ``100`` - - -``journal queue max ops`` - -:Description: The maximum number of operations allowed in the queue at - any one time. - -:Type: Integer -:Required: No -:Default: ``500`` - - -``journal queue max bytes`` - -:Description: The maximum number of bytes allowed in the queue at - any one time. - -:Type: Integer -:Required: No -:Default: ``10 << 20`` - - -``journal align min size`` - -:Description: Align data payloads greater than the specified minimum. -:Type: Integer -:Required: No -:Default: ``64 << 10`` - - -``journal zero on create`` - -:Description: Causes the file store to overwrite the entire journal with - ``0``'s during ``mkfs``. -:Type: Boolean -:Required: No -:Default: ``false`` diff --git a/src/ceph/doc/rados/configuration/mon-config-ref.rst b/src/ceph/doc/rados/configuration/mon-config-ref.rst deleted file mode 100644 index 6c8e92b..0000000 --- a/src/ceph/doc/rados/configuration/mon-config-ref.rst +++ /dev/null @@ -1,1222 +0,0 @@ -========================== - Monitor Config Reference -========================== - -Understanding how to configure a :term:`Ceph Monitor` is an important part of -building a reliable :term:`Ceph Storage Cluster`. **All Ceph Storage Clusters -have at least one monitor**. A monitor configuration usually remains fairly -consistent, but you can add, remove or replace a monitor in a cluster. See -`Adding/Removing a Monitor`_ and `Add/Remove a Monitor (ceph-deploy)`_ for -details. - - -.. index:: Ceph Monitor; Paxos - -Background -========== - -Ceph Monitors maintain a "master copy" of the :term:`cluster map`, which means a -:term:`Ceph Client` can determine the location of all Ceph Monitors, Ceph OSD -Daemons, and Ceph Metadata Servers just by connecting to one Ceph Monitor and -retrieving a current cluster map. Before Ceph Clients can read from or write to -Ceph OSD Daemons or Ceph Metadata Servers, they must connect to a Ceph Monitor -first. With a current copy of the cluster map and the CRUSH algorithm, a Ceph -Client can compute the location for any object. The ability to compute object -locations allows a Ceph Client to talk directly to Ceph OSD Daemons, which is a -very important aspect of Ceph's high scalability and performance. See -`Scalability and High Availability`_ for additional details. - -The primary role of the Ceph Monitor is to maintain a master copy of the cluster -map. Ceph Monitors also provide authentication and logging services. Ceph -Monitors write all changes in the monitor services to a single Paxos instance, -and Paxos writes the changes to a key/value store for strong consistency. Ceph -Monitors can query the most recent version of the cluster map during sync -operations. Ceph Monitors leverage the key/value store's snapshots and iterators -(using leveldb) to perform store-wide synchronization. - -.. ditaa:: - - /-------------\ /-------------\ - | Monitor | Write Changes | Paxos | - | cCCC +-------------->+ cCCC | - | | | | - +-------------+ \------+------/ - | Auth | | - +-------------+ | Write Changes - | Log | | - +-------------+ v - | Monitor Map | /------+------\ - +-------------+ | Key / Value | - | OSD Map | | Store | - +-------------+ | cCCC | - | PG Map | \------+------/ - +-------------+ ^ - | MDS Map | | Read Changes - +-------------+ | - | cCCC |*---------------------+ - \-------------/ - - -.. deprecated:: version 0.58 - -In Ceph versions 0.58 and earlier, Ceph Monitors use a Paxos instance for -each service and store the map as a file. - -.. index:: Ceph Monitor; cluster map - -Cluster Maps ------------- - -The cluster map is a composite of maps, including the monitor map, the OSD map, -the placement group map and the metadata server map. The cluster map tracks a -number of important things: which processes are ``in`` the Ceph Storage Cluster; -which processes that are ``in`` the Ceph Storage Cluster are ``up`` and running -or ``down``; whether, the placement groups are ``active`` or ``inactive``, and -``clean`` or in some other state; and, other details that reflect the current -state of the cluster such as the total amount of storage space, and the amount -of storage used. - -When there is a significant change in the state of the cluster--e.g., a Ceph OSD -Daemon goes down, a placement group falls into a degraded state, etc.--the -cluster map gets updated to reflect the current state of the cluster. -Additionally, the Ceph Monitor also maintains a history of the prior states of -the cluster. The monitor map, OSD map, placement group map and metadata server -map each maintain a history of their map versions. We call each version an -"epoch." - -When operating your Ceph Storage Cluster, keeping track of these states is an -important part of your system administration duties. See `Monitoring a Cluster`_ -and `Monitoring OSDs and PGs`_ for additional details. - -.. index:: high availability; quorum - -Monitor Quorum --------------- - -Our Configuring ceph section provides a trivial `Ceph configuration file`_ that -provides for one monitor in the test cluster. A cluster will run fine with a -single monitor; however, **a single monitor is a single-point-of-failure**. To -ensure high availability in a production Ceph Storage Cluster, you should run -Ceph with multiple monitors so that the failure of a single monitor **WILL NOT** -bring down your entire cluster. - -When a Ceph Storage Cluster runs multiple Ceph Monitors for high availability, -Ceph Monitors use `Paxos`_ to establish consensus about the master cluster map. -A consensus requires a majority of monitors running to establish a quorum for -consensus about the cluster map (e.g., 1; 2 out of 3; 3 out of 5; 4 out of 6; -etc.). - -``mon force quorum join`` - -:Description: Force monitor to join quorum even if it has been previously removed from the map -:Type: Boolean -:Default: ``False`` - -.. index:: Ceph Monitor; consistency - -Consistency ------------ - -When you add monitor settings to your Ceph configuration file, you need to be -aware of some of the architectural aspects of Ceph Monitors. **Ceph imposes -strict consistency requirements** for a Ceph monitor when discovering another -Ceph Monitor within the cluster. Whereas, Ceph Clients and other Ceph daemons -use the Ceph configuration file to discover monitors, monitors discover each -other using the monitor map (monmap), not the Ceph configuration file. - -A Ceph Monitor always refers to the local copy of the monmap when discovering -other Ceph Monitors in the Ceph Storage Cluster. Using the monmap instead of the -Ceph configuration file avoids errors that could break the cluster (e.g., typos -in ``ceph.conf`` when specifying a monitor address or port). Since monitors use -monmaps for discovery and they share monmaps with clients and other Ceph -daemons, **the monmap provides monitors with a strict guarantee that their -consensus is valid.** - -Strict consistency also applies to updates to the monmap. As with any other -updates on the Ceph Monitor, changes to the monmap always run through a -distributed consensus algorithm called `Paxos`_. The Ceph Monitors must agree on -each update to the monmap, such as adding or removing a Ceph Monitor, to ensure -that each monitor in the quorum has the same version of the monmap. Updates to -the monmap are incremental so that Ceph Monitors have the latest agreed upon -version, and a set of previous versions. Maintaining a history enables a Ceph -Monitor that has an older version of the monmap to catch up with the current -state of the Ceph Storage Cluster. - -If Ceph Monitors discovered each other through the Ceph configuration file -instead of through the monmap, it would introduce additional risks because the -Ceph configuration files are not updated and distributed automatically. Ceph -Monitors might inadvertently use an older Ceph configuration file, fail to -recognize a Ceph Monitor, fall out of a quorum, or develop a situation where -`Paxos`_ is not able to determine the current state of the system accurately. - - -.. index:: Ceph Monitor; bootstrapping monitors - -Bootstrapping Monitors ----------------------- - -In most configuration and deployment cases, tools that deploy Ceph may help -bootstrap the Ceph Monitors by generating a monitor map for you (e.g., -``ceph-deploy``, etc). A Ceph Monitor requires a few explicit -settings: - -- **Filesystem ID**: The ``fsid`` is the unique identifier for your - object store. Since you can run multiple clusters on the same - hardware, you must specify the unique ID of the object store when - bootstrapping a monitor. Deployment tools usually do this for you - (e.g., ``ceph-deploy`` can call a tool like ``uuidgen``), but you - may specify the ``fsid`` manually too. - -- **Monitor ID**: A monitor ID is a unique ID assigned to each monitor within - the cluster. It is an alphanumeric value, and by convention the identifier - usually follows an alphabetical increment (e.g., ``a``, ``b``, etc.). This - can be set in a Ceph configuration file (e.g., ``[mon.a]``, ``[mon.b]``, etc.), - by a deployment tool, or using the ``ceph`` commandline. - -- **Keys**: The monitor must have secret keys. A deployment tool such as - ``ceph-deploy`` usually does this for you, but you may - perform this step manually too. See `Monitor Keyrings`_ for details. - -For additional details on bootstrapping, see `Bootstrapping a Monitor`_. - -.. index:: Ceph Monitor; configuring monitors - -Configuring Monitors -==================== - -To apply configuration settings to the entire cluster, enter the configuration -settings under ``[global]``. To apply configuration settings to all monitors in -your cluster, enter the configuration settings under ``[mon]``. To apply -configuration settings to specific monitors, specify the monitor instance -(e.g., ``[mon.a]``). By convention, monitor instance names use alpha notation. - -.. code-block:: ini - - [global] - - [mon] - - [mon.a] - - [mon.b] - - [mon.c] - - -Minimum Configuration ---------------------- - -The bare minimum monitor settings for a Ceph monitor via the Ceph configuration -file include a hostname and a monitor address for each monitor. You can configure -these under ``[mon]`` or under the entry for a specific monitor. - -.. code-block:: ini - - [mon] - mon host = hostname1,hostname2,hostname3 - mon addr = 10.0.0.10:6789,10.0.0.11:6789,10.0.0.12:6789 - - -.. code-block:: ini - - [mon.a] - host = hostname1 - mon addr = 10.0.0.10:6789 - -See the `Network Configuration Reference`_ for details. - -.. note:: This minimum configuration for monitors assumes that a deployment - tool generates the ``fsid`` and the ``mon.`` key for you. - -Once you deploy a Ceph cluster, you **SHOULD NOT** change the IP address of -the monitors. However, if you decide to change the monitor's IP address, you -must follow a specific procedure. See `Changing a Monitor's IP Address`_ for -details. - -Monitors can also be found by clients using DNS SRV records. See `Monitor lookup through DNS`_ for details. - -Cluster ID ----------- - -Each Ceph Storage Cluster has a unique identifier (``fsid``). If specified, it -usually appears under the ``[global]`` section of the configuration file. -Deployment tools usually generate the ``fsid`` and store it in the monitor map, -so the value may not appear in a configuration file. The ``fsid`` makes it -possible to run daemons for multiple clusters on the same hardware. - -``fsid`` - -:Description: The cluster ID. One per cluster. -:Type: UUID -:Required: Yes. -:Default: N/A. May be generated by a deployment tool if not specified. - -.. note:: Do not set this value if you use a deployment tool that does - it for you. - - -.. index:: Ceph Monitor; initial members - -Initial Members ---------------- - -We recommend running a production Ceph Storage Cluster with at least three Ceph -Monitors to ensure high availability. When you run multiple monitors, you may -specify the initial monitors that must be members of the cluster in order to -establish a quorum. This may reduce the time it takes for your cluster to come -online. - -.. code-block:: ini - - [mon] - mon initial members = a,b,c - - -``mon initial members`` - -:Description: The IDs of initial monitors in a cluster during startup. If - specified, Ceph requires an odd number of monitors to form an - initial quorum (e.g., 3). - -:Type: String -:Default: None - -.. note:: A *majority* of monitors in your cluster must be able to reach - each other in order to establish a quorum. You can decrease the initial - number of monitors to establish a quorum with this setting. - -.. index:: Ceph Monitor; data path - -Data ----- - -Ceph provides a default path where Ceph Monitors store data. For optimal -performance in a production Ceph Storage Cluster, we recommend running Ceph -Monitors on separate hosts and drives from Ceph OSD Daemons. As leveldb is using -``mmap()`` for writing the data, Ceph Monitors flush their data from memory to disk -very often, which can interfere with Ceph OSD Daemon workloads if the data -store is co-located with the OSD Daemons. - -In Ceph versions 0.58 and earlier, Ceph Monitors store their data in files. This -approach allows users to inspect monitor data with common tools like ``ls`` -and ``cat``. However, it doesn't provide strong consistency. - -In Ceph versions 0.59 and later, Ceph Monitors store their data as key/value -pairs. Ceph Monitors require `ACID`_ transactions. Using a data store prevents -recovering Ceph Monitors from running corrupted versions through Paxos, and it -enables multiple modification operations in one single atomic batch, among other -advantages. - -Generally, we do not recommend changing the default data location. If you modify -the default location, we recommend that you make it uniform across Ceph Monitors -by setting it in the ``[mon]`` section of the configuration file. - - -``mon data`` - -:Description: The monitor's data location. -:Type: String -:Default: ``/var/lib/ceph/mon/$cluster-$id`` - - -``mon data size warn`` - -:Description: Issue a ``HEALTH_WARN`` in cluster log when the monitor's data - store goes over 15GB. -:Type: Integer -:Default: 15*1024*1024*1024* - - -``mon data avail warn`` - -:Description: Issue a ``HEALTH_WARN`` in cluster log when the available disk - space of monitor's data store is lower or equal to this - percentage. -:Type: Integer -:Default: 30 - - -``mon data avail crit`` - -:Description: Issue a ``HEALTH_ERR`` in cluster log when the available disk - space of monitor's data store is lower or equal to this - percentage. -:Type: Integer -:Default: 5 - - -``mon warn on cache pools without hit sets`` - -:Description: Issue a ``HEALTH_WARN`` in cluster log if a cache pool does not - have the hitset type set set. - See `hit set type <../operations/pools#hit-set-type>`_ for more - details. -:Type: Boolean -:Default: True - - -``mon warn on crush straw calc version zero`` - -:Description: Issue a ``HEALTH_WARN`` in cluster log if the CRUSH's - ``straw_calc_version`` is zero. See - `CRUSH map tunables <../operations/crush-map#tunables>`_ for - details. -:Type: Boolean -:Default: True - - -``mon warn on legacy crush tunables`` - -:Description: Issue a ``HEALTH_WARN`` in cluster log if - CRUSH tunables are too old (older than ``mon_min_crush_required_version``) -:Type: Boolean -:Default: True - - -``mon crush min required version`` - -:Description: The minimum tunable profile version required by the cluster. - See - `CRUSH map tunables <../operations/crush-map#tunables>`_ for - details. -:Type: String -:Default: ``firefly`` - - -``mon warn on osd down out interval zero`` - -:Description: Issue a ``HEALTH_WARN`` in cluster log if - ``mon osd down out interval`` is zero. Having this option set to - zero on the leader acts much like the ``noout`` flag. It's hard - to figure out what's going wrong with clusters witout the - ``noout`` flag set but acting like that just the same, so we - report a warning in this case. -:Type: Boolean -:Default: True - - -``mon cache target full warn ratio`` - -:Description: Position between pool's ``cache_target_full`` and - ``target_max_object`` where we start warning -:Type: Float -:Default: ``0.66`` - - -``mon health data update interval`` - -:Description: How often (in seconds) the monitor in quorum shares its health - status with its peers. (negative number disables it) -:Type: Float -:Default: ``60`` - - -``mon health to clog`` - -:Description: Enable sending health summary to cluster log periodically. -:Type: Boolean -:Default: True - - -``mon health to clog tick interval`` - -:Description: How often (in seconds) the monitor send health summary to cluster - log (a non-positive number disables it). If current health summary - is empty or identical to the last time, monitor will not send it - to cluster log. -:Type: Integer -:Default: 3600 - - -``mon health to clog interval`` - -:Description: How often (in seconds) the monitor send health summary to cluster - log (a non-positive number disables it). Monitor will always - send the summary to cluster log no matter if the summary changes - or not. -:Type: Integer -:Default: 60 - - - -.. index:: Ceph Storage Cluster; capacity planning, Ceph Monitor; capacity planning - -Storage Capacity ----------------- - -When a Ceph Storage Cluster gets close to its maximum capacity (i.e., ``mon osd -full ratio``), Ceph prevents you from writing to or reading from Ceph OSD -Daemons as a safety measure to prevent data loss. Therefore, letting a -production Ceph Storage Cluster approach its full ratio is not a good practice, -because it sacrifices high availability. The default full ratio is ``.95``, or -95% of capacity. This a very aggressive setting for a test cluster with a small -number of OSDs. - -.. tip:: When monitoring your cluster, be alert to warnings related to the - ``nearfull`` ratio. This means that a failure of some OSDs could result - in a temporary service disruption if one or more OSDs fails. Consider adding - more OSDs to increase storage capacity. - -A common scenario for test clusters involves a system administrator removing a -Ceph OSD Daemon from the Ceph Storage Cluster to watch the cluster rebalance; -then, removing another Ceph OSD Daemon, and so on until the Ceph Storage Cluster -eventually reaches the full ratio and locks up. We recommend a bit of capacity -planning even with a test cluster. Planning enables you to gauge how much spare -capacity you will need in order to maintain high availability. Ideally, you want -to plan for a series of Ceph OSD Daemon failures where the cluster can recover -to an ``active + clean`` state without replacing those Ceph OSD Daemons -immediately. You can run a cluster in an ``active + degraded`` state, but this -is not ideal for normal operating conditions. - -The following diagram depicts a simplistic Ceph Storage Cluster containing 33 -Ceph Nodes with one Ceph OSD Daemon per host, each Ceph OSD Daemon reading from -and writing to a 3TB drive. So this exemplary Ceph Storage Cluster has a maximum -actual capacity of 99TB. With a ``mon osd full ratio`` of ``0.95``, if the Ceph -Storage Cluster falls to 5TB of remaining capacity, the cluster will not allow -Ceph Clients to read and write data. So the Ceph Storage Cluster's operating -capacity is 95TB, not 99TB. - -.. ditaa:: - - +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ - | Rack 1 | | Rack 2 | | Rack 3 | | Rack 4 | | Rack 5 | | Rack 6 | - | cCCC | | cF00 | | cCCC | | cCCC | | cCCC | | cCCC | - +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ - | OSD 1 | | OSD 7 | | OSD 13 | | OSD 19 | | OSD 25 | | OSD 31 | - +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ - | OSD 2 | | OSD 8 | | OSD 14 | | OSD 20 | | OSD 26 | | OSD 32 | - +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ - | OSD 3 | | OSD 9 | | OSD 15 | | OSD 21 | | OSD 27 | | OSD 33 | - +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ - | OSD 4 | | OSD 10 | | OSD 16 | | OSD 22 | | OSD 28 | | Spare | - +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ - | OSD 5 | | OSD 11 | | OSD 17 | | OSD 23 | | OSD 29 | | Spare | - +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ - | OSD 6 | | OSD 12 | | OSD 18 | | OSD 24 | | OSD 30 | | Spare | - +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ - -It is normal in such a cluster for one or two OSDs to fail. A less frequent but -reasonable scenario involves a rack's router or power supply failing, which -brings down multiple OSDs simultaneously (e.g., OSDs 7-12). In such a scenario, -you should still strive for a cluster that can remain operational and achieve an -``active + clean`` state--even if that means adding a few hosts with additional -OSDs in short order. If your capacity utilization is too high, you may not lose -data, but you could still sacrifice data availability while resolving an outage -within a failure domain if capacity utilization of the cluster exceeds the full -ratio. For this reason, we recommend at least some rough capacity planning. - -Identify two numbers for your cluster: - -#. The number of OSDs. -#. The total capacity of the cluster - -If you divide the total capacity of your cluster by the number of OSDs in your -cluster, you will find the mean average capacity of an OSD within your cluster. -Consider multiplying that number by the number of OSDs you expect will fail -simultaneously during normal operations (a relatively small number). Finally -multiply the capacity of the cluster by the full ratio to arrive at a maximum -operating capacity; then, subtract the number of amount of data from the OSDs -you expect to fail to arrive at a reasonable full ratio. Repeat the foregoing -process with a higher number of OSD failures (e.g., a rack of OSDs) to arrive at -a reasonable number for a near full ratio. - -.. code-block:: ini - - [global] - - mon osd full ratio = .80 - mon osd backfillfull ratio = .75 - mon osd nearfull ratio = .70 - - -``mon osd full ratio`` - -:Description: The percentage of disk space used before an OSD is - considered ``full``. - -:Type: Float -:Default: ``.95`` - - -``mon osd backfillfull ratio`` - -:Description: The percentage of disk space used before an OSD is - considered too ``full`` to backfill. - -:Type: Float -:Default: ``.90`` - - -``mon osd nearfull ratio`` - -:Description: The percentage of disk space used before an OSD is - considered ``nearfull``. - -:Type: Float -:Default: ``.85`` - - -.. tip:: If some OSDs are nearfull, but others have plenty of capacity, you - may have a problem with the CRUSH weight for the nearfull OSDs. - -.. index:: heartbeat - -Heartbeat ---------- - -Ceph monitors know about the cluster by requiring reports from each OSD, and by -receiving reports from OSDs about the status of their neighboring OSDs. Ceph -provides reasonable default settings for monitor/OSD interaction; however, you -may modify them as needed. See `Monitor/OSD Interaction`_ for details. - - -.. index:: Ceph Monitor; leader, Ceph Monitor; provider, Ceph Monitor; requester, Ceph Monitor; synchronization - -Monitor Store Synchronization ------------------------------ - -When you run a production cluster with multiple monitors (recommended), each -monitor checks to see if a neighboring monitor has a more recent version of the -cluster map (e.g., a map in a neighboring monitor with one or more epoch numbers -higher than the most current epoch in the map of the instant monitor). -Periodically, one monitor in the cluster may fall behind the other monitors to -the point where it must leave the quorum, synchronize to retrieve the most -current information about the cluster, and then rejoin the quorum. For the -purposes of synchronization, monitors may assume one of three roles: - -#. **Leader**: The `Leader` is the first monitor to achieve the most recent - Paxos version of the cluster map. - -#. **Provider**: The `Provider` is a monitor that has the most recent version - of the cluster map, but wasn't the first to achieve the most recent version. - -#. **Requester:** A `Requester` is a monitor that has fallen behind the leader - and must synchronize in order to retrieve the most recent information about - the cluster before it can rejoin the quorum. - -These roles enable a leader to delegate synchronization duties to a provider, -which prevents synchronization requests from overloading the leader--improving -performance. In the following diagram, the requester has learned that it has -fallen behind the other monitors. The requester asks the leader to synchronize, -and the leader tells the requester to synchronize with a provider. - - -.. ditaa:: +-----------+ +---------+ +----------+ - | Requester | | Leader | | Provider | - +-----------+ +---------+ +----------+ - | | | - | | | - | Ask to Synchronize | | - |------------------->| | - | | | - |<-------------------| | - | Tell Requester to | | - | Sync with Provider | | - | | | - | Synchronize | - |--------------------+-------------------->| - | | | - |<-------------------+---------------------| - | Send Chunk to Requester | - | (repeat as necessary) | - | Requester Acks Chuck to Provider | - |--------------------+-------------------->| - | | - | Sync Complete | - | Notification | - |------------------->| - | | - |<-------------------| - | Ack | - | | - - -Synchronization always occurs when a new monitor joins the cluster. During -runtime operations, monitors may receive updates to the cluster map at different -times. This means the leader and provider roles may migrate from one monitor to -another. If this happens while synchronizing (e.g., a provider falls behind the -leader), the provider can terminate synchronization with a requester. - -Once synchronization is complete, Ceph requires trimming across the cluster. -Trimming requires that the placement groups are ``active + clean``. - - -``mon sync trim timeout`` - -:Description: -:Type: Double -:Default: ``30.0`` - - -``mon sync heartbeat timeout`` - -:Description: -:Type: Double -:Default: ``30.0`` - - -``mon sync heartbeat interval`` - -:Description: -:Type: Double -:Default: ``5.0`` - - -``mon sync backoff timeout`` - -:Description: -:Type: Double -:Default: ``30.0`` - - -``mon sync timeout`` - -:Description: Number of seconds the monitor will wait for the next update - message from its sync provider before it gives up and bootstrap - again. -:Type: Double -:Default: ``30.0`` - - -``mon sync max retries`` - -:Description: -:Type: Integer -:Default: ``5`` - - -``mon sync max payload size`` - -:Description: The maximum size for a sync payload (in bytes). -:Type: 32-bit Integer -:Default: ``1045676`` - - -``paxos max join drift`` - -:Description: The maximum Paxos iterations before we must first sync the - monitor data stores. When a monitor finds that its peer is too - far ahead of it, it will first sync with data stores before moving - on. -:Type: Integer -:Default: ``10`` - -``paxos stash full interval`` - -:Description: How often (in commits) to stash a full copy of the PaxosService state. - Current this setting only affects ``mds``, ``mon``, ``auth`` and ``mgr`` - PaxosServices. -:Type: Integer -:Default: 25 - -``paxos propose interval`` - -:Description: Gather updates for this time interval before proposing - a map update. -:Type: Double -:Default: ``1.0`` - - -``paxos min`` - -:Description: The minimum number of paxos states to keep around -:Type: Integer -:Default: 500 - - -``paxos min wait`` - -:Description: The minimum amount of time to gather updates after a period of - inactivity. -:Type: Double -:Default: ``0.05`` - - -``paxos trim min`` - -:Description: Number of extra proposals tolerated before trimming -:Type: Integer -:Default: 250 - - -``paxos trim max`` - -:Description: The maximum number of extra proposals to trim at a time -:Type: Integer -:Default: 500 - - -``paxos service trim min`` - -:Description: The minimum amount of versions to trigger a trim (0 disables it) -:Type: Integer -:Default: 250 - - -``paxos service trim max`` - -:Description: The maximum amount of versions to trim during a single proposal (0 disables it) -:Type: Integer -:Default: 500 - - -``mon max log epochs`` - -:Description: The maximum amount of log epochs to trim during a single proposal -:Type: Integer -:Default: 500 - - -``mon max pgmap epochs`` - -:Description: The maximum amount of pgmap epochs to trim during a single proposal -:Type: Integer -:Default: 500 - - -``mon mds force trim to`` - -:Description: Force monitor to trim mdsmaps to this point (0 disables it. - dangerous, use with care) -:Type: Integer -:Default: 0 - - -``mon osd force trim to`` - -:Description: Force monitor to trim osdmaps to this point, even if there is - PGs not clean at the specified epoch (0 disables it. dangerous, - use with care) -:Type: Integer -:Default: 0 - -``mon osd cache size`` - -:Description: The size of osdmaps cache, not to rely on underlying store's cache -:Type: Integer -:Default: 10 - - -``mon election timeout`` - -:Description: On election proposer, maximum waiting time for all ACKs in seconds. -:Type: Float -:Default: ``5`` - - -``mon lease`` - -:Description: The length (in seconds) of the lease on the monitor's versions. -:Type: Float -:Default: ``5`` - - -``mon lease renew interval factor`` - -:Description: ``mon lease`` \* ``mon lease renew interval factor`` will be the - interval for the Leader to renew the other monitor's leases. The - factor should be less than ``1.0``. -:Type: Float -:Default: ``0.6`` - - -``mon lease ack timeout factor`` - -:Description: The Leader will wait ``mon lease`` \* ``mon lease ack timeout factor`` - for the Providers to acknowledge the lease extension. -:Type: Float -:Default: ``2.0`` - - -``mon accept timeout factor`` - -:Description: The Leader will wait ``mon lease`` \* ``mon accept timeout factor`` - for the Requester(s) to accept a Paxos update. It is also used - during the Paxos recovery phase for similar purposes. -:Type: Float -:Default: ``2.0`` - - -``mon min osdmap epochs`` - -:Description: Minimum number of OSD map epochs to keep at all times. -:Type: 32-bit Integer -:Default: ``500`` - - -``mon max pgmap epochs`` - -:Description: Maximum number of PG map epochs the monitor should keep. -:Type: 32-bit Integer -:Default: ``500`` - - -``mon max log epochs`` - -:Description: Maximum number of Log epochs the monitor should keep. -:Type: 32-bit Integer -:Default: ``500`` - - - -.. index:: Ceph Monitor; clock - -Clock ------ - -Ceph daemons pass critical messages to each other, which must be processed -before daemons reach a timeout threshold. If the clocks in Ceph monitors -are not synchronized, it can lead to a number of anomalies. For example: - -- Daemons ignoring received messages (e.g., timestamps outdated) -- Timeouts triggered too soon/late when a message wasn't received in time. - -See `Monitor Store Synchronization`_ for details. - - -.. tip:: You SHOULD install NTP on your Ceph monitor hosts to - ensure that the monitor cluster operates with synchronized clocks. - -Clock drift may still be noticeable with NTP even though the discrepancy is not -yet harmful. Ceph's clock drift / clock skew warnings may get triggered even -though NTP maintains a reasonable level of synchronization. Increasing your -clock drift may be tolerable under such circumstances; however, a number of -factors such as workload, network latency, configuring overrides to default -timeouts and the `Monitor Store Synchronization`_ settings may influence -the level of acceptable clock drift without compromising Paxos guarantees. - -Ceph provides the following tunable options to allow you to find -acceptable values. - - -``clock offset`` - -:Description: How much to offset the system clock. See ``Clock.cc`` for details. -:Type: Double -:Default: ``0`` - - -.. deprecated:: 0.58 - -``mon tick interval`` - -:Description: A monitor's tick interval in seconds. -:Type: 32-bit Integer -:Default: ``5`` - - -``mon clock drift allowed`` - -:Description: The clock drift in seconds allowed between monitors. -:Type: Float -:Default: ``.050`` - - -``mon clock drift warn backoff`` - -:Description: Exponential backoff for clock drift warnings -:Type: Float -:Default: ``5`` - - -``mon timecheck interval`` - -:Description: The time check interval (clock drift check) in seconds - for the Leader. - -:Type: Float -:Default: ``300.0`` - - -``mon timecheck skew interval`` - -:Description: The time check interval (clock drift check) in seconds when in - presence of a skew in seconds for the Leader. -:Type: Float -:Default: ``30.0`` - - -Client ------- - -``mon client hunt interval`` - -:Description: The client will try a new monitor every ``N`` seconds until it - establishes a connection. - -:Type: Double -:Default: ``3.0`` - - -``mon client ping interval`` - -:Description: The client will ping the monitor every ``N`` seconds. -:Type: Double -:Default: ``10.0`` - - -``mon client max log entries per message`` - -:Description: The maximum number of log entries a monitor will generate - per client message. - -:Type: Integer -:Default: ``1000`` - - -``mon client bytes`` - -:Description: The amount of client message data allowed in memory (in bytes). -:Type: 64-bit Integer Unsigned -:Default: ``100ul << 20`` - - -Pool settings -============= -Since version v0.94 there is support for pool flags which allow or disallow changes to be made to pools. - -Monitors can also disallow removal of pools if configured that way. - -``mon allow pool delete`` - -:Description: If the monitors should allow pools to be removed. Regardless of what the pool flags say. -:Type: Boolean -:Default: ``false`` - -``osd pool default flag hashpspool`` - -:Description: Set the hashpspool flag on new pools -:Type: Boolean -:Default: ``true`` - -``osd pool default flag nodelete`` - -:Description: Set the nodelete flag on new pools. Prevents allow pool removal with this flag in any way. -:Type: Boolean -:Default: ``false`` - -``osd pool default flag nopgchange`` - -:Description: Set the nopgchange flag on new pools. Does not allow the number of PGs to be changed for a pool. -:Type: Boolean -:Default: ``false`` - -``osd pool default flag nosizechange`` - -:Description: Set the nosizechange flag on new pools. Does not allow the size to be changed of pool. -:Type: Boolean -:Default: ``false`` - -For more information about the pool flags see `Pool values`_. - -Miscellaneous -============= - - -``mon max osd`` - -:Description: The maximum number of OSDs allowed in the cluster. -:Type: 32-bit Integer -:Default: ``10000`` - -``mon globalid prealloc`` - -:Description: The number of global IDs to pre-allocate for clients and daemons in the cluster. -:Type: 32-bit Integer -:Default: ``100`` - -``mon subscribe interval`` - -:Description: The refresh interval (in seconds) for subscriptions. The - subscription mechanism enables obtaining the cluster maps - and log information. - -:Type: Double -:Default: ``300`` - - -``mon stat smooth intervals`` - -:Description: Ceph will smooth statistics over the last ``N`` PG maps. -:Type: Integer -:Default: ``2`` - - -``mon probe timeout`` - -:Description: Number of seconds the monitor will wait to find peers before bootstrapping. -:Type: Double -:Default: ``2.0`` - - -``mon daemon bytes`` - -:Description: The message memory cap for metadata server and OSD messages (in bytes). -:Type: 64-bit Integer Unsigned -:Default: ``400ul << 20`` - - -``mon max log entries per event`` - -:Description: The maximum number of log entries per event. -:Type: Integer -:Default: ``4096`` - - -``mon osd prime pg temp`` - -:Description: Enables or disable priming the PGMap with the previous OSDs when an out - OSD comes back into the cluster. With the ``true`` setting the clients - will continue to use the previous OSDs until the newly in OSDs as that - PG peered. -:Type: Boolean -:Default: ``true`` - - -``mon osd prime pg temp max time`` - -:Description: How much time in seconds the monitor should spend trying to prime the - PGMap when an out OSD comes back into the cluster. -:Type: Float -:Default: ``0.5`` - - -``mon osd prime pg temp max time estimate`` - -:Description: Maximum estimate of time spent on each PG before we prime all PGs - in parallel. -:Type: Float -:Default: ``0.25`` - - -``mon osd allow primary affinity`` - -:Description: allow ``primary_affinity`` to be set in the osdmap. -:Type: Boolean -:Default: False - - -``mon osd pool ec fast read`` - -:Description: Whether turn on fast read on the pool or not. It will be used as - the default setting of newly created erasure pools if ``fast_read`` - is not specified at create time. -:Type: Boolean -:Default: False - - -``mon mds skip sanity`` - -:Description: Skip safety assertions on FSMap (in case of bugs where we want to - continue anyway). Monitor terminates if the FSMap sanity check - fails, but we can disable it by enabling this option. -:Type: Boolean -:Default: False - - -``mon max mdsmap epochs`` - -:Description: The maximum amount of mdsmap epochs to trim during a single proposal. -:Type: Integer -:Default: 500 - - -``mon config key max entry size`` - -:Description: The maximum size of config-key entry (in bytes) -:Type: Integer -:Default: 4096 - - -``mon scrub interval`` - -:Description: How often (in seconds) the monitor scrub its store by comparing - the stored checksums with the computed ones of all the stored - keys. -:Type: Integer -:Default: 3600*24 - - -``mon scrub max keys`` - -:Description: The maximum number of keys to scrub each time. -:Type: Integer -:Default: 100 - - -``mon compact on start`` - -:Description: Compact the database used as Ceph Monitor store on - ``ceph-mon`` start. A manual compaction helps to shrink the - monitor database and improve the performance of it if the regular - compaction fails to work. -:Type: Boolean -:Default: False - - -``mon compact on bootstrap`` - -:Description: Compact the database used as Ceph Monitor store on - on bootstrap. Monitor starts probing each other for creating - a quorum after bootstrap. If it times out before joining the - quorum, it will start over and bootstrap itself again. -:Type: Boolean -:Default: False - - -``mon compact on trim`` - -:Description: Compact a certain prefix (including paxos) when we trim its old states. -:Type: Boolean -:Default: True - - -``mon cpu threads`` - -:Description: Number of threads for performing CPU intensive work on monitor. -:Type: Boolean -:Default: True - - -``mon osd mapping pgs per chunk`` - -:Description: We calculate the mapping from placement group to OSDs in chunks. - This option specifies the number of placement groups per chunk. -:Type: Integer -:Default: 4096 - - -``mon osd max split count`` - -:Description: Largest number of PGs per "involved" OSD to let split create. - When we increase the ``pg_num`` of a pool, the placement groups - will be splitted on all OSDs serving that pool. We want to avoid - extreme multipliers on PG splits. -:Type: Integer -:Default: 300 - - -``mon session timeout`` - -:Description: Monitor will terminate inactive sessions stay idle over this - time limit. -:Type: Integer -:Default: 300 - - - -.. _Paxos: http://en.wikipedia.org/wiki/Paxos_(computer_science) -.. _Monitor Keyrings: ../../../dev/mon-bootstrap#secret-keys -.. _Ceph configuration file: ../ceph-conf/#monitors -.. _Network Configuration Reference: ../network-config-ref -.. _Monitor lookup through DNS: ../mon-lookup-dns -.. _ACID: http://en.wikipedia.org/wiki/ACID -.. _Adding/Removing a Monitor: ../../operations/add-or-rm-mons -.. _Add/Remove a Monitor (ceph-deploy): ../../deployment/ceph-deploy-mon -.. _Monitoring a Cluster: ../../operations/monitoring -.. _Monitoring OSDs and PGs: ../../operations/monitoring-osd-pg -.. _Bootstrapping a Monitor: ../../../dev/mon-bootstrap -.. _Changing a Monitor's IP Address: ../../operations/add-or-rm-mons#changing-a-monitor-s-ip-address -.. _Monitor/OSD Interaction: ../mon-osd-interaction -.. _Scalability and High Availability: ../../../architecture#scalability-and-high-availability -.. _Pool values: ../../operations/pools/#set-pool-values diff --git a/src/ceph/doc/rados/configuration/mon-lookup-dns.rst b/src/ceph/doc/rados/configuration/mon-lookup-dns.rst deleted file mode 100644 index e32b320..0000000 --- a/src/ceph/doc/rados/configuration/mon-lookup-dns.rst +++ /dev/null @@ -1,51 +0,0 @@ -=============================== -Looking op Monitors through DNS -=============================== - -Since version 11.0.0 RADOS supports looking up Monitors through DNS. - -This way daemons and clients do not require a *mon host* configuration directive in their ceph.conf configuration file. - -Using DNS SRV TCP records clients are able to look up the monitors. - -This allows for less configuration on clients and monitors. Using a DNS update clients and daemons can be made aware of changes in the monitor topology. - -By default clients and daemons will look for the TCP service called *ceph-mon* which is configured by the *mon_dns_srv_name* configuration directive. - - -``mon dns srv name`` - -:Description: the service name used querying the DNS for the monitor hosts/addresses -:Type: String -:Default: ``ceph-mon`` - -Example -------- -When the DNS search domain is set to *example.com* a DNS zone file might contain the following elements. - -First, create records for the Monitors, either IPv4 (A) or IPv6 (AAAA). - -:: - - mon1.example.com. AAAA 2001:db8::100 - mon2.example.com. AAAA 2001:db8::200 - mon3.example.com. AAAA 2001:db8::300 - -:: - - mon1.example.com. A 192.168.0.1 - mon2.example.com. A 192.168.0.2 - mon3.example.com. A 192.168.0.3 - - -With those records now existing we can create the SRV TCP records with the name *ceph-mon* pointing to the three Monitors. - -:: - - _ceph-mon._tcp.example.com. 60 IN SRV 10 60 6789 mon1.example.com. - _ceph-mon._tcp.example.com. 60 IN SRV 10 60 6789 mon2.example.com. - _ceph-mon._tcp.example.com. 60 IN SRV 10 60 6789 mon3.example.com. - -In this case the Monitors are running on port *6789*, and their priority and weight are all *10* and *60* respectively. - -The current implementation in clients and daemons will *only* respect the priority set in SRV records, and they will only connect to the monitors with lowest-numbered priority. The targets with the same priority will be selected at random. diff --git a/src/ceph/doc/rados/configuration/mon-osd-interaction.rst b/src/ceph/doc/rados/configuration/mon-osd-interaction.rst deleted file mode 100644 index e335ff0..0000000 --- a/src/ceph/doc/rados/configuration/mon-osd-interaction.rst +++ /dev/null @@ -1,408 +0,0 @@ -===================================== - Configuring Monitor/OSD Interaction -===================================== - -.. index:: heartbeat - -After you have completed your initial Ceph configuration, you may deploy and run -Ceph. When you execute a command such as ``ceph health`` or ``ceph -s``, the -:term:`Ceph Monitor` reports on the current state of the :term:`Ceph Storage -Cluster`. The Ceph Monitor knows about the Ceph Storage Cluster by requiring -reports from each :term:`Ceph OSD Daemon`, and by receiving reports from Ceph -OSD Daemons about the status of their neighboring Ceph OSD Daemons. If the Ceph -Monitor doesn't receive reports, or if it receives reports of changes in the -Ceph Storage Cluster, the Ceph Monitor updates the status of the :term:`Ceph -Cluster Map`. - -Ceph provides reasonable default settings for Ceph Monitor/Ceph OSD Daemon -interaction. However, you may override the defaults. The following sections -describe how Ceph Monitors and Ceph OSD Daemons interact for the purposes of -monitoring the Ceph Storage Cluster. - -.. index:: heartbeat interval - -OSDs Check Heartbeats -===================== - -Each Ceph OSD Daemon checks the heartbeat of other Ceph OSD Daemons every 6 -seconds. You can change the heartbeat interval by adding an ``osd heartbeat -interval`` setting under the ``[osd]`` section of your Ceph configuration file, -or by setting the value at runtime. If a neighboring Ceph OSD Daemon doesn't -show a heartbeat within a 20 second grace period, the Ceph OSD Daemon may -consider the neighboring Ceph OSD Daemon ``down`` and report it back to a Ceph -Monitor, which will update the Ceph Cluster Map. You may change this grace -period by adding an ``osd heartbeat grace`` setting under the ``[mon]`` -and ``[osd]`` or ``[global]`` section of your Ceph configuration file, -or by setting the value at runtime. - - -.. ditaa:: +---------+ +---------+ - | OSD 1 | | OSD 2 | - +---------+ +---------+ - | | - |----+ Heartbeat | - | | Interval | - |<---+ Exceeded | - | | - | Check | - | Heartbeat | - |------------------->| - | | - |<-------------------| - | Heart Beating | - | | - |----+ Heartbeat | - | | Interval | - |<---+ Exceeded | - | | - | Check | - | Heartbeat | - |------------------->| - | | - |----+ Grace | - | | Period | - |<---+ Exceeded | - | | - |----+ Mark | - | | OSD 2 | - |<---+ Down | - - -.. index:: OSD down report - -OSDs Report Down OSDs -===================== - -By default, two Ceph OSD Daemons from different hosts must report to the Ceph -Monitors that another Ceph OSD Daemon is ``down`` before the Ceph Monitors -acknowledge that the reported Ceph OSD Daemon is ``down``. But there is chance -that all the OSDs reporting the failure are hosted in a rack with a bad switch -which has trouble connecting to another OSD. To avoid this sort of false alarm, -we consider the peers reporting a failure a proxy for a potential "subcluster" -over the overall cluster that is similarly laggy. This is clearly not true in -all cases, but will sometimes help us localize the grace correction to a subset -of the system that is unhappy. ``mon osd reporter subtree level`` is used to -group the peers into the "subcluster" by their common ancestor type in CRUSH -map. By default, only two reports from different subtree are required to report -another Ceph OSD Daemon ``down``. You can change the number of reporters from -unique subtrees and the common ancestor type required to report a Ceph OSD -Daemon ``down`` to a Ceph Monitor by adding an ``mon osd min down reporters`` -and ``mon osd reporter subtree level`` settings under the ``[mon]`` section of -your Ceph configuration file, or by setting the value at runtime. - - -.. ditaa:: +---------+ +---------+ +---------+ - | OSD 1 | | OSD 2 | | Monitor | - +---------+ +---------+ +---------+ - | | | - | OSD 3 Is Down | | - |---------------+--------------->| - | | | - | | | - | | OSD 3 Is Down | - | |--------------->| - | | | - | | | - | | |---------+ Mark - | | | | OSD 3 - | | |<--------+ Down - - -.. index:: peering failure - -OSDs Report Peering Failure -=========================== - -If a Ceph OSD Daemon cannot peer with any of the Ceph OSD Daemons defined in its -Ceph configuration file (or the cluster map), it will ping a Ceph Monitor for -the most recent copy of the cluster map every 30 seconds. You can change the -Ceph Monitor heartbeat interval by adding an ``osd mon heartbeat interval`` -setting under the ``[osd]`` section of your Ceph configuration file, or by -setting the value at runtime. - -.. ditaa:: +---------+ +---------+ +-------+ +---------+ - | OSD 1 | | OSD 2 | | OSD 3 | | Monitor | - +---------+ +---------+ +-------+ +---------+ - | | | | - | Request To | | | - | Peer | | | - |-------------->| | | - |<--------------| | | - | Peering | | - | | | - | Request To | | - | Peer | | - |----------------------------->| | - | | - |----+ OSD Monitor | - | | Heartbeat | - |<---+ Interval Exceeded | - | | - | Failed to Peer with OSD 3 | - |-------------------------------------------->| - |<--------------------------------------------| - | Receive New Cluster Map | - - -.. index:: OSD status - -OSDs Report Their Status -======================== - -If an Ceph OSD Daemon doesn't report to a Ceph Monitor, the Ceph Monitor will -consider the Ceph OSD Daemon ``down`` after the ``mon osd report timeout`` -elapses. A Ceph OSD Daemon sends a report to a Ceph Monitor when a reportable -event such as a failure, a change in placement group stats, a change in -``up_thru`` or when it boots within 5 seconds. You can change the Ceph OSD -Daemon minimum report interval by adding an ``osd mon report interval min`` -setting under the ``[osd]`` section of your Ceph configuration file, or by -setting the value at runtime. A Ceph OSD Daemon sends a report to a Ceph -Monitor every 120 seconds irrespective of whether any notable changes occur. -You can change the Ceph Monitor report interval by adding an ``osd mon report -interval max`` setting under the ``[osd]`` section of your Ceph configuration -file, or by setting the value at runtime. - - -.. ditaa:: +---------+ +---------+ - | OSD 1 | | Monitor | - +---------+ +---------+ - | | - |----+ Report Min | - | | Interval | - |<---+ Exceeded | - | | - |----+ Reportable | - | | Event | - |<---+ Occurs | - | | - | Report To | - | Monitor | - |------------------->| - | | - |----+ Report Max | - | | Interval | - |<---+ Exceeded | - | | - | Report To | - | Monitor | - |------------------->| - | | - |----+ Monitor | - | | Fails | - |<---+ | - +----+ Monitor OSD - | | Report Timeout - |<---+ Exceeded - | - +----+ Mark - | | OSD 1 - |<---+ Down - - - - -Configuration Settings -====================== - -When modifying heartbeat settings, you should include them in the ``[global]`` -section of your configuration file. - -.. index:: monitor heartbeat - -Monitor Settings ----------------- - -``mon osd min up ratio`` - -:Description: The minimum ratio of ``up`` Ceph OSD Daemons before Ceph will - mark Ceph OSD Daemons ``down``. - -:Type: Double -:Default: ``.3`` - - -``mon osd min in ratio`` - -:Description: The minimum ratio of ``in`` Ceph OSD Daemons before Ceph will - mark Ceph OSD Daemons ``out``. - -:Type: Double -:Default: ``.75`` - - -``mon osd laggy halflife`` - -:Description: The number of seconds laggy estimates will decay. -:Type: Integer -:Default: ``60*60`` - - -``mon osd laggy weight`` - -:Description: The weight for new samples in laggy estimation decay. -:Type: Double -:Default: ``0.3`` - - - -``mon osd laggy max interval`` - -:Description: Maximum value of ``laggy_interval`` in laggy estimations (in seconds). - Monitor uses an adaptive approach to evaluate the ``laggy_interval`` of - a certain OSD. This value will be used to calculate the grace time for - that OSD. -:Type: Integer -:Default: 300 - -``mon osd adjust heartbeat grace`` - -:Description: If set to ``true``, Ceph will scale based on laggy estimations. -:Type: Boolean -:Default: ``true`` - - -``mon osd adjust down out interval`` - -:Description: If set to ``true``, Ceph will scaled based on laggy estimations. -:Type: Boolean -:Default: ``true`` - - -``mon osd auto mark in`` - -:Description: Ceph will mark any booting Ceph OSD Daemons as ``in`` - the Ceph Storage Cluster. - -:Type: Boolean -:Default: ``false`` - - -``mon osd auto mark auto out in`` - -:Description: Ceph will mark booting Ceph OSD Daemons auto marked ``out`` - of the Ceph Storage Cluster as ``in`` the cluster. - -:Type: Boolean -:Default: ``true`` - - -``mon osd auto mark new in`` - -:Description: Ceph will mark booting new Ceph OSD Daemons as ``in`` the - Ceph Storage Cluster. - -:Type: Boolean -:Default: ``true`` - - -``mon osd down out interval`` - -:Description: The number of seconds Ceph waits before marking a Ceph OSD Daemon - ``down`` and ``out`` if it doesn't respond. - -:Type: 32-bit Integer -:Default: ``600`` - - -``mon osd down out subtree limit`` - -:Description: The smallest :term:`CRUSH` unit type that Ceph will **not** - automatically mark out. For instance, if set to ``host`` and if - all OSDs of a host are down, Ceph will not automatically mark out - these OSDs. - -:Type: String -:Default: ``rack`` - - -``mon osd report timeout`` - -:Description: The grace period in seconds before declaring - unresponsive Ceph OSD Daemons ``down``. - -:Type: 32-bit Integer -:Default: ``900`` - -``mon osd min down reporters`` - -:Description: The minimum number of Ceph OSD Daemons required to report a - ``down`` Ceph OSD Daemon. - -:Type: 32-bit Integer -:Default: ``2`` - - -``mon osd reporter subtree level`` - -:Description: In which level of parent bucket the reporters are counted. The OSDs - send failure reports to monitor if they find its peer is not responsive. - And monitor mark the reported OSD out and then down after a grace period. -:Type: String -:Default: ``host`` - - -.. index:: OSD hearbeat - -OSD Settings ------------- - -``osd heartbeat address`` - -:Description: An Ceph OSD Daemon's network address for heartbeats. -:Type: Address -:Default: The host address. - - -``osd heartbeat interval`` - -:Description: How often an Ceph OSD Daemon pings its peers (in seconds). -:Type: 32-bit Integer -:Default: ``6`` - - -``osd heartbeat grace`` - -:Description: The elapsed time when a Ceph OSD Daemon hasn't shown a heartbeat - that the Ceph Storage Cluster considers it ``down``. - This setting has to be set in both the [mon] and [osd] or [global] - section so that it is read by both the MON and OSD daemons. -:Type: 32-bit Integer -:Default: ``20`` - - -``osd mon heartbeat interval`` - -:Description: How often the Ceph OSD Daemon pings a Ceph Monitor if it has no - Ceph OSD Daemon peers. - -:Type: 32-bit Integer -:Default: ``30`` - - -``osd mon report interval max`` - -:Description: The maximum time in seconds that a Ceph OSD Daemon can wait before - it must report to a Ceph Monitor. - -:Type: 32-bit Integer -:Default: ``120`` - - -``osd mon report interval min`` - -:Description: The minimum number of seconds a Ceph OSD Daemon may wait - from startup or another reportable event before reporting - to a Ceph Monitor. - -:Type: 32-bit Integer -:Default: ``5`` -:Valid Range: Should be less than ``osd mon report interval max`` - - -``osd mon ack timeout`` - -:Description: The number of seconds to wait for a Ceph Monitor to acknowledge a - request for statistics. - -:Type: 32-bit Integer -:Default: ``30`` diff --git a/src/ceph/doc/rados/configuration/ms-ref.rst b/src/ceph/doc/rados/configuration/ms-ref.rst deleted file mode 100644 index 55d009e..0000000 --- a/src/ceph/doc/rados/configuration/ms-ref.rst +++ /dev/null @@ -1,154 +0,0 @@ -=========== - Messaging -=========== - -General Settings -================ - -``ms tcp nodelay`` - -:Description: Disables nagle's algorithm on messenger tcp sessions. -:Type: Boolean -:Required: No -:Default: ``true`` - - -``ms initial backoff`` - -:Description: The initial time to wait before reconnecting on a fault. -:Type: Double -:Required: No -:Default: ``.2`` - - -``ms max backoff`` - -:Description: The maximum time to wait before reconnecting on a fault. -:Type: Double -:Required: No -:Default: ``15.0`` - - -``ms nocrc`` - -:Description: Disables crc on network messages. May increase performance if cpu limited. -:Type: Boolean -:Required: No -:Default: ``false`` - - -``ms die on bad msg`` - -:Description: Debug option; do not configure. -:Type: Boolean -:Required: No -:Default: ``false`` - - -``ms dispatch throttle bytes`` - -:Description: Throttles total size of messages waiting to be dispatched. -:Type: 64-bit Unsigned Integer -:Required: No -:Default: ``100 << 20`` - - -``ms bind ipv6`` - -:Description: Enable if you want your daemons to bind to IPv6 address instead of IPv4 ones. (Not required if you specify a daemon or cluster IP.) -:Type: Boolean -:Required: No -:Default: ``false`` - - -``ms rwthread stack bytes`` - -:Description: Debug option for stack size; do not configure. -:Type: 64-bit Unsigned Integer -:Required: No -:Default: ``1024 << 10`` - - -``ms tcp read timeout`` - -:Description: Controls how long (in seconds) the messenger will wait before closing an idle connection. -:Type: 64-bit Unsigned Integer -:Required: No -:Default: ``900`` - - -``ms inject socket failures`` - -:Description: Debug option; do not configure. -:Type: 64-bit Unsigned Integer -:Required: No -:Default: ``0`` - -Async messenger options -======================= - - -``ms async transport type`` - -:Description: Transport type used by Async Messenger. Can be ``posix``, ``dpdk`` - or ``rdma``. Posix uses standard TCP/IP networking and is default. - Other transports may be experimental and support may be limited. -:Type: String -:Required: No -:Default: ``posix`` - - -``ms async op threads`` - -:Description: Initial number of worker threads used by each Async Messenger instance. - Should be at least equal to highest number of replicas, but you can - decrease it if you are low on CPU core count and/or you host a lot of - OSDs on single server. -:Type: 64-bit Unsigned Integer -:Required: No -:Default: ``3`` - - -``ms async max op threads`` - -:Description: Maximum number of worker threads used by each Async Messenger instance. - Set to lower values when your machine has limited CPU count, and increase - when your CPUs are underutilized (i. e. one or more of CPUs are - constantly on 100% load during I/O operations). -:Type: 64-bit Unsigned Integer -:Required: No -:Default: ``5`` - - -``ms async set affinity`` - -:Description: Set to true to bind Async Messenger workers to particular CPU cores. -:Type: Boolean -:Required: No -:Default: ``true`` - - -``ms async affinity cores`` - -:Description: When ``ms async set affinity`` is true, this string specifies how Async - Messenger workers are bound to CPU cores. For example, "0,2" will bind - workers #1 and #2 to CPU cores #0 and #2, respectively. - NOTE: when manually setting affinity, make sure to not assign workers to - processors that are virtual CPUs created as an effect of Hyperthreading - or similar technology, because they are slower than regular CPU cores. -:Type: String -:Required: No -:Default: ``(empty)`` - - -``ms async send inline`` - -:Description: Send messages directly from the thread that generated them instead of - queuing and sending from Async Messenger thread. This option is known - to decrease performance on systems with a lot of CPU cores, so it's - disabled by default. -:Type: Boolean -:Required: No -:Default: ``false`` - - diff --git a/src/ceph/doc/rados/configuration/network-config-ref.rst b/src/ceph/doc/rados/configuration/network-config-ref.rst deleted file mode 100644 index 2d7f9d6..0000000 --- a/src/ceph/doc/rados/configuration/network-config-ref.rst +++ /dev/null @@ -1,494 +0,0 @@ -================================= - Network Configuration Reference -================================= - -Network configuration is critical for building a high performance :term:`Ceph -Storage Cluster`. The Ceph Storage Cluster does not perform request routing or -dispatching on behalf of the :term:`Ceph Client`. Instead, Ceph Clients make -requests directly to Ceph OSD Daemons. Ceph OSD Daemons perform data replication -on behalf of Ceph Clients, which means replication and other factors impose -additional loads on Ceph Storage Cluster networks. - -Our Quick Start configurations provide a trivial `Ceph configuration file`_ that -sets monitor IP addresses and daemon host names only. Unless you specify a -cluster network, Ceph assumes a single "public" network. Ceph functions just -fine with a public network only, but you may see significant performance -improvement with a second "cluster" network in a large cluster. - -We recommend running a Ceph Storage Cluster with two networks: a public -(front-side) network and a cluster (back-side) network. To support two networks, -each :term:`Ceph Node` will need to have more than one NIC. See `Hardware -Recommendations - Networks`_ for additional details. - -.. ditaa:: - +-------------+ - | Ceph Client | - +----*--*-----+ - | ^ - Request | : Response - v | - /----------------------------------*--*-------------------------------------\ - | Public Network | - \---*--*------------*--*-------------*--*------------*--*------------*--*---/ - ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ - | | | | | | | | | | - | : | : | : | : | : - v v v v v v v v v v - +---*--*---+ +---*--*---+ +---*--*---+ +---*--*---+ +---*--*---+ - | Ceph MON | | Ceph MDS | | Ceph OSD | | Ceph OSD | | Ceph OSD | - +----------+ +----------+ +---*--*---+ +---*--*---+ +---*--*---+ - ^ ^ ^ ^ ^ ^ - The cluster network relieves | | | | | | - OSD replication and heartbeat | : | : | : - traffic from the public network. v v v v v v - /------------------------------------*--*------------*--*------------*--*---\ - | cCCC Cluster Network | - \---------------------------------------------------------------------------/ - - -There are several reasons to consider operating two separate networks: - -#. **Performance:** Ceph OSD Daemons handle data replication for the Ceph - Clients. When Ceph OSD Daemons replicate data more than once, the network - load between Ceph OSD Daemons easily dwarfs the network load between Ceph - Clients and the Ceph Storage Cluster. This can introduce latency and - create a performance problem. Recovery and rebalancing can - also introduce significant latency on the public network. See - `Scalability and High Availability`_ for additional details on how Ceph - replicates data. See `Monitor / OSD Interaction`_ for details on heartbeat - traffic. - -#. **Security**: While most people are generally civil, a very tiny segment of - the population likes to engage in what's known as a Denial of Service (DoS) - attack. When traffic between Ceph OSD Daemons gets disrupted, placement - groups may no longer reflect an ``active + clean`` state, which may prevent - users from reading and writing data. A great way to defeat this type of - attack is to maintain a completely separate cluster network that doesn't - connect directly to the internet. Also, consider using `Message Signatures`_ - to defeat spoofing attacks. - - -IP Tables -========= - -By default, daemons `bind`_ to ports within the ``6800:7300`` range. You may -configure this range at your discretion. Before configuring your IP tables, -check the default ``iptables`` configuration. - - sudo iptables -L - -Some Linux distributions include rules that reject all inbound requests -except SSH from all network interfaces. For example:: - - REJECT all -- anywhere anywhere reject-with icmp-host-prohibited - -You will need to delete these rules on both your public and cluster networks -initially, and replace them with appropriate rules when you are ready to -harden the ports on your Ceph Nodes. - - -Monitor IP Tables ------------------ - -Ceph Monitors listen on port ``6789`` by default. Additionally, Ceph Monitors -always operate on the public network. When you add the rule using the example -below, make sure you replace ``{iface}`` with the public network interface -(e.g., ``eth0``, ``eth1``, etc.), ``{ip-address}`` with the IP address of the -public network and ``{netmask}`` with the netmask for the public network. :: - - sudo iptables -A INPUT -i {iface} -p tcp -s {ip-address}/{netmask} --dport 6789 -j ACCEPT - - -MDS IP Tables -------------- - -A :term:`Ceph Metadata Server` listens on the first available port on the public -network beginning at port 6800. Note that this behavior is not deterministic, so -if you are running more than one OSD or MDS on the same host, or if you restart -the daemons within a short window of time, the daemons will bind to higher -ports. You should open the entire 6800-7300 range by default. When you add the -rule using the example below, make sure you replace ``{iface}`` with the public -network interface (e.g., ``eth0``, ``eth1``, etc.), ``{ip-address}`` with the IP -address of the public network and ``{netmask}`` with the netmask of the public -network. - -For example:: - - sudo iptables -A INPUT -i {iface} -m multiport -p tcp -s {ip-address}/{netmask} --dports 6800:7300 -j ACCEPT - - -OSD IP Tables -------------- - -By default, Ceph OSD Daemons `bind`_ to the first available ports on a Ceph Node -beginning at port 6800. Note that this behavior is not deterministic, so if you -are running more than one OSD or MDS on the same host, or if you restart the -daemons within a short window of time, the daemons will bind to higher ports. -Each Ceph OSD Daemon on a Ceph Node may use up to four ports: - -#. One for talking to clients and monitors. -#. One for sending data to other OSDs. -#. Two for heartbeating on each interface. - -.. ditaa:: - /---------------\ - | OSD | - | +---+----------------+-----------+ - | | Clients & Monitors | Heartbeat | - | +---+----------------+-----------+ - | | - | +---+----------------+-----------+ - | | Data Replication | Heartbeat | - | +---+----------------+-----------+ - | cCCC | - \---------------/ - -When a daemon fails and restarts without letting go of the port, the restarted -daemon will bind to a new port. You should open the entire 6800-7300 port range -to handle this possibility. - -If you set up separate public and cluster networks, you must add rules for both -the public network and the cluster network, because clients will connect using -the public network and other Ceph OSD Daemons will connect using the cluster -network. When you add the rule using the example below, make sure you replace -``{iface}`` with the network interface (e.g., ``eth0``, ``eth1``, etc.), -``{ip-address}`` with the IP address and ``{netmask}`` with the netmask of the -public or cluster network. For example:: - - sudo iptables -A INPUT -i {iface} -m multiport -p tcp -s {ip-address}/{netmask} --dports 6800:7300 -j ACCEPT - -.. tip:: If you run Ceph Metadata Servers on the same Ceph Node as the - Ceph OSD Daemons, you can consolidate the public network configuration step. - - -Ceph Networks -============= - -To configure Ceph networks, you must add a network configuration to the -``[global]`` section of the configuration file. Our 5-minute Quick Start -provides a trivial `Ceph configuration file`_ that assumes one public network -with client and server on the same network and subnet. Ceph functions just fine -with a public network only. However, Ceph allows you to establish much more -specific criteria, including multiple IP network and subnet masks for your -public network. You can also establish a separate cluster network to handle OSD -heartbeat, object replication and recovery traffic. Don't confuse the IP -addresses you set in your configuration with the public-facing IP addresses -network clients may use to access your service. Typical internal IP networks are -often ``192.168.0.0`` or ``10.0.0.0``. - -.. tip:: If you specify more than one IP address and subnet mask for - either the public or the cluster network, the subnets within the network - must be capable of routing to each other. Additionally, make sure you - include each IP address/subnet in your IP tables and open ports for them - as necessary. - -.. note:: Ceph uses `CIDR`_ notation for subnets (e.g., ``10.0.0.0/24``). - -When you have configured your networks, you may restart your cluster or restart -each daemon. Ceph daemons bind dynamically, so you do not have to restart the -entire cluster at once if you change your network configuration. - - -Public Network --------------- - -To configure a public network, add the following option to the ``[global]`` -section of your Ceph configuration file. - -.. code-block:: ini - - [global] - ... - public network = {public-network/netmask} - - -Cluster Network ---------------- - -If you declare a cluster network, OSDs will route heartbeat, object replication -and recovery traffic over the cluster network. This may improve performance -compared to using a single network. To configure a cluster network, add the -following option to the ``[global]`` section of your Ceph configuration file. - -.. code-block:: ini - - [global] - ... - cluster network = {cluster-network/netmask} - -We prefer that the cluster network is **NOT** reachable from the public network -or the Internet for added security. - - -Ceph Daemons -============ - -Ceph has one network configuration requirement that applies to all daemons: the -Ceph configuration file **MUST** specify the ``host`` for each daemon. Ceph also -requires that a Ceph configuration file specify the monitor IP address and its -port. - -.. important:: Some deployment tools (e.g., ``ceph-deploy``, Chef) may create a - configuration file for you. **DO NOT** set these values if the deployment - tool does it for you. - -.. tip:: The ``host`` setting is the short name of the host (i.e., not - an fqdn). It is **NOT** an IP address either. Enter ``hostname -s`` on - the command line to retrieve the name of the host. - - -.. code-block:: ini - - [mon.a] - - host = {hostname} - mon addr = {ip-address}:6789 - - [osd.0] - host = {hostname} - - -You do not have to set the host IP address for a daemon. If you have a static IP -configuration and both public and cluster networks running, the Ceph -configuration file may specify the IP address of the host for each daemon. To -set a static IP address for a daemon, the following option(s) should appear in -the daemon instance sections of your ``ceph.conf`` file. - -.. code-block:: ini - - [osd.0] - public addr = {host-public-ip-address} - cluster addr = {host-cluster-ip-address} - - -.. topic:: One NIC OSD in a Two Network Cluster - - Generally, we do not recommend deploying an OSD host with a single NIC in a - cluster with two networks. However, you may accomplish this by forcing the - OSD host to operate on the public network by adding a ``public addr`` entry - to the ``[osd.n]`` section of the Ceph configuration file, where ``n`` - refers to the number of the OSD with one NIC. Additionally, the public - network and cluster network must be able to route traffic to each other, - which we don't recommend for security reasons. - - -Network Config Settings -======================= - -Network configuration settings are not required. Ceph assumes a public network -with all hosts operating on it unless you specifically configure a cluster -network. - - -Public Network --------------- - -The public network configuration allows you specifically define IP addresses -and subnets for the public network. You may specifically assign static IP -addresses or override ``public network`` settings using the ``public addr`` -setting for a specific daemon. - -``public network`` - -:Description: The IP address and netmask of the public (front-side) network - (e.g., ``192.168.0.0/24``). Set in ``[global]``. You may specify - comma-delimited subnets. - -:Type: ``{ip-address}/{netmask} [, {ip-address}/{netmask}]`` -:Required: No -:Default: N/A - - -``public addr`` - -:Description: The IP address for the public (front-side) network. - Set for each daemon. - -:Type: IP Address -:Required: No -:Default: N/A - - - -Cluster Network ---------------- - -The cluster network configuration allows you to declare a cluster network, and -specifically define IP addresses and subnets for the cluster network. You may -specifically assign static IP addresses or override ``cluster network`` -settings using the ``cluster addr`` setting for specific OSD daemons. - - -``cluster network`` - -:Description: The IP address and netmask of the cluster (back-side) network - (e.g., ``10.0.0.0/24``). Set in ``[global]``. You may specify - comma-delimited subnets. - -:Type: ``{ip-address}/{netmask} [, {ip-address}/{netmask}]`` -:Required: No -:Default: N/A - - -``cluster addr`` - -:Description: The IP address for the cluster (back-side) network. - Set for each daemon. - -:Type: Address -:Required: No -:Default: N/A - - -Bind ----- - -Bind settings set the default port ranges Ceph OSD and MDS daemons use. The -default range is ``6800:7300``. Ensure that your `IP Tables`_ configuration -allows you to use the configured port range. - -You may also enable Ceph daemons to bind to IPv6 addresses instead of IPv4 -addresses. - - -``ms bind port min`` - -:Description: The minimum port number to which an OSD or MDS daemon will bind. -:Type: 32-bit Integer -:Default: ``6800`` -:Required: No - - -``ms bind port max`` - -:Description: The maximum port number to which an OSD or MDS daemon will bind. -:Type: 32-bit Integer -:Default: ``7300`` -:Required: No. - - -``ms bind ipv6`` - -:Description: Enables Ceph daemons to bind to IPv6 addresses. Currently the - messenger *either* uses IPv4 or IPv6, but it cannot do both. -:Type: Boolean -:Default: ``false`` -:Required: No - -``public bind addr`` - -:Description: In some dynamic deployments the Ceph MON daemon might bind - to an IP address locally that is different from the ``public addr`` - advertised to other peers in the network. The environment must ensure - that routing rules are set correclty. If ``public bind addr`` is set - the Ceph MON daemon will bind to it locally and use ``public addr`` - in the monmaps to advertise its address to peers. This behavior is limited - to the MON daemon. - -:Type: IP Address -:Required: No -:Default: N/A - - - -Hosts ------ - -Ceph expects at least one monitor declared in the Ceph configuration file, with -a ``mon addr`` setting under each declared monitor. Ceph expects a ``host`` -setting under each declared monitor, metadata server and OSD in the Ceph -configuration file. Optionally, a monitor can be assigned with a priority, and -the clients will always connect to the monitor with lower value of priority if -specified. - - -``mon addr`` - -:Description: A list of ``{hostname}:{port}`` entries that clients can use to - connect to a Ceph monitor. If not set, Ceph searches ``[mon.*]`` - sections. - -:Type: String -:Required: No -:Default: N/A - -``mon priority`` - -:Description: The priority of the declared monitor, the lower value the more - prefered when a client selects a monitor when trying to connect - to the cluster. - -:Type: Unsigned 16-bit Integer -:Required: No -:Default: 0 - -``host`` - -:Description: The hostname. Use this setting for specific daemon instances - (e.g., ``[osd.0]``). - -:Type: String -:Required: Yes, for daemon instances. -:Default: ``localhost`` - -.. tip:: Do not use ``localhost``. To get your host name, execute - ``hostname -s`` on your command line and use the name of your host - (to the first period, not the fully-qualified domain name). - -.. important:: You should not specify any value for ``host`` when using a third - party deployment system that retrieves the host name for you. - - - -TCP ---- - -Ceph disables TCP buffering by default. - - -``ms tcp nodelay`` - -:Description: Ceph enables ``ms tcp nodelay`` so that each request is sent - immediately (no buffering). Disabling `Nagle's algorithm`_ - increases network traffic, which can introduce latency. If you - experience large numbers of small packets, you may try - disabling ``ms tcp nodelay``. - -:Type: Boolean -:Required: No -:Default: ``true`` - - - -``ms tcp rcvbuf`` - -:Description: The size of the socket buffer on the receiving end of a network - connection. Disable by default. - -:Type: 32-bit Integer -:Required: No -:Default: ``0`` - - - -``ms tcp read timeout`` - -:Description: If a client or daemon makes a request to another Ceph daemon and - does not drop an unused connection, the ``ms tcp read timeout`` - defines the connection as idle after the specified number - of seconds. - -:Type: Unsigned 64-bit Integer -:Required: No -:Default: ``900`` 15 minutes. - - - -.. _Scalability and High Availability: ../../../architecture#scalability-and-high-availability -.. _Hardware Recommendations - Networks: ../../../start/hardware-recommendations#networks -.. _Ceph configuration file: ../../../start/quick-ceph-deploy/#create-a-cluster -.. _hardware recommendations: ../../../start/hardware-recommendations -.. _Monitor / OSD Interaction: ../mon-osd-interaction -.. _Message Signatures: ../auth-config-ref#signatures -.. _CIDR: http://en.wikipedia.org/wiki/Classless_Inter-Domain_Routing -.. _Nagle's Algorithm: http://en.wikipedia.org/wiki/Nagle's_algorithm diff --git a/src/ceph/doc/rados/configuration/osd-config-ref.rst b/src/ceph/doc/rados/configuration/osd-config-ref.rst deleted file mode 100644 index fae7078..0000000 --- a/src/ceph/doc/rados/configuration/osd-config-ref.rst +++ /dev/null @@ -1,1105 +0,0 @@ -====================== - OSD Config Reference -====================== - -.. index:: OSD; configuration - -You can configure Ceph OSD Daemons in the Ceph configuration file, but Ceph OSD -Daemons can use the default values and a very minimal configuration. A minimal -Ceph OSD Daemon configuration sets ``osd journal size`` and ``host``, and -uses default values for nearly everything else. - -Ceph OSD Daemons are numerically identified in incremental fashion, beginning -with ``0`` using the following convention. :: - - osd.0 - osd.1 - osd.2 - -In a configuration file, you may specify settings for all Ceph OSD Daemons in -the cluster by adding configuration settings to the ``[osd]`` section of your -configuration file. To add settings directly to a specific Ceph OSD Daemon -(e.g., ``host``), enter it in an OSD-specific section of your configuration -file. For example: - -.. code-block:: ini - - [osd] - osd journal size = 1024 - - [osd.0] - host = osd-host-a - - [osd.1] - host = osd-host-b - - -.. index:: OSD; config settings - -General Settings -================ - -The following settings provide an Ceph OSD Daemon's ID, and determine paths to -data and journals. Ceph deployment scripts typically generate the UUID -automatically. We **DO NOT** recommend changing the default paths for data or -journals, as it makes it more problematic to troubleshoot Ceph later. - -The journal size should be at least twice the product of the expected drive -speed multiplied by ``filestore max sync interval``. However, the most common -practice is to partition the journal drive (often an SSD), and mount it such -that Ceph uses the entire partition for the journal. - - -``osd uuid`` - -:Description: The universally unique identifier (UUID) for the Ceph OSD Daemon. -:Type: UUID -:Default: The UUID. -:Note: The ``osd uuid`` applies to a single Ceph OSD Daemon. The ``fsid`` - applies to the entire cluster. - - -``osd data`` - -:Description: The path to the OSDs data. You must create the directory when - deploying Ceph. You should mount a drive for OSD data at this - mount point. We do not recommend changing the default. - -:Type: String -:Default: ``/var/lib/ceph/osd/$cluster-$id`` - - -``osd max write size`` - -:Description: The maximum size of a write in megabytes. -:Type: 32-bit Integer -:Default: ``90`` - - -``osd client message size cap`` - -:Description: The largest client data message allowed in memory. -:Type: 64-bit Unsigned Integer -:Default: 500MB default. ``500*1024L*1024L`` - - -``osd class dir`` - -:Description: The class path for RADOS class plug-ins. -:Type: String -:Default: ``$libdir/rados-classes`` - - -.. index:: OSD; file system - -File System Settings -==================== -Ceph builds and mounts file systems which are used for Ceph OSDs. - -``osd mkfs options {fs-type}`` - -:Description: Options used when creating a new Ceph OSD of type {fs-type}. - -:Type: String -:Default for xfs: ``-f -i 2048`` -:Default for other file systems: {empty string} - -For example:: - ``osd mkfs options xfs = -f -d agcount=24`` - -``osd mount options {fs-type}`` - -:Description: Options used when mounting a Ceph OSD of type {fs-type}. - -:Type: String -:Default for xfs: ``rw,noatime,inode64`` -:Default for other file systems: ``rw, noatime`` - -For example:: - ``osd mount options xfs = rw, noatime, inode64, logbufs=8`` - - -.. index:: OSD; journal settings - -Journal Settings -================ - -By default, Ceph expects that you will store an Ceph OSD Daemons journal with -the following path:: - - /var/lib/ceph/osd/$cluster-$id/journal - -Without performance optimization, Ceph stores the journal on the same disk as -the Ceph OSD Daemons data. An Ceph OSD Daemon optimized for performance may use -a separate disk to store journal data (e.g., a solid state drive delivers high -performance journaling). - -Ceph's default ``osd journal size`` is 0, so you will need to set this in your -``ceph.conf`` file. A journal size should find the product of the ``filestore -max sync interval`` and the expected throughput, and multiply the product by -two (2):: - - osd journal size = {2 * (expected throughput * filestore max sync interval)} - -The expected throughput number should include the expected disk throughput -(i.e., sustained data transfer rate), and network throughput. For example, -a 7200 RPM disk will likely have approximately 100 MB/s. Taking the ``min()`` -of the disk and network throughput should provide a reasonable expected -throughput. Some users just start off with a 10GB journal size. For -example:: - - osd journal size = 10000 - - -``osd journal`` - -:Description: The path to the OSD's journal. This may be a path to a file or a - block device (such as a partition of an SSD). If it is a file, - you must create the directory to contain it. We recommend using a - drive separate from the ``osd data`` drive. - -:Type: String -:Default: ``/var/lib/ceph/osd/$cluster-$id/journal`` - - -``osd journal size`` - -:Description: The size of the journal in megabytes. If this is 0, and the - journal is a block device, the entire block device is used. - Since v0.54, this is ignored if the journal is a block device, - and the entire block device is used. - -:Type: 32-bit Integer -:Default: ``5120`` -:Recommended: Begin with 1GB. Should be at least twice the product of the - expected speed multiplied by ``filestore max sync interval``. - - -See `Journal Config Reference`_ for additional details. - - -Monitor OSD Interaction -======================= - -Ceph OSD Daemons check each other's heartbeats and report to monitors -periodically. Ceph can use default values in many cases. However, if your -network has latency issues, you may need to adopt longer intervals. See -`Configuring Monitor/OSD Interaction`_ for a detailed discussion of heartbeats. - - -Data Placement -============== - -See `Pool & PG Config Reference`_ for details. - - -.. index:: OSD; scrubbing - -Scrubbing -========= - -In addition to making multiple copies of objects, Ceph insures data integrity by -scrubbing placement groups. Ceph scrubbing is analogous to ``fsck`` on the -object storage layer. For each placement group, Ceph generates a catalog of all -objects and compares each primary object and its replicas to ensure that no -objects are missing or mismatched. Light scrubbing (daily) checks the object -size and attributes. Deep scrubbing (weekly) reads the data and uses checksums -to ensure data integrity. - -Scrubbing is important for maintaining data integrity, but it can reduce -performance. You can adjust the following settings to increase or decrease -scrubbing operations. - - -``osd max scrubs`` - -:Description: The maximum number of simultaneous scrub operations for - a Ceph OSD Daemon. - -:Type: 32-bit Int -:Default: ``1`` - -``osd scrub begin hour`` - -:Description: The time of day for the lower bound when a scheduled scrub can be - performed. -:Type: Integer in the range of 0 to 24 -:Default: ``0`` - - -``osd scrub end hour`` - -:Description: The time of day for the upper bound when a scheduled scrub can be - performed. Along with ``osd scrub begin hour``, they define a time - window, in which the scrubs can happen. But a scrub will be performed - no matter the time window allows or not, as long as the placement - group's scrub interval exceeds ``osd scrub max interval``. -:Type: Integer in the range of 0 to 24 -:Default: ``24`` - - -``osd scrub during recovery`` - -:Description: Allow scrub during recovery. Setting this to ``false`` will disable - scheduling new scrub (and deep--scrub) while there is active recovery. - Already running scrubs will be continued. This might be useful to reduce - load on busy clusters. -:Type: Boolean -:Default: ``true`` - - -``osd scrub thread timeout`` - -:Description: The maximum time in seconds before timing out a scrub thread. -:Type: 32-bit Integer -:Default: ``60`` - - -``osd scrub finalize thread timeout`` - -:Description: The maximum time in seconds before timing out a scrub finalize - thread. - -:Type: 32-bit Integer -:Default: ``60*10`` - - -``osd scrub load threshold`` - -:Description: The maximum load. Ceph will not scrub when the system load - (as defined by ``getloadavg()``) is higher than this number. - Default is ``0.5``. - -:Type: Float -:Default: ``0.5`` - - -``osd scrub min interval`` - -:Description: The minimal interval in seconds for scrubbing the Ceph OSD Daemon - when the Ceph Storage Cluster load is low. - -:Type: Float -:Default: Once per day. ``60*60*24`` - - -``osd scrub max interval`` - -:Description: The maximum interval in seconds for scrubbing the Ceph OSD Daemon - irrespective of cluster load. - -:Type: Float -:Default: Once per week. ``7*60*60*24`` - - -``osd scrub chunk min`` - -:Description: The minimal number of object store chunks to scrub during single operation. - Ceph blocks writes to single chunk during scrub. - -:Type: 32-bit Integer -:Default: 5 - - -``osd scrub chunk max`` - -:Description: The maximum number of object store chunks to scrub during single operation. - -:Type: 32-bit Integer -:Default: 25 - - -``osd scrub sleep`` - -:Description: Time to sleep before scrubbing next group of chunks. Increasing this value will slow - down whole scrub operation while client operations will be less impacted. - -:Type: Float -:Default: 0 - - -``osd deep scrub interval`` - -:Description: The interval for "deep" scrubbing (fully reading all data). The - ``osd scrub load threshold`` does not affect this setting. - -:Type: Float -:Default: Once per week. ``60*60*24*7`` - - -``osd scrub interval randomize ratio`` - -:Description: Add a random delay to ``osd scrub min interval`` when scheduling - the next scrub job for a placement group. The delay is a random - value less than ``osd scrub min interval`` \* - ``osd scrub interval randomized ratio``. So the default setting - practically randomly spreads the scrubs out in the allowed time - window of ``[1, 1.5]`` \* ``osd scrub min interval``. -:Type: Float -:Default: ``0.5`` - -``osd deep scrub stride`` - -:Description: Read size when doing a deep scrub. -:Type: 32-bit Integer -:Default: 512 KB. ``524288`` - - -.. index:: OSD; operations settings - -Operations -========== - -Operations settings allow you to configure the number of threads for servicing -requests. If you set ``osd op threads`` to ``0``, it disables multi-threading. -By default, Ceph uses two threads with a 30 second timeout and a 30 second -complaint time if an operation doesn't complete within those time parameters. -You can set operations priority weights between client operations and -recovery operations to ensure optimal performance during recovery. - - -``osd op threads`` - -:Description: The number of threads to service Ceph OSD Daemon operations. - Set to ``0`` to disable it. Increasing the number may increase - the request processing rate. - -:Type: 32-bit Integer -:Default: ``2`` - - -``osd op queue`` - -:Description: This sets the type of queue to be used for prioritizing ops - in the OSDs. Both queues feature a strict sub-queue which is - dequeued before the normal queue. The normal queue is different - between implementations. The original PrioritizedQueue (``prio``) uses a - token bucket system which when there are sufficient tokens will - dequeue high priority queues first. If there are not enough - tokens available, queues are dequeued low priority to high priority. - The WeightedPriorityQueue (``wpq``) dequeues all priorities in - relation to their priorities to prevent starvation of any queue. - WPQ should help in cases where a few OSDs are more overloaded - than others. The new mClock based OpClassQueue - (``mclock_opclass``) prioritizes operations based on which class - they belong to (recovery, scrub, snaptrim, client op, osd subop). - And, the mClock based ClientQueue (``mclock_client``) also - incorporates the client identifier in order to promote fairness - between clients. See `QoS Based on mClock`_. Requires a restart. - -:Type: String -:Valid Choices: prio, wpq, mclock_opclass, mclock_client -:Default: ``prio`` - - -``osd op queue cut off`` - -:Description: This selects which priority ops will be sent to the strict - queue verses the normal queue. The ``low`` setting sends all - replication ops and higher to the strict queue, while the ``high`` - option sends only replication acknowledgement ops and higher to - the strict queue. Setting this to ``high`` should help when a few - OSDs in the cluster are very busy especially when combined with - ``wpq`` in the ``osd op queue`` setting. OSDs that are very busy - handling replication traffic could starve primary client traffic - on these OSDs without these settings. Requires a restart. - -:Type: String -:Valid Choices: low, high -:Default: ``low`` - - -``osd client op priority`` - -:Description: The priority set for client operations. It is relative to - ``osd recovery op priority``. - -:Type: 32-bit Integer -:Default: ``63`` -:Valid Range: 1-63 - - -``osd recovery op priority`` - -:Description: The priority set for recovery operations. It is relative to - ``osd client op priority``. - -:Type: 32-bit Integer -:Default: ``3`` -:Valid Range: 1-63 - - -``osd scrub priority`` - -:Description: The priority set for scrub operations. It is relative to - ``osd client op priority``. - -:Type: 32-bit Integer -:Default: ``5`` -:Valid Range: 1-63 - - -``osd snap trim priority`` - -:Description: The priority set for snap trim operations. It is relative to - ``osd client op priority``. - -:Type: 32-bit Integer -:Default: ``5`` -:Valid Range: 1-63 - - -``osd op thread timeout`` - -:Description: The Ceph OSD Daemon operation thread timeout in seconds. -:Type: 32-bit Integer -:Default: ``15`` - - -``osd op complaint time`` - -:Description: An operation becomes complaint worthy after the specified number - of seconds have elapsed. - -:Type: Float -:Default: ``30`` - - -``osd disk threads`` - -:Description: The number of disk threads, which are used to perform background - disk intensive OSD operations such as scrubbing and snap - trimming. - -:Type: 32-bit Integer -:Default: ``1`` - -``osd disk thread ioprio class`` - -:Description: Warning: it will only be used if both ``osd disk thread - ioprio class`` and ``osd disk thread ioprio priority`` are - set to a non default value. Sets the ioprio_set(2) I/O - scheduling ``class`` for the disk thread. Acceptable - values are ``idle``, ``be`` or ``rt``. The ``idle`` - class means the disk thread will have lower priority - than any other thread in the OSD. This is useful to slow - down scrubbing on an OSD that is busy handling client - operations. ``be`` is the default and is the same - priority as all other threads in the OSD. ``rt`` means - the disk thread will have precendence over all other - threads in the OSD. Note: Only works with the Linux Kernel - CFQ scheduler. Since Jewel scrubbing is no longer carried - out by the disk iothread, see osd priority options instead. -:Type: String -:Default: the empty string - -``osd disk thread ioprio priority`` - -:Description: Warning: it will only be used if both ``osd disk thread - ioprio class`` and ``osd disk thread ioprio priority`` are - set to a non default value. It sets the ioprio_set(2) - I/O scheduling ``priority`` of the disk thread ranging - from 0 (highest) to 7 (lowest). If all OSDs on a given - host were in class ``idle`` and compete for I/O - (i.e. due to controller congestion), it can be used to - lower the disk thread priority of one OSD to 7 so that - another OSD with priority 0 can have priority. - Note: Only works with the Linux Kernel CFQ scheduler. -:Type: Integer in the range of 0 to 7 or -1 if not to be used. -:Default: ``-1`` - -``osd op history size`` - -:Description: The maximum number of completed operations to track. -:Type: 32-bit Unsigned Integer -:Default: ``20`` - - -``osd op history duration`` - -:Description: The oldest completed operation to track. -:Type: 32-bit Unsigned Integer -:Default: ``600`` - - -``osd op log threshold`` - -:Description: How many operations logs to display at once. -:Type: 32-bit Integer -:Default: ``5`` - - -QoS Based on mClock -------------------- - -Ceph's use of mClock is currently in the experimental phase and should -be approached with an exploratory mindset. - -Core Concepts -````````````` - -The QoS support of Ceph is implemented using a queueing scheduler -based on `the dmClock algorithm`_. This algorithm allocates the I/O -resources of the Ceph cluster in proportion to weights, and enforces -the constraits of minimum reservation and maximum limitation, so that -the services can compete for the resources fairly. Currently the -*mclock_opclass* operation queue divides Ceph services involving I/O -resources into following buckets: - -- client op: the iops issued by client -- osd subop: the iops issued by primary OSD -- snap trim: the snap trimming related requests -- pg recovery: the recovery related requests -- pg scrub: the scrub related requests - -And the resources are partitioned using following three sets of tags. In other -words, the share of each type of service is controlled by three tags: - -#. reservation: the minimum IOPS allocated for the service. -#. limitation: the maximum IOPS allocated for the service. -#. weight: the proportional share of capacity if extra capacity or system - oversubscribed. - -In Ceph operations are graded with "cost". And the resources allocated -for serving various services are consumed by these "costs". So, for -example, the more reservation a services has, the more resource it is -guaranteed to possess, as long as it requires. Assuming there are 2 -services: recovery and client ops: - -- recovery: (r:1, l:5, w:1) -- client ops: (r:2, l:0, w:9) - -The settings above ensure that the recovery won't get more than 5 -requests per second serviced, even if it requires so (see CURRENT -IMPLEMENTATION NOTE below), and no other services are competing with -it. But if the clients start to issue large amount of I/O requests, -neither will they exhaust all the I/O resources. 1 request per second -is always allocated for recovery jobs as long as there are any such -requests. So the recovery jobs won't be starved even in a cluster with -high load. And in the meantime, the client ops can enjoy a larger -portion of the I/O resource, because its weight is "9", while its -competitor "1". In the case of client ops, it is not clamped by the -limit setting, so it can make use of all the resources if there is no -recovery ongoing. - -Along with *mclock_opclass* another mclock operation queue named -*mclock_client* is available. It divides operations based on category -but also divides them based on the client making the request. This -helps not only manage the distribution of resources spent on different -classes of operations but also tries to insure fairness among clients. - -CURRENT IMPLEMENTATION NOTE: the current experimental implementation -does not enforce the limit values. As a first approximation we decided -not to prevent operations that would otherwise enter the operation -sequencer from doing so. - -Subtleties of mClock -```````````````````` - -The reservation and limit values have a unit of requests per -second. The weight, however, does not technically have a unit and the -weights are relative to one another. So if one class of requests has a -weight of 1 and another a weight of 9, then the latter class of -requests should get 9 executed at a 9 to 1 ratio as the first class. -However that will only happen once the reservations are met and those -values include the operations executed under the reservation phase. - -Even though the weights do not have units, one must be careful in -choosing their values due how the algorithm assigns weight tags to -requests. If the weight is *W*, then for a given class of requests, -the next one that comes in will have a weight tag of *1/W* plus the -previous weight tag or the current time, whichever is larger. That -means if *W* is sufficiently large and therefore *1/W* is sufficiently -small, the calculated tag may never be assigned as it will get a value -of the current time. The ultimate lesson is that values for weight -should not be too large. They should be under the number of requests -one expects to ve serviced each second. - -Caveats -``````` - -There are some factors that can reduce the impact of the mClock op -queues within Ceph. First, requests to an OSD are sharded by their -placement group identifier. Each shard has its own mClock queue and -these queues neither interact nor share information among them. The -number of shards can be controlled with the configuration options -``osd_op_num_shards``, ``osd_op_num_shards_hdd``, and -``osd_op_num_shards_ssd``. A lower number of shards will increase the -impact of the mClock queues, but may have other deliterious effects. - -Second, requests are transferred from the operation queue to the -operation sequencer, in which they go through the phases of -execution. The operation queue is where mClock resides and mClock -determines the next op to transfer to the operation sequencer. The -number of operations allowed in the operation sequencer is a complex -issue. In general we want to keep enough operations in the sequencer -so it's always getting work done on some operations while it's waiting -for disk and network access to complete on other operations. On the -other hand, once an operation is transferred to the operation -sequencer, mClock no longer has control over it. Therefore to maximize -the impact of mClock, we want to keep as few operations in the -operation sequencer as possible. So we have an inherent tension. - -The configuration options that influence the number of operations in -the operation sequencer are ``bluestore_throttle_bytes``, -``bluestore_throttle_deferred_bytes``, -``bluestore_throttle_cost_per_io``, -``bluestore_throttle_cost_per_io_hdd``, and -``bluestore_throttle_cost_per_io_ssd``. - -A third factor that affects the impact of the mClock algorithm is that -we're using a distributed system, where requests are made to multiple -OSDs and each OSD has (can have) multiple shards. Yet we're currently -using the mClock algorithm, which is not distributed (note: dmClock is -the distributed version of mClock). - -Various organizations and individuals are currently experimenting with -mClock as it exists in this code base along with their modifications -to the code base. We hope you'll share you're experiences with your -mClock and dmClock experiments in the ceph-devel mailing list. - - -``osd push per object cost`` - -:Description: the overhead for serving a push op - -:Type: Unsigned Integer -:Default: 1000 - -``osd recovery max chunk`` - -:Description: the maximum total size of data chunks a recovery op can carry. - -:Type: Unsigned Integer -:Default: 8 MiB - - -``osd op queue mclock client op res`` - -:Description: the reservation of client op. - -:Type: Float -:Default: 1000.0 - - -``osd op queue mclock client op wgt`` - -:Description: the weight of client op. - -:Type: Float -:Default: 500.0 - - -``osd op queue mclock client op lim`` - -:Description: the limit of client op. - -:Type: Float -:Default: 1000.0 - - -``osd op queue mclock osd subop res`` - -:Description: the reservation of osd subop. - -:Type: Float -:Default: 1000.0 - - -``osd op queue mclock osd subop wgt`` - -:Description: the weight of osd subop. - -:Type: Float -:Default: 500.0 - - -``osd op queue mclock osd subop lim`` - -:Description: the limit of osd subop. - -:Type: Float -:Default: 0.0 - - -``osd op queue mclock snap res`` - -:Description: the reservation of snap trimming. - -:Type: Float -:Default: 0.0 - - -``osd op queue mclock snap wgt`` - -:Description: the weight of snap trimming. - -:Type: Float -:Default: 1.0 - - -``osd op queue mclock snap lim`` - -:Description: the limit of snap trimming. - -:Type: Float -:Default: 0.001 - - -``osd op queue mclock recov res`` - -:Description: the reservation of recovery. - -:Type: Float -:Default: 0.0 - - -``osd op queue mclock recov wgt`` - -:Description: the weight of recovery. - -:Type: Float -:Default: 1.0 - - -``osd op queue mclock recov lim`` - -:Description: the limit of recovery. - -:Type: Float -:Default: 0.001 - - -``osd op queue mclock scrub res`` - -:Description: the reservation of scrub jobs. - -:Type: Float -:Default: 0.0 - - -``osd op queue mclock scrub wgt`` - -:Description: the weight of scrub jobs. - -:Type: Float -:Default: 1.0 - - -``osd op queue mclock scrub lim`` - -:Description: the limit of scrub jobs. - -:Type: Float -:Default: 0.001 - -.. _the dmClock algorithm: https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Gulati.pdf - - -.. index:: OSD; backfilling - -Backfilling -=========== - -When you add or remove Ceph OSD Daemons to a cluster, the CRUSH algorithm will -want to rebalance the cluster by moving placement groups to or from Ceph OSD -Daemons to restore the balance. The process of migrating placement groups and -the objects they contain can reduce the cluster's operational performance -considerably. To maintain operational performance, Ceph performs this migration -with 'backfilling', which allows Ceph to set backfill operations to a lower -priority than requests to read or write data. - - -``osd max backfills`` - -:Description: The maximum number of backfills allowed to or from a single OSD. -:Type: 64-bit Unsigned Integer -:Default: ``1`` - - -``osd backfill scan min`` - -:Description: The minimum number of objects per backfill scan. - -:Type: 32-bit Integer -:Default: ``64`` - - -``osd backfill scan max`` - -:Description: The maximum number of objects per backfill scan. - -:Type: 32-bit Integer -:Default: ``512`` - - -``osd backfill retry interval`` - -:Description: The number of seconds to wait before retrying backfill requests. -:Type: Double -:Default: ``10.0`` - -.. index:: OSD; osdmap - -OSD Map -======= - -OSD maps reflect the OSD daemons operating in the cluster. Over time, the -number of map epochs increases. Ceph provides some settings to ensure that -Ceph performs well as the OSD map grows larger. - - -``osd map dedup`` - -:Description: Enable removing duplicates in the OSD map. -:Type: Boolean -:Default: ``true`` - - -``osd map cache size`` - -:Description: The number of OSD maps to keep cached. -:Type: 32-bit Integer -:Default: ``500`` - - -``osd map cache bl size`` - -:Description: The size of the in-memory OSD map cache in OSD daemons. -:Type: 32-bit Integer -:Default: ``50`` - - -``osd map cache bl inc size`` - -:Description: The size of the in-memory OSD map cache incrementals in - OSD daemons. - -:Type: 32-bit Integer -:Default: ``100`` - - -``osd map message max`` - -:Description: The maximum map entries allowed per MOSDMap message. -:Type: 32-bit Integer -:Default: ``100`` - - - -.. index:: OSD; recovery - -Recovery -======== - -When the cluster starts or when a Ceph OSD Daemon crashes and restarts, the OSD -begins peering with other Ceph OSD Daemons before writes can occur. See -`Monitoring OSDs and PGs`_ for details. - -If a Ceph OSD Daemon crashes and comes back online, usually it will be out of -sync with other Ceph OSD Daemons containing more recent versions of objects in -the placement groups. When this happens, the Ceph OSD Daemon goes into recovery -mode and seeks to get the latest copy of the data and bring its map back up to -date. Depending upon how long the Ceph OSD Daemon was down, the OSD's objects -and placement groups may be significantly out of date. Also, if a failure domain -went down (e.g., a rack), more than one Ceph OSD Daemon may come back online at -the same time. This can make the recovery process time consuming and resource -intensive. - -To maintain operational performance, Ceph performs recovery with limitations on -the number recovery requests, threads and object chunk sizes which allows Ceph -perform well in a degraded state. - - -``osd recovery delay start`` - -:Description: After peering completes, Ceph will delay for the specified number - of seconds before starting to recover objects. - -:Type: Float -:Default: ``0`` - - -``osd recovery max active`` - -:Description: The number of active recovery requests per OSD at one time. More - requests will accelerate recovery, but the requests places an - increased load on the cluster. - -:Type: 32-bit Integer -:Default: ``3`` - - -``osd recovery max chunk`` - -:Description: The maximum size of a recovered chunk of data to push. -:Type: 64-bit Unsigned Integer -:Default: ``8 << 20`` - - -``osd recovery max single start`` - -:Description: The maximum number of recovery operations per OSD that will be - newly started when an OSD is recovering. -:Type: 64-bit Unsigned Integer -:Default: ``1`` - - -``osd recovery thread timeout`` - -:Description: The maximum time in seconds before timing out a recovery thread. -:Type: 32-bit Integer -:Default: ``30`` - - -``osd recover clone overlap`` - -:Description: Preserves clone overlap during recovery. Should always be set - to ``true``. - -:Type: Boolean -:Default: ``true`` - - -``osd recovery sleep`` - -:Description: Time in seconds to sleep before next recovery or backfill op. - Increasing this value will slow down recovery operation while - client operations will be less impacted. - -:Type: Float -:Default: ``0`` - - -``osd recovery sleep hdd`` - -:Description: Time in seconds to sleep before next recovery or backfill op - for HDDs. - -:Type: Float -:Default: ``0.1`` - - -``osd recovery sleep ssd`` - -:Description: Time in seconds to sleep before next recovery or backfill op - for SSDs. - -:Type: Float -:Default: ``0`` - - -``osd recovery sleep hybrid`` - -:Description: Time in seconds to sleep before next recovery or backfill op - when osd data is on HDD and osd journal is on SSD. - -:Type: Float -:Default: ``0.025`` - -Tiering -======= - -``osd agent max ops`` - -:Description: The maximum number of simultaneous flushing ops per tiering agent - in the high speed mode. -:Type: 32-bit Integer -:Default: ``4`` - - -``osd agent max low ops`` - -:Description: The maximum number of simultaneous flushing ops per tiering agent - in the low speed mode. -:Type: 32-bit Integer -:Default: ``2`` - -See `cache target dirty high ratio`_ for when the tiering agent flushes dirty -objects within the high speed mode. - -Miscellaneous -============= - - -``osd snap trim thread timeout`` - -:Description: The maximum time in seconds before timing out a snap trim thread. -:Type: 32-bit Integer -:Default: ``60*60*1`` - - -``osd backlog thread timeout`` - -:Description: The maximum time in seconds before timing out a backlog thread. -:Type: 32-bit Integer -:Default: ``60*60*1`` - - -``osd default notify timeout`` - -:Description: The OSD default notification timeout (in seconds). -:Type: 32-bit Unsigned Integer -:Default: ``30`` - - -``osd check for log corruption`` - -:Description: Check log files for corruption. Can be computationally expensive. -:Type: Boolean -:Default: ``false`` - - -``osd remove thread timeout`` - -:Description: The maximum time in seconds before timing out a remove OSD thread. -:Type: 32-bit Integer -:Default: ``60*60`` - - -``osd command thread timeout`` - -:Description: The maximum time in seconds before timing out a command thread. -:Type: 32-bit Integer -:Default: ``10*60`` - - -``osd command max records`` - -:Description: Limits the number of lost objects to return. -:Type: 32-bit Integer -:Default: ``256`` - - -``osd auto upgrade tmap`` - -:Description: Uses ``tmap`` for ``omap`` on old objects. -:Type: Boolean -:Default: ``true`` - - -``osd tmapput sets users tmap`` - -:Description: Uses ``tmap`` for debugging only. -:Type: Boolean -:Default: ``false`` - - -``osd fast fail on connection refused`` - -:Description: If this option is enabled, crashed OSDs are marked down - immediately by connected peers and MONs (assuming that the - crashed OSD host survives). Disable it to restore old - behavior, at the expense of possible long I/O stalls when - OSDs crash in the middle of I/O operations. -:Type: Boolean -:Default: ``true`` - - - -.. _pool: ../../operations/pools -.. _Configuring Monitor/OSD Interaction: ../mon-osd-interaction -.. _Monitoring OSDs and PGs: ../../operations/monitoring-osd-pg#peering -.. _Pool & PG Config Reference: ../pool-pg-config-ref -.. _Journal Config Reference: ../journal-ref -.. _cache target dirty high ratio: ../../operations/pools#cache-target-dirty-high-ratio diff --git a/src/ceph/doc/rados/configuration/pool-pg-config-ref.rst b/src/ceph/doc/rados/configuration/pool-pg-config-ref.rst deleted file mode 100644 index 89a3707..0000000 --- a/src/ceph/doc/rados/configuration/pool-pg-config-ref.rst +++ /dev/null @@ -1,270 +0,0 @@ -====================================== - Pool, PG and CRUSH Config Reference -====================================== - -.. index:: pools; configuration - -When you create pools and set the number of placement groups for the pool, Ceph -uses default values when you don't specifically override the defaults. **We -recommend** overridding some of the defaults. Specifically, we recommend setting -a pool's replica size and overriding the default number of placement groups. You -can specifically set these values when running `pool`_ commands. You can also -override the defaults by adding new ones in the ``[global]`` section of your -Ceph configuration file. - - -.. literalinclude:: pool-pg.conf - :language: ini - - - -``mon max pool pg num`` - -:Description: The maximum number of placement groups per pool. -:Type: Integer -:Default: ``65536`` - - -``mon pg create interval`` - -:Description: Number of seconds between PG creation in the same - Ceph OSD Daemon. - -:Type: Float -:Default: ``30.0`` - - -``mon pg stuck threshold`` - -:Description: Number of seconds after which PGs can be considered as - being stuck. - -:Type: 32-bit Integer -:Default: ``300`` - -``mon pg min inactive`` - -:Description: Issue a ``HEALTH_ERR`` in cluster log if the number of PGs stay - inactive longer than ``mon_pg_stuck_threshold`` exceeds this - setting. A non-positive number means disabled, never go into ERR. -:Type: Integer -:Default: ``1`` - - -``mon pg warn min per osd`` - -:Description: Issue a ``HEALTH_WARN`` in cluster log if the average number - of PGs per (in) OSD is under this number. (a non-positive number - disables this) -:Type: Integer -:Default: ``30`` - - -``mon pg warn max per osd`` - -:Description: Issue a ``HEALTH_WARN`` in cluster log if the average number - of PGs per (in) OSD is above this number. (a non-positive number - disables this) -:Type: Integer -:Default: ``300`` - - -``mon pg warn min objects`` - -:Description: Do not warn if the total number of objects in cluster is below - this number -:Type: Integer -:Default: ``1000`` - - -``mon pg warn min pool objects`` - -:Description: Do not warn on pools whose object number is below this number -:Type: Integer -:Default: ``1000`` - - -``mon pg check down all threshold`` - -:Description: Threshold of down OSDs percentage after which we check all PGs - for stale ones. -:Type: Float -:Default: ``0.5`` - - -``mon pg warn max object skew`` - -:Description: Issue a ``HEALTH_WARN`` in cluster log if the average object number - of a certain pool is greater than ``mon pg warn max object skew`` times - the average object number of the whole pool. (a non-positive number - disables this) -:Type: Float -:Default: ``10`` - - -``mon delta reset interval`` - -:Description: Seconds of inactivity before we reset the pg delta to 0. We keep - track of the delta of the used space of each pool, so, for - example, it would be easier for us to understand the progress of - recovery or the performance of cache tier. But if there's no - activity reported for a certain pool, we just reset the history of - deltas of that pool. -:Type: Integer -:Default: ``10`` - - -``mon osd max op age`` - -:Description: Maximum op age before we get concerned (make it a power of 2). - A ``HEALTH_WARN`` will be issued if a request has been blocked longer - than this limit. -:Type: Float -:Default: ``32.0`` - - -``osd pg bits`` - -:Description: Placement group bits per Ceph OSD Daemon. -:Type: 32-bit Integer -:Default: ``6`` - - -``osd pgp bits`` - -:Description: The number of bits per Ceph OSD Daemon for PGPs. -:Type: 32-bit Integer -:Default: ``6`` - - -``osd crush chooseleaf type`` - -:Description: The bucket type to use for ``chooseleaf`` in a CRUSH rule. Uses - ordinal rank rather than name. - -:Type: 32-bit Integer -:Default: ``1``. Typically a host containing one or more Ceph OSD Daemons. - - -``osd crush initial weight`` - -:Description: The initial crush weight for newly added osds into crushmap. - -:Type: Double -:Default: ``the size of newly added osd in TB``. By default, the initial crush - weight for the newly added osd is set to its volume size in TB. - See `Weighting Bucket Items`_ for details. - - -``osd pool default crush replicated ruleset`` - -:Description: The default CRUSH ruleset to use when creating a replicated pool. -:Type: 8-bit Integer -:Default: ``CEPH_DEFAULT_CRUSH_REPLICATED_RULESET``, which means "pick - a ruleset with the lowest numerical ID and use that". This is to - make pool creation work in the absence of ruleset 0. - - -``osd pool erasure code stripe unit`` - -:Description: Sets the default size, in bytes, of a chunk of an object - stripe for erasure coded pools. Every object of size S - will be stored as N stripes, with each data chunk - receiving ``stripe unit`` bytes. Each stripe of ``N * - stripe unit`` bytes will be encoded/decoded - individually. This option can is overridden by the - ``stripe_unit`` setting in an erasure code profile. - -:Type: Unsigned 32-bit Integer -:Default: ``4096`` - - -``osd pool default size`` - -:Description: Sets the number of replicas for objects in the pool. The default - value is the same as - ``ceph osd pool set {pool-name} size {size}``. - -:Type: 32-bit Integer -:Default: ``3`` - - -``osd pool default min size`` - -:Description: Sets the minimum number of written replicas for objects in the - pool in order to acknowledge a write operation to the client. - If minimum is not met, Ceph will not acknowledge the write to the - client. This setting ensures a minimum number of replicas when - operating in ``degraded`` mode. - -:Type: 32-bit Integer -:Default: ``0``, which means no particular minimum. If ``0``, - minimum is ``size - (size / 2)``. - - -``osd pool default pg num`` - -:Description: The default number of placement groups for a pool. The default - value is the same as ``pg_num`` with ``mkpool``. - -:Type: 32-bit Integer -:Default: ``8`` - - -``osd pool default pgp num`` - -:Description: The default number of placement groups for placement for a pool. - The default value is the same as ``pgp_num`` with ``mkpool``. - PG and PGP should be equal (for now). - -:Type: 32-bit Integer -:Default: ``8`` - - -``osd pool default flags`` - -:Description: The default flags for new pools. -:Type: 32-bit Integer -:Default: ``0`` - - -``osd max pgls`` - -:Description: The maximum number of placement groups to list. A client - requesting a large number can tie up the Ceph OSD Daemon. - -:Type: Unsigned 64-bit Integer -:Default: ``1024`` -:Note: Default should be fine. - - -``osd min pg log entries`` - -:Description: The minimum number of placement group logs to maintain - when trimming log files. - -:Type: 32-bit Int Unsigned -:Default: ``1000`` - - -``osd default data pool replay window`` - -:Description: The time (in seconds) for an OSD to wait for a client to replay - a request. - -:Type: 32-bit Integer -:Default: ``45`` - -``osd max pg per osd hard ratio`` - -:Description: The ratio of number of PGs per OSD allowed by the cluster before - OSD refuses to create new PGs. OSD stops creating new PGs if the number - of PGs it serves exceeds - ``osd max pg per osd hard ratio`` \* ``mon max pg per osd``. - -:Type: Float -:Default: ``2`` - -.. _pool: ../../operations/pools -.. _Monitoring OSDs and PGs: ../../operations/monitoring-osd-pg#peering -.. _Weighting Bucket Items: ../../operations/crush-map#weightingbucketitems diff --git a/src/ceph/doc/rados/configuration/pool-pg.conf b/src/ceph/doc/rados/configuration/pool-pg.conf deleted file mode 100644 index 5f1b3b7..0000000 --- a/src/ceph/doc/rados/configuration/pool-pg.conf +++ /dev/null @@ -1,20 +0,0 @@ -[global] - - # By default, Ceph makes 3 replicas of objects. If you want to make four - # copies of an object the default value--a primary copy and three replica - # copies--reset the default values as shown in 'osd pool default size'. - # If you want to allow Ceph to write a lesser number of copies in a degraded - # state, set 'osd pool default min size' to a number less than the - # 'osd pool default size' value. - - osd pool default size = 4 # Write an object 4 times. - osd pool default min size = 1 # Allow writing one copy in a degraded state. - - # Ensure you have a realistic number of placement groups. We recommend - # approximately 100 per OSD. E.g., total number of OSDs multiplied by 100 - # divided by the number of replicas (i.e., osd pool default size). So for - # 10 OSDs and osd pool default size = 4, we'd recommend approximately - # (100 * 10) / 4 = 250. - - osd pool default pg num = 250 - osd pool default pgp num = 250 diff --git a/src/ceph/doc/rados/configuration/storage-devices.rst b/src/ceph/doc/rados/configuration/storage-devices.rst deleted file mode 100644 index 83c0c9b..0000000 --- a/src/ceph/doc/rados/configuration/storage-devices.rst +++ /dev/null @@ -1,83 +0,0 @@ -================= - Storage Devices -================= - -There are two Ceph daemons that store data on disk: - -* **Ceph OSDs** (or Object Storage Daemons) are where most of the - data is stored in Ceph. Generally speaking, each OSD is backed by - a single storage device, like a traditional hard disk (HDD) or - solid state disk (SSD). OSDs can also be backed by a combination - of devices, like a HDD for most data and an SSD (or partition of an - SSD) for some metadata. The number of OSDs in a cluster is - generally a function of how much data will be stored, how big each - storage device will be, and the level and type of redundancy - (replication or erasure coding). -* **Ceph Monitor** daemons manage critical cluster state like cluster - membership and authentication information. For smaller clusters a - few gigabytes is all that is needed, although for larger clusters - the monitor database can reach tens or possibly hundreds of - gigabytes. - - -OSD Backends -============ - -There are two ways that OSDs can manage the data they store. Starting -with the Luminous 12.2.z release, the new default (and recommended) backend is -*BlueStore*. Prior to Luminous, the default (and only option) was -*FileStore*. - -BlueStore ---------- - -BlueStore is a special-purpose storage backend designed specifically -for managing data on disk for Ceph OSD workloads. It is motivated by -experience supporting and managing OSDs using FileStore over the -last ten years. Key BlueStore features include: - -* Direct management of storage devices. BlueStore consumes raw block - devices or partitions. This avoids any intervening layers of - abstraction (such as local file systems like XFS) that may limit - performance or add complexity. -* Metadata management with RocksDB. We embed RocksDB's key/value database - in order to manage internal metadata, such as the mapping from object - names to block locations on disk. -* Full data and metadata checksumming. By default all data and - metadata written to BlueStore is protected by one or more - checksums. No data or metadata will be read from disk or returned - to the user without being verified. -* Inline compression. Data written may be optionally compressed - before being written to disk. -* Multi-device metadata tiering. BlueStore allows its internal - journal (write-ahead log) to be written to a separate, high-speed - device (like an SSD, NVMe, or NVDIMM) to increased performance. If - a significant amount of faster storage is available, internal - metadata can also be stored on the faster device. -* Efficient copy-on-write. RBD and CephFS snapshots rely on a - copy-on-write *clone* mechanism that is implemented efficiently in - BlueStore. This results in efficient IO both for regular snapshots - and for erasure coded pools (which rely on cloning to implement - efficient two-phase commits). - -For more information, see :doc:`bluestore-config-ref`. - -FileStore ---------- - -FileStore is the legacy approach to storing objects in Ceph. It -relies on a standard file system (normally XFS) in combination with a -key/value database (traditionally LevelDB, now RocksDB) for some -metadata. - -FileStore is well-tested and widely used in production but suffers -from many performance deficiencies due to its overall design and -reliance on a traditional file system for storing object data. - -Although FileStore is generally capable of functioning on most -POSIX-compatible file systems (including btrfs and ext4), we only -recommend that XFS be used. Both btrfs and ext4 have known bugs and -deficiencies and their use may lead to data loss. By default all Ceph -provisioning tools will use XFS. - -For more information, see :doc:`filestore-config-ref`. diff --git a/src/ceph/doc/rados/deployment/ceph-deploy-admin.rst b/src/ceph/doc/rados/deployment/ceph-deploy-admin.rst deleted file mode 100644 index a91f69c..0000000 --- a/src/ceph/doc/rados/deployment/ceph-deploy-admin.rst +++ /dev/null @@ -1,38 +0,0 @@ -============= - Admin Tasks -============= - -Once you have set up a cluster with ``ceph-deploy``, you may -provide the client admin key and the Ceph configuration file -to another host so that a user on the host may use the ``ceph`` -command line as an administrative user. - - -Create an Admin Host -==================== - -To enable a host to execute ceph commands with administrator -privileges, use the ``admin`` command. :: - - ceph-deploy admin {host-name [host-name]...} - - -Deploy Config File -================== - -To send an updated copy of the Ceph configuration file to hosts -in your cluster, use the ``config push`` command. :: - - ceph-deploy config push {host-name [host-name]...} - -.. tip:: With a base name and increment host-naming convention, - it is easy to deploy configuration files via simple scripts - (e.g., ``ceph-deploy config hostname{1,2,3,4,5}``). - -Retrieve Config File -==================== - -To retrieve a copy of the Ceph configuration file from a host -in your cluster, use the ``config pull`` command. :: - - ceph-deploy config pull {host-name [host-name]...} diff --git a/src/ceph/doc/rados/deployment/ceph-deploy-install.rst b/src/ceph/doc/rados/deployment/ceph-deploy-install.rst deleted file mode 100644 index 849d68e..0000000 --- a/src/ceph/doc/rados/deployment/ceph-deploy-install.rst +++ /dev/null @@ -1,46 +0,0 @@ -==================== - Package Management -==================== - -Install -======= - -To install Ceph packages on your cluster hosts, open a command line on your -client machine and type the following:: - - ceph-deploy install {hostname [hostname] ...} - -Without additional arguments, ``ceph-deploy`` will install the most recent -major release of Ceph to the cluster host(s). To specify a particular package, -you may select from the following: - -- ``--release <code-name>`` -- ``--testing`` -- ``--dev <branch-or-tag>`` - -For example:: - - ceph-deploy install --release cuttlefish hostname1 - ceph-deploy install --testing hostname2 - ceph-deploy install --dev wip-some-branch hostname{1,2,3,4,5} - -For additional usage, execute:: - - ceph-deploy install -h - - -Uninstall -========= - -To uninstall Ceph packages from your cluster hosts, open a terminal on -your admin host and type the following:: - - ceph-deploy uninstall {hostname [hostname] ...} - -On a Debian or Ubuntu system, you may also:: - - ceph-deploy purge {hostname [hostname] ...} - -The tool will unininstall ``ceph`` packages from the specified hosts. Purge -additionally removes configuration files. - diff --git a/src/ceph/doc/rados/deployment/ceph-deploy-keys.rst b/src/ceph/doc/rados/deployment/ceph-deploy-keys.rst deleted file mode 100644 index 3e106c9..0000000 --- a/src/ceph/doc/rados/deployment/ceph-deploy-keys.rst +++ /dev/null @@ -1,32 +0,0 @@ -================= - Keys Management -================= - - -Gather Keys -=========== - -Before you can provision a host to run OSDs or metadata servers, you must gather -monitor keys and the OSD and MDS bootstrap keyrings. To gather keys, enter the -following:: - - ceph-deploy gatherkeys {monitor-host} - - -.. note:: To retrieve the keys, you specify a host that has a - Ceph monitor. - -.. note:: If you have specified multiple monitors in the setup of the cluster, - make sure, that all monitors are up and running. If the monitors haven't - formed quorum, ``ceph-create-keys`` will not finish and the keys are not - generated. - -Forget Keys -=========== - -When you are no longer using ``ceph-deploy`` (or if you are recreating a -cluster), you should delete the keys in the local directory of your admin host. -To delete keys, enter the following:: - - ceph-deploy forgetkeys - diff --git a/src/ceph/doc/rados/deployment/ceph-deploy-mds.rst b/src/ceph/doc/rados/deployment/ceph-deploy-mds.rst deleted file mode 100644 index d2afaec..0000000 --- a/src/ceph/doc/rados/deployment/ceph-deploy-mds.rst +++ /dev/null @@ -1,46 +0,0 @@ -============================ - Add/Remove Metadata Server -============================ - -With ``ceph-deploy``, adding and removing metadata servers is a simple task. You -just add or remove one or more metadata servers on the command line with one -command. - -.. important:: You must deploy at least one metadata server to use CephFS. - There is experimental support for running multiple metadata servers. - Do not run multiple active metadata servers in production. - -See `MDS Config Reference`_ for details on configuring metadata servers. - - -Add a Metadata Server -===================== - -Once you deploy monitors and OSDs you may deploy the metadata server(s). :: - - ceph-deploy mds create {host-name}[:{daemon-name}] [{host-name}[:{daemon-name}] ...] - -You may specify a daemon instance a name (optional) if you would like to run -multiple daemons on a single server. - - -Remove a Metadata Server -======================== - -Coming soon... - -.. If you have a metadata server in your cluster that you'd like to remove, you may use -.. the ``destroy`` option. :: - -.. ceph-deploy mds destroy {host-name}[:{daemon-name}] [{host-name}[:{daemon-name}] ...] - -.. You may specify a daemon instance a name (optional) if you would like to destroy -.. a particular daemon that runs on a single server with multiple MDS daemons. - -.. .. note:: Ensure that if you remove a metadata server, the remaining metadata - servers will be able to service requests from CephFS clients. If that is not - possible, consider adding a metadata server before destroying the metadata - server you would like to take offline. - - -.. _MDS Config Reference: ../../../cephfs/mds-config-ref diff --git a/src/ceph/doc/rados/deployment/ceph-deploy-mon.rst b/src/ceph/doc/rados/deployment/ceph-deploy-mon.rst deleted file mode 100644 index bda34fe..0000000 --- a/src/ceph/doc/rados/deployment/ceph-deploy-mon.rst +++ /dev/null @@ -1,56 +0,0 @@ -===================== - Add/Remove Monitors -===================== - -With ``ceph-deploy``, adding and removing monitors is a simple task. You just -add or remove one or more monitors on the command line with one command. Before -``ceph-deploy``, the process of `adding and removing monitors`_ involved -numerous manual steps. Using ``ceph-deploy`` imposes a restriction: **you may -only install one monitor per host.** - -.. note:: We do not recommend comingling monitors and OSDs on - the same host. - -For high availability, you should run a production Ceph cluster with **AT -LEAST** three monitors. Ceph uses the Paxos algorithm, which requires a -consensus among the majority of monitors in a quorum. With Paxos, the monitors -cannot determine a majority for establishing a quorum with only two monitors. A -majority of monitors must be counted as such: 1:1, 2:3, 3:4, 3:5, 4:6, etc. - -See `Monitor Config Reference`_ for details on configuring monitors. - - -Add a Monitor -============= - -Once you create a cluster and install Ceph packages to the monitor host(s), you -may deploy the monitor(s) to the monitor host(s). When using ``ceph-deploy``, -the tool enforces a single monitor per host. :: - - ceph-deploy mon create {host-name [host-name]...} - - -.. note:: Ensure that you add monitors such that they may arrive at a consensus - among a majority of monitors, otherwise other steps (like ``ceph-deploy gatherkeys``) - will fail. - -.. note:: When adding a monitor on a host that was not in hosts initially defined - with the ``ceph-deploy new`` command, a ``public network`` statement needs - to be added to the ceph.conf file. - -Remove a Monitor -================ - -If you have a monitor in your cluster that you'd like to remove, you may use -the ``destroy`` option. :: - - ceph-deploy mon destroy {host-name [host-name]...} - - -.. note:: Ensure that if you remove a monitor, the remaining monitors will be - able to establish a consensus. If that is not possible, consider adding a - monitor before removing the monitor you would like to take offline. - - -.. _adding and removing monitors: ../../operations/add-or-rm-mons -.. _Monitor Config Reference: ../../configuration/mon-config-ref diff --git a/src/ceph/doc/rados/deployment/ceph-deploy-new.rst b/src/ceph/doc/rados/deployment/ceph-deploy-new.rst deleted file mode 100644 index 5eb37a9..0000000 --- a/src/ceph/doc/rados/deployment/ceph-deploy-new.rst +++ /dev/null @@ -1,66 +0,0 @@ -================== - Create a Cluster -================== - -The first step in using Ceph with ``ceph-deploy`` is to create a new Ceph -cluster. A new Ceph cluster has: - -- A Ceph configuration file, and -- A monitor keyring. - -The Ceph configuration file consists of at least: - -- Its own filesystem ID (``fsid``) -- The initial monitor(s) hostname(s), and -- The initial monitor(s) and IP address(es). - -For additional details, see the `Monitor Configuration Reference`_. - -The ``ceph-deploy`` tool also creates a monitor keyring and populates it with a -``[mon.]`` key. For additional details, see the `Cephx Guide`_. - - -Usage ------ - -To create a cluster with ``ceph-deploy``, use the ``new`` command and specify -the host(s) that will be initial members of the monitor quorum. :: - - ceph-deploy new {host [host], ...} - -For example:: - - ceph-deploy new mon1.foo.com - ceph-deploy new mon{1,2,3} - -The ``ceph-deploy`` utility will use DNS to resolve hostnames to IP -addresses. The monitors will be named using the first component of -the name (e.g., ``mon1`` above). It will add the specified host names -to the Ceph configuration file. For additional details, execute:: - - ceph-deploy new -h - - -Naming a Cluster ----------------- - -By default, Ceph clusters have a cluster name of ``ceph``. You can specify -a cluster name if you want to run multiple clusters on the same hardware. For -example, if you want to optimize a cluster for use with block devices, and -another for use with the gateway, you can run two different clusters on the same -hardware if they have a different ``fsid`` and cluster name. :: - - ceph-deploy --cluster {cluster-name} new {host [host], ...} - -For example:: - - ceph-deploy --cluster rbdcluster new ceph-mon1 - ceph-deploy --cluster rbdcluster new ceph-mon{1,2,3} - -.. note:: If you run multiple clusters, ensure you adjust the default - port settings and open ports for your additional cluster(s) so that - the networks of the two different clusters don't conflict with each other. - - -.. _Monitor Configuration Reference: ../../configuration/mon-config-ref -.. _Cephx Guide: ../../../dev/mon-bootstrap#secret-keys diff --git a/src/ceph/doc/rados/deployment/ceph-deploy-osd.rst b/src/ceph/doc/rados/deployment/ceph-deploy-osd.rst deleted file mode 100644 index a4eb4d1..0000000 --- a/src/ceph/doc/rados/deployment/ceph-deploy-osd.rst +++ /dev/null @@ -1,121 +0,0 @@ -================= - Add/Remove OSDs -================= - -Adding and removing Ceph OSD Daemons to your cluster may involve a few more -steps when compared to adding and removing other Ceph daemons. Ceph OSD Daemons -write data to the disk and to journals. So you need to provide a disk for the -OSD and a path to the journal partition (i.e., this is the most common -configuration, but you may configure your system to your own needs). - -In Ceph v0.60 and later releases, Ceph supports ``dm-crypt`` on disk encryption. -You may specify the ``--dmcrypt`` argument when preparing an OSD to tell -``ceph-deploy`` that you want to use encryption. You may also specify the -``--dmcrypt-key-dir`` argument to specify the location of ``dm-crypt`` -encryption keys. - -You should test various drive configurations to gauge their throughput before -before building out a large cluster. See `Data Storage`_ for additional details. - - -List Disks -========== - -To list the disks on a node, execute the following command:: - - ceph-deploy disk list {node-name [node-name]...} - - -Zap Disks -========= - -To zap a disk (delete its partition table) in preparation for use with Ceph, -execute the following:: - - ceph-deploy disk zap {osd-server-name}:{disk-name} - ceph-deploy disk zap osdserver1:sdb - -.. important:: This will delete all data. - - -Prepare OSDs -============ - -Once you create a cluster, install Ceph packages, and gather keys, you -may prepare the OSDs and deploy them to the OSD node(s). If you need to -identify a disk or zap it prior to preparing it for use as an OSD, -see `List Disks`_ and `Zap Disks`_. :: - - ceph-deploy osd prepare {node-name}:{data-disk}[:{journal-disk}] - ceph-deploy osd prepare osdserver1:sdb:/dev/ssd - ceph-deploy osd prepare osdserver1:sdc:/dev/ssd - -The ``prepare`` command only prepares the OSD. On most operating -systems, the ``activate`` phase will automatically run when the -partitions are created on the disk (using Ceph ``udev`` rules). If not -use the ``activate`` command. See `Activate OSDs`_ for -details. - -The foregoing example assumes a disk dedicated to one Ceph OSD Daemon, and -a path to an SSD journal partition. We recommend storing the journal on -a separate drive to maximize throughput. You may dedicate a single drive -for the journal too (which may be expensive) or place the journal on the -same disk as the OSD (not recommended as it impairs performance). In the -foregoing example we store the journal on a partitioned solid state drive. - -You can use the settings --fs-type or --bluestore to choose which file system -you want to install in the OSD drive. (More information by running -'ceph-deploy osd prepare --help'). - -.. note:: When running multiple Ceph OSD daemons on a single node, and - sharing a partioned journal with each OSD daemon, you should consider - the entire node the minimum failure domain for CRUSH purposes, because - if the SSD drive fails, all of the Ceph OSD daemons that journal to it - will fail too. - - -Activate OSDs -============= - -Once you prepare an OSD you may activate it with the following command. :: - - ceph-deploy osd activate {node-name}:{data-disk-partition}[:{journal-disk-partition}] - ceph-deploy osd activate osdserver1:/dev/sdb1:/dev/ssd1 - ceph-deploy osd activate osdserver1:/dev/sdc1:/dev/ssd2 - -The ``activate`` command will cause your OSD to come ``up`` and be placed -``in`` the cluster. The ``activate`` command uses the path to the partition -created when running the ``prepare`` command. - - -Create OSDs -=========== - -You may prepare OSDs, deploy them to the OSD node(s) and activate them in one -step with the ``create`` command. The ``create`` command is a convenience method -for executing the ``prepare`` and ``activate`` command sequentially. :: - - ceph-deploy osd create {node-name}:{disk}[:{path/to/journal}] - ceph-deploy osd create osdserver1:sdb:/dev/ssd1 - -.. List OSDs -.. ========= - -.. To list the OSDs deployed on a node(s), execute the following command:: - -.. ceph-deploy osd list {node-name} - - -Destroy OSDs -============ - -.. note:: Coming soon. See `Remove OSDs`_ for manual procedures. - -.. To destroy an OSD, execute the following command:: - -.. ceph-deploy osd destroy {node-name}:{path-to-disk}[:{path/to/journal}] - -.. Destroying an OSD will take it ``down`` and ``out`` of the cluster. - -.. _Data Storage: ../../../start/hardware-recommendations#data-storage -.. _Remove OSDs: ../../operations/add-or-rm-osds#removing-osds-manual diff --git a/src/ceph/doc/rados/deployment/ceph-deploy-purge.rst b/src/ceph/doc/rados/deployment/ceph-deploy-purge.rst deleted file mode 100644 index 685c3c4..0000000 --- a/src/ceph/doc/rados/deployment/ceph-deploy-purge.rst +++ /dev/null @@ -1,25 +0,0 @@ -============== - Purge a Host -============== - -When you remove Ceph daemons and uninstall Ceph, there may still be extraneous -data from the cluster on your server. The ``purge`` and ``purgedata`` commands -provide a convenient means of cleaning up a host. - - -Purge Data -========== - -To remove all data from ``/var/lib/ceph`` (but leave Ceph packages intact), -execute the ``purgedata`` command. - - ceph-deploy purgedata {hostname} [{hostname} ...] - - -Purge -===== - -To remove all data from ``/var/lib/ceph`` and uninstall Ceph packages, execute -the ``purge`` command. - - ceph-deploy purge {hostname} [{hostname} ...]
\ No newline at end of file diff --git a/src/ceph/doc/rados/deployment/index.rst b/src/ceph/doc/rados/deployment/index.rst deleted file mode 100644 index 0853e4a..0000000 --- a/src/ceph/doc/rados/deployment/index.rst +++ /dev/null @@ -1,58 +0,0 @@ -================= - Ceph Deployment -================= - -The ``ceph-deploy`` tool is a way to deploy Ceph relying only upon SSH access to -the servers, ``sudo``, and some Python. It runs on your workstation, and does -not require servers, databases, or any other tools. If you set up and -tear down Ceph clusters a lot, and want minimal extra bureaucracy, -``ceph-deploy`` is an ideal tool. The ``ceph-deploy`` tool is not a generic -deployment system. It was designed exclusively for Ceph users who want to get -Ceph up and running quickly with sensible initial configuration settings without -the overhead of installing Chef, Puppet or Juju. Users who want fine-control -over security settings, partitions or directory locations should use a tool -such as Juju, Puppet, `Chef`_ or Crowbar. - - -With ``ceph-deploy``, you can develop scripts to install Ceph packages on remote -hosts, create a cluster, add monitors, gather (or forget) keys, add OSDs and -metadata servers, configure admin hosts, and tear down the clusters. - -.. raw:: html - - <table cellpadding="10"><tbody valign="top"><tr><td> - -.. toctree:: - - Preflight Checklist <preflight-checklist> - Install Ceph <ceph-deploy-install> - -.. raw:: html - - </td><td> - -.. toctree:: - - Create a Cluster <ceph-deploy-new> - Add/Remove Monitor(s) <ceph-deploy-mon> - Key Management <ceph-deploy-keys> - Add/Remove OSD(s) <ceph-deploy-osd> - Add/Remove MDS(s) <ceph-deploy-mds> - - -.. raw:: html - - </td><td> - -.. toctree:: - - Purge Hosts <ceph-deploy-purge> - Admin Tasks <ceph-deploy-admin> - - -.. raw:: html - - </td></tr></tbody></table> - - -.. _Chef: http://tracker.ceph.com/projects/ceph/wiki/Deploying_Ceph_with_Chef diff --git a/src/ceph/doc/rados/deployment/preflight-checklist.rst b/src/ceph/doc/rados/deployment/preflight-checklist.rst deleted file mode 100644 index 64a669f..0000000 --- a/src/ceph/doc/rados/deployment/preflight-checklist.rst +++ /dev/null @@ -1,109 +0,0 @@ -===================== - Preflight Checklist -===================== - -.. versionadded:: 0.60 - -This **Preflight Checklist** will help you prepare an admin node for use with -``ceph-deploy``, and server nodes for use with passwordless ``ssh`` and -``sudo``. - -Before you can deploy Ceph using ``ceph-deploy``, you need to ensure that you -have a few things set up first on your admin node and on nodes running Ceph -daemons. - - -Install an Operating System -=========================== - -Install a recent release of Debian or Ubuntu (e.g., 12.04 LTS, 14.04 LTS) on -your nodes. For additional details on operating systems or to use other -operating systems other than Debian or Ubuntu, see `OS Recommendations`_. - - -Install an SSH Server -===================== - -The ``ceph-deploy`` utility requires ``ssh``, so your server node(s) require an -SSH server. :: - - sudo apt-get install openssh-server - - -Create a User -============= - -Create a user on nodes running Ceph daemons. - -.. tip:: We recommend a username that brute force attackers won't - guess easily (e.g., something other than ``root``, ``ceph``, etc). - -:: - - ssh user@ceph-server - sudo useradd -d /home/ceph -m ceph - sudo passwd ceph - - -``ceph-deploy`` installs packages onto your nodes. This means that -the user you create requires passwordless ``sudo`` privileges. - -.. note:: We **DO NOT** recommend enabling the ``root`` password - for security reasons. - -To provide full privileges to the user, add the following to -``/etc/sudoers.d/ceph``. :: - - echo "ceph ALL = (root) NOPASSWD:ALL" | sudo tee /etc/sudoers.d/ceph - sudo chmod 0440 /etc/sudoers.d/ceph - - -Configure SSH -============= - -Configure your admin machine with password-less SSH access to each node -running Ceph daemons (leave the passphrase empty). :: - - ssh-keygen - Generating public/private key pair. - Enter file in which to save the key (/ceph-client/.ssh/id_rsa): - Enter passphrase (empty for no passphrase): - Enter same passphrase again: - Your identification has been saved in /ceph-client/.ssh/id_rsa. - Your public key has been saved in /ceph-client/.ssh/id_rsa.pub. - -Copy the key to each node running Ceph daemons:: - - ssh-copy-id ceph@ceph-server - -Modify your ~/.ssh/config file of your admin node so that it defaults -to logging in as the user you created when no username is specified. :: - - Host ceph-server - Hostname ceph-server.fqdn-or-ip-address.com - User ceph - - -Install ceph-deploy -=================== - -To install ``ceph-deploy``, execute the following:: - - wget -q -O- 'https://download.ceph.com/keys/release.asc' | sudo apt-key add - - echo deb http://ceph.com/debian-dumpling/ $(lsb_release -sc) main | sudo tee /etc/apt/sources.list.d/ceph.list - sudo apt-get update - sudo apt-get install ceph-deploy - - -Ensure Connectivity -=================== - -Ensure that your Admin node has connectivity to the network and to your Server -node (e.g., ensure ``iptables``, ``ufw`` or other tools that may prevent -connections, traffic forwarding, etc. to allow what you need). - - -Once you have completed this pre-flight checklist, you are ready to begin using -``ceph-deploy``. - -.. _OS Recommendations: ../../../start/os-recommendations diff --git a/src/ceph/doc/rados/index.rst b/src/ceph/doc/rados/index.rst deleted file mode 100644 index 929bb7e..0000000 --- a/src/ceph/doc/rados/index.rst +++ /dev/null @@ -1,76 +0,0 @@ -====================== - Ceph Storage Cluster -====================== - -The :term:`Ceph Storage Cluster` is the foundation for all Ceph deployments. -Based upon :abbr:`RADOS (Reliable Autonomic Distributed Object Store)`, Ceph -Storage Clusters consist of two types of daemons: a :term:`Ceph OSD Daemon` -(OSD) stores data as objects on a storage node; and a :term:`Ceph Monitor` (MON) -maintains a master copy of the cluster map. A Ceph Storage Cluster may contain -thousands of storage nodes. A minimal system will have at least one -Ceph Monitor and two Ceph OSD Daemons for data replication. - -The Ceph Filesystem, Ceph Object Storage and Ceph Block Devices read data from -and write data to the Ceph Storage Cluster. - -.. raw:: html - - <style type="text/css">div.body h3{margin:5px 0px 0px 0px;}</style> - <table cellpadding="10"><colgroup><col width="33%"><col width="33%"><col width="33%"></colgroup><tbody valign="top"><tr><td><h3>Config and Deploy</h3> - -Ceph Storage Clusters have a few required settings, but most configuration -settings have default values. A typical deployment uses a deployment tool -to define a cluster and bootstrap a monitor. See `Deployment`_ for details -on ``ceph-deploy.`` - -.. toctree:: - :maxdepth: 2 - - Configuration <configuration/index> - Deployment <deployment/index> - -.. raw:: html - - </td><td><h3>Operations</h3> - -Once you have a deployed a Ceph Storage Cluster, you may begin operating -your cluster. - -.. toctree:: - :maxdepth: 2 - - - Operations <operations/index> - -.. toctree:: - :maxdepth: 1 - - Man Pages <man/index> - - -.. toctree:: - :hidden: - - troubleshooting/index - -.. raw:: html - - </td><td><h3>APIs</h3> - -Most Ceph deployments use `Ceph Block Devices`_, `Ceph Object Storage`_ and/or the -`Ceph Filesystem`_. You may also develop applications that talk directly to -the Ceph Storage Cluster. - -.. toctree:: - :maxdepth: 2 - - APIs <api/index> - -.. raw:: html - - </td></tr></tbody></table> - -.. _Ceph Block Devices: ../rbd/ -.. _Ceph Filesystem: ../cephfs/ -.. _Ceph Object Storage: ../radosgw/ -.. _Deployment: ../rados/deployment/ diff --git a/src/ceph/doc/rados/man/index.rst b/src/ceph/doc/rados/man/index.rst deleted file mode 100644 index abeb88b..0000000 --- a/src/ceph/doc/rados/man/index.rst +++ /dev/null @@ -1,34 +0,0 @@ -======================= - Object Store Manpages -======================= - -.. toctree:: - :maxdepth: 1 - - ../../man/8/ceph-disk.rst - ../../man/8/ceph-volume.rst - ../../man/8/ceph-volume-systemd.rst - ../../man/8/ceph.rst - ../../man/8/ceph-deploy.rst - ../../man/8/ceph-rest-api.rst - ../../man/8/ceph-authtool.rst - ../../man/8/ceph-clsinfo.rst - ../../man/8/ceph-conf.rst - ../../man/8/ceph-debugpack.rst - ../../man/8/ceph-dencoder.rst - ../../man/8/ceph-mon.rst - ../../man/8/ceph-osd.rst - ../../man/8/ceph-kvstore-tool.rst - ../../man/8/ceph-run.rst - ../../man/8/ceph-syn.rst - ../../man/8/crushtool.rst - ../../man/8/librados-config.rst - ../../man/8/monmaptool.rst - ../../man/8/osdmaptool.rst - ../../man/8/rados.rst - - -.. toctree:: - :hidden: - - ../../man/8/ceph-post-file.rst diff --git a/src/ceph/doc/rados/operations/add-or-rm-mons.rst b/src/ceph/doc/rados/operations/add-or-rm-mons.rst deleted file mode 100644 index 0cdc431..0000000 --- a/src/ceph/doc/rados/operations/add-or-rm-mons.rst +++ /dev/null @@ -1,370 +0,0 @@ -========================== - Adding/Removing Monitors -========================== - -When you have a cluster up and running, you may add or remove monitors -from the cluster at runtime. To bootstrap a monitor, see `Manual Deployment`_ -or `Monitor Bootstrap`_. - -Adding Monitors -=============== - -Ceph monitors are light-weight processes that maintain a master copy of the -cluster map. You can run a cluster with 1 monitor. We recommend at least 3 -monitors for a production cluster. Ceph monitors use a variation of the -`Paxos`_ protocol to establish consensus about maps and other critical -information across the cluster. Due to the nature of Paxos, Ceph requires -a majority of monitors running to establish a quorum (thus establishing -consensus). - -It is advisable to run an odd-number of monitors but not mandatory. An -odd-number of monitors has a higher resiliency to failures than an -even-number of monitors. For instance, on a 2 monitor deployment, no -failures can be tolerated in order to maintain a quorum; with 3 monitors, -one failure can be tolerated; in a 4 monitor deployment, one failure can -be tolerated; with 5 monitors, two failures can be tolerated. This is -why an odd-number is advisable. Summarizing, Ceph needs a majority of -monitors to be running (and able to communicate with each other), but that -majority can be achieved using a single monitor, or 2 out of 2 monitors, -2 out of 3, 3 out of 4, etc. - -For an initial deployment of a multi-node Ceph cluster, it is advisable to -deploy three monitors, increasing the number two at a time if a valid need -for more than three exists. - -Since monitors are light-weight, it is possible to run them on the same -host as an OSD; however, we recommend running them on separate hosts, -because fsync issues with the kernel may impair performance. - -.. note:: A *majority* of monitors in your cluster must be able to - reach each other in order to establish a quorum. - -Deploy your Hardware --------------------- - -If you are adding a new host when adding a new monitor, see `Hardware -Recommendations`_ for details on minimum recommendations for monitor hardware. -To add a monitor host to your cluster, first make sure you have an up-to-date -version of Linux installed (typically Ubuntu 14.04 or RHEL 7). - -Add your monitor host to a rack in your cluster, connect it to the network -and ensure that it has network connectivity. - -.. _Hardware Recommendations: ../../../start/hardware-recommendations - -Install the Required Software ------------------------------ - -For manually deployed clusters, you must install Ceph packages -manually. See `Installing Packages`_ for details. -You should configure SSH to a user with password-less authentication -and root permissions. - -.. _Installing Packages: ../../../install/install-storage-cluster - - -.. _Adding a Monitor (Manual): - -Adding a Monitor (Manual) -------------------------- - -This procedure creates a ``ceph-mon`` data directory, retrieves the monitor map -and monitor keyring, and adds a ``ceph-mon`` daemon to your cluster. If -this results in only two monitor daemons, you may add more monitors by -repeating this procedure until you have a sufficient number of ``ceph-mon`` -daemons to achieve a quorum. - -At this point you should define your monitor's id. Traditionally, monitors -have been named with single letters (``a``, ``b``, ``c``, ...), but you are -free to define the id as you see fit. For the purpose of this document, -please take into account that ``{mon-id}`` should be the id you chose, -without the ``mon.`` prefix (i.e., ``{mon-id}`` should be the ``a`` -on ``mon.a``). - -#. Create the default directory on the machine that will host your - new monitor. :: - - ssh {new-mon-host} - sudo mkdir /var/lib/ceph/mon/ceph-{mon-id} - -#. Create a temporary directory ``{tmp}`` to keep the files needed during - this process. This directory should be different from the monitor's default - directory created in the previous step, and can be removed after all the - steps are executed. :: - - mkdir {tmp} - -#. Retrieve the keyring for your monitors, where ``{tmp}`` is the path to - the retrieved keyring, and ``{key-filename}`` is the name of the file - containing the retrieved monitor key. :: - - ceph auth get mon. -o {tmp}/{key-filename} - -#. Retrieve the monitor map, where ``{tmp}`` is the path to - the retrieved monitor map, and ``{map-filename}`` is the name of the file - containing the retrieved monitor monitor map. :: - - ceph mon getmap -o {tmp}/{map-filename} - -#. Prepare the monitor's data directory created in the first step. You must - specify the path to the monitor map so that you can retrieve the - information about a quorum of monitors and their ``fsid``. You must also - specify a path to the monitor keyring:: - - sudo ceph-mon -i {mon-id} --mkfs --monmap {tmp}/{map-filename} --keyring {tmp}/{key-filename} - - -#. Start the new monitor and it will automatically join the cluster. - The daemon needs to know which address to bind to, either via - ``--public-addr {ip:port}`` or by setting ``mon addr`` in the - appropriate section of ``ceph.conf``. For example:: - - ceph-mon -i {mon-id} --public-addr {ip:port} - - -Removing Monitors -================= - -When you remove monitors from a cluster, consider that Ceph monitors use -PAXOS to establish consensus about the master cluster map. You must have -a sufficient number of monitors to establish a quorum for consensus about -the cluster map. - -.. _Removing a Monitor (Manual): - -Removing a Monitor (Manual) ---------------------------- - -This procedure removes a ``ceph-mon`` daemon from your cluster. If this -procedure results in only two monitor daemons, you may add or remove another -monitor until you have a number of ``ceph-mon`` daemons that can achieve a -quorum. - -#. Stop the monitor. :: - - service ceph -a stop mon.{mon-id} - -#. Remove the monitor from the cluster. :: - - ceph mon remove {mon-id} - -#. Remove the monitor entry from ``ceph.conf``. - - -Removing Monitors from an Unhealthy Cluster -------------------------------------------- - -This procedure removes a ``ceph-mon`` daemon from an unhealthy -cluster, for example a cluster where the monitors cannot form a -quorum. - - -#. Stop all ``ceph-mon`` daemons on all monitor hosts. :: - - ssh {mon-host} - service ceph stop mon || stop ceph-mon-all - # and repeat for all mons - -#. Identify a surviving monitor and log in to that host. :: - - ssh {mon-host} - -#. Extract a copy of the monmap file. :: - - ceph-mon -i {mon-id} --extract-monmap {map-path} - # in most cases, that's - ceph-mon -i `hostname` --extract-monmap /tmp/monmap - -#. Remove the non-surviving or problematic monitors. For example, if - you have three monitors, ``mon.a``, ``mon.b``, and ``mon.c``, where - only ``mon.a`` will survive, follow the example below:: - - monmaptool {map-path} --rm {mon-id} - # for example, - monmaptool /tmp/monmap --rm b - monmaptool /tmp/monmap --rm c - -#. Inject the surviving map with the removed monitors into the - surviving monitor(s). For example, to inject a map into monitor - ``mon.a``, follow the example below:: - - ceph-mon -i {mon-id} --inject-monmap {map-path} - # for example, - ceph-mon -i a --inject-monmap /tmp/monmap - -#. Start only the surviving monitors. - -#. Verify the monitors form a quorum (``ceph -s``). - -#. You may wish to archive the removed monitors' data directory in - ``/var/lib/ceph/mon`` in a safe location, or delete it if you are - confident the remaining monitors are healthy and are sufficiently - redundant. - -.. _Changing a Monitor's IP address: - -Changing a Monitor's IP Address -=============================== - -.. important:: Existing monitors are not supposed to change their IP addresses. - -Monitors are critical components of a Ceph cluster, and they need to maintain a -quorum for the whole system to work properly. To establish a quorum, the -monitors need to discover each other. Ceph has strict requirements for -discovering monitors. - -Ceph clients and other Ceph daemons use ``ceph.conf`` to discover monitors. -However, monitors discover each other using the monitor map, not ``ceph.conf``. -For example, if you refer to `Adding a Monitor (Manual)`_ you will see that you -need to obtain the current monmap for the cluster when creating a new monitor, -as it is one of the required arguments of ``ceph-mon -i {mon-id} --mkfs``. The -following sections explain the consistency requirements for Ceph monitors, and a -few safe ways to change a monitor's IP address. - - -Consistency Requirements ------------------------- - -A monitor always refers to the local copy of the monmap when discovering other -monitors in the cluster. Using the monmap instead of ``ceph.conf`` avoids -errors that could break the cluster (e.g., typos in ``ceph.conf`` when -specifying a monitor address or port). Since monitors use monmaps for discovery -and they share monmaps with clients and other Ceph daemons, the monmap provides -monitors with a strict guarantee that their consensus is valid. - -Strict consistency also applies to updates to the monmap. As with any other -updates on the monitor, changes to the monmap always run through a distributed -consensus algorithm called `Paxos`_. The monitors must agree on each update to -the monmap, such as adding or removing a monitor, to ensure that each monitor in -the quorum has the same version of the monmap. Updates to the monmap are -incremental so that monitors have the latest agreed upon version, and a set of -previous versions, allowing a monitor that has an older version of the monmap to -catch up with the current state of the cluster. - -If monitors discovered each other through the Ceph configuration file instead of -through the monmap, it would introduce additional risks because the Ceph -configuration files are not updated and distributed automatically. Monitors -might inadvertently use an older ``ceph.conf`` file, fail to recognize a -monitor, fall out of a quorum, or develop a situation where `Paxos`_ is not able -to determine the current state of the system accurately. Consequently, making -changes to an existing monitor's IP address must be done with great care. - - -Changing a Monitor's IP address (The Right Way) ------------------------------------------------ - -Changing a monitor's IP address in ``ceph.conf`` only is not sufficient to -ensure that other monitors in the cluster will receive the update. To change a -monitor's IP address, you must add a new monitor with the IP address you want -to use (as described in `Adding a Monitor (Manual)`_), ensure that the new -monitor successfully joins the quorum; then, remove the monitor that uses the -old IP address. Then, update the ``ceph.conf`` file to ensure that clients and -other daemons know the IP address of the new monitor. - -For example, lets assume there are three monitors in place, such as :: - - [mon.a] - host = host01 - addr = 10.0.0.1:6789 - [mon.b] - host = host02 - addr = 10.0.0.2:6789 - [mon.c] - host = host03 - addr = 10.0.0.3:6789 - -To change ``mon.c`` to ``host04`` with the IP address ``10.0.0.4``, follow the -steps in `Adding a Monitor (Manual)`_ by adding a new monitor ``mon.d``. Ensure -that ``mon.d`` is running before removing ``mon.c``, or it will break the -quorum. Remove ``mon.c`` as described on `Removing a Monitor (Manual)`_. Moving -all three monitors would thus require repeating this process as many times as -needed. - - -Changing a Monitor's IP address (The Messy Way) ------------------------------------------------ - -There may come a time when the monitors must be moved to a different network, a -different part of the datacenter or a different datacenter altogether. While it -is possible to do it, the process becomes a bit more hazardous. - -In such a case, the solution is to generate a new monmap with updated IP -addresses for all the monitors in the cluster, and inject the new map on each -individual monitor. This is not the most user-friendly approach, but we do not -expect this to be something that needs to be done every other week. As it is -clearly stated on the top of this section, monitors are not supposed to change -IP addresses. - -Using the previous monitor configuration as an example, assume you want to move -all the monitors from the ``10.0.0.x`` range to ``10.1.0.x``, and these -networks are unable to communicate. Use the following procedure: - -#. Retrieve the monitor map, where ``{tmp}`` is the path to - the retrieved monitor map, and ``{filename}`` is the name of the file - containing the retrieved monitor monitor map. :: - - ceph mon getmap -o {tmp}/{filename} - -#. The following example demonstrates the contents of the monmap. :: - - $ monmaptool --print {tmp}/{filename} - - monmaptool: monmap file {tmp}/{filename} - epoch 1 - fsid 224e376d-c5fe-4504-96bb-ea6332a19e61 - last_changed 2012-12-17 02:46:41.591248 - created 2012-12-17 02:46:41.591248 - 0: 10.0.0.1:6789/0 mon.a - 1: 10.0.0.2:6789/0 mon.b - 2: 10.0.0.3:6789/0 mon.c - -#. Remove the existing monitors. :: - - $ monmaptool --rm a --rm b --rm c {tmp}/{filename} - - monmaptool: monmap file {tmp}/{filename} - monmaptool: removing a - monmaptool: removing b - monmaptool: removing c - monmaptool: writing epoch 1 to {tmp}/{filename} (0 monitors) - -#. Add the new monitor locations. :: - - $ monmaptool --add a 10.1.0.1:6789 --add b 10.1.0.2:6789 --add c 10.1.0.3:6789 {tmp}/{filename} - - monmaptool: monmap file {tmp}/{filename} - monmaptool: writing epoch 1 to {tmp}/{filename} (3 monitors) - -#. Check new contents. :: - - $ monmaptool --print {tmp}/{filename} - - monmaptool: monmap file {tmp}/{filename} - epoch 1 - fsid 224e376d-c5fe-4504-96bb-ea6332a19e61 - last_changed 2012-12-17 02:46:41.591248 - created 2012-12-17 02:46:41.591248 - 0: 10.1.0.1:6789/0 mon.a - 1: 10.1.0.2:6789/0 mon.b - 2: 10.1.0.3:6789/0 mon.c - -At this point, we assume the monitors (and stores) are installed at the new -location. The next step is to propagate the modified monmap to the new -monitors, and inject the modified monmap into each new monitor. - -#. First, make sure to stop all your monitors. Injection must be done while - the daemon is not running. - -#. Inject the monmap. :: - - ceph-mon -i {mon-id} --inject-monmap {tmp}/{filename} - -#. Restart the monitors. - -After this step, migration to the new location is complete and -the monitors should operate successfully. - - -.. _Manual Deployment: ../../../install/manual-deployment -.. _Monitor Bootstrap: ../../../dev/mon-bootstrap -.. _Paxos: http://en.wikipedia.org/wiki/Paxos_(computer_science) diff --git a/src/ceph/doc/rados/operations/add-or-rm-osds.rst b/src/ceph/doc/rados/operations/add-or-rm-osds.rst deleted file mode 100644 index 59ce4c7..0000000 --- a/src/ceph/doc/rados/operations/add-or-rm-osds.rst +++ /dev/null @@ -1,366 +0,0 @@ -====================== - Adding/Removing OSDs -====================== - -When you have a cluster up and running, you may add OSDs or remove OSDs -from the cluster at runtime. - -Adding OSDs -=========== - -When you want to expand a cluster, you may add an OSD at runtime. With Ceph, an -OSD is generally one Ceph ``ceph-osd`` daemon for one storage drive within a -host machine. If your host has multiple storage drives, you may map one -``ceph-osd`` daemon for each drive. - -Generally, it's a good idea to check the capacity of your cluster to see if you -are reaching the upper end of its capacity. As your cluster reaches its ``near -full`` ratio, you should add one or more OSDs to expand your cluster's capacity. - -.. warning:: Do not let your cluster reach its ``full ratio`` before - adding an OSD. OSD failures that occur after the cluster reaches - its ``near full`` ratio may cause the cluster to exceed its - ``full ratio``. - -Deploy your Hardware --------------------- - -If you are adding a new host when adding a new OSD, see `Hardware -Recommendations`_ for details on minimum recommendations for OSD hardware. To -add an OSD host to your cluster, first make sure you have an up-to-date version -of Linux installed, and you have made some initial preparations for your -storage drives. See `Filesystem Recommendations`_ for details. - -Add your OSD host to a rack in your cluster, connect it to the network -and ensure that it has network connectivity. See the `Network Configuration -Reference`_ for details. - -.. _Hardware Recommendations: ../../../start/hardware-recommendations -.. _Filesystem Recommendations: ../../configuration/filesystem-recommendations -.. _Network Configuration Reference: ../../configuration/network-config-ref - -Install the Required Software ------------------------------ - -For manually deployed clusters, you must install Ceph packages -manually. See `Installing Ceph (Manual)`_ for details. -You should configure SSH to a user with password-less authentication -and root permissions. - -.. _Installing Ceph (Manual): ../../../install - - -Adding an OSD (Manual) ----------------------- - -This procedure sets up a ``ceph-osd`` daemon, configures it to use one drive, -and configures the cluster to distribute data to the OSD. If your host has -multiple drives, you may add an OSD for each drive by repeating this procedure. - -To add an OSD, create a data directory for it, mount a drive to that directory, -add the OSD to the cluster, and then add it to the CRUSH map. - -When you add the OSD to the CRUSH map, consider the weight you give to the new -OSD. Hard drive capacity grows 40% per year, so newer OSD hosts may have larger -hard drives than older hosts in the cluster (i.e., they may have greater -weight). - -.. tip:: Ceph prefers uniform hardware across pools. If you are adding drives - of dissimilar size, you can adjust their weights. However, for best - performance, consider a CRUSH hierarchy with drives of the same type/size. - -#. Create the OSD. If no UUID is given, it will be set automatically when the - OSD starts up. The following command will output the OSD number, which you - will need for subsequent steps. :: - - ceph osd create [{uuid} [{id}]] - - If the optional parameter {id} is given it will be used as the OSD id. - Note, in this case the command may fail if the number is already in use. - - .. warning:: In general, explicitly specifying {id} is not recommended. - IDs are allocated as an array, and skipping entries consumes some extra - memory. This can become significant if there are large gaps and/or - clusters are large. If {id} is not specified, the smallest available is - used. - -#. Create the default directory on your new OSD. :: - - ssh {new-osd-host} - sudo mkdir /var/lib/ceph/osd/ceph-{osd-number} - - -#. If the OSD is for a drive other than the OS drive, prepare it - for use with Ceph, and mount it to the directory you just created:: - - ssh {new-osd-host} - sudo mkfs -t {fstype} /dev/{drive} - sudo mount -o user_xattr /dev/{hdd} /var/lib/ceph/osd/ceph-{osd-number} - - -#. Initialize the OSD data directory. :: - - ssh {new-osd-host} - ceph-osd -i {osd-num} --mkfs --mkkey - - The directory must be empty before you can run ``ceph-osd``. - -#. Register the OSD authentication key. The value of ``ceph`` for - ``ceph-{osd-num}`` in the path is the ``$cluster-$id``. If your - cluster name differs from ``ceph``, use your cluster name instead.:: - - ceph auth add osd.{osd-num} osd 'allow *' mon 'allow rwx' -i /var/lib/ceph/osd/ceph-{osd-num}/keyring - - -#. Add the OSD to the CRUSH map so that the OSD can begin receiving data. The - ``ceph osd crush add`` command allows you to add OSDs to the CRUSH hierarchy - wherever you wish. If you specify at least one bucket, the command - will place the OSD into the most specific bucket you specify, *and* it will - move that bucket underneath any other buckets you specify. **Important:** If - you specify only the root bucket, the command will attach the OSD directly - to the root, but CRUSH rules expect OSDs to be inside of hosts. - - For Argonaut (v 0.48), execute the following:: - - ceph osd crush add {id} {name} {weight} [{bucket-type}={bucket-name} ...] - - For Bobtail (v 0.56) and later releases, execute the following:: - - ceph osd crush add {id-or-name} {weight} [{bucket-type}={bucket-name} ...] - - You may also decompile the CRUSH map, add the OSD to the device list, add the - host as a bucket (if it's not already in the CRUSH map), add the device as an - item in the host, assign it a weight, recompile it and set it. See - `Add/Move an OSD`_ for details. - - -.. topic:: Argonaut (v0.48) Best Practices - - To limit impact on user I/O performance, add an OSD to the CRUSH map - with an initial weight of ``0``. Then, ramp up the CRUSH weight a - little bit at a time. For example, to ramp by increments of ``0.2``, - start with:: - - ceph osd crush reweight {osd-id} .2 - - and allow migration to complete before reweighting to ``0.4``, - ``0.6``, and so on until the desired CRUSH weight is reached. - - To limit the impact of OSD failures, you can set:: - - mon osd down out interval = 0 - - which prevents down OSDs from automatically being marked out, and then - ramp them down manually with:: - - ceph osd reweight {osd-num} .8 - - Again, wait for the cluster to finish migrating data, and then adjust - the weight further until you reach a weight of 0. Note that this - problem prevents the cluster to automatically re-replicate data after - a failure, so please ensure that sufficient monitoring is in place for - an administrator to intervene promptly. - - Note that this practice will no longer be necessary in Bobtail and - subsequent releases. - - -Replacing an OSD ----------------- - -When disks fail, or if an admnistrator wants to reprovision OSDs with a new -backend, for instance, for switching from FileStore to BlueStore, OSDs need to -be replaced. Unlike `Removing the OSD`_, replaced OSD's id and CRUSH map entry -need to be keep intact after the OSD is destroyed for replacement. - -#. Destroy the OSD first:: - - ceph osd destroy {id} --yes-i-really-mean-it - -#. Zap a disk for the new OSD, if the disk was used before for other purposes. - It's not necessary for a new disk:: - - ceph-disk zap /dev/sdX - -#. Prepare the disk for replacement by using the previously destroyed OSD id:: - - ceph-disk prepare --bluestore /dev/sdX --osd-id {id} --osd-uuid `uuidgen` - -#. And activate the OSD:: - - ceph-disk activate /dev/sdX1 - - -Starting the OSD ----------------- - -After you add an OSD to Ceph, the OSD is in your configuration. However, -it is not yet running. The OSD is ``down`` and ``in``. You must start -your new OSD before it can begin receiving data. You may use -``service ceph`` from your admin host or start the OSD from its host -machine. - -For Ubuntu Trusty use Upstart. :: - - sudo start ceph-osd id={osd-num} - -For all other distros use systemd. :: - - sudo systemctl start ceph-osd@{osd-num} - - -Once you start your OSD, it is ``up`` and ``in``. - - -Observe the Data Migration --------------------------- - -Once you have added your new OSD to the CRUSH map, Ceph will begin rebalancing -the server by migrating placement groups to your new OSD. You can observe this -process with the `ceph`_ tool. :: - - ceph -w - -You should see the placement group states change from ``active+clean`` to -``active, some degraded objects``, and finally ``active+clean`` when migration -completes. (Control-c to exit.) - - -.. _Add/Move an OSD: ../crush-map#addosd -.. _ceph: ../monitoring - - - -Removing OSDs (Manual) -====================== - -When you want to reduce the size of a cluster or replace hardware, you may -remove an OSD at runtime. With Ceph, an OSD is generally one Ceph ``ceph-osd`` -daemon for one storage drive within a host machine. If your host has multiple -storage drives, you may need to remove one ``ceph-osd`` daemon for each drive. -Generally, it's a good idea to check the capacity of your cluster to see if you -are reaching the upper end of its capacity. Ensure that when you remove an OSD -that your cluster is not at its ``near full`` ratio. - -.. warning:: Do not let your cluster reach its ``full ratio`` when - removing an OSD. Removing OSDs could cause the cluster to reach - or exceed its ``full ratio``. - - -Take the OSD out of the Cluster ------------------------------------ - -Before you remove an OSD, it is usually ``up`` and ``in``. You need to take it -out of the cluster so that Ceph can begin rebalancing and copying its data to -other OSDs. :: - - ceph osd out {osd-num} - - -Observe the Data Migration --------------------------- - -Once you have taken your OSD ``out`` of the cluster, Ceph will begin -rebalancing the cluster by migrating placement groups out of the OSD you -removed. You can observe this process with the `ceph`_ tool. :: - - ceph -w - -You should see the placement group states change from ``active+clean`` to -``active, some degraded objects``, and finally ``active+clean`` when migration -completes. (Control-c to exit.) - -.. note:: Sometimes, typically in a "small" cluster with few hosts (for - instance with a small testing cluster), the fact to take ``out`` the - OSD can spawn a CRUSH corner case where some PGs remain stuck in the - ``active+remapped`` state. If you are in this case, you should mark - the OSD ``in`` with: - - ``ceph osd in {osd-num}`` - - to come back to the initial state and then, instead of marking ``out`` - the OSD, set its weight to 0 with: - - ``ceph osd crush reweight osd.{osd-num} 0`` - - After that, you can observe the data migration which should come to its - end. The difference between marking ``out`` the OSD and reweighting it - to 0 is that in the first case the weight of the bucket which contains - the OSD is not changed whereas in the second case the weight of the bucket - is updated (and decreased of the OSD weight). The reweight command could - be sometimes favoured in the case of a "small" cluster. - - - -Stopping the OSD ----------------- - -After you take an OSD out of the cluster, it may still be running. -That is, the OSD may be ``up`` and ``out``. You must stop -your OSD before you remove it from the configuration. :: - - ssh {osd-host} - sudo systemctl stop ceph-osd@{osd-num} - -Once you stop your OSD, it is ``down``. - - -Removing the OSD ----------------- - -This procedure removes an OSD from a cluster map, removes its authentication -key, removes the OSD from the OSD map, and removes the OSD from the -``ceph.conf`` file. If your host has multiple drives, you may need to remove an -OSD for each drive by repeating this procedure. - -#. Let the cluster forget the OSD first. This step removes the OSD from the CRUSH - map, removes its authentication key. And it is removed from the OSD map as - well. Please note the `purge subcommand`_ is introduced in Luminous, for older - versions, please see below :: - - ceph osd purge {id} --yes-i-really-mean-it - -#. Navigate to the host where you keep the master copy of the cluster's - ``ceph.conf`` file. :: - - ssh {admin-host} - cd /etc/ceph - vim ceph.conf - -#. Remove the OSD entry from your ``ceph.conf`` file (if it exists). :: - - [osd.1] - host = {hostname} - -#. From the host where you keep the master copy of the cluster's ``ceph.conf`` file, - copy the updated ``ceph.conf`` file to the ``/etc/ceph`` directory of other - hosts in your cluster. - -If your Ceph cluster is older than Luminous, instead of using ``ceph osd purge``, -you need to perform this step manually: - - -#. Remove the OSD from the CRUSH map so that it no longer receives data. You may - also decompile the CRUSH map, remove the OSD from the device list, remove the - device as an item in the host bucket or remove the host bucket (if it's in the - CRUSH map and you intend to remove the host), recompile the map and set it. - See `Remove an OSD`_ for details. :: - - ceph osd crush remove {name} - -#. Remove the OSD authentication key. :: - - ceph auth del osd.{osd-num} - - The value of ``ceph`` for ``ceph-{osd-num}`` in the path is the ``$cluster-$id``. - If your cluster name differs from ``ceph``, use your cluster name instead. - -#. Remove the OSD. :: - - ceph osd rm {osd-num} - #for example - ceph osd rm 1 - - -.. _Remove an OSD: ../crush-map#removeosd -.. _purge subcommand: /man/8/ceph#osd diff --git a/src/ceph/doc/rados/operations/cache-tiering.rst b/src/ceph/doc/rados/operations/cache-tiering.rst deleted file mode 100644 index 322c6ff..0000000 --- a/src/ceph/doc/rados/operations/cache-tiering.rst +++ /dev/null @@ -1,461 +0,0 @@ -=============== - Cache Tiering -=============== - -A cache tier provides Ceph Clients with better I/O performance for a subset of -the data stored in a backing storage tier. Cache tiering involves creating a -pool of relatively fast/expensive storage devices (e.g., solid state drives) -configured to act as a cache tier, and a backing pool of either erasure-coded -or relatively slower/cheaper devices configured to act as an economical storage -tier. The Ceph objecter handles where to place the objects and the tiering -agent determines when to flush objects from the cache to the backing storage -tier. So the cache tier and the backing storage tier are completely transparent -to Ceph clients. - - -.. ditaa:: - +-------------+ - | Ceph Client | - +------+------+ - ^ - Tiering is | - Transparent | Faster I/O - to Ceph | +---------------+ - Client Ops | | | - | +----->+ Cache Tier | - | | | | - | | +-----+---+-----+ - | | | ^ - v v | | Active Data in Cache Tier - +------+----+--+ | | - | Objecter | | | - +-----------+--+ | | - ^ | | Inactive Data in Storage Tier - | v | - | +-----+---+-----+ - | | | - +----->| Storage Tier | - | | - +---------------+ - Slower I/O - - -The cache tiering agent handles the migration of data between the cache tier -and the backing storage tier automatically. However, admins have the ability to -configure how this migration takes place. There are two main scenarios: - -- **Writeback Mode:** When admins configure tiers with ``writeback`` mode, Ceph - clients write data to the cache tier and receive an ACK from the cache tier. - In time, the data written to the cache tier migrates to the storage tier - and gets flushed from the cache tier. Conceptually, the cache tier is - overlaid "in front" of the backing storage tier. When a Ceph client needs - data that resides in the storage tier, the cache tiering agent migrates the - data to the cache tier on read, then it is sent to the Ceph client. - Thereafter, the Ceph client can perform I/O using the cache tier, until the - data becomes inactive. This is ideal for mutable data (e.g., photo/video - editing, transactional data, etc.). - -- **Read-proxy Mode:** This mode will use any objects that already - exist in the cache tier, but if an object is not present in the - cache the request will be proxied to the base tier. This is useful - for transitioning from ``writeback`` mode to a disabled cache as it - allows the workload to function properly while the cache is drained, - without adding any new objects to the cache. - -A word of caution -================= - -Cache tiering will *degrade* performance for most workloads. Users should use -extreme caution before using this feature. - -* *Workload dependent*: Whether a cache will improve performance is - highly dependent on the workload. Because there is a cost - associated with moving objects into or out of the cache, it can only - be effective when there is a *large skew* in the access pattern in - the data set, such that most of the requests touch a small number of - objects. The cache pool should be large enough to capture the - working set for your workload to avoid thrashing. - -* *Difficult to benchmark*: Most benchmarks that users run to measure - performance will show terrible performance with cache tiering, in - part because very few of them skew requests toward a small set of - objects, it can take a long time for the cache to "warm up," and - because the warm-up cost can be high. - -* *Usually slower*: For workloads that are not cache tiering-friendly, - performance is often slower than a normal RADOS pool without cache - tiering enabled. - -* *librados object enumeration*: The librados-level object enumeration - API is not meant to be coherent in the presence of the case. If - your applicatoin is using librados directly and relies on object - enumeration, cache tiering will probably not work as expected. - (This is not a problem for RGW, RBD, or CephFS.) - -* *Complexity*: Enabling cache tiering means that a lot of additional - machinery and complexity within the RADOS cluster is being used. - This increases the probability that you will encounter a bug in the system - that other users have not yet encountered and will put your deployment at a - higher level of risk. - -Known Good Workloads --------------------- - -* *RGW time-skewed*: If the RGW workload is such that almost all read - operations are directed at recently written objects, a simple cache - tiering configuration that destages recently written objects from - the cache to the base tier after a configurable period can work - well. - -Known Bad Workloads -------------------- - -The following configurations are *known to work poorly* with cache -tiering. - -* *RBD with replicated cache and erasure-coded base*: This is a common - request, but usually does not perform well. Even reasonably skewed - workloads still send some small writes to cold objects, and because - small writes are not yet supported by the erasure-coded pool, entire - (usually 4 MB) objects must be migrated into the cache in order to - satisfy a small (often 4 KB) write. Only a handful of users have - successfully deployed this configuration, and it only works for them - because their data is extremely cold (backups) and they are not in - any way sensitive to performance. - -* *RBD with replicated cache and base*: RBD with a replicated base - tier does better than when the base is erasure coded, but it is - still highly dependent on the amount of skew in the workload, and - very difficult to validate. The user will need to have a good - understanding of their workload and will need to tune the cache - tiering parameters carefully. - - -Setting Up Pools -================ - -To set up cache tiering, you must have two pools. One will act as the -backing storage and the other will act as the cache. - - -Setting Up a Backing Storage Pool ---------------------------------- - -Setting up a backing storage pool typically involves one of two scenarios: - -- **Standard Storage**: In this scenario, the pool stores multiple copies - of an object in the Ceph Storage Cluster. - -- **Erasure Coding:** In this scenario, the pool uses erasure coding to - store data much more efficiently with a small performance tradeoff. - -In the standard storage scenario, you can setup a CRUSH ruleset to establish -the failure domain (e.g., osd, host, chassis, rack, row, etc.). Ceph OSD -Daemons perform optimally when all storage drives in the ruleset are of the -same size, speed (both RPMs and throughput) and type. See `CRUSH Maps`_ -for details on creating a ruleset. Once you have created a ruleset, create -a backing storage pool. - -In the erasure coding scenario, the pool creation arguments will generate the -appropriate ruleset automatically. See `Create a Pool`_ for details. - -In subsequent examples, we will refer to the backing storage pool -as ``cold-storage``. - - -Setting Up a Cache Pool ------------------------ - -Setting up a cache pool follows the same procedure as the standard storage -scenario, but with this difference: the drives for the cache tier are typically -high performance drives that reside in their own servers and have their own -ruleset. When setting up a ruleset, it should take account of the hosts that -have the high performance drives while omitting the hosts that don't. See -`Placing Different Pools on Different OSDs`_ for details. - - -In subsequent examples, we will refer to the cache pool as ``hot-storage`` and -the backing pool as ``cold-storage``. - -For cache tier configuration and default values, see -`Pools - Set Pool Values`_. - - -Creating a Cache Tier -===================== - -Setting up a cache tier involves associating a backing storage pool with -a cache pool :: - - ceph osd tier add {storagepool} {cachepool} - -For example :: - - ceph osd tier add cold-storage hot-storage - -To set the cache mode, execute the following:: - - ceph osd tier cache-mode {cachepool} {cache-mode} - -For example:: - - ceph osd tier cache-mode hot-storage writeback - -The cache tiers overlay the backing storage tier, so they require one -additional step: you must direct all client traffic from the storage pool to -the cache pool. To direct client traffic directly to the cache pool, execute -the following:: - - ceph osd tier set-overlay {storagepool} {cachepool} - -For example:: - - ceph osd tier set-overlay cold-storage hot-storage - - -Configuring a Cache Tier -======================== - -Cache tiers have several configuration options. You may set -cache tier configuration options with the following usage:: - - ceph osd pool set {cachepool} {key} {value} - -See `Pools - Set Pool Values`_ for details. - - -Target Size and Type --------------------- - -Ceph's production cache tiers use a `Bloom Filter`_ for the ``hit_set_type``:: - - ceph osd pool set {cachepool} hit_set_type bloom - -For example:: - - ceph osd pool set hot-storage hit_set_type bloom - -The ``hit_set_count`` and ``hit_set_period`` define how much time each HitSet -should cover, and how many such HitSets to store. :: - - ceph osd pool set {cachepool} hit_set_count 12 - ceph osd pool set {cachepool} hit_set_period 14400 - ceph osd pool set {cachepool} target_max_bytes 1000000000000 - -.. note:: A larger ``hit_set_count`` results in more RAM consumed by - the ``ceph-osd`` process. - -Binning accesses over time allows Ceph to determine whether a Ceph client -accessed an object at least once, or more than once over a time period -("age" vs "temperature"). - -The ``min_read_recency_for_promote`` defines how many HitSets to check for the -existence of an object when handling a read operation. The checking result is -used to decide whether to promote the object asynchronously. Its value should be -between 0 and ``hit_set_count``. If it's set to 0, the object is always promoted. -If it's set to 1, the current HitSet is checked. And if this object is in the -current HitSet, it's promoted. Otherwise not. For the other values, the exact -number of archive HitSets are checked. The object is promoted if the object is -found in any of the most recent ``min_read_recency_for_promote`` HitSets. - -A similar parameter can be set for the write operation, which is -``min_write_recency_for_promote``. :: - - ceph osd pool set {cachepool} min_read_recency_for_promote 2 - ceph osd pool set {cachepool} min_write_recency_for_promote 2 - -.. note:: The longer the period and the higher the - ``min_read_recency_for_promote`` and - ``min_write_recency_for_promote``values, the more RAM the ``ceph-osd`` - daemon consumes. In particular, when the agent is active to flush - or evict cache objects, all ``hit_set_count`` HitSets are loaded - into RAM. - - -Cache Sizing ------------- - -The cache tiering agent performs two main functions: - -- **Flushing:** The agent identifies modified (or dirty) objects and forwards - them to the storage pool for long-term storage. - -- **Evicting:** The agent identifies objects that haven't been modified - (or clean) and evicts the least recently used among them from the cache. - - -Absolute Sizing -~~~~~~~~~~~~~~~ - -The cache tiering agent can flush or evict objects based upon the total number -of bytes or the total number of objects. To specify a maximum number of bytes, -execute the following:: - - ceph osd pool set {cachepool} target_max_bytes {#bytes} - -For example, to flush or evict at 1 TB, execute the following:: - - ceph osd pool set hot-storage target_max_bytes 1099511627776 - - -To specify the maximum number of objects, execute the following:: - - ceph osd pool set {cachepool} target_max_objects {#objects} - -For example, to flush or evict at 1M objects, execute the following:: - - ceph osd pool set hot-storage target_max_objects 1000000 - -.. note:: Ceph is not able to determine the size of a cache pool automatically, so - the configuration on the absolute size is required here, otherwise the - flush/evict will not work. If you specify both limits, the cache tiering - agent will begin flushing or evicting when either threshold is triggered. - -.. note:: All client requests will be blocked only when ``target_max_bytes`` or - ``target_max_objects`` reached - -Relative Sizing -~~~~~~~~~~~~~~~ - -The cache tiering agent can flush or evict objects relative to the size of the -cache pool(specified by ``target_max_bytes`` / ``target_max_objects`` in -`Absolute sizing`_). When the cache pool consists of a certain percentage of -modified (or dirty) objects, the cache tiering agent will flush them to the -storage pool. To set the ``cache_target_dirty_ratio``, execute the following:: - - ceph osd pool set {cachepool} cache_target_dirty_ratio {0.0..1.0} - -For example, setting the value to ``0.4`` will begin flushing modified -(dirty) objects when they reach 40% of the cache pool's capacity:: - - ceph osd pool set hot-storage cache_target_dirty_ratio 0.4 - -When the dirty objects reaches a certain percentage of its capacity, flush dirty -objects with a higher speed. To set the ``cache_target_dirty_high_ratio``:: - - ceph osd pool set {cachepool} cache_target_dirty_high_ratio {0.0..1.0} - -For example, setting the value to ``0.6`` will begin aggressively flush dirty objects -when they reach 60% of the cache pool's capacity. obviously, we'd better set the value -between dirty_ratio and full_ratio:: - - ceph osd pool set hot-storage cache_target_dirty_high_ratio 0.6 - -When the cache pool reaches a certain percentage of its capacity, the cache -tiering agent will evict objects to maintain free capacity. To set the -``cache_target_full_ratio``, execute the following:: - - ceph osd pool set {cachepool} cache_target_full_ratio {0.0..1.0} - -For example, setting the value to ``0.8`` will begin flushing unmodified -(clean) objects when they reach 80% of the cache pool's capacity:: - - ceph osd pool set hot-storage cache_target_full_ratio 0.8 - - -Cache Age ---------- - -You can specify the minimum age of an object before the cache tiering agent -flushes a recently modified (or dirty) object to the backing storage pool:: - - ceph osd pool set {cachepool} cache_min_flush_age {#seconds} - -For example, to flush modified (or dirty) objects after 10 minutes, execute -the following:: - - ceph osd pool set hot-storage cache_min_flush_age 600 - -You can specify the minimum age of an object before it will be evicted from -the cache tier:: - - ceph osd pool {cache-tier} cache_min_evict_age {#seconds} - -For example, to evict objects after 30 minutes, execute the following:: - - ceph osd pool set hot-storage cache_min_evict_age 1800 - - -Removing a Cache Tier -===================== - -Removing a cache tier differs depending on whether it is a writeback -cache or a read-only cache. - - -Removing a Read-Only Cache --------------------------- - -Since a read-only cache does not have modified data, you can disable -and remove it without losing any recent changes to objects in the cache. - -#. Change the cache-mode to ``none`` to disable it. :: - - ceph osd tier cache-mode {cachepool} none - - For example:: - - ceph osd tier cache-mode hot-storage none - -#. Remove the cache pool from the backing pool. :: - - ceph osd tier remove {storagepool} {cachepool} - - For example:: - - ceph osd tier remove cold-storage hot-storage - - - -Removing a Writeback Cache --------------------------- - -Since a writeback cache may have modified data, you must take steps to ensure -that you do not lose any recent changes to objects in the cache before you -disable and remove it. - - -#. Change the cache mode to ``forward`` so that new and modified objects will - flush to the backing storage pool. :: - - ceph osd tier cache-mode {cachepool} forward - - For example:: - - ceph osd tier cache-mode hot-storage forward - - -#. Ensure that the cache pool has been flushed. This may take a few minutes:: - - rados -p {cachepool} ls - - If the cache pool still has objects, you can flush them manually. - For example:: - - rados -p {cachepool} cache-flush-evict-all - - -#. Remove the overlay so that clients will not direct traffic to the cache. :: - - ceph osd tier remove-overlay {storagetier} - - For example:: - - ceph osd tier remove-overlay cold-storage - - -#. Finally, remove the cache tier pool from the backing storage pool. :: - - ceph osd tier remove {storagepool} {cachepool} - - For example:: - - ceph osd tier remove cold-storage hot-storage - - -.. _Create a Pool: ../pools#create-a-pool -.. _Pools - Set Pool Values: ../pools#set-pool-values -.. _Placing Different Pools on Different OSDs: ../crush-map/#placing-different-pools-on-different-osds -.. _Bloom Filter: http://en.wikipedia.org/wiki/Bloom_filter -.. _CRUSH Maps: ../crush-map -.. _Absolute Sizing: #absolute-sizing diff --git a/src/ceph/doc/rados/operations/control.rst b/src/ceph/doc/rados/operations/control.rst deleted file mode 100644 index 1a58076..0000000 --- a/src/ceph/doc/rados/operations/control.rst +++ /dev/null @@ -1,453 +0,0 @@ -.. index:: control, commands - -================== - Control Commands -================== - - -Monitor Commands -================ - -Monitor commands are issued using the ceph utility:: - - ceph [-m monhost] {command} - -The command is usually (though not always) of the form:: - - ceph {subsystem} {command} - - -System Commands -=============== - -Execute the following to display the current status of the cluster. :: - - ceph -s - ceph status - -Execute the following to display a running summary of the status of the cluster, -and major events. :: - - ceph -w - -Execute the following to show the monitor quorum, including which monitors are -participating and which one is the leader. :: - - ceph quorum_status - -Execute the following to query the status of a single monitor, including whether -or not it is in the quorum. :: - - ceph [-m monhost] mon_status - - -Authentication Subsystem -======================== - -To add a keyring for an OSD, execute the following:: - - ceph auth add {osd} {--in-file|-i} {path-to-osd-keyring} - -To list the cluster's keys and their capabilities, execute the following:: - - ceph auth ls - - -Placement Group Subsystem -========================= - -To display the statistics for all placement groups, execute the following:: - - ceph pg dump [--format {format}] - -The valid formats are ``plain`` (default) and ``json``. - -To display the statistics for all placement groups stuck in a specified state, -execute the following:: - - ceph pg dump_stuck inactive|unclean|stale|undersized|degraded [--format {format}] [-t|--threshold {seconds}] - - -``--format`` may be ``plain`` (default) or ``json`` - -``--threshold`` defines how many seconds "stuck" is (default: 300) - -**Inactive** Placement groups cannot process reads or writes because they are waiting for an OSD -with the most up-to-date data to come back. - -**Unclean** Placement groups contain objects that are not replicated the desired number -of times. They should be recovering. - -**Stale** Placement groups are in an unknown state - the OSDs that host them have not -reported to the monitor cluster in a while (configured by -``mon_osd_report_timeout``). - -Delete "lost" objects or revert them to their prior state, either a previous version -or delete them if they were just created. :: - - ceph pg {pgid} mark_unfound_lost revert|delete - - -OSD Subsystem -============= - -Query OSD subsystem status. :: - - ceph osd stat - -Write a copy of the most recent OSD map to a file. See -`osdmaptool`_. :: - - ceph osd getmap -o file - -.. _osdmaptool: ../../man/8/osdmaptool - -Write a copy of the crush map from the most recent OSD map to -file. :: - - ceph osd getcrushmap -o file - -The foregoing functionally equivalent to :: - - ceph osd getmap -o /tmp/osdmap - osdmaptool /tmp/osdmap --export-crush file - -Dump the OSD map. Valid formats for ``-f`` are ``plain`` and ``json``. If no -``--format`` option is given, the OSD map is dumped as plain text. :: - - ceph osd dump [--format {format}] - -Dump the OSD map as a tree with one line per OSD containing weight -and state. :: - - ceph osd tree [--format {format}] - -Find out where a specific object is or would be stored in the system:: - - ceph osd map <pool-name> <object-name> - -Add or move a new item (OSD) with the given id/name/weight at the specified -location. :: - - ceph osd crush set {id} {weight} [{loc1} [{loc2} ...]] - -Remove an existing item (OSD) from the CRUSH map. :: - - ceph osd crush remove {name} - -Remove an existing bucket from the CRUSH map. :: - - ceph osd crush remove {bucket-name} - -Move an existing bucket from one position in the hierarchy to another. :: - - ceph osd crush move {id} {loc1} [{loc2} ...] - -Set the weight of the item given by ``{name}`` to ``{weight}``. :: - - ceph osd crush reweight {name} {weight} - -Mark an OSD as lost. This may result in permanent data loss. Use with caution. :: - - ceph osd lost {id} [--yes-i-really-mean-it] - -Create a new OSD. If no UUID is given, it will be set automatically when the OSD -starts up. :: - - ceph osd create [{uuid}] - -Remove the given OSD(s). :: - - ceph osd rm [{id}...] - -Query the current max_osd parameter in the OSD map. :: - - ceph osd getmaxosd - -Import the given crush map. :: - - ceph osd setcrushmap -i file - -Set the ``max_osd`` parameter in the OSD map. This is necessary when -expanding the storage cluster. :: - - ceph osd setmaxosd - -Mark OSD ``{osd-num}`` down. :: - - ceph osd down {osd-num} - -Mark OSD ``{osd-num}`` out of the distribution (i.e. allocated no data). :: - - ceph osd out {osd-num} - -Mark ``{osd-num}`` in the distribution (i.e. allocated data). :: - - ceph osd in {osd-num} - -Set or clear the pause flags in the OSD map. If set, no IO requests -will be sent to any OSD. Clearing the flags via unpause results in -resending pending requests. :: - - ceph osd pause - ceph osd unpause - -Set the weight of ``{osd-num}`` to ``{weight}``. Two OSDs with the -same weight will receive roughly the same number of I/O requests and -store approximately the same amount of data. ``ceph osd reweight`` -sets an override weight on the OSD. This value is in the range 0 to 1, -and forces CRUSH to re-place (1-weight) of the data that would -otherwise live on this drive. It does not change the weights assigned -to the buckets above the OSD in the crush map, and is a corrective -measure in case the normal CRUSH distribution is not working out quite -right. For instance, if one of your OSDs is at 90% and the others are -at 50%, you could reduce this weight to try and compensate for it. :: - - ceph osd reweight {osd-num} {weight} - -Reweights all the OSDs by reducing the weight of OSDs which are -heavily overused. By default it will adjust the weights downward on -OSDs which have 120% of the average utilization, but if you include -threshold it will use that percentage instead. :: - - ceph osd reweight-by-utilization [threshold] - -Describes what reweight-by-utilization would do. :: - - ceph osd test-reweight-by-utilization - -Adds/removes the address to/from the blacklist. When adding an address, -you can specify how long it should be blacklisted in seconds; otherwise, -it will default to 1 hour. A blacklisted address is prevented from -connecting to any OSD. Blacklisting is most often used to prevent a -lagging metadata server from making bad changes to data on the OSDs. - -These commands are mostly only useful for failure testing, as -blacklists are normally maintained automatically and shouldn't need -manual intervention. :: - - ceph osd blacklist add ADDRESS[:source_port] [TIME] - ceph osd blacklist rm ADDRESS[:source_port] - -Creates/deletes a snapshot of a pool. :: - - ceph osd pool mksnap {pool-name} {snap-name} - ceph osd pool rmsnap {pool-name} {snap-name} - -Creates/deletes/renames a storage pool. :: - - ceph osd pool create {pool-name} pg_num [pgp_num] - ceph osd pool delete {pool-name} [{pool-name} --yes-i-really-really-mean-it] - ceph osd pool rename {old-name} {new-name} - -Changes a pool setting. :: - - ceph osd pool set {pool-name} {field} {value} - -Valid fields are: - - * ``size``: Sets the number of copies of data in the pool. - * ``pg_num``: The placement group number. - * ``pgp_num``: Effective number when calculating pg placement. - * ``crush_ruleset``: rule number for mapping placement. - -Get the value of a pool setting. :: - - ceph osd pool get {pool-name} {field} - -Valid fields are: - - * ``pg_num``: The placement group number. - * ``pgp_num``: Effective number of placement groups when calculating placement. - * ``lpg_num``: The number of local placement groups. - * ``lpgp_num``: The number used for placing the local placement groups. - - -Sends a scrub command to OSD ``{osd-num}``. To send the command to all OSDs, use ``*``. :: - - ceph osd scrub {osd-num} - -Sends a repair command to OSD.N. To send the command to all OSDs, use ``*``. :: - - ceph osd repair N - -Runs a simple throughput benchmark against OSD.N, writing ``TOTAL_DATA_BYTES`` -in write requests of ``BYTES_PER_WRITE`` each. By default, the test -writes 1 GB in total in 4-MB increments. -The benchmark is non-destructive and will not overwrite existing live -OSD data, but might temporarily affect the performance of clients -concurrently accessing the OSD. :: - - ceph tell osd.N bench [TOTAL_DATA_BYTES] [BYTES_PER_WRITE] - - -MDS Subsystem -============= - -Change configuration parameters on a running mds. :: - - ceph tell mds.{mds-id} injectargs --{switch} {value} [--{switch} {value}] - -Example:: - - ceph tell mds.0 injectargs --debug_ms 1 --debug_mds 10 - -Enables debug messages. :: - - ceph mds stat - -Displays the status of all metadata servers. :: - - ceph mds fail 0 - -Marks the active MDS as failed, triggering failover to a standby if present. - -.. todo:: ``ceph mds`` subcommands missing docs: set, dump, getmap, stop, setmap - - -Mon Subsystem -============= - -Show monitor stats:: - - ceph mon stat - - e2: 3 mons at {a=127.0.0.1:40000/0,b=127.0.0.1:40001/0,c=127.0.0.1:40002/0}, election epoch 6, quorum 0,1,2 a,b,c - - -The ``quorum`` list at the end lists monitor nodes that are part of the current quorum. - -This is also available more directly:: - - ceph quorum_status -f json-pretty - -.. code-block:: javascript - - { - "election_epoch": 6, - "quorum": [ - 0, - 1, - 2 - ], - "quorum_names": [ - "a", - "b", - "c" - ], - "quorum_leader_name": "a", - "monmap": { - "epoch": 2, - "fsid": "ba807e74-b64f-4b72-b43f-597dfe60ddbc", - "modified": "2016-12-26 14:42:09.288066", - "created": "2016-12-26 14:42:03.573585", - "features": { - "persistent": [ - "kraken" - ], - "optional": [] - }, - "mons": [ - { - "rank": 0, - "name": "a", - "addr": "127.0.0.1:40000\/0", - "public_addr": "127.0.0.1:40000\/0" - }, - { - "rank": 1, - "name": "b", - "addr": "127.0.0.1:40001\/0", - "public_addr": "127.0.0.1:40001\/0" - }, - { - "rank": 2, - "name": "c", - "addr": "127.0.0.1:40002\/0", - "public_addr": "127.0.0.1:40002\/0" - } - ] - } - } - - -The above will block until a quorum is reached. - -For a status of just the monitor you connect to (use ``-m HOST:PORT`` -to select):: - - ceph mon_status -f json-pretty - - -.. code-block:: javascript - - { - "name": "b", - "rank": 1, - "state": "peon", - "election_epoch": 6, - "quorum": [ - 0, - 1, - 2 - ], - "features": { - "required_con": "9025616074522624", - "required_mon": [ - "kraken" - ], - "quorum_con": "1152921504336314367", - "quorum_mon": [ - "kraken" - ] - }, - "outside_quorum": [], - "extra_probe_peers": [], - "sync_provider": [], - "monmap": { - "epoch": 2, - "fsid": "ba807e74-b64f-4b72-b43f-597dfe60ddbc", - "modified": "2016-12-26 14:42:09.288066", - "created": "2016-12-26 14:42:03.573585", - "features": { - "persistent": [ - "kraken" - ], - "optional": [] - }, - "mons": [ - { - "rank": 0, - "name": "a", - "addr": "127.0.0.1:40000\/0", - "public_addr": "127.0.0.1:40000\/0" - }, - { - "rank": 1, - "name": "b", - "addr": "127.0.0.1:40001\/0", - "public_addr": "127.0.0.1:40001\/0" - }, - { - "rank": 2, - "name": "c", - "addr": "127.0.0.1:40002\/0", - "public_addr": "127.0.0.1:40002\/0" - } - ] - } - } - -A dump of the monitor state:: - - ceph mon dump - - dumped monmap epoch 2 - epoch 2 - fsid ba807e74-b64f-4b72-b43f-597dfe60ddbc - last_changed 2016-12-26 14:42:09.288066 - created 2016-12-26 14:42:03.573585 - 0: 127.0.0.1:40000/0 mon.a - 1: 127.0.0.1:40001/0 mon.b - 2: 127.0.0.1:40002/0 mon.c - diff --git a/src/ceph/doc/rados/operations/crush-map-edits.rst b/src/ceph/doc/rados/operations/crush-map-edits.rst deleted file mode 100644 index 5222270..0000000 --- a/src/ceph/doc/rados/operations/crush-map-edits.rst +++ /dev/null @@ -1,654 +0,0 @@ -Manually editing a CRUSH Map -============================ - -.. note:: Manually editing the CRUSH map is considered an advanced - administrator operation. All CRUSH changes that are - necessary for the overwhelming majority of installations are - possible via the standard ceph CLI and do not require manual - CRUSH map edits. If you have identified a use case where - manual edits *are* necessary, consider contacting the Ceph - developers so that future versions of Ceph can make this - unnecessary. - -To edit an existing CRUSH map: - -#. `Get the CRUSH map`_. -#. `Decompile`_ the CRUSH map. -#. Edit at least one of `Devices`_, `Buckets`_ and `Rules`_. -#. `Recompile`_ the CRUSH map. -#. `Set the CRUSH map`_. - -To activate CRUSH map rules for a specific pool, identify the common ruleset -number for those rules and specify that ruleset number for the pool. See `Set -Pool Values`_ for details. - -.. _Get the CRUSH map: #getcrushmap -.. _Decompile: #decompilecrushmap -.. _Devices: #crushmapdevices -.. _Buckets: #crushmapbuckets -.. _Rules: #crushmaprules -.. _Recompile: #compilecrushmap -.. _Set the CRUSH map: #setcrushmap -.. _Set Pool Values: ../pools#setpoolvalues - -.. _getcrushmap: - -Get a CRUSH Map ---------------- - -To get the CRUSH map for your cluster, execute the following:: - - ceph osd getcrushmap -o {compiled-crushmap-filename} - -Ceph will output (-o) a compiled CRUSH map to the filename you specified. Since -the CRUSH map is in a compiled form, you must decompile it first before you can -edit it. - -.. _decompilecrushmap: - -Decompile a CRUSH Map ---------------------- - -To decompile a CRUSH map, execute the following:: - - crushtool -d {compiled-crushmap-filename} -o {decompiled-crushmap-filename} - - -Sections --------- - -There are six main sections to a CRUSH Map. - -#. **tunables:** The preamble at the top of the map described any *tunables* - for CRUSH behavior that vary from the historical/legacy CRUSH behavior. These - correct for old bugs, optimizations, or other changes in behavior that have - been made over the years to improve CRUSH's behavior. - -#. **devices:** Devices are individual ``ceph-osd`` daemons that can - store data. - -#. **types**: Bucket ``types`` define the types of buckets used in - your CRUSH hierarchy. Buckets consist of a hierarchical aggregation - of storage locations (e.g., rows, racks, chassis, hosts, etc.) and - their assigned weights. - -#. **buckets:** Once you define bucket types, you must define each node - in the hierarchy, its type, and which devices or other nodes it - containes. - -#. **rules:** Rules define policy about how data is distributed across - devices in the hierarchy. - -#. **choose_args:** Choose_args are alternative weights associated with - the hierarchy that have been adjusted to optimize data placement. A single - choose_args map can be used for the entire cluster, or one can be - created for each individual pool. - - -.. _crushmapdevices: - -CRUSH Map Devices ------------------ - -Devices are individual ``ceph-osd`` daemons that can store data. You -will normally have one defined here for each OSD daemon in your -cluster. Devices are identified by an id (a non-negative integer) and -a name, normally ``osd.N`` where ``N`` is the device id. - -Devices may also have a *device class* associated with them (e.g., -``hdd`` or ``ssd``), allowing them to be conveniently targetted by a -crush rule. - -:: - - # devices - device {num} {osd.name} [class {class}] - -For example:: - - # devices - device 0 osd.0 class ssd - device 1 osd.1 class hdd - device 2 osd.2 - device 3 osd.3 - -In most cases, each device maps to a single ``ceph-osd`` daemon. This -is normally a single storage device, a pair of devices (for example, -one for data and one for a journal or metadata), or in some cases a -small RAID device. - - - - - -CRUSH Map Bucket Types ----------------------- - -The second list in the CRUSH map defines 'bucket' types. Buckets facilitate -a hierarchy of nodes and leaves. Node (or non-leaf) buckets typically represent -physical locations in a hierarchy. Nodes aggregate other nodes or leaves. -Leaf buckets represent ``ceph-osd`` daemons and their corresponding storage -media. - -.. tip:: The term "bucket" used in the context of CRUSH means a node in - the hierarchy, i.e. a location or a piece of physical hardware. It - is a different concept from the term "bucket" when used in the - context of RADOS Gateway APIs. - -To add a bucket type to the CRUSH map, create a new line under your list of -bucket types. Enter ``type`` followed by a unique numeric ID and a bucket name. -By convention, there is one leaf bucket and it is ``type 0``; however, you may -give it any name you like (e.g., osd, disk, drive, storage, etc.):: - - #types - type {num} {bucket-name} - -For example:: - - # types - type 0 osd - type 1 host - type 2 chassis - type 3 rack - type 4 row - type 5 pdu - type 6 pod - type 7 room - type 8 datacenter - type 9 region - type 10 root - - - -.. _crushmapbuckets: - -CRUSH Map Bucket Hierarchy --------------------------- - -The CRUSH algorithm distributes data objects among storage devices according -to a per-device weight value, approximating a uniform probability distribution. -CRUSH distributes objects and their replicas according to the hierarchical -cluster map you define. Your CRUSH map represents the available storage -devices and the logical elements that contain them. - -To map placement groups to OSDs across failure domains, a CRUSH map defines a -hierarchical list of bucket types (i.e., under ``#types`` in the generated CRUSH -map). The purpose of creating a bucket hierarchy is to segregate the -leaf nodes by their failure domains, such as hosts, chassis, racks, power -distribution units, pods, rows, rooms, and data centers. With the exception of -the leaf nodes representing OSDs, the rest of the hierarchy is arbitrary, and -you may define it according to your own needs. - -We recommend adapting your CRUSH map to your firms's hardware naming conventions -and using instances names that reflect the physical hardware. Your naming -practice can make it easier to administer the cluster and troubleshoot -problems when an OSD and/or other hardware malfunctions and the administrator -need access to physical hardware. - -In the following example, the bucket hierarchy has a leaf bucket named ``osd``, -and two node buckets named ``host`` and ``rack`` respectively. - -.. ditaa:: - +-----------+ - | {o}rack | - | Bucket | - +-----+-----+ - | - +---------------+---------------+ - | | - +-----+-----+ +-----+-----+ - | {o}host | | {o}host | - | Bucket | | Bucket | - +-----+-----+ +-----+-----+ - | | - +-------+-------+ +-------+-------+ - | | | | - +-----+-----+ +-----+-----+ +-----+-----+ +-----+-----+ - | osd | | osd | | osd | | osd | - | Bucket | | Bucket | | Bucket | | Bucket | - +-----------+ +-----------+ +-----------+ +-----------+ - -.. note:: The higher numbered ``rack`` bucket type aggregates the lower - numbered ``host`` bucket type. - -Since leaf nodes reflect storage devices declared under the ``#devices`` list -at the beginning of the CRUSH map, you do not need to declare them as bucket -instances. The second lowest bucket type in your hierarchy usually aggregates -the devices (i.e., it's usually the computer containing the storage media, and -uses whatever term you prefer to describe it, such as "node", "computer", -"server," "host", "machine", etc.). In high density environments, it is -increasingly common to see multiple hosts/nodes per chassis. You should account -for chassis failure too--e.g., the need to pull a chassis if a node fails may -result in bringing down numerous hosts/nodes and their OSDs. - -When declaring a bucket instance, you must specify its type, give it a unique -name (string), assign it a unique ID expressed as a negative integer (optional), -specify a weight relative to the total capacity/capability of its item(s), -specify the bucket algorithm (usually ``straw``), and the hash (usually ``0``, -reflecting hash algorithm ``rjenkins1``). A bucket may have one or more items. -The items may consist of node buckets or leaves. Items may have a weight that -reflects the relative weight of the item. - -You may declare a node bucket with the following syntax:: - - [bucket-type] [bucket-name] { - id [a unique negative numeric ID] - weight [the relative capacity/capability of the item(s)] - alg [the bucket type: uniform | list | tree | straw ] - hash [the hash type: 0 by default] - item [item-name] weight [weight] - } - -For example, using the diagram above, we would define two host buckets -and one rack bucket. The OSDs are declared as items within the host buckets:: - - host node1 { - id -1 - alg straw - hash 0 - item osd.0 weight 1.00 - item osd.1 weight 1.00 - } - - host node2 { - id -2 - alg straw - hash 0 - item osd.2 weight 1.00 - item osd.3 weight 1.00 - } - - rack rack1 { - id -3 - alg straw - hash 0 - item node1 weight 2.00 - item node2 weight 2.00 - } - -.. note:: In the foregoing example, note that the rack bucket does not contain - any OSDs. Rather it contains lower level host buckets, and includes the - sum total of their weight in the item entry. - -.. topic:: Bucket Types - - Ceph supports four bucket types, each representing a tradeoff between - performance and reorganization efficiency. If you are unsure of which bucket - type to use, we recommend using a ``straw`` bucket. For a detailed - discussion of bucket types, refer to - `CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_, - and more specifically to **Section 3.4**. The bucket types are: - - #. **Uniform:** Uniform buckets aggregate devices with **exactly** the same - weight. For example, when firms commission or decommission hardware, they - typically do so with many machines that have exactly the same physical - configuration (e.g., bulk purchases). When storage devices have exactly - the same weight, you may use the ``uniform`` bucket type, which allows - CRUSH to map replicas into uniform buckets in constant time. With - non-uniform weights, you should use another bucket algorithm. - - #. **List**: List buckets aggregate their content as linked lists. Based on - the :abbr:`RUSH (Replication Under Scalable Hashing)` :sub:`P` algorithm, - a list is a natural and intuitive choice for an **expanding cluster**: - either an object is relocated to the newest device with some appropriate - probability, or it remains on the older devices as before. The result is - optimal data migration when items are added to the bucket. Items removed - from the middle or tail of the list, however, can result in a significant - amount of unnecessary movement, making list buckets most suitable for - circumstances in which they **never (or very rarely) shrink**. - - #. **Tree**: Tree buckets use a binary search tree. They are more efficient - than list buckets when a bucket contains a larger set of items. Based on - the :abbr:`RUSH (Replication Under Scalable Hashing)` :sub:`R` algorithm, - tree buckets reduce the placement time to O(log :sub:`n`), making them - suitable for managing much larger sets of devices or nested buckets. - - #. **Straw:** List and Tree buckets use a divide and conquer strategy - in a way that either gives certain items precedence (e.g., those - at the beginning of a list) or obviates the need to consider entire - subtrees of items at all. That improves the performance of the replica - placement process, but can also introduce suboptimal reorganization - behavior when the contents of a bucket change due an addition, removal, - or re-weighting of an item. The straw bucket type allows all items to - fairly “compete” against each other for replica placement through a - process analogous to a draw of straws. - -.. topic:: Hash - - Each bucket uses a hash algorithm. Currently, Ceph supports ``rjenkins1``. - Enter ``0`` as your hash setting to select ``rjenkins1``. - - -.. _weightingbucketitems: - -.. topic:: Weighting Bucket Items - - Ceph expresses bucket weights as doubles, which allows for fine - weighting. A weight is the relative difference between device capacities. We - recommend using ``1.00`` as the relative weight for a 1TB storage device. - In such a scenario, a weight of ``0.5`` would represent approximately 500GB, - and a weight of ``3.00`` would represent approximately 3TB. Higher level - buckets have a weight that is the sum total of the leaf items aggregated by - the bucket. - - A bucket item weight is one dimensional, but you may also calculate your - item weights to reflect the performance of the storage drive. For example, - if you have many 1TB drives where some have relatively low data transfer - rate and the others have a relatively high data transfer rate, you may - weight them differently, even though they have the same capacity (e.g., - a weight of 0.80 for the first set of drives with lower total throughput, - and 1.20 for the second set of drives with higher total throughput). - - -.. _crushmaprules: - -CRUSH Map Rules ---------------- - -CRUSH maps support the notion of 'CRUSH rules', which are the rules that -determine data placement for a pool. For large clusters, you will likely create -many pools where each pool may have its own CRUSH ruleset and rules. The default -CRUSH map has a rule for each pool, and one ruleset assigned to each of the -default pools. - -.. note:: In most cases, you will not need to modify the default rules. When - you create a new pool, its default ruleset is ``0``. - - -CRUSH rules define placement and replication strategies or distribution policies -that allow you to specify exactly how CRUSH places object replicas. For -example, you might create a rule selecting a pair of targets for 2-way -mirroring, another rule for selecting three targets in two different data -centers for 3-way mirroring, and yet another rule for erasure coding over six -storage devices. For a detailed discussion of CRUSH rules, refer to -`CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_, -and more specifically to **Section 3.2**. - -A rule takes the following form:: - - rule <rulename> { - - ruleset <ruleset> - type [ replicated | erasure ] - min_size <min-size> - max_size <max-size> - step take <bucket-name> [class <device-class>] - step [choose|chooseleaf] [firstn|indep] <N> <bucket-type> - step emit - } - - -``ruleset`` - -:Description: A means of classifying a rule as belonging to a set of rules. - Activated by `setting the ruleset in a pool`_. - -:Purpose: A component of the rule mask. -:Type: Integer -:Required: Yes -:Default: 0 - -.. _setting the ruleset in a pool: ../pools#setpoolvalues - - -``type`` - -:Description: Describes a rule for either a storage drive (replicated) - or a RAID. - -:Purpose: A component of the rule mask. -:Type: String -:Required: Yes -:Default: ``replicated`` -:Valid Values: Currently only ``replicated`` and ``erasure`` - -``min_size`` - -:Description: If a pool makes fewer replicas than this number, CRUSH will - **NOT** select this rule. - -:Type: Integer -:Purpose: A component of the rule mask. -:Required: Yes -:Default: ``1`` - -``max_size`` - -:Description: If a pool makes more replicas than this number, CRUSH will - **NOT** select this rule. - -:Type: Integer -:Purpose: A component of the rule mask. -:Required: Yes -:Default: 10 - - -``step take <bucket-name> [class <device-class>]`` - -:Description: Takes a bucket name, and begins iterating down the tree. - If the ``device-class`` is specified, it must match - a class previously used when defining a device. All - devices that do not belong to the class are excluded. -:Purpose: A component of the rule. -:Required: Yes -:Example: ``step take data`` - - -``step choose firstn {num} type {bucket-type}`` - -:Description: Selects the number of buckets of the given type. The number is - usually the number of replicas in the pool (i.e., pool size). - - - If ``{num} == 0``, choose ``pool-num-replicas`` buckets (all available). - - If ``{num} > 0 && < pool-num-replicas``, choose that many buckets. - - If ``{num} < 0``, it means ``pool-num-replicas - {num}``. - -:Purpose: A component of the rule. -:Prerequisite: Follows ``step take`` or ``step choose``. -:Example: ``step choose firstn 1 type row`` - - -``step chooseleaf firstn {num} type {bucket-type}`` - -:Description: Selects a set of buckets of ``{bucket-type}`` and chooses a leaf - node from the subtree of each bucket in the set of buckets. The - number of buckets in the set is usually the number of replicas in - the pool (i.e., pool size). - - - If ``{num} == 0``, choose ``pool-num-replicas`` buckets (all available). - - If ``{num} > 0 && < pool-num-replicas``, choose that many buckets. - - If ``{num} < 0``, it means ``pool-num-replicas - {num}``. - -:Purpose: A component of the rule. Usage removes the need to select a device using two steps. -:Prerequisite: Follows ``step take`` or ``step choose``. -:Example: ``step chooseleaf firstn 0 type row`` - - - -``step emit`` - -:Description: Outputs the current value and empties the stack. Typically used - at the end of a rule, but may also be used to pick from different - trees in the same rule. - -:Purpose: A component of the rule. -:Prerequisite: Follows ``step choose``. -:Example: ``step emit`` - -.. important:: To activate one or more rules with a common ruleset number to a - pool, set the ruleset number of the pool. - - -Placing Different Pools on Different OSDS: -========================================== - -Suppose you want to have most pools default to OSDs backed by large hard drives, -but have some pools mapped to OSDs backed by fast solid-state drives (SSDs). -It's possible to have multiple independent CRUSH hierarchies within the same -CRUSH map. Define two hierarchies with two different root nodes--one for hard -disks (e.g., "root platter") and one for SSDs (e.g., "root ssd") as shown -below:: - - device 0 osd.0 - device 1 osd.1 - device 2 osd.2 - device 3 osd.3 - device 4 osd.4 - device 5 osd.5 - device 6 osd.6 - device 7 osd.7 - - host ceph-osd-ssd-server-1 { - id -1 - alg straw - hash 0 - item osd.0 weight 1.00 - item osd.1 weight 1.00 - } - - host ceph-osd-ssd-server-2 { - id -2 - alg straw - hash 0 - item osd.2 weight 1.00 - item osd.3 weight 1.00 - } - - host ceph-osd-platter-server-1 { - id -3 - alg straw - hash 0 - item osd.4 weight 1.00 - item osd.5 weight 1.00 - } - - host ceph-osd-platter-server-2 { - id -4 - alg straw - hash 0 - item osd.6 weight 1.00 - item osd.7 weight 1.00 - } - - root platter { - id -5 - alg straw - hash 0 - item ceph-osd-platter-server-1 weight 2.00 - item ceph-osd-platter-server-2 weight 2.00 - } - - root ssd { - id -6 - alg straw - hash 0 - item ceph-osd-ssd-server-1 weight 2.00 - item ceph-osd-ssd-server-2 weight 2.00 - } - - rule data { - ruleset 0 - type replicated - min_size 2 - max_size 2 - step take platter - step chooseleaf firstn 0 type host - step emit - } - - rule metadata { - ruleset 1 - type replicated - min_size 0 - max_size 10 - step take platter - step chooseleaf firstn 0 type host - step emit - } - - rule rbd { - ruleset 2 - type replicated - min_size 0 - max_size 10 - step take platter - step chooseleaf firstn 0 type host - step emit - } - - rule platter { - ruleset 3 - type replicated - min_size 0 - max_size 10 - step take platter - step chooseleaf firstn 0 type host - step emit - } - - rule ssd { - ruleset 4 - type replicated - min_size 0 - max_size 4 - step take ssd - step chooseleaf firstn 0 type host - step emit - } - - rule ssd-primary { - ruleset 5 - type replicated - min_size 5 - max_size 10 - step take ssd - step chooseleaf firstn 1 type host - step emit - step take platter - step chooseleaf firstn -1 type host - step emit - } - -You can then set a pool to use the SSD rule by:: - - ceph osd pool set <poolname> crush_ruleset 4 - -Similarly, using the ``ssd-primary`` rule will cause each placement group in the -pool to be placed with an SSD as the primary and platters as the replicas. - - -Tuning CRUSH, the hard way --------------------------- - -If you can ensure that all clients are running recent code, you can -adjust the tunables by extracting the CRUSH map, modifying the values, -and reinjecting it into the cluster. - -* Extract the latest CRUSH map:: - - ceph osd getcrushmap -o /tmp/crush - -* Adjust tunables. These values appear to offer the best behavior - for both large and small clusters we tested with. You will need to - additionally specify the ``--enable-unsafe-tunables`` argument to - ``crushtool`` for this to work. Please use this option with - extreme care.:: - - crushtool -i /tmp/crush --set-choose-local-tries 0 --set-choose-local-fallback-tries 0 --set-choose-total-tries 50 -o /tmp/crush.new - -* Reinject modified map:: - - ceph osd setcrushmap -i /tmp/crush.new - -Legacy values -------------- - -For reference, the legacy values for the CRUSH tunables can be set -with:: - - crushtool -i /tmp/crush --set-choose-local-tries 2 --set-choose-local-fallback-tries 5 --set-choose-total-tries 19 --set-chooseleaf-descend-once 0 --set-chooseleaf-vary-r 0 -o /tmp/crush.legacy - -Again, the special ``--enable-unsafe-tunables`` option is required. -Further, as noted above, be careful running old versions of the -``ceph-osd`` daemon after reverting to legacy values as the feature -bit is not perfectly enforced. diff --git a/src/ceph/doc/rados/operations/crush-map.rst b/src/ceph/doc/rados/operations/crush-map.rst deleted file mode 100644 index 05fa4ff..0000000 --- a/src/ceph/doc/rados/operations/crush-map.rst +++ /dev/null @@ -1,956 +0,0 @@ -============ - CRUSH Maps -============ - -The :abbr:`CRUSH (Controlled Replication Under Scalable Hashing)` algorithm -determines how to store and retrieve data by computing data storage locations. -CRUSH empowers Ceph clients to communicate with OSDs directly rather than -through a centralized server or broker. With an algorithmically determined -method of storing and retrieving data, Ceph avoids a single point of failure, a -performance bottleneck, and a physical limit to its scalability. - -CRUSH requires a map of your cluster, and uses the CRUSH map to pseudo-randomly -store and retrieve data in OSDs with a uniform distribution of data across the -cluster. For a detailed discussion of CRUSH, see -`CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_ - -CRUSH maps contain a list of :abbr:`OSDs (Object Storage Devices)`, a list of -'buckets' for aggregating the devices into physical locations, and a list of -rules that tell CRUSH how it should replicate data in a Ceph cluster's pools. By -reflecting the underlying physical organization of the installation, CRUSH can -model—and thereby address—potential sources of correlated device failures. -Typical sources include physical proximity, a shared power source, and a shared -network. By encoding this information into the cluster map, CRUSH placement -policies can separate object replicas across different failure domains while -still maintaining the desired distribution. For example, to address the -possibility of concurrent failures, it may be desirable to ensure that data -replicas are on devices using different shelves, racks, power supplies, -controllers, and/or physical locations. - -When you deploy OSDs they are automatically placed within the CRUSH map under a -``host`` node named with the hostname for the host they are running on. This, -combined with the default CRUSH failure domain, ensures that replicas or erasure -code shards are separated across hosts and a single host failure will not -affect availability. For larger clusters, however, administrators should carefully consider their choice of failure domain. Separating replicas across racks, -for example, is common for mid- to large-sized clusters. - - -CRUSH Location -============== - -The location of an OSD in terms of the CRUSH map's hierarchy is -referred to as a ``crush location``. This location specifier takes the -form of a list of key and value pairs describing a position. For -example, if an OSD is in a particular row, rack, chassis and host, and -is part of the 'default' CRUSH tree (this is the case for the vast -majority of clusters), its crush location could be described as:: - - root=default row=a rack=a2 chassis=a2a host=a2a1 - -Note: - -#. Note that the order of the keys does not matter. -#. The key name (left of ``=``) must be a valid CRUSH ``type``. By default - these include root, datacenter, room, row, pod, pdu, rack, chassis and host, - but those types can be customized to be anything appropriate by modifying - the CRUSH map. -#. Not all keys need to be specified. For example, by default, Ceph - automatically sets a ``ceph-osd`` daemon's location to be - ``root=default host=HOSTNAME`` (based on the output from ``hostname -s``). - -The crush location for an OSD is normally expressed via the ``crush location`` -config option being set in the ``ceph.conf`` file. Each time the OSD starts, -it verifies it is in the correct location in the CRUSH map and, if it is not, -it moved itself. To disable this automatic CRUSH map management, add the -following to your configuration file in the ``[osd]`` section:: - - osd crush update on start = false - - -Custom location hooks ---------------------- - -A customized location hook can be used to generate a more complete -crush location on startup. The sample ``ceph-crush-location`` utility -will generate a CRUSH location string for a given daemon. The -location is based on, in order of preference: - -#. A ``crush location`` option in ceph.conf. -#. A default of ``root=default host=HOSTNAME`` where the hostname is - generated with the ``hostname -s`` command. - -This is not useful by itself, as the OSD itself has the exact same -behavior. However, the script can be modified to provide additional -location fields (for example, the rack or datacenter), and then the -hook enabled via the config option:: - - crush location hook = /path/to/customized-ceph-crush-location - -This hook is passed several arguments (below) and should output a single line -to stdout with the CRUSH location description.:: - - $ ceph-crush-location --cluster CLUSTER --id ID --type TYPE - -where the cluster name is typically 'ceph', the id is the daemon -identifier (the OSD number), and the daemon type is typically ``osd``. - - -CRUSH structure -=============== - -The CRUSH map consists of, loosely speaking, a hierarchy describing -the physical topology of the cluster, and a set of rules defining -policy about how we place data on those devices. The hierarchy has -devices (``ceph-osd`` daemons) at the leaves, and internal nodes -corresponding to other physical features or groupings: hosts, racks, -rows, datacenters, and so on. The rules describe how replicas are -placed in terms of that hierarchy (e.g., 'three replicas in different -racks'). - -Devices -------- - -Devices are individual ``ceph-osd`` daemons that can store data. You -will normally have one defined here for each OSD daemon in your -cluster. Devices are identified by an id (a non-negative integer) and -a name, normally ``osd.N`` where ``N`` is the device id. - -Devices may also have a *device class* associated with them (e.g., -``hdd`` or ``ssd``), allowing them to be conveniently targetted by a -crush rule. - -Types and Buckets ------------------ - -A bucket is the CRUSH term for internal nodes in the hierarchy: hosts, -racks, rows, etc. The CRUSH map defines a series of *types* that are -used to describe these nodes. By default, these types include: - -- osd (or device) -- host -- chassis -- rack -- row -- pdu -- pod -- room -- datacenter -- region -- root - -Most clusters make use of only a handful of these types, and others -can be defined as needed. - -The hierarchy is built with devices (normally type ``osd``) at the -leaves, interior nodes with non-device types, and a root node of type -``root``. For example, - -.. ditaa:: - - +-----------------+ - | {o}root default | - +--------+--------+ - | - +---------------+---------------+ - | | - +-------+-------+ +-----+-------+ - | {o}host foo | | {o}host bar | - +-------+-------+ +-----+-------+ - | | - +-------+-------+ +-------+-------+ - | | | | - +-----+-----+ +-----+-----+ +-----+-----+ +-----+-----+ - | osd.0 | | osd.1 | | osd.2 | | osd.3 | - +-----------+ +-----------+ +-----------+ +-----------+ - -Each node (device or bucket) in the hierarchy has a *weight* -associated with it, indicating the relative proportion of the total -data that device or hierarchy subtree should store. Weights are set -at the leaves, indicating the size of the device, and automatically -sum up the tree from there, such that the weight of the default node -will be the total of all devices contained beneath it. Normally -weights are in units of terabytes (TB). - -You can get a simple view the CRUSH hierarchy for your cluster, -including the weights, with:: - - ceph osd crush tree - -Rules ------ - -Rules define policy about how data is distributed across the devices -in the hierarchy. - -CRUSH rules define placement and replication strategies or -distribution policies that allow you to specify exactly how CRUSH -places object replicas. For example, you might create a rule selecting -a pair of targets for 2-way mirroring, another rule for selecting -three targets in two different data centers for 3-way mirroring, and -yet another rule for erasure coding over six storage devices. For a -detailed discussion of CRUSH rules, refer to `CRUSH - Controlled, -Scalable, Decentralized Placement of Replicated Data`_, and more -specifically to **Section 3.2**. - -In almost all cases, CRUSH rules can be created via the CLI by -specifying the *pool type* they will be used for (replicated or -erasure coded), the *failure domain*, and optionally a *device class*. -In rare cases rules must be written by hand by manually editing the -CRUSH map. - -You can see what rules are defined for your cluster with:: - - ceph osd crush rule ls - -You can view the contents of the rules with:: - - ceph osd crush rule dump - -Device classes --------------- - -Each device can optionally have a *class* associated with it. By -default, OSDs automatically set their class on startup to either -`hdd`, `ssd`, or `nvme` based on the type of device they are backed -by. - -The device class for one or more OSDs can be explicitly set with:: - - ceph osd crush set-device-class <class> <osd-name> [...] - -Once a device class is set, it cannot be changed to another class -until the old class is unset with:: - - ceph osd crush rm-device-class <osd-name> [...] - -This allows administrators to set device classes without the class -being changed on OSD restart or by some other script. - -A placement rule that targets a specific device class can be created with:: - - ceph osd crush rule create-replicated <rule-name> <root> <failure-domain> <class> - -A pool can then be changed to use the new rule with:: - - ceph osd pool set <pool-name> crush_rule <rule-name> - -Device classes are implemented by creating a "shadow" CRUSH hierarchy -for each device class in use that contains only devices of that class. -Rules can then distribute data over the shadow hierarchy. One nice -thing about this approach is that it is fully backward compatible with -old Ceph clients. You can view the CRUSH hierarchy with shadow items -with:: - - ceph osd crush tree --show-shadow - - -Weights sets ------------- - -A *weight set* is an alternative set of weights to use when -calculating data placement. The normal weights associated with each -device in the CRUSH map are set based on the device size and indicate -how much data we *should* be storing where. However, because CRUSH is -based on a pseudorandom placement process, there is always some -variation from this ideal distribution, the same way that rolling a -dice sixty times will not result in rolling exactly 10 ones and 10 -sixes. Weight sets allow the cluster to do a numerical optimization -based on the specifics of your cluster (hierarchy, pools, etc.) to achieve -a balanced distribution. - -There are two types of weight sets supported: - - #. A **compat** weight set is a single alternative set of weights for - each device and node in the cluster. This is not well-suited for - correcting for all anomalies (for example, placement groups for - different pools may be different sizes and have different load - levels, but will be mostly treated the same by the balancer). - However, compat weight sets have the huge advantage that they are - *backward compatible* with previous versions of Ceph, which means - that even though weight sets were first introduced in Luminous - v12.2.z, older clients (e.g., firefly) can still connect to the - cluster when a compat weight set is being used to balance data. - #. A **per-pool** weight set is more flexible in that it allows - placement to be optimized for each data pool. Additionally, - weights can be adjusted for each position of placement, allowing - the optimizer to correct for a suble skew of data toward devices - with small weights relative to their peers (and effect that is - usually only apparently in very large clusters but which can cause - balancing problems). - -When weight sets are in use, the weights associated with each node in -the hierarchy is visible as a separate column (labeled either -``(compat)`` or the pool name) from the command:: - - ceph osd crush tree - -When both *compat* and *per-pool* weight sets are in use, data -placement for a particular pool will use its own per-pool weight set -if present. If not, it will use the compat weight set if present. If -neither are present, it will use the normal CRUSH weights. - -Although weight sets can be set up and manipulated by hand, it is -recommended that the *balancer* module be enabled to do so -automatically. - - -Modifying the CRUSH map -======================= - -.. _addosd: - -Add/Move an OSD ---------------- - -.. note: OSDs are normally automatically added to the CRUSH map when - the OSD is created. This command is rarely needed. - -To add or move an OSD in the CRUSH map of a running cluster:: - - ceph osd crush set {name} {weight} root={root} [{bucket-type}={bucket-name} ...] - -Where: - -``name`` - -:Description: The full name of the OSD. -:Type: String -:Required: Yes -:Example: ``osd.0`` - - -``weight`` - -:Description: The CRUSH weight for the OSD, normally its size measure in terabytes (TB). -:Type: Double -:Required: Yes -:Example: ``2.0`` - - -``root`` - -:Description: The root node of the tree in which the OSD resides (normally ``default``) -:Type: Key/value pair. -:Required: Yes -:Example: ``root=default`` - - -``bucket-type`` - -:Description: You may specify the OSD's location in the CRUSH hierarchy. -:Type: Key/value pairs. -:Required: No -:Example: ``datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1`` - - -The following example adds ``osd.0`` to the hierarchy, or moves the -OSD from a previous location. :: - - ceph osd crush set osd.0 1.0 root=default datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1 - - -Adjust OSD weight ------------------ - -.. note: Normally OSDs automatically add themselves to the CRUSH map - with the correct weight when they are created. This command - is rarely needed. - -To adjust an OSD's crush weight in the CRUSH map of a running cluster, execute -the following:: - - ceph osd crush reweight {name} {weight} - -Where: - -``name`` - -:Description: The full name of the OSD. -:Type: String -:Required: Yes -:Example: ``osd.0`` - - -``weight`` - -:Description: The CRUSH weight for the OSD. -:Type: Double -:Required: Yes -:Example: ``2.0`` - - -.. _removeosd: - -Remove an OSD -------------- - -.. note: OSDs are normally removed from the CRUSH as part of the - ``ceph osd purge`` command. This command is rarely needed. - -To remove an OSD from the CRUSH map of a running cluster, execute the -following:: - - ceph osd crush remove {name} - -Where: - -``name`` - -:Description: The full name of the OSD. -:Type: String -:Required: Yes -:Example: ``osd.0`` - - -Add a Bucket ------------- - -.. note: Buckets are normally implicitly created when an OSD is added - that specifies a ``{bucket-type}={bucket-name}`` as part of its - location and a bucket with that name does not already exist. This - command is typically used when manually adjusting the structure of the - hierarchy after OSDs have been created (for example, to move a - series of hosts underneath a new rack-level bucket). - -To add a bucket in the CRUSH map of a running cluster, execute the -``ceph osd crush add-bucket`` command:: - - ceph osd crush add-bucket {bucket-name} {bucket-type} - -Where: - -``bucket-name`` - -:Description: The full name of the bucket. -:Type: String -:Required: Yes -:Example: ``rack12`` - - -``bucket-type`` - -:Description: The type of the bucket. The type must already exist in the hierarchy. -:Type: String -:Required: Yes -:Example: ``rack`` - - -The following example adds the ``rack12`` bucket to the hierarchy:: - - ceph osd crush add-bucket rack12 rack - -Move a Bucket -------------- - -To move a bucket to a different location or position in the CRUSH map -hierarchy, execute the following:: - - ceph osd crush move {bucket-name} {bucket-type}={bucket-name}, [...] - -Where: - -``bucket-name`` - -:Description: The name of the bucket to move/reposition. -:Type: String -:Required: Yes -:Example: ``foo-bar-1`` - -``bucket-type`` - -:Description: You may specify the bucket's location in the CRUSH hierarchy. -:Type: Key/value pairs. -:Required: No -:Example: ``datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1`` - -Remove a Bucket ---------------- - -To remove a bucket from the CRUSH map hierarchy, execute the following:: - - ceph osd crush remove {bucket-name} - -.. note:: A bucket must be empty before removing it from the CRUSH hierarchy. - -Where: - -``bucket-name`` - -:Description: The name of the bucket that you'd like to remove. -:Type: String -:Required: Yes -:Example: ``rack12`` - -The following example removes the ``rack12`` bucket from the hierarchy:: - - ceph osd crush remove rack12 - -Creating a compat weight set ----------------------------- - -.. note: This step is normally done automatically by the ``balancer`` - module when enabled. - -To create a *compat* weight set:: - - ceph osd crush weight-set create-compat - -Weights for the compat weight set can be adjusted with:: - - ceph osd crush weight-set reweight-compat {name} {weight} - -The compat weight set can be destroyed with:: - - ceph osd crush weight-set rm-compat - -Creating per-pool weight sets ------------------------------ - -To create a weight set for a specific pool,:: - - ceph osd crush weight-set create {pool-name} {mode} - -.. note:: Per-pool weight sets require that all servers and daemons - run Luminous v12.2.z or later. - -Where: - -``pool-name`` - -:Description: The name of a RADOS pool -:Type: String -:Required: Yes -:Example: ``rbd`` - -``mode`` - -:Description: Either ``flat`` or ``positional``. A *flat* weight set - has a single weight for each device or bucket. A - *positional* weight set has a potentially different - weight for each position in the resulting placement - mapping. For example, if a pool has a replica count of - 3, then a positional weight set will have three weights - for each device and bucket. -:Type: String -:Required: Yes -:Example: ``flat`` - -To adjust the weight of an item in a weight set:: - - ceph osd crush weight-set reweight {pool-name} {item-name} {weight [...]} - -To list existing weight sets,:: - - ceph osd crush weight-set ls - -To remove a weight set,:: - - ceph osd crush weight-set rm {pool-name} - -Creating a rule for a replicated pool -------------------------------------- - -For a replicated pool, the primary decision when creating the CRUSH -rule is what the failure domain is going to be. For example, if a -failure domain of ``host`` is selected, then CRUSH will ensure that -each replica of the data is stored on a different host. If ``rack`` -is selected, then each replica will be stored in a different rack. -What failure domain you choose primarily depends on the size of your -cluster and how your hierarchy is structured. - -Normally, the entire cluster hierarchy is nested beneath a root node -named ``default``. If you have customized your hierarchy, you may -want to create a rule nested at some other node in the hierarchy. It -doesn't matter what type is associated with that node (it doesn't have -to be a ``root`` node). - -It is also possible to create a rule that restricts data placement to -a specific *class* of device. By default, Ceph OSDs automatically -classify themselves as either ``hdd`` or ``ssd``, depending on the -underlying type of device being used. These classes can also be -customized. - -To create a replicated rule,:: - - ceph osd crush rule create-replicated {name} {root} {failure-domain-type} [{class}] - -Where: - -``name`` - -:Description: The name of the rule -:Type: String -:Required: Yes -:Example: ``rbd-rule`` - -``root`` - -:Description: The name of the node under which data should be placed. -:Type: String -:Required: Yes -:Example: ``default`` - -``failure-domain-type`` - -:Description: The type of CRUSH nodes across which we should separate replicas. -:Type: String -:Required: Yes -:Example: ``rack`` - -``class`` - -:Description: The device class data should be placed on. -:Type: String -:Required: No -:Example: ``ssd`` - -Creating a rule for an erasure coded pool ------------------------------------------ - -For an erasure-coded pool, the same basic decisions need to be made as -with a replicated pool: what is the failure domain, what node in the -hierarchy will data be placed under (usually ``default``), and will -placement be restricted to a specific device class. Erasure code -pools are created a bit differently, however, because they need to be -constructed carefully based on the erasure code being used. For this reason, -you must include this information in the *erasure code profile*. A CRUSH -rule will then be created from that either explicitly or automatically when -the profile is used to create a pool. - -The erasure code profiles can be listed with:: - - ceph osd erasure-code-profile ls - -An existing profile can be viewed with:: - - ceph osd erasure-code-profile get {profile-name} - -Normally profiles should never be modified; instead, a new profile -should be created and used when creating a new pool or creating a new -rule for an existing pool. - -An erasure code profile consists of a set of key=value pairs. Most of -these control the behavior of the erasure code that is encoding data -in the pool. Those that begin with ``crush-``, however, affect the -CRUSH rule that is created. - -The erasure code profile properties of interest are: - - * **crush-root**: the name of the CRUSH node to place data under [default: ``default``]. - * **crush-failure-domain**: the CRUSH type to separate erasure-coded shards across [default: ``host``]. - * **crush-device-class**: the device class to place data on [default: none, meaning all devices are used]. - * **k** and **m** (and, for the ``lrc`` plugin, **l**): these determine the number of erasure code shards, affecting the resulting CRUSH rule. - -Once a profile is defined, you can create a CRUSH rule with:: - - ceph osd crush rule create-erasure {name} {profile-name} - -.. note: When creating a new pool, it is not actually necessary to - explicitly create the rule. If the erasure code profile alone is - specified and the rule argument is left off then Ceph will create - the CRUSH rule automatically. - -Deleting rules --------------- - -Rules that are not in use by pools can be deleted with:: - - ceph osd crush rule rm {rule-name} - - -Tunables -======== - -Over time, we have made (and continue to make) improvements to the -CRUSH algorithm used to calculate the placement of data. In order to -support the change in behavior, we have introduced a series of tunable -options that control whether the legacy or improved variation of the -algorithm is used. - -In order to use newer tunables, both clients and servers must support -the new version of CRUSH. For this reason, we have created -``profiles`` that are named after the Ceph version in which they were -introduced. For example, the ``firefly`` tunables are first supported -in the firefly release, and will not work with older (e.g., dumpling) -clients. Once a given set of tunables are changed from the legacy -default behavior, the ``ceph-mon`` and ``ceph-osd`` will prevent older -clients who do not support the new CRUSH features from connecting to -the cluster. - -argonaut (legacy) ------------------ - -The legacy CRUSH behavior used by argonaut and older releases works -fine for most clusters, provided there are not too many OSDs that have -been marked out. - -bobtail (CRUSH_TUNABLES2) -------------------------- - -The bobtail tunable profile fixes a few key misbehaviors: - - * For hierarchies with a small number of devices in the leaf buckets, - some PGs map to fewer than the desired number of replicas. This - commonly happens for hierarchies with "host" nodes with a small - number (1-3) of OSDs nested beneath each one. - - * For large clusters, some small percentages of PGs map to less than - the desired number of OSDs. This is more prevalent when there are - several layers of the hierarchy (e.g., row, rack, host, osd). - - * When some OSDs are marked out, the data tends to get redistributed - to nearby OSDs instead of across the entire hierarchy. - -The new tunables are: - - * ``choose_local_tries``: Number of local retries. Legacy value is - 2, optimal value is 0. - - * ``choose_local_fallback_tries``: Legacy value is 5, optimal value - is 0. - - * ``choose_total_tries``: Total number of attempts to choose an item. - Legacy value was 19, subsequent testing indicates that a value of - 50 is more appropriate for typical clusters. For extremely large - clusters, a larger value might be necessary. - - * ``chooseleaf_descend_once``: Whether a recursive chooseleaf attempt - will retry, or only try once and allow the original placement to - retry. Legacy default is 0, optimal value is 1. - -Migration impact: - - * Moving from argonaut to bobtail tunables triggers a moderate amount - of data movement. Use caution on a cluster that is already - populated with data. - -firefly (CRUSH_TUNABLES3) -------------------------- - -The firefly tunable profile fixes a problem -with the ``chooseleaf`` CRUSH rule behavior that tends to result in PG -mappings with too few results when too many OSDs have been marked out. - -The new tunable is: - - * ``chooseleaf_vary_r``: Whether a recursive chooseleaf attempt will - start with a non-zero value of r, based on how many attempts the - parent has already made. Legacy default is 0, but with this value - CRUSH is sometimes unable to find a mapping. The optimal value (in - terms of computational cost and correctness) is 1. - -Migration impact: - - * For existing clusters that have lots of existing data, changing - from 0 to 1 will cause a lot of data to move; a value of 4 or 5 - will allow CRUSH to find a valid mapping but will make less data - move. - -straw_calc_version tunable (introduced with Firefly too) --------------------------------------------------------- - -There were some problems with the internal weights calculated and -stored in the CRUSH map for ``straw`` buckets. Specifically, when -there were items with a CRUSH weight of 0 or both a mix of weights and -some duplicated weights CRUSH would distribute data incorrectly (i.e., -not in proportion to the weights). - -The new tunable is: - - * ``straw_calc_version``: A value of 0 preserves the old, broken - internal weight calculation; a value of 1 fixes the behavior. - -Migration impact: - - * Moving to straw_calc_version 1 and then adjusting a straw bucket - (by adding, removing, or reweighting an item, or by using the - reweight-all command) can trigger a small to moderate amount of - data movement *if* the cluster has hit one of the problematic - conditions. - -This tunable option is special because it has absolutely no impact -concerning the required kernel version in the client side. - -hammer (CRUSH_V4) ------------------ - -The hammer tunable profile does not affect the -mapping of existing CRUSH maps simply by changing the profile. However: - - * There is a new bucket type (``straw2``) supported. The new - ``straw2`` bucket type fixes several limitations in the original - ``straw`` bucket. Specifically, the old ``straw`` buckets would - change some mappings that should have changed when a weight was - adjusted, while ``straw2`` achieves the original goal of only - changing mappings to or from the bucket item whose weight has - changed. - - * ``straw2`` is the default for any newly created buckets. - -Migration impact: - - * Changing a bucket type from ``straw`` to ``straw2`` will result in - a reasonably small amount of data movement, depending on how much - the bucket item weights vary from each other. When the weights are - all the same no data will move, and when item weights vary - significantly there will be more movement. - -jewel (CRUSH_TUNABLES5) ------------------------ - -The jewel tunable profile improves the -overall behavior of CRUSH such that significantly fewer mappings -change when an OSD is marked out of the cluster. - -The new tunable is: - - * ``chooseleaf_stable``: Whether a recursive chooseleaf attempt will - use a better value for an inner loop that greatly reduces the number - of mapping changes when an OSD is marked out. The legacy value is 0, - while the new value of 1 uses the new approach. - -Migration impact: - - * Changing this value on an existing cluster will result in a very - large amount of data movement as almost every PG mapping is likely - to change. - - - - -Which client versions support CRUSH_TUNABLES --------------------------------------------- - - * argonaut series, v0.48.1 or later - * v0.49 or later - * Linux kernel version v3.6 or later (for the file system and RBD kernel clients) - -Which client versions support CRUSH_TUNABLES2 ---------------------------------------------- - - * v0.55 or later, including bobtail series (v0.56.x) - * Linux kernel version v3.9 or later (for the file system and RBD kernel clients) - -Which client versions support CRUSH_TUNABLES3 ---------------------------------------------- - - * v0.78 (firefly) or later - * Linux kernel version v3.15 or later (for the file system and RBD kernel clients) - -Which client versions support CRUSH_V4 --------------------------------------- - - * v0.94 (hammer) or later - * Linux kernel version v4.1 or later (for the file system and RBD kernel clients) - -Which client versions support CRUSH_TUNABLES5 ---------------------------------------------- - - * v10.0.2 (jewel) or later - * Linux kernel version v4.5 or later (for the file system and RBD kernel clients) - -Warning when tunables are non-optimal -------------------------------------- - -Starting with version v0.74, Ceph will issue a health warning if the -current CRUSH tunables don't include all the optimal values from the -``default`` profile (see below for the meaning of the ``default`` profile). -To make this warning go away, you have two options: - -1. Adjust the tunables on the existing cluster. Note that this will - result in some data movement (possibly as much as 10%). This is the - preferred route, but should be taken with care on a production cluster - where the data movement may affect performance. You can enable optimal - tunables with:: - - ceph osd crush tunables optimal - - If things go poorly (e.g., too much load) and not very much - progress has been made, or there is a client compatibility problem - (old kernel cephfs or rbd clients, or pre-bobtail librados - clients), you can switch back with:: - - ceph osd crush tunables legacy - -2. You can make the warning go away without making any changes to CRUSH by - adding the following option to your ceph.conf ``[mon]`` section:: - - mon warn on legacy crush tunables = false - - For the change to take effect, you will need to restart the monitors, or - apply the option to running monitors with:: - - ceph tell mon.\* injectargs --no-mon-warn-on-legacy-crush-tunables - - -A few important points ----------------------- - - * Adjusting these values will result in the shift of some PGs between - storage nodes. If the Ceph cluster is already storing a lot of - data, be prepared for some fraction of the data to move. - * The ``ceph-osd`` and ``ceph-mon`` daemons will start requiring the - feature bits of new connections as soon as they get - the updated map. However, already-connected clients are - effectively grandfathered in, and will misbehave if they do not - support the new feature. - * If the CRUSH tunables are set to non-legacy values and then later - changed back to the defult values, ``ceph-osd`` daemons will not be - required to support the feature. However, the OSD peering process - requires examining and understanding old maps. Therefore, you - should not run old versions of the ``ceph-osd`` daemon - if the cluster has previously used non-legacy CRUSH values, even if - the latest version of the map has been switched back to using the - legacy defaults. - -Tuning CRUSH ------------- - -The simplest way to adjust the crush tunables is by changing to a known -profile. Those are: - - * ``legacy``: the legacy behavior from argonaut and earlier. - * ``argonaut``: the legacy values supported by the original argonaut release - * ``bobtail``: the values supported by the bobtail release - * ``firefly``: the values supported by the firefly release - * ``hammer``: the values supported by the hammer release - * ``jewel``: the values supported by the jewel release - * ``optimal``: the best (ie optimal) values of the current version of Ceph - * ``default``: the default values of a new cluster installed from - scratch. These values, which depend on the current version of Ceph, - are hard coded and are generally a mix of optimal and legacy values. - These values generally match the ``optimal`` profile of the previous - LTS release, or the most recent release for which we generally except - more users to have up to date clients for. - -You can select a profile on a running cluster with the command:: - - ceph osd crush tunables {PROFILE} - -Note that this may result in some data movement. - - -.. _CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data: https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf - - -Primary Affinity -================ - -When a Ceph Client reads or writes data, it always contacts the primary OSD in -the acting set. For set ``[2, 3, 4]``, ``osd.2`` is the primary. Sometimes an -OSD is not well suited to act as a primary compared to other OSDs (e.g., it has -a slow disk or a slow controller). To prevent performance bottlenecks -(especially on read operations) while maximizing utilization of your hardware, -you can set a Ceph OSD's primary affinity so that CRUSH is less likely to use -the OSD as a primary in an acting set. :: - - ceph osd primary-affinity <osd-id> <weight> - -Primary affinity is ``1`` by default (*i.e.,* an OSD may act as a primary). You -may set the OSD primary range from ``0-1``, where ``0`` means that the OSD may -**NOT** be used as a primary and ``1`` means that an OSD may be used as a -primary. When the weight is ``< 1``, it is less likely that CRUSH will select -the Ceph OSD Daemon to act as a primary. - - - diff --git a/src/ceph/doc/rados/operations/data-placement.rst b/src/ceph/doc/rados/operations/data-placement.rst deleted file mode 100644 index 27966b0..0000000 --- a/src/ceph/doc/rados/operations/data-placement.rst +++ /dev/null @@ -1,37 +0,0 @@ -========================= - Data Placement Overview -========================= - -Ceph stores, replicates and rebalances data objects across a RADOS cluster -dynamically. With many different users storing objects in different pools for -different purposes on countless OSDs, Ceph operations require some data -placement planning. The main data placement planning concepts in Ceph include: - -- **Pools:** Ceph stores data within pools, which are logical groups for storing - objects. Pools manage the number of placement groups, the number of replicas, - and the ruleset for the pool. To store data in a pool, you must have - an authenticated user with permissions for the pool. Ceph can snapshot pools. - See `Pools`_ for additional details. - -- **Placement Groups:** Ceph maps objects to placement groups (PGs). - Placement groups (PGs) are shards or fragments of a logical object pool - that place objects as a group into OSDs. Placement groups reduce the amount - of per-object metadata when Ceph stores the data in OSDs. A larger number of - placement groups (e.g., 100 per OSD) leads to better balancing. See - `Placement Groups`_ for additional details. - -- **CRUSH Maps:** CRUSH is a big part of what allows Ceph to scale without - performance bottlenecks, without limitations to scalability, and without a - single point of failure. CRUSH maps provide the physical topology of the - cluster to the CRUSH algorithm to determine where the data for an object - and its replicas should be stored, and how to do so across failure domains - for added data safety among other things. See `CRUSH Maps`_ for additional - details. - -When you initially set up a test cluster, you can use the default values. Once -you begin planning for a large Ceph cluster, refer to pools, placement groups -and CRUSH for data placement operations. - -.. _Pools: ../pools -.. _Placement Groups: ../placement-groups -.. _CRUSH Maps: ../crush-map diff --git a/src/ceph/doc/rados/operations/erasure-code-isa.rst b/src/ceph/doc/rados/operations/erasure-code-isa.rst deleted file mode 100644 index b52933a..0000000 --- a/src/ceph/doc/rados/operations/erasure-code-isa.rst +++ /dev/null @@ -1,105 +0,0 @@ -======================= -ISA erasure code plugin -======================= - -The *isa* plugin encapsulates the `ISA -<https://01.org/intel%C2%AE-storage-acceleration-library-open-source-version/>`_ -library. It only runs on Intel processors. - -Create an isa profile -===================== - -To create a new *isa* erasure code profile:: - - ceph osd erasure-code-profile set {name} \ - plugin=isa \ - technique={reed_sol_van|cauchy} \ - [k={data-chunks}] \ - [m={coding-chunks}] \ - [crush-root={root}] \ - [crush-failure-domain={bucket-type}] \ - [crush-device-class={device-class}] \ - [directory={directory}] \ - [--force] - -Where: - -``k={data chunks}`` - -:Description: Each object is split in **data-chunks** parts, - each stored on a different OSD. - -:Type: Integer -:Required: No. -:Default: 7 - -``m={coding-chunks}`` - -:Description: Compute **coding chunks** for each object and store them - on different OSDs. The number of coding chunks is also - the number of OSDs that can be down without losing data. - -:Type: Integer -:Required: No. -:Default: 3 - -``technique={reed_sol_van|cauchy}`` - -:Description: The ISA plugin comes in two `Reed Solomon - <https://en.wikipedia.org/wiki/Reed%E2%80%93Solomon_error_correction>`_ - forms. If *reed_sol_van* is set, it is `Vandermonde - <https://en.wikipedia.org/wiki/Vandermonde_matrix>`_, if - *cauchy* is set, it is `Cauchy - <https://en.wikipedia.org/wiki/Cauchy_matrix>`_. - -:Type: String -:Required: No. -:Default: reed_sol_van - -``crush-root={root}`` - -:Description: The name of the crush bucket used for the first step of - the ruleset. For intance **step take default**. - -:Type: String -:Required: No. -:Default: default - -``crush-failure-domain={bucket-type}`` - -:Description: Ensure that no two chunks are in a bucket with the same - failure domain. For instance, if the failure domain is - **host** no two chunks will be stored on the same - host. It is used to create a ruleset step such as **step - chooseleaf host**. - -:Type: String -:Required: No. -:Default: host - -``crush-device-class={device-class}`` - -:Description: Restrict placement to devices of a specific class (e.g., - ``ssd`` or ``hdd``), using the crush device class names - in the CRUSH map. - -:Type: String -:Required: No. -:Default: - -``directory={directory}`` - -:Description: Set the **directory** name from which the erasure code - plugin is loaded. - -:Type: String -:Required: No. -:Default: /usr/lib/ceph/erasure-code - -``--force`` - -:Description: Override an existing profile by the same name. - -:Type: String -:Required: No. - diff --git a/src/ceph/doc/rados/operations/erasure-code-jerasure.rst b/src/ceph/doc/rados/operations/erasure-code-jerasure.rst deleted file mode 100644 index e8da097..0000000 --- a/src/ceph/doc/rados/operations/erasure-code-jerasure.rst +++ /dev/null @@ -1,120 +0,0 @@ -============================ -Jerasure erasure code plugin -============================ - -The *jerasure* plugin is the most generic and flexible plugin, it is -also the default for Ceph erasure coded pools. - -The *jerasure* plugin encapsulates the `Jerasure -<http://jerasure.org>`_ library. It is -recommended to read the *jerasure* documentation to get a better -understanding of the parameters. - -Create a jerasure profile -========================= - -To create a new *jerasure* erasure code profile:: - - ceph osd erasure-code-profile set {name} \ - plugin=jerasure \ - k={data-chunks} \ - m={coding-chunks} \ - technique={reed_sol_van|reed_sol_r6_op|cauchy_orig|cauchy_good|liberation|blaum_roth|liber8tion} \ - [crush-root={root}] \ - [crush-failure-domain={bucket-type}] \ - [crush-device-class={device-class}] \ - [directory={directory}] \ - [--force] - -Where: - -``k={data chunks}`` - -:Description: Each object is split in **data-chunks** parts, - each stored on a different OSD. - -:Type: Integer -:Required: Yes. -:Example: 4 - -``m={coding-chunks}`` - -:Description: Compute **coding chunks** for each object and store them - on different OSDs. The number of coding chunks is also - the number of OSDs that can be down without losing data. - -:Type: Integer -:Required: Yes. -:Example: 2 - -``technique={reed_sol_van|reed_sol_r6_op|cauchy_orig|cauchy_good|liberation|blaum_roth|liber8tion}`` - -:Description: The more flexible technique is *reed_sol_van* : it is - enough to set *k* and *m*. The *cauchy_good* technique - can be faster but you need to chose the *packetsize* - carefully. All of *reed_sol_r6_op*, *liberation*, - *blaum_roth*, *liber8tion* are *RAID6* equivalents in - the sense that they can only be configured with *m=2*. - -:Type: String -:Required: No. -:Default: reed_sol_van - -``packetsize={bytes}`` - -:Description: The encoding will be done on packets of *bytes* size at - a time. Chosing the right packet size is difficult. The - *jerasure* documentation contains extensive information - on this topic. - -:Type: Integer -:Required: No. -:Default: 2048 - -``crush-root={root}`` - -:Description: The name of the crush bucket used for the first step of - the ruleset. For intance **step take default**. - -:Type: String -:Required: No. -:Default: default - -``crush-failure-domain={bucket-type}`` - -:Description: Ensure that no two chunks are in a bucket with the same - failure domain. For instance, if the failure domain is - **host** no two chunks will be stored on the same - host. It is used to create a ruleset step such as **step - chooseleaf host**. - -:Type: String -:Required: No. -:Default: host - -``crush-device-class={device-class}`` - -:Description: Restrict placement to devices of a specific class (e.g., - ``ssd`` or ``hdd``), using the crush device class names - in the CRUSH map. - -:Type: String -:Required: No. -:Default: - - ``directory={directory}`` - -:Description: Set the **directory** name from which the erasure code - plugin is loaded. - -:Type: String -:Required: No. -:Default: /usr/lib/ceph/erasure-code - -``--force`` - -:Description: Override an existing profile by the same name. - -:Type: String -:Required: No. - diff --git a/src/ceph/doc/rados/operations/erasure-code-lrc.rst b/src/ceph/doc/rados/operations/erasure-code-lrc.rst deleted file mode 100644 index 447ce23..0000000 --- a/src/ceph/doc/rados/operations/erasure-code-lrc.rst +++ /dev/null @@ -1,371 +0,0 @@ -====================================== -Locally repairable erasure code plugin -====================================== - -With the *jerasure* plugin, when an erasure coded object is stored on -multiple OSDs, recovering from the loss of one OSD requires reading -from all the others. For instance if *jerasure* is configured with -*k=8* and *m=4*, losing one OSD requires reading from the eleven -others to repair. - -The *lrc* erasure code plugin creates local parity chunks to be able -to recover using less OSDs. For instance if *lrc* is configured with -*k=8*, *m=4* and *l=4*, it will create an additional parity chunk for -every four OSDs. When a single OSD is lost, it can be recovered with -only four OSDs instead of eleven. - -Erasure code profile examples -============================= - -Reduce recovery bandwidth between hosts ---------------------------------------- - -Although it is probably not an interesting use case when all hosts are -connected to the same switch, reduced bandwidth usage can actually be -observed.:: - - $ ceph osd erasure-code-profile set LRCprofile \ - plugin=lrc \ - k=4 m=2 l=3 \ - crush-failure-domain=host - $ ceph osd pool create lrcpool 12 12 erasure LRCprofile - - -Reduce recovery bandwidth between racks ---------------------------------------- - -In Firefly the reduced bandwidth will only be observed if the primary -OSD is in the same rack as the lost chunk.:: - - $ ceph osd erasure-code-profile set LRCprofile \ - plugin=lrc \ - k=4 m=2 l=3 \ - crush-locality=rack \ - crush-failure-domain=host - $ ceph osd pool create lrcpool 12 12 erasure LRCprofile - - -Create an lrc profile -===================== - -To create a new lrc erasure code profile:: - - ceph osd erasure-code-profile set {name} \ - plugin=lrc \ - k={data-chunks} \ - m={coding-chunks} \ - l={locality} \ - [crush-root={root}] \ - [crush-locality={bucket-type}] \ - [crush-failure-domain={bucket-type}] \ - [crush-device-class={device-class}] \ - [directory={directory}] \ - [--force] - -Where: - -``k={data chunks}`` - -:Description: Each object is split in **data-chunks** parts, - each stored on a different OSD. - -:Type: Integer -:Required: Yes. -:Example: 4 - -``m={coding-chunks}`` - -:Description: Compute **coding chunks** for each object and store them - on different OSDs. The number of coding chunks is also - the number of OSDs that can be down without losing data. - -:Type: Integer -:Required: Yes. -:Example: 2 - -``l={locality}`` - -:Description: Group the coding and data chunks into sets of size - **locality**. For instance, for **k=4** and **m=2**, - when **locality=3** two groups of three are created. - Each set can be recovered without reading chunks - from another set. - -:Type: Integer -:Required: Yes. -:Example: 3 - -``crush-root={root}`` - -:Description: The name of the crush bucket used for the first step of - the ruleset. For intance **step take default**. - -:Type: String -:Required: No. -:Default: default - -``crush-locality={bucket-type}`` - -:Description: The type of the crush bucket in which each set of chunks - defined by **l** will be stored. For instance, if it is - set to **rack**, each group of **l** chunks will be - placed in a different rack. It is used to create a - ruleset step such as **step choose rack**. If it is not - set, no such grouping is done. - -:Type: String -:Required: No. - -``crush-failure-domain={bucket-type}`` - -:Description: Ensure that no two chunks are in a bucket with the same - failure domain. For instance, if the failure domain is - **host** no two chunks will be stored on the same - host. It is used to create a ruleset step such as **step - chooseleaf host**. - -:Type: String -:Required: No. -:Default: host - -``crush-device-class={device-class}`` - -:Description: Restrict placement to devices of a specific class (e.g., - ``ssd`` or ``hdd``), using the crush device class names - in the CRUSH map. - -:Type: String -:Required: No. -:Default: - -``directory={directory}`` - -:Description: Set the **directory** name from which the erasure code - plugin is loaded. - -:Type: String -:Required: No. -:Default: /usr/lib/ceph/erasure-code - -``--force`` - -:Description: Override an existing profile by the same name. - -:Type: String -:Required: No. - -Low level plugin configuration -============================== - -The sum of **k** and **m** must be a multiple of the **l** parameter. -The low level configuration parameters do not impose such a -restriction and it may be more convienient to use it for specific -purposes. It is for instance possible to define two groups, one with 4 -chunks and another with 3 chunks. It is also possible to recursively -define locality sets, for instance datacenters and racks into -datacenters. The **k/m/l** are implemented by generating a low level -configuration. - -The *lrc* erasure code plugin recursively applies erasure code -techniques so that recovering from the loss of some chunks only -requires a subset of the available chunks, most of the time. - -For instance, when three coding steps are described as:: - - chunk nr 01234567 - step 1 _cDD_cDD - step 2 cDDD____ - step 3 ____cDDD - -where *c* are coding chunks calculated from the data chunks *D*, the -loss of chunk *7* can be recovered with the last four chunks. And the -loss of chunk *2* chunk can be recovered with the first four -chunks. - -Erasure code profile examples using low level configuration -=========================================================== - -Minimal testing ---------------- - -It is strictly equivalent to using the default erasure code profile. The *DD* -implies *K=2*, the *c* implies *M=1* and the *jerasure* plugin is used -by default.:: - - $ ceph osd erasure-code-profile set LRCprofile \ - plugin=lrc \ - mapping=DD_ \ - layers='[ [ "DDc", "" ] ]' - $ ceph osd pool create lrcpool 12 12 erasure LRCprofile - -Reduce recovery bandwidth between hosts ---------------------------------------- - -Although it is probably not an interesting use case when all hosts are -connected to the same switch, reduced bandwidth usage can actually be -observed. It is equivalent to **k=4**, **m=2** and **l=3** although -the layout of the chunks is different:: - - $ ceph osd erasure-code-profile set LRCprofile \ - plugin=lrc \ - mapping=__DD__DD \ - layers='[ - [ "_cDD_cDD", "" ], - [ "cDDD____", "" ], - [ "____cDDD", "" ], - ]' - $ ceph osd pool create lrcpool 12 12 erasure LRCprofile - - -Reduce recovery bandwidth between racks ---------------------------------------- - -In Firefly the reduced bandwidth will only be observed if the primary -OSD is in the same rack as the lost chunk.:: - - $ ceph osd erasure-code-profile set LRCprofile \ - plugin=lrc \ - mapping=__DD__DD \ - layers='[ - [ "_cDD_cDD", "" ], - [ "cDDD____", "" ], - [ "____cDDD", "" ], - ]' \ - crush-steps='[ - [ "choose", "rack", 2 ], - [ "chooseleaf", "host", 4 ], - ]' - $ ceph osd pool create lrcpool 12 12 erasure LRCprofile - -Testing with different Erasure Code backends --------------------------------------------- - -LRC now uses jerasure as the default EC backend. It is possible to -specify the EC backend/algorithm on a per layer basis using the low -level configuration. The second argument in layers='[ [ "DDc", "" ] ]' -is actually an erasure code profile to be used for this level. The -example below specifies the ISA backend with the cauchy technique to -be used in the lrcpool.:: - - $ ceph osd erasure-code-profile set LRCprofile \ - plugin=lrc \ - mapping=DD_ \ - layers='[ [ "DDc", "plugin=isa technique=cauchy" ] ]' - $ ceph osd pool create lrcpool 12 12 erasure LRCprofile - -You could also use a different erasure code profile for for each -layer.:: - - $ ceph osd erasure-code-profile set LRCprofile \ - plugin=lrc \ - mapping=__DD__DD \ - layers='[ - [ "_cDD_cDD", "plugin=isa technique=cauchy" ], - [ "cDDD____", "plugin=isa" ], - [ "____cDDD", "plugin=jerasure" ], - ]' - $ ceph osd pool create lrcpool 12 12 erasure LRCprofile - - - -Erasure coding and decoding algorithm -===================================== - -The steps found in the layers description:: - - chunk nr 01234567 - - step 1 _cDD_cDD - step 2 cDDD____ - step 3 ____cDDD - -are applied in order. For instance, if a 4K object is encoded, it will -first go thru *step 1* and be divided in four 1K chunks (the four -uppercase D). They are stored in the chunks 2, 3, 6 and 7, in -order. From these, two coding chunks are calculated (the two lowercase -c). The coding chunks are stored in the chunks 1 and 5, respectively. - -The *step 2* re-uses the content created by *step 1* in a similar -fashion and stores a single coding chunk *c* at position 0. The last four -chunks, marked with an underscore (*_*) for readability, are ignored. - -The *step 3* stores a single coding chunk *c* at position 4. The three -chunks created by *step 1* are used to compute this coding chunk, -i.e. the coding chunk from *step 1* becomes a data chunk in *step 3*. - -If chunk *2* is lost:: - - chunk nr 01234567 - - step 1 _c D_cDD - step 2 cD D____ - step 3 __ _cDDD - -decoding will attempt to recover it by walking the steps in reverse -order: *step 3* then *step 2* and finally *step 1*. - -The *step 3* knows nothing about chunk *2* (i.e. it is an underscore) -and is skipped. - -The coding chunk from *step 2*, stored in chunk *0*, allows it to -recover the content of chunk *2*. There are no more chunks to recover -and the process stops, without considering *step 1*. - -Recovering chunk *2* requires reading chunks *0, 1, 3* and writing -back chunk *2*. - -If chunk *2, 3, 6* are lost:: - - chunk nr 01234567 - - step 1 _c _c D - step 2 cD __ _ - step 3 __ cD D - -The *step 3* can recover the content of chunk *6*:: - - chunk nr 01234567 - - step 1 _c _cDD - step 2 cD ____ - step 3 __ cDDD - -The *step 2* fails to recover and is skipped because there are two -chunks missing (*2, 3*) and it can only recover from one missing -chunk. - -The coding chunk from *step 1*, stored in chunk *1, 5*, allows it to -recover the content of chunk *2, 3*:: - - chunk nr 01234567 - - step 1 _cDD_cDD - step 2 cDDD____ - step 3 ____cDDD - -Controlling crush placement -=========================== - -The default crush ruleset provides OSDs that are on different hosts. For instance:: - - chunk nr 01234567 - - step 1 _cDD_cDD - step 2 cDDD____ - step 3 ____cDDD - -needs exactly *8* OSDs, one for each chunk. If the hosts are in two -adjacent racks, the first four chunks can be placed in the first rack -and the last four in the second rack. So that recovering from the loss -of a single OSD does not require using bandwidth between the two -racks. - -For instance:: - - crush-steps='[ [ "choose", "rack", 2 ], [ "chooseleaf", "host", 4 ] ]' - -will create a ruleset that will select two crush buckets of type -*rack* and for each of them choose four OSDs, each of them located in -different buckets of type *host*. - -The ruleset can also be manually crafted for finer control. diff --git a/src/ceph/doc/rados/operations/erasure-code-profile.rst b/src/ceph/doc/rados/operations/erasure-code-profile.rst deleted file mode 100644 index ddf772d..0000000 --- a/src/ceph/doc/rados/operations/erasure-code-profile.rst +++ /dev/null @@ -1,121 +0,0 @@ -===================== -Erasure code profiles -===================== - -Erasure code is defined by a **profile** and is used when creating an -erasure coded pool and the associated crush ruleset. - -The **default** erasure code profile (which is created when the Ceph -cluster is initialized) provides the same level of redundancy as two -copies but requires 25% less disk space. It is described as a profile -with **k=2** and **m=1**, meaning the information is spread over three -OSD (k+m == 3) and one of them can be lost. - -To improve redundancy without increasing raw storage requirements, a -new profile can be created. For instance, a profile with **k=10** and -**m=4** can sustain the loss of four (**m=4**) OSDs by distributing an -object on fourteen (k+m=14) OSDs. The object is first divided in -**10** chunks (if the object is 10MB, each chunk is 1MB) and **4** -coding chunks are computed, for recovery (each coding chunk has the -same size as the data chunk, i.e. 1MB). The raw space overhead is only -40% and the object will not be lost even if four OSDs break at the -same time. - -.. _list of available plugins: - -.. toctree:: - :maxdepth: 1 - - erasure-code-jerasure - erasure-code-isa - erasure-code-lrc - erasure-code-shec - -osd erasure-code-profile set -============================ - -To create a new erasure code profile:: - - ceph osd erasure-code-profile set {name} \ - [{directory=directory}] \ - [{plugin=plugin}] \ - [{stripe_unit=stripe_unit}] \ - [{key=value} ...] \ - [--force] - -Where: - -``{directory=directory}`` - -:Description: Set the **directory** name from which the erasure code - plugin is loaded. - -:Type: String -:Required: No. -:Default: /usr/lib/ceph/erasure-code - -``{plugin=plugin}`` - -:Description: Use the erasure code **plugin** to compute coding chunks - and recover missing chunks. See the `list of available - plugins`_ for more information. - -:Type: String -:Required: No. -:Default: jerasure - -``{stripe_unit=stripe_unit}`` - -:Description: The amount of data in a data chunk, per stripe. For - example, a profile with 2 data chunks and stripe_unit=4K - would put the range 0-4K in chunk 0, 4K-8K in chunk 1, - then 8K-12K in chunk 0 again. This should be a multiple - of 4K for best performance. The default value is taken - from the monitor config option - ``osd_pool_erasure_code_stripe_unit`` when a pool is - created. The stripe_width of a pool using this profile - will be the number of data chunks multiplied by this - stripe_unit. - -:Type: String -:Required: No. - -``{key=value}`` - -:Description: The semantic of the remaining key/value pairs is defined - by the erasure code plugin. - -:Type: String -:Required: No. - -``--force`` - -:Description: Override an existing profile by the same name, and allow - setting a non-4K-aligned stripe_unit. - -:Type: String -:Required: No. - -osd erasure-code-profile rm -============================ - -To remove an erasure code profile:: - - ceph osd erasure-code-profile rm {name} - -If the profile is referenced by a pool, the deletion will fail. - -osd erasure-code-profile get -============================ - -To display an erasure code profile:: - - ceph osd erasure-code-profile get {name} - -osd erasure-code-profile ls -=========================== - -To list the names of all erasure code profiles:: - - ceph osd erasure-code-profile ls - diff --git a/src/ceph/doc/rados/operations/erasure-code-shec.rst b/src/ceph/doc/rados/operations/erasure-code-shec.rst deleted file mode 100644 index e3bab37..0000000 --- a/src/ceph/doc/rados/operations/erasure-code-shec.rst +++ /dev/null @@ -1,144 +0,0 @@ -======================== -SHEC erasure code plugin -======================== - -The *shec* plugin encapsulates the `multiple SHEC -<http://tracker.ceph.com/projects/ceph/wiki/Shingled_Erasure_Code_(SHEC)>`_ -library. It allows ceph to recover data more efficiently than Reed Solomon codes. - -Create an SHEC profile -====================== - -To create a new *shec* erasure code profile:: - - ceph osd erasure-code-profile set {name} \ - plugin=shec \ - [k={data-chunks}] \ - [m={coding-chunks}] \ - [c={durability-estimator}] \ - [crush-root={root}] \ - [crush-failure-domain={bucket-type}] \ - [crush-device-class={device-class}] \ - [directory={directory}] \ - [--force] - -Where: - -``k={data-chunks}`` - -:Description: Each object is split in **data-chunks** parts, - each stored on a different OSD. - -:Type: Integer -:Required: No. -:Default: 4 - -``m={coding-chunks}`` - -:Description: Compute **coding-chunks** for each object and store them on - different OSDs. The number of **coding-chunks** does not necessarily - equal the number of OSDs that can be down without losing data. - -:Type: Integer -:Required: No. -:Default: 3 - -``c={durability-estimator}`` - -:Description: The number of parity chunks each of which includes each data chunk in its - calculation range. The number is used as a **durability estimator**. - For instance, if c=2, 2 OSDs can be down without losing data. - -:Type: Integer -:Required: No. -:Default: 2 - -``crush-root={root}`` - -:Description: The name of the crush bucket used for the first step of - the ruleset. For intance **step take default**. - -:Type: String -:Required: No. -:Default: default - -``crush-failure-domain={bucket-type}`` - -:Description: Ensure that no two chunks are in a bucket with the same - failure domain. For instance, if the failure domain is - **host** no two chunks will be stored on the same - host. It is used to create a ruleset step such as **step - chooseleaf host**. - -:Type: String -:Required: No. -:Default: host - -``crush-device-class={device-class}`` - -:Description: Restrict placement to devices of a specific class (e.g., - ``ssd`` or ``hdd``), using the crush device class names - in the CRUSH map. - -:Type: String -:Required: No. -:Default: - -``directory={directory}`` - -:Description: Set the **directory** name from which the erasure code - plugin is loaded. - -:Type: String -:Required: No. -:Default: /usr/lib/ceph/erasure-code - -``--force`` - -:Description: Override an existing profile by the same name. - -:Type: String -:Required: No. - -Brief description of SHEC's layouts -=================================== - -Space Efficiency ----------------- - -Space efficiency is a ratio of data chunks to all ones in a object and -represented as k/(k+m). -In order to improve space efficiency, you should increase k or decrease m. - -:: - - space efficiency of SHEC(4,3,2) = 4/(4+3) = 0.57 - SHEC(5,3,2) or SHEC(4,2,2) improves SHEC(4,3,2)'s space efficiency - -Durability ----------- - -The third parameter of SHEC (=c) is a durability estimator, which approximates -the number of OSDs that can be down without losing data. - -``durability estimator of SHEC(4,3,2) = 2`` - -Recovery Efficiency -------------------- - -Describing calculation of recovery efficiency is beyond the scope of this document, -but at least increasing m without increasing c achieves improvement of recovery efficiency. -(However, we must pay attention to the sacrifice of space efficiency in this case.) - -``SHEC(4,2,2) -> SHEC(4,3,2) : achieves improvement of recovery efficiency`` - -Erasure code profile examples -============================= - -:: - - $ ceph osd erasure-code-profile set SHECprofile \ - plugin=shec \ - k=8 m=4 c=3 \ - crush-failure-domain=host - $ ceph osd pool create shecpool 256 256 erasure SHECprofile diff --git a/src/ceph/doc/rados/operations/erasure-code.rst b/src/ceph/doc/rados/operations/erasure-code.rst deleted file mode 100644 index 6ec5a09..0000000 --- a/src/ceph/doc/rados/operations/erasure-code.rst +++ /dev/null @@ -1,195 +0,0 @@ -============= - Erasure code -============= - -A Ceph pool is associated to a type to sustain the loss of an OSD -(i.e. a disk since most of the time there is one OSD per disk). The -default choice when `creating a pool <../pools>`_ is *replicated*, -meaning every object is copied on multiple disks. The `Erasure Code -<https://en.wikipedia.org/wiki/Erasure_code>`_ pool type can be used -instead to save space. - -Creating a sample erasure coded pool ------------------------------------- - -The simplest erasure coded pool is equivalent to `RAID5 -<https://en.wikipedia.org/wiki/Standard_RAID_levels#RAID_5>`_ and -requires at least three hosts:: - - $ ceph osd pool create ecpool 12 12 erasure - pool 'ecpool' created - $ echo ABCDEFGHI | rados --pool ecpool put NYAN - - $ rados --pool ecpool get NYAN - - ABCDEFGHI - -.. note:: the 12 in *pool create* stands for - `the number of placement groups <../pools>`_. - -Erasure code profiles ---------------------- - -The default erasure code profile sustains the loss of a single OSD. It -is equivalent to a replicated pool of size two but requires 1.5TB -instead of 2TB to store 1TB of data. The default profile can be -displayed with:: - - $ ceph osd erasure-code-profile get default - k=2 - m=1 - plugin=jerasure - crush-failure-domain=host - technique=reed_sol_van - -Choosing the right profile is important because it cannot be modified -after the pool is created: a new pool with a different profile needs -to be created and all objects from the previous pool moved to the new. - -The most important parameters of the profile are *K*, *M* and -*crush-failure-domain* because they define the storage overhead and -the data durability. For instance, if the desired architecture must -sustain the loss of two racks with a storage overhead of 40% overhead, -the following profile can be defined:: - - $ ceph osd erasure-code-profile set myprofile \ - k=3 \ - m=2 \ - crush-failure-domain=rack - $ ceph osd pool create ecpool 12 12 erasure myprofile - $ echo ABCDEFGHI | rados --pool ecpool put NYAN - - $ rados --pool ecpool get NYAN - - ABCDEFGHI - -The *NYAN* object will be divided in three (*K=3*) and two additional -*chunks* will be created (*M=2*). The value of *M* defines how many -OSD can be lost simultaneously without losing any data. The -*crush-failure-domain=rack* will create a CRUSH ruleset that ensures -no two *chunks* are stored in the same rack. - -.. ditaa:: - +-------------------+ - name | NYAN | - +-------------------+ - content | ABCDEFGHI | - +--------+----------+ - | - | - v - +------+------+ - +---------------+ encode(3,2) +-----------+ - | +--+--+---+---+ | - | | | | | - | +-------+ | +-----+ | - | | | | | - +--v---+ +--v---+ +--v---+ +--v---+ +--v---+ - name | NYAN | | NYAN | | NYAN | | NYAN | | NYAN | - +------+ +------+ +------+ +------+ +------+ - shard | 1 | | 2 | | 3 | | 4 | | 5 | - +------+ +------+ +------+ +------+ +------+ - content | ABC | | DEF | | GHI | | YXY | | QGC | - +--+---+ +--+---+ +--+---+ +--+---+ +--+---+ - | | | | | - | | v | | - | | +--+---+ | | - | | | OSD1 | | | - | | +------+ | | - | | | | - | | +------+ | | - | +------>| OSD2 | | | - | +------+ | | - | | | - | +------+ | | - | | OSD3 |<----+ | - | +------+ | - | | - | +------+ | - | | OSD4 |<--------------+ - | +------+ - | - | +------+ - +----------------->| OSD5 | - +------+ - - -More information can be found in the `erasure code profiles -<../erasure-code-profile>`_ documentation. - - -Erasure Coding with Overwrites ------------------------------- - -By default, erasure coded pools only work with uses like RGW that -perform full object writes and appends. - -Since Luminous, partial writes for an erasure coded pool may be -enabled with a per-pool setting. This lets RBD and Cephfs store their -data in an erasure coded pool:: - - ceph osd pool set ec_pool allow_ec_overwrites true - -This can only be enabled on a pool residing on bluestore OSDs, since -bluestore's checksumming is used to detect bitrot or other corruption -during deep-scrub. In addition to being unsafe, using filestore with -ec overwrites yields low performance compared to bluestore. - -Erasure coded pools do not support omap, so to use them with RBD and -Cephfs you must instruct them to store their data in an ec pool, and -their metadata in a replicated pool. For RBD, this means using the -erasure coded pool as the ``--data-pool`` during image creation:: - - rbd create --size 1G --data-pool ec_pool replicated_pool/image_name - -For Cephfs, using an erasure coded pool means setting that pool in -a `file layout <../../../cephfs/file-layouts>`_. - - -Erasure coded pool and cache tiering ------------------------------------- - -Erasure coded pools require more resources than replicated pools and -lack some functionalities such as omap. To overcome these -limitations, one can set up a `cache tier <../cache-tiering>`_ -before the erasure coded pool. - -For instance, if the pool *hot-storage* is made of fast storage:: - - $ ceph osd tier add ecpool hot-storage - $ ceph osd tier cache-mode hot-storage writeback - $ ceph osd tier set-overlay ecpool hot-storage - -will place the *hot-storage* pool as tier of *ecpool* in *writeback* -mode so that every write and read to the *ecpool* are actually using -the *hot-storage* and benefit from its flexibility and speed. - -More information can be found in the `cache tiering -<../cache-tiering>`_ documentation. - -Glossary --------- - -*chunk* - when the encoding function is called, it returns chunks of the same - size. Data chunks which can be concatenated to reconstruct the original - object and coding chunks which can be used to rebuild a lost chunk. - -*K* - the number of data *chunks*, i.e. the number of *chunks* in which the - original object is divided. For instance if *K* = 2 a 10KB object - will be divided into *K* objects of 5KB each. - -*M* - the number of coding *chunks*, i.e. the number of additional *chunks* - computed by the encoding functions. If there are 2 coding *chunks*, - it means 2 OSDs can be out without losing data. - - -Table of content ----------------- - -.. toctree:: - :maxdepth: 1 - - erasure-code-profile - erasure-code-jerasure - erasure-code-isa - erasure-code-lrc - erasure-code-shec diff --git a/src/ceph/doc/rados/operations/health-checks.rst b/src/ceph/doc/rados/operations/health-checks.rst deleted file mode 100644 index c1e2200..0000000 --- a/src/ceph/doc/rados/operations/health-checks.rst +++ /dev/null @@ -1,527 +0,0 @@ - -============= -Health checks -============= - -Overview -======== - -There is a finite set of possible health messages that a Ceph cluster can -raise -- these are defined as *health checks* which have unique identifiers. - -The identifier is a terse pseudo-human-readable (i.e. like a variable name) -string. It is intended to enable tools (such as UIs) to make sense of -health checks, and present them in a way that reflects their meaning. - -This page lists the health checks that are raised by the monitor and manager -daemons. In addition to these, you may also see health checks that originate -from MDS daemons (see :doc:`/cephfs/health-messages`), and health checks -that are defined by ceph-mgr python modules. - -Definitions -=========== - - -OSDs ----- - -OSD_DOWN -________ - -One or more OSDs are marked down. The ceph-osd daemon may have been -stopped, or peer OSDs may be unable to reach the OSD over the network. -Common causes include a stopped or crashed daemon, a down host, or a -network outage. - -Verify the host is healthy, the daemon is started, and network is -functioning. If the daemon has crashed, the daemon log file -(``/var/log/ceph/ceph-osd.*``) may contain debugging information. - -OSD_<crush type>_DOWN -_____________________ - -(e.g. OSD_HOST_DOWN, OSD_ROOT_DOWN) - -All the OSDs within a particular CRUSH subtree are marked down, for example -all OSDs on a host. - -OSD_ORPHAN -__________ - -An OSD is referenced in the CRUSH map hierarchy but does not exist. - -The OSD can be removed from the CRUSH hierarchy with:: - - ceph osd crush rm osd.<id> - -OSD_OUT_OF_ORDER_FULL -_____________________ - -The utilization thresholds for `backfillfull`, `nearfull`, `full`, -and/or `failsafe_full` are not ascending. In particular, we expect -`backfillfull < nearfull`, `nearfull < full`, and `full < -failsafe_full`. - -The thresholds can be adjusted with:: - - ceph osd set-backfillfull-ratio <ratio> - ceph osd set-nearfull-ratio <ratio> - ceph osd set-full-ratio <ratio> - - -OSD_FULL -________ - -One or more OSDs has exceeded the `full` threshold and is preventing -the cluster from servicing writes. - -Utilization by pool can be checked with:: - - ceph df - -The currently defined `full` ratio can be seen with:: - - ceph osd dump | grep full_ratio - -A short-term workaround to restore write availability is to raise the full -threshold by a small amount:: - - ceph osd set-full-ratio <ratio> - -New storage should be added to the cluster by deploying more OSDs or -existing data should be deleted in order to free up space. - -OSD_BACKFILLFULL -________________ - -One or more OSDs has exceeded the `backfillfull` threshold, which will -prevent data from being allowed to rebalance to this device. This is -an early warning that rebalancing may not be able to complete and that -the cluster is approaching full. - -Utilization by pool can be checked with:: - - ceph df - -OSD_NEARFULL -____________ - -One or more OSDs has exceeded the `nearfull` threshold. This is an early -warning that the cluster is approaching full. - -Utilization by pool can be checked with:: - - ceph df - -OSDMAP_FLAGS -____________ - -One or more cluster flags of interest has been set. These flags include: - -* *full* - the cluster is flagged as full and cannot service writes -* *pauserd*, *pausewr* - paused reads or writes -* *noup* - OSDs are not allowed to start -* *nodown* - OSD failure reports are being ignored, such that the - monitors will not mark OSDs `down` -* *noin* - OSDs that were previously marked `out` will not be marked - back `in` when they start -* *noout* - down OSDs will not automatically be marked out after the - configured interval -* *nobackfill*, *norecover*, *norebalance* - recovery or data - rebalancing is suspended -* *noscrub*, *nodeep_scrub* - scrubbing is disabled -* *notieragent* - cache tiering activity is suspended - -With the exception of *full*, these flags can be set or cleared with:: - - ceph osd set <flag> - ceph osd unset <flag> - -OSD_FLAGS -_________ - -One or more OSDs has a per-OSD flag of interest set. These flags include: - -* *noup*: OSD is not allowed to start -* *nodown*: failure reports for this OSD will be ignored -* *noin*: if this OSD was previously marked `out` automatically - after a failure, it will not be marked in when it stats -* *noout*: if this OSD is down it will not automatically be marked - `out` after the configured interval - -Per-OSD flags can be set and cleared with:: - - ceph osd add-<flag> <osd-id> - ceph osd rm-<flag> <osd-id> - -For example, :: - - ceph osd rm-nodown osd.123 - -OLD_CRUSH_TUNABLES -__________________ - -The CRUSH map is using very old settings and should be updated. The -oldest tunables that can be used (i.e., the oldest client version that -can connect to the cluster) without triggering this health warning is -determined by the ``mon_crush_min_required_version`` config option. -See :doc:`/rados/operations/crush-map/#tunables` for more information. - -OLD_CRUSH_STRAW_CALC_VERSION -____________________________ - -The CRUSH map is using an older, non-optimal method for calculating -intermediate weight values for ``straw`` buckets. - -The CRUSH map should be updated to use the newer method -(``straw_calc_version=1``). See -:doc:`/rados/operations/crush-map/#tunables` for more information. - -CACHE_POOL_NO_HIT_SET -_____________________ - -One or more cache pools is not configured with a *hit set* to track -utilization, which will prevent the tiering agent from identifying -cold objects to flush and evict from the cache. - -Hit sets can be configured on the cache pool with:: - - ceph osd pool set <poolname> hit_set_type <type> - ceph osd pool set <poolname> hit_set_period <period-in-seconds> - ceph osd pool set <poolname> hit_set_count <number-of-hitsets> - ceph osd pool set <poolname> hit_set_fpp <target-false-positive-rate> - -OSD_NO_SORTBITWISE -__________________ - -No pre-luminous v12.y.z OSDs are running but the ``sortbitwise`` flag has not -been set. - -The ``sortbitwise`` flag must be set before luminous v12.y.z or newer -OSDs can start. You can safely set the flag with:: - - ceph osd set sortbitwise - -POOL_FULL -_________ - -One or more pools has reached its quota and is no longer allowing writes. - -Pool quotas and utilization can be seen with:: - - ceph df detail - -You can either raise the pool quota with:: - - ceph osd pool set-quota <poolname> max_objects <num-objects> - ceph osd pool set-quota <poolname> max_bytes <num-bytes> - -or delete some existing data to reduce utilization. - - -Data health (pools & placement groups) --------------------------------------- - -PG_AVAILABILITY -_______________ - -Data availability is reduced, meaning that the cluster is unable to -service potential read or write requests for some data in the cluster. -Specifically, one or more PGs is in a state that does not allow IO -requests to be serviced. Problematic PG states include *peering*, -*stale*, *incomplete*, and the lack of *active* (if those conditions do not clear -quickly). - -Detailed information about which PGs are affected is available from:: - - ceph health detail - -In most cases the root cause is that one or more OSDs is currently -down; see the dicussion for ``OSD_DOWN`` above. - -The state of specific problematic PGs can be queried with:: - - ceph tell <pgid> query - -PG_DEGRADED -___________ - -Data redundancy is reduced for some data, meaning the cluster does not -have the desired number of replicas for all data (for replicated -pools) or erasure code fragments (for erasure coded pools). -Specifically, one or more PGs: - -* has the *degraded* or *undersized* flag set, meaning there are not - enough instances of that placement group in the cluster; -* has not had the *clean* flag set for some time. - -Detailed information about which PGs are affected is available from:: - - ceph health detail - -In most cases the root cause is that one or more OSDs is currently -down; see the dicussion for ``OSD_DOWN`` above. - -The state of specific problematic PGs can be queried with:: - - ceph tell <pgid> query - - -PG_DEGRADED_FULL -________________ - -Data redundancy may be reduced or at risk for some data due to a lack -of free space in the cluster. Specifically, one or more PGs has the -*backfill_toofull* or *recovery_toofull* flag set, meaning that the -cluster is unable to migrate or recover data because one or more OSDs -is above the *backfillfull* threshold. - -See the discussion for *OSD_BACKFILLFULL* or *OSD_FULL* above for -steps to resolve this condition. - -PG_DAMAGED -__________ - -Data scrubbing has discovered some problems with data consistency in -the cluster. Specifically, one or more PGs has the *inconsistent* or -*snaptrim_error* flag is set, indicating an earlier scrub operation -found a problem, or that the *repair* flag is set, meaning a repair -for such an inconsistency is currently in progress. - -See :doc:`pg-repair` for more information. - -OSD_SCRUB_ERRORS -________________ - -Recent OSD scrubs have uncovered inconsistencies. This error is generally -paired with *PG_DAMANGED* (see above). - -See :doc:`pg-repair` for more information. - -CACHE_POOL_NEAR_FULL -____________________ - -A cache tier pool is nearly full. Full in this context is determined -by the ``target_max_bytes`` and ``target_max_objects`` properties on -the cache pool. Once the pool reaches the target threshold, write -requests to the pool may block while data is flushed and evicted -from the cache, a state that normally leads to very high latencies and -poor performance. - -The cache pool target size can be adjusted with:: - - ceph osd pool set <cache-pool-name> target_max_bytes <bytes> - ceph osd pool set <cache-pool-name> target_max_objects <objects> - -Normal cache flush and evict activity may also be throttled due to reduced -availability or performance of the base tier, or overall cluster load. - -TOO_FEW_PGS -___________ - -The number of PGs in use in the cluster is below the configurable -threshold of ``mon_pg_warn_min_per_osd`` PGs per OSD. This can lead -to suboptimizal distribution and balance of data across the OSDs in -the cluster, and similar reduce overall performance. - -This may be an expected condition if data pools have not yet been -created. - -The PG count for existing pools can be increased or new pools can be -created. Please refer to -:doc:`placement-groups#Choosing-the-number-of-Placement-Groups` for -more information. - -TOO_MANY_PGS -____________ - -The number of PGs in use in the cluster is above the configurable -threshold of ``mon_max_pg_per_osd`` PGs per OSD. If this threshold is -exceed the cluster will not allow new pools to be created, pool `pg_num` to -be increased, or pool replication to be increased (any of which would lead to -more PGs in the cluster). A large number of PGs can lead -to higher memory utilization for OSD daemons, slower peering after -cluster state changes (like OSD restarts, additions, or removals), and -higher load on the Manager and Monitor daemons. - -The simplest way to mitigate the problem is to increase the number of -OSDs in the cluster by adding more hardware. Note that the OSD count -used for the purposes of this health check is the number of "in" OSDs, -so marking "out" OSDs "in" (if there are any) can also help:: - - ceph osd in <osd id(s)> - -Please refer to -:doc:`placement-groups#Choosing-the-number-of-Placement-Groups` for -more information. - -SMALLER_PGP_NUM -_______________ - -One or more pools has a ``pgp_num`` value less than ``pg_num``. This -is normally an indication that the PG count was increased without -also increasing the placement behavior. - -This is sometimes done deliberately to separate out the `split` step -when the PG count is adjusted from the data migration that is needed -when ``pgp_num`` is changed. - -This is normally resolved by setting ``pgp_num`` to match ``pg_num``, -triggering the data migration, with:: - - ceph osd pool set <pool> pgp_num <pg-num-value> - -MANY_OBJECTS_PER_PG -___________________ - -One or more pools has an average number of objects per PG that is -significantly higher than the overall cluster average. The specific -threshold is controlled by the ``mon_pg_warn_max_object_skew`` -configuration value. - -This is usually an indication that the pool(s) containing most of the -data in the cluster have too few PGs, and/or that other pools that do -not contain as much data have too many PGs. See the discussion of -*TOO_MANY_PGS* above. - -The threshold can be raised to silence the health warning by adjusting -the ``mon_pg_warn_max_object_skew`` config option on the monitors. - -POOL_APP_NOT_ENABLED -____________________ - -A pool exists that contains one or more objects but has not been -tagged for use by a particular application. - -Resolve this warning by labeling the pool for use by an application. For -example, if the pool is used by RBD,:: - - rbd pool init <poolname> - -If the pool is being used by a custom application 'foo', you can also label -via the low-level command:: - - ceph osd pool application enable foo - -For more information, see :doc:`pools.rst#associate-pool-to-application`. - -POOL_FULL -_________ - -One or more pools has reached (or is very close to reaching) its -quota. The threshold to trigger this error condition is controlled by -the ``mon_pool_quota_crit_threshold`` configuration option. - -Pool quotas can be adjusted up or down (or removed) with:: - - ceph osd pool set-quota <pool> max_bytes <bytes> - ceph osd pool set-quota <pool> max_objects <objects> - -Setting the quota value to 0 will disable the quota. - -POOL_NEAR_FULL -______________ - -One or more pools is approaching is quota. The threshold to trigger -this warning condition is controlled by the -``mon_pool_quota_warn_threshold`` configuration option. - -Pool quotas can be adjusted up or down (or removed) with:: - - ceph osd pool set-quota <pool> max_bytes <bytes> - ceph osd pool set-quota <pool> max_objects <objects> - -Setting the quota value to 0 will disable the quota. - -OBJECT_MISPLACED -________________ - -One or more objects in the cluster is not stored on the node the -cluster would like it to be stored on. This is an indication that -data migration due to some recent cluster change has not yet completed. - -Misplaced data is not a dangerous condition in and of itself; data -consistency is never at risk, and old copies of objects are never -removed until the desired number of new copies (in the desired -locations) are present. - -OBJECT_UNFOUND -______________ - -One or more objects in the cluster cannot be found. Specifically, the -OSDs know that a new or updated copy of an object should exist, but a -copy of that version of the object has not been found on OSDs that are -currently online. - -Read or write requests to unfound objects will block. - -Ideally, a down OSD can be brought back online that has the more -recent copy of the unfound object. Candidate OSDs can be identified from the -peering state for the PG(s) responsible for the unfound object:: - - ceph tell <pgid> query - -If the latest copy of the object is not available, the cluster can be -told to roll back to a previous version of the object. See -:doc:`troubleshooting-pg#Unfound-objects` for more information. - -REQUEST_SLOW -____________ - -One or more OSD requests is taking a long time to process. This can -be an indication of extreme load, a slow storage device, or a software -bug. - -The request queue on the OSD(s) in question can be queried with the -following command, executed from the OSD host:: - - ceph daemon osd.<id> ops - -A summary of the slowest recent requests can be seen with:: - - ceph daemon osd.<id> dump_historic_ops - -The location of an OSD can be found with:: - - ceph osd find osd.<id> - -REQUEST_STUCK -_____________ - -One or more OSD requests has been blocked for an extremely long time. -This is an indication that either the cluster has been unhealthy for -an extended period of time (e.g., not enough running OSDs) or there is -some internal problem with the OSD. See the dicussion of -*REQUEST_SLOW* above. - -PG_NOT_SCRUBBED -_______________ - -One or more PGs has not been scrubbed recently. PGs are normally -scrubbed every ``mon_scrub_interval`` seconds, and this warning -triggers when ``mon_warn_not_scrubbed`` such intervals have elapsed -without a scrub. - -PGs will not scrub if they are not flagged as *clean*, which may -happen if they are misplaced or degraded (see *PG_AVAILABILITY* and -*PG_DEGRADED* above). - -You can manually initiate a scrub of a clean PG with:: - - ceph pg scrub <pgid> - -PG_NOT_DEEP_SCRUBBED -____________________ - -One or more PGs has not been deep scrubbed recently. PGs are normally -scrubbed every ``osd_deep_mon_scrub_interval`` seconds, and this warning -triggers when ``mon_warn_not_deep_scrubbed`` such intervals have elapsed -without a scrub. - -PGs will not (deep) scrub if they are not flagged as *clean*, which may -happen if they are misplaced or degraded (see *PG_AVAILABILITY* and -*PG_DEGRADED* above). - -You can manually initiate a scrub of a clean PG with:: - - ceph pg deep-scrub <pgid> diff --git a/src/ceph/doc/rados/operations/index.rst b/src/ceph/doc/rados/operations/index.rst deleted file mode 100644 index aacf764..0000000 --- a/src/ceph/doc/rados/operations/index.rst +++ /dev/null @@ -1,90 +0,0 @@ -==================== - Cluster Operations -==================== - -.. raw:: html - - <table><colgroup><col width="50%"><col width="50%"></colgroup><tbody valign="top"><tr><td><h3>High-level Operations</h3> - -High-level cluster operations consist primarily of starting, stopping, and -restarting a cluster with the ``ceph`` service; checking the cluster's health; -and, monitoring an operating cluster. - -.. toctree:: - :maxdepth: 1 - - operating - health-checks - monitoring - monitoring-osd-pg - user-management - -.. raw:: html - - </td><td><h3>Data Placement</h3> - -Once you have your cluster up and running, you may begin working with data -placement. Ceph supports petabyte-scale data storage clusters, with storage -pools and placement groups that distribute data across the cluster using Ceph's -CRUSH algorithm. - -.. toctree:: - :maxdepth: 1 - - data-placement - pools - erasure-code - cache-tiering - placement-groups - upmap - crush-map - crush-map-edits - - - -.. raw:: html - - </td></tr><tr><td><h3>Low-level Operations</h3> - -Low-level cluster operations consist of starting, stopping, and restarting a -particular daemon within a cluster; changing the settings of a particular -daemon or subsystem; and, adding a daemon to the cluster or removing a daemon -from the cluster. The most common use cases for low-level operations include -growing or shrinking the Ceph cluster and replacing legacy or failed hardware -with new hardware. - -.. toctree:: - :maxdepth: 1 - - add-or-rm-osds - add-or-rm-mons - Command Reference <control> - - - -.. raw:: html - - </td><td><h3>Troubleshooting</h3> - -Ceph is still on the leading edge, so you may encounter situations that require -you to evaluate your Ceph configuration and modify your logging and debugging -settings to identify and remedy issues you are encountering with your cluster. - -.. toctree:: - :maxdepth: 1 - - ../troubleshooting/community - ../troubleshooting/troubleshooting-mon - ../troubleshooting/troubleshooting-osd - ../troubleshooting/troubleshooting-pg - ../troubleshooting/log-and-debug - ../troubleshooting/cpu-profiling - ../troubleshooting/memory-profiling - - - - -.. raw:: html - - </td></tr></tbody></table> - diff --git a/src/ceph/doc/rados/operations/monitoring-osd-pg.rst b/src/ceph/doc/rados/operations/monitoring-osd-pg.rst deleted file mode 100644 index 0107e34..0000000 --- a/src/ceph/doc/rados/operations/monitoring-osd-pg.rst +++ /dev/null @@ -1,617 +0,0 @@ -========================= - Monitoring OSDs and PGs -========================= - -High availability and high reliability require a fault-tolerant approach to -managing hardware and software issues. Ceph has no single point-of-failure, and -can service requests for data in a "degraded" mode. Ceph's `data placement`_ -introduces a layer of indirection to ensure that data doesn't bind directly to -particular OSD addresses. This means that tracking down system faults requires -finding the `placement group`_ and the underlying OSDs at root of the problem. - -.. tip:: A fault in one part of the cluster may prevent you from accessing a - particular object, but that doesn't mean that you cannot access other objects. - When you run into a fault, don't panic. Just follow the steps for monitoring - your OSDs and placement groups. Then, begin troubleshooting. - -Ceph is generally self-repairing. However, when problems persist, monitoring -OSDs and placement groups will help you identify the problem. - - -Monitoring OSDs -=============== - -An OSD's status is either in the cluster (``in``) or out of the cluster -(``out``); and, it is either up and running (``up``), or it is down and not -running (``down``). If an OSD is ``up``, it may be either ``in`` the cluster -(you can read and write data) or it is ``out`` of the cluster. If it was -``in`` the cluster and recently moved ``out`` of the cluster, Ceph will migrate -placement groups to other OSDs. If an OSD is ``out`` of the cluster, CRUSH will -not assign placement groups to the OSD. If an OSD is ``down``, it should also be -``out``. - -.. note:: If an OSD is ``down`` and ``in``, there is a problem and the cluster - will not be in a healthy state. - -.. ditaa:: +----------------+ +----------------+ - | | | | - | OSD #n In | | OSD #n Up | - | | | | - +----------------+ +----------------+ - ^ ^ - | | - | | - v v - +----------------+ +----------------+ - | | | | - | OSD #n Out | | OSD #n Down | - | | | | - +----------------+ +----------------+ - -If you execute a command such as ``ceph health``, ``ceph -s`` or ``ceph -w``, -you may notice that the cluster does not always echo back ``HEALTH OK``. Don't -panic. With respect to OSDs, you should expect that the cluster will **NOT** -echo ``HEALTH OK`` in a few expected circumstances: - -#. You haven't started the cluster yet (it won't respond). -#. You have just started or restarted the cluster and it's not ready yet, - because the placement groups are getting created and the OSDs are in - the process of peering. -#. You just added or removed an OSD. -#. You just have modified your cluster map. - -An important aspect of monitoring OSDs is to ensure that when the cluster -is up and running that all OSDs that are ``in`` the cluster are ``up`` and -running, too. To see if all OSDs are running, execute:: - - ceph osd stat - -The result should tell you the map epoch (eNNNN), the total number of OSDs (x), -how many are ``up`` (y) and how many are ``in`` (z). :: - - eNNNN: x osds: y up, z in - -If the number of OSDs that are ``in`` the cluster is more than the number of -OSDs that are ``up``, execute the following command to identify the ``ceph-osd`` -daemons that are not running:: - - ceph osd tree - -:: - - dumped osdmap tree epoch 1 - # id weight type name up/down reweight - -1 2 pool openstack - -3 2 rack dell-2950-rack-A - -2 2 host dell-2950-A1 - 0 1 osd.0 up 1 - 1 1 osd.1 down 1 - - -.. tip:: The ability to search through a well-designed CRUSH hierarchy may help - you troubleshoot your cluster by identifying the physcial locations faster. - -If an OSD is ``down``, start it:: - - sudo systemctl start ceph-osd@1 - -See `OSD Not Running`_ for problems associated with OSDs that stopped, or won't -restart. - - -PG Sets -======= - -When CRUSH assigns placement groups to OSDs, it looks at the number of replicas -for the pool and assigns the placement group to OSDs such that each replica of -the placement group gets assigned to a different OSD. For example, if the pool -requires three replicas of a placement group, CRUSH may assign them to -``osd.1``, ``osd.2`` and ``osd.3`` respectively. CRUSH actually seeks a -pseudo-random placement that will take into account failure domains you set in -your `CRUSH map`_, so you will rarely see placement groups assigned to nearest -neighbor OSDs in a large cluster. We refer to the set of OSDs that should -contain the replicas of a particular placement group as the **Acting Set**. In -some cases, an OSD in the Acting Set is ``down`` or otherwise not able to -service requests for objects in the placement group. When these situations -arise, don't panic. Common examples include: - -- You added or removed an OSD. Then, CRUSH reassigned the placement group to - other OSDs--thereby changing the composition of the Acting Set and spawning - the migration of data with a "backfill" process. -- An OSD was ``down``, was restarted, and is now ``recovering``. -- An OSD in the Acting Set is ``down`` or unable to service requests, - and another OSD has temporarily assumed its duties. - -Ceph processes a client request using the **Up Set**, which is the set of OSDs -that will actually handle the requests. In most cases, the Up Set and the Acting -Set are virtually identical. When they are not, it may indicate that Ceph is -migrating data, an OSD is recovering, or that there is a problem (i.e., Ceph -usually echoes a "HEALTH WARN" state with a "stuck stale" message in such -scenarios). - -To retrieve a list of placement groups, execute:: - - ceph pg dump - -To view which OSDs are within the Acting Set or the Up Set for a given placement -group, execute:: - - ceph pg map {pg-num} - -The result should tell you the osdmap epoch (eNNN), the placement group number -({pg-num}), the OSDs in the Up Set (up[]), and the OSDs in the acting set -(acting[]). :: - - osdmap eNNN pg {pg-num} -> up [0,1,2] acting [0,1,2] - -.. note:: If the Up Set and Acting Set do not match, this may be an indicator - that the cluster rebalancing itself or of a potential problem with - the cluster. - - -Peering -======= - -Before you can write data to a placement group, it must be in an ``active`` -state, and it **should** be in a ``clean`` state. For Ceph to determine the -current state of a placement group, the primary OSD of the placement group -(i.e., the first OSD in the acting set), peers with the secondary and tertiary -OSDs to establish agreement on the current state of the placement group -(assuming a pool with 3 replicas of the PG). - - -.. ditaa:: +---------+ +---------+ +-------+ - | OSD 1 | | OSD 2 | | OSD 3 | - +---------+ +---------+ +-------+ - | | | - | Request To | | - | Peer | | - |-------------->| | - |<--------------| | - | Peering | - | | - | Request To | - | Peer | - |----------------------------->| - |<-----------------------------| - | Peering | - -The OSDs also report their status to the monitor. See `Configuring Monitor/OSD -Interaction`_ for details. To troubleshoot peering issues, see `Peering -Failure`_. - - -Monitoring Placement Group States -================================= - -If you execute a command such as ``ceph health``, ``ceph -s`` or ``ceph -w``, -you may notice that the cluster does not always echo back ``HEALTH OK``. After -you check to see if the OSDs are running, you should also check placement group -states. You should expect that the cluster will **NOT** echo ``HEALTH OK`` in a -number of placement group peering-related circumstances: - -#. You have just created a pool and placement groups haven't peered yet. -#. The placement groups are recovering. -#. You have just added an OSD to or removed an OSD from the cluster. -#. You have just modified your CRUSH map and your placement groups are migrating. -#. There is inconsistent data in different replicas of a placement group. -#. Ceph is scrubbing a placement group's replicas. -#. Ceph doesn't have enough storage capacity to complete backfilling operations. - -If one of the foregoing circumstances causes Ceph to echo ``HEALTH WARN``, don't -panic. In many cases, the cluster will recover on its own. In some cases, you -may need to take action. An important aspect of monitoring placement groups is -to ensure that when the cluster is up and running that all placement groups are -``active``, and preferably in the ``clean`` state. To see the status of all -placement groups, execute:: - - ceph pg stat - -The result should tell you the placement group map version (vNNNNNN), the total -number of placement groups (x), and how many placement groups are in a -particular state such as ``active+clean`` (y). :: - - vNNNNNN: x pgs: y active+clean; z bytes data, aa MB used, bb GB / cc GB avail - -.. note:: It is common for Ceph to report multiple states for placement groups. - -In addition to the placement group states, Ceph will also echo back the amount -of data used (aa), the amount of storage capacity remaining (bb), and the total -storage capacity for the placement group. These numbers can be important in a -few cases: - -- You are reaching your ``near full ratio`` or ``full ratio``. -- Your data is not getting distributed across the cluster due to an - error in your CRUSH configuration. - - -.. topic:: Placement Group IDs - - Placement group IDs consist of the pool number (not pool name) followed - by a period (.) and the placement group ID--a hexadecimal number. You - can view pool numbers and their names from the output of ``ceph osd - lspools``. For example, the default pool ``rbd`` corresponds to - pool number ``0``. A fully qualified placement group ID has the - following form:: - - {pool-num}.{pg-id} - - And it typically looks like this:: - - 0.1f - - -To retrieve a list of placement groups, execute the following:: - - ceph pg dump - -You can also format the output in JSON format and save it to a file:: - - ceph pg dump -o {filename} --format=json - -To query a particular placement group, execute the following:: - - ceph pg {poolnum}.{pg-id} query - -Ceph will output the query in JSON format. - -.. code-block:: javascript - - { - "state": "active+clean", - "up": [ - 1, - 0 - ], - "acting": [ - 1, - 0 - ], - "info": { - "pgid": "1.e", - "last_update": "4'1", - "last_complete": "4'1", - "log_tail": "0'0", - "last_backfill": "MAX", - "purged_snaps": "[]", - "history": { - "epoch_created": 1, - "last_epoch_started": 537, - "last_epoch_clean": 537, - "last_epoch_split": 534, - "same_up_since": 536, - "same_interval_since": 536, - "same_primary_since": 536, - "last_scrub": "4'1", - "last_scrub_stamp": "2013-01-25 10:12:23.828174" - }, - "stats": { - "version": "4'1", - "reported": "536'782", - "state": "active+clean", - "last_fresh": "2013-01-25 10:12:23.828271", - "last_change": "2013-01-25 10:12:23.828271", - "last_active": "2013-01-25 10:12:23.828271", - "last_clean": "2013-01-25 10:12:23.828271", - "last_unstale": "2013-01-25 10:12:23.828271", - "mapping_epoch": 535, - "log_start": "0'0", - "ondisk_log_start": "0'0", - "created": 1, - "last_epoch_clean": 1, - "parent": "0.0", - "parent_split_bits": 0, - "last_scrub": "4'1", - "last_scrub_stamp": "2013-01-25 10:12:23.828174", - "log_size": 128, - "ondisk_log_size": 128, - "stat_sum": { - "num_bytes": 205, - "num_objects": 1, - "num_object_clones": 0, - "num_object_copies": 0, - "num_objects_missing_on_primary": 0, - "num_objects_degraded": 0, - "num_objects_unfound": 0, - "num_read": 1, - "num_read_kb": 0, - "num_write": 3, - "num_write_kb": 1 - }, - "stat_cat_sum": { - - }, - "up": [ - 1, - 0 - ], - "acting": [ - 1, - 0 - ] - }, - "empty": 0, - "dne": 0, - "incomplete": 0 - }, - "recovery_state": [ - { - "name": "Started\/Primary\/Active", - "enter_time": "2013-01-23 09:35:37.594691", - "might_have_unfound": [ - - ], - "scrub": { - "scrub_epoch_start": "536", - "scrub_active": 0, - "scrub_block_writes": 0, - "finalizing_scrub": 0, - "scrub_waiting_on": 0, - "scrub_waiting_on_whom": [ - - ] - } - }, - { - "name": "Started", - "enter_time": "2013-01-23 09:35:31.581160" - } - ] - } - - - -The following subsections describe common states in greater detail. - -Creating --------- - -When you create a pool, it will create the number of placement groups you -specified. Ceph will echo ``creating`` when it is creating one or more -placement groups. Once they are created, the OSDs that are part of a placement -group's Acting Set will peer. Once peering is complete, the placement group -status should be ``active+clean``, which means a Ceph client can begin writing -to the placement group. - -.. ditaa:: - - /-----------\ /-----------\ /-----------\ - | Creating |------>| Peering |------>| Active | - \-----------/ \-----------/ \-----------/ - -Peering -------- - -When Ceph is Peering a placement group, Ceph is bringing the OSDs that -store the replicas of the placement group into **agreement about the state** -of the objects and metadata in the placement group. When Ceph completes peering, -this means that the OSDs that store the placement group agree about the current -state of the placement group. However, completion of the peering process does -**NOT** mean that each replica has the latest contents. - -.. topic:: Authoratative History - - Ceph will **NOT** acknowledge a write operation to a client, until - all OSDs of the acting set persist the write operation. This practice - ensures that at least one member of the acting set will have a record - of every acknowledged write operation since the last successful - peering operation. - - With an accurate record of each acknowledged write operation, Ceph can - construct and disseminate a new authoritative history of the placement - group--a complete, and fully ordered set of operations that, if performed, - would bring an OSD’s copy of a placement group up to date. - - -Active ------- - -Once Ceph completes the peering process, a placement group may become -``active``. The ``active`` state means that the data in the placement group is -generally available in the primary placement group and the replicas for read -and write operations. - - -Clean ------ - -When a placement group is in the ``clean`` state, the primary OSD and the -replica OSDs have successfully peered and there are no stray replicas for the -placement group. Ceph replicated all objects in the placement group the correct -number of times. - - -Degraded --------- - -When a client writes an object to the primary OSD, the primary OSD is -responsible for writing the replicas to the replica OSDs. After the primary OSD -writes the object to storage, the placement group will remain in a ``degraded`` -state until the primary OSD has received an acknowledgement from the replica -OSDs that Ceph created the replica objects successfully. - -The reason a placement group can be ``active+degraded`` is that an OSD may be -``active`` even though it doesn't hold all of the objects yet. If an OSD goes -``down``, Ceph marks each placement group assigned to the OSD as ``degraded``. -The OSDs must peer again when the OSD comes back online. However, a client can -still write a new object to a ``degraded`` placement group if it is ``active``. - -If an OSD is ``down`` and the ``degraded`` condition persists, Ceph may mark the -``down`` OSD as ``out`` of the cluster and remap the data from the ``down`` OSD -to another OSD. The time between being marked ``down`` and being marked ``out`` -is controlled by ``mon osd down out interval``, which is set to ``600`` seconds -by default. - -A placement group can also be ``degraded``, because Ceph cannot find one or more -objects that Ceph thinks should be in the placement group. While you cannot -read or write to unfound objects, you can still access all of the other objects -in the ``degraded`` placement group. - - -Recovering ----------- - -Ceph was designed for fault-tolerance at a scale where hardware and software -problems are ongoing. When an OSD goes ``down``, its contents may fall behind -the current state of other replicas in the placement groups. When the OSD is -back ``up``, the contents of the placement groups must be updated to reflect the -current state. During that time period, the OSD may reflect a ``recovering`` -state. - -Recovery is not always trivial, because a hardware failure might cause a -cascading failure of multiple OSDs. For example, a network switch for a rack or -cabinet may fail, which can cause the OSDs of a number of host machines to fall -behind the current state of the cluster. Each one of the OSDs must recover once -the fault is resolved. - -Ceph provides a number of settings to balance the resource contention between -new service requests and the need to recover data objects and restore the -placement groups to the current state. The ``osd recovery delay start`` setting -allows an OSD to restart, re-peer and even process some replay requests before -starting the recovery process. The ``osd -recovery thread timeout`` sets a thread timeout, because multiple OSDs may fail, -restart and re-peer at staggered rates. The ``osd recovery max active`` setting -limits the number of recovery requests an OSD will entertain simultaneously to -prevent the OSD from failing to serve . The ``osd recovery max chunk`` setting -limits the size of the recovered data chunks to prevent network congestion. - - -Back Filling ------------- - -When a new OSD joins the cluster, CRUSH will reassign placement groups from OSDs -in the cluster to the newly added OSD. Forcing the new OSD to accept the -reassigned placement groups immediately can put excessive load on the new OSD. -Back filling the OSD with the placement groups allows this process to begin in -the background. Once backfilling is complete, the new OSD will begin serving -requests when it is ready. - -During the backfill operations, you may see one of several states: -``backfill_wait`` indicates that a backfill operation is pending, but is not -underway yet; ``backfill`` indicates that a backfill operation is underway; -and, ``backfill_too_full`` indicates that a backfill operation was requested, -but couldn't be completed due to insufficient storage capacity. When a -placement group cannot be backfilled, it may be considered ``incomplete``. - -Ceph provides a number of settings to manage the load spike associated with -reassigning placement groups to an OSD (especially a new OSD). By default, -``osd_max_backfills`` sets the maximum number of concurrent backfills to or from -an OSD to 10. The ``backfill full ratio`` enables an OSD to refuse a -backfill request if the OSD is approaching its full ratio (90%, by default) and -change with ``ceph osd set-backfillfull-ratio`` comand. -If an OSD refuses a backfill request, the ``osd backfill retry interval`` -enables an OSD to retry the request (after 10 seconds, by default). OSDs can -also set ``osd backfill scan min`` and ``osd backfill scan max`` to manage scan -intervals (64 and 512, by default). - - -Remapped --------- - -When the Acting Set that services a placement group changes, the data migrates -from the old acting set to the new acting set. It may take some time for a new -primary OSD to service requests. So it may ask the old primary to continue to -service requests until the placement group migration is complete. Once data -migration completes, the mapping uses the primary OSD of the new acting set. - - -Stale ------ - -While Ceph uses heartbeats to ensure that hosts and daemons are running, the -``ceph-osd`` daemons may also get into a ``stuck`` state where they are not -reporting statistics in a timely manner (e.g., a temporary network fault). By -default, OSD daemons report their placement group, up thru, boot and failure -statistics every half second (i.e., ``0.5``), which is more frequent than the -heartbeat thresholds. If the **Primary OSD** of a placement group's acting set -fails to report to the monitor or if other OSDs have reported the primary OSD -``down``, the monitors will mark the placement group ``stale``. - -When you start your cluster, it is common to see the ``stale`` state until -the peering process completes. After your cluster has been running for awhile, -seeing placement groups in the ``stale`` state indicates that the primary OSD -for those placement groups is ``down`` or not reporting placement group statistics -to the monitor. - - -Identifying Troubled PGs -======================== - -As previously noted, a placement group is not necessarily problematic just -because its state is not ``active+clean``. Generally, Ceph's ability to self -repair may not be working when placement groups get stuck. The stuck states -include: - -- **Unclean**: Placement groups contain objects that are not replicated the - desired number of times. They should be recovering. -- **Inactive**: Placement groups cannot process reads or writes because they - are waiting for an OSD with the most up-to-date data to come back ``up``. -- **Stale**: Placement groups are in an unknown state, because the OSDs that - host them have not reported to the monitor cluster in a while (configured - by ``mon osd report timeout``). - -To identify stuck placement groups, execute the following:: - - ceph pg dump_stuck [unclean|inactive|stale|undersized|degraded] - -See `Placement Group Subsystem`_ for additional details. To troubleshoot -stuck placement groups, see `Troubleshooting PG Errors`_. - - -Finding an Object Location -========================== - -To store object data in the Ceph Object Store, a Ceph client must: - -#. Set an object name -#. Specify a `pool`_ - -The Ceph client retrieves the latest cluster map and the CRUSH algorithm -calculates how to map the object to a `placement group`_, and then calculates -how to assign the placement group to an OSD dynamically. To find the object -location, all you need is the object name and the pool name. For example:: - - ceph osd map {poolname} {object-name} - -.. topic:: Exercise: Locate an Object - - As an exercise, lets create an object. Specify an object name, a path to a - test file containing some object data and a pool name using the - ``rados put`` command on the command line. For example:: - - rados put {object-name} {file-path} --pool=data - rados put test-object-1 testfile.txt --pool=data - - To verify that the Ceph Object Store stored the object, execute the following:: - - rados -p data ls - - Now, identify the object location:: - - ceph osd map {pool-name} {object-name} - ceph osd map data test-object-1 - - Ceph should output the object's location. For example:: - - osdmap e537 pool 'data' (0) object 'test-object-1' -> pg 0.d1743484 (0.4) -> up [1,0] acting [1,0] - - To remove the test object, simply delete it using the ``rados rm`` command. - For example:: - - rados rm test-object-1 --pool=data - - -As the cluster evolves, the object location may change dynamically. One benefit -of Ceph's dynamic rebalancing is that Ceph relieves you from having to perform -the migration manually. See the `Architecture`_ section for details. - -.. _data placement: ../data-placement -.. _pool: ../pools -.. _placement group: ../placement-groups -.. _Architecture: ../../../architecture -.. _OSD Not Running: ../../troubleshooting/troubleshooting-osd#osd-not-running -.. _Troubleshooting PG Errors: ../../troubleshooting/troubleshooting-pg#troubleshooting-pg-errors -.. _Peering Failure: ../../troubleshooting/troubleshooting-pg#failures-osd-peering -.. _CRUSH map: ../crush-map -.. _Configuring Monitor/OSD Interaction: ../../configuration/mon-osd-interaction/ -.. _Placement Group Subsystem: ../control#placement-group-subsystem diff --git a/src/ceph/doc/rados/operations/monitoring.rst b/src/ceph/doc/rados/operations/monitoring.rst deleted file mode 100644 index c291440..0000000 --- a/src/ceph/doc/rados/operations/monitoring.rst +++ /dev/null @@ -1,351 +0,0 @@ -====================== - Monitoring a Cluster -====================== - -Once you have a running cluster, you may use the ``ceph`` tool to monitor your -cluster. Monitoring a cluster typically involves checking OSD status, monitor -status, placement group status and metadata server status. - -Using the command line -====================== - -Interactive mode ----------------- - -To run the ``ceph`` tool in interactive mode, type ``ceph`` at the command line -with no arguments. For example:: - - ceph - ceph> health - ceph> status - ceph> quorum_status - ceph> mon_status - -Non-default paths ------------------ - -If you specified non-default locations for your configuration or keyring, -you may specify their locations:: - - ceph -c /path/to/conf -k /path/to/keyring health - -Checking a Cluster's Status -=========================== - -After you start your cluster, and before you start reading and/or -writing data, check your cluster's status first. - -To check a cluster's status, execute the following:: - - ceph status - -Or:: - - ceph -s - -In interactive mode, type ``status`` and press **Enter**. :: - - ceph> status - -Ceph will print the cluster status. For example, a tiny Ceph demonstration -cluster with one of each service may print the following: - -:: - - cluster: - id: 477e46f1-ae41-4e43-9c8f-72c918ab0a20 - health: HEALTH_OK - - services: - mon: 1 daemons, quorum a - mgr: x(active) - mds: 1/1/1 up {0=a=up:active} - osd: 1 osds: 1 up, 1 in - - data: - pools: 2 pools, 16 pgs - objects: 21 objects, 2246 bytes - usage: 546 GB used, 384 GB / 931 GB avail - pgs: 16 active+clean - - -.. topic:: How Ceph Calculates Data Usage - - The ``usage`` value reflects the *actual* amount of raw storage used. The - ``xxx GB / xxx GB`` value means the amount available (the lesser number) - of the overall storage capacity of the cluster. The notional number reflects - the size of the stored data before it is replicated, cloned or snapshotted. - Therefore, the amount of data actually stored typically exceeds the notional - amount stored, because Ceph creates replicas of the data and may also use - storage capacity for cloning and snapshotting. - - -Watching a Cluster -================== - -In addition to local logging by each daemon, Ceph clusters maintain -a *cluster log* that records high level events about the whole system. -This is logged to disk on monitor servers (as ``/var/log/ceph/ceph.log`` by -default), but can also be monitored via the command line. - -To follow the cluster log, use the following command - -:: - - ceph -w - -Ceph will print the status of the system, followed by each log message as it -is emitted. For example: - -:: - - cluster: - id: 477e46f1-ae41-4e43-9c8f-72c918ab0a20 - health: HEALTH_OK - - services: - mon: 1 daemons, quorum a - mgr: x(active) - mds: 1/1/1 up {0=a=up:active} - osd: 1 osds: 1 up, 1 in - - data: - pools: 2 pools, 16 pgs - objects: 21 objects, 2246 bytes - usage: 546 GB used, 384 GB / 931 GB avail - pgs: 16 active+clean - - - 2017-07-24 08:15:11.329298 mon.a mon.0 172.21.9.34:6789/0 23 : cluster [INF] osd.0 172.21.9.34:6806/20527 boot - 2017-07-24 08:15:14.258143 mon.a mon.0 172.21.9.34:6789/0 39 : cluster [INF] Activating manager daemon x - 2017-07-24 08:15:15.446025 mon.a mon.0 172.21.9.34:6789/0 47 : cluster [INF] Manager daemon x is now available - - -In addition to using ``ceph -w`` to print log lines as they are emitted, -use ``ceph log last [n]`` to see the most recent ``n`` lines from the cluster -log. - -Monitoring Health Checks -======================== - -Ceph continously runs various *health checks* against its own status. When -a health check fails, this is reflected in the output of ``ceph status`` (or -``ceph health``). In addition, messages are sent to the cluster log to -indicate when a check fails, and when the cluster recovers. - -For example, when an OSD goes down, the ``health`` section of the status -output may be updated as follows: - -:: - - health: HEALTH_WARN - 1 osds down - Degraded data redundancy: 21/63 objects degraded (33.333%), 16 pgs unclean, 16 pgs degraded - -At this time, cluster log messages are also emitted to record the failure of the -health checks: - -:: - - 2017-07-25 10:08:58.265945 mon.a mon.0 172.21.9.34:6789/0 91 : cluster [WRN] Health check failed: 1 osds down (OSD_DOWN) - 2017-07-25 10:09:01.302624 mon.a mon.0 172.21.9.34:6789/0 94 : cluster [WRN] Health check failed: Degraded data redundancy: 21/63 objects degraded (33.333%), 16 pgs unclean, 16 pgs degraded (PG_DEGRADED) - -When the OSD comes back online, the cluster log records the cluster's return -to a health state: - -:: - - 2017-07-25 10:11:11.526841 mon.a mon.0 172.21.9.34:6789/0 109 : cluster [WRN] Health check update: Degraded data redundancy: 2 pgs unclean, 2 pgs degraded, 2 pgs undersized (PG_DEGRADED) - 2017-07-25 10:11:13.535493 mon.a mon.0 172.21.9.34:6789/0 110 : cluster [INF] Health check cleared: PG_DEGRADED (was: Degraded data redundancy: 2 pgs unclean, 2 pgs degraded, 2 pgs undersized) - 2017-07-25 10:11:13.535577 mon.a mon.0 172.21.9.34:6789/0 111 : cluster [INF] Cluster is now healthy - - -Detecting configuration issues -============================== - -In addition to the health checks that Ceph continuously runs on its -own status, there are some configuration issues that may only be detected -by an external tool. - -Use the `ceph-medic`_ tool to run these additional checks on your Ceph -cluster's configuration. - -Checking a Cluster's Usage Stats -================================ - -To check a cluster's data usage and data distribution among pools, you can -use the ``df`` option. It is similar to Linux ``df``. Execute -the following:: - - ceph df - -The **GLOBAL** section of the output provides an overview of the amount of -storage your cluster uses for your data. - -- **SIZE:** The overall storage capacity of the cluster. -- **AVAIL:** The amount of free space available in the cluster. -- **RAW USED:** The amount of raw storage used. -- **% RAW USED:** The percentage of raw storage used. Use this number in - conjunction with the ``full ratio`` and ``near full ratio`` to ensure that - you are not reaching your cluster's capacity. See `Storage Capacity`_ for - additional details. - -The **POOLS** section of the output provides a list of pools and the notional -usage of each pool. The output from this section **DOES NOT** reflect replicas, -clones or snapshots. For example, if you store an object with 1MB of data, the -notional usage will be 1MB, but the actual usage may be 2MB or more depending -on the number of replicas, clones and snapshots. - -- **NAME:** The name of the pool. -- **ID:** The pool ID. -- **USED:** The notional amount of data stored in kilobytes, unless the number - appends **M** for megabytes or **G** for gigabytes. -- **%USED:** The notional percentage of storage used per pool. -- **MAX AVAIL:** An estimate of the notional amount of data that can be written - to this pool. -- **Objects:** The notional number of objects stored per pool. - -.. note:: The numbers in the **POOLS** section are notional. They are not - inclusive of the number of replicas, shapshots or clones. As a result, - the sum of the **USED** and **%USED** amounts will not add up to the - **RAW USED** and **%RAW USED** amounts in the **GLOBAL** section of the - output. - -.. note:: The **MAX AVAIL** value is a complicated function of the - replication or erasure code used, the CRUSH rule that maps storage - to devices, the utilization of those devices, and the configured - mon_osd_full_ratio. - - - -Checking OSD Status -=================== - -You can check OSDs to ensure they are ``up`` and ``in`` by executing:: - - ceph osd stat - -Or:: - - ceph osd dump - -You can also check view OSDs according to their position in the CRUSH map. :: - - ceph osd tree - -Ceph will print out a CRUSH tree with a host, its OSDs, whether they are up -and their weight. :: - - # id weight type name up/down reweight - -1 3 pool default - -3 3 rack mainrack - -2 3 host osd-host - 0 1 osd.0 up 1 - 1 1 osd.1 up 1 - 2 1 osd.2 up 1 - -For a detailed discussion, refer to `Monitoring OSDs and Placement Groups`_. - -Checking Monitor Status -======================= - -If your cluster has multiple monitors (likely), you should check the monitor -quorum status after you start the cluster before reading and/or writing data. A -quorum must be present when multiple monitors are running. You should also check -monitor status periodically to ensure that they are running. - -To see display the monitor map, execute the following:: - - ceph mon stat - -Or:: - - ceph mon dump - -To check the quorum status for the monitor cluster, execute the following:: - - ceph quorum_status - -Ceph will return the quorum status. For example, a Ceph cluster consisting of -three monitors may return the following: - -.. code-block:: javascript - - { "election_epoch": 10, - "quorum": [ - 0, - 1, - 2], - "monmap": { "epoch": 1, - "fsid": "444b489c-4f16-4b75-83f0-cb8097468898", - "modified": "2011-12-12 13:28:27.505520", - "created": "2011-12-12 13:28:27.505520", - "mons": [ - { "rank": 0, - "name": "a", - "addr": "127.0.0.1:6789\/0"}, - { "rank": 1, - "name": "b", - "addr": "127.0.0.1:6790\/0"}, - { "rank": 2, - "name": "c", - "addr": "127.0.0.1:6791\/0"} - ] - } - } - -Checking MDS Status -=================== - -Metadata servers provide metadata services for Ceph FS. Metadata servers have -two sets of states: ``up | down`` and ``active | inactive``. To ensure your -metadata servers are ``up`` and ``active``, execute the following:: - - ceph mds stat - -To display details of the metadata cluster, execute the following:: - - ceph fs dump - - -Checking Placement Group States -=============================== - -Placement groups map objects to OSDs. When you monitor your -placement groups, you will want them to be ``active`` and ``clean``. -For a detailed discussion, refer to `Monitoring OSDs and Placement Groups`_. - -.. _Monitoring OSDs and Placement Groups: ../monitoring-osd-pg - - -Using the Admin Socket -====================== - -The Ceph admin socket allows you to query a daemon via a socket interface. -By default, Ceph sockets reside under ``/var/run/ceph``. To access a daemon -via the admin socket, login to the host running the daemon and use the -following command:: - - ceph daemon {daemon-name} - ceph daemon {path-to-socket-file} - -For example, the following are equivalent:: - - ceph daemon osd.0 foo - ceph daemon /var/run/ceph/ceph-osd.0.asok foo - -To view the available admin socket commands, execute the following command:: - - ceph daemon {daemon-name} help - -The admin socket command enables you to show and set your configuration at -runtime. See `Viewing a Configuration at Runtime`_ for details. - -Additionally, you can set configuration values at runtime directly (i.e., the -admin socket bypasses the monitor, unlike ``ceph tell {daemon-type}.{id} -injectargs``, which relies on the monitor but doesn't require you to login -directly to the host in question ). - -.. _Viewing a Configuration at Runtime: ../../configuration/ceph-conf#ceph-runtime-config -.. _Storage Capacity: ../../configuration/mon-config-ref#storage-capacity -.. _ceph-medic: http://docs.ceph.com/ceph-medic/master/ diff --git a/src/ceph/doc/rados/operations/operating.rst b/src/ceph/doc/rados/operations/operating.rst deleted file mode 100644 index 791941a..0000000 --- a/src/ceph/doc/rados/operations/operating.rst +++ /dev/null @@ -1,251 +0,0 @@ -===================== - Operating a Cluster -===================== - -.. index:: systemd; operating a cluster - - -Running Ceph with systemd -========================== - -For all distributions that support systemd (CentOS 7, Fedora, Debian -Jessie 8 and later, SUSE), ceph daemons are now managed using native -systemd files instead of the legacy sysvinit scripts. For example:: - - sudo systemctl start ceph.target # start all daemons - sudo systemctl status ceph-osd@12 # check status of osd.12 - -To list the Ceph systemd units on a node, execute:: - - sudo systemctl status ceph\*.service ceph\*.target - -Starting all Daemons --------------------- - -To start all daemons on a Ceph Node (irrespective of type), execute the -following:: - - sudo systemctl start ceph.target - - -Stopping all Daemons --------------------- - -To stop all daemons on a Ceph Node (irrespective of type), execute the -following:: - - sudo systemctl stop ceph\*.service ceph\*.target - - -Starting all Daemons by Type ----------------------------- - -To start all daemons of a particular type on a Ceph Node, execute one of the -following:: - - sudo systemctl start ceph-osd.target - sudo systemctl start ceph-mon.target - sudo systemctl start ceph-mds.target - - -Stopping all Daemons by Type ----------------------------- - -To stop all daemons of a particular type on a Ceph Node, execute one of the -following:: - - sudo systemctl stop ceph-mon\*.service ceph-mon.target - sudo systemctl stop ceph-osd\*.service ceph-osd.target - sudo systemctl stop ceph-mds\*.service ceph-mds.target - - -Starting a Daemon ------------------ - -To start a specific daemon instance on a Ceph Node, execute one of the -following:: - - sudo systemctl start ceph-osd@{id} - sudo systemctl start ceph-mon@{hostname} - sudo systemctl start ceph-mds@{hostname} - -For example:: - - sudo systemctl start ceph-osd@1 - sudo systemctl start ceph-mon@ceph-server - sudo systemctl start ceph-mds@ceph-server - - -Stopping a Daemon ------------------ - -To stop a specific daemon instance on a Ceph Node, execute one of the -following:: - - sudo systemctl stop ceph-osd@{id} - sudo systemctl stop ceph-mon@{hostname} - sudo systemctl stop ceph-mds@{hostname} - -For example:: - - sudo systemctl stop ceph-osd@1 - sudo systemctl stop ceph-mon@ceph-server - sudo systemctl stop ceph-mds@ceph-server - - -.. index:: Ceph service; Upstart; operating a cluster - - - -Running Ceph with Upstart -========================= - -When deploying Ceph with ``ceph-deploy`` on Ubuntu Trusty, you may start and -stop Ceph daemons on a :term:`Ceph Node` using the event-based `Upstart`_. -Upstart does not require you to define daemon instances in the Ceph -configuration file. - -To list the Ceph Upstart jobs and instances on a node, execute:: - - sudo initctl list | grep ceph - -See `initctl`_ for additional details. - - -Starting all Daemons --------------------- - -To start all daemons on a Ceph Node (irrespective of type), execute the -following:: - - sudo start ceph-all - - -Stopping all Daemons --------------------- - -To stop all daemons on a Ceph Node (irrespective of type), execute the -following:: - - sudo stop ceph-all - - -Starting all Daemons by Type ----------------------------- - -To start all daemons of a particular type on a Ceph Node, execute one of the -following:: - - sudo start ceph-osd-all - sudo start ceph-mon-all - sudo start ceph-mds-all - - -Stopping all Daemons by Type ----------------------------- - -To stop all daemons of a particular type on a Ceph Node, execute one of the -following:: - - sudo stop ceph-osd-all - sudo stop ceph-mon-all - sudo stop ceph-mds-all - - -Starting a Daemon ------------------ - -To start a specific daemon instance on a Ceph Node, execute one of the -following:: - - sudo start ceph-osd id={id} - sudo start ceph-mon id={hostname} - sudo start ceph-mds id={hostname} - -For example:: - - sudo start ceph-osd id=1 - sudo start ceph-mon id=ceph-server - sudo start ceph-mds id=ceph-server - - -Stopping a Daemon ------------------ - -To stop a specific daemon instance on a Ceph Node, execute one of the -following:: - - sudo stop ceph-osd id={id} - sudo stop ceph-mon id={hostname} - sudo stop ceph-mds id={hostname} - -For example:: - - sudo stop ceph-osd id=1 - sudo start ceph-mon id=ceph-server - sudo start ceph-mds id=ceph-server - - -.. index:: Ceph service; sysvinit; operating a cluster - - -Running Ceph -============ - -Each time you to **start**, **restart**, and **stop** Ceph daemons (or your -entire cluster) you must specify at least one option and one command. You may -also specify a daemon type or a daemon instance. :: - - {commandline} [options] [commands] [daemons] - - -The ``ceph`` options include: - -+-----------------+----------+-------------------------------------------------+ -| Option | Shortcut | Description | -+=================+==========+=================================================+ -| ``--verbose`` | ``-v`` | Use verbose logging. | -+-----------------+----------+-------------------------------------------------+ -| ``--valgrind`` | ``N/A`` | (Dev and QA only) Use `Valgrind`_ debugging. | -+-----------------+----------+-------------------------------------------------+ -| ``--allhosts`` | ``-a`` | Execute on all nodes in ``ceph.conf.`` | -| | | Otherwise, it only executes on ``localhost``. | -+-----------------+----------+-------------------------------------------------+ -| ``--restart`` | ``N/A`` | Automatically restart daemon if it core dumps. | -+-----------------+----------+-------------------------------------------------+ -| ``--norestart`` | ``N/A`` | Don't restart a daemon if it core dumps. | -+-----------------+----------+-------------------------------------------------+ -| ``--conf`` | ``-c`` | Use an alternate configuration file. | -+-----------------+----------+-------------------------------------------------+ - -The ``ceph`` commands include: - -+------------------+------------------------------------------------------------+ -| Command | Description | -+==================+============================================================+ -| ``start`` | Start the daemon(s). | -+------------------+------------------------------------------------------------+ -| ``stop`` | Stop the daemon(s). | -+------------------+------------------------------------------------------------+ -| ``forcestop`` | Force the daemon(s) to stop. Same as ``kill -9`` | -+------------------+------------------------------------------------------------+ -| ``killall`` | Kill all daemons of a particular type. | -+------------------+------------------------------------------------------------+ -| ``cleanlogs`` | Cleans out the log directory. | -+------------------+------------------------------------------------------------+ -| ``cleanalllogs`` | Cleans out **everything** in the log directory. | -+------------------+------------------------------------------------------------+ - -For subsystem operations, the ``ceph`` service can target specific daemon types -by adding a particular daemon type for the ``[daemons]`` option. Daemon types -include: - -- ``mon`` -- ``osd`` -- ``mds`` - - - -.. _Valgrind: http://www.valgrind.org/ -.. _Upstart: http://upstart.ubuntu.com/index.html -.. _initctl: http://manpages.ubuntu.com/manpages/raring/en/man8/initctl.8.html diff --git a/src/ceph/doc/rados/operations/pg-concepts.rst b/src/ceph/doc/rados/operations/pg-concepts.rst deleted file mode 100644 index 636d6bf..0000000 --- a/src/ceph/doc/rados/operations/pg-concepts.rst +++ /dev/null @@ -1,102 +0,0 @@ -========================== - Placement Group Concepts -========================== - -When you execute commands like ``ceph -w``, ``ceph osd dump``, and other -commands related to placement groups, Ceph may return values using some -of the following terms: - -*Peering* - The process of bringing all of the OSDs that store - a Placement Group (PG) into agreement about the state - of all of the objects (and their metadata) in that PG. - Note that agreeing on the state does not mean that - they all have the latest contents. - -*Acting Set* - The ordered list of OSDs who are (or were as of some epoch) - responsible for a particular placement group. - -*Up Set* - The ordered list of OSDs responsible for a particular placement - group for a particular epoch according to CRUSH. Normally this - is the same as the *Acting Set*, except when the *Acting Set* has - been explicitly overridden via ``pg_temp`` in the OSD Map. - -*Current Interval* or *Past Interval* - A sequence of OSD map epochs during which the *Acting Set* and *Up - Set* for particular placement group do not change. - -*Primary* - The member (and by convention first) of the *Acting Set*, - that is responsible for coordination peering, and is - the only OSD that will accept client-initiated - writes to objects in a placement group. - -*Replica* - A non-primary OSD in the *Acting Set* for a placement group - (and who has been recognized as such and *activated* by the primary). - -*Stray* - An OSD that is not a member of the current *Acting Set*, but - has not yet been told that it can delete its copies of a - particular placement group. - -*Recovery* - Ensuring that copies of all of the objects in a placement group - are on all of the OSDs in the *Acting Set*. Once *Peering* has - been performed, the *Primary* can start accepting write operations, - and *Recovery* can proceed in the background. - -*PG Info* - Basic metadata about the placement group's creation epoch, the version - for the most recent write to the placement group, *last epoch started*, - *last epoch clean*, and the beginning of the *current interval*. Any - inter-OSD communication about placement groups includes the *PG Info*, - such that any OSD that knows a placement group exists (or once existed) - also has a lower bound on *last epoch clean* or *last epoch started*. - -*PG Log* - A list of recent updates made to objects in a placement group. - Note that these logs can be truncated after all OSDs - in the *Acting Set* have acknowledged up to a certain - point. - -*Missing Set* - Each OSD notes update log entries and if they imply updates to - the contents of an object, adds that object to a list of needed - updates. This list is called the *Missing Set* for that ``<OSD,PG>``. - -*Authoritative History* - A complete, and fully ordered set of operations that, if - performed, would bring an OSD's copy of a placement group - up to date. - -*Epoch* - A (monotonically increasing) OSD map version number - -*Last Epoch Start* - The last epoch at which all nodes in the *Acting Set* - for a particular placement group agreed on an - *Authoritative History*. At this point, *Peering* is - deemed to have been successful. - -*up_thru* - Before a *Primary* can successfully complete the *Peering* process, - it must inform a monitor that is alive through the current - OSD map *Epoch* by having the monitor set its *up_thru* in the osd - map. This helps *Peering* ignore previous *Acting Sets* for which - *Peering* never completed after certain sequences of failures, such as - the second interval below: - - - *acting set* = [A,B] - - *acting set* = [A] - - *acting set* = [] very shortly after (e.g., simultaneous failure, but staggered detection) - - *acting set* = [B] (B restarts, A does not) - -*Last Epoch Clean* - The last *Epoch* at which all nodes in the *Acting set* - for a particular placement group were completely - up to date (both placement group logs and object contents). - At this point, *recovery* is deemed to have been - completed. diff --git a/src/ceph/doc/rados/operations/pg-repair.rst b/src/ceph/doc/rados/operations/pg-repair.rst deleted file mode 100644 index 0d6692a..0000000 --- a/src/ceph/doc/rados/operations/pg-repair.rst +++ /dev/null @@ -1,4 +0,0 @@ -Repairing PG inconsistencies -============================ - - diff --git a/src/ceph/doc/rados/operations/pg-states.rst b/src/ceph/doc/rados/operations/pg-states.rst deleted file mode 100644 index 0fbd3dc..0000000 --- a/src/ceph/doc/rados/operations/pg-states.rst +++ /dev/null @@ -1,80 +0,0 @@ -======================== - Placement Group States -======================== - -When checking a cluster's status (e.g., running ``ceph -w`` or ``ceph -s``), -Ceph will report on the status of the placement groups. A placement group has -one or more states. The optimum state for placement groups in the placement group -map is ``active + clean``. - -*Creating* - Ceph is still creating the placement group. - -*Active* - Ceph will process requests to the placement group. - -*Clean* - Ceph replicated all objects in the placement group the correct number of times. - -*Down* - A replica with necessary data is down, so the placement group is offline. - -*Scrubbing* - Ceph is checking the placement group for inconsistencies. - -*Degraded* - Ceph has not replicated some objects in the placement group the correct number of times yet. - -*Inconsistent* - Ceph detects inconsistencies in the one or more replicas of an object in the placement group - (e.g. objects are the wrong size, objects are missing from one replica *after* recovery finished, etc.). - -*Peering* - The placement group is undergoing the peering process - -*Repair* - Ceph is checking the placement group and repairing any inconsistencies it finds (if possible). - -*Recovering* - Ceph is migrating/synchronizing objects and their replicas. - -*Forced-Recovery* - High recovery priority of that PG is enforced by user. - -*Backfill* - Ceph is scanning and synchronizing the entire contents of a placement group - instead of inferring what contents need to be synchronized from the logs of - recent operations. *Backfill* is a special case of recovery. - -*Forced-Backfill* - High backfill priority of that PG is enforced by user. - -*Wait-backfill* - The placement group is waiting in line to start backfill. - -*Backfill-toofull* - A backfill operation is waiting because the destination OSD is over its - full ratio. - -*Incomplete* - Ceph detects that a placement group is missing information about - writes that may have occurred, or does not have any healthy - copies. If you see this state, try to start any failed OSDs that may - contain the needed information. In the case of an erasure coded pool - temporarily reducing min_size may allow recovery. - -*Stale* - The placement group is in an unknown state - the monitors have not received - an update for it since the placement group mapping changed. - -*Remapped* - The placement group is temporarily mapped to a different set of OSDs from what - CRUSH specified. - -*Undersized* - The placement group fewer copies than the configured pool replication level. - -*Peered* - The placement group has peered, but cannot serve client IO due to not having - enough copies to reach the pool's configured min_size parameter. Recovery - may occur in this state, so the pg may heal up to min_size eventually. diff --git a/src/ceph/doc/rados/operations/placement-groups.rst b/src/ceph/doc/rados/operations/placement-groups.rst deleted file mode 100644 index fee833a..0000000 --- a/src/ceph/doc/rados/operations/placement-groups.rst +++ /dev/null @@ -1,469 +0,0 @@ -================== - Placement Groups -================== - -.. _preselection: - -A preselection of pg_num -======================== - -When creating a new pool with:: - - ceph osd pool create {pool-name} pg_num - -it is mandatory to choose the value of ``pg_num`` because it cannot be -calculated automatically. Here are a few values commonly used: - -- Less than 5 OSDs set ``pg_num`` to 128 - -- Between 5 and 10 OSDs set ``pg_num`` to 512 - -- Between 10 and 50 OSDs set ``pg_num`` to 1024 - -- If you have more than 50 OSDs, you need to understand the tradeoffs - and how to calculate the ``pg_num`` value by yourself - -- For calculating ``pg_num`` value by yourself please take help of `pgcalc`_ tool - -As the number of OSDs increases, chosing the right value for pg_num -becomes more important because it has a significant influence on the -behavior of the cluster as well as the durability of the data when -something goes wrong (i.e. the probability that a catastrophic event -leads to data loss). - -How are Placement Groups used ? -=============================== - -A placement group (PG) aggregates objects within a pool because -tracking object placement and object metadata on a per-object basis is -computationally expensive--i.e., a system with millions of objects -cannot realistically track placement on a per-object basis. - -.. ditaa:: - /-----\ /-----\ /-----\ /-----\ /-----\ - | obj | | obj | | obj | | obj | | obj | - \-----/ \-----/ \-----/ \-----/ \-----/ - | | | | | - +--------+--------+ +---+----+ - | | - v v - +-----------------------+ +-----------------------+ - | Placement Group #1 | | Placement Group #2 | - | | | | - +-----------------------+ +-----------------------+ - | | - +------------------------------+ - | - v - +-----------------------+ - | Pool | - | | - +-----------------------+ - -The Ceph client will calculate which placement group an object should -be in. It does this by hashing the object ID and applying an operation -based on the number of PGs in the defined pool and the ID of the pool. -See `Mapping PGs to OSDs`_ for details. - -The object's contents within a placement group are stored in a set of -OSDs. For instance, in a replicated pool of size two, each placement -group will store objects on two OSDs, as shown below. - -.. ditaa:: - - +-----------------------+ +-----------------------+ - | Placement Group #1 | | Placement Group #2 | - | | | | - +-----------------------+ +-----------------------+ - | | | | - v v v v - /----------\ /----------\ /----------\ /----------\ - | | | | | | | | - | OSD #1 | | OSD #2 | | OSD #2 | | OSD #3 | - | | | | | | | | - \----------/ \----------/ \----------/ \----------/ - - -Should OSD #2 fail, another will be assigned to Placement Group #1 and -will be filled with copies of all objects in OSD #1. If the pool size -is changed from two to three, an additional OSD will be assigned to -the placement group and will receive copies of all objects in the -placement group. - -Placement groups do not own the OSD, they share it with other -placement groups from the same pool or even other pools. If OSD #2 -fails, the Placement Group #2 will also have to restore copies of -objects, using OSD #3. - -When the number of placement groups increases, the new placement -groups will be assigned OSDs. The result of the CRUSH function will -also change and some objects from the former placement groups will be -copied over to the new Placement Groups and removed from the old ones. - -Placement Groups Tradeoffs -========================== - -Data durability and even distribution among all OSDs call for more -placement groups but their number should be reduced to the minimum to -save CPU and memory. - -.. _data durability: - -Data durability ---------------- - -After an OSD fails, the risk of data loss increases until the data it -contained is fully recovered. Let's imagine a scenario that causes -permanent data loss in a single placement group: - -- The OSD fails and all copies of the object it contains are lost. - For all objects within the placement group the number of replica - suddently drops from three to two. - -- Ceph starts recovery for this placement group by chosing a new OSD - to re-create the third copy of all objects. - -- Another OSD, within the same placement group, fails before the new - OSD is fully populated with the third copy. Some objects will then - only have one surviving copies. - -- Ceph picks yet another OSD and keeps copying objects to restore the - desired number of copies. - -- A third OSD, within the same placement group, fails before recovery - is complete. If this OSD contained the only remaining copy of an - object, it is permanently lost. - -In a cluster containing 10 OSDs with 512 placement groups in a three -replica pool, CRUSH will give each placement groups three OSDs. In the -end, each OSDs will end up hosting (512 * 3) / 10 = ~150 Placement -Groups. When the first OSD fails, the above scenario will therefore -start recovery for all 150 placement groups at the same time. - -The 150 placement groups being recovered are likely to be -homogeneously spread over the 9 remaining OSDs. Each remaining OSD is -therefore likely to send copies of objects to all others and also -receive some new objects to be stored because they became part of a -new placement group. - -The amount of time it takes for this recovery to complete entirely -depends on the architecture of the Ceph cluster. Let say each OSD is -hosted by a 1TB SSD on a single machine and all of them are connected -to a 10Gb/s switch and the recovery for a single OSD completes within -M minutes. If there are two OSDs per machine using spinners with no -SSD journal and a 1Gb/s switch, it will at least be an order of -magnitude slower. - -In a cluster of this size, the number of placement groups has almost -no influence on data durability. It could be 128 or 8192 and the -recovery would not be slower or faster. - -However, growing the same Ceph cluster to 20 OSDs instead of 10 OSDs -is likely to speed up recovery and therefore improve data durability -significantly. Each OSD now participates in only ~75 placement groups -instead of ~150 when there were only 10 OSDs and it will still require -all 19 remaining OSDs to perform the same amount of object copies in -order to recover. But where 10 OSDs had to copy approximately 100GB -each, they now have to copy 50GB each instead. If the network was the -bottleneck, recovery will happen twice as fast. In other words, -recovery goes faster when the number of OSDs increases. - -If this cluster grows to 40 OSDs, each of them will only host ~35 -placement groups. If an OSD dies, recovery will keep going faster -unless it is blocked by another bottleneck. However, if this cluster -grows to 200 OSDs, each of them will only host ~7 placement groups. If -an OSD dies, recovery will happen between at most of ~21 (7 * 3) OSDs -in these placement groups: recovery will take longer than when there -were 40 OSDs, meaning the number of placement groups should be -increased. - -No matter how short the recovery time is, there is a chance for a -second OSD to fail while it is in progress. In the 10 OSDs cluster -described above, if any of them fail, then ~17 placement groups -(i.e. ~150 / 9 placement groups being recovered) will only have one -surviving copy. And if any of the 8 remaining OSD fail, the last -objects of two placement groups are likely to be lost (i.e. ~17 / 8 -placement groups with only one remaining copy being recovered). - -When the size of the cluster grows to 20 OSDs, the number of Placement -Groups damaged by the loss of three OSDs drops. The second OSD lost -will degrade ~4 (i.e. ~75 / 19 placement groups being recovered) -instead of ~17 and the third OSD lost will only lose data if it is one -of the four OSDs containing the surviving copy. In other words, if the -probability of losing one OSD is 0.0001% during the recovery time -frame, it goes from 17 * 10 * 0.0001% in the cluster with 10 OSDs to 4 * 20 * -0.0001% in the cluster with 20 OSDs. - -In a nutshell, more OSDs mean faster recovery and a lower risk of -cascading failures leading to the permanent loss of a Placement -Group. Having 512 or 4096 Placement Groups is roughly equivalent in a -cluster with less than 50 OSDs as far as data durability is concerned. - -Note: It may take a long time for a new OSD added to the cluster to be -populated with placement groups that were assigned to it. However -there is no degradation of any object and it has no impact on the -durability of the data contained in the Cluster. - -.. _object distribution: - -Object distribution within a pool ---------------------------------- - -Ideally objects are evenly distributed in each placement group. Since -CRUSH computes the placement group for each object, but does not -actually know how much data is stored in each OSD within this -placement group, the ratio between the number of placement groups and -the number of OSDs may influence the distribution of the data -significantly. - -For instance, if there was single a placement group for ten OSDs in a -three replica pool, only three OSD would be used because CRUSH would -have no other choice. When more placement groups are available, -objects are more likely to be evenly spread among them. CRUSH also -makes every effort to evenly spread OSDs among all existing Placement -Groups. - -As long as there are one or two orders of magnitude more Placement -Groups than OSDs, the distribution should be even. For instance, 300 -placement groups for 3 OSDs, 1000 placement groups for 10 OSDs etc. - -Uneven data distribution can be caused by factors other than the ratio -between OSDs and placement groups. Since CRUSH does not take into -account the size of the objects, a few very large objects may create -an imbalance. Let say one million 4K objects totaling 4GB are evenly -spread among 1000 placement groups on 10 OSDs. They will use 4GB / 10 -= 400MB on each OSD. If one 400MB object is added to the pool, the -three OSDs supporting the placement group in which the object has been -placed will be filled with 400MB + 400MB = 800MB while the seven -others will remain occupied with only 400MB. - -.. _resource usage: - -Memory, CPU and network usage ------------------------------ - -For each placement group, OSDs and MONs need memory, network and CPU -at all times and even more during recovery. Sharing this overhead by -clustering objects within a placement group is one of the main reasons -they exist. - -Minimizing the number of placement groups saves significant amounts of -resources. - -Choosing the number of Placement Groups -======================================= - -If you have more than 50 OSDs, we recommend approximately 50-100 -placement groups per OSD to balance out resource usage, data -durability and distribution. If you have less than 50 OSDs, chosing -among the `preselection`_ above is best. For a single pool of objects, -you can use the following formula to get a baseline:: - - (OSDs * 100) - Total PGs = ------------ - pool size - -Where **pool size** is either the number of replicas for replicated -pools or the K+M sum for erasure coded pools (as returned by **ceph -osd erasure-code-profile get**). - -You should then check if the result makes sense with the way you -designed your Ceph cluster to maximize `data durability`_, -`object distribution`_ and minimize `resource usage`_. - -The result should be **rounded up to the nearest power of two.** -Rounding up is optional, but recommended for CRUSH to evenly balance -the number of objects among placement groups. - -As an example, for a cluster with 200 OSDs and a pool size of 3 -replicas, you would estimate your number of PGs as follows:: - - (200 * 100) - ----------- = 6667. Nearest power of 2: 8192 - 3 - -When using multiple data pools for storing objects, you need to ensure -that you balance the number of placement groups per pool with the -number of placement groups per OSD so that you arrive at a reasonable -total number of placement groups that provides reasonably low variance -per OSD without taxing system resources or making the peering process -too slow. - -For instance a cluster of 10 pools each with 512 placement groups on -ten OSDs is a total of 5,120 placement groups spread over ten OSDs, -that is 512 placement groups per OSD. That does not use too many -resources. However, if 1,000 pools were created with 512 placement -groups each, the OSDs will handle ~50,000 placement groups each and it -would require significantly more resources and time for peering. - -You may find the `PGCalc`_ tool helpful. - - -.. _setting the number of placement groups: - -Set the Number of Placement Groups -================================== - -To set the number of placement groups in a pool, you must specify the -number of placement groups at the time you create the pool. -See `Create a Pool`_ for details. Once you have set placement groups for a -pool, you may increase the number of placement groups (but you cannot -decrease the number of placement groups). To increase the number of -placement groups, execute the following:: - - ceph osd pool set {pool-name} pg_num {pg_num} - -Once you increase the number of placement groups, you must also -increase the number of placement groups for placement (``pgp_num``) -before your cluster will rebalance. The ``pgp_num`` will be the number of -placement groups that will be considered for placement by the CRUSH -algorithm. Increasing ``pg_num`` splits the placement groups but data -will not be migrated to the newer placement groups until placement -groups for placement, ie. ``pgp_num`` is increased. The ``pgp_num`` -should be equal to the ``pg_num``. To increase the number of -placement groups for placement, execute the following:: - - ceph osd pool set {pool-name} pgp_num {pgp_num} - - -Get the Number of Placement Groups -================================== - -To get the number of placement groups in a pool, execute the following:: - - ceph osd pool get {pool-name} pg_num - - -Get a Cluster's PG Statistics -============================= - -To get the statistics for the placement groups in your cluster, execute the following:: - - ceph pg dump [--format {format}] - -Valid formats are ``plain`` (default) and ``json``. - - -Get Statistics for Stuck PGs -============================ - -To get the statistics for all placement groups stuck in a specified state, -execute the following:: - - ceph pg dump_stuck inactive|unclean|stale|undersized|degraded [--format <format>] [-t|--threshold <seconds>] - -**Inactive** Placement groups cannot process reads or writes because they are waiting for an OSD -with the most up-to-date data to come up and in. - -**Unclean** Placement groups contain objects that are not replicated the desired number -of times. They should be recovering. - -**Stale** Placement groups are in an unknown state - the OSDs that host them have not -reported to the monitor cluster in a while (configured by ``mon_osd_report_timeout``). - -Valid formats are ``plain`` (default) and ``json``. The threshold defines the minimum number -of seconds the placement group is stuck before including it in the returned statistics -(default 300 seconds). - - -Get a PG Map -============ - -To get the placement group map for a particular placement group, execute the following:: - - ceph pg map {pg-id} - -For example:: - - ceph pg map 1.6c - -Ceph will return the placement group map, the placement group, and the OSD status:: - - osdmap e13 pg 1.6c (1.6c) -> up [1,0] acting [1,0] - - -Get a PGs Statistics -==================== - -To retrieve statistics for a particular placement group, execute the following:: - - ceph pg {pg-id} query - - -Scrub a Placement Group -======================= - -To scrub a placement group, execute the following:: - - ceph pg scrub {pg-id} - -Ceph checks the primary and any replica nodes, generates a catalog of all objects -in the placement group and compares them to ensure that no objects are missing -or mismatched, and their contents are consistent. Assuming the replicas all -match, a final semantic sweep ensures that all of the snapshot-related object -metadata is consistent. Errors are reported via logs. - -Prioritize backfill/recovery of a Placement Group(s) -==================================================== - -You may run into a situation where a bunch of placement groups will require -recovery and/or backfill, and some particular groups hold data more important -than others (for example, those PGs may hold data for images used by running -machines and other PGs may be used by inactive machines/less relevant data). -In that case, you may want to prioritize recovery of those groups so -performance and/or availability of data stored on those groups is restored -earlier. To do this (mark particular placement group(s) as prioritized during -backfill or recovery), execute the following:: - - ceph pg force-recovery {pg-id} [{pg-id #2}] [{pg-id #3} ...] - ceph pg force-backfill {pg-id} [{pg-id #2}] [{pg-id #3} ...] - -This will cause Ceph to perform recovery or backfill on specified placement -groups first, before other placement groups. This does not interrupt currently -ongoing backfills or recovery, but causes specified PGs to be processed -as soon as possible. If you change your mind or prioritize wrong groups, -use:: - - ceph pg cancel-force-recovery {pg-id} [{pg-id #2}] [{pg-id #3} ...] - ceph pg cancel-force-backfill {pg-id} [{pg-id #2}] [{pg-id #3} ...] - -This will remove "force" flag from those PGs and they will be processed -in default order. Again, this doesn't affect currently processed placement -group, only those that are still queued. - -The "force" flag is cleared automatically after recovery or backfill of group -is done. - -Revert Lost -=========== - -If the cluster has lost one or more objects, and you have decided to -abandon the search for the lost data, you must mark the unfound objects -as ``lost``. - -If all possible locations have been queried and objects are still -lost, you may have to give up on the lost objects. This is -possible given unusual combinations of failures that allow the cluster -to learn about writes that were performed before the writes themselves -are recovered. - -Currently the only supported option is "revert", which will either roll back to -a previous version of the object or (if it was a new object) forget about it -entirely. To mark the "unfound" objects as "lost", execute the following:: - - ceph pg {pg-id} mark_unfound_lost revert|delete - -.. important:: Use this feature with caution, because it may confuse - applications that expect the object(s) to exist. - - -.. toctree:: - :hidden: - - pg-states - pg-concepts - - -.. _Create a Pool: ../pools#createpool -.. _Mapping PGs to OSDs: ../../../architecture#mapping-pgs-to-osds -.. _pgcalc: http://ceph.com/pgcalc/ diff --git a/src/ceph/doc/rados/operations/pools.rst b/src/ceph/doc/rados/operations/pools.rst deleted file mode 100644 index 7015593..0000000 --- a/src/ceph/doc/rados/operations/pools.rst +++ /dev/null @@ -1,798 +0,0 @@ -======= - Pools -======= - -When you first deploy a cluster without creating a pool, Ceph uses the default -pools for storing data. A pool provides you with: - -- **Resilience**: You can set how many OSD are allowed to fail without losing data. - For replicated pools, it is the desired number of copies/replicas of an object. - A typical configuration stores an object and one additional copy - (i.e., ``size = 2``), but you can determine the number of copies/replicas. - For `erasure coded pools <../erasure-code>`_, it is the number of coding chunks - (i.e. ``m=2`` in the **erasure code profile**) - -- **Placement Groups**: You can set the number of placement groups for the pool. - A typical configuration uses approximately 100 placement groups per OSD to - provide optimal balancing without using up too many computing resources. When - setting up multiple pools, be careful to ensure you set a reasonable number of - placement groups for both the pool and the cluster as a whole. - -- **CRUSH Rules**: When you store data in a pool, a CRUSH ruleset mapped to the - pool enables CRUSH to identify a rule for the placement of the object - and its replicas (or chunks for erasure coded pools) in your cluster. - You can create a custom CRUSH rule for your pool. - -- **Snapshots**: When you create snapshots with ``ceph osd pool mksnap``, - you effectively take a snapshot of a particular pool. - -To organize data into pools, you can list, create, and remove pools. -You can also view the utilization statistics for each pool. - -List Pools -========== - -To list your cluster's pools, execute:: - - ceph osd lspools - -On a freshly installed cluster, only the ``rbd`` pool exists. - - -.. _createpool: - -Create a Pool -============= - -Before creating pools, refer to the `Pool, PG and CRUSH Config Reference`_. -Ideally, you should override the default value for the number of placement -groups in your Ceph configuration file, as the default is NOT ideal. -For details on placement group numbers refer to `setting the number of placement groups`_ - -.. note:: Starting with Luminous, all pools need to be associated to the - application using the pool. See `Associate Pool to Application`_ below for - more information. - -For example:: - - osd pool default pg num = 100 - osd pool default pgp num = 100 - -To create a pool, execute:: - - ceph osd pool create {pool-name} {pg-num} [{pgp-num}] [replicated] \ - [crush-rule-name] [expected-num-objects] - ceph osd pool create {pool-name} {pg-num} {pgp-num} erasure \ - [erasure-code-profile] [crush-rule-name] [expected_num_objects] - -Where: - -``{pool-name}`` - -:Description: The name of the pool. It must be unique. -:Type: String -:Required: Yes. - -``{pg-num}`` - -:Description: The total number of placement groups for the pool. See `Placement - Groups`_ for details on calculating a suitable number. The - default value ``8`` is NOT suitable for most systems. - -:Type: Integer -:Required: Yes. -:Default: 8 - -``{pgp-num}`` - -:Description: The total number of placement groups for placement purposes. This - **should be equal to the total number of placement groups**, except - for placement group splitting scenarios. - -:Type: Integer -:Required: Yes. Picks up default or Ceph configuration value if not specified. -:Default: 8 - -``{replicated|erasure}`` - -:Description: The pool type which may either be **replicated** to - recover from lost OSDs by keeping multiple copies of the - objects or **erasure** to get a kind of - `generalized RAID5 <../erasure-code>`_ capability. - The **replicated** pools require more - raw storage but implement all Ceph operations. The - **erasure** pools require less raw storage but only - implement a subset of the available operations. - -:Type: String -:Required: No. -:Default: replicated - -``[crush-rule-name]`` - -:Description: The name of a CRUSH rule to use for this pool. The specified - rule must exist. - -:Type: String -:Required: No. -:Default: For **replicated** pools it is the ruleset specified by the ``osd - pool default crush replicated ruleset`` config variable. This - ruleset must exist. - For **erasure** pools it is ``erasure-code`` if the ``default`` - `erasure code profile`_ is used or ``{pool-name}`` otherwise. This - ruleset will be created implicitly if it doesn't exist already. - - -``[erasure-code-profile=profile]`` - -.. _erasure code profile: ../erasure-code-profile - -:Description: For **erasure** pools only. Use the `erasure code profile`_. It - must be an existing profile as defined by - **osd erasure-code-profile set**. - -:Type: String -:Required: No. - -When you create a pool, set the number of placement groups to a reasonable value -(e.g., ``100``). Consider the total number of placement groups per OSD too. -Placement groups are computationally expensive, so performance will degrade when -you have many pools with many placement groups (e.g., 50 pools with 100 -placement groups each). The point of diminishing returns depends upon the power -of the OSD host. - -See `Placement Groups`_ for details on calculating an appropriate number of -placement groups for your pool. - -.. _Placement Groups: ../placement-groups - -``[expected-num-objects]`` - -:Description: The expected number of objects for this pool. By setting this value ( - together with a negative **filestore merge threshold**), the PG folder - splitting would happen at the pool creation time, to avoid the latency - impact to do a runtime folder splitting. - -:Type: Integer -:Required: No. -:Default: 0, no splitting at the pool creation time. - -Associate Pool to Application -============================= - -Pools need to be associated with an application before use. Pools that will be -used with CephFS or pools that are automatically created by RGW are -automatically associated. Pools that are intended for use with RBD should be -initialized using the ``rbd`` tool (see `Block Device Commands`_ for more -information). - -For other cases, you can manually associate a free-form application name to -a pool.:: - - ceph osd pool application enable {pool-name} {application-name} - -.. note:: CephFS uses the application name ``cephfs``, RBD uses the - application name ``rbd``, and RGW uses the application name ``rgw``. - -Set Pool Quotas -=============== - -You can set pool quotas for the maximum number of bytes and/or the maximum -number of objects per pool. :: - - ceph osd pool set-quota {pool-name} [max_objects {obj-count}] [max_bytes {bytes}] - -For example:: - - ceph osd pool set-quota data max_objects 10000 - -To remove a quota, set its value to ``0``. - - -Delete a Pool -============= - -To delete a pool, execute:: - - ceph osd pool delete {pool-name} [{pool-name} --yes-i-really-really-mean-it] - - -To remove a pool the mon_allow_pool_delete flag must be set to true in the Monitor's -configuration. Otherwise they will refuse to remove a pool. - -See `Monitor Configuration`_ for more information. - -.. _Monitor Configuration: ../../configuration/mon-config-ref - -If you created your own rulesets and rules for a pool you created, you should -consider removing them when you no longer need your pool:: - - ceph osd pool get {pool-name} crush_ruleset - -If the ruleset was "123", for example, you can check the other pools like so:: - - ceph osd dump | grep "^pool" | grep "crush_ruleset 123" - -If no other pools use that custom ruleset, then it's safe to delete that -ruleset from the cluster. - -If you created users with permissions strictly for a pool that no longer -exists, you should consider deleting those users too:: - - ceph auth ls | grep -C 5 {pool-name} - ceph auth del {user} - - -Rename a Pool -============= - -To rename a pool, execute:: - - ceph osd pool rename {current-pool-name} {new-pool-name} - -If you rename a pool and you have per-pool capabilities for an authenticated -user, you must update the user's capabilities (i.e., caps) with the new pool -name. - -.. note:: Version ``0.48`` Argonaut and above. - -Show Pool Statistics -==================== - -To show a pool's utilization statistics, execute:: - - rados df - - -Make a Snapshot of a Pool -========================= - -To make a snapshot of a pool, execute:: - - ceph osd pool mksnap {pool-name} {snap-name} - -.. note:: Version ``0.48`` Argonaut and above. - - -Remove a Snapshot of a Pool -=========================== - -To remove a snapshot of a pool, execute:: - - ceph osd pool rmsnap {pool-name} {snap-name} - -.. note:: Version ``0.48`` Argonaut and above. - -.. _setpoolvalues: - - -Set Pool Values -=============== - -To set a value to a pool, execute the following:: - - ceph osd pool set {pool-name} {key} {value} - -You may set values for the following keys: - -.. _compression_algorithm: - -``compression_algorithm`` -:Description: Sets inline compression algorithm to use for underlying BlueStore. - This setting overrides the `global setting <rados/configuration/bluestore-config-ref/#inline-compression>`_ of ``bluestore compression algorithm``. - -:Type: String -:Valid Settings: ``lz4``, ``snappy``, ``zlib``, ``zstd`` - -``compression_mode`` - -:Description: Sets the policy for the inline compression algorithm for underlying BlueStore. - This setting overrides the `global setting <rados/configuration/bluestore-config-ref/#inline-compression>`_ of ``bluestore compression mode``. - -:Type: String -:Valid Settings: ``none``, ``passive``, ``aggressive``, ``force`` - -``compression_min_blob_size`` - -:Description: Chunks smaller than this are never compressed. - This setting overrides the `global setting <rados/configuration/bluestore-config-ref/#inline-compression>`_ of ``bluestore compression min blob *``. - -:Type: Unsigned Integer - -``compression_max_blob_size`` - -:Description: Chunks larger than this are broken into smaller blobs sizing - ``compression_max_blob_size`` before being compressed. - -:Type: Unsigned Integer - -.. _size: - -``size`` - -:Description: Sets the number of replicas for objects in the pool. - See `Set the Number of Object Replicas`_ for further details. - Replicated pools only. - -:Type: Integer - -.. _min_size: - -``min_size`` - -:Description: Sets the minimum number of replicas required for I/O. - See `Set the Number of Object Replicas`_ for further details. - Replicated pools only. - -:Type: Integer -:Version: ``0.54`` and above - -.. _pg_num: - -``pg_num`` - -:Description: The effective number of placement groups to use when calculating - data placement. -:Type: Integer -:Valid Range: Superior to ``pg_num`` current value. - -.. _pgp_num: - -``pgp_num`` - -:Description: The effective number of placement groups for placement to use - when calculating data placement. - -:Type: Integer -:Valid Range: Equal to or less than ``pg_num``. - -.. _crush_ruleset: - -``crush_ruleset`` - -:Description: The ruleset to use for mapping object placement in the cluster. -:Type: Integer - -.. _allow_ec_overwrites: - -``allow_ec_overwrites`` - -:Description: Whether writes to an erasure coded pool can update part - of an object, so cephfs and rbd can use it. See - `Erasure Coding with Overwrites`_ for more details. -:Type: Boolean -:Version: ``12.2.0`` and above - -.. _hashpspool: - -``hashpspool`` - -:Description: Set/Unset HASHPSPOOL flag on a given pool. -:Type: Integer -:Valid Range: 1 sets flag, 0 unsets flag -:Version: Version ``0.48`` Argonaut and above. - -.. _nodelete: - -``nodelete`` - -:Description: Set/Unset NODELETE flag on a given pool. -:Type: Integer -:Valid Range: 1 sets flag, 0 unsets flag -:Version: Version ``FIXME`` - -.. _nopgchange: - -``nopgchange`` - -:Description: Set/Unset NOPGCHANGE flag on a given pool. -:Type: Integer -:Valid Range: 1 sets flag, 0 unsets flag -:Version: Version ``FIXME`` - -.. _nosizechange: - -``nosizechange`` - -:Description: Set/Unset NOSIZECHANGE flag on a given pool. -:Type: Integer -:Valid Range: 1 sets flag, 0 unsets flag -:Version: Version ``FIXME`` - -.. _write_fadvise_dontneed: - -``write_fadvise_dontneed`` - -:Description: Set/Unset WRITE_FADVISE_DONTNEED flag on a given pool. -:Type: Integer -:Valid Range: 1 sets flag, 0 unsets flag - -.. _noscrub: - -``noscrub`` - -:Description: Set/Unset NOSCRUB flag on a given pool. -:Type: Integer -:Valid Range: 1 sets flag, 0 unsets flag - -.. _nodeep-scrub: - -``nodeep-scrub`` - -:Description: Set/Unset NODEEP_SCRUB flag on a given pool. -:Type: Integer -:Valid Range: 1 sets flag, 0 unsets flag - -.. _hit_set_type: - -``hit_set_type`` - -:Description: Enables hit set tracking for cache pools. - See `Bloom Filter`_ for additional information. - -:Type: String -:Valid Settings: ``bloom``, ``explicit_hash``, ``explicit_object`` -:Default: ``bloom``. Other values are for testing. - -.. _hit_set_count: - -``hit_set_count`` - -:Description: The number of hit sets to store for cache pools. The higher - the number, the more RAM consumed by the ``ceph-osd`` daemon. - -:Type: Integer -:Valid Range: ``1``. Agent doesn't handle > 1 yet. - -.. _hit_set_period: - -``hit_set_period`` - -:Description: The duration of a hit set period in seconds for cache pools. - The higher the number, the more RAM consumed by the - ``ceph-osd`` daemon. - -:Type: Integer -:Example: ``3600`` 1hr - -.. _hit_set_fpp: - -``hit_set_fpp`` - -:Description: The false positive probability for the ``bloom`` hit set type. - See `Bloom Filter`_ for additional information. - -:Type: Double -:Valid Range: 0.0 - 1.0 -:Default: ``0.05`` - -.. _cache_target_dirty_ratio: - -``cache_target_dirty_ratio`` - -:Description: The percentage of the cache pool containing modified (dirty) - objects before the cache tiering agent will flush them to the - backing storage pool. - -:Type: Double -:Default: ``.4`` - -.. _cache_target_dirty_high_ratio: - -``cache_target_dirty_high_ratio`` - -:Description: The percentage of the cache pool containing modified (dirty) - objects before the cache tiering agent will flush them to the - backing storage pool with a higher speed. - -:Type: Double -:Default: ``.6`` - -.. _cache_target_full_ratio: - -``cache_target_full_ratio`` - -:Description: The percentage of the cache pool containing unmodified (clean) - objects before the cache tiering agent will evict them from the - cache pool. - -:Type: Double -:Default: ``.8`` - -.. _target_max_bytes: - -``target_max_bytes`` - -:Description: Ceph will begin flushing or evicting objects when the - ``max_bytes`` threshold is triggered. - -:Type: Integer -:Example: ``1000000000000`` #1-TB - -.. _target_max_objects: - -``target_max_objects`` - -:Description: Ceph will begin flushing or evicting objects when the - ``max_objects`` threshold is triggered. - -:Type: Integer -:Example: ``1000000`` #1M objects - - -``hit_set_grade_decay_rate`` - -:Description: Temperature decay rate between two successive hit_sets -:Type: Integer -:Valid Range: 0 - 100 -:Default: ``20`` - - -``hit_set_search_last_n`` - -:Description: Count at most N appearance in hit_sets for temperature calculation -:Type: Integer -:Valid Range: 0 - hit_set_count -:Default: ``1`` - - -.. _cache_min_flush_age: - -``cache_min_flush_age`` - -:Description: The time (in seconds) before the cache tiering agent will flush - an object from the cache pool to the storage pool. - -:Type: Integer -:Example: ``600`` 10min - -.. _cache_min_evict_age: - -``cache_min_evict_age`` - -:Description: The time (in seconds) before the cache tiering agent will evict - an object from the cache pool. - -:Type: Integer -:Example: ``1800`` 30min - -.. _fast_read: - -``fast_read`` - -:Description: On Erasure Coding pool, if this flag is turned on, the read request - would issue sub reads to all shards, and waits until it receives enough - shards to decode to serve the client. In the case of jerasure and isa - erasure plugins, once the first K replies return, client's request is - served immediately using the data decoded from these replies. This - helps to tradeoff some resources for better performance. Currently this - flag is only supported for Erasure Coding pool. - -:Type: Boolean -:Defaults: ``0`` - -.. _scrub_min_interval: - -``scrub_min_interval`` - -:Description: The minimum interval in seconds for pool scrubbing when - load is low. If it is 0, the value osd_scrub_min_interval - from config is used. - -:Type: Double -:Default: ``0`` - -.. _scrub_max_interval: - -``scrub_max_interval`` - -:Description: The maximum interval in seconds for pool scrubbing - irrespective of cluster load. If it is 0, the value - osd_scrub_max_interval from config is used. - -:Type: Double -:Default: ``0`` - -.. _deep_scrub_interval: - -``deep_scrub_interval`` - -:Description: The interval in seconds for pool “deep” scrubbing. If it - is 0, the value osd_deep_scrub_interval from config is used. - -:Type: Double -:Default: ``0`` - - -Get Pool Values -=============== - -To get a value from a pool, execute the following:: - - ceph osd pool get {pool-name} {key} - -You may get values for the following keys: - -``size`` - -:Description: see size_ - -:Type: Integer - -``min_size`` - -:Description: see min_size_ - -:Type: Integer -:Version: ``0.54`` and above - -``pg_num`` - -:Description: see pg_num_ - -:Type: Integer - - -``pgp_num`` - -:Description: see pgp_num_ - -:Type: Integer -:Valid Range: Equal to or less than ``pg_num``. - - -``crush_ruleset`` - -:Description: see crush_ruleset_ - - -``hit_set_type`` - -:Description: see hit_set_type_ - -:Type: String -:Valid Settings: ``bloom``, ``explicit_hash``, ``explicit_object`` - -``hit_set_count`` - -:Description: see hit_set_count_ - -:Type: Integer - - -``hit_set_period`` - -:Description: see hit_set_period_ - -:Type: Integer - - -``hit_set_fpp`` - -:Description: see hit_set_fpp_ - -:Type: Double - - -``cache_target_dirty_ratio`` - -:Description: see cache_target_dirty_ratio_ - -:Type: Double - - -``cache_target_dirty_high_ratio`` - -:Description: see cache_target_dirty_high_ratio_ - -:Type: Double - - -``cache_target_full_ratio`` - -:Description: see cache_target_full_ratio_ - -:Type: Double - - -``target_max_bytes`` - -:Description: see target_max_bytes_ - -:Type: Integer - - -``target_max_objects`` - -:Description: see target_max_objects_ - -:Type: Integer - - -``cache_min_flush_age`` - -:Description: see cache_min_flush_age_ - -:Type: Integer - - -``cache_min_evict_age`` - -:Description: see cache_min_evict_age_ - -:Type: Integer - - -``fast_read`` - -:Description: see fast_read_ - -:Type: Boolean - - -``scrub_min_interval`` - -:Description: see scrub_min_interval_ - -:Type: Double - - -``scrub_max_interval`` - -:Description: see scrub_max_interval_ - -:Type: Double - - -``deep_scrub_interval`` - -:Description: see deep_scrub_interval_ - -:Type: Double - - -Set the Number of Object Replicas -================================= - -To set the number of object replicas on a replicated pool, execute the following:: - - ceph osd pool set {poolname} size {num-replicas} - -.. important:: The ``{num-replicas}`` includes the object itself. - If you want the object and two copies of the object for a total of - three instances of the object, specify ``3``. - -For example:: - - ceph osd pool set data size 3 - -You may execute this command for each pool. **Note:** An object might accept -I/Os in degraded mode with fewer than ``pool size`` replicas. To set a minimum -number of required replicas for I/O, you should use the ``min_size`` setting. -For example:: - - ceph osd pool set data min_size 2 - -This ensures that no object in the data pool will receive I/O with fewer than -``min_size`` replicas. - - -Get the Number of Object Replicas -================================= - -To get the number of object replicas, execute the following:: - - ceph osd dump | grep 'replicated size' - -Ceph will list the pools, with the ``replicated size`` attribute highlighted. -By default, ceph creates two replicas of an object (a total of three copies, or -a size of 3). - - - -.. _Pool, PG and CRUSH Config Reference: ../../configuration/pool-pg-config-ref -.. _Bloom Filter: http://en.wikipedia.org/wiki/Bloom_filter -.. _setting the number of placement groups: ../placement-groups#set-the-number-of-placement-groups -.. _Erasure Coding with Overwrites: ../erasure-code#erasure-coding-with-overwrites -.. _Block Device Commands: ../../../rbd/rados-rbd-cmds/#create-a-block-device-pool - diff --git a/src/ceph/doc/rados/operations/upmap.rst b/src/ceph/doc/rados/operations/upmap.rst deleted file mode 100644 index 58f6322..0000000 --- a/src/ceph/doc/rados/operations/upmap.rst +++ /dev/null @@ -1,75 +0,0 @@ -Using the pg-upmap -================== - -Starting in Luminous v12.2.z there is a new *pg-upmap* exception table -in the OSDMap that allows the cluster to explicitly map specific PGs to -specific OSDs. This allows the cluster to fine-tune the data -distribution to, in most cases, perfectly distributed PGs across OSDs. - -The key caveat to this new mechanism is that it requires that all -clients understand the new *pg-upmap* structure in the OSDMap. - -Enabling --------- - -To allow use of the feature, you must tell the cluster that it only -needs to support luminous (and newer) clients with:: - - ceph osd set-require-min-compat-client luminous - -This command will fail if any pre-luminous clients or daemons are -connected to the monitors. You can see what client versions are in -use with:: - - ceph features - -A word of caution ------------------ - -This is a new feature and not very user friendly. At the time of this -writing we are working on a new `balancer` module for ceph-mgr that -will eventually do all of this automatically. - -Until then, - -Offline optimization --------------------- - -Upmap entries are updated with an offline optimizer built into ``osdmaptool``. - -#. Grab the latest copy of your osdmap:: - - ceph osd getmap -o om - -#. Run the optimizer:: - - osdmaptool om --upmap out.txt [--upmap-pool <pool>] [--upmap-max <max-count>] [--upmap-deviation <max-deviation>] - - It is highly recommended that optimization be done for each pool - individually, or for sets of similarly-utilized pools. You can - specify the ``--upmap-pool`` option multiple times. "Similar pools" - means pools that are mapped to the same devices and store the same - kind of data (e.g., RBD image pools, yes; RGW index pool and RGW - data pool, no). - - The ``max-count`` value is the maximum number of upmap entries to - identify in the run. The default is 100, but you may want to make - this a smaller number so that the tool completes more quickly (but - does less work). If it cannot find any additional changes to make - it will stop early (i.e., when the pool distribution is perfect). - - The ``max-deviation`` value defaults to `.01` (i.e., 1%). If an OSD - utilization varies from the average by less than this amount it - will be considered perfect. - -#. The proposed changes are written to the output file ``out.txt`` in - the example above. These are normal ceph CLI commands that can be - run to apply the changes to the cluster. This can be done with:: - - source out.txt - -The above steps can be repeated as many times as necessary to achieve -a perfect distribution of PGs for each set of pools. - -You can see some (gory) details about what the tool is doing by -passing ``--debug-osd 10`` to ``osdmaptool``. diff --git a/src/ceph/doc/rados/operations/user-management.rst b/src/ceph/doc/rados/operations/user-management.rst deleted file mode 100644 index 8a35a50..0000000 --- a/src/ceph/doc/rados/operations/user-management.rst +++ /dev/null @@ -1,665 +0,0 @@ -================= - User Management -================= - -This document describes :term:`Ceph Client` users, and their authentication and -authorization with the :term:`Ceph Storage Cluster`. Users are either -individuals or system actors such as applications, which use Ceph clients to -interact with the Ceph Storage Cluster daemons. - -.. ditaa:: +-----+ - | {o} | - | | - +--+--+ /---------\ /---------\ - | | Ceph | | Ceph | - ---+---*----->| |<------------->| | - | uses | Clients | | Servers | - | \---------/ \---------/ - /--+--\ - | | - | | - actor - - -When Ceph runs with authentication and authorization enabled (enabled by -default), you must specify a user name and a keyring containing the secret key -of the specified user (usually via the command line). If you do not specify a -user name, Ceph will use ``client.admin`` as the default user name. If you do -not specify a keyring, Ceph will look for a keyring via the ``keyring`` setting -in the Ceph configuration. For example, if you execute the ``ceph health`` -command without specifying a user or keyring:: - - ceph health - -Ceph interprets the command like this:: - - ceph -n client.admin --keyring=/etc/ceph/ceph.client.admin.keyring health - -Alternatively, you may use the ``CEPH_ARGS`` environment variable to avoid -re-entry of the user name and secret. - -For details on configuring the Ceph Storage Cluster to use authentication, -see `Cephx Config Reference`_. For details on the architecture of Cephx, see -`Architecture - High Availability Authentication`_. - - -Background -========== - -Irrespective of the type of Ceph client (e.g., Block Device, Object Storage, -Filesystem, native API, etc.), Ceph stores all data as objects within `pools`_. -Ceph users must have access to pools in order to read and write data. -Additionally, Ceph users must have execute permissions to use Ceph's -administrative commands. The following concepts will help you understand Ceph -user management. - - -User ----- - -A user is either an individual or a system actor such as an application. -Creating users allows you to control who (or what) can access your Ceph Storage -Cluster, its pools, and the data within pools. - -Ceph has the notion of a ``type`` of user. For the purposes of user management, -the type will always be ``client``. Ceph identifies users in period (.) -delimited form consisting of the user type and the user ID: for example, -``TYPE.ID``, ``client.admin``, or ``client.user1``. The reason for user typing -is that Ceph Monitors, OSDs, and Metadata Servers also use the Cephx protocol, -but they are not clients. Distinguishing the user type helps to distinguish -between client users and other users--streamlining access control, user -monitoring and traceability. - -Sometimes Ceph's user type may seem confusing, because the Ceph command line -allows you to specify a user with or without the type, depending upon your -command line usage. If you specify ``--user`` or ``--id``, you can omit the -type. So ``client.user1`` can be entered simply as ``user1``. If you specify -``--name`` or ``-n``, you must specify the type and name, such as -``client.user1``. We recommend using the type and name as a best practice -wherever possible. - -.. note:: A Ceph Storage Cluster user is not the same as a Ceph Object Storage - user or a Ceph Filesystem user. The Ceph Object Gateway uses a Ceph Storage - Cluster user to communicate between the gateway daemon and the storage - cluster, but the gateway has its own user management functionality for end - users. The Ceph Filesystem uses POSIX semantics. The user space associated - with the Ceph Filesystem is not the same as a Ceph Storage Cluster user. - - - -Authorization (Capabilities) ----------------------------- - -Ceph uses the term "capabilities" (caps) to describe authorizing an -authenticated user to exercise the functionality of the monitors, OSDs and -metadata servers. Capabilities can also restrict access to data within a pool or -a namespace within a pool. A Ceph administrative user sets a user's -capabilities when creating or updating a user. - -Capability syntax follows the form:: - - {daemon-type} '{capspec}[, {capspec} ...]' - -- **Monitor Caps:** Monitor capabilities include ``r``, ``w``, ``x`` access - settings or ``profile {name}``. For example:: - - mon 'allow rwx' - mon 'profile osd' - -- **OSD Caps:** OSD capabilities include ``r``, ``w``, ``x``, ``class-read``, - ``class-write`` access settings or ``profile {name}``. Additionally, OSD - capabilities also allow for pool and namespace settings. :: - - osd 'allow {access} [pool={pool-name} [namespace={namespace-name}]]' - osd 'profile {name} [pool={pool-name} [namespace={namespace-name}]]' - -- **Metadata Server Caps:** For administrators, use ``allow *``. For all - other users, such as CephFS clients, consult :doc:`/cephfs/client-auth` - - -.. note:: The Ceph Object Gateway daemon (``radosgw``) is a client of the - Ceph Storage Cluster, so it is not represented as a Ceph Storage - Cluster daemon type. - -The following entries describe each capability. - -``allow`` - -:Description: Precedes access settings for a daemon. Implies ``rw`` - for MDS only. - - -``r`` - -:Description: Gives the user read access. Required with monitors to retrieve - the CRUSH map. - - -``w`` - -:Description: Gives the user write access to objects. - - -``x`` - -:Description: Gives the user the capability to call class methods - (i.e., both read and write) and to conduct ``auth`` - operations on monitors. - - -``class-read`` - -:Descriptions: Gives the user the capability to call class read methods. - Subset of ``x``. - - -``class-write`` - -:Description: Gives the user the capability to call class write methods. - Subset of ``x``. - - -``*`` - -:Description: Gives the user read, write and execute permissions for a - particular daemon/pool, and the ability to execute - admin commands. - - -``profile osd`` (Monitor only) - -:Description: Gives a user permissions to connect as an OSD to other OSDs or - monitors. Conferred on OSDs to enable OSDs to handle replication - heartbeat traffic and status reporting. - - -``profile mds`` (Monitor only) - -:Description: Gives a user permissions to connect as a MDS to other MDSs or - monitors. - - -``profile bootstrap-osd`` (Monitor only) - -:Description: Gives a user permissions to bootstrap an OSD. Conferred on - deployment tools such as ``ceph-disk``, ``ceph-deploy``, etc. - so that they have permissions to add keys, etc. when - bootstrapping an OSD. - - -``profile bootstrap-mds`` (Monitor only) - -:Description: Gives a user permissions to bootstrap a metadata server. - Conferred on deployment tools such as ``ceph-deploy``, etc. - so they have permissions to add keys, etc. when bootstrapping - a metadata server. - -``profile rbd`` (Monitor and OSD) - -:Description: Gives a user permissions to manipulate RBD images. When used - as a Monitor cap, it provides the minimal privileges required - by an RBD client application. When used as an OSD cap, it - provides read-write access to an RBD client application. - -``profile rbd-read-only`` (OSD only) - -:Description: Gives a user read-only permissions to an RBD image. - - -Pool ----- - -A pool is a logical partition where users store data. -In Ceph deployments, it is common to create a pool as a logical partition for -similar types of data. For example, when deploying Ceph as a backend for -OpenStack, a typical deployment would have pools for volumes, images, backups -and virtual machines, and users such as ``client.glance``, ``client.cinder``, -etc. - - -Namespace ---------- - -Objects within a pool can be associated to a namespace--a logical group of -objects within the pool. A user's access to a pool can be associated with a -namespace such that reads and writes by the user take place only within the -namespace. Objects written to a namespace within the pool can only be accessed -by users who have access to the namespace. - -.. note:: Namespaces are primarily useful for applications written on top of - ``librados`` where the logical grouping can alleviate the need to create - different pools. Ceph Object Gateway (from ``luminous``) uses namespaces for various - metadata objects. - -The rationale for namespaces is that pools can be a computationally expensive -method of segregating data sets for the purposes of authorizing separate sets -of users. For example, a pool should have ~100 placement groups per OSD. So an -exemplary cluster with 1000 OSDs would have 100,000 placement groups for one -pool. Each pool would create another 100,000 placement groups in the exemplary -cluster. By contrast, writing an object to a namespace simply associates the -namespace to the object name with out the computational overhead of a separate -pool. Rather than creating a separate pool for a user or set of users, you may -use a namespace. **Note:** Only available using ``librados`` at this time. - - -Managing Users -============== - -User management functionality provides Ceph Storage Cluster administrators with -the ability to create, update and delete users directly in the Ceph Storage -Cluster. - -When you create or delete users in the Ceph Storage Cluster, you may need to -distribute keys to clients so that they can be added to keyrings. See `Keyring -Management`_ for details. - - -List Users ----------- - -To list the users in your cluster, execute the following:: - - ceph auth ls - -Ceph will list out all users in your cluster. For example, in a two-node -exemplary cluster, ``ceph auth ls`` will output something that looks like -this:: - - installed auth entries: - - osd.0 - key: AQCvCbtToC6MDhAATtuT70Sl+DymPCfDSsyV4w== - caps: [mon] allow profile osd - caps: [osd] allow * - osd.1 - key: AQC4CbtTCFJBChAAVq5spj0ff4eHZICxIOVZeA== - caps: [mon] allow profile osd - caps: [osd] allow * - client.admin - key: AQBHCbtT6APDHhAA5W00cBchwkQjh3dkKsyPjw== - caps: [mds] allow - caps: [mon] allow * - caps: [osd] allow * - client.bootstrap-mds - key: AQBICbtTOK9uGBAAdbe5zcIGHZL3T/u2g6EBww== - caps: [mon] allow profile bootstrap-mds - client.bootstrap-osd - key: AQBHCbtT4GxqORAADE5u7RkpCN/oo4e5W0uBtw== - caps: [mon] allow profile bootstrap-osd - - -Note that the ``TYPE.ID`` notation for users applies such that ``osd.0`` is a -user of type ``osd`` and its ID is ``0``, ``client.admin`` is a user of type -``client`` and its ID is ``admin`` (i.e., the default ``client.admin`` user). -Note also that each entry has a ``key: <value>`` entry, and one or more -``caps:`` entries. - -You may use the ``-o {filename}`` option with ``ceph auth ls`` to -save the output to a file. - - -Get a User ----------- - -To retrieve a specific user, key and capabilities, execute the -following:: - - ceph auth get {TYPE.ID} - -For example:: - - ceph auth get client.admin - -You may also use the ``-o {filename}`` option with ``ceph auth get`` to -save the output to a file. Developers may also execute the following:: - - ceph auth export {TYPE.ID} - -The ``auth export`` command is identical to ``auth get``, but also prints -out the internal ``auid``, which is not relevant to end users. - - - -Add a User ----------- - -Adding a user creates a username (i.e., ``TYPE.ID``), a secret key and -any capabilities included in the command you use to create the user. - -A user's key enables the user to authenticate with the Ceph Storage Cluster. -The user's capabilities authorize the user to read, write, or execute on Ceph -monitors (``mon``), Ceph OSDs (``osd``) or Ceph Metadata Servers (``mds``). - -There are a few ways to add a user: - -- ``ceph auth add``: This command is the canonical way to add a user. It - will create the user, generate a key and add any specified capabilities. - -- ``ceph auth get-or-create``: This command is often the most convenient way - to create a user, because it returns a keyfile format with the user name - (in brackets) and the key. If the user already exists, this command - simply returns the user name and key in the keyfile format. You may use the - ``-o {filename}`` option to save the output to a file. - -- ``ceph auth get-or-create-key``: This command is a convenient way to create - a user and return the user's key (only). This is useful for clients that - need the key only (e.g., libvirt). If the user already exists, this command - simply returns the key. You may use the ``-o {filename}`` option to save the - output to a file. - -When creating client users, you may create a user with no capabilities. A user -with no capabilities is useless beyond mere authentication, because the client -cannot retrieve the cluster map from the monitor. However, you can create a -user with no capabilities if you wish to defer adding capabilities later using -the ``ceph auth caps`` command. - -A typical user has at least read capabilities on the Ceph monitor and -read and write capability on Ceph OSDs. Additionally, a user's OSD permissions -are often restricted to accessing a particular pool. :: - - ceph auth add client.john mon 'allow r' osd 'allow rw pool=liverpool' - ceph auth get-or-create client.paul mon 'allow r' osd 'allow rw pool=liverpool' - ceph auth get-or-create client.george mon 'allow r' osd 'allow rw pool=liverpool' -o george.keyring - ceph auth get-or-create-key client.ringo mon 'allow r' osd 'allow rw pool=liverpool' -o ringo.key - - -.. important:: If you provide a user with capabilities to OSDs, but you DO NOT - restrict access to particular pools, the user will have access to ALL - pools in the cluster! - - -.. _modify-user-capabilities: - -Modify User Capabilities ------------------------- - -The ``ceph auth caps`` command allows you to specify a user and change the -user's capabilities. Setting new capabilities will overwrite current capabilities. -To view current capabilities run ``ceph auth get USERTYPE.USERID``. To add -capabilities, you should also specify the existing capabilities when using the form:: - - ceph auth caps USERTYPE.USERID {daemon} 'allow [r|w|x|*|...] [pool={pool-name}] [namespace={namespace-name}]' [{daemon} 'allow [r|w|x|*|...] [pool={pool-name}] [namespace={namespace-name}]'] - -For example:: - - ceph auth get client.john - ceph auth caps client.john mon 'allow r' osd 'allow rw pool=liverpool' - ceph auth caps client.paul mon 'allow rw' osd 'allow rwx pool=liverpool' - ceph auth caps client.brian-manager mon 'allow *' osd 'allow *' - -To remove a capability, you may reset the capability. If you want the user -to have no access to a particular daemon that was previously set, specify -an empty string. For example:: - - ceph auth caps client.ringo mon ' ' osd ' ' - -See `Authorization (Capabilities)`_ for additional details on capabilities. - - -Delete a User -------------- - -To delete a user, use ``ceph auth del``:: - - ceph auth del {TYPE}.{ID} - -Where ``{TYPE}`` is one of ``client``, ``osd``, ``mon``, or ``mds``, -and ``{ID}`` is the user name or ID of the daemon. - - -Print a User's Key ------------------- - -To print a user's authentication key to standard output, execute the following:: - - ceph auth print-key {TYPE}.{ID} - -Where ``{TYPE}`` is one of ``client``, ``osd``, ``mon``, or ``mds``, -and ``{ID}`` is the user name or ID of the daemon. - -Printing a user's key is useful when you need to populate client -software with a user's key (e.g., libvirt). :: - - mount -t ceph serverhost:/ mountpoint -o name=client.user,secret=`ceph auth print-key client.user` - - -Import a User(s) ----------------- - -To import one or more users, use ``ceph auth import`` and -specify a keyring:: - - ceph auth import -i /path/to/keyring - -For example:: - - sudo ceph auth import -i /etc/ceph/ceph.keyring - - -.. note:: The ceph storage cluster will add new users, their keys and their - capabilities and will update existing users, their keys and their - capabilities. - - -Keyring Management -================== - -When you access Ceph via a Ceph client, the Ceph client will look for a local -keyring. Ceph presets the ``keyring`` setting with the following four keyring -names by default so you don't have to set them in your Ceph configuration file -unless you want to override the defaults (not recommended): - -- ``/etc/ceph/$cluster.$name.keyring`` -- ``/etc/ceph/$cluster.keyring`` -- ``/etc/ceph/keyring`` -- ``/etc/ceph/keyring.bin`` - -The ``$cluster`` metavariable is your Ceph cluster name as defined by the -name of the Ceph configuration file (i.e., ``ceph.conf`` means the cluster name -is ``ceph``; thus, ``ceph.keyring``). The ``$name`` metavariable is the user -type and user ID (e.g., ``client.admin``; thus, ``ceph.client.admin.keyring``). - -.. note:: When executing commands that read or write to ``/etc/ceph``, you may - need to use ``sudo`` to execute the command as ``root``. - -After you create a user (e.g., ``client.ringo``), you must get the key and add -it to a keyring on a Ceph client so that the user can access the Ceph Storage -Cluster. - -The `User Management`_ section details how to list, get, add, modify and delete -users directly in the Ceph Storage Cluster. However, Ceph also provides the -``ceph-authtool`` utility to allow you to manage keyrings from a Ceph client. - - -Create a Keyring ----------------- - -When you use the procedures in the `Managing Users`_ section to create users, -you need to provide user keys to the Ceph client(s) so that the Ceph client -can retrieve the key for the specified user and authenticate with the Ceph -Storage Cluster. Ceph Clients access keyrings to lookup a user name and -retrieve the user's key. - -The ``ceph-authtool`` utility allows you to create a keyring. To create an -empty keyring, use ``--create-keyring`` or ``-C``. For example:: - - ceph-authtool --create-keyring /path/to/keyring - -When creating a keyring with multiple users, we recommend using the cluster name -(e.g., ``$cluster.keyring``) for the keyring filename and saving it in the -``/etc/ceph`` directory so that the ``keyring`` configuration default setting -will pick up the filename without requiring you to specify it in the local copy -of your Ceph configuration file. For example, create ``ceph.keyring`` by -executing the following:: - - sudo ceph-authtool -C /etc/ceph/ceph.keyring - -When creating a keyring with a single user, we recommend using the cluster name, -the user type and the user name and saving it in the ``/etc/ceph`` directory. -For example, ``ceph.client.admin.keyring`` for the ``client.admin`` user. - -To create a keyring in ``/etc/ceph``, you must do so as ``root``. This means -the file will have ``rw`` permissions for the ``root`` user only, which is -appropriate when the keyring contains administrator keys. However, if you -intend to use the keyring for a particular user or group of users, ensure -that you execute ``chown`` or ``chmod`` to establish appropriate keyring -ownership and access. - - -Add a User to a Keyring ------------------------ - -When you `Add a User`_ to the Ceph Storage Cluster, you can use the `Get a -User`_ procedure to retrieve a user, key and capabilities and save the user to a -keyring. - -When you only want to use one user per keyring, the `Get a User`_ procedure with -the ``-o`` option will save the output in the keyring file format. For example, -to create a keyring for the ``client.admin`` user, execute the following:: - - sudo ceph auth get client.admin -o /etc/ceph/ceph.client.admin.keyring - -Notice that we use the recommended file format for an individual user. - -When you want to import users to a keyring, you can use ``ceph-authtool`` -to specify the destination keyring and the source keyring. -For example:: - - sudo ceph-authtool /etc/ceph/ceph.keyring --import-keyring /etc/ceph/ceph.client.admin.keyring - - -Create a User -------------- - -Ceph provides the `Add a User`_ function to create a user directly in the Ceph -Storage Cluster. However, you can also create a user, keys and capabilities -directly on a Ceph client keyring. Then, you can import the user to the Ceph -Storage Cluster. For example:: - - sudo ceph-authtool -n client.ringo --cap osd 'allow rwx' --cap mon 'allow rwx' /etc/ceph/ceph.keyring - -See `Authorization (Capabilities)`_ for additional details on capabilities. - -You can also create a keyring and add a new user to the keyring simultaneously. -For example:: - - sudo ceph-authtool -C /etc/ceph/ceph.keyring -n client.ringo --cap osd 'allow rwx' --cap mon 'allow rwx' --gen-key - -In the foregoing scenarios, the new user ``client.ringo`` is only in the -keyring. To add the new user to the Ceph Storage Cluster, you must still add -the new user to the Ceph Storage Cluster. :: - - sudo ceph auth add client.ringo -i /etc/ceph/ceph.keyring - - -Modify a User -------------- - -To modify the capabilities of a user record in a keyring, specify the keyring, -and the user followed by the capabilities. For example:: - - sudo ceph-authtool /etc/ceph/ceph.keyring -n client.ringo --cap osd 'allow rwx' --cap mon 'allow rwx' - -To update the user to the Ceph Storage Cluster, you must update the user -in the keyring to the user entry in the the Ceph Storage Cluster. :: - - sudo ceph auth import -i /etc/ceph/ceph.keyring - -See `Import a User(s)`_ for details on updating a Ceph Storage Cluster user -from a keyring. - -You may also `Modify User Capabilities`_ directly in the cluster, store the -results to a keyring file; then, import the keyring into your main -``ceph.keyring`` file. - - -Command Line Usage -================== - -Ceph supports the following usage for user name and secret: - -``--id`` | ``--user`` - -:Description: Ceph identifies users with a type and an ID (e.g., ``TYPE.ID`` or - ``client.admin``, ``client.user1``). The ``id``, ``name`` and - ``-n`` options enable you to specify the ID portion of the user - name (e.g., ``admin``, ``user1``, ``foo``, etc.). You can specify - the user with the ``--id`` and omit the type. For example, - to specify user ``client.foo`` enter the following:: - - ceph --id foo --keyring /path/to/keyring health - ceph --user foo --keyring /path/to/keyring health - - -``--name`` | ``-n`` - -:Description: Ceph identifies users with a type and an ID (e.g., ``TYPE.ID`` or - ``client.admin``, ``client.user1``). The ``--name`` and ``-n`` - options enables you to specify the fully qualified user name. - You must specify the user type (typically ``client``) with the - user ID. For example:: - - ceph --name client.foo --keyring /path/to/keyring health - ceph -n client.foo --keyring /path/to/keyring health - - -``--keyring`` - -:Description: The path to the keyring containing one or more user name and - secret. The ``--secret`` option provides the same functionality, - but it does not work with Ceph RADOS Gateway, which uses - ``--secret`` for another purpose. You may retrieve a keyring with - ``ceph auth get-or-create`` and store it locally. This is a - preferred approach, because you can switch user names without - switching the keyring path. For example:: - - sudo rbd map --id foo --keyring /path/to/keyring mypool/myimage - - -.. _pools: ../pools - - -Limitations -=========== - -The ``cephx`` protocol authenticates Ceph clients and servers to each other. It -is not intended to handle authentication of human users or application programs -run on their behalf. If that effect is required to handle your access control -needs, you must have another mechanism, which is likely to be specific to the -front end used to access the Ceph object store. This other mechanism has the -role of ensuring that only acceptable users and programs are able to run on the -machine that Ceph will permit to access its object store. - -The keys used to authenticate Ceph clients and servers are typically stored in -a plain text file with appropriate permissions in a trusted host. - -.. important:: Storing keys in plaintext files has security shortcomings, but - they are difficult to avoid, given the basic authentication methods Ceph - uses in the background. Those setting up Ceph systems should be aware of - these shortcomings. - -In particular, arbitrary user machines, especially portable machines, should not -be configured to interact directly with Ceph, since that mode of use would -require the storage of a plaintext authentication key on an insecure machine. -Anyone who stole that machine or obtained surreptitious access to it could -obtain the key that will allow them to authenticate their own machines to Ceph. - -Rather than permitting potentially insecure machines to access a Ceph object -store directly, users should be required to sign in to a trusted machine in -your environment using a method that provides sufficient security for your -purposes. That trusted machine will store the plaintext Ceph keys for the -human users. A future version of Ceph may address these particular -authentication issues more fully. - -At the moment, none of the Ceph authentication protocols provide secrecy for -messages in transit. Thus, an eavesdropper on the wire can hear and understand -all data sent between clients and servers in Ceph, even if it cannot create or -alter them. Further, Ceph does not include options to encrypt user data in the -object store. Users can hand-encrypt and store their own data in the Ceph -object store, of course, but Ceph provides no features to perform object -encryption itself. Those storing sensitive data in Ceph should consider -encrypting their data before providing it to the Ceph system. - - -.. _Architecture - High Availability Authentication: ../../../architecture#high-availability-authentication -.. _Cephx Config Reference: ../../configuration/auth-config-ref diff --git a/src/ceph/doc/rados/troubleshooting/community.rst b/src/ceph/doc/rados/troubleshooting/community.rst deleted file mode 100644 index 9faad13..0000000 --- a/src/ceph/doc/rados/troubleshooting/community.rst +++ /dev/null @@ -1,29 +0,0 @@ -==================== - The Ceph Community -==================== - -The Ceph community is an excellent source of information and help. For -operational issues with Ceph releases we recommend you `subscribe to the -ceph-users email list`_. When you no longer want to receive emails, you can -`unsubscribe from the ceph-users email list`_. - -You may also `subscribe to the ceph-devel email list`_. You should do so if -your issue is: - -- Likely related to a bug -- Related to a development release package -- Related to a development testing package -- Related to your own builds - -If you no longer want to receive emails from the ``ceph-devel`` email list, you -may `unsubscribe from the ceph-devel email list`_. - -.. tip:: The Ceph community is growing rapidly, and community members can help - you if you provide them with detailed information about your problem. You - can attach the output of the ``ceph report`` command to help people understand your issues. - -.. _subscribe to the ceph-devel email list: mailto:majordomo@vger.kernel.org?body=subscribe+ceph-devel -.. _unsubscribe from the ceph-devel email list: mailto:majordomo@vger.kernel.org?body=unsubscribe+ceph-devel -.. _subscribe to the ceph-users email list: mailto:ceph-users-join@lists.ceph.com -.. _unsubscribe from the ceph-users email list: mailto:ceph-users-leave@lists.ceph.com -.. _ceph-devel: ceph-devel@vger.kernel.org
\ No newline at end of file diff --git a/src/ceph/doc/rados/troubleshooting/cpu-profiling.rst b/src/ceph/doc/rados/troubleshooting/cpu-profiling.rst deleted file mode 100644 index 159f799..0000000 --- a/src/ceph/doc/rados/troubleshooting/cpu-profiling.rst +++ /dev/null @@ -1,67 +0,0 @@ -=============== - CPU Profiling -=============== - -If you built Ceph from source and compiled Ceph for use with `oprofile`_ -you can profile Ceph's CPU usage. See `Installing Oprofile`_ for details. - - -Initializing oprofile -===================== - -The first time you use ``oprofile`` you need to initialize it. Locate the -``vmlinux`` image corresponding to the kernel you are now running. :: - - ls /boot - sudo opcontrol --init - sudo opcontrol --setup --vmlinux={path-to-image} --separate=library --callgraph=6 - - -Starting oprofile -================= - -To start ``oprofile`` execute the following command:: - - opcontrol --start - -Once you start ``oprofile``, you may run some tests with Ceph. - - -Stopping oprofile -================= - -To stop ``oprofile`` execute the following command:: - - opcontrol --stop - - -Retrieving oprofile Results -=========================== - -To retrieve the top ``cmon`` results, execute the following command:: - - opreport -gal ./cmon | less - - -To retrieve the top ``cmon`` results with call graphs attached, execute the -following command:: - - opreport -cal ./cmon | less - -.. important:: After reviewing results, you should reset ``oprofile`` before - running it again. Resetting ``oprofile`` removes data from the session - directory. - - -Resetting oprofile -================== - -To reset ``oprofile``, execute the following command:: - - sudo opcontrol --reset - -.. important:: You should reset ``oprofile`` after analyzing data so that - you do not commingle results from different tests. - -.. _oprofile: http://oprofile.sourceforge.net/about/ -.. _Installing Oprofile: ../../../dev/cpu-profiler diff --git a/src/ceph/doc/rados/troubleshooting/index.rst b/src/ceph/doc/rados/troubleshooting/index.rst deleted file mode 100644 index 80d14f3..0000000 --- a/src/ceph/doc/rados/troubleshooting/index.rst +++ /dev/null @@ -1,19 +0,0 @@ -================= - Troubleshooting -================= - -Ceph is still on the leading edge, so you may encounter situations that require -you to examine your configuration, modify your logging output, troubleshoot -monitors and OSDs, profile memory and CPU usage, and reach out to the -Ceph community for help. - -.. toctree:: - :maxdepth: 1 - - community - log-and-debug - troubleshooting-mon - troubleshooting-osd - troubleshooting-pg - memory-profiling - cpu-profiling diff --git a/src/ceph/doc/rados/troubleshooting/log-and-debug.rst b/src/ceph/doc/rados/troubleshooting/log-and-debug.rst deleted file mode 100644 index c91f272..0000000 --- a/src/ceph/doc/rados/troubleshooting/log-and-debug.rst +++ /dev/null @@ -1,550 +0,0 @@ -======================= - Logging and Debugging -======================= - -Typically, when you add debugging to your Ceph configuration, you do so at -runtime. You can also add Ceph debug logging to your Ceph configuration file if -you are encountering issues when starting your cluster. You may view Ceph log -files under ``/var/log/ceph`` (the default location). - -.. tip:: When debug output slows down your system, the latency can hide - race conditions. - -Logging is resource intensive. If you are encountering a problem in a specific -area of your cluster, enable logging for that area of the cluster. For example, -if your OSDs are running fine, but your metadata servers are not, you should -start by enabling debug logging for the specific metadata server instance(s) -giving you trouble. Enable logging for each subsystem as needed. - -.. important:: Verbose logging can generate over 1GB of data per hour. If your - OS disk reaches its capacity, the node will stop working. - -If you enable or increase the rate of Ceph logging, ensure that you have -sufficient disk space on your OS disk. See `Accelerating Log Rotation`_ for -details on rotating log files. When your system is running well, remove -unnecessary debugging settings to ensure your cluster runs optimally. Logging -debug output messages is relatively slow, and a waste of resources when -operating your cluster. - -See `Subsystem, Log and Debug Settings`_ for details on available settings. - -Runtime -======= - -If you would like to see the configuration settings at runtime, you must log -in to a host with a running daemon and execute the following:: - - ceph daemon {daemon-name} config show | less - -For example,:: - - ceph daemon osd.0 config show | less - -To activate Ceph's debugging output (*i.e.*, ``dout()``) at runtime, use the -``ceph tell`` command to inject arguments into the runtime configuration:: - - ceph tell {daemon-type}.{daemon id or *} injectargs --{name} {value} [--{name} {value}] - -Replace ``{daemon-type}`` with one of ``osd``, ``mon`` or ``mds``. You may apply -the runtime setting to all daemons of a particular type with ``*``, or specify -a specific daemon's ID. For example, to increase -debug logging for a ``ceph-osd`` daemon named ``osd.0``, execute the following:: - - ceph tell osd.0 injectargs --debug-osd 0/5 - -The ``ceph tell`` command goes through the monitors. If you cannot bind to the -monitor, you can still make the change by logging into the host of the daemon -whose configuration you'd like to change using ``ceph daemon``. -For example:: - - sudo ceph daemon osd.0 config set debug_osd 0/5 - -See `Subsystem, Log and Debug Settings`_ for details on available settings. - - -Boot Time -========= - -To activate Ceph's debugging output (*i.e.*, ``dout()``) at boot time, you must -add settings to your Ceph configuration file. Subsystems common to each daemon -may be set under ``[global]`` in your configuration file. Subsystems for -particular daemons are set under the daemon section in your configuration file -(*e.g.*, ``[mon]``, ``[osd]``, ``[mds]``). For example:: - - [global] - debug ms = 1/5 - - [mon] - debug mon = 20 - debug paxos = 1/5 - debug auth = 2 - - [osd] - debug osd = 1/5 - debug filestore = 1/5 - debug journal = 1 - debug monc = 5/20 - - [mds] - debug mds = 1 - debug mds balancer = 1 - - -See `Subsystem, Log and Debug Settings`_ for details. - - -Accelerating Log Rotation -========================= - -If your OS disk is relatively full, you can accelerate log rotation by modifying -the Ceph log rotation file at ``/etc/logrotate.d/ceph``. Add a size setting -after the rotation frequency to accelerate log rotation (via cronjob) if your -logs exceed the size setting. For example, the default setting looks like -this:: - - rotate 7 - weekly - compress - sharedscripts - -Modify it by adding a ``size`` setting. :: - - rotate 7 - weekly - size 500M - compress - sharedscripts - -Then, start the crontab editor for your user space. :: - - crontab -e - -Finally, add an entry to check the ``etc/logrotate.d/ceph`` file. :: - - 30 * * * * /usr/sbin/logrotate /etc/logrotate.d/ceph >/dev/null 2>&1 - -The preceding example checks the ``etc/logrotate.d/ceph`` file every 30 minutes. - - -Valgrind -======== - -Debugging may also require you to track down memory and threading issues. -You can run a single daemon, a type of daemon, or the whole cluster with -Valgrind. You should only use Valgrind when developing or debugging Ceph. -Valgrind is computationally expensive, and will slow down your system otherwise. -Valgrind messages are logged to ``stderr``. - - -Subsystem, Log and Debug Settings -================================= - -In most cases, you will enable debug logging output via subsystems. - -Ceph Subsystems ---------------- - -Each subsystem has a logging level for its output logs, and for its logs -in-memory. You may set different values for each of these subsystems by setting -a log file level and a memory level for debug logging. Ceph's logging levels -operate on a scale of ``1`` to ``20``, where ``1`` is terse and ``20`` is -verbose [#]_ . In general, the logs in-memory are not sent to the output log unless: - -- a fatal signal is raised or -- an ``assert`` in source code is triggered or -- upon requested. Please consult `document on admin socket <http://docs.ceph.com/docs/master/man/8/ceph/#daemon>`_ for more details. - -A debug logging setting can take a single value for the log level and the -memory level, which sets them both as the same value. For example, if you -specify ``debug ms = 5``, Ceph will treat it as a log level and a memory level -of ``5``. You may also specify them separately. The first setting is the log -level, and the second setting is the memory level. You must separate them with -a forward slash (/). For example, if you want to set the ``ms`` subsystem's -debug logging level to ``1`` and its memory level to ``5``, you would specify it -as ``debug ms = 1/5``. For example: - - - -.. code-block:: ini - - debug {subsystem} = {log-level}/{memory-level} - #for example - debug mds balancer = 1/20 - - -The following table provides a list of Ceph subsystems and their default log and -memory levels. Once you complete your logging efforts, restore the subsystems -to their default level or to a level suitable for normal operations. - - -+--------------------+-----------+--------------+ -| Subsystem | Log Level | Memory Level | -+====================+===========+==============+ -| ``default`` | 0 | 5 | -+--------------------+-----------+--------------+ -| ``lockdep`` | 0 | 5 | -+--------------------+-----------+--------------+ -| ``context`` | 0 | 5 | -+--------------------+-----------+--------------+ -| ``crush`` | 1 | 5 | -+--------------------+-----------+--------------+ -| ``mds`` | 1 | 5 | -+--------------------+-----------+--------------+ -| ``mds balancer`` | 1 | 5 | -+--------------------+-----------+--------------+ -| ``mds locker`` | 1 | 5 | -+--------------------+-----------+--------------+ -| ``mds log`` | 1 | 5 | -+--------------------+-----------+--------------+ -| ``mds log expire`` | 1 | 5 | -+--------------------+-----------+--------------+ -| ``mds migrator`` | 1 | 5 | -+--------------------+-----------+--------------+ -| ``buffer`` | 0 | 0 | -+--------------------+-----------+--------------+ -| ``timer`` | 0 | 5 | -+--------------------+-----------+--------------+ -| ``filer`` | 0 | 5 | -+--------------------+-----------+--------------+ -| ``objecter`` | 0 | 0 | -+--------------------+-----------+--------------+ -| ``rados`` | 0 | 5 | -+--------------------+-----------+--------------+ -| ``rbd`` | 0 | 5 | -+--------------------+-----------+--------------+ -| ``journaler`` | 0 | 5 | -+--------------------+-----------+--------------+ -| ``objectcacher`` | 0 | 5 | -+--------------------+-----------+--------------+ -| ``client`` | 0 | 5 | -+--------------------+-----------+--------------+ -| ``osd`` | 0 | 5 | -+--------------------+-----------+--------------+ -| ``optracker`` | 0 | 5 | -+--------------------+-----------+--------------+ -| ``objclass`` | 0 | 5 | -+--------------------+-----------+--------------+ -| ``filestore`` | 1 | 5 | -+--------------------+-----------+--------------+ -| ``journal`` | 1 | 5 | -+--------------------+-----------+--------------+ -| ``ms`` | 0 | 5 | -+--------------------+-----------+--------------+ -| ``mon`` | 1 | 5 | -+--------------------+-----------+--------------+ -| ``monc`` | 0 | 5 | -+--------------------+-----------+--------------+ -| ``paxos`` | 0 | 5 | -+--------------------+-----------+--------------+ -| ``tp`` | 0 | 5 | -+--------------------+-----------+--------------+ -| ``auth`` | 1 | 5 | -+--------------------+-----------+--------------+ -| ``finisher`` | 1 | 5 | -+--------------------+-----------+--------------+ -| ``heartbeatmap`` | 1 | 5 | -+--------------------+-----------+--------------+ -| ``perfcounter`` | 1 | 5 | -+--------------------+-----------+--------------+ -| ``rgw`` | 1 | 5 | -+--------------------+-----------+--------------+ -| ``javaclient`` | 1 | 5 | -+--------------------+-----------+--------------+ -| ``asok`` | 1 | 5 | -+--------------------+-----------+--------------+ -| ``throttle`` | 1 | 5 | -+--------------------+-----------+--------------+ - - -Logging Settings ----------------- - -Logging and debugging settings are not required in a Ceph configuration file, -but you may override default settings as needed. Ceph supports the following -settings: - - -``log file`` - -:Description: The location of the logging file for your cluster. -:Type: String -:Required: No -:Default: ``/var/log/ceph/$cluster-$name.log`` - - -``log max new`` - -:Description: The maximum number of new log files. -:Type: Integer -:Required: No -:Default: ``1000`` - - -``log max recent`` - -:Description: The maximum number of recent events to include in a log file. -:Type: Integer -:Required: No -:Default: ``1000000`` - - -``log to stderr`` - -:Description: Determines if logging messages should appear in ``stderr``. -:Type: Boolean -:Required: No -:Default: ``true`` - - -``err to stderr`` - -:Description: Determines if error messages should appear in ``stderr``. -:Type: Boolean -:Required: No -:Default: ``true`` - - -``log to syslog`` - -:Description: Determines if logging messages should appear in ``syslog``. -:Type: Boolean -:Required: No -:Default: ``false`` - - -``err to syslog`` - -:Description: Determines if error messages should appear in ``syslog``. -:Type: Boolean -:Required: No -:Default: ``false`` - - -``log flush on exit`` - -:Description: Determines if Ceph should flush the log files after exit. -:Type: Boolean -:Required: No -:Default: ``true`` - - -``clog to monitors`` - -:Description: Determines if ``clog`` messages should be sent to monitors. -:Type: Boolean -:Required: No -:Default: ``true`` - - -``clog to syslog`` - -:Description: Determines if ``clog`` messages should be sent to syslog. -:Type: Boolean -:Required: No -:Default: ``false`` - - -``mon cluster log to syslog`` - -:Description: Determines if the cluster log should be output to the syslog. -:Type: Boolean -:Required: No -:Default: ``false`` - - -``mon cluster log file`` - -:Description: The location of the cluster's log file. -:Type: String -:Required: No -:Default: ``/var/log/ceph/$cluster.log`` - - - -OSD ---- - - -``osd debug drop ping probability`` - -:Description: ? -:Type: Double -:Required: No -:Default: 0 - - -``osd debug drop ping duration`` - -:Description: -:Type: Integer -:Required: No -:Default: 0 - -``osd debug drop pg create probability`` - -:Description: -:Type: Integer -:Required: No -:Default: 0 - -``osd debug drop pg create duration`` - -:Description: ? -:Type: Double -:Required: No -:Default: 1 - - -``osd tmapput sets uses tmap`` - -:Description: Uses ``tmap``. For debug only. -:Type: Boolean -:Required: No -:Default: ``false`` - - -``osd min pg log entries`` - -:Description: The minimum number of log entries for placement groups. -:Type: 32-bit Unsigned Integer -:Required: No -:Default: 1000 - - -``osd op log threshold`` - -:Description: How many op log messages to show up in one pass. -:Type: Integer -:Required: No -:Default: 5 - - - -Filestore ---------- - -``filestore debug omap check`` - -:Description: Debugging check on synchronization. This is an expensive operation. -:Type: Boolean -:Required: No -:Default: 0 - - -MDS ---- - - -``mds debug scatterstat`` - -:Description: Ceph will assert that various recursive stat invariants are true - (for developers only). - -:Type: Boolean -:Required: No -:Default: ``false`` - - -``mds debug frag`` - -:Description: Ceph will verify directory fragmentation invariants when - convenient (developers only). - -:Type: Boolean -:Required: No -:Default: ``false`` - - -``mds debug auth pins`` - -:Description: The debug auth pin invariants (for developers only). -:Type: Boolean -:Required: No -:Default: ``false`` - - -``mds debug subtrees`` - -:Description: The debug subtree invariants (for developers only). -:Type: Boolean -:Required: No -:Default: ``false`` - - - -RADOS Gateway -------------- - - -``rgw log nonexistent bucket`` - -:Description: Should we log a non-existent buckets? -:Type: Boolean -:Required: No -:Default: ``false`` - - -``rgw log object name`` - -:Description: Should an object's name be logged. // man date to see codes (a subset are supported) -:Type: String -:Required: No -:Default: ``%Y-%m-%d-%H-%i-%n`` - - -``rgw log object name utc`` - -:Description: Object log name contains UTC? -:Type: Boolean -:Required: No -:Default: ``false`` - - -``rgw enable ops log`` - -:Description: Enables logging of every RGW operation. -:Type: Boolean -:Required: No -:Default: ``true`` - - -``rgw enable usage log`` - -:Description: Enable logging of RGW's bandwidth usage. -:Type: Boolean -:Required: No -:Default: ``true`` - - -``rgw usage log flush threshold`` - -:Description: Threshold to flush pending log data. -:Type: Integer -:Required: No -:Default: ``1024`` - - -``rgw usage log tick interval`` - -:Description: Flush pending log data every ``s`` seconds. -:Type: Integer -:Required: No -:Default: 30 - - -``rgw intent log object name`` - -:Description: -:Type: String -:Required: No -:Default: ``%Y-%m-%d-%i-%n`` - - -``rgw intent log object name utc`` - -:Description: Include a UTC timestamp in the intent log object name. -:Type: Boolean -:Required: No -:Default: ``false`` - -.. [#] there are levels >20 in some rare cases and that they are extremely verbose. diff --git a/src/ceph/doc/rados/troubleshooting/memory-profiling.rst b/src/ceph/doc/rados/troubleshooting/memory-profiling.rst deleted file mode 100644 index e2396e2..0000000 --- a/src/ceph/doc/rados/troubleshooting/memory-profiling.rst +++ /dev/null @@ -1,142 +0,0 @@ -================== - Memory Profiling -================== - -Ceph MON, OSD and MDS can generate heap profiles using -``tcmalloc``. To generate heap profiles, ensure you have -``google-perftools`` installed:: - - sudo apt-get install google-perftools - -The profiler dumps output to your ``log file`` directory (i.e., -``/var/log/ceph``). See `Logging and Debugging`_ for details. -To view the profiler logs with Google's performance tools, execute the -following:: - - google-pprof --text {path-to-daemon} {log-path/filename} - -For example:: - - $ ceph tell osd.0 heap start_profiler - $ ceph tell osd.0 heap dump - osd.0 tcmalloc heap stats:------------------------------------------------ - MALLOC: 2632288 ( 2.5 MiB) Bytes in use by application - MALLOC: + 499712 ( 0.5 MiB) Bytes in page heap freelist - MALLOC: + 543800 ( 0.5 MiB) Bytes in central cache freelist - MALLOC: + 327680 ( 0.3 MiB) Bytes in transfer cache freelist - MALLOC: + 1239400 ( 1.2 MiB) Bytes in thread cache freelists - MALLOC: + 1142936 ( 1.1 MiB) Bytes in malloc metadata - MALLOC: ------------ - MALLOC: = 6385816 ( 6.1 MiB) Actual memory used (physical + swap) - MALLOC: + 0 ( 0.0 MiB) Bytes released to OS (aka unmapped) - MALLOC: ------------ - MALLOC: = 6385816 ( 6.1 MiB) Virtual address space used - MALLOC: - MALLOC: 231 Spans in use - MALLOC: 56 Thread heaps in use - MALLOC: 8192 Tcmalloc page size - ------------------------------------------------ - Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()). - Bytes released to the OS take up virtual address space but no physical memory. - $ google-pprof --text \ - /usr/bin/ceph-osd \ - /var/log/ceph/ceph-osd.0.profile.0001.heap - Total: 3.7 MB - 1.9 51.1% 51.1% 1.9 51.1% ceph::log::Log::create_entry - 1.8 47.3% 98.4% 1.8 47.3% std::string::_Rep::_S_create - 0.0 0.4% 98.9% 0.0 0.6% SimpleMessenger::add_accept_pipe - 0.0 0.4% 99.2% 0.0 0.6% decode_message - ... - -Another heap dump on the same daemon will add another file. It is -convenient to compare to a previous heap dump to show what has grown -in the interval. For instance:: - - $ google-pprof --text --base out/osd.0.profile.0001.heap \ - ceph-osd out/osd.0.profile.0003.heap - Total: 0.2 MB - 0.1 50.3% 50.3% 0.1 50.3% ceph::log::Log::create_entry - 0.1 46.6% 96.8% 0.1 46.6% std::string::_Rep::_S_create - 0.0 0.9% 97.7% 0.0 26.1% ReplicatedPG::do_op - 0.0 0.8% 98.5% 0.0 0.8% __gnu_cxx::new_allocator::allocate - -Refer to `Google Heap Profiler`_ for additional details. - -Once you have the heap profiler installed, start your cluster and -begin using the heap profiler. You may enable or disable the heap -profiler at runtime, or ensure that it runs continuously. For the -following commandline usage, replace ``{daemon-type}`` with ``mon``, -``osd`` or ``mds``, and replace ``{daemon-id}`` with the OSD number or -the MON or MDS id. - - -Starting the Profiler ---------------------- - -To start the heap profiler, execute the following:: - - ceph tell {daemon-type}.{daemon-id} heap start_profiler - -For example:: - - ceph tell osd.1 heap start_profiler - -Alternatively the profile can be started when the daemon starts -running if the ``CEPH_HEAP_PROFILER_INIT=true`` variable is found in -the environment. - -Printing Stats --------------- - -To print out statistics, execute the following:: - - ceph tell {daemon-type}.{daemon-id} heap stats - -For example:: - - ceph tell osd.0 heap stats - -.. note:: Printing stats does not require the profiler to be running and does - not dump the heap allocation information to a file. - - -Dumping Heap Information ------------------------- - -To dump heap information, execute the following:: - - ceph tell {daemon-type}.{daemon-id} heap dump - -For example:: - - ceph tell mds.a heap dump - -.. note:: Dumping heap information only works when the profiler is running. - - -Releasing Memory ----------------- - -To release memory that ``tcmalloc`` has allocated but which is not being used by -the Ceph daemon itself, execute the following:: - - ceph tell {daemon-type}{daemon-id} heap release - -For example:: - - ceph tell osd.2 heap release - - -Stopping the Profiler ---------------------- - -To stop the heap profiler, execute the following:: - - ceph tell {daemon-type}.{daemon-id} heap stop_profiler - -For example:: - - ceph tell osd.0 heap stop_profiler - -.. _Logging and Debugging: ../log-and-debug -.. _Google Heap Profiler: http://goog-perftools.sourceforge.net/doc/heap_profiler.html diff --git a/src/ceph/doc/rados/troubleshooting/troubleshooting-mon.rst b/src/ceph/doc/rados/troubleshooting/troubleshooting-mon.rst deleted file mode 100644 index 89fb94c..0000000 --- a/src/ceph/doc/rados/troubleshooting/troubleshooting-mon.rst +++ /dev/null @@ -1,567 +0,0 @@ -================================= - Troubleshooting Monitors -================================= - -.. index:: monitor, high availability - -When a cluster encounters monitor-related troubles there's a tendency to -panic, and some times with good reason. You should keep in mind that losing -a monitor, or a bunch of them, don't necessarily mean that your cluster is -down, as long as a majority is up, running and with a formed quorum. -Regardless of how bad the situation is, the first thing you should do is to -calm down, take a breath and try answering our initial troubleshooting script. - - -Initial Troubleshooting -======================== - - -**Are the monitors running?** - - First of all, we need to make sure the monitors are running. You would be - amazed by how often people forget to run the monitors, or restart them after - an upgrade. There's no shame in that, but let's try not losing a couple of - hours chasing an issue that is not there. - -**Are you able to connect to the monitor's servers?** - - Doesn't happen often, but sometimes people do have ``iptables`` rules that - block accesses to monitor servers or monitor ports. Usually leftovers from - monitor stress-testing that were forgotten at some point. Try ssh'ing into - the server and, if that succeeds, try connecting to the monitor's port - using you tool of choice (telnet, nc,...). - -**Does ceph -s run and obtain a reply from the cluster?** - - If the answer is yes then your cluster is up and running. One thing you - can take for granted is that the monitors will only answer to a ``status`` - request if there is a formed quorum. - - If ``ceph -s`` blocked however, without obtaining a reply from the cluster - or showing a lot of ``fault`` messages, then it is likely that your monitors - are either down completely or just a portion is up -- a portion that is not - enough to form a quorum (keep in mind that a quorum if formed by a majority - of monitors). - -**What if ceph -s doesn't finish?** - - If you haven't gone through all the steps so far, please go back and do. - - For those running on Emperor 0.72-rc1 and forward, you will be able to - contact each monitor individually asking them for their status, regardless - of a quorum being formed. This an be achieved using ``ceph ping mon.ID``, - ID being the monitor's identifier. You should perform this for each monitor - in the cluster. In section `Understanding mon_status`_ we will explain how - to interpret the output of this command. - - For the rest of you who don't tread on the bleeding edge, you will need to - ssh into the server and use the monitor's admin socket. Please jump to - `Using the monitor's admin socket`_. - -For other specific issues, keep on reading. - - -Using the monitor's admin socket -================================= - -The admin socket allows you to interact with a given daemon directly using a -Unix socket file. This file can be found in your monitor's ``run`` directory. -By default, the admin socket will be kept in ``/var/run/ceph/ceph-mon.ID.asok`` -but this can vary if you defined it otherwise. If you don't find it there, -please check your ``ceph.conf`` for an alternative path or run:: - - ceph-conf --name mon.ID --show-config-value admin_socket - -Please bear in mind that the admin socket will only be available while the -monitor is running. When the monitor is properly shutdown, the admin socket -will be removed. If however the monitor is not running and the admin socket -still persists, it is likely that the monitor was improperly shutdown. -Regardless, if the monitor is not running, you will not be able to use the -admin socket, with ``ceph`` likely returning ``Error 111: Connection Refused``. - -Accessing the admin socket is as simple as telling the ``ceph`` tool to use -the ``asok`` file. In pre-Dumpling Ceph, this can be achieved by:: - - ceph --admin-daemon /var/run/ceph/ceph-mon.<id>.asok <command> - -while in Dumpling and beyond you can use the alternate (and recommended) -format:: - - ceph daemon mon.<id> <command> - -Using ``help`` as the command to the ``ceph`` tool will show you the -supported commands available through the admin socket. Please take a look -at ``config get``, ``config show``, ``mon_status`` and ``quorum_status``, -as those can be enlightening when troubleshooting a monitor. - - -Understanding mon_status -========================= - -``mon_status`` can be obtained through the ``ceph`` tool when you have -a formed quorum, or via the admin socket if you don't. This command will -output a multitude of information about the monitor, including the same -output you would get with ``quorum_status``. - -Take the following example of ``mon_status``:: - - - { "name": "c", - "rank": 2, - "state": "peon", - "election_epoch": 38, - "quorum": [ - 1, - 2], - "outside_quorum": [], - "extra_probe_peers": [], - "sync_provider": [], - "monmap": { "epoch": 3, - "fsid": "5c4e9d53-e2e1-478a-8061-f543f8be4cf8", - "modified": "2013-10-30 04:12:01.945629", - "created": "2013-10-29 14:14:41.914786", - "mons": [ - { "rank": 0, - "name": "a", - "addr": "127.0.0.1:6789\/0"}, - { "rank": 1, - "name": "b", - "addr": "127.0.0.1:6790\/0"}, - { "rank": 2, - "name": "c", - "addr": "127.0.0.1:6795\/0"}]}} - -A couple of things are obvious: we have three monitors in the monmap (*a*, *b* -and *c*), the quorum is formed by only two monitors, and *c* is in the quorum -as a *peon*. - -Which monitor is out of the quorum? - - The answer would be **a**. - -Why? - - Take a look at the ``quorum`` set. We have two monitors in this set: *1* - and *2*. These are not monitor names. These are monitor ranks, as established - in the current monmap. We are missing the monitor with rank 0, and according - to the monmap that would be ``mon.a``. - -By the way, how are ranks established? - - Ranks are (re)calculated whenever you add or remove monitors and follow a - simple rule: the **greater** the ``IP:PORT`` combination, the **lower** the - rank is. In this case, considering that ``127.0.0.1:6789`` is lower than all - the remaining ``IP:PORT`` combinations, ``mon.a`` has rank 0. - -Most Common Monitor Issues -=========================== - -Have Quorum but at least one Monitor is down ---------------------------------------------- - -When this happens, depending on the version of Ceph you are running, -you should be seeing something similar to:: - - $ ceph health detail - [snip] - mon.a (rank 0) addr 127.0.0.1:6789/0 is down (out of quorum) - -How to troubleshoot this? - - First, make sure ``mon.a`` is running. - - Second, make sure you are able to connect to ``mon.a``'s server from the - other monitors' servers. Check the ports as well. Check ``iptables`` on - all your monitor nodes and make sure you are not dropping/rejecting - connections. - - If this initial troubleshooting doesn't solve your problems, then it's - time to go deeper. - - First, check the problematic monitor's ``mon_status`` via the admin - socket as explained in `Using the monitor's admin socket`_ and - `Understanding mon_status`_. - - Considering the monitor is out of the quorum, its state should be one of - ``probing``, ``electing`` or ``synchronizing``. If it happens to be either - ``leader`` or ``peon``, then the monitor believes to be in quorum, while - the remaining cluster is sure it is not; or maybe it got into the quorum - while we were troubleshooting the monitor, so check you ``ceph -s`` again - just to make sure. Proceed if the monitor is not yet in the quorum. - -What if the state is ``probing``? - - This means the monitor is still looking for the other monitors. Every time - you start a monitor, the monitor will stay in this state for some time - while trying to find the rest of the monitors specified in the ``monmap``. - The time a monitor will spend in this state can vary. For instance, when on - a single-monitor cluster, the monitor will pass through the probing state - almost instantaneously, since there are no other monitors around. On a - multi-monitor cluster, the monitors will stay in this state until they - find enough monitors to form a quorum -- this means that if you have 2 out - of 3 monitors down, the one remaining monitor will stay in this state - indefinitively until you bring one of the other monitors up. - - If you have a quorum, however, the monitor should be able to find the - remaining monitors pretty fast, as long as they can be reached. If your - monitor is stuck probing and you have gone through with all the communication - troubleshooting, then there is a fair chance that the monitor is trying - to reach the other monitors on a wrong address. ``mon_status`` outputs the - ``monmap`` known to the monitor: check if the other monitor's locations - match reality. If they don't, jump to - `Recovering a Monitor's Broken monmap`_; if they do, then it may be related - to severe clock skews amongst the monitor nodes and you should refer to - `Clock Skews`_ first, but if that doesn't solve your problem then it is - the time to prepare some logs and reach out to the community (please refer - to `Preparing your logs`_ on how to best prepare your logs). - - -What if state is ``electing``? - - This means the monitor is in the middle of an election. These should be - fast to complete, but at times the monitors can get stuck electing. This - is usually a sign of a clock skew among the monitor nodes; jump to - `Clock Skews`_ for more infos on that. If all your clocks are properly - synchronized, it is best if you prepare some logs and reach out to the - community. This is not a state that is likely to persist and aside from - (*really*) old bugs there is not an obvious reason besides clock skews on - why this would happen. - -What if state is ``synchronizing``? - - This means the monitor is synchronizing with the rest of the cluster in - order to join the quorum. The synchronization process is as faster as - smaller your monitor store is, so if you have a big store it may - take a while. Don't worry, it should be finished soon enough. - - However, if you notice that the monitor jumps from ``synchronizing`` to - ``electing`` and then back to ``synchronizing``, then you do have a - problem: the cluster state is advancing (i.e., generating new maps) way - too fast for the synchronization process to keep up. This used to be a - thing in early Cuttlefish, but since then the synchronization process was - quite refactored and enhanced to avoid just this sort of behavior. If this - happens in later versions let us know. And bring some logs - (see `Preparing your logs`_). - -What if state is ``leader`` or ``peon``? - - This should not happen. There is a chance this might happen however, and - it has a lot to do with clock skews -- see `Clock Skews`_. If you are not - suffering from clock skews, then please prepare your logs (see - `Preparing your logs`_) and reach out to us. - - -Recovering a Monitor's Broken monmap -------------------------------------- - -This is how a ``monmap`` usually looks like, depending on the number of -monitors:: - - - epoch 3 - fsid 5c4e9d53-e2e1-478a-8061-f543f8be4cf8 - last_changed 2013-10-30 04:12:01.945629 - created 2013-10-29 14:14:41.914786 - 0: 127.0.0.1:6789/0 mon.a - 1: 127.0.0.1:6790/0 mon.b - 2: 127.0.0.1:6795/0 mon.c - -This may not be what you have however. For instance, in some versions of -early Cuttlefish there was this one bug that could cause your ``monmap`` -to be nullified. Completely filled with zeros. This means that not even -``monmaptool`` would be able to read it because it would find it hard to -make sense of only-zeros. Some other times, you may end up with a monitor -with a severely outdated monmap, thus being unable to find the remaining -monitors (e.g., say ``mon.c`` is down; you add a new monitor ``mon.d``, -then remove ``mon.a``, then add a new monitor ``mon.e`` and remove -``mon.b``; you will end up with a totally different monmap from the one -``mon.c`` knows). - -In this sort of situations, you have two possible solutions: - -Scrap the monitor and create a new one - - You should only take this route if you are positive that you won't - lose the information kept by that monitor; that you have other monitors - and that they are running just fine so that your new monitor is able - to synchronize from the remaining monitors. Keep in mind that destroying - a monitor, if there are no other copies of its contents, may lead to - loss of data. - -Inject a monmap into the monitor - - Usually the safest path. You should grab the monmap from the remaining - monitors and inject it into the monitor with the corrupted/lost monmap. - - These are the basic steps: - - 1. Is there a formed quorum? If so, grab the monmap from the quorum:: - - $ ceph mon getmap -o /tmp/monmap - - 2. No quorum? Grab the monmap directly from another monitor (this - assumes the monitor you are grabbing the monmap from has id ID-FOO - and has been stopped):: - - $ ceph-mon -i ID-FOO --extract-monmap /tmp/monmap - - 3. Stop the monitor you are going to inject the monmap into. - - 4. Inject the monmap:: - - $ ceph-mon -i ID --inject-monmap /tmp/monmap - - 5. Start the monitor - - Please keep in mind that the ability to inject monmaps is a powerful - feature that can cause havoc with your monitors if misused as it will - overwrite the latest, existing monmap kept by the monitor. - - -Clock Skews ------------- - -Monitors can be severely affected by significant clock skews across the -monitor nodes. This usually translates into weird behavior with no obvious -cause. To avoid such issues, you should run a clock synchronization tool -on your monitor nodes. - - -What's the maximum tolerated clock skew? - - By default the monitors will allow clocks to drift up to ``0.05 seconds``. - - -Can I increase the maximum tolerated clock skew? - - This value is configurable via the ``mon-clock-drift-allowed`` option, and - although you *CAN* it doesn't mean you *SHOULD*. The clock skew mechanism - is in place because clock skewed monitor may not properly behave. We, as - developers and QA afficcionados, are comfortable with the current default - value, as it will alert the user before the monitors get out hand. Changing - this value without testing it first may cause unforeseen effects on the - stability of the monitors and overall cluster healthiness, although there is - no risk of dataloss. - - -How do I know there's a clock skew? - - The monitors will warn you in the form of a ``HEALTH_WARN``. ``ceph health - detail`` should show something in the form of:: - - mon.c addr 10.10.0.1:6789/0 clock skew 0.08235s > max 0.05s (latency 0.0045s) - - That means that ``mon.c`` has been flagged as suffering from a clock skew. - - -What should I do if there's a clock skew? - - Synchronize your clocks. Running an NTP client may help. If you are already - using one and you hit this sort of issues, check if you are using some NTP - server remote to your network and consider hosting your own NTP server on - your network. This last option tends to reduce the amount of issues with - monitor clock skews. - - -Client Can't Connect or Mount ------------------------------- - -Check your IP tables. Some OS install utilities add a ``REJECT`` rule to -``iptables``. The rule rejects all clients trying to connect to the host except -for ``ssh``. If your monitor host's IP tables have such a ``REJECT`` rule in -place, clients connecting from a separate node will fail to mount with a timeout -error. You need to address ``iptables`` rules that reject clients trying to -connect to Ceph daemons. For example, you would need to address rules that look -like this appropriately:: - - REJECT all -- anywhere anywhere reject-with icmp-host-prohibited - -You may also need to add rules to IP tables on your Ceph hosts to ensure -that clients can access the ports associated with your Ceph monitors (i.e., port -6789 by default) and Ceph OSDs (i.e., 6800 through 7300 by default). For -example:: - - iptables -A INPUT -m multiport -p tcp -s {ip-address}/{netmask} --dports 6789,6800:7300 -j ACCEPT - -Monitor Store Failures -====================== - -Symptoms of store corruption ----------------------------- - -Ceph monitor stores the `cluster map`_ in a key/value store such as LevelDB. If -a monitor fails due to the key/value store corruption, following error messages -might be found in the monitor log:: - - Corruption: error in middle of record - -or:: - - Corruption: 1 missing files; e.g.: /var/lib/ceph/mon/mon.0/store.db/1234567.ldb - -Recovery using healthy monitor(s) ---------------------------------- - -If there is any survivers, we can always `replace`_ the corrupted one with a -new one. And after booting up, the new joiner will sync up with a healthy -peer, and once it is fully sync'ed, it will be able to serve the clients. - -Recovery using OSDs -------------------- - -But what if all monitors fail at the same time? Since users are encouraged to -deploy at least three monitors in a Ceph cluster, the chance of simultaneous -failure is rare. But unplanned power-downs in a data center with improperly -configured disk/fs settings could fail the underlying filesystem, and hence -kill all the monitors. In this case, we can recover the monitor store with the -information stored in OSDs.:: - - ms=/tmp/mon-store - mkdir $ms - # collect the cluster map from OSDs - for host in $hosts; do - rsync -avz $ms user@host:$ms - rm -rf $ms - ssh user@host <<EOF - for osd in /var/lib/osd/osd-*; do - ceph-objectstore-tool --data-path \$osd --op update-mon-db --mon-store-path $ms - done - EOF - rsync -avz user@host:$ms $ms - done - # rebuild the monitor store from the collected map, if the cluster does not - # use cephx authentication, we can skip the following steps to update the - # keyring with the caps, and there is no need to pass the "--keyring" option. - # i.e. just use "ceph-monstore-tool /tmp/mon-store rebuild" instead - ceph-authtool /path/to/admin.keyring -n mon. \ - --cap mon 'allow *' - ceph-authtool /path/to/admin.keyring -n client.admin \ - --cap mon 'allow *' --cap osd 'allow *' --cap mds 'allow *' - ceph-monstore-tool /tmp/mon-store rebuild -- --keyring /path/to/admin.keyring - # backup corrupted store.db just in case - mv /var/lib/ceph/mon/mon.0/store.db /var/lib/ceph/mon/mon.0/store.db.corrupted - mv /tmp/mon-store/store.db /var/lib/ceph/mon/mon.0/store.db - chown -R ceph:ceph /var/lib/ceph/mon/mon.0/store.db - -The steps above - -#. collect the map from all OSD hosts, -#. then rebuild the store, -#. fill the entities in keyring file with appropriate caps -#. replace the corrupted store on ``mon.0`` with the recovered copy. - -Known limitations -~~~~~~~~~~~~~~~~~ - -Following information are not recoverable using the steps above: - -- **some added keyrings**: all the OSD keyrings added using ``ceph auth add`` command - are recovered from the OSD's copy. And the ``client.admin`` keyring is imported - using ``ceph-monstore-tool``. But the MDS keyrings and other keyrings are missing - in the recovered monitor store. You might need to re-add them manually. - -- **pg settings**: the ``full ratio`` and ``nearfull ratio`` settings configured using - ``ceph pg set_full_ratio`` and ``ceph pg set_nearfull_ratio`` will be lost. - -- **MDS Maps**: the MDS maps are lost. - - -Everything Failed! Now What? -============================= - -Reaching out for help ----------------------- - -You can find us on IRC at #ceph and #ceph-devel at OFTC (server irc.oftc.net) -and on ``ceph-devel@vger.kernel.org`` and ``ceph-users@lists.ceph.com``. Make -sure you have grabbed your logs and have them ready if someone asks: the faster -the interaction and lower the latency in response, the better chances everyone's -time is optimized. - - -Preparing your logs ---------------------- - -Monitor logs are, by default, kept in ``/var/log/ceph/ceph-mon.FOO.log*``. We -may want them. However, your logs may not have the necessary information. If -you don't find your monitor logs at their default location, you can check -where they should be by running:: - - ceph-conf --name mon.FOO --show-config-value log_file - -The amount of information in the logs are subject to the debug levels being -enforced by your configuration files. If you have not enforced a specific -debug level then Ceph is using the default levels and your logs may not -contain important information to track down you issue. -A first step in getting relevant information into your logs will be to raise -debug levels. In this case we will be interested in the information from the -monitor. -Similarly to what happens on other components, different parts of the monitor -will output their debug information on different subsystems. - -You will have to raise the debug levels of those subsystems more closely -related to your issue. This may not be an easy task for someone unfamiliar -with troubleshooting Ceph. For most situations, setting the following options -on your monitors will be enough to pinpoint a potential source of the issue:: - - debug mon = 10 - debug ms = 1 - -If we find that these debug levels are not enough, there's a chance we may -ask you to raise them or even define other debug subsystems to obtain infos -from -- but at least we started off with some useful information, instead -of a massively empty log without much to go on with. - -Do I need to restart a monitor to adjust debug levels? ------------------------------------------------------- - -No. You may do it in one of two ways: - -You have quorum - - Either inject the debug option into the monitor you want to debug:: - - ceph tell mon.FOO injectargs --debug_mon 10/10 - - or into all monitors at once:: - - ceph tell mon.* injectargs --debug_mon 10/10 - -No quourm - - Use the monitor's admin socket and directly adjust the configuration - options:: - - ceph daemon mon.FOO config set debug_mon 10/10 - - -Going back to default values is as easy as rerunning the above commands -using the debug level ``1/10`` instead. You can check your current -values using the admin socket and the following commands:: - - ceph daemon mon.FOO config show - -or:: - - ceph daemon mon.FOO config get 'OPTION_NAME' - - -Reproduced the problem with appropriate debug levels. Now what? ----------------------------------------------------------------- - -Ideally you would send us only the relevant portions of your logs. -We realise that figuring out the corresponding portion may not be the -easiest of tasks. Therefore, we won't hold it to you if you provide the -full log, but common sense should be employed. If your log has hundreds -of thousands of lines, it may get tricky to go through the whole thing, -specially if we are not aware at which point, whatever your issue is, -happened. For instance, when reproducing, keep in mind to write down -current time and date and to extract the relevant portions of your logs -based on that. - -Finally, you should reach out to us on the mailing lists, on IRC or file -a new issue on the `tracker`_. - -.. _cluster map: ../../architecture#cluster-map -.. _replace: ../operation/add-or-rm-mons -.. _tracker: http://tracker.ceph.com/projects/ceph/issues/new diff --git a/src/ceph/doc/rados/troubleshooting/troubleshooting-osd.rst b/src/ceph/doc/rados/troubleshooting/troubleshooting-osd.rst deleted file mode 100644 index 88307fe..0000000 --- a/src/ceph/doc/rados/troubleshooting/troubleshooting-osd.rst +++ /dev/null @@ -1,536 +0,0 @@ -====================== - Troubleshooting OSDs -====================== - -Before troubleshooting your OSDs, check your monitors and network first. If -you execute ``ceph health`` or ``ceph -s`` on the command line and Ceph returns -a health status, it means that the monitors have a quorum. -If you don't have a monitor quorum or if there are errors with the monitor -status, `address the monitor issues first <../troubleshooting-mon>`_. -Check your networks to ensure they -are running properly, because networks may have a significant impact on OSD -operation and performance. - - - -Obtaining Data About OSDs -========================= - -A good first step in troubleshooting your OSDs is to obtain information in -addition to the information you collected while `monitoring your OSDs`_ -(e.g., ``ceph osd tree``). - - -Ceph Logs ---------- - -If you haven't changed the default path, you can find Ceph log files at -``/var/log/ceph``:: - - ls /var/log/ceph - -If you don't get enough log detail, you can change your logging level. See -`Logging and Debugging`_ for details to ensure that Ceph performs adequately -under high logging volume. - - -Admin Socket ------------- - -Use the admin socket tool to retrieve runtime information. For details, list -the sockets for your Ceph processes:: - - ls /var/run/ceph - -Then, execute the following, replacing ``{daemon-name}`` with an actual -daemon (e.g., ``osd.0``):: - - ceph daemon osd.0 help - -Alternatively, you can specify a ``{socket-file}`` (e.g., something in ``/var/run/ceph``):: - - ceph daemon {socket-file} help - - -The admin socket, among other things, allows you to: - -- List your configuration at runtime -- Dump historic operations -- Dump the operation priority queue state -- Dump operations in flight -- Dump perfcounters - - -Display Freespace ------------------ - -Filesystem issues may arise. To display your filesystem's free space, execute -``df``. :: - - df -h - -Execute ``df --help`` for additional usage. - - -I/O Statistics --------------- - -Use `iostat`_ to identify I/O-related issues. :: - - iostat -x - - -Diagnostic Messages -------------------- - -To retrieve diagnostic messages, use ``dmesg`` with ``less``, ``more``, ``grep`` -or ``tail``. For example:: - - dmesg | grep scsi - - -Stopping w/out Rebalancing -========================== - -Periodically, you may need to perform maintenance on a subset of your cluster, -or resolve a problem that affects a failure domain (e.g., a rack). If you do not -want CRUSH to automatically rebalance the cluster as you stop OSDs for -maintenance, set the cluster to ``noout`` first:: - - ceph osd set noout - -Once the cluster is set to ``noout``, you can begin stopping the OSDs within the -failure domain that requires maintenance work. :: - - stop ceph-osd id={num} - -.. note:: Placement groups within the OSDs you stop will become ``degraded`` - while you are addressing issues with within the failure domain. - -Once you have completed your maintenance, restart the OSDs. :: - - start ceph-osd id={num} - -Finally, you must unset the cluster from ``noout``. :: - - ceph osd unset noout - - - -.. _osd-not-running: - -OSD Not Running -=============== - -Under normal circumstances, simply restarting the ``ceph-osd`` daemon will -allow it to rejoin the cluster and recover. - -An OSD Won't Start ------------------- - -If you start your cluster and an OSD won't start, check the following: - -- **Configuration File:** If you were not able to get OSDs running from - a new installation, check your configuration file to ensure it conforms - (e.g., ``host`` not ``hostname``, etc.). - -- **Check Paths:** Check the paths in your configuration, and the actual - paths themselves for data and journals. If you separate the OSD data from - the journal data and there are errors in your configuration file or in the - actual mounts, you may have trouble starting OSDs. If you want to store the - journal on a block device, you should partition your journal disk and assign - one partition per OSD. - -- **Check Max Threadcount:** If you have a node with a lot of OSDs, you may be - hitting the default maximum number of threads (e.g., usually 32k), especially - during recovery. You can increase the number of threads using ``sysctl`` to - see if increasing the maximum number of threads to the maximum possible - number of threads allowed (i.e., 4194303) will help. For example:: - - sysctl -w kernel.pid_max=4194303 - - If increasing the maximum thread count resolves the issue, you can make it - permanent by including a ``kernel.pid_max`` setting in the - ``/etc/sysctl.conf`` file. For example:: - - kernel.pid_max = 4194303 - -- **Kernel Version:** Identify the kernel version and distribution you - are using. Ceph uses some third party tools by default, which may be - buggy or may conflict with certain distributions and/or kernel - versions (e.g., Google perftools). Check the `OS recommendations`_ - to ensure you have addressed any issues related to your kernel. - -- **Segment Fault:** If there is a segment fault, turn your logging up - (if it is not already), and try again. If it segment faults again, - contact the ceph-devel email list and provide your Ceph configuration - file, your monitor output and the contents of your log file(s). - - - -An OSD Failed -------------- - -When a ``ceph-osd`` process dies, the monitor will learn about the failure -from surviving ``ceph-osd`` daemons and report it via the ``ceph health`` -command:: - - ceph health - HEALTH_WARN 1/3 in osds are down - -Specifically, you will get a warning whenever there are ``ceph-osd`` -processes that are marked ``in`` and ``down``. You can identify which -``ceph-osds`` are ``down`` with:: - - ceph health detail - HEALTH_WARN 1/3 in osds are down - osd.0 is down since epoch 23, last address 192.168.106.220:6800/11080 - -If there is a disk -failure or other fault preventing ``ceph-osd`` from functioning or -restarting, an error message should be present in its log file in -``/var/log/ceph``. - -If the daemon stopped because of a heartbeat failure, the underlying -kernel file system may be unresponsive. Check ``dmesg`` output for disk -or other kernel errors. - -If the problem is a software error (failed assertion or other -unexpected error), it should be reported to the `ceph-devel`_ email list. - - -No Free Drive Space -------------------- - -Ceph prevents you from writing to a full OSD so that you don't lose data. -In an operational cluster, you should receive a warning when your cluster -is getting near its full ratio. The ``mon osd full ratio`` defaults to -``0.95``, or 95% of capacity before it stops clients from writing data. -The ``mon osd backfillfull ratio`` defaults to ``0.90``, or 90 % of -capacity when it blocks backfills from starting. The -``mon osd nearfull ratio`` defaults to ``0.85``, or 85% of capacity -when it generates a health warning. - -Full cluster issues usually arise when testing how Ceph handles an OSD -failure on a small cluster. When one node has a high percentage of the -cluster's data, the cluster can easily eclipse its nearfull and full ratio -immediately. If you are testing how Ceph reacts to OSD failures on a small -cluster, you should leave ample free disk space and consider temporarily -lowering the ``mon osd full ratio``, ``mon osd backfillfull ratio`` and -``mon osd nearfull ratio``. - -Full ``ceph-osds`` will be reported by ``ceph health``:: - - ceph health - HEALTH_WARN 1 nearfull osd(s) - -Or:: - - ceph health detail - HEALTH_ERR 1 full osd(s); 1 backfillfull osd(s); 1 nearfull osd(s) - osd.3 is full at 97% - osd.4 is backfill full at 91% - osd.2 is near full at 87% - -The best way to deal with a full cluster is to add new ``ceph-osds``, allowing -the cluster to redistribute data to the newly available storage. - -If you cannot start an OSD because it is full, you may delete some data by deleting -some placement group directories in the full OSD. - -.. important:: If you choose to delete a placement group directory on a full OSD, - **DO NOT** delete the same placement group directory on another full OSD, or - **YOU MAY LOSE DATA**. You **MUST** maintain at least one copy of your data on - at least one OSD. - -See `Monitor Config Reference`_ for additional details. - - -OSDs are Slow/Unresponsive -========================== - -A commonly recurring issue involves slow or unresponsive OSDs. Ensure that you -have eliminated other troubleshooting possibilities before delving into OSD -performance issues. For example, ensure that your network(s) is working properly -and your OSDs are running. Check to see if OSDs are throttling recovery traffic. - -.. tip:: Newer versions of Ceph provide better recovery handling by preventing - recovering OSDs from using up system resources so that ``up`` and ``in`` - OSDs are not available or are otherwise slow. - - -Networking Issues ------------------ - -Ceph is a distributed storage system, so it depends upon networks to peer with -OSDs, replicate objects, recover from faults and check heartbeats. Networking -issues can cause OSD latency and flapping OSDs. See `Flapping OSDs`_ for -details. - -Ensure that Ceph processes and Ceph-dependent processes are connected and/or -listening. :: - - netstat -a | grep ceph - netstat -l | grep ceph - sudo netstat -p | grep ceph - -Check network statistics. :: - - netstat -s - - -Drive Configuration -------------------- - -A storage drive should only support one OSD. Sequential read and sequential -write throughput can bottleneck if other processes share the drive, including -journals, operating systems, monitors, other OSDs and non-Ceph processes. - -Ceph acknowledges writes *after* journaling, so fast SSDs are an -attractive option to accelerate the response time--particularly when -using the ``XFS`` or ``ext4`` filesystems. By contrast, the ``btrfs`` -filesystem can write and journal simultaneously. (Note, however, that -we recommend against using ``btrfs`` for production deployments.) - -.. note:: Partitioning a drive does not change its total throughput or - sequential read/write limits. Running a journal in a separate partition - may help, but you should prefer a separate physical drive. - - -Bad Sectors / Fragmented Disk ------------------------------ - -Check your disks for bad sectors and fragmentation. This can cause total throughput -to drop substantially. - - -Co-resident Monitors/OSDs -------------------------- - -Monitors are generally light-weight processes, but they do lots of ``fsync()``, -which can interfere with other workloads, particularly if monitors run on the -same drive as your OSDs. Additionally, if you run monitors on the same host as -the OSDs, you may incur performance issues related to: - -- Running an older kernel (pre-3.0) -- Running Argonaut with an old ``glibc`` -- Running a kernel with no syncfs(2) syscall. - -In these cases, multiple OSDs running on the same host can drag each other down -by doing lots of commits. That often leads to the bursty writes. - - -Co-resident Processes ---------------------- - -Spinning up co-resident processes such as a cloud-based solution, virtual -machines and other applications that write data to Ceph while operating on the -same hardware as OSDs can introduce significant OSD latency. Generally, we -recommend optimizing a host for use with Ceph and using other hosts for other -processes. The practice of separating Ceph operations from other applications -may help improve performance and may streamline troubleshooting and maintenance. - - -Logging Levels --------------- - -If you turned logging levels up to track an issue and then forgot to turn -logging levels back down, the OSD may be putting a lot of logs onto the disk. If -you intend to keep logging levels high, you may consider mounting a drive to the -default path for logging (i.e., ``/var/log/ceph/$cluster-$name.log``). - - -Recovery Throttling -------------------- - -Depending upon your configuration, Ceph may reduce recovery rates to maintain -performance or it may increase recovery rates to the point that recovery -impacts OSD performance. Check to see if the OSD is recovering. - - -Kernel Version --------------- - -Check the kernel version you are running. Older kernels may not receive -new backports that Ceph depends upon for better performance. - - -Kernel Issues with SyncFS -------------------------- - -Try running one OSD per host to see if performance improves. Old kernels -might not have a recent enough version of ``glibc`` to support ``syncfs(2)``. - - -Filesystem Issues ------------------ - -Currently, we recommend deploying clusters with XFS. - -We recommend against using btrfs or ext4. The btrfs filesystem has -many attractive features, but bugs in the filesystem may lead to -performance issues and suprious ENOSPC errors. We do not recommend -ext4 because xattr size limitations break our support for long object -names (needed for RGW). - -For more information, see `Filesystem Recommendations`_. - -.. _Filesystem Recommendations: ../configuration/filesystem-recommendations - - -Insufficient RAM ----------------- - -We recommend 1GB of RAM per OSD daemon. You may notice that during normal -operations, the OSD only uses a fraction of that amount (e.g., 100-200MB). -Unused RAM makes it tempting to use the excess RAM for co-resident applications, -VMs and so forth. However, when OSDs go into recovery mode, their memory -utilization spikes. If there is no RAM available, the OSD performance will slow -considerably. - - -Old Requests or Slow Requests ------------------------------ - -If a ``ceph-osd`` daemon is slow to respond to a request, it will generate log messages -complaining about requests that are taking too long. The warning threshold -defaults to 30 seconds, and is configurable via the ``osd op complaint time`` -option. When this happens, the cluster log will receive messages. - -Legacy versions of Ceph complain about 'old requests`:: - - osd.0 192.168.106.220:6800/18813 312 : [WRN] old request osd_op(client.5099.0:790 fatty_26485_object789 [write 0~4096] 2.5e54f643) v4 received at 2012-03-06 15:42:56.054801 currently waiting for sub ops - -New versions of Ceph complain about 'slow requests`:: - - {date} {osd.num} [WRN] 1 slow requests, 1 included below; oldest blocked for > 30.005692 secs - {date} {osd.num} [WRN] slow request 30.005692 seconds old, received at {date-time}: osd_op(client.4240.0:8 benchmark_data_ceph-1_39426_object7 [write 0~4194304] 0.69848840) v4 currently waiting for subops from [610] - - -Possible causes include: - -- A bad drive (check ``dmesg`` output) -- A bug in the kernel file system bug (check ``dmesg`` output) -- An overloaded cluster (check system load, iostat, etc.) -- A bug in the ``ceph-osd`` daemon. - -Possible solutions - -- Remove VMs Cloud Solutions from Ceph Hosts -- Upgrade Kernel -- Upgrade Ceph -- Restart OSDs - -Debugging Slow Requests ------------------------ - -If you run "ceph daemon osd.<id> dump_historic_ops" or "dump_ops_in_flight", -you will see a set of operations and a list of events each operation went -through. These are briefly described below. - -Events from the Messenger layer: - -- header_read: when the messenger first started reading the message off the wire -- throttled: when the messenger tried to acquire memory throttle space to read - the message into memory -- all_read: when the messenger finished reading the message off the wire -- dispatched: when the messenger gave the message to the OSD -- Initiated: <This is identical to header_read. The existence of both is a - historical oddity. - -Events from the OSD as it prepares operations - -- queued_for_pg: the op has been put into the queue for processing by its PG -- reached_pg: the PG has started doing the op -- waiting for \*: the op is waiting for some other work to complete before it - can proceed (a new OSDMap; for its object target to scrub; for the PG to - finish peering; all as specified in the message) -- started: the op has been accepted as something the OSD should actually do - (reasons not to do it: failed security/permission checks; out-of-date local - state; etc) and is now actually being performed -- waiting for subops from: the op has been sent to replica OSDs - -Events from the FileStore - -- commit_queued_for_journal_write: the op has been given to the FileStore -- write_thread_in_journal_buffer: the op is in the journal's buffer and waiting - to be persisted (as the next disk write) -- journaled_completion_queued: the op was journaled to disk and its callback - queued for invocation - -Events from the OSD after stuff has been given to local disk - -- op_commit: the op has been committed (ie, written to journal) by the - primary OSD -- op_applied: The op has been write()'en to the backing FS (ie, applied in - memory but not flushed out to disk) on the primary -- sub_op_applied: op_applied, but for a replica's "subop" -- sub_op_committed: op_commited, but for a replica's subop (only for EC pools) -- sub_op_commit_rec/sub_op_apply_rec from <X>: the primary marks this when it - hears about the above, but for a particular replica <X> -- commit_sent: we sent a reply back to the client (or primary OSD, for sub ops) - -Many of these events are seemingly redundant, but cross important boundaries in -the internal code (such as passing data across locks into new threads). - -Flapping OSDs -============= - -We recommend using both a public (front-end) network and a cluster (back-end) -network so that you can better meet the capacity requirements of object -replication. Another advantage is that you can run a cluster network such that -it is not connected to the internet, thereby preventing some denial of service -attacks. When OSDs peer and check heartbeats, they use the cluster (back-end) -network when it's available. See `Monitor/OSD Interaction`_ for details. - -However, if the cluster (back-end) network fails or develops significant latency -while the public (front-end) network operates optimally, OSDs currently do not -handle this situation well. What happens is that OSDs mark each other ``down`` -on the monitor, while marking themselves ``up``. We call this scenario -'flapping`. - -If something is causing OSDs to 'flap' (repeatedly getting marked ``down`` and -then ``up`` again), you can force the monitors to stop the flapping with:: - - ceph osd set noup # prevent OSDs from getting marked up - ceph osd set nodown # prevent OSDs from getting marked down - -These flags are recorded in the osdmap structure:: - - ceph osd dump | grep flags - flags no-up,no-down - -You can clear the flags with:: - - ceph osd unset noup - ceph osd unset nodown - -Two other flags are supported, ``noin`` and ``noout``, which prevent -booting OSDs from being marked ``in`` (allocated data) or protect OSDs -from eventually being marked ``out`` (regardless of what the current value for -``mon osd down out interval`` is). - -.. note:: ``noup``, ``noout``, and ``nodown`` are temporary in the - sense that once the flags are cleared, the action they were blocking - should occur shortly after. The ``noin`` flag, on the other hand, - prevents OSDs from being marked ``in`` on boot, and any daemons that - started while the flag was set will remain that way. - - - - - - -.. _iostat: http://en.wikipedia.org/wiki/Iostat -.. _Ceph Logging and Debugging: ../../configuration/ceph-conf#ceph-logging-and-debugging -.. _Logging and Debugging: ../log-and-debug -.. _Debugging and Logging: ../debug -.. _Monitor/OSD Interaction: ../../configuration/mon-osd-interaction -.. _Monitor Config Reference: ../../configuration/mon-config-ref -.. _monitoring your OSDs: ../../operations/monitoring-osd-pg -.. _subscribe to the ceph-devel email list: mailto:majordomo@vger.kernel.org?body=subscribe+ceph-devel -.. _unsubscribe from the ceph-devel email list: mailto:majordomo@vger.kernel.org?body=unsubscribe+ceph-devel -.. _subscribe to the ceph-users email list: mailto:ceph-users-join@lists.ceph.com -.. _unsubscribe from the ceph-users email list: mailto:ceph-users-leave@lists.ceph.com -.. _OS recommendations: ../../../start/os-recommendations -.. _ceph-devel: ceph-devel@vger.kernel.org diff --git a/src/ceph/doc/rados/troubleshooting/troubleshooting-pg.rst b/src/ceph/doc/rados/troubleshooting/troubleshooting-pg.rst deleted file mode 100644 index 4241fee..0000000 --- a/src/ceph/doc/rados/troubleshooting/troubleshooting-pg.rst +++ /dev/null @@ -1,668 +0,0 @@ -===================== - Troubleshooting PGs -===================== - -Placement Groups Never Get Clean -================================ - -When you create a cluster and your cluster remains in ``active``, -``active+remapped`` or ``active+degraded`` status and never achieve an -``active+clean`` status, you likely have a problem with your configuration. - -You may need to review settings in the `Pool, PG and CRUSH Config Reference`_ -and make appropriate adjustments. - -As a general rule, you should run your cluster with more than one OSD and a -pool size greater than 1 object replica. - -One Node Cluster ----------------- - -Ceph no longer provides documentation for operating on a single node, because -you would never deploy a system designed for distributed computing on a single -node. Additionally, mounting client kernel modules on a single node containing a -Ceph daemon may cause a deadlock due to issues with the Linux kernel itself -(unless you use VMs for the clients). You can experiment with Ceph in a 1-node -configuration, in spite of the limitations as described herein. - -If you are trying to create a cluster on a single node, you must change the -default of the ``osd crush chooseleaf type`` setting from ``1`` (meaning -``host`` or ``node``) to ``0`` (meaning ``osd``) in your Ceph configuration -file before you create your monitors and OSDs. This tells Ceph that an OSD -can peer with another OSD on the same host. If you are trying to set up a -1-node cluster and ``osd crush chooseleaf type`` is greater than ``0``, -Ceph will try to peer the PGs of one OSD with the PGs of another OSD on -another node, chassis, rack, row, or even datacenter depending on the setting. - -.. tip:: DO NOT mount kernel clients directly on the same node as your - Ceph Storage Cluster, because kernel conflicts can arise. However, you - can mount kernel clients within virtual machines (VMs) on a single node. - -If you are creating OSDs using a single disk, you must create directories -for the data manually first. For example:: - - mkdir /var/local/osd0 /var/local/osd1 - ceph-deploy osd prepare {localhost-name}:/var/local/osd0 {localhost-name}:/var/local/osd1 - ceph-deploy osd activate {localhost-name}:/var/local/osd0 {localhost-name}:/var/local/osd1 - - -Fewer OSDs than Replicas ------------------------- - -If you have brought up two OSDs to an ``up`` and ``in`` state, but you still -don't see ``active + clean`` placement groups, you may have an -``osd pool default size`` set to greater than ``2``. - -There are a few ways to address this situation. If you want to operate your -cluster in an ``active + degraded`` state with two replicas, you can set the -``osd pool default min size`` to ``2`` so that you can write objects in -an ``active + degraded`` state. You may also set the ``osd pool default size`` -setting to ``2`` so that you only have two stored replicas (the original and -one replica), in which case the cluster should achieve an ``active + clean`` -state. - -.. note:: You can make the changes at runtime. If you make the changes in - your Ceph configuration file, you may need to restart your cluster. - - -Pool Size = 1 -------------- - -If you have the ``osd pool default size`` set to ``1``, you will only have -one copy of the object. OSDs rely on other OSDs to tell them which objects -they should have. If a first OSD has a copy of an object and there is no -second copy, then no second OSD can tell the first OSD that it should have -that copy. For each placement group mapped to the first OSD (see -``ceph pg dump``), you can force the first OSD to notice the placement groups -it needs by running:: - - ceph osd force-create-pg <pgid> - - -CRUSH Map Errors ----------------- - -Another candidate for placement groups remaining unclean involves errors -in your CRUSH map. - - -Stuck Placement Groups -====================== - -It is normal for placement groups to enter states like "degraded" or "peering" -following a failure. Normally these states indicate the normal progression -through the failure recovery process. However, if a placement group stays in one -of these states for a long time this may be an indication of a larger problem. -For this reason, the monitor will warn when placement groups get "stuck" in a -non-optimal state. Specifically, we check for: - -* ``inactive`` - The placement group has not been ``active`` for too long - (i.e., it hasn't been able to service read/write requests). - -* ``unclean`` - The placement group has not been ``clean`` for too long - (i.e., it hasn't been able to completely recover from a previous failure). - -* ``stale`` - The placement group status has not been updated by a ``ceph-osd``, - indicating that all nodes storing this placement group may be ``down``. - -You can explicitly list stuck placement groups with one of:: - - ceph pg dump_stuck stale - ceph pg dump_stuck inactive - ceph pg dump_stuck unclean - -For stuck ``stale`` placement groups, it is normally a matter of getting the -right ``ceph-osd`` daemons running again. For stuck ``inactive`` placement -groups, it is usually a peering problem (see :ref:`failures-osd-peering`). For -stuck ``unclean`` placement groups, there is usually something preventing -recovery from completing, like unfound objects (see -:ref:`failures-osd-unfound`); - - - -.. _failures-osd-peering: - -Placement Group Down - Peering Failure -====================================== - -In certain cases, the ``ceph-osd`` `Peering` process can run into -problems, preventing a PG from becoming active and usable. For -example, ``ceph health`` might report:: - - ceph health detail - HEALTH_ERR 7 pgs degraded; 12 pgs down; 12 pgs peering; 1 pgs recovering; 6 pgs stuck unclean; 114/3300 degraded (3.455%); 1/3 in osds are down - ... - pg 0.5 is down+peering - pg 1.4 is down+peering - ... - osd.1 is down since epoch 69, last address 192.168.106.220:6801/8651 - -We can query the cluster to determine exactly why the PG is marked ``down`` with:: - - ceph pg 0.5 query - -.. code-block:: javascript - - { "state": "down+peering", - ... - "recovery_state": [ - { "name": "Started\/Primary\/Peering\/GetInfo", - "enter_time": "2012-03-06 14:40:16.169679", - "requested_info_from": []}, - { "name": "Started\/Primary\/Peering", - "enter_time": "2012-03-06 14:40:16.169659", - "probing_osds": [ - 0, - 1], - "blocked": "peering is blocked due to down osds", - "down_osds_we_would_probe": [ - 1], - "peering_blocked_by": [ - { "osd": 1, - "current_lost_at": 0, - "comment": "starting or marking this osd lost may let us proceed"}]}, - { "name": "Started", - "enter_time": "2012-03-06 14:40:16.169513"} - ] - } - -The ``recovery_state`` section tells us that peering is blocked due to -down ``ceph-osd`` daemons, specifically ``osd.1``. In this case, we can start that ``ceph-osd`` -and things will recover. - -Alternatively, if there is a catastrophic failure of ``osd.1`` (e.g., disk -failure), we can tell the cluster that it is ``lost`` and to cope as -best it can. - -.. important:: This is dangerous in that the cluster cannot - guarantee that the other copies of the data are consistent - and up to date. - -To instruct Ceph to continue anyway:: - - ceph osd lost 1 - -Recovery will proceed. - - -.. _failures-osd-unfound: - -Unfound Objects -=============== - -Under certain combinations of failures Ceph may complain about -``unfound`` objects:: - - ceph health detail - HEALTH_WARN 1 pgs degraded; 78/3778 unfound (2.065%) - pg 2.4 is active+degraded, 78 unfound - -This means that the storage cluster knows that some objects (or newer -copies of existing objects) exist, but it hasn't found copies of them. -One example of how this might come about for a PG whose data is on ceph-osds -1 and 2: - -* 1 goes down -* 2 handles some writes, alone -* 1 comes up -* 1 and 2 repeer, and the objects missing on 1 are queued for recovery. -* Before the new objects are copied, 2 goes down. - -Now 1 knows that these object exist, but there is no live ``ceph-osd`` who -has a copy. In this case, IO to those objects will block, and the -cluster will hope that the failed node comes back soon; this is -assumed to be preferable to returning an IO error to the user. - -First, you can identify which objects are unfound with:: - - ceph pg 2.4 list_missing [starting offset, in json] - -.. code-block:: javascript - - { "offset": { "oid": "", - "key": "", - "snapid": 0, - "hash": 0, - "max": 0}, - "num_missing": 0, - "num_unfound": 0, - "objects": [ - { "oid": "object 1", - "key": "", - "hash": 0, - "max": 0 }, - ... - ], - "more": 0} - -If there are too many objects to list in a single result, the ``more`` -field will be true and you can query for more. (Eventually the -command line tool will hide this from you, but not yet.) - -Second, you can identify which OSDs have been probed or might contain -data:: - - ceph pg 2.4 query - -.. code-block:: javascript - - "recovery_state": [ - { "name": "Started\/Primary\/Active", - "enter_time": "2012-03-06 15:15:46.713212", - "might_have_unfound": [ - { "osd": 1, - "status": "osd is down"}]}, - -In this case, for example, the cluster knows that ``osd.1`` might have -data, but it is ``down``. The full range of possible states include: - -* already probed -* querying -* OSD is down -* not queried (yet) - -Sometimes it simply takes some time for the cluster to query possible -locations. - -It is possible that there are other locations where the object can -exist that are not listed. For example, if a ceph-osd is stopped and -taken out of the cluster, the cluster fully recovers, and due to some -future set of failures ends up with an unfound object, it won't -consider the long-departed ceph-osd as a potential location to -consider. (This scenario, however, is unlikely.) - -If all possible locations have been queried and objects are still -lost, you may have to give up on the lost objects. This, again, is -possible given unusual combinations of failures that allow the cluster -to learn about writes that were performed before the writes themselves -are recovered. To mark the "unfound" objects as "lost":: - - ceph pg 2.5 mark_unfound_lost revert|delete - -This the final argument specifies how the cluster should deal with -lost objects. - -The "delete" option will forget about them entirely. - -The "revert" option (not available for erasure coded pools) will -either roll back to a previous version of the object or (if it was a -new object) forget about it entirely. Use this with caution, as it -may confuse applications that expected the object to exist. - - -Homeless Placement Groups -========================= - -It is possible for all OSDs that had copies of a given placement groups to fail. -If that's the case, that subset of the object store is unavailable, and the -monitor will receive no status updates for those placement groups. To detect -this situation, the monitor marks any placement group whose primary OSD has -failed as ``stale``. For example:: - - ceph health - HEALTH_WARN 24 pgs stale; 3/300 in osds are down - -You can identify which placement groups are ``stale``, and what the last OSDs to -store them were, with:: - - ceph health detail - HEALTH_WARN 24 pgs stale; 3/300 in osds are down - ... - pg 2.5 is stuck stale+active+remapped, last acting [2,0] - ... - osd.10 is down since epoch 23, last address 192.168.106.220:6800/11080 - osd.11 is down since epoch 13, last address 192.168.106.220:6803/11539 - osd.12 is down since epoch 24, last address 192.168.106.220:6806/11861 - -If we want to get placement group 2.5 back online, for example, this tells us that -it was last managed by ``osd.0`` and ``osd.2``. Restarting those ``ceph-osd`` -daemons will allow the cluster to recover that placement group (and, presumably, -many others). - - -Only a Few OSDs Receive Data -============================ - -If you have many nodes in your cluster and only a few of them receive data, -`check`_ the number of placement groups in your pool. Since placement groups get -mapped to OSDs, a small number of placement groups will not distribute across -your cluster. Try creating a pool with a placement group count that is a -multiple of the number of OSDs. See `Placement Groups`_ for details. The default -placement group count for pools is not useful, but you can change it `here`_. - - -Can't Write Data -================ - -If your cluster is up, but some OSDs are down and you cannot write data, -check to ensure that you have the minimum number of OSDs running for the -placement group. If you don't have the minimum number of OSDs running, -Ceph will not allow you to write data because there is no guarantee -that Ceph can replicate your data. See ``osd pool default min size`` -in the `Pool, PG and CRUSH Config Reference`_ for details. - - -PGs Inconsistent -================ - -If you receive an ``active + clean + inconsistent`` state, this may happen -due to an error during scrubbing. As always, we can identify the inconsistent -placement group(s) with:: - - $ ceph health detail - HEALTH_ERR 1 pgs inconsistent; 2 scrub errors - pg 0.6 is active+clean+inconsistent, acting [0,1,2] - 2 scrub errors - -Or if you prefer inspecting the output in a programmatic way:: - - $ rados list-inconsistent-pg rbd - ["0.6"] - -There is only one consistent state, but in the worst case, we could have -different inconsistencies in multiple perspectives found in more than one -objects. If an object named ``foo`` in PG ``0.6`` is truncated, we will have:: - - $ rados list-inconsistent-obj 0.6 --format=json-pretty - -.. code-block:: javascript - - { - "epoch": 14, - "inconsistents": [ - { - "object": { - "name": "foo", - "nspace": "", - "locator": "", - "snap": "head", - "version": 1 - }, - "errors": [ - "data_digest_mismatch", - "size_mismatch" - ], - "union_shard_errors": [ - "data_digest_mismatch_oi", - "size_mismatch_oi" - ], - "selected_object_info": "0:602f83fe:::foo:head(16'1 client.4110.0:1 dirty|data_digest|omap_digest s 968 uv 1 dd e978e67f od ffffffff alloc_hint [0 0 0])", - "shards": [ - { - "osd": 0, - "errors": [], - "size": 968, - "omap_digest": "0xffffffff", - "data_digest": "0xe978e67f" - }, - { - "osd": 1, - "errors": [], - "size": 968, - "omap_digest": "0xffffffff", - "data_digest": "0xe978e67f" - }, - { - "osd": 2, - "errors": [ - "data_digest_mismatch_oi", - "size_mismatch_oi" - ], - "size": 0, - "omap_digest": "0xffffffff", - "data_digest": "0xffffffff" - } - ] - } - ] - } - -In this case, we can learn from the output: - -* The only inconsistent object is named ``foo``, and it is its head that has - inconsistencies. -* The inconsistencies fall into two categories: - - * ``errors``: these errors indicate inconsistencies between shards without a - determination of which shard(s) are bad. Check for the ``errors`` in the - `shards` array, if available, to pinpoint the problem. - - * ``data_digest_mismatch``: the digest of the replica read from OSD.2 is - different from the ones of OSD.0 and OSD.1 - * ``size_mismatch``: the size of the replica read from OSD.2 is 0, while - the size reported by OSD.0 and OSD.1 is 968. - * ``union_shard_errors``: the union of all shard specific ``errors`` in - ``shards`` array. The ``errors`` are set for the given shard that has the - problem. They include errors like ``read_error``. The ``errors`` ending in - ``oi`` indicate a comparison with ``selected_object_info``. Look at the - ``shards`` array to determine which shard has which error(s). - - * ``data_digest_mismatch_oi``: the digest stored in the object-info is not - ``0xffffffff``, which is calculated from the shard read from OSD.2 - * ``size_mismatch_oi``: the size stored in the object-info is different - from the one read from OSD.2. The latter is 0. - -You can repair the inconsistent placement group by executing:: - - ceph pg repair {placement-group-ID} - -Which overwrites the `bad` copies with the `authoritative` ones. In most cases, -Ceph is able to choose authoritative copies from all available replicas using -some predefined criteria. But this does not always work. For example, the stored -data digest could be missing, and the calculated digest will be ignored when -choosing the authoritative copies. So, please use the above command with caution. - -If ``read_error`` is listed in the ``errors`` attribute of a shard, the -inconsistency is likely due to disk errors. You might want to check your disk -used by that OSD. - -If you receive ``active + clean + inconsistent`` states periodically due to -clock skew, you may consider configuring your `NTP`_ daemons on your -monitor hosts to act as peers. See `The Network Time Protocol`_ and Ceph -`Clock Settings`_ for additional details. - - -Erasure Coded PGs are not active+clean -====================================== - -When CRUSH fails to find enough OSDs to map to a PG, it will show as a -``2147483647`` which is ITEM_NONE or ``no OSD found``. For instance:: - - [2,1,6,0,5,8,2147483647,7,4] - -Not enough OSDs ---------------- - -If the Ceph cluster only has 8 OSDs and the erasure coded pool needs -9, that is what it will show. You can either create another erasure -coded pool that requires less OSDs:: - - ceph osd erasure-code-profile set myprofile k=5 m=3 - ceph osd pool create erasurepool 16 16 erasure myprofile - -or add a new OSDs and the PG will automatically use them. - -CRUSH constraints cannot be satisfied -------------------------------------- - -If the cluster has enough OSDs, it is possible that the CRUSH ruleset -imposes constraints that cannot be satisfied. If there are 10 OSDs on -two hosts and the CRUSH rulesets require that no two OSDs from the -same host are used in the same PG, the mapping may fail because only -two OSD will be found. You can check the constraint by displaying the -ruleset:: - - $ ceph osd crush rule ls - [ - "replicated_ruleset", - "erasurepool"] - $ ceph osd crush rule dump erasurepool - { "rule_id": 1, - "rule_name": "erasurepool", - "ruleset": 1, - "type": 3, - "min_size": 3, - "max_size": 20, - "steps": [ - { "op": "take", - "item": -1, - "item_name": "default"}, - { "op": "chooseleaf_indep", - "num": 0, - "type": "host"}, - { "op": "emit"}]} - - -You can resolve the problem by creating a new pool in which PGs are allowed -to have OSDs residing on the same host with:: - - ceph osd erasure-code-profile set myprofile crush-failure-domain=osd - ceph osd pool create erasurepool 16 16 erasure myprofile - -CRUSH gives up too soon ------------------------ - -If the Ceph cluster has just enough OSDs to map the PG (for instance a -cluster with a total of 9 OSDs and an erasure coded pool that requires -9 OSDs per PG), it is possible that CRUSH gives up before finding a -mapping. It can be resolved by: - -* lowering the erasure coded pool requirements to use less OSDs per PG - (that requires the creation of another pool as erasure code profiles - cannot be dynamically modified). - -* adding more OSDs to the cluster (that does not require the erasure - coded pool to be modified, it will become clean automatically) - -* use a hand made CRUSH ruleset that tries more times to find a good - mapping. It can be done by setting ``set_choose_tries`` to a value - greater than the default. - -You should first verify the problem with ``crushtool`` after -extracting the crushmap from the cluster so your experiments do not -modify the Ceph cluster and only work on a local files:: - - $ ceph osd crush rule dump erasurepool - { "rule_name": "erasurepool", - "ruleset": 1, - "type": 3, - "min_size": 3, - "max_size": 20, - "steps": [ - { "op": "take", - "item": -1, - "item_name": "default"}, - { "op": "chooseleaf_indep", - "num": 0, - "type": "host"}, - { "op": "emit"}]} - $ ceph osd getcrushmap > crush.map - got crush map from osdmap epoch 13 - $ crushtool -i crush.map --test --show-bad-mappings \ - --rule 1 \ - --num-rep 9 \ - --min-x 1 --max-x $((1024 * 1024)) - bad mapping rule 8 x 43 num_rep 9 result [3,2,7,1,2147483647,8,5,6,0] - bad mapping rule 8 x 79 num_rep 9 result [6,0,2,1,4,7,2147483647,5,8] - bad mapping rule 8 x 173 num_rep 9 result [0,4,6,8,2,1,3,7,2147483647] - -Where ``--num-rep`` is the number of OSDs the erasure code crush -ruleset needs, ``--rule`` is the value of the ``ruleset`` field -displayed by ``ceph osd crush rule dump``. The test will try mapping -one million values (i.e. the range defined by ``[--min-x,--max-x]``) -and must display at least one bad mapping. If it outputs nothing it -means all mappings are successfull and you can stop right there: the -problem is elsewhere. - -The crush ruleset can be edited by decompiling the crush map:: - - $ crushtool --decompile crush.map > crush.txt - -and adding the following line to the ruleset:: - - step set_choose_tries 100 - -The relevant part of of the ``crush.txt`` file should look something -like:: - - rule erasurepool { - ruleset 1 - type erasure - min_size 3 - max_size 20 - step set_chooseleaf_tries 5 - step set_choose_tries 100 - step take default - step chooseleaf indep 0 type host - step emit - } - -It can then be compiled and tested again:: - - $ crushtool --compile crush.txt -o better-crush.map - -When all mappings succeed, an histogram of the number of tries that -were necessary to find all of them can be displayed with the -``--show-choose-tries`` option of ``crushtool``:: - - $ crushtool -i better-crush.map --test --show-bad-mappings \ - --show-choose-tries \ - --rule 1 \ - --num-rep 9 \ - --min-x 1 --max-x $((1024 * 1024)) - ... - 11: 42 - 12: 44 - 13: 54 - 14: 45 - 15: 35 - 16: 34 - 17: 30 - 18: 25 - 19: 19 - 20: 22 - 21: 20 - 22: 17 - 23: 13 - 24: 16 - 25: 13 - 26: 11 - 27: 11 - 28: 13 - 29: 11 - 30: 10 - 31: 6 - 32: 5 - 33: 10 - 34: 3 - 35: 7 - 36: 5 - 37: 2 - 38: 5 - 39: 5 - 40: 2 - 41: 5 - 42: 4 - 43: 1 - 44: 2 - 45: 2 - 46: 3 - 47: 1 - 48: 0 - ... - 102: 0 - 103: 1 - 104: 0 - ... - -It took 11 tries to map 42 PGs, 12 tries to map 44 PGs etc. The highest number of tries is the minimum value of ``set_choose_tries`` that prevents bad mappings (i.e. 103 in the above output because it did not take more than 103 tries for any PG to be mapped). - -.. _check: ../../operations/placement-groups#get-the-number-of-placement-groups -.. _here: ../../configuration/pool-pg-config-ref -.. _Placement Groups: ../../operations/placement-groups -.. _Pool, PG and CRUSH Config Reference: ../../configuration/pool-pg-config-ref -.. _NTP: http://en.wikipedia.org/wiki/Network_Time_Protocol -.. _The Network Time Protocol: http://www.ntp.org/ -.. _Clock Settings: ../../configuration/mon-config-ref/#clock - - |