diff options
Diffstat (limited to 'src/ceph/doc/rados')
70 files changed, 19407 insertions, 0 deletions
diff --git a/src/ceph/doc/rados/api/index.rst b/src/ceph/doc/rados/api/index.rst new file mode 100644 index 0000000..cccc153 --- /dev/null +++ b/src/ceph/doc/rados/api/index.rst @@ -0,0 +1,22 @@ +=========================== + Ceph Storage Cluster APIs +=========================== + +The :term:`Ceph Storage Cluster` has a messaging layer protocol that enables +clients to interact with a :term:`Ceph Monitor` and a :term:`Ceph OSD Daemon`. +``librados`` provides this functionality to :term:`Ceph Clients` in the form of +a library. All Ceph Clients either use ``librados`` or the same functionality +encapsulated in ``librados`` to interact with the object store. For example, +``librbd`` and ``libcephfs`` leverage this functionality. You may use +``librados`` to interact with Ceph directly (e.g., an application that talks to +Ceph, your own interface to Ceph, etc.). + + +.. toctree:: + :maxdepth: 2 + + Introduction to librados <librados-intro> + librados (C) <librados> + librados (C++) <libradospp> + librados (Python) <python> + object class <objclass-sdk> diff --git a/src/ceph/doc/rados/api/librados-intro.rst b/src/ceph/doc/rados/api/librados-intro.rst new file mode 100644 index 0000000..8405f6e --- /dev/null +++ b/src/ceph/doc/rados/api/librados-intro.rst @@ -0,0 +1,1003 @@ +========================== + Introduction to librados +========================== + +The :term:`Ceph Storage Cluster` provides the basic storage service that allows +:term:`Ceph` to uniquely deliver **object, block, and file storage** in one +unified system. However, you are not limited to using the RESTful, block, or +POSIX interfaces. Based upon :abbr:`RADOS (Reliable Autonomic Distributed Object +Store)`, the ``librados`` API enables you to create your own interface to the +Ceph Storage Cluster. + +The ``librados`` API enables you to interact with the two types of daemons in +the Ceph Storage Cluster: + +- The :term:`Ceph Monitor`, which maintains a master copy of the cluster map. +- The :term:`Ceph OSD Daemon` (OSD), which stores data as objects on a storage node. + +.. ditaa:: + +---------------------------------+ + | Ceph Storage Cluster Protocol | + | (librados) | + +---------------------------------+ + +---------------+ +---------------+ + | OSDs | | Monitors | + +---------------+ +---------------+ + +This guide provides a high-level introduction to using ``librados``. +Refer to :doc:`../../architecture` for additional details of the Ceph +Storage Cluster. To use the API, you need a running Ceph Storage Cluster. +See `Installation (Quick)`_ for details. + + +Step 1: Getting librados +======================== + +Your client application must bind with ``librados`` to connect to the Ceph +Storage Cluster. You must install ``librados`` and any required packages to +write applications that use ``librados``. The ``librados`` API is written in +C++, with additional bindings for C, Python, Java and PHP. + + +Getting librados for C/C++ +-------------------------- + +To install ``librados`` development support files for C/C++ on Debian/Ubuntu +distributions, execute the following:: + + sudo apt-get install librados-dev + +To install ``librados`` development support files for C/C++ on RHEL/CentOS +distributions, execute the following:: + + sudo yum install librados2-devel + +Once you install ``librados`` for developers, you can find the required +headers for C/C++ under ``/usr/include/rados``. :: + + ls /usr/include/rados + + +Getting librados for Python +--------------------------- + +The ``rados`` module provides ``librados`` support to Python +applications. The ``librados-dev`` package for Debian/Ubuntu +and the ``librados2-devel`` package for RHEL/CentOS will install the +``python-rados`` package for you. You may install ``python-rados`` +directly too. + +To install ``librados`` development support files for Python on Debian/Ubuntu +distributions, execute the following:: + + sudo apt-get install python-rados + +To install ``librados`` development support files for Python on RHEL/CentOS +distributions, execute the following:: + + sudo yum install python-rados + +You can find the module under ``/usr/share/pyshared`` on Debian systems, +or under ``/usr/lib/python*/site-packages`` on CentOS/RHEL systems. + + +Getting librados for Java +------------------------- + +To install ``librados`` for Java, you need to execute the following procedure: + +#. Install ``jna.jar``. For Debian/Ubuntu, execute:: + + sudo apt-get install libjna-java + + For CentOS/RHEL, execute:: + + sudo yum install jna + + The JAR files are located in ``/usr/share/java``. + +#. Clone the ``rados-java`` repository:: + + git clone --recursive https://github.com/ceph/rados-java.git + +#. Build the ``rados-java`` repository:: + + cd rados-java + ant + + The JAR file is located under ``rados-java/target``. + +#. Copy the JAR for RADOS to a common location (e.g., ``/usr/share/java``) and + ensure that it and the JNA JAR are in your JVM's classpath. For example:: + + sudo cp target/rados-0.1.3.jar /usr/share/java/rados-0.1.3.jar + sudo ln -s /usr/share/java/jna-3.2.7.jar /usr/lib/jvm/default-java/jre/lib/ext/jna-3.2.7.jar + sudo ln -s /usr/share/java/rados-0.1.3.jar /usr/lib/jvm/default-java/jre/lib/ext/rados-0.1.3.jar + +To build the documentation, execute the following:: + + ant docs + + +Getting librados for PHP +------------------------- + +To install the ``librados`` extension for PHP, you need to execute the following procedure: + +#. Install php-dev. For Debian/Ubuntu, execute:: + + sudo apt-get install php5-dev build-essential + + For CentOS/RHEL, execute:: + + sudo yum install php-devel + +#. Clone the ``phprados`` repository:: + + git clone https://github.com/ceph/phprados.git + +#. Build ``phprados``:: + + cd phprados + phpize + ./configure + make + sudo make install + +#. Enable ``phprados`` in php.ini by adding:: + + extension=rados.so + + +Step 2: Configuring a Cluster Handle +==================================== + +A :term:`Ceph Client`, via ``librados``, interacts directly with OSDs to store +and retrieve data. To interact with OSDs, the client app must invoke +``librados`` and connect to a Ceph Monitor. Once connected, ``librados`` +retrieves the :term:`Cluster Map` from the Ceph Monitor. When the client app +wants to read or write data, it creates an I/O context and binds to a +:term:`pool`. The pool has an associated :term:`ruleset` that defines how it +will place data in the storage cluster. Via the I/O context, the client +provides the object name to ``librados``, which takes the object name +and the cluster map (i.e., the topology of the cluster) and `computes`_ the +placement group and `OSD`_ for locating the data. Then the client application +can read or write data. The client app doesn't need to learn about the topology +of the cluster directly. + +.. ditaa:: + +--------+ Retrieves +---------------+ + | Client |------------>| Cluster Map | + +--------+ +---------------+ + | + v Writes + /-----\ + | obj | + \-----/ + | To + v + +--------+ +---------------+ + | Pool |---------->| CRUSH Ruleset | + +--------+ Selects +---------------+ + + +The Ceph Storage Cluster handle encapsulates the client configuration, including: + +- The `user ID`_ for ``rados_create()`` or user name for ``rados_create2()`` + (preferred). +- The :term:`cephx` authentication key +- The monitor ID and IP address +- Logging levels +- Debugging levels + +Thus, the first steps in using the cluster from your app are to 1) create +a cluster handle that your app will use to connect to the storage cluster, +and then 2) use that handle to connect. To connect to the cluster, the +app must supply a monitor address, a username and an authentication key +(cephx is enabled by default). + +.. tip:: Talking to different Ceph Storage Clusters – or to the same cluster + with different users – requires different cluster handles. + +RADOS provides a number of ways for you to set the required values. For +the monitor and encryption key settings, an easy way to handle them is to ensure +that your Ceph configuration file contains a ``keyring`` path to a keyring file +and at least one monitor address (e.g,. ``mon host``). For example:: + + [global] + mon host = 192.168.1.1 + keyring = /etc/ceph/ceph.client.admin.keyring + +Once you create the handle, you can read a Ceph configuration file to configure +the handle. You can also pass arguments to your app and parse them with the +function for parsing command line arguments (e.g., ``rados_conf_parse_argv()``), +or parse Ceph environment variables (e.g., ``rados_conf_parse_env()``). Some +wrappers may not implement convenience methods, so you may need to implement +these capabilities. The following diagram provides a high-level flow for the +initial connection. + + +.. ditaa:: +---------+ +---------+ + | Client | | Monitor | + +---------+ +---------+ + | | + |-----+ create | + | | cluster | + |<----+ handle | + | | + |-----+ read | + | | config | + |<----+ file | + | | + | connect | + |-------------->| + | | + |<--------------| + | connected | + | | + + +Once connected, your app can invoke functions that affect the whole cluster +with only the cluster handle. For example, once you have a cluster +handle, you can: + +- Get cluster statistics +- Use Pool Operation (exists, create, list, delete) +- Get and set the configuration + + +One of the powerful features of Ceph is the ability to bind to different pools. +Each pool may have a different number of placement groups, object replicas and +replication strategies. For example, a pool could be set up as a "hot" pool that +uses SSDs for frequently used objects or a "cold" pool that uses erasure coding. + +The main difference in the various ``librados`` bindings is between C and +the object-oriented bindings for C++, Java and Python. The object-oriented +bindings use objects to represent cluster handles, IO Contexts, iterators, +exceptions, etc. + + +C Example +--------- + +For C, creating a simple cluster handle using the ``admin`` user, configuring +it and connecting to the cluster might look something like this: + +.. code-block:: c + + #include <stdio.h> + #include <stdlib.h> + #include <string.h> + #include <rados/librados.h> + + int main (int argc, const char **argv) + { + + /* Declare the cluster handle and required arguments. */ + rados_t cluster; + char cluster_name[] = "ceph"; + char user_name[] = "client.admin"; + uint64_t flags; + + /* Initialize the cluster handle with the "ceph" cluster name and the "client.admin" user */ + int err; + err = rados_create2(&cluster, cluster_name, user_name, flags); + + if (err < 0) { + fprintf(stderr, "%s: Couldn't create the cluster handle! %s\n", argv[0], strerror(-err)); + exit(EXIT_FAILURE); + } else { + printf("\nCreated a cluster handle.\n"); + } + + + /* Read a Ceph configuration file to configure the cluster handle. */ + err = rados_conf_read_file(cluster, "/etc/ceph/ceph.conf"); + if (err < 0) { + fprintf(stderr, "%s: cannot read config file: %s\n", argv[0], strerror(-err)); + exit(EXIT_FAILURE); + } else { + printf("\nRead the config file.\n"); + } + + /* Read command line arguments */ + err = rados_conf_parse_argv(cluster, argc, argv); + if (err < 0) { + fprintf(stderr, "%s: cannot parse command line arguments: %s\n", argv[0], strerror(-err)); + exit(EXIT_FAILURE); + } else { + printf("\nRead the command line arguments.\n"); + } + + /* Connect to the cluster */ + err = rados_connect(cluster); + if (err < 0) { + fprintf(stderr, "%s: cannot connect to cluster: %s\n", argv[0], strerror(-err)); + exit(EXIT_FAILURE); + } else { + printf("\nConnected to the cluster.\n"); + } + + } + +Compile your client and link to ``librados`` using ``-lrados``. For example:: + + gcc ceph-client.c -lrados -o ceph-client + + +C++ Example +----------- + +The Ceph project provides a C++ example in the ``ceph/examples/librados`` +directory. For C++, a simple cluster handle using the ``admin`` user requires +you to initialize a ``librados::Rados`` cluster handle object: + +.. code-block:: c++ + + #include <iostream> + #include <string> + #include <rados/librados.hpp> + + int main(int argc, const char **argv) + { + + int ret = 0; + + /* Declare the cluster handle and required variables. */ + librados::Rados cluster; + char cluster_name[] = "ceph"; + char user_name[] = "client.admin"; + uint64_t flags = 0; + + /* Initialize the cluster handle with the "ceph" cluster name and "client.admin" user */ + { + ret = cluster.init2(user_name, cluster_name, flags); + if (ret < 0) { + std::cerr << "Couldn't initialize the cluster handle! error " << ret << std::endl; + return EXIT_FAILURE; + } else { + std::cout << "Created a cluster handle." << std::endl; + } + } + + /* Read a Ceph configuration file to configure the cluster handle. */ + { + ret = cluster.conf_read_file("/etc/ceph/ceph.conf"); + if (ret < 0) { + std::cerr << "Couldn't read the Ceph configuration file! error " << ret << std::endl; + return EXIT_FAILURE; + } else { + std::cout << "Read the Ceph configuration file." << std::endl; + } + } + + /* Read command line arguments */ + { + ret = cluster.conf_parse_argv(argc, argv); + if (ret < 0) { + std::cerr << "Couldn't parse command line options! error " << ret << std::endl; + return EXIT_FAILURE; + } else { + std::cout << "Parsed command line options." << std::endl; + } + } + + /* Connect to the cluster */ + { + ret = cluster.connect(); + if (ret < 0) { + std::cerr << "Couldn't connect to cluster! error " << ret << std::endl; + return EXIT_FAILURE; + } else { + std::cout << "Connected to the cluster." << std::endl; + } + } + + return 0; + } + + +Compile the source; then, link ``librados`` using ``-lrados``. +For example:: + + g++ -g -c ceph-client.cc -o ceph-client.o + g++ -g ceph-client.o -lrados -o ceph-client + + + +Python Example +-------------- + +Python uses the ``admin`` id and the ``ceph`` cluster name by default, and +will read the standard ``ceph.conf`` file if the conffile parameter is +set to the empty string. The Python binding converts C++ errors +into exceptions. + + +.. code-block:: python + + import rados + + try: + cluster = rados.Rados(conffile='') + except TypeError as e: + print 'Argument validation error: ', e + raise e + + print "Created cluster handle." + + try: + cluster.connect() + except Exception as e: + print "connection error: ", e + raise e + finally: + print "Connected to the cluster." + + +Execute the example to verify that it connects to your cluster. :: + + python ceph-client.py + + +Java Example +------------ + +Java requires you to specify the user ID (``admin``) or user name +(``client.admin``), and uses the ``ceph`` cluster name by default . The Java +binding converts C++-based errors into exceptions. + +.. code-block:: java + + import com.ceph.rados.Rados; + import com.ceph.rados.RadosException; + + import java.io.File; + + public class CephClient { + public static void main (String args[]){ + + try { + Rados cluster = new Rados("admin"); + System.out.println("Created cluster handle."); + + File f = new File("/etc/ceph/ceph.conf"); + cluster.confReadFile(f); + System.out.println("Read the configuration file."); + + cluster.connect(); + System.out.println("Connected to the cluster."); + + } catch (RadosException e) { + System.out.println(e.getMessage() + ": " + e.getReturnValue()); + } + } + } + + +Compile the source; then, run it. If you have copied the JAR to +``/usr/share/java`` and sym linked from your ``ext`` directory, you won't need +to specify the classpath. For example:: + + javac CephClient.java + java CephClient + + +PHP Example +------------ + +With the RADOS extension enabled in PHP you can start creating a new cluster handle very easily: + +.. code-block:: php + + <?php + + $r = rados_create(); + rados_conf_read_file($r, '/etc/ceph/ceph.conf'); + if (!rados_connect($r)) { + echo "Failed to connect to Ceph cluster"; + } else { + echo "Successfully connected to Ceph cluster"; + } + + +Save this as rados.php and run the code:: + + php rados.php + + +Step 3: Creating an I/O Context +=============================== + +Once your app has a cluster handle and a connection to a Ceph Storage Cluster, +you may create an I/O Context and begin reading and writing data. An I/O Context +binds the connection to a specific pool. The user must have appropriate +`CAPS`_ permissions to access the specified pool. For example, a user with read +access but not write access will only be able to read data. I/O Context +functionality includes: + +- Write/read data and extended attributes +- List and iterate over objects and extended attributes +- Snapshot pools, list snapshots, etc. + + +.. ditaa:: +---------+ +---------+ +---------+ + | Client | | Monitor | | OSD | + +---------+ +---------+ +---------+ + | | | + |-----+ create | | + | | I/O | | + |<----+ context | | + | | | + | write data | | + |---------------+-------------->| + | | | + | write ack | | + |<--------------+---------------| + | | | + | write xattr | | + |---------------+-------------->| + | | | + | xattr ack | | + |<--------------+---------------| + | | | + | read data | | + |---------------+-------------->| + | | | + | read ack | | + |<--------------+---------------| + | | | + | remove data | | + |---------------+-------------->| + | | | + | remove ack | | + |<--------------+---------------| + + + +RADOS enables you to interact both synchronously and asynchronously. Once your +app has an I/O Context, read/write operations only require you to know the +object/xattr name. The CRUSH algorithm encapsulated in ``librados`` uses the +cluster map to identify the appropriate OSD. OSD daemons handle the replication, +as described in `Smart Daemons Enable Hyperscale`_. The ``librados`` library also +maps objects to placement groups, as described in `Calculating PG IDs`_. + +The following examples use the default ``data`` pool. However, you may also +use the API to list pools, ensure they exist, or create and delete pools. For +the write operations, the examples illustrate how to use synchronous mode. For +the read operations, the examples illustrate how to use asynchronous mode. + +.. important:: Use caution when deleting pools with this API. If you delete + a pool, the pool and ALL DATA in the pool will be lost. + + +C Example +--------- + + +.. code-block:: c + + #include <stdio.h> + #include <stdlib.h> + #include <string.h> + #include <rados/librados.h> + + int main (int argc, const char **argv) + { + /* + * Continued from previous C example, where cluster handle and + * connection are established. First declare an I/O Context. + */ + + rados_ioctx_t io; + char *poolname = "data"; + + err = rados_ioctx_create(cluster, poolname, &io); + if (err < 0) { + fprintf(stderr, "%s: cannot open rados pool %s: %s\n", argv[0], poolname, strerror(-err)); + rados_shutdown(cluster); + exit(EXIT_FAILURE); + } else { + printf("\nCreated I/O context.\n"); + } + + /* Write data to the cluster synchronously. */ + err = rados_write(io, "hw", "Hello World!", 12, 0); + if (err < 0) { + fprintf(stderr, "%s: Cannot write object \"hw\" to pool %s: %s\n", argv[0], poolname, strerror(-err)); + rados_ioctx_destroy(io); + rados_shutdown(cluster); + exit(1); + } else { + printf("\nWrote \"Hello World\" to object \"hw\".\n"); + } + + char xattr[] = "en_US"; + err = rados_setxattr(io, "hw", "lang", xattr, 5); + if (err < 0) { + fprintf(stderr, "%s: Cannot write xattr to pool %s: %s\n", argv[0], poolname, strerror(-err)); + rados_ioctx_destroy(io); + rados_shutdown(cluster); + exit(1); + } else { + printf("\nWrote \"en_US\" to xattr \"lang\" for object \"hw\".\n"); + } + + /* + * Read data from the cluster asynchronously. + * First, set up asynchronous I/O completion. + */ + rados_completion_t comp; + err = rados_aio_create_completion(NULL, NULL, NULL, &comp); + if (err < 0) { + fprintf(stderr, "%s: Could not create aio completion: %s\n", argv[0], strerror(-err)); + rados_ioctx_destroy(io); + rados_shutdown(cluster); + exit(1); + } else { + printf("\nCreated AIO completion.\n"); + } + + /* Next, read data using rados_aio_read. */ + char read_res[100]; + err = rados_aio_read(io, "hw", comp, read_res, 12, 0); + if (err < 0) { + fprintf(stderr, "%s: Cannot read object. %s %s\n", argv[0], poolname, strerror(-err)); + rados_ioctx_destroy(io); + rados_shutdown(cluster); + exit(1); + } else { + printf("\nRead object \"hw\". The contents are:\n %s \n", read_res); + } + + /* Wait for the operation to complete */ + rados_aio_wait_for_complete(comp); + + /* Release the asynchronous I/O complete handle to avoid memory leaks. */ + rados_aio_release(comp); + + + char xattr_res[100]; + err = rados_getxattr(io, "hw", "lang", xattr_res, 5); + if (err < 0) { + fprintf(stderr, "%s: Cannot read xattr. %s %s\n", argv[0], poolname, strerror(-err)); + rados_ioctx_destroy(io); + rados_shutdown(cluster); + exit(1); + } else { + printf("\nRead xattr \"lang\" for object \"hw\". The contents are:\n %s \n", xattr_res); + } + + err = rados_rmxattr(io, "hw", "lang"); + if (err < 0) { + fprintf(stderr, "%s: Cannot remove xattr. %s %s\n", argv[0], poolname, strerror(-err)); + rados_ioctx_destroy(io); + rados_shutdown(cluster); + exit(1); + } else { + printf("\nRemoved xattr \"lang\" for object \"hw\".\n"); + } + + err = rados_remove(io, "hw"); + if (err < 0) { + fprintf(stderr, "%s: Cannot remove object. %s %s\n", argv[0], poolname, strerror(-err)); + rados_ioctx_destroy(io); + rados_shutdown(cluster); + exit(1); + } else { + printf("\nRemoved object \"hw\".\n"); + } + + } + + + +C++ Example +----------- + + +.. code-block:: c++ + + #include <iostream> + #include <string> + #include <rados/librados.hpp> + + int main(int argc, const char **argv) + { + + /* Continued from previous C++ example, where cluster handle and + * connection are established. First declare an I/O Context. + */ + + librados::IoCtx io_ctx; + const char *pool_name = "data"; + + { + ret = cluster.ioctx_create(pool_name, io_ctx); + if (ret < 0) { + std::cerr << "Couldn't set up ioctx! error " << ret << std::endl; + exit(EXIT_FAILURE); + } else { + std::cout << "Created an ioctx for the pool." << std::endl; + } + } + + + /* Write an object synchronously. */ + { + librados::bufferlist bl; + bl.append("Hello World!"); + ret = io_ctx.write_full("hw", bl); + if (ret < 0) { + std::cerr << "Couldn't write object! error " << ret << std::endl; + exit(EXIT_FAILURE); + } else { + std::cout << "Wrote new object 'hw' " << std::endl; + } + } + + + /* + * Add an xattr to the object. + */ + { + librados::bufferlist lang_bl; + lang_bl.append("en_US"); + ret = io_ctx.setxattr("hw", "lang", lang_bl); + if (ret < 0) { + std::cerr << "failed to set xattr version entry! error " + << ret << std::endl; + exit(EXIT_FAILURE); + } else { + std::cout << "Set the xattr 'lang' on our object!" << std::endl; + } + } + + + /* + * Read the object back asynchronously. + */ + { + librados::bufferlist read_buf; + int read_len = 4194304; + + //Create I/O Completion. + librados::AioCompletion *read_completion = librados::Rados::aio_create_completion(); + + //Send read request. + ret = io_ctx.aio_read("hw", read_completion, &read_buf, read_len, 0); + if (ret < 0) { + std::cerr << "Couldn't start read object! error " << ret << std::endl; + exit(EXIT_FAILURE); + } + + // Wait for the request to complete, and check that it succeeded. + read_completion->wait_for_complete(); + ret = read_completion->get_return_value(); + if (ret < 0) { + std::cerr << "Couldn't read object! error " << ret << std::endl; + exit(EXIT_FAILURE); + } else { + std::cout << "Read object hw asynchronously with contents.\n" + << read_buf.c_str() << std::endl; + } + } + + + /* + * Read the xattr. + */ + { + librados::bufferlist lang_res; + ret = io_ctx.getxattr("hw", "lang", lang_res); + if (ret < 0) { + std::cerr << "failed to get xattr version entry! error " + << ret << std::endl; + exit(EXIT_FAILURE); + } else { + std::cout << "Got the xattr 'lang' from object hw!" + << lang_res.c_str() << std::endl; + } + } + + + /* + * Remove the xattr. + */ + { + ret = io_ctx.rmxattr("hw", "lang"); + if (ret < 0) { + std::cerr << "Failed to remove xattr! error " + << ret << std::endl; + exit(EXIT_FAILURE); + } else { + std::cout << "Removed the xattr 'lang' from our object!" << std::endl; + } + } + + /* + * Remove the object. + */ + { + ret = io_ctx.remove("hw"); + if (ret < 0) { + std::cerr << "Couldn't remove object! error " << ret << std::endl; + exit(EXIT_FAILURE); + } else { + std::cout << "Removed object 'hw'." << std::endl; + } + } + } + + + +Python Example +-------------- + +.. code-block:: python + + print "\n\nI/O Context and Object Operations" + print "=================================" + + print "\nCreating a context for the 'data' pool" + if not cluster.pool_exists('data'): + raise RuntimeError('No data pool exists') + ioctx = cluster.open_ioctx('data') + + print "\nWriting object 'hw' with contents 'Hello World!' to pool 'data'." + ioctx.write("hw", "Hello World!") + print "Writing XATTR 'lang' with value 'en_US' to object 'hw'" + ioctx.set_xattr("hw", "lang", "en_US") + + + print "\nWriting object 'bm' with contents 'Bonjour tout le monde!' to pool 'data'." + ioctx.write("bm", "Bonjour tout le monde!") + print "Writing XATTR 'lang' with value 'fr_FR' to object 'bm'" + ioctx.set_xattr("bm", "lang", "fr_FR") + + print "\nContents of object 'hw'\n------------------------" + print ioctx.read("hw") + + print "\n\nGetting XATTR 'lang' from object 'hw'" + print ioctx.get_xattr("hw", "lang") + + print "\nContents of object 'bm'\n------------------------" + print ioctx.read("bm") + + print "Getting XATTR 'lang' from object 'bm'" + print ioctx.get_xattr("bm", "lang") + + + print "\nRemoving object 'hw'" + ioctx.remove_object("hw") + + print "Removing object 'bm'" + ioctx.remove_object("bm") + + +Java-Example +------------ + +.. code-block:: java + + import com.ceph.rados.Rados; + import com.ceph.rados.RadosException; + + import java.io.File; + import com.ceph.rados.IoCTX; + + public class CephClient { + public static void main (String args[]){ + + try { + Rados cluster = new Rados("admin"); + System.out.println("Created cluster handle."); + + File f = new File("/etc/ceph/ceph.conf"); + cluster.confReadFile(f); + System.out.println("Read the configuration file."); + + cluster.connect(); + System.out.println("Connected to the cluster."); + + IoCTX io = cluster.ioCtxCreate("data"); + + String oidone = "hw"; + String contentone = "Hello World!"; + io.write(oidone, contentone); + + String oidtwo = "bm"; + String contenttwo = "Bonjour tout le monde!"; + io.write(oidtwo, contenttwo); + + String[] objects = io.listObjects(); + for (String object: objects) + System.out.println(object); + + io.remove(oidone); + io.remove(oidtwo); + + cluster.ioCtxDestroy(io); + + } catch (RadosException e) { + System.out.println(e.getMessage() + ": " + e.getReturnValue()); + } + } + } + + +PHP Example +----------- + +.. code-block:: php + + <?php + + $io = rados_ioctx_create($r, "mypool"); + rados_write_full($io, "oidOne", "mycontents"); + rados_remove("oidOne"); + rados_ioctx_destroy($io); + + +Step 4: Closing Sessions +======================== + +Once your app finishes with the I/O Context and cluster handle, the app should +close the connection and shutdown the handle. For asynchronous I/O, the app +should also ensure that pending asynchronous operations have completed. + + +C Example +--------- + +.. code-block:: c + + rados_ioctx_destroy(io); + rados_shutdown(cluster); + + +C++ Example +----------- + +.. code-block:: c++ + + io_ctx.close(); + cluster.shutdown(); + + +Java Example +-------------- + +.. code-block:: java + + cluster.ioCtxDestroy(io); + cluster.shutDown(); + + +Python Example +-------------- + +.. code-block:: python + + print "\nClosing the connection." + ioctx.close() + + print "Shutting down the handle." + cluster.shutdown() + +PHP Example +----------- + +.. code-block:: php + + rados_shutdown($r); + + + +.. _user ID: ../../operations/user-management#command-line-usage +.. _CAPS: ../../operations/user-management#authorization-capabilities +.. _Installation (Quick): ../../../start +.. _Smart Daemons Enable Hyperscale: ../../../architecture#smart-daemons-enable-hyperscale +.. _Calculating PG IDs: ../../../architecture#calculating-pg-ids +.. _computes: ../../../architecture#calculating-pg-ids +.. _OSD: ../../../architecture#mapping-pgs-to-osds diff --git a/src/ceph/doc/rados/api/librados.rst b/src/ceph/doc/rados/api/librados.rst new file mode 100644 index 0000000..73d0e42 --- /dev/null +++ b/src/ceph/doc/rados/api/librados.rst @@ -0,0 +1,187 @@ +============== + Librados (C) +============== + +.. highlight:: c + +`librados` provides low-level access to the RADOS service. For an +overview of RADOS, see :doc:`../../architecture`. + + +Example: connecting and writing an object +========================================= + +To use `Librados`, you instantiate a :c:type:`rados_t` variable (a cluster handle) and +call :c:func:`rados_create()` with a pointer to it:: + + int err; + rados_t cluster; + + err = rados_create(&cluster, NULL); + if (err < 0) { + fprintf(stderr, "%s: cannot create a cluster handle: %s\n", argv[0], strerror(-err)); + exit(1); + } + +Then you configure your :c:type:`rados_t` to connect to your cluster, +either by setting individual values (:c:func:`rados_conf_set()`), +using a configuration file (:c:func:`rados_conf_read_file()`), using +command line options (:c:func:`rados_conf_parse_argv`), or an +environment variable (:c:func:`rados_conf_parse_env()`):: + + err = rados_conf_read_file(cluster, "/path/to/myceph.conf"); + if (err < 0) { + fprintf(stderr, "%s: cannot read config file: %s\n", argv[0], strerror(-err)); + exit(1); + } + +Once the cluster handle is configured, you can connect to the cluster with :c:func:`rados_connect()`:: + + err = rados_connect(cluster); + if (err < 0) { + fprintf(stderr, "%s: cannot connect to cluster: %s\n", argv[0], strerror(-err)); + exit(1); + } + +Then you open an "IO context", a :c:type:`rados_ioctx_t`, with :c:func:`rados_ioctx_create()`:: + + rados_ioctx_t io; + char *poolname = "mypool"; + + err = rados_ioctx_create(cluster, poolname, &io); + if (err < 0) { + fprintf(stderr, "%s: cannot open rados pool %s: %s\n", argv[0], poolname, strerror(-err)); + rados_shutdown(cluster); + exit(1); + } + +Note that the pool you try to access must exist. + +Then you can use the RADOS data manipulation functions, for example +write into an object called ``greeting`` with +:c:func:`rados_write_full()`:: + + err = rados_write_full(io, "greeting", "hello", 5); + if (err < 0) { + fprintf(stderr, "%s: cannot write pool %s: %s\n", argv[0], poolname, strerror(-err)); + rados_ioctx_destroy(io); + rados_shutdown(cluster); + exit(1); + } + +In the end, you will want to close your IO context and connection to RADOS with :c:func:`rados_ioctx_destroy()` and :c:func:`rados_shutdown()`:: + + rados_ioctx_destroy(io); + rados_shutdown(cluster); + + +Asychronous IO +============== + +When doing lots of IO, you often don't need to wait for one operation +to complete before starting the next one. `Librados` provides +asynchronous versions of several operations: + +* :c:func:`rados_aio_write` +* :c:func:`rados_aio_append` +* :c:func:`rados_aio_write_full` +* :c:func:`rados_aio_read` + +For each operation, you must first create a +:c:type:`rados_completion_t` that represents what to do when the +operation is safe or complete by calling +:c:func:`rados_aio_create_completion`. If you don't need anything +special to happen, you can pass NULL:: + + rados_completion_t comp; + err = rados_aio_create_completion(NULL, NULL, NULL, &comp); + if (err < 0) { + fprintf(stderr, "%s: could not create aio completion: %s\n", argv[0], strerror(-err)); + rados_ioctx_destroy(io); + rados_shutdown(cluster); + exit(1); + } + +Now you can call any of the aio operations, and wait for it to +be in memory or on disk on all replicas:: + + err = rados_aio_write(io, "foo", comp, "bar", 3, 0); + if (err < 0) { + fprintf(stderr, "%s: could not schedule aio write: %s\n", argv[0], strerror(-err)); + rados_aio_release(comp); + rados_ioctx_destroy(io); + rados_shutdown(cluster); + exit(1); + } + rados_aio_wait_for_complete(comp); // in memory + rados_aio_wait_for_safe(comp); // on disk + +Finally, we need to free the memory used by the completion with :c:func:`rados_aio_release`:: + + rados_aio_release(comp); + +You can use the callbacks to tell your application when writes are +durable, or when read buffers are full. For example, if you wanted to +measure the latency of each operation when appending to several +objects, you could schedule several writes and store the ack and +commit time in the corresponding callback, then wait for all of them +to complete using :c:func:`rados_aio_flush` before analyzing the +latencies:: + + typedef struct { + struct timeval start; + struct timeval ack_end; + struct timeval commit_end; + } req_duration; + + void ack_callback(rados_completion_t comp, void *arg) { + req_duration *dur = (req_duration *) arg; + gettimeofday(&dur->ack_end, NULL); + } + + void commit_callback(rados_completion_t comp, void *arg) { + req_duration *dur = (req_duration *) arg; + gettimeofday(&dur->commit_end, NULL); + } + + int output_append_latency(rados_ioctx_t io, const char *data, size_t len, size_t num_writes) { + req_duration times[num_writes]; + rados_completion_t comps[num_writes]; + for (size_t i = 0; i < num_writes; ++i) { + gettimeofday(×[i].start, NULL); + int err = rados_aio_create_completion((void*) ×[i], ack_callback, commit_callback, &comps[i]); + if (err < 0) { + fprintf(stderr, "Error creating rados completion: %s\n", strerror(-err)); + return err; + } + char obj_name[100]; + snprintf(obj_name, sizeof(obj_name), "foo%ld", (unsigned long)i); + err = rados_aio_append(io, obj_name, comps[i], data, len); + if (err < 0) { + fprintf(stderr, "Error from rados_aio_append: %s", strerror(-err)); + return err; + } + } + // wait until all requests finish *and* the callbacks complete + rados_aio_flush(io); + // the latencies can now be analyzed + printf("Request # | Ack latency (s) | Commit latency (s)\n"); + for (size_t i = 0; i < num_writes; ++i) { + // don't forget to free the completions + rados_aio_release(comps[i]); + struct timeval ack_lat, commit_lat; + timersub(×[i].ack_end, ×[i].start, &ack_lat); + timersub(×[i].commit_end, ×[i].start, &commit_lat); + printf("%9ld | %8ld.%06ld | %10ld.%06ld\n", (unsigned long) i, ack_lat.tv_sec, ack_lat.tv_usec, commit_lat.tv_sec, commit_lat.tv_usec); + } + return 0; + } + +Note that all the :c:type:`rados_completion_t` must be freed with :c:func:`rados_aio_release` to avoid leaking memory. + + +API calls +========= + + .. autodoxygenfile:: rados_types.h + .. autodoxygenfile:: librados.h diff --git a/src/ceph/doc/rados/api/libradospp.rst b/src/ceph/doc/rados/api/libradospp.rst new file mode 100644 index 0000000..27d3fa7 --- /dev/null +++ b/src/ceph/doc/rados/api/libradospp.rst @@ -0,0 +1,5 @@ +================== + LibradosPP (C++) +================== + +.. todo:: write me! diff --git a/src/ceph/doc/rados/api/objclass-sdk.rst b/src/ceph/doc/rados/api/objclass-sdk.rst new file mode 100644 index 0000000..6b1162f --- /dev/null +++ b/src/ceph/doc/rados/api/objclass-sdk.rst @@ -0,0 +1,37 @@ +=========================== +SDK for Ceph Object Classes +=========================== + +`Ceph` can be extended by creating shared object classes called `Ceph Object +Classes`. The existing framework to build these object classes has dependencies +on the internal functionality of `Ceph`, which restricts users to build object +classes within the tree. The aim of this project is to create an independent +object class interface, which can be used to build object classes outside the +`Ceph` tree. This allows us to have two types of object classes, 1) those that +have in-tree dependencies and reside in the tree and 2) those that can make use +of the `Ceph Object Class SDK framework` and can be built outside of the `Ceph` +tree because they do not depend on any internal implementation of `Ceph`. This +project decouples object class development from Ceph and encourages creation +and distribution of object classes as packages. + +In order to demonstrate the use of this framework, we have provided an example +called ``cls_sdk``, which is a very simple object class that makes use of the +SDK framework. This object class resides in the ``src/cls`` directory. + +Installing objclass.h +--------------------- + +The object class interface that enables out-of-tree development of object +classes resides in ``src/include/rados/`` and gets installed with `Ceph` +installation. After running ``make install``, you should be able to see it +in ``<prefix>/include/rados``. :: + + ls /usr/local/include/rados + +Using the SDK example +--------------------- + +The ``cls_sdk`` object class resides in ``src/cls/sdk/``. This gets built and +loaded into Ceph, with the Ceph build process. You can run the +``ceph_test_cls_sdk`` unittest, which resides in ``src/test/cls_sdk/``, +to test this class. diff --git a/src/ceph/doc/rados/api/python.rst b/src/ceph/doc/rados/api/python.rst new file mode 100644 index 0000000..b4fd7e0 --- /dev/null +++ b/src/ceph/doc/rados/api/python.rst @@ -0,0 +1,397 @@ +=================== + Librados (Python) +=================== + +The ``rados`` module is a thin Python wrapper for ``librados``. + +Installation +============ + +To install Python libraries for Ceph, see `Getting librados for Python`_. + + +Getting Started +=============== + +You can create your own Ceph client using Python. The following tutorial will +show you how to import the Ceph Python module, connect to a Ceph cluster, and +perform object operations as a ``client.admin`` user. + +.. note:: To use the Ceph Python bindings, you must have access to a + running Ceph cluster. To set one up quickly, see `Getting Started`_. + +First, create a Python source file for your Ceph client. :: + :linenos: + + sudo vim client.py + + +Import the Module +----------------- + +To use the ``rados`` module, import it into your source file. + +.. code-block:: python + :linenos: + + import rados + + +Configure a Cluster Handle +-------------------------- + +Before connecting to the Ceph Storage Cluster, create a cluster handle. By +default, the cluster handle assumes a cluster named ``ceph`` (i.e., the default +for deployment tools, and our Getting Started guides too), and a +``client.admin`` user name. You may change these defaults to suit your needs. + +To connect to the Ceph Storage Cluster, your application needs to know where to +find the Ceph Monitor. Provide this information to your application by +specifying the path to your Ceph configuration file, which contains the location +of the initial Ceph monitors. + +.. code-block:: python + :linenos: + + import rados, sys + + #Create Handle Examples. + cluster = rados.Rados(conffile='ceph.conf') + cluster = rados.Rados(conffile=sys.argv[1]) + cluster = rados.Rados(conffile = 'ceph.conf', conf = dict (keyring = '/path/to/keyring')) + +Ensure that the ``conffile`` argument provides the path and file name of your +Ceph configuration file. You may use the ``sys`` module to avoid hard-coding the +Ceph configuration path and file name. + +Your Python client also requires a client keyring. For this example, we use the +``client.admin`` key by default. If you would like to specify the keyring when +creating the cluster handle, you may use the ``conf`` argument. Alternatively, +you may specify the keyring path in your Ceph configuration file. For example, +you may add something like the following line to you Ceph configuration file:: + + keyring = /path/to/ceph.client.admin.keyring + +For additional details on modifying your configuration via Python, see `Configuration`_. + + +Connect to the Cluster +---------------------- + +Once you have a cluster handle configured, you may connect to the cluster. +With a connection to the cluster, you may execute methods that return +information about the cluster. + +.. code-block:: python + :linenos: + :emphasize-lines: 7 + + import rados, sys + + cluster = rados.Rados(conffile='ceph.conf') + print "\nlibrados version: " + str(cluster.version()) + print "Will attempt to connect to: " + str(cluster.conf_get('mon initial members')) + + cluster.connect() + print "\nCluster ID: " + cluster.get_fsid() + + print "\n\nCluster Statistics" + print "==================" + cluster_stats = cluster.get_cluster_stats() + + for key, value in cluster_stats.iteritems(): + print key, value + + +By default, Ceph authentication is ``on``. Your application will need to know +the location of the keyring. The ``python-ceph`` module doesn't have the default +location, so you need to specify the keyring path. The easiest way to specify +the keyring is to add it to the Ceph configuration file. The following Ceph +configuration file example uses the ``client.admin`` keyring you generated with +``ceph-deploy``. + +.. code-block:: ini + :linenos: + + [global] + ... + keyring=/path/to/keyring/ceph.client.admin.keyring + + +Manage Pools +------------ + +When connected to the cluster, the ``Rados`` API allows you to manage pools. You +can list pools, check for the existence of a pool, create a pool and delete a +pool. + +.. code-block:: python + :linenos: + :emphasize-lines: 6, 13, 18, 25 + + print "\n\nPool Operations" + print "===============" + + print "\nAvailable Pools" + print "----------------" + pools = cluster.list_pools() + + for pool in pools: + print pool + + print "\nCreate 'test' Pool" + print "------------------" + cluster.create_pool('test') + + print "\nPool named 'test' exists: " + str(cluster.pool_exists('test')) + print "\nVerify 'test' Pool Exists" + print "-------------------------" + pools = cluster.list_pools() + + for pool in pools: + print pool + + print "\nDelete 'test' Pool" + print "------------------" + cluster.delete_pool('test') + print "\nPool named 'test' exists: " + str(cluster.pool_exists('test')) + + + +Input/Output Context +-------------------- + +Reading from and writing to the Ceph Storage Cluster requires an input/output +context (ioctx). You can create an ioctx with the ``open_ioctx()`` method of the +``Rados`` class. The ``ioctx_name`` parameter is the name of the pool you wish +to use. + +.. code-block:: python + :linenos: + + ioctx = cluster.open_ioctx('data') + + +Once you have an I/O context, you can read/write objects, extended attributes, +and perform a number of other operations. After you complete operations, ensure +that you close the connection. For example: + +.. code-block:: python + :linenos: + + print "\nClosing the connection." + ioctx.close() + + +Writing, Reading and Removing Objects +------------------------------------- + +Once you create an I/O context, you can write objects to the cluster. If you +write to an object that doesn't exist, Ceph creates it. If you write to an +object that exists, Ceph overwrites it (except when you specify a range, and +then it only overwrites the range). You may read objects (and object ranges) +from the cluster. You may also remove objects from the cluster. For example: + +.. code-block:: python + :linenos: + :emphasize-lines: 2, 5, 8 + + print "\nWriting object 'hw' with contents 'Hello World!' to pool 'data'." + ioctx.write_full("hw", "Hello World!") + + print "\n\nContents of object 'hw'\n------------------------\n" + print ioctx.read("hw") + + print "\nRemoving object 'hw'" + ioctx.remove_object("hw") + + +Writing and Reading XATTRS +-------------------------- + +Once you create an object, you can write extended attributes (XATTRs) to +the object and read XATTRs from the object. For example: + +.. code-block:: python + :linenos: + :emphasize-lines: 2, 5 + + print "\n\nWriting XATTR 'lang' with value 'en_US' to object 'hw'" + ioctx.set_xattr("hw", "lang", "en_US") + + print "\n\nGetting XATTR 'lang' from object 'hw'\n" + print ioctx.get_xattr("hw", "lang") + + +Listing Objects +--------------- + +If you want to examine the list of objects in a pool, you may +retrieve the list of objects and iterate over them with the object iterator. +For example: + +.. code-block:: python + :linenos: + :emphasize-lines: 1, 6, 7 + + object_iterator = ioctx.list_objects() + + while True : + + try : + rados_object = object_iterator.next() + print "Object contents = " + rados_object.read() + + except StopIteration : + break + +The ``Object`` class provides a file-like interface to an object, allowing +you to read and write content and extended attributes. Object operations using +the I/O context provide additional functionality and asynchronous capabilities. + + +Cluster Handle API +================== + +The ``Rados`` class provides an interface into the Ceph Storage Daemon. + + +Configuration +------------- + +The ``Rados`` class provides methods for getting and setting configuration +values, reading the Ceph configuration file, and parsing arguments. You +do not need to be connected to the Ceph Storage Cluster to invoke the following +methods. See `Storage Cluster Configuration`_ for details on settings. + +.. currentmodule:: rados +.. automethod:: Rados.conf_get(option) +.. automethod:: Rados.conf_set(option, val) +.. automethod:: Rados.conf_read_file(path=None) +.. automethod:: Rados.conf_parse_argv(args) +.. automethod:: Rados.version() + + +Connection Management +--------------------- + +Once you configure your cluster handle, you may connect to the cluster, check +the cluster ``fsid``, retrieve cluster statistics, and disconnect (shutdown) +from the cluster. You may also assert that the cluster handle is in a particular +state (e.g., "configuring", "connecting", etc.). + + +.. automethod:: Rados.connect(timeout=0) +.. automethod:: Rados.shutdown() +.. automethod:: Rados.get_fsid() +.. automethod:: Rados.get_cluster_stats() +.. automethod:: Rados.require_state(*args) + + +Pool Operations +--------------- + +To use pool operation methods, you must connect to the Ceph Storage Cluster +first. You may list the available pools, create a pool, check to see if a pool +exists, and delete a pool. + +.. automethod:: Rados.list_pools() +.. automethod:: Rados.create_pool(pool_name, auid=None, crush_rule=None) +.. automethod:: Rados.pool_exists() +.. automethod:: Rados.delete_pool(pool_name) + + + +Input/Output Context API +======================== + +To write data to and read data from the Ceph Object Store, you must create +an Input/Output context (ioctx). The `Rados` class provides a `open_ioctx()` +method. The remaining ``ioctx`` operations involve invoking methods of the +`Ioctx` and other classes. + +.. automethod:: Rados.open_ioctx(ioctx_name) +.. automethod:: Ioctx.require_ioctx_open() +.. automethod:: Ioctx.get_stats() +.. automethod:: Ioctx.change_auid(auid) +.. automethod:: Ioctx.get_last_version() +.. automethod:: Ioctx.close() + + +.. Pool Snapshots +.. -------------- + +.. The Ceph Storage Cluster allows you to make a snapshot of a pool's state. +.. Whereas, basic pool operations only require a connection to the cluster, +.. snapshots require an I/O context. + +.. Ioctx.create_snap(self, snap_name) +.. Ioctx.list_snaps(self) +.. SnapIterator.next(self) +.. Snap.get_timestamp(self) +.. Ioctx.lookup_snap(self, snap_name) +.. Ioctx.remove_snap(self, snap_name) + +.. not published. This doesn't seem ready yet. + +Object Operations +----------------- + +The Ceph Storage Cluster stores data as objects. You can read and write objects +synchronously or asynchronously. You can read and write from offsets. An object +has a name (or key) and data. + + +.. automethod:: Ioctx.aio_write(object_name, to_write, offset=0, oncomplete=None, onsafe=None) +.. automethod:: Ioctx.aio_write_full(object_name, to_write, oncomplete=None, onsafe=None) +.. automethod:: Ioctx.aio_append(object_name, to_append, oncomplete=None, onsafe=None) +.. automethod:: Ioctx.write(key, data, offset=0) +.. automethod:: Ioctx.write_full(key, data) +.. automethod:: Ioctx.aio_flush() +.. automethod:: Ioctx.set_locator_key(loc_key) +.. automethod:: Ioctx.aio_read(object_name, length, offset, oncomplete) +.. automethod:: Ioctx.read(key, length=8192, offset=0) +.. automethod:: Ioctx.stat(key) +.. automethod:: Ioctx.trunc(key, size) +.. automethod:: Ioctx.remove_object(key) + + +Object Extended Attributes +-------------------------- + +You may set extended attributes (XATTRs) on an object. You can retrieve a list +of objects or XATTRs and iterate over them. + +.. automethod:: Ioctx.set_xattr(key, xattr_name, xattr_value) +.. automethod:: Ioctx.get_xattrs(oid) +.. automethod:: XattrIterator.next() +.. automethod:: Ioctx.get_xattr(key, xattr_name) +.. automethod:: Ioctx.rm_xattr(key, xattr_name) + + + +Object Interface +================ + +From an I/O context, you can retrieve a list of objects from a pool and iterate +over them. The object interface provide makes each object look like a file, and +you may perform synchronous operations on the objects. For asynchronous +operations, you should use the I/O context methods. + +.. automethod:: Ioctx.list_objects() +.. automethod:: ObjectIterator.next() +.. automethod:: Object.read(length = 1024*1024) +.. automethod:: Object.write(string_to_write) +.. automethod:: Object.get_xattrs() +.. automethod:: Object.get_xattr(xattr_name) +.. automethod:: Object.set_xattr(xattr_name, xattr_value) +.. automethod:: Object.rm_xattr(xattr_name) +.. automethod:: Object.stat() +.. automethod:: Object.remove() + + + + +.. _Getting Started: ../../../start +.. _Storage Cluster Configuration: ../../configuration +.. _Getting librados for Python: ../librados-intro#getting-librados-for-python diff --git a/src/ceph/doc/rados/command/list-inconsistent-obj.json b/src/ceph/doc/rados/command/list-inconsistent-obj.json new file mode 100644 index 0000000..76ca43e --- /dev/null +++ b/src/ceph/doc/rados/command/list-inconsistent-obj.json @@ -0,0 +1,195 @@ +{ + "$schema": "http://json-schema.org/draft-04/schema#", + "type": "object", + "properties": { + "epoch": { + "description": "Scrub epoch", + "type": "integer" + }, + "inconsistents": { + "type": "array", + "items": { + "type": "object", + "properties": { + "object": { + "description": "Identify a Ceph object", + "type": "object", + "properties": { + "name": { + "type": "string" + }, + "nspace": { + "type": "string" + }, + "locator": { + "type": "string" + }, + "version": { + "type": "integer", + "minimum": 0 + }, + "snap": { + "oneOf": [ + { + "type": "string", + "enum": [ "head", "snapdir" ] + }, + { + "type": "integer", + "minimum": 0 + } + ] + } + }, + "required": [ + "name", + "nspace", + "locator", + "version", + "snap" + ] + }, + "selected_object_info": { + "type": "string" + }, + "union_shard_errors": { + "description": "Union of all shard errors", + "type": "array", + "items": { + "enum": [ + "missing", + "stat_error", + "read_error", + "data_digest_mismatch_oi", + "omap_digest_mismatch_oi", + "size_mismatch_oi", + "ec_hash_error", + "ec_size_error", + "oi_attr_missing", + "oi_attr_corrupted", + "obj_size_oi_mismatch", + "ss_attr_missing", + "ss_attr_corrupted" + ] + }, + "minItems": 0, + "uniqueItems": true + }, + "errors": { + "description": "Errors related to the analysis of this object", + "type": "array", + "items": { + "enum": [ + "object_info_inconsistency", + "data_digest_mismatch", + "omap_digest_mismatch", + "size_mismatch", + "attr_value_mismatch", + "attr_name_mismatch" + ] + }, + "minItems": 0, + "uniqueItems": true + }, + "shards": { + "description": "All found or expected shards", + "type": "array", + "items": { + "description": "Information about a particular shard of object", + "type": "object", + "properties": { + "object_info": { + "type": "string" + }, + "shard": { + "type": "integer" + }, + "osd": { + "type": "integer" + }, + "primary": { + "type": "boolean" + }, + "size": { + "type": "integer" + }, + "omap_digest": { + "description": "Hex representation (e.g. 0x1abd1234)", + "type": "string" + }, + "data_digest": { + "description": "Hex representation (e.g. 0x1abd1234)", + "type": "string" + }, + "errors": { + "description": "Errors with this shard", + "type": "array", + "items": { + "enum": [ + "missing", + "stat_error", + "read_error", + "data_digest_mismatch_oi", + "omap_digest_mismatch_oi", + "size_mismatch_oi", + "ec_hash_error", + "ec_size_error", + "oi_attr_missing", + "oi_attr_corrupted", + "obj_size_oi_mismatch", + "ss_attr_missing", + "ss_attr_corrupted" + ] + }, + "minItems": 0, + "uniqueItems": true + }, + "attrs": { + "description": "If any shard's attr error is set then all attrs are here", + "type": "array", + "items": { + "description": "Information about a particular shard of object", + "type": "object", + "properties": { + "name": { + "type": "string" + }, + "value": { + "type": "string" + }, + "Base64": { + "type": "boolean" + } + }, + "required": [ + "name", + "value", + "Base64" + ], + "additionalProperties": false, + "minItems": 1 + } + } + }, + "required": [ + "osd", + "primary", + "errors" + ] + } + } + }, + "required": [ + "object", + "union_shard_errors", + "errors", + "shards" + ] + } + } + }, + "required": [ + "epoch", + "inconsistents" + ] +} diff --git a/src/ceph/doc/rados/command/list-inconsistent-snap.json b/src/ceph/doc/rados/command/list-inconsistent-snap.json new file mode 100644 index 0000000..0da6b0f --- /dev/null +++ b/src/ceph/doc/rados/command/list-inconsistent-snap.json @@ -0,0 +1,87 @@ +{ + "$schema": "http://json-schema.org/draft-04/schema#", + "type": "object", + "properties": { + "epoch": { + "description": "Scrub epoch", + "type": "integer" + }, + "inconsistents": { + "type": "array", + "items": { + "type": "object", + "properties": { + "name": { + "type": "string" + }, + "nspace": { + "type": "string" + }, + "locator": { + "type": "string" + }, + "snap": { + "oneOf": [ + { + "type": "string", + "enum": [ + "head", + "snapdir" + ] + }, + { + "type": "integer", + "minimum": 0 + } + ] + }, + "errors": { + "description": "Errors for this object's snap", + "type": "array", + "items": { + "enum": [ + "ss_attr_missing", + "ss_attr_corrupted", + "oi_attr_missing", + "oi_attr_corrupted", + "snapset_mismatch", + "head_mismatch", + "headless", + "size_mismatch", + "extra_clones", + "clone_missing" + ] + }, + "minItems": 1, + "uniqueItems": true + }, + "missing": { + "description": "List of missing clones if clone_missing error set", + "type": "array", + "items": { + "type": "integer" + } + }, + "extra_clones": { + "description": "List of extra clones if extra_clones error set", + "type": "array", + "items": { + "type": "integer" + } + } + }, + "required": [ + "name", + "nspace", + "locator", + "snap", + "errors" + ] + } + } + }, + "required": [ + "epoch", + "inconsistents" + ] +} diff --git a/src/ceph/doc/rados/configuration/auth-config-ref.rst b/src/ceph/doc/rados/configuration/auth-config-ref.rst new file mode 100644 index 0000000..eb14fa4 --- /dev/null +++ b/src/ceph/doc/rados/configuration/auth-config-ref.rst @@ -0,0 +1,432 @@ +======================== + Cephx Config Reference +======================== + +The ``cephx`` protocol is enabled by default. Cryptographic authentication has +some computational costs, though they should generally be quite low. If the +network environment connecting your client and server hosts is very safe and +you cannot afford authentication, you can turn it off. **This is not generally +recommended**. + +.. note:: If you disable authentication, you are at risk of a man-in-the-middle + attack altering your client/server messages, which could lead to disastrous + security effects. + +For creating users, see `User Management`_. For details on the architecture +of Cephx, see `Architecture - High Availability Authentication`_. + + +Deployment Scenarios +==================== + +There are two main scenarios for deploying a Ceph cluster, which impact +how you initially configure Cephx. Most first time Ceph users use +``ceph-deploy`` to create a cluster (easiest). For clusters using +other deployment tools (e.g., Chef, Juju, Puppet, etc.), you will need +to use the manual procedures or configure your deployment tool to +bootstrap your monitor(s). + +ceph-deploy +----------- + +When you deploy a cluster with ``ceph-deploy``, you do not have to bootstrap the +monitor manually or create the ``client.admin`` user or keyring. The steps you +execute in the `Storage Cluster Quick Start`_ will invoke ``ceph-deploy`` to do +that for you. + +When you execute ``ceph-deploy new {initial-monitor(s)}``, Ceph will create a +monitor keyring for you (only used to bootstrap monitors), and it will generate +an initial Ceph configuration file for you, which contains the following +authentication settings, indicating that Ceph enables authentication by +default:: + + auth_cluster_required = cephx + auth_service_required = cephx + auth_client_required = cephx + +When you execute ``ceph-deploy mon create-initial``, Ceph will bootstrap the +initial monitor(s), retrieve a ``ceph.client.admin.keyring`` file containing the +key for the ``client.admin`` user. Additionally, it will also retrieve keyrings +that give ``ceph-deploy`` and ``ceph-disk`` utilities the ability to prepare and +activate OSDs and metadata servers. + +When you execute ``ceph-deploy admin {node-name}`` (**note:** Ceph must be +installed first), you are pushing a Ceph configuration file and the +``ceph.client.admin.keyring`` to the ``/etc/ceph`` directory of the node. You +will be able to execute Ceph administrative functions as ``root`` on the command +line of that node. + + +Manual Deployment +----------------- + +When you deploy a cluster manually, you have to bootstrap the monitor manually +and create the ``client.admin`` user and keyring. To bootstrap monitors, follow +the steps in `Monitor Bootstrapping`_. The steps for monitor bootstrapping are +the logical steps you must perform when using third party deployment tools like +Chef, Puppet, Juju, etc. + + +Enabling/Disabling Cephx +======================== + +Enabling Cephx requires that you have deployed keys for your monitors, +OSDs and metadata servers. If you are simply toggling Cephx on / off, +you do not have to repeat the bootstrapping procedures. + + +Enabling Cephx +-------------- + +When ``cephx`` is enabled, Ceph will look for the keyring in the default search +path, which includes ``/etc/ceph/$cluster.$name.keyring``. You can override +this location by adding a ``keyring`` option in the ``[global]`` section of +your `Ceph configuration`_ file, but this is not recommended. + +Execute the following procedures to enable ``cephx`` on a cluster with +authentication disabled. If you (or your deployment utility) have already +generated the keys, you may skip the steps related to generating keys. + +#. Create a ``client.admin`` key, and save a copy of the key for your client + host:: + + ceph auth get-or-create client.admin mon 'allow *' mds 'allow *' osd 'allow *' -o /etc/ceph/ceph.client.admin.keyring + + **Warning:** This will clobber any existing + ``/etc/ceph/client.admin.keyring`` file. Do not perform this step if a + deployment tool has already done it for you. Be careful! + +#. Create a keyring for your monitor cluster and generate a monitor + secret key. :: + + ceph-authtool --create-keyring /tmp/ceph.mon.keyring --gen-key -n mon. --cap mon 'allow *' + +#. Copy the monitor keyring into a ``ceph.mon.keyring`` file in every monitor's + ``mon data`` directory. For example, to copy it to ``mon.a`` in cluster ``ceph``, + use the following:: + + cp /tmp/ceph.mon.keyring /var/lib/ceph/mon/ceph-a/keyring + +#. Generate a secret key for every OSD, where ``{$id}`` is the OSD number:: + + ceph auth get-or-create osd.{$id} mon 'allow rwx' osd 'allow *' -o /var/lib/ceph/osd/ceph-{$id}/keyring + +#. Generate a secret key for every MDS, where ``{$id}`` is the MDS letter:: + + ceph auth get-or-create mds.{$id} mon 'allow rwx' osd 'allow *' mds 'allow *' -o /var/lib/ceph/mds/ceph-{$id}/keyring + +#. Enable ``cephx`` authentication by setting the following options in the + ``[global]`` section of your `Ceph configuration`_ file:: + + auth cluster required = cephx + auth service required = cephx + auth client required = cephx + + +#. Start or restart the Ceph cluster. See `Operating a Cluster`_ for details. + +For details on bootstrapping a monitor manually, see `Manual Deployment`_. + + + +Disabling Cephx +--------------- + +The following procedure describes how to disable Cephx. If your cluster +environment is relatively safe, you can offset the computation expense of +running authentication. **We do not recommend it.** However, it may be easier +during setup and/or troubleshooting to temporarily disable authentication. + +#. Disable ``cephx`` authentication by setting the following options in the + ``[global]`` section of your `Ceph configuration`_ file:: + + auth cluster required = none + auth service required = none + auth client required = none + + +#. Start or restart the Ceph cluster. See `Operating a Cluster`_ for details. + + +Configuration Settings +====================== + +Enablement +---------- + + +``auth cluster required`` + +:Description: If enabled, the Ceph Storage Cluster daemons (i.e., ``ceph-mon``, + ``ceph-osd``, and ``ceph-mds``) must authenticate with + each other. Valid settings are ``cephx`` or ``none``. + +:Type: String +:Required: No +:Default: ``cephx``. + + +``auth service required`` + +:Description: If enabled, the Ceph Storage Cluster daemons require Ceph Clients + to authenticate with the Ceph Storage Cluster in order to access + Ceph services. Valid settings are ``cephx`` or ``none``. + +:Type: String +:Required: No +:Default: ``cephx``. + + +``auth client required`` + +:Description: If enabled, the Ceph Client requires the Ceph Storage Cluster to + authenticate with the Ceph Client. Valid settings are ``cephx`` + or ``none``. + +:Type: String +:Required: No +:Default: ``cephx``. + + +.. index:: keys; keyring + +Keys +---- + +When you run Ceph with authentication enabled, ``ceph`` administrative commands +and Ceph Clients require authentication keys to access the Ceph Storage Cluster. + +The most common way to provide these keys to the ``ceph`` administrative +commands and clients is to include a Ceph keyring under the ``/etc/ceph`` +directory. For Cuttlefish and later releases using ``ceph-deploy``, the filename +is usually ``ceph.client.admin.keyring`` (or ``$cluster.client.admin.keyring``). +If you include the keyring under the ``/etc/ceph`` directory, you don't need to +specify a ``keyring`` entry in your Ceph configuration file. + +We recommend copying the Ceph Storage Cluster's keyring file to nodes where you +will run administrative commands, because it contains the ``client.admin`` key. + +You may use ``ceph-deploy admin`` to perform this task. See `Create an Admin +Host`_ for details. To perform this step manually, execute the following:: + + sudo scp {user}@{ceph-cluster-host}:/etc/ceph/ceph.client.admin.keyring /etc/ceph/ceph.client.admin.keyring + +.. tip:: Ensure the ``ceph.keyring`` file has appropriate permissions set + (e.g., ``chmod 644``) on your client machine. + +You may specify the key itself in the Ceph configuration file using the ``key`` +setting (not recommended), or a path to a keyfile using the ``keyfile`` setting. + + +``keyring`` + +:Description: The path to the keyring file. +:Type: String +:Required: No +:Default: ``/etc/ceph/$cluster.$name.keyring,/etc/ceph/$cluster.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin`` + + +``keyfile`` + +:Description: The path to a key file (i.e,. a file containing only the key). +:Type: String +:Required: No +:Default: None + + +``key`` + +:Description: The key (i.e., the text string of the key itself). Not recommended. +:Type: String +:Required: No +:Default: None + + +Daemon Keyrings +--------------- + +Administrative users or deployment tools (e.g., ``ceph-deploy``) may generate +daemon keyrings in the same way as generating user keyrings. By default, Ceph +stores daemons keyrings inside their data directory. The default keyring +locations, and the capabilities necessary for the daemon to function, are shown +below. + +``ceph-mon`` + +:Location: ``$mon_data/keyring`` +:Capabilities: ``mon 'allow *'`` + +``ceph-osd`` + +:Location: ``$osd_data/keyring`` +:Capabilities: ``mon 'allow profile osd' osd 'allow *'`` + +``ceph-mds`` + +:Location: ``$mds_data/keyring`` +:Capabilities: ``mds 'allow' mon 'allow profile mds' osd 'allow rwx'`` + +``radosgw`` + +:Location: ``$rgw_data/keyring`` +:Capabilities: ``mon 'allow rwx' osd 'allow rwx'`` + + +.. note:: The monitor keyring (i.e., ``mon.``) contains a key but no + capabilities, and is not part of the cluster ``auth`` database. + +The daemon data directory locations default to directories of the form:: + + /var/lib/ceph/$type/$cluster-$id + +For example, ``osd.12`` would be:: + + /var/lib/ceph/osd/ceph-12 + +You can override these locations, but it is not recommended. + + +.. index:: signatures + +Signatures +---------- + +In Ceph Bobtail and subsequent versions, we prefer that Ceph authenticate all +ongoing messages between the entities using the session key set up for that +initial authentication. However, Argonaut and earlier Ceph daemons do not know +how to perform ongoing message authentication. To maintain backward +compatibility (e.g., running both Botbail and Argonaut daemons in the same +cluster), message signing is **off** by default. If you are running Bobtail or +later daemons exclusively, configure Ceph to require signatures. + +Like other parts of Ceph authentication, Ceph provides fine-grained control so +you can enable/disable signatures for service messages between the client and +Ceph, and you can enable/disable signatures for messages between Ceph daemons. + + +``cephx require signatures`` + +:Description: If set to ``true``, Ceph requires signatures on all message + traffic between the Ceph Client and the Ceph Storage Cluster, and + between daemons comprising the Ceph Storage Cluster. + +:Type: Boolean +:Required: No +:Default: ``false`` + + +``cephx cluster require signatures`` + +:Description: If set to ``true``, Ceph requires signatures on all message + traffic between Ceph daemons comprising the Ceph Storage Cluster. + +:Type: Boolean +:Required: No +:Default: ``false`` + + +``cephx service require signatures`` + +:Description: If set to ``true``, Ceph requires signatures on all message + traffic between Ceph Clients and the Ceph Storage Cluster. + +:Type: Boolean +:Required: No +:Default: ``false`` + + +``cephx sign messages`` + +:Description: If the Ceph version supports message signing, Ceph will sign + all messages so they cannot be spoofed. + +:Type: Boolean +:Default: ``true`` + + +Time to Live +------------ + +``auth service ticket ttl`` + +:Description: When the Ceph Storage Cluster sends a Ceph Client a ticket for + authentication, the Ceph Storage Cluster assigns the ticket a + time to live. + +:Type: Double +:Default: ``60*60`` + + +Backward Compatibility +====================== + +For Cuttlefish and earlier releases, see `Cephx`_. + +In Ceph Argonaut v0.48 and earlier versions, if you enable ``cephx`` +authentication, Ceph only authenticates the initial communication between the +client and daemon; Ceph does not authenticate the subsequent messages they send +to each other, which has security implications. In Ceph Bobtail and subsequent +versions, Ceph authenticates all ongoing messages between the entities using the +session key set up for that initial authentication. + +We identified a backward compatibility issue between Argonaut v0.48 (and prior +versions) and Bobtail (and subsequent versions). During testing, if you +attempted to use Argonaut (and earlier) daemons with Bobtail (and later) +daemons, the Argonaut daemons did not know how to perform ongoing message +authentication, while the Bobtail versions of the daemons insist on +authenticating message traffic subsequent to the initial +request/response--making it impossible for Argonaut (and prior) daemons to +interoperate with Bobtail (and subsequent) daemons. + +We have addressed this potential problem by providing a means for Argonaut (and +prior) systems to interact with Bobtail (and subsequent) systems. Here's how it +works: by default, the newer systems will not insist on seeing signatures from +older systems that do not know how to perform them, but will simply accept such +messages without authenticating them. This new default behavior provides the +advantage of allowing two different releases to interact. **We do not recommend +this as a long term solution**. Allowing newer daemons to forgo ongoing +authentication has the unfortunate security effect that an attacker with control +of some of your machines or some access to your network can disable session +security simply by claiming to be unable to sign messages. + +.. note:: Even if you don't actually run any old versions of Ceph, + the attacker may be able to force some messages to be accepted unsigned in the + default scenario. While running Cephx with the default scenario, Ceph still + authenticates the initial communication, but you lose desirable session security. + +If you know that you are not running older versions of Ceph, or you are willing +to accept that old servers and new servers will not be able to interoperate, you +can eliminate this security risk. If you do so, any Ceph system that is new +enough to support session authentication and that has Cephx enabled will reject +unsigned messages. To preclude new servers from interacting with old servers, +include the following in the ``[global]`` section of your `Ceph +configuration`_ file directly below the line that specifies the use of Cephx +for authentication:: + + cephx require signatures = true ; everywhere possible + +You can also selectively require signatures for cluster internal +communications only, separate from client-facing service:: + + cephx cluster require signatures = true ; for cluster-internal communication + cephx service require signatures = true ; for client-facing service + +An option to make a client require signatures from the cluster is not +yet implemented. + +**We recommend migrating all daemons to the newer versions and enabling the +foregoing flag** at the nearest practical time so that you may avail yourself +of the enhanced authentication. + +.. note:: Ceph kernel modules do not support signatures yet. + + +.. _Storage Cluster Quick Start: ../../../start/quick-ceph-deploy/ +.. _Monitor Bootstrapping: ../../../install/manual-deployment#monitor-bootstrapping +.. _Operating a Cluster: ../../operations/operating +.. _Manual Deployment: ../../../install/manual-deployment +.. _Cephx: http://docs.ceph.com/docs/cuttlefish/rados/configuration/auth-config-ref/ +.. _Ceph configuration: ../ceph-conf +.. _Create an Admin Host: ../../deployment/ceph-deploy-admin +.. _Architecture - High Availability Authentication: ../../../architecture#high-availability-authentication +.. _User Management: ../../operations/user-management diff --git a/src/ceph/doc/rados/configuration/bluestore-config-ref.rst b/src/ceph/doc/rados/configuration/bluestore-config-ref.rst new file mode 100644 index 0000000..8d8ace6 --- /dev/null +++ b/src/ceph/doc/rados/configuration/bluestore-config-ref.rst @@ -0,0 +1,297 @@ +========================== +BlueStore Config Reference +========================== + +Devices +======= + +BlueStore manages either one, two, or (in certain cases) three storage +devices. + +In the simplest case, BlueStore consumes a single (primary) storage +device. The storage device is normally partitioned into two parts: + +#. A small partition is formatted with XFS and contains basic metadata + for the OSD. This *data directory* includes information about the + OSD (its identifier, which cluster it belongs to, and its private + keyring. + +#. The rest of the device is normally a large partition occupying the + rest of the device that is managed directly by BlueStore contains + all of the actual data. This *primary device* is normally identifed + by a ``block`` symlink in data directory. + +It is also possible to deploy BlueStore across two additional devices: + +* A *WAL device* can be used for BlueStore's internal journal or + write-ahead log. It is identified by the ``block.wal`` symlink in + the data directory. It is only useful to use a WAL device if the + device is faster than the primary device (e.g., when it is on an SSD + and the primary device is an HDD). +* A *DB device* can be used for storing BlueStore's internal metadata. + BlueStore (or rather, the embedded RocksDB) will put as much + metadata as it can on the DB device to improve performance. If the + DB device fills up, metadata will spill back onto the primary device + (where it would have been otherwise). Again, it is only helpful to + provision a DB device if it is faster than the primary device. + +If there is only a small amount of fast storage available (e.g., less +than a gigabyte), we recommend using it as a WAL device. If there is +more, provisioning a DB device makes more sense. The BlueStore +journal will always be placed on the fastest device available, so +using a DB device will provide the same benefit that the WAL device +would while *also* allowing additional metadata to be stored there (if +it will fix). + +A single-device BlueStore OSD can be provisioned with:: + + ceph-disk prepare --bluestore <device> + +To specify a WAL device and/or DB device, :: + + ceph-disk prepare --bluestore <device> --block.wal <wal-device> --block-db <db-device> + +Cache size +========== + +The amount of memory consumed by each OSD for BlueStore's cache is +determined by the ``bluestore_cache_size`` configuration option. If +that config option is not set (i.e., remains at 0), there is a +different default value that is used depending on whether an HDD or +SSD is used for the primary device (set by the +``bluestore_cache_size_ssd`` and ``bluestore_cache_size_hdd`` config +options). + +BlueStore and the rest of the Ceph OSD does the best it can currently +to stick to the budgeted memory. Note that on top of the configured +cache size, there is also memory consumed by the OSD itself, and +generally some overhead due to memory fragmentation and other +allocator overhead. + +The configured cache memory budget can be used in a few different ways: + +* Key/Value metadata (i.e., RocksDB's internal cache) +* BlueStore metadata +* BlueStore data (i.e., recently read or written object data) + +Cache memory usage is governed by the following options: +``bluestore_cache_meta_ratio``, ``bluestore_cache_kv_ratio``, and +``bluestore_cache_kv_max``. The fraction of the cache devoted to data +is 1.0 minus the meta and kv ratios. The memory devoted to kv +metadata (the RocksDB cache) is capped by ``bluestore_cache_kv_max`` +since our testing indicates there are diminishing returns beyond a +certain point. + +``bluestore_cache_size`` + +:Description: The amount of memory BlueStore will use for its cache. If zero, ``bluestore_cache_size_hdd`` or ``bluestore_cache_size_ssd`` will be used instead. +:Type: Integer +:Required: Yes +:Default: ``0`` + +``bluestore_cache_size_hdd`` + +:Description: The default amount of memory BlueStore will use for its cache when backed by an HDD. +:Type: Integer +:Required: Yes +:Default: ``1 * 1024 * 1024 * 1024`` (1 GB) + +``bluestore_cache_size_ssd`` + +:Description: The default amount of memory BlueStore will use for its cache when backed by an SSD. +:Type: Integer +:Required: Yes +:Default: ``3 * 1024 * 1024 * 1024`` (3 GB) + +``bluestore_cache_meta_ratio`` + +:Description: The ratio of cache devoted to metadata. +:Type: Floating point +:Required: Yes +:Default: ``.01`` + +``bluestore_cache_kv_ratio`` + +:Description: The ratio of cache devoted to key/value data (rocksdb). +:Type: Floating point +:Required: Yes +:Default: ``.99`` + +``bluestore_cache_kv_max`` + +:Description: The maximum amount of cache devoted to key/value data (rocksdb). +:Type: Floating point +:Required: Yes +:Default: ``512 * 1024*1024`` (512 MB) + + +Checksums +========= + +BlueStore checksums all metadata and data written to disk. Metadata +checksumming is handled by RocksDB and uses `crc32c`. Data +checksumming is done by BlueStore and can make use of `crc32c`, +`xxhash32`, or `xxhash64`. The default is `crc32c` and should be +suitable for most purposes. + +Full data checksumming does increase the amount of metadata that +BlueStore must store and manage. When possible, e.g., when clients +hint that data is written and read sequentially, BlueStore will +checksum larger blocks, but in many cases it must store a checksum +value (usually 4 bytes) for every 4 kilobyte block of data. + +It is possible to use a smaller checksum value by truncating the +checksum to two or one byte, reducing the metadata overhead. The +trade-off is that the probability that a random error will not be +detected is higher with a smaller checksum, going from about one if +four billion with a 32-bit (4 byte) checksum to one is 65,536 for a +16-bit (2 byte) checksum or one in 256 for an 8-bit (1 byte) checksum. +The smaller checksum values can be used by selecting `crc32c_16` or +`crc32c_8` as the checksum algorithm. + +The *checksum algorithm* can be set either via a per-pool +``csum_type`` property or the global config option. For example, :: + + ceph osd pool set <pool-name> csum_type <algorithm> + +``bluestore_csum_type`` + +:Description: The default checksum algorithm to use. +:Type: String +:Required: Yes +:Valid Settings: ``none``, ``crc32c``, ``crc32c_16``, ``crc32c_8``, ``xxhash32``, ``xxhash64`` +:Default: ``crc32c`` + + +Inline Compression +================== + +BlueStore supports inline compression using `snappy`, `zlib`, or +`lz4`. Please note that the `lz4` compression plugin is not +distributed in the official release. + +Whether data in BlueStore is compressed is determined by a combination +of the *compression mode* and any hints associated with a write +operation. The modes are: + +* **none**: Never compress data. +* **passive**: Do not compress data unless the write operation as a + *compressible* hint set. +* **aggressive**: Compress data unless the write operation as an + *incompressible* hint set. +* **force**: Try to compress data no matter what. + +For more information about the *compressible* and *incompressible* IO +hints, see :doc:`/api/librados/#rados_set_alloc_hint`. + +Note that regardless of the mode, if the size of the data chunk is not +reduced sufficiently it will not be used and the original +(uncompressed) data will be stored. For example, if the ``bluestore +compression required ratio`` is set to ``.7`` then the compressed data +must be 70% of the size of the original (or smaller). + +The *compression mode*, *compression algorithm*, *compression required +ratio*, *min blob size*, and *max blob size* can be set either via a +per-pool property or a global config option. Pool properties can be +set with:: + + ceph osd pool set <pool-name> compression_algorithm <algorithm> + ceph osd pool set <pool-name> compression_mode <mode> + ceph osd pool set <pool-name> compression_required_ratio <ratio> + ceph osd pool set <pool-name> compression_min_blob_size <size> + ceph osd pool set <pool-name> compression_max_blob_size <size> + +``bluestore compression algorithm`` + +:Description: The default compressor to use (if any) if the per-pool property + ``compression_algorithm`` is not set. Note that zstd is *not* + recommended for bluestore due to high CPU overhead when + compressing small amounts of data. +:Type: String +:Required: No +:Valid Settings: ``lz4``, ``snappy``, ``zlib``, ``zstd`` +:Default: ``snappy`` + +``bluestore compression mode`` + +:Description: The default policy for using compression if the per-pool property + ``compression_mode`` is not set. ``none`` means never use + compression. ``passive`` means use compression when + `clients hint`_ that data is compressible. ``aggressive`` means + use compression unless clients hint that data is not compressible. + ``force`` means use compression under all circumstances even if + the clients hint that the data is not compressible. +:Type: String +:Required: No +:Valid Settings: ``none``, ``passive``, ``aggressive``, ``force`` +:Default: ``none`` + +``bluestore compression required ratio`` + +:Description: The ratio of the size of the data chunk after + compression relative to the original size must be at + least this small in order to store the compressed + version. + +:Type: Floating point +:Required: No +:Default: .875 + +``bluestore compression min blob size`` + +:Description: Chunks smaller than this are never compressed. + The per-pool property ``compression_min_blob_size`` overrides + this setting. + +:Type: Unsigned Integer +:Required: No +:Default: 0 + +``bluestore compression min blob size hdd`` + +:Description: Default value of ``bluestore compression min blob size`` + for rotational media. + +:Type: Unsigned Integer +:Required: No +:Default: 128K + +``bluestore compression min blob size ssd`` + +:Description: Default value of ``bluestore compression min blob size`` + for non-rotational (solid state) media. + +:Type: Unsigned Integer +:Required: No +:Default: 8K + +``bluestore compression max blob size`` + +:Description: Chunks larger than this are broken into smaller blobs sizing + ``bluestore compression max blob size`` before being compressed. + The per-pool property ``compression_max_blob_size`` overrides + this setting. + +:Type: Unsigned Integer +:Required: No +:Default: 0 + +``bluestore compression max blob size hdd`` + +:Description: Default value of ``bluestore compression max blob size`` + for rotational media. + +:Type: Unsigned Integer +:Required: No +:Default: 512K + +``bluestore compression max blob size ssd`` + +:Description: Default value of ``bluestore compression max blob size`` + for non-rotational (solid state) media. + +:Type: Unsigned Integer +:Required: No +:Default: 64K + +.. _clients hint: ../../api/librados/#rados_set_alloc_hint diff --git a/src/ceph/doc/rados/configuration/ceph-conf.rst b/src/ceph/doc/rados/configuration/ceph-conf.rst new file mode 100644 index 0000000..df88452 --- /dev/null +++ b/src/ceph/doc/rados/configuration/ceph-conf.rst @@ -0,0 +1,629 @@ +================== + Configuring Ceph +================== + +When you start the Ceph service, the initialization process activates a series +of daemons that run in the background. A :term:`Ceph Storage Cluster` runs +two types of daemons: + +- :term:`Ceph Monitor` (``ceph-mon``) +- :term:`Ceph OSD Daemon` (``ceph-osd``) + +Ceph Storage Clusters that support the :term:`Ceph Filesystem` run at least one +:term:`Ceph Metadata Server` (``ceph-mds``). Clusters that support :term:`Ceph +Object Storage` run Ceph Gateway daemons (``radosgw``). For your convenience, +each daemon has a series of default values (*i.e.*, many are set by +``ceph/src/common/config_opts.h``). You may override these settings with a Ceph +configuration file. + + +.. _ceph-conf-file: + +The Configuration File +====================== + +When you start a Ceph Storage Cluster, each daemon looks for a Ceph +configuration file (i.e., ``ceph.conf`` by default) that provides the cluster's +configuration settings. For manual deployments, you need to create a Ceph +configuration file. For tools that create configuration files for you (*e.g.*, +``ceph-deploy``, Chef, etc.), you may use the information contained herein as a +reference. The Ceph configuration file defines: + +- Cluster Identity +- Authentication settings +- Cluster membership +- Host names +- Host addresses +- Paths to keyrings +- Paths to journals +- Paths to data +- Other runtime options + +The default Ceph configuration file locations in sequential order include: + +#. ``$CEPH_CONF`` (*i.e.,* the path following the ``$CEPH_CONF`` + environment variable) +#. ``-c path/path`` (*i.e.,* the ``-c`` command line argument) +#. ``/etc/ceph/ceph.conf`` +#. ``~/.ceph/config`` +#. ``./ceph.conf`` (*i.e.,* in the current working directory) + + +The Ceph configuration file uses an *ini* style syntax. You can add comments +by preceding comments with a pound sign (#) or a semi-colon (;). For example: + +.. code-block:: ini + + # <--A number (#) sign precedes a comment. + ; A comment may be anything. + # Comments always follow a semi-colon (;) or a pound (#) on each line. + # The end of the line terminates a comment. + # We recommend that you provide comments in your configuration file(s). + + +.. _ceph-conf-settings: + +Config Sections +=============== + +The configuration file can configure all Ceph daemons in a Ceph Storage Cluster, +or all Ceph daemons of a particular type. To configure a series of daemons, the +settings must be included under the processes that will receive the +configuration as follows: + +``[global]`` + +:Description: Settings under ``[global]`` affect all daemons in a Ceph Storage + Cluster. + +:Example: ``auth supported = cephx`` + +``[osd]`` + +:Description: Settings under ``[osd]`` affect all ``ceph-osd`` daemons in + the Ceph Storage Cluster, and override the same setting in + ``[global]``. + +:Example: ``osd journal size = 1000`` + +``[mon]`` + +:Description: Settings under ``[mon]`` affect all ``ceph-mon`` daemons in + the Ceph Storage Cluster, and override the same setting in + ``[global]``. + +:Example: ``mon addr = 10.0.0.101:6789`` + + +``[mds]`` + +:Description: Settings under ``[mds]`` affect all ``ceph-mds`` daemons in + the Ceph Storage Cluster, and override the same setting in + ``[global]``. + +:Example: ``host = myserver01`` + +``[client]`` + +:Description: Settings under ``[client]`` affect all Ceph Clients + (e.g., mounted Ceph Filesystems, mounted Ceph Block Devices, + etc.). + +:Example: ``log file = /var/log/ceph/radosgw.log`` + + +Global settings affect all instances of all daemon in the Ceph Storage Cluster. +Use the ``[global]`` setting for values that are common for all daemons in the +Ceph Storage Cluster. You can override each ``[global]`` setting by: + +#. Changing the setting in a particular process type + (*e.g.,* ``[osd]``, ``[mon]``, ``[mds]`` ). + +#. Changing the setting in a particular process (*e.g.,* ``[osd.1]`` ). + +Overriding a global setting affects all child processes, except those that +you specifically override in a particular daemon. + +A typical global setting involves activating authentication. For example: + +.. code-block:: ini + + [global] + #Enable authentication between hosts within the cluster. + #v 0.54 and earlier + auth supported = cephx + + #v 0.55 and after + auth cluster required = cephx + auth service required = cephx + auth client required = cephx + + +You can specify settings that apply to a particular type of daemon. When you +specify settings under ``[osd]``, ``[mon]`` or ``[mds]`` without specifying a +particular instance, the setting will apply to all OSDs, monitors or metadata +daemons respectively. + +A typical daemon-wide setting involves setting journal sizes, filestore +settings, etc. For example: + +.. code-block:: ini + + [osd] + osd journal size = 1000 + + +You may specify settings for particular instances of a daemon. You may specify +an instance by entering its type, delimited by a period (.) and by the instance +ID. The instance ID for a Ceph OSD Daemon is always numeric, but it may be +alphanumeric for Ceph Monitors and Ceph Metadata Servers. + +.. code-block:: ini + + [osd.1] + # settings affect osd.1 only. + + [mon.a] + # settings affect mon.a only. + + [mds.b] + # settings affect mds.b only. + + +If the daemon you specify is a Ceph Gateway client, specify the daemon and the +instance, delimited by a period (.). For example:: + + [client.radosgw.instance-name] + # settings affect client.radosgw.instance-name only. + + + +.. _ceph-metavariables: + +Metavariables +============= + +Metavariables simplify Ceph Storage Cluster configuration dramatically. When a +metavariable is set in a configuration value, Ceph expands the metavariable into +a concrete value. Metavariables are very powerful when used within the +``[global]``, ``[osd]``, ``[mon]``, ``[mds]`` or ``[client]`` sections of your +configuration file. Ceph metavariables are similar to Bash shell expansion. + +Ceph supports the following metavariables: + + +``$cluster`` + +:Description: Expands to the Ceph Storage Cluster name. Useful when running + multiple Ceph Storage Clusters on the same hardware. + +:Example: ``/etc/ceph/$cluster.keyring`` +:Default: ``ceph`` + + +``$type`` + +:Description: Expands to one of ``mds``, ``osd``, or ``mon``, depending on the + type of the instant daemon. + +:Example: ``/var/lib/ceph/$type`` + + +``$id`` + +:Description: Expands to the daemon identifier. For ``osd.0``, this would be + ``0``; for ``mds.a``, it would be ``a``. + +:Example: ``/var/lib/ceph/$type/$cluster-$id`` + + +``$host`` + +:Description: Expands to the host name of the instant daemon. + + +``$name`` + +:Description: Expands to ``$type.$id``. +:Example: ``/var/run/ceph/$cluster-$name.asok`` + +``$pid`` + +:Description: Expands to daemon pid. +:Example: ``/var/run/ceph/$cluster-$name-$pid.asok`` + + +.. _ceph-conf-common-settings: + +Common Settings +=============== + +The `Hardware Recommendations`_ section provides some hardware guidelines for +configuring a Ceph Storage Cluster. It is possible for a single :term:`Ceph +Node` to run multiple daemons. For example, a single node with multiple drives +may run one ``ceph-osd`` for each drive. Ideally, you will have a node for a +particular type of process. For example, some nodes may run ``ceph-osd`` +daemons, other nodes may run ``ceph-mds`` daemons, and still other nodes may +run ``ceph-mon`` daemons. + +Each node has a name identified by the ``host`` setting. Monitors also specify +a network address and port (i.e., domain name or IP address) identified by the +``addr`` setting. A basic configuration file will typically specify only +minimal settings for each instance of monitor daemons. For example: + +.. code-block:: ini + + [global] + mon_initial_members = ceph1 + mon_host = 10.0.0.1 + + +.. important:: The ``host`` setting is the short name of the node (i.e., not + an fqdn). It is **NOT** an IP address either. Enter ``hostname -s`` on + the command line to retrieve the name of the node. Do not use ``host`` + settings for anything other than initial monitors unless you are deploying + Ceph manually. You **MUST NOT** specify ``host`` under individual daemons + when using deployment tools like ``chef`` or ``ceph-deploy``, as those tools + will enter the appropriate values for you in the cluster map. + + +.. _ceph-network-config: + +Networks +======== + +See the `Network Configuration Reference`_ for a detailed discussion about +configuring a network for use with Ceph. + + +Monitors +======== + +Ceph production clusters typically deploy with a minimum 3 :term:`Ceph Monitor` +daemons to ensure high availability should a monitor instance crash. At least +three (3) monitors ensures that the Paxos algorithm can determine which version +of the :term:`Ceph Cluster Map` is the most recent from a majority of Ceph +Monitors in the quorum. + +.. note:: You may deploy Ceph with a single monitor, but if the instance fails, + the lack of other monitors may interrupt data service availability. + +Ceph Monitors typically listen on port ``6789``. For example: + +.. code-block:: ini + + [mon.a] + host = hostName + mon addr = 150.140.130.120:6789 + +By default, Ceph expects that you will store a monitor's data under the +following path:: + + /var/lib/ceph/mon/$cluster-$id + +You or a deployment tool (e.g., ``ceph-deploy``) must create the corresponding +directory. With metavariables fully expressed and a cluster named "ceph", the +foregoing directory would evaluate to:: + + /var/lib/ceph/mon/ceph-a + +For additional details, see the `Monitor Config Reference`_. + +.. _Monitor Config Reference: ../mon-config-ref + + +.. _ceph-osd-config: + + +Authentication +============== + +.. versionadded:: Bobtail 0.56 + +For Bobtail (v 0.56) and beyond, you should expressly enable or disable +authentication in the ``[global]`` section of your Ceph configuration file. :: + + auth cluster required = cephx + auth service required = cephx + auth client required = cephx + +Additionally, you should enable message signing. See `Cephx Config Reference`_ for details. + +.. important:: When upgrading, we recommend expressly disabling authentication + first, then perform the upgrade. Once the upgrade is complete, re-enable + authentication. + +.. _Cephx Config Reference: ../auth-config-ref + + +.. _ceph-monitor-config: + + +OSDs +==== + +Ceph production clusters typically deploy :term:`Ceph OSD Daemons` where one node +has one OSD daemon running a filestore on one storage drive. A typical +deployment specifies a journal size. For example: + +.. code-block:: ini + + [osd] + osd journal size = 10000 + + [osd.0] + host = {hostname} #manual deployments only. + + +By default, Ceph expects that you will store a Ceph OSD Daemon's data with the +following path:: + + /var/lib/ceph/osd/$cluster-$id + +You or a deployment tool (e.g., ``ceph-deploy``) must create the corresponding +directory. With metavariables fully expressed and a cluster named "ceph", the +foregoing directory would evaluate to:: + + /var/lib/ceph/osd/ceph-0 + +You may override this path using the ``osd data`` setting. We don't recommend +changing the default location. Create the default directory on your OSD host. + +:: + + ssh {osd-host} + sudo mkdir /var/lib/ceph/osd/ceph-{osd-number} + +The ``osd data`` path ideally leads to a mount point with a hard disk that is +separate from the hard disk storing and running the operating system and +daemons. If the OSD is for a disk other than the OS disk, prepare it for +use with Ceph, and mount it to the directory you just created:: + + ssh {new-osd-host} + sudo mkfs -t {fstype} /dev/{disk} + sudo mount -o user_xattr /dev/{hdd} /var/lib/ceph/osd/ceph-{osd-number} + +We recommend using the ``xfs`` file system when running +:command:`mkfs`. (``btrfs`` and ``ext4`` are not recommended and no +longer tested.) + +See the `OSD Config Reference`_ for additional configuration details. + + +Heartbeats +========== + +During runtime operations, Ceph OSD Daemons check up on other Ceph OSD Daemons +and report their findings to the Ceph Monitor. You do not have to provide any +settings. However, if you have network latency issues, you may wish to modify +the settings. + +See `Configuring Monitor/OSD Interaction`_ for additional details. + + +.. _ceph-logging-and-debugging: + +Logs / Debugging +================ + +Sometimes you may encounter issues with Ceph that require +modifying logging output and using Ceph's debugging. See `Debugging and +Logging`_ for details on log rotation. + +.. _Debugging and Logging: ../../troubleshooting/log-and-debug + + +Example ceph.conf +================= + +.. literalinclude:: demo-ceph.conf + :language: ini + +.. _ceph-runtime-config: + +Runtime Changes +=============== + +Ceph allows you to make changes to the configuration of a ``ceph-osd``, +``ceph-mon``, or ``ceph-mds`` daemon at runtime. This capability is quite +useful for increasing/decreasing logging output, enabling/disabling debug +settings, and even for runtime optimization. The following reflects runtime +configuration usage:: + + ceph tell {daemon-type}.{id or *} injectargs --{name} {value} [--{name} {value}] + +Replace ``{daemon-type}`` with one of ``osd``, ``mon`` or ``mds``. You may apply +the runtime setting to all daemons of a particular type with ``*``, or specify +a specific daemon's ID (i.e., its number or letter). For example, to increase +debug logging for a ``ceph-osd`` daemon named ``osd.0``, execute the following:: + + ceph tell osd.0 injectargs --debug-osd 20 --debug-ms 1 + +In your ``ceph.conf`` file, you may use spaces when specifying a +setting name. When specifying a setting name on the command line, +ensure that you use an underscore or hyphen (``_`` or ``-``) between +terms (e.g., ``debug osd`` becomes ``--debug-osd``). + + +Viewing a Configuration at Runtime +================================== + +If your Ceph Storage Cluster is running, and you would like to see the +configuration settings from a running daemon, execute the following:: + + ceph daemon {daemon-type}.{id} config show | less + +If you are on a machine where osd.0 is running, the command would be:: + + ceph daemon osd.0 config show | less + +Reading Configuration Metadata at Runtime +========================================= + +Information about the available configuration options is available via +the ``config help`` command: + +:: + + ceph daemon {daemon-type}.{id} config help | less + + +This metadata is primarily intended to be used when integrating other +software with Ceph, such as graphical user interfaces. The output is +a list of JSON objects, for example: + +:: + + { + "name": "mon_host", + "type": "std::string", + "level": "basic", + "desc": "list of hosts or addresses to search for a monitor", + "long_desc": "This is a comma, whitespace, or semicolon separated list of IP addresses or hostnames. Hostnames are resolved via DNS and all A or AAAA records are included in the search list.", + "default": "", + "daemon_default": "", + "tags": [], + "services": [ + "common" + ], + "see_also": [], + "enum_values": [], + "min": "", + "max": "" + } + +type +____ + +The type of the setting, given as a C++ type name. + +level +_____ + +One of `basic`, `advanced`, `dev`. The `dev` options are not intended +for use outside of development and testing. + +desc +____ + +A short description -- this is a sentence fragment suitable for display +in small spaces like a single line in a list. + +long_desc +_________ + +A full description of what the setting does, this may be as long as needed. + +default +_______ + +The default value, if any. + +daemon_default +______________ + +An alternative default used for daemons (services) as opposed to clients. + +tags +____ + +A list of strings indicating topics to which this setting relates. Examples +of tags are `performance` and `networking`. + +services +________ + +A list of strings indicating which Ceph services the setting relates to, such +as `osd`, `mds`, `mon`. For settings that are relevant to any Ceph client +or server, `common` is used. + +see_also +________ + +A list of strings indicating other configuration options that may also +be of interest to a user setting this option. + +enum_values +___________ + +Optional: a list of strings indicating the valid settings. + +min, max +________ + +Optional: upper and lower (inclusive) bounds on valid settings. + + + + +Running Multiple Clusters +========================= + +With Ceph, you can run multiple Ceph Storage Clusters on the same hardware. +Running multiple clusters provides a higher level of isolation compared to +using different pools on the same cluster with different CRUSH rulesets. A +separate cluster will have separate monitor, OSD and metadata server processes. +When running Ceph with default settings, the default cluster name is ``ceph``, +which means you would save your Ceph configuration file with the file name +``ceph.conf`` in the ``/etc/ceph`` default directory. + +See `ceph-deploy new`_ for details. +.. _ceph-deploy new:../ceph-deploy-new + +When you run multiple clusters, you must name your cluster and save the Ceph +configuration file with the name of the cluster. For example, a cluster named +``openstack`` will have a Ceph configuration file with the file name +``openstack.conf`` in the ``/etc/ceph`` default directory. + +.. important:: Cluster names must consist of letters a-z and digits 0-9 only. + +Separate clusters imply separate data disks and journals, which are not shared +between clusters. Referring to `Metavariables`_, the ``$cluster`` metavariable +evaluates to the cluster name (i.e., ``openstack`` in the foregoing example). +Various settings use the ``$cluster`` metavariable, including: + +- ``keyring`` +- ``admin socket`` +- ``log file`` +- ``pid file`` +- ``mon data`` +- ``mon cluster log file`` +- ``osd data`` +- ``osd journal`` +- ``mds data`` +- ``rgw data`` + +See `General Settings`_, `OSD Settings`_, `Monitor Settings`_, `MDS Settings`_, +`RGW Settings`_ and `Log Settings`_ for relevant path defaults that use the +``$cluster`` metavariable. + +.. _General Settings: ../general-config-ref +.. _OSD Settings: ../osd-config-ref +.. _Monitor Settings: ../mon-config-ref +.. _MDS Settings: ../../../cephfs/mds-config-ref +.. _RGW Settings: ../../../radosgw/config-ref/ +.. _Log Settings: ../../troubleshooting/log-and-debug + + +When creating default directories or files, you should use the cluster +name at the appropriate places in the path. For example:: + + sudo mkdir /var/lib/ceph/osd/openstack-0 + sudo mkdir /var/lib/ceph/mon/openstack-a + +.. important:: When running monitors on the same host, you should use + different ports. By default, monitors use port 6789. If you already + have monitors using port 6789, use a different port for your other cluster(s). + +To invoke a cluster other than the default ``ceph`` cluster, use the +``-c {filename}.conf`` option with the ``ceph`` command. For example:: + + ceph -c {cluster-name}.conf health + ceph -c openstack.conf health + + +.. _Hardware Recommendations: ../../../start/hardware-recommendations +.. _Network Configuration Reference: ../network-config-ref +.. _OSD Config Reference: ../osd-config-ref +.. _Configuring Monitor/OSD Interaction: ../mon-osd-interaction +.. _ceph-deploy new: ../../deployment/ceph-deploy-new#naming-a-cluster diff --git a/src/ceph/doc/rados/configuration/demo-ceph.conf b/src/ceph/doc/rados/configuration/demo-ceph.conf new file mode 100644 index 0000000..ba86d53 --- /dev/null +++ b/src/ceph/doc/rados/configuration/demo-ceph.conf @@ -0,0 +1,31 @@ +[global] +fsid = {cluster-id} +mon initial members = {hostname}[, {hostname}] +mon host = {ip-address}[, {ip-address}] + +#All clusters have a front-side public network. +#If you have two NICs, you can configure a back side cluster +#network for OSD object replication, heart beats, backfilling, +#recovery, etc. +public network = {network}[, {network}] +#cluster network = {network}[, {network}] + +#Clusters require authentication by default. +auth cluster required = cephx +auth service required = cephx +auth client required = cephx + +#Choose reasonable numbers for your journals, number of replicas +#and placement groups. +osd journal size = {n} +osd pool default size = {n} # Write an object n times. +osd pool default min size = {n} # Allow writing n copy in a degraded state. +osd pool default pg num = {n} +osd pool default pgp num = {n} + +#Choose a reasonable crush leaf type. +#0 for a 1-node cluster. +#1 for a multi node cluster in a single rack +#2 for a multi node, multi chassis cluster with multiple hosts in a chassis +#3 for a multi node cluster with hosts across racks, etc. +osd crush chooseleaf type = {n}
\ No newline at end of file diff --git a/src/ceph/doc/rados/configuration/filestore-config-ref.rst b/src/ceph/doc/rados/configuration/filestore-config-ref.rst new file mode 100644 index 0000000..4dff60c --- /dev/null +++ b/src/ceph/doc/rados/configuration/filestore-config-ref.rst @@ -0,0 +1,365 @@ +============================ + Filestore Config Reference +============================ + + +``filestore debug omap check`` + +:Description: Debugging check on synchronization. Expensive. For debugging only. +:Type: Boolean +:Required: No +:Default: ``0`` + + +.. index:: filestore; extended attributes + +Extended Attributes +=================== + +Extended Attributes (XATTRs) are an important aspect in your configuration. +Some file systems have limits on the number of bytes stored in XATTRS. +Additionally, in some cases, the filesystem may not be as fast as an alternative +method of storing XATTRs. The following settings may help improve performance +by using a method of storing XATTRs that is extrinsic to the underlying filesystem. + +Ceph XATTRs are stored as ``inline xattr``, using the XATTRs provided +by the underlying file system, if it does not impose a size limit. If +there is a size limit (4KB total on ext4, for instance), some Ceph +XATTRs will be stored in an key/value database when either the +``filestore max inline xattr size`` or ``filestore max inline +xattrs`` threshold is reached. + + +``filestore max inline xattr size`` + +:Description: The maximimum size of an XATTR stored in the filesystem (i.e., XFS, + btrfs, ext4, etc.) per object. Should not be larger than the + filesytem can handle. Default value of 0 means to use the value + specific to the underlying filesystem. +:Type: Unsigned 32-bit Integer +:Required: No +:Default: ``0`` + + +``filestore max inline xattr size xfs`` + +:Description: The maximimum size of an XATTR stored in the XFS filesystem. + Only used if ``filestore max inline xattr size`` == 0. +:Type: Unsigned 32-bit Integer +:Required: No +:Default: ``65536`` + + +``filestore max inline xattr size btrfs`` + +:Description: The maximimum size of an XATTR stored in the btrfs filesystem. + Only used if ``filestore max inline xattr size`` == 0. +:Type: Unsigned 32-bit Integer +:Required: No +:Default: ``2048`` + + +``filestore max inline xattr size other`` + +:Description: The maximimum size of an XATTR stored in other filesystems. + Only used if ``filestore max inline xattr size`` == 0. +:Type: Unsigned 32-bit Integer +:Required: No +:Default: ``512`` + + +``filestore max inline xattrs`` + +:Description: The maximum number of XATTRs stored in the filesystem per object. + Default value of 0 means to use the value specific to the + underlying filesystem. +:Type: 32-bit Integer +:Required: No +:Default: ``0`` + + +``filestore max inline xattrs xfs`` + +:Description: The maximum number of XATTRs stored in the XFS filesystem per object. + Only used if ``filestore max inline xattrs`` == 0. +:Type: 32-bit Integer +:Required: No +:Default: ``10`` + + +``filestore max inline xattrs btrfs`` + +:Description: The maximum number of XATTRs stored in the btrfs filesystem per object. + Only used if ``filestore max inline xattrs`` == 0. +:Type: 32-bit Integer +:Required: No +:Default: ``10`` + + +``filestore max inline xattrs other`` + +:Description: The maximum number of XATTRs stored in other filesystems per object. + Only used if ``filestore max inline xattrs`` == 0. +:Type: 32-bit Integer +:Required: No +:Default: ``2`` + +.. index:: filestore; synchronization + +Synchronization Intervals +========================= + +Periodically, the filestore needs to quiesce writes and synchronize the +filesystem, which creates a consistent commit point. It can then free journal +entries up to the commit point. Synchronizing more frequently tends to reduce +the time required to perform synchronization, and reduces the amount of data +that needs to remain in the journal. Less frequent synchronization allows the +backing filesystem to coalesce small writes and metadata updates more +optimally--potentially resulting in more efficient synchronization. + + +``filestore max sync interval`` + +:Description: The maximum interval in seconds for synchronizing the filestore. +:Type: Double +:Required: No +:Default: ``5`` + + +``filestore min sync interval`` + +:Description: The minimum interval in seconds for synchronizing the filestore. +:Type: Double +:Required: No +:Default: ``.01`` + + +.. index:: filestore; flusher + +Flusher +======= + +The filestore flusher forces data from large writes to be written out using +``sync file range`` before the sync in order to (hopefully) reduce the cost of +the eventual sync. In practice, disabling 'filestore flusher' seems to improve +performance in some cases. + + +``filestore flusher`` + +:Description: Enables the filestore flusher. +:Type: Boolean +:Required: No +:Default: ``false`` + +.. deprecated:: v.65 + +``filestore flusher max fds`` + +:Description: Sets the maximum number of file descriptors for the flusher. +:Type: Integer +:Required: No +:Default: ``512`` + +.. deprecated:: v.65 + +``filestore sync flush`` + +:Description: Enables the synchronization flusher. +:Type: Boolean +:Required: No +:Default: ``false`` + +.. deprecated:: v.65 + +``filestore fsync flushes journal data`` + +:Description: Flush journal data during filesystem synchronization. +:Type: Boolean +:Required: No +:Default: ``false`` + + +.. index:: filestore; queue + +Queue +===== + +The following settings provide limits on the size of filestore queue. + +``filestore queue max ops`` + +:Description: Defines the maximum number of in progress operations the file store accepts before blocking on queuing new operations. +:Type: Integer +:Required: No. Minimal impact on performance. +:Default: ``50`` + + +``filestore queue max bytes`` + +:Description: The maximum number of bytes for an operation. +:Type: Integer +:Required: No +:Default: ``100 << 20`` + + + + +.. index:: filestore; timeouts + +Timeouts +======== + + +``filestore op threads`` + +:Description: The number of filesystem operation threads that execute in parallel. +:Type: Integer +:Required: No +:Default: ``2`` + + +``filestore op thread timeout`` + +:Description: The timeout for a filesystem operation thread (in seconds). +:Type: Integer +:Required: No +:Default: ``60`` + + +``filestore op thread suicide timeout`` + +:Description: The timeout for a commit operation before cancelling the commit (in seconds). +:Type: Integer +:Required: No +:Default: ``180`` + + +.. index:: filestore; btrfs + +B-Tree Filesystem +================= + + +``filestore btrfs snap`` + +:Description: Enable snapshots for a ``btrfs`` filestore. +:Type: Boolean +:Required: No. Only used for ``btrfs``. +:Default: ``true`` + + +``filestore btrfs clone range`` + +:Description: Enable cloning ranges for a ``btrfs`` filestore. +:Type: Boolean +:Required: No. Only used for ``btrfs``. +:Default: ``true`` + + +.. index:: filestore; journal + +Journal +======= + + +``filestore journal parallel`` + +:Description: Enables parallel journaling, default for btrfs. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``filestore journal writeahead`` + +:Description: Enables writeahead journaling, default for xfs. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``filestore journal trailing`` + +:Description: Deprecated, never use. +:Type: Boolean +:Required: No +:Default: ``false`` + + +Misc +==== + + +``filestore merge threshold`` + +:Description: Min number of files in a subdir before merging into parent + NOTE: A negative value means to disable subdir merging +:Type: Integer +:Required: No +:Default: ``10`` + + +``filestore split multiple`` + +:Description: ``(filestore_split_multiple * abs(filestore_merge_threshold) + (rand() % filestore_split_rand_factor)) * 16`` + is the maximum number of files in a subdirectory before + splitting into child directories. + +:Type: Integer +:Required: No +:Default: ``2`` + + +``filestore split rand factor`` + +:Description: A random factor added to the split threshold to avoid + too many filestore splits occurring at once. See + ``filestore split multiple`` for details. + This can only be changed for an existing osd offline, + via ceph-objectstore-tool's apply-layout-settings command. + +:Type: Unsigned 32-bit Integer +:Required: No +:Default: ``20`` + + +``filestore update to`` + +:Description: Limits filestore auto upgrade to specified version. +:Type: Integer +:Required: No +:Default: ``1000`` + + +``filestore blackhole`` + +:Description: Drop any new transactions on the floor. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``filestore dump file`` + +:Description: File onto which store transaction dumps. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``filestore kill at`` + +:Description: inject a failure at the n'th opportunity +:Type: String +:Required: No +:Default: ``false`` + + +``filestore fail eio`` + +:Description: Fail/Crash on eio. +:Type: Boolean +:Required: No +:Default: ``true`` + diff --git a/src/ceph/doc/rados/configuration/general-config-ref.rst b/src/ceph/doc/rados/configuration/general-config-ref.rst new file mode 100644 index 0000000..ca09ee5 --- /dev/null +++ b/src/ceph/doc/rados/configuration/general-config-ref.rst @@ -0,0 +1,66 @@ +========================== + General Config Reference +========================== + + +``fsid`` + +:Description: The filesystem ID. One per cluster. +:Type: UUID +:Required: No. +:Default: N/A. Usually generated by deployment tools. + + +``admin socket`` + +:Description: The socket for executing administrative commands on a daemon, + irrespective of whether Ceph Monitors have established a quorum. + +:Type: String +:Required: No +:Default: ``/var/run/ceph/$cluster-$name.asok`` + + +``pid file`` + +:Description: The file in which the mon, osd or mds will write its + PID. For instance, ``/var/run/$cluster/$type.$id.pid`` + will create /var/run/ceph/mon.a.pid for the ``mon`` with + id ``a`` running in the ``ceph`` cluster. The ``pid + file`` is removed when the daemon stops gracefully. If + the process is not daemonized (i.e. runs with the ``-f`` + or ``-d`` option), the ``pid file`` is not created. +:Type: String +:Required: No +:Default: No + + +``chdir`` + +:Description: The directory Ceph daemons change to once they are + up and running. Default ``/`` directory recommended. + +:Type: String +:Required: No +:Default: ``/`` + + +``max open files`` + +:Description: If set, when the :term:`Ceph Storage Cluster` starts, Ceph sets + the ``max open fds`` at the OS level (i.e., the max # of file + descriptors). It helps prevents Ceph OSD Daemons from running out + of file descriptors. + +:Type: 64-bit Integer +:Required: No +:Default: ``0`` + + +``fatal signal handlers`` + +:Description: If set, we will install signal handlers for SEGV, ABRT, BUS, ILL, + FPE, XCPU, XFSZ, SYS signals to generate a useful log message + +:Type: Boolean +:Default: ``true`` diff --git a/src/ceph/doc/rados/configuration/index.rst b/src/ceph/doc/rados/configuration/index.rst new file mode 100644 index 0000000..48b58ef --- /dev/null +++ b/src/ceph/doc/rados/configuration/index.rst @@ -0,0 +1,64 @@ +=============== + Configuration +=============== + +Ceph can run with a cluster containing thousands of Object Storage Devices +(OSDs). A minimal system will have at least two OSDs for data replication. To +configure OSD clusters, you must provide settings in the configuration file. +Ceph provides default values for many settings, which you can override in the +configuration file. Additionally, you can make runtime modification to the +configuration using command-line utilities. + +When Ceph starts, it activates three daemons: + +- ``ceph-mon`` (mandatory) +- ``ceph-osd`` (mandatory) +- ``ceph-mds`` (mandatory for cephfs only) + +Each process, daemon or utility loads the host's configuration file. A process +may have information about more than one daemon instance (*i.e.,* multiple +contexts). A daemon or utility only has information about a single daemon +instance (a single context). + +.. note:: Ceph can run on a single host for evaluation purposes. + + +.. raw:: html + + <table cellpadding="10"><colgroup><col width="50%"><col width="50%"></colgroup><tbody valign="top"><tr><td><h3>Configuring the Object Store</h3> + +For general object store configuration, refer to the following: + +.. toctree:: + :maxdepth: 1 + + Storage devices <storage-devices> + ceph-conf + + +.. raw:: html + + </td><td><h3>Reference</h3> + +To optimize the performance of your cluster, refer to the following: + +.. toctree:: + :maxdepth: 1 + + Network Settings <network-config-ref> + Auth Settings <auth-config-ref> + Monitor Settings <mon-config-ref> + mon-lookup-dns + Heartbeat Settings <mon-osd-interaction> + OSD Settings <osd-config-ref> + BlueStore Settings <bluestore-config-ref> + FileStore Settings <filestore-config-ref> + Journal Settings <journal-ref> + Pool, PG & CRUSH Settings <pool-pg-config-ref.rst> + Messaging Settings <ms-ref> + General Settings <general-config-ref> + + +.. raw:: html + + </td></tr></tbody></table> diff --git a/src/ceph/doc/rados/configuration/journal-ref.rst b/src/ceph/doc/rados/configuration/journal-ref.rst new file mode 100644 index 0000000..97300f4 --- /dev/null +++ b/src/ceph/doc/rados/configuration/journal-ref.rst @@ -0,0 +1,116 @@ +========================== + Journal Config Reference +========================== + +.. index:: journal; journal configuration + +Ceph OSDs use a journal for two reasons: speed and consistency. + +- **Speed:** The journal enables the Ceph OSD Daemon to commit small writes + quickly. Ceph writes small, random i/o to the journal sequentially, which + tends to speed up bursty workloads by allowing the backing filesystem more + time to coalesce writes. The Ceph OSD Daemon's journal, however, can lead + to spiky performance with short spurts of high-speed writes followed by + periods without any write progress as the filesystem catches up to the + journal. + +- **Consistency:** Ceph OSD Daemons require a filesystem interface that + guarantees atomic compound operations. Ceph OSD Daemons write a description + of the operation to the journal and apply the operation to the filesystem. + This enables atomic updates to an object (for example, placement group + metadata). Every few seconds--between ``filestore max sync interval`` and + ``filestore min sync interval``--the Ceph OSD Daemon stops writes and + synchronizes the journal with the filesystem, allowing Ceph OSD Daemons to + trim operations from the journal and reuse the space. On failure, Ceph + OSD Daemons replay the journal starting after the last synchronization + operation. + +Ceph OSD Daemons support the following journal settings: + + +``journal dio`` + +:Description: Enables direct i/o to the journal. Requires ``journal block + align`` set to ``true``. + +:Type: Boolean +:Required: Yes when using ``aio``. +:Default: ``true`` + + + +``journal aio`` + +.. versionchanged:: 0.61 Cuttlefish + +:Description: Enables using ``libaio`` for asynchronous writes to the journal. + Requires ``journal dio`` set to ``true``. + +:Type: Boolean +:Required: No. +:Default: Version 0.61 and later, ``true``. Version 0.60 and earlier, ``false``. + + +``journal block align`` + +:Description: Block aligns write operations. Required for ``dio`` and ``aio``. +:Type: Boolean +:Required: Yes when using ``dio`` and ``aio``. +:Default: ``true`` + + +``journal max write bytes`` + +:Description: The maximum number of bytes the journal will write at + any one time. + +:Type: Integer +:Required: No +:Default: ``10 << 20`` + + +``journal max write entries`` + +:Description: The maximum number of entries the journal will write at + any one time. + +:Type: Integer +:Required: No +:Default: ``100`` + + +``journal queue max ops`` + +:Description: The maximum number of operations allowed in the queue at + any one time. + +:Type: Integer +:Required: No +:Default: ``500`` + + +``journal queue max bytes`` + +:Description: The maximum number of bytes allowed in the queue at + any one time. + +:Type: Integer +:Required: No +:Default: ``10 << 20`` + + +``journal align min size`` + +:Description: Align data payloads greater than the specified minimum. +:Type: Integer +:Required: No +:Default: ``64 << 10`` + + +``journal zero on create`` + +:Description: Causes the file store to overwrite the entire journal with + ``0``'s during ``mkfs``. +:Type: Boolean +:Required: No +:Default: ``false`` diff --git a/src/ceph/doc/rados/configuration/mon-config-ref.rst b/src/ceph/doc/rados/configuration/mon-config-ref.rst new file mode 100644 index 0000000..6c8e92b --- /dev/null +++ b/src/ceph/doc/rados/configuration/mon-config-ref.rst @@ -0,0 +1,1222 @@ +========================== + Monitor Config Reference +========================== + +Understanding how to configure a :term:`Ceph Monitor` is an important part of +building a reliable :term:`Ceph Storage Cluster`. **All Ceph Storage Clusters +have at least one monitor**. A monitor configuration usually remains fairly +consistent, but you can add, remove or replace a monitor in a cluster. See +`Adding/Removing a Monitor`_ and `Add/Remove a Monitor (ceph-deploy)`_ for +details. + + +.. index:: Ceph Monitor; Paxos + +Background +========== + +Ceph Monitors maintain a "master copy" of the :term:`cluster map`, which means a +:term:`Ceph Client` can determine the location of all Ceph Monitors, Ceph OSD +Daemons, and Ceph Metadata Servers just by connecting to one Ceph Monitor and +retrieving a current cluster map. Before Ceph Clients can read from or write to +Ceph OSD Daemons or Ceph Metadata Servers, they must connect to a Ceph Monitor +first. With a current copy of the cluster map and the CRUSH algorithm, a Ceph +Client can compute the location for any object. The ability to compute object +locations allows a Ceph Client to talk directly to Ceph OSD Daemons, which is a +very important aspect of Ceph's high scalability and performance. See +`Scalability and High Availability`_ for additional details. + +The primary role of the Ceph Monitor is to maintain a master copy of the cluster +map. Ceph Monitors also provide authentication and logging services. Ceph +Monitors write all changes in the monitor services to a single Paxos instance, +and Paxos writes the changes to a key/value store for strong consistency. Ceph +Monitors can query the most recent version of the cluster map during sync +operations. Ceph Monitors leverage the key/value store's snapshots and iterators +(using leveldb) to perform store-wide synchronization. + +.. ditaa:: + + /-------------\ /-------------\ + | Monitor | Write Changes | Paxos | + | cCCC +-------------->+ cCCC | + | | | | + +-------------+ \------+------/ + | Auth | | + +-------------+ | Write Changes + | Log | | + +-------------+ v + | Monitor Map | /------+------\ + +-------------+ | Key / Value | + | OSD Map | | Store | + +-------------+ | cCCC | + | PG Map | \------+------/ + +-------------+ ^ + | MDS Map | | Read Changes + +-------------+ | + | cCCC |*---------------------+ + \-------------/ + + +.. deprecated:: version 0.58 + +In Ceph versions 0.58 and earlier, Ceph Monitors use a Paxos instance for +each service and store the map as a file. + +.. index:: Ceph Monitor; cluster map + +Cluster Maps +------------ + +The cluster map is a composite of maps, including the monitor map, the OSD map, +the placement group map and the metadata server map. The cluster map tracks a +number of important things: which processes are ``in`` the Ceph Storage Cluster; +which processes that are ``in`` the Ceph Storage Cluster are ``up`` and running +or ``down``; whether, the placement groups are ``active`` or ``inactive``, and +``clean`` or in some other state; and, other details that reflect the current +state of the cluster such as the total amount of storage space, and the amount +of storage used. + +When there is a significant change in the state of the cluster--e.g., a Ceph OSD +Daemon goes down, a placement group falls into a degraded state, etc.--the +cluster map gets updated to reflect the current state of the cluster. +Additionally, the Ceph Monitor also maintains a history of the prior states of +the cluster. The monitor map, OSD map, placement group map and metadata server +map each maintain a history of their map versions. We call each version an +"epoch." + +When operating your Ceph Storage Cluster, keeping track of these states is an +important part of your system administration duties. See `Monitoring a Cluster`_ +and `Monitoring OSDs and PGs`_ for additional details. + +.. index:: high availability; quorum + +Monitor Quorum +-------------- + +Our Configuring ceph section provides a trivial `Ceph configuration file`_ that +provides for one monitor in the test cluster. A cluster will run fine with a +single monitor; however, **a single monitor is a single-point-of-failure**. To +ensure high availability in a production Ceph Storage Cluster, you should run +Ceph with multiple monitors so that the failure of a single monitor **WILL NOT** +bring down your entire cluster. + +When a Ceph Storage Cluster runs multiple Ceph Monitors for high availability, +Ceph Monitors use `Paxos`_ to establish consensus about the master cluster map. +A consensus requires a majority of monitors running to establish a quorum for +consensus about the cluster map (e.g., 1; 2 out of 3; 3 out of 5; 4 out of 6; +etc.). + +``mon force quorum join`` + +:Description: Force monitor to join quorum even if it has been previously removed from the map +:Type: Boolean +:Default: ``False`` + +.. index:: Ceph Monitor; consistency + +Consistency +----------- + +When you add monitor settings to your Ceph configuration file, you need to be +aware of some of the architectural aspects of Ceph Monitors. **Ceph imposes +strict consistency requirements** for a Ceph monitor when discovering another +Ceph Monitor within the cluster. Whereas, Ceph Clients and other Ceph daemons +use the Ceph configuration file to discover monitors, monitors discover each +other using the monitor map (monmap), not the Ceph configuration file. + +A Ceph Monitor always refers to the local copy of the monmap when discovering +other Ceph Monitors in the Ceph Storage Cluster. Using the monmap instead of the +Ceph configuration file avoids errors that could break the cluster (e.g., typos +in ``ceph.conf`` when specifying a monitor address or port). Since monitors use +monmaps for discovery and they share monmaps with clients and other Ceph +daemons, **the monmap provides monitors with a strict guarantee that their +consensus is valid.** + +Strict consistency also applies to updates to the monmap. As with any other +updates on the Ceph Monitor, changes to the monmap always run through a +distributed consensus algorithm called `Paxos`_. The Ceph Monitors must agree on +each update to the monmap, such as adding or removing a Ceph Monitor, to ensure +that each monitor in the quorum has the same version of the monmap. Updates to +the monmap are incremental so that Ceph Monitors have the latest agreed upon +version, and a set of previous versions. Maintaining a history enables a Ceph +Monitor that has an older version of the monmap to catch up with the current +state of the Ceph Storage Cluster. + +If Ceph Monitors discovered each other through the Ceph configuration file +instead of through the monmap, it would introduce additional risks because the +Ceph configuration files are not updated and distributed automatically. Ceph +Monitors might inadvertently use an older Ceph configuration file, fail to +recognize a Ceph Monitor, fall out of a quorum, or develop a situation where +`Paxos`_ is not able to determine the current state of the system accurately. + + +.. index:: Ceph Monitor; bootstrapping monitors + +Bootstrapping Monitors +---------------------- + +In most configuration and deployment cases, tools that deploy Ceph may help +bootstrap the Ceph Monitors by generating a monitor map for you (e.g., +``ceph-deploy``, etc). A Ceph Monitor requires a few explicit +settings: + +- **Filesystem ID**: The ``fsid`` is the unique identifier for your + object store. Since you can run multiple clusters on the same + hardware, you must specify the unique ID of the object store when + bootstrapping a monitor. Deployment tools usually do this for you + (e.g., ``ceph-deploy`` can call a tool like ``uuidgen``), but you + may specify the ``fsid`` manually too. + +- **Monitor ID**: A monitor ID is a unique ID assigned to each monitor within + the cluster. It is an alphanumeric value, and by convention the identifier + usually follows an alphabetical increment (e.g., ``a``, ``b``, etc.). This + can be set in a Ceph configuration file (e.g., ``[mon.a]``, ``[mon.b]``, etc.), + by a deployment tool, or using the ``ceph`` commandline. + +- **Keys**: The monitor must have secret keys. A deployment tool such as + ``ceph-deploy`` usually does this for you, but you may + perform this step manually too. See `Monitor Keyrings`_ for details. + +For additional details on bootstrapping, see `Bootstrapping a Monitor`_. + +.. index:: Ceph Monitor; configuring monitors + +Configuring Monitors +==================== + +To apply configuration settings to the entire cluster, enter the configuration +settings under ``[global]``. To apply configuration settings to all monitors in +your cluster, enter the configuration settings under ``[mon]``. To apply +configuration settings to specific monitors, specify the monitor instance +(e.g., ``[mon.a]``). By convention, monitor instance names use alpha notation. + +.. code-block:: ini + + [global] + + [mon] + + [mon.a] + + [mon.b] + + [mon.c] + + +Minimum Configuration +--------------------- + +The bare minimum monitor settings for a Ceph monitor via the Ceph configuration +file include a hostname and a monitor address for each monitor. You can configure +these under ``[mon]`` or under the entry for a specific monitor. + +.. code-block:: ini + + [mon] + mon host = hostname1,hostname2,hostname3 + mon addr = 10.0.0.10:6789,10.0.0.11:6789,10.0.0.12:6789 + + +.. code-block:: ini + + [mon.a] + host = hostname1 + mon addr = 10.0.0.10:6789 + +See the `Network Configuration Reference`_ for details. + +.. note:: This minimum configuration for monitors assumes that a deployment + tool generates the ``fsid`` and the ``mon.`` key for you. + +Once you deploy a Ceph cluster, you **SHOULD NOT** change the IP address of +the monitors. However, if you decide to change the monitor's IP address, you +must follow a specific procedure. See `Changing a Monitor's IP Address`_ for +details. + +Monitors can also be found by clients using DNS SRV records. See `Monitor lookup through DNS`_ for details. + +Cluster ID +---------- + +Each Ceph Storage Cluster has a unique identifier (``fsid``). If specified, it +usually appears under the ``[global]`` section of the configuration file. +Deployment tools usually generate the ``fsid`` and store it in the monitor map, +so the value may not appear in a configuration file. The ``fsid`` makes it +possible to run daemons for multiple clusters on the same hardware. + +``fsid`` + +:Description: The cluster ID. One per cluster. +:Type: UUID +:Required: Yes. +:Default: N/A. May be generated by a deployment tool if not specified. + +.. note:: Do not set this value if you use a deployment tool that does + it for you. + + +.. index:: Ceph Monitor; initial members + +Initial Members +--------------- + +We recommend running a production Ceph Storage Cluster with at least three Ceph +Monitors to ensure high availability. When you run multiple monitors, you may +specify the initial monitors that must be members of the cluster in order to +establish a quorum. This may reduce the time it takes for your cluster to come +online. + +.. code-block:: ini + + [mon] + mon initial members = a,b,c + + +``mon initial members`` + +:Description: The IDs of initial monitors in a cluster during startup. If + specified, Ceph requires an odd number of monitors to form an + initial quorum (e.g., 3). + +:Type: String +:Default: None + +.. note:: A *majority* of monitors in your cluster must be able to reach + each other in order to establish a quorum. You can decrease the initial + number of monitors to establish a quorum with this setting. + +.. index:: Ceph Monitor; data path + +Data +---- + +Ceph provides a default path where Ceph Monitors store data. For optimal +performance in a production Ceph Storage Cluster, we recommend running Ceph +Monitors on separate hosts and drives from Ceph OSD Daemons. As leveldb is using +``mmap()`` for writing the data, Ceph Monitors flush their data from memory to disk +very often, which can interfere with Ceph OSD Daemon workloads if the data +store is co-located with the OSD Daemons. + +In Ceph versions 0.58 and earlier, Ceph Monitors store their data in files. This +approach allows users to inspect monitor data with common tools like ``ls`` +and ``cat``. However, it doesn't provide strong consistency. + +In Ceph versions 0.59 and later, Ceph Monitors store their data as key/value +pairs. Ceph Monitors require `ACID`_ transactions. Using a data store prevents +recovering Ceph Monitors from running corrupted versions through Paxos, and it +enables multiple modification operations in one single atomic batch, among other +advantages. + +Generally, we do not recommend changing the default data location. If you modify +the default location, we recommend that you make it uniform across Ceph Monitors +by setting it in the ``[mon]`` section of the configuration file. + + +``mon data`` + +:Description: The monitor's data location. +:Type: String +:Default: ``/var/lib/ceph/mon/$cluster-$id`` + + +``mon data size warn`` + +:Description: Issue a ``HEALTH_WARN`` in cluster log when the monitor's data + store goes over 15GB. +:Type: Integer +:Default: 15*1024*1024*1024* + + +``mon data avail warn`` + +:Description: Issue a ``HEALTH_WARN`` in cluster log when the available disk + space of monitor's data store is lower or equal to this + percentage. +:Type: Integer +:Default: 30 + + +``mon data avail crit`` + +:Description: Issue a ``HEALTH_ERR`` in cluster log when the available disk + space of monitor's data store is lower or equal to this + percentage. +:Type: Integer +:Default: 5 + + +``mon warn on cache pools without hit sets`` + +:Description: Issue a ``HEALTH_WARN`` in cluster log if a cache pool does not + have the hitset type set set. + See `hit set type <../operations/pools#hit-set-type>`_ for more + details. +:Type: Boolean +:Default: True + + +``mon warn on crush straw calc version zero`` + +:Description: Issue a ``HEALTH_WARN`` in cluster log if the CRUSH's + ``straw_calc_version`` is zero. See + `CRUSH map tunables <../operations/crush-map#tunables>`_ for + details. +:Type: Boolean +:Default: True + + +``mon warn on legacy crush tunables`` + +:Description: Issue a ``HEALTH_WARN`` in cluster log if + CRUSH tunables are too old (older than ``mon_min_crush_required_version``) +:Type: Boolean +:Default: True + + +``mon crush min required version`` + +:Description: The minimum tunable profile version required by the cluster. + See + `CRUSH map tunables <../operations/crush-map#tunables>`_ for + details. +:Type: String +:Default: ``firefly`` + + +``mon warn on osd down out interval zero`` + +:Description: Issue a ``HEALTH_WARN`` in cluster log if + ``mon osd down out interval`` is zero. Having this option set to + zero on the leader acts much like the ``noout`` flag. It's hard + to figure out what's going wrong with clusters witout the + ``noout`` flag set but acting like that just the same, so we + report a warning in this case. +:Type: Boolean +:Default: True + + +``mon cache target full warn ratio`` + +:Description: Position between pool's ``cache_target_full`` and + ``target_max_object`` where we start warning +:Type: Float +:Default: ``0.66`` + + +``mon health data update interval`` + +:Description: How often (in seconds) the monitor in quorum shares its health + status with its peers. (negative number disables it) +:Type: Float +:Default: ``60`` + + +``mon health to clog`` + +:Description: Enable sending health summary to cluster log periodically. +:Type: Boolean +:Default: True + + +``mon health to clog tick interval`` + +:Description: How often (in seconds) the monitor send health summary to cluster + log (a non-positive number disables it). If current health summary + is empty or identical to the last time, monitor will not send it + to cluster log. +:Type: Integer +:Default: 3600 + + +``mon health to clog interval`` + +:Description: How often (in seconds) the monitor send health summary to cluster + log (a non-positive number disables it). Monitor will always + send the summary to cluster log no matter if the summary changes + or not. +:Type: Integer +:Default: 60 + + + +.. index:: Ceph Storage Cluster; capacity planning, Ceph Monitor; capacity planning + +Storage Capacity +---------------- + +When a Ceph Storage Cluster gets close to its maximum capacity (i.e., ``mon osd +full ratio``), Ceph prevents you from writing to or reading from Ceph OSD +Daemons as a safety measure to prevent data loss. Therefore, letting a +production Ceph Storage Cluster approach its full ratio is not a good practice, +because it sacrifices high availability. The default full ratio is ``.95``, or +95% of capacity. This a very aggressive setting for a test cluster with a small +number of OSDs. + +.. tip:: When monitoring your cluster, be alert to warnings related to the + ``nearfull`` ratio. This means that a failure of some OSDs could result + in a temporary service disruption if one or more OSDs fails. Consider adding + more OSDs to increase storage capacity. + +A common scenario for test clusters involves a system administrator removing a +Ceph OSD Daemon from the Ceph Storage Cluster to watch the cluster rebalance; +then, removing another Ceph OSD Daemon, and so on until the Ceph Storage Cluster +eventually reaches the full ratio and locks up. We recommend a bit of capacity +planning even with a test cluster. Planning enables you to gauge how much spare +capacity you will need in order to maintain high availability. Ideally, you want +to plan for a series of Ceph OSD Daemon failures where the cluster can recover +to an ``active + clean`` state without replacing those Ceph OSD Daemons +immediately. You can run a cluster in an ``active + degraded`` state, but this +is not ideal for normal operating conditions. + +The following diagram depicts a simplistic Ceph Storage Cluster containing 33 +Ceph Nodes with one Ceph OSD Daemon per host, each Ceph OSD Daemon reading from +and writing to a 3TB drive. So this exemplary Ceph Storage Cluster has a maximum +actual capacity of 99TB. With a ``mon osd full ratio`` of ``0.95``, if the Ceph +Storage Cluster falls to 5TB of remaining capacity, the cluster will not allow +Ceph Clients to read and write data. So the Ceph Storage Cluster's operating +capacity is 95TB, not 99TB. + +.. ditaa:: + + +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ + | Rack 1 | | Rack 2 | | Rack 3 | | Rack 4 | | Rack 5 | | Rack 6 | + | cCCC | | cF00 | | cCCC | | cCCC | | cCCC | | cCCC | + +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ + | OSD 1 | | OSD 7 | | OSD 13 | | OSD 19 | | OSD 25 | | OSD 31 | + +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ + | OSD 2 | | OSD 8 | | OSD 14 | | OSD 20 | | OSD 26 | | OSD 32 | + +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ + | OSD 3 | | OSD 9 | | OSD 15 | | OSD 21 | | OSD 27 | | OSD 33 | + +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ + | OSD 4 | | OSD 10 | | OSD 16 | | OSD 22 | | OSD 28 | | Spare | + +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ + | OSD 5 | | OSD 11 | | OSD 17 | | OSD 23 | | OSD 29 | | Spare | + +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ + | OSD 6 | | OSD 12 | | OSD 18 | | OSD 24 | | OSD 30 | | Spare | + +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ + +It is normal in such a cluster for one or two OSDs to fail. A less frequent but +reasonable scenario involves a rack's router or power supply failing, which +brings down multiple OSDs simultaneously (e.g., OSDs 7-12). In such a scenario, +you should still strive for a cluster that can remain operational and achieve an +``active + clean`` state--even if that means adding a few hosts with additional +OSDs in short order. If your capacity utilization is too high, you may not lose +data, but you could still sacrifice data availability while resolving an outage +within a failure domain if capacity utilization of the cluster exceeds the full +ratio. For this reason, we recommend at least some rough capacity planning. + +Identify two numbers for your cluster: + +#. The number of OSDs. +#. The total capacity of the cluster + +If you divide the total capacity of your cluster by the number of OSDs in your +cluster, you will find the mean average capacity of an OSD within your cluster. +Consider multiplying that number by the number of OSDs you expect will fail +simultaneously during normal operations (a relatively small number). Finally +multiply the capacity of the cluster by the full ratio to arrive at a maximum +operating capacity; then, subtract the number of amount of data from the OSDs +you expect to fail to arrive at a reasonable full ratio. Repeat the foregoing +process with a higher number of OSD failures (e.g., a rack of OSDs) to arrive at +a reasonable number for a near full ratio. + +.. code-block:: ini + + [global] + + mon osd full ratio = .80 + mon osd backfillfull ratio = .75 + mon osd nearfull ratio = .70 + + +``mon osd full ratio`` + +:Description: The percentage of disk space used before an OSD is + considered ``full``. + +:Type: Float +:Default: ``.95`` + + +``mon osd backfillfull ratio`` + +:Description: The percentage of disk space used before an OSD is + considered too ``full`` to backfill. + +:Type: Float +:Default: ``.90`` + + +``mon osd nearfull ratio`` + +:Description: The percentage of disk space used before an OSD is + considered ``nearfull``. + +:Type: Float +:Default: ``.85`` + + +.. tip:: If some OSDs are nearfull, but others have plenty of capacity, you + may have a problem with the CRUSH weight for the nearfull OSDs. + +.. index:: heartbeat + +Heartbeat +--------- + +Ceph monitors know about the cluster by requiring reports from each OSD, and by +receiving reports from OSDs about the status of their neighboring OSDs. Ceph +provides reasonable default settings for monitor/OSD interaction; however, you +may modify them as needed. See `Monitor/OSD Interaction`_ for details. + + +.. index:: Ceph Monitor; leader, Ceph Monitor; provider, Ceph Monitor; requester, Ceph Monitor; synchronization + +Monitor Store Synchronization +----------------------------- + +When you run a production cluster with multiple monitors (recommended), each +monitor checks to see if a neighboring monitor has a more recent version of the +cluster map (e.g., a map in a neighboring monitor with one or more epoch numbers +higher than the most current epoch in the map of the instant monitor). +Periodically, one monitor in the cluster may fall behind the other monitors to +the point where it must leave the quorum, synchronize to retrieve the most +current information about the cluster, and then rejoin the quorum. For the +purposes of synchronization, monitors may assume one of three roles: + +#. **Leader**: The `Leader` is the first monitor to achieve the most recent + Paxos version of the cluster map. + +#. **Provider**: The `Provider` is a monitor that has the most recent version + of the cluster map, but wasn't the first to achieve the most recent version. + +#. **Requester:** A `Requester` is a monitor that has fallen behind the leader + and must synchronize in order to retrieve the most recent information about + the cluster before it can rejoin the quorum. + +These roles enable a leader to delegate synchronization duties to a provider, +which prevents synchronization requests from overloading the leader--improving +performance. In the following diagram, the requester has learned that it has +fallen behind the other monitors. The requester asks the leader to synchronize, +and the leader tells the requester to synchronize with a provider. + + +.. ditaa:: +-----------+ +---------+ +----------+ + | Requester | | Leader | | Provider | + +-----------+ +---------+ +----------+ + | | | + | | | + | Ask to Synchronize | | + |------------------->| | + | | | + |<-------------------| | + | Tell Requester to | | + | Sync with Provider | | + | | | + | Synchronize | + |--------------------+-------------------->| + | | | + |<-------------------+---------------------| + | Send Chunk to Requester | + | (repeat as necessary) | + | Requester Acks Chuck to Provider | + |--------------------+-------------------->| + | | + | Sync Complete | + | Notification | + |------------------->| + | | + |<-------------------| + | Ack | + | | + + +Synchronization always occurs when a new monitor joins the cluster. During +runtime operations, monitors may receive updates to the cluster map at different +times. This means the leader and provider roles may migrate from one monitor to +another. If this happens while synchronizing (e.g., a provider falls behind the +leader), the provider can terminate synchronization with a requester. + +Once synchronization is complete, Ceph requires trimming across the cluster. +Trimming requires that the placement groups are ``active + clean``. + + +``mon sync trim timeout`` + +:Description: +:Type: Double +:Default: ``30.0`` + + +``mon sync heartbeat timeout`` + +:Description: +:Type: Double +:Default: ``30.0`` + + +``mon sync heartbeat interval`` + +:Description: +:Type: Double +:Default: ``5.0`` + + +``mon sync backoff timeout`` + +:Description: +:Type: Double +:Default: ``30.0`` + + +``mon sync timeout`` + +:Description: Number of seconds the monitor will wait for the next update + message from its sync provider before it gives up and bootstrap + again. +:Type: Double +:Default: ``30.0`` + + +``mon sync max retries`` + +:Description: +:Type: Integer +:Default: ``5`` + + +``mon sync max payload size`` + +:Description: The maximum size for a sync payload (in bytes). +:Type: 32-bit Integer +:Default: ``1045676`` + + +``paxos max join drift`` + +:Description: The maximum Paxos iterations before we must first sync the + monitor data stores. When a monitor finds that its peer is too + far ahead of it, it will first sync with data stores before moving + on. +:Type: Integer +:Default: ``10`` + +``paxos stash full interval`` + +:Description: How often (in commits) to stash a full copy of the PaxosService state. + Current this setting only affects ``mds``, ``mon``, ``auth`` and ``mgr`` + PaxosServices. +:Type: Integer +:Default: 25 + +``paxos propose interval`` + +:Description: Gather updates for this time interval before proposing + a map update. +:Type: Double +:Default: ``1.0`` + + +``paxos min`` + +:Description: The minimum number of paxos states to keep around +:Type: Integer +:Default: 500 + + +``paxos min wait`` + +:Description: The minimum amount of time to gather updates after a period of + inactivity. +:Type: Double +:Default: ``0.05`` + + +``paxos trim min`` + +:Description: Number of extra proposals tolerated before trimming +:Type: Integer +:Default: 250 + + +``paxos trim max`` + +:Description: The maximum number of extra proposals to trim at a time +:Type: Integer +:Default: 500 + + +``paxos service trim min`` + +:Description: The minimum amount of versions to trigger a trim (0 disables it) +:Type: Integer +:Default: 250 + + +``paxos service trim max`` + +:Description: The maximum amount of versions to trim during a single proposal (0 disables it) +:Type: Integer +:Default: 500 + + +``mon max log epochs`` + +:Description: The maximum amount of log epochs to trim during a single proposal +:Type: Integer +:Default: 500 + + +``mon max pgmap epochs`` + +:Description: The maximum amount of pgmap epochs to trim during a single proposal +:Type: Integer +:Default: 500 + + +``mon mds force trim to`` + +:Description: Force monitor to trim mdsmaps to this point (0 disables it. + dangerous, use with care) +:Type: Integer +:Default: 0 + + +``mon osd force trim to`` + +:Description: Force monitor to trim osdmaps to this point, even if there is + PGs not clean at the specified epoch (0 disables it. dangerous, + use with care) +:Type: Integer +:Default: 0 + +``mon osd cache size`` + +:Description: The size of osdmaps cache, not to rely on underlying store's cache +:Type: Integer +:Default: 10 + + +``mon election timeout`` + +:Description: On election proposer, maximum waiting time for all ACKs in seconds. +:Type: Float +:Default: ``5`` + + +``mon lease`` + +:Description: The length (in seconds) of the lease on the monitor's versions. +:Type: Float +:Default: ``5`` + + +``mon lease renew interval factor`` + +:Description: ``mon lease`` \* ``mon lease renew interval factor`` will be the + interval for the Leader to renew the other monitor's leases. The + factor should be less than ``1.0``. +:Type: Float +:Default: ``0.6`` + + +``mon lease ack timeout factor`` + +:Description: The Leader will wait ``mon lease`` \* ``mon lease ack timeout factor`` + for the Providers to acknowledge the lease extension. +:Type: Float +:Default: ``2.0`` + + +``mon accept timeout factor`` + +:Description: The Leader will wait ``mon lease`` \* ``mon accept timeout factor`` + for the Requester(s) to accept a Paxos update. It is also used + during the Paxos recovery phase for similar purposes. +:Type: Float +:Default: ``2.0`` + + +``mon min osdmap epochs`` + +:Description: Minimum number of OSD map epochs to keep at all times. +:Type: 32-bit Integer +:Default: ``500`` + + +``mon max pgmap epochs`` + +:Description: Maximum number of PG map epochs the monitor should keep. +:Type: 32-bit Integer +:Default: ``500`` + + +``mon max log epochs`` + +:Description: Maximum number of Log epochs the monitor should keep. +:Type: 32-bit Integer +:Default: ``500`` + + + +.. index:: Ceph Monitor; clock + +Clock +----- + +Ceph daemons pass critical messages to each other, which must be processed +before daemons reach a timeout threshold. If the clocks in Ceph monitors +are not synchronized, it can lead to a number of anomalies. For example: + +- Daemons ignoring received messages (e.g., timestamps outdated) +- Timeouts triggered too soon/late when a message wasn't received in time. + +See `Monitor Store Synchronization`_ for details. + + +.. tip:: You SHOULD install NTP on your Ceph monitor hosts to + ensure that the monitor cluster operates with synchronized clocks. + +Clock drift may still be noticeable with NTP even though the discrepancy is not +yet harmful. Ceph's clock drift / clock skew warnings may get triggered even +though NTP maintains a reasonable level of synchronization. Increasing your +clock drift may be tolerable under such circumstances; however, a number of +factors such as workload, network latency, configuring overrides to default +timeouts and the `Monitor Store Synchronization`_ settings may influence +the level of acceptable clock drift without compromising Paxos guarantees. + +Ceph provides the following tunable options to allow you to find +acceptable values. + + +``clock offset`` + +:Description: How much to offset the system clock. See ``Clock.cc`` for details. +:Type: Double +:Default: ``0`` + + +.. deprecated:: 0.58 + +``mon tick interval`` + +:Description: A monitor's tick interval in seconds. +:Type: 32-bit Integer +:Default: ``5`` + + +``mon clock drift allowed`` + +:Description: The clock drift in seconds allowed between monitors. +:Type: Float +:Default: ``.050`` + + +``mon clock drift warn backoff`` + +:Description: Exponential backoff for clock drift warnings +:Type: Float +:Default: ``5`` + + +``mon timecheck interval`` + +:Description: The time check interval (clock drift check) in seconds + for the Leader. + +:Type: Float +:Default: ``300.0`` + + +``mon timecheck skew interval`` + +:Description: The time check interval (clock drift check) in seconds when in + presence of a skew in seconds for the Leader. +:Type: Float +:Default: ``30.0`` + + +Client +------ + +``mon client hunt interval`` + +:Description: The client will try a new monitor every ``N`` seconds until it + establishes a connection. + +:Type: Double +:Default: ``3.0`` + + +``mon client ping interval`` + +:Description: The client will ping the monitor every ``N`` seconds. +:Type: Double +:Default: ``10.0`` + + +``mon client max log entries per message`` + +:Description: The maximum number of log entries a monitor will generate + per client message. + +:Type: Integer +:Default: ``1000`` + + +``mon client bytes`` + +:Description: The amount of client message data allowed in memory (in bytes). +:Type: 64-bit Integer Unsigned +:Default: ``100ul << 20`` + + +Pool settings +============= +Since version v0.94 there is support for pool flags which allow or disallow changes to be made to pools. + +Monitors can also disallow removal of pools if configured that way. + +``mon allow pool delete`` + +:Description: If the monitors should allow pools to be removed. Regardless of what the pool flags say. +:Type: Boolean +:Default: ``false`` + +``osd pool default flag hashpspool`` + +:Description: Set the hashpspool flag on new pools +:Type: Boolean +:Default: ``true`` + +``osd pool default flag nodelete`` + +:Description: Set the nodelete flag on new pools. Prevents allow pool removal with this flag in any way. +:Type: Boolean +:Default: ``false`` + +``osd pool default flag nopgchange`` + +:Description: Set the nopgchange flag on new pools. Does not allow the number of PGs to be changed for a pool. +:Type: Boolean +:Default: ``false`` + +``osd pool default flag nosizechange`` + +:Description: Set the nosizechange flag on new pools. Does not allow the size to be changed of pool. +:Type: Boolean +:Default: ``false`` + +For more information about the pool flags see `Pool values`_. + +Miscellaneous +============= + + +``mon max osd`` + +:Description: The maximum number of OSDs allowed in the cluster. +:Type: 32-bit Integer +:Default: ``10000`` + +``mon globalid prealloc`` + +:Description: The number of global IDs to pre-allocate for clients and daemons in the cluster. +:Type: 32-bit Integer +:Default: ``100`` + +``mon subscribe interval`` + +:Description: The refresh interval (in seconds) for subscriptions. The + subscription mechanism enables obtaining the cluster maps + and log information. + +:Type: Double +:Default: ``300`` + + +``mon stat smooth intervals`` + +:Description: Ceph will smooth statistics over the last ``N`` PG maps. +:Type: Integer +:Default: ``2`` + + +``mon probe timeout`` + +:Description: Number of seconds the monitor will wait to find peers before bootstrapping. +:Type: Double +:Default: ``2.0`` + + +``mon daemon bytes`` + +:Description: The message memory cap for metadata server and OSD messages (in bytes). +:Type: 64-bit Integer Unsigned +:Default: ``400ul << 20`` + + +``mon max log entries per event`` + +:Description: The maximum number of log entries per event. +:Type: Integer +:Default: ``4096`` + + +``mon osd prime pg temp`` + +:Description: Enables or disable priming the PGMap with the previous OSDs when an out + OSD comes back into the cluster. With the ``true`` setting the clients + will continue to use the previous OSDs until the newly in OSDs as that + PG peered. +:Type: Boolean +:Default: ``true`` + + +``mon osd prime pg temp max time`` + +:Description: How much time in seconds the monitor should spend trying to prime the + PGMap when an out OSD comes back into the cluster. +:Type: Float +:Default: ``0.5`` + + +``mon osd prime pg temp max time estimate`` + +:Description: Maximum estimate of time spent on each PG before we prime all PGs + in parallel. +:Type: Float +:Default: ``0.25`` + + +``mon osd allow primary affinity`` + +:Description: allow ``primary_affinity`` to be set in the osdmap. +:Type: Boolean +:Default: False + + +``mon osd pool ec fast read`` + +:Description: Whether turn on fast read on the pool or not. It will be used as + the default setting of newly created erasure pools if ``fast_read`` + is not specified at create time. +:Type: Boolean +:Default: False + + +``mon mds skip sanity`` + +:Description: Skip safety assertions on FSMap (in case of bugs where we want to + continue anyway). Monitor terminates if the FSMap sanity check + fails, but we can disable it by enabling this option. +:Type: Boolean +:Default: False + + +``mon max mdsmap epochs`` + +:Description: The maximum amount of mdsmap epochs to trim during a single proposal. +:Type: Integer +:Default: 500 + + +``mon config key max entry size`` + +:Description: The maximum size of config-key entry (in bytes) +:Type: Integer +:Default: 4096 + + +``mon scrub interval`` + +:Description: How often (in seconds) the monitor scrub its store by comparing + the stored checksums with the computed ones of all the stored + keys. +:Type: Integer +:Default: 3600*24 + + +``mon scrub max keys`` + +:Description: The maximum number of keys to scrub each time. +:Type: Integer +:Default: 100 + + +``mon compact on start`` + +:Description: Compact the database used as Ceph Monitor store on + ``ceph-mon`` start. A manual compaction helps to shrink the + monitor database and improve the performance of it if the regular + compaction fails to work. +:Type: Boolean +:Default: False + + +``mon compact on bootstrap`` + +:Description: Compact the database used as Ceph Monitor store on + on bootstrap. Monitor starts probing each other for creating + a quorum after bootstrap. If it times out before joining the + quorum, it will start over and bootstrap itself again. +:Type: Boolean +:Default: False + + +``mon compact on trim`` + +:Description: Compact a certain prefix (including paxos) when we trim its old states. +:Type: Boolean +:Default: True + + +``mon cpu threads`` + +:Description: Number of threads for performing CPU intensive work on monitor. +:Type: Boolean +:Default: True + + +``mon osd mapping pgs per chunk`` + +:Description: We calculate the mapping from placement group to OSDs in chunks. + This option specifies the number of placement groups per chunk. +:Type: Integer +:Default: 4096 + + +``mon osd max split count`` + +:Description: Largest number of PGs per "involved" OSD to let split create. + When we increase the ``pg_num`` of a pool, the placement groups + will be splitted on all OSDs serving that pool. We want to avoid + extreme multipliers on PG splits. +:Type: Integer +:Default: 300 + + +``mon session timeout`` + +:Description: Monitor will terminate inactive sessions stay idle over this + time limit. +:Type: Integer +:Default: 300 + + + +.. _Paxos: http://en.wikipedia.org/wiki/Paxos_(computer_science) +.. _Monitor Keyrings: ../../../dev/mon-bootstrap#secret-keys +.. _Ceph configuration file: ../ceph-conf/#monitors +.. _Network Configuration Reference: ../network-config-ref +.. _Monitor lookup through DNS: ../mon-lookup-dns +.. _ACID: http://en.wikipedia.org/wiki/ACID +.. _Adding/Removing a Monitor: ../../operations/add-or-rm-mons +.. _Add/Remove a Monitor (ceph-deploy): ../../deployment/ceph-deploy-mon +.. _Monitoring a Cluster: ../../operations/monitoring +.. _Monitoring OSDs and PGs: ../../operations/monitoring-osd-pg +.. _Bootstrapping a Monitor: ../../../dev/mon-bootstrap +.. _Changing a Monitor's IP Address: ../../operations/add-or-rm-mons#changing-a-monitor-s-ip-address +.. _Monitor/OSD Interaction: ../mon-osd-interaction +.. _Scalability and High Availability: ../../../architecture#scalability-and-high-availability +.. _Pool values: ../../operations/pools/#set-pool-values diff --git a/src/ceph/doc/rados/configuration/mon-lookup-dns.rst b/src/ceph/doc/rados/configuration/mon-lookup-dns.rst new file mode 100644 index 0000000..e32b320 --- /dev/null +++ b/src/ceph/doc/rados/configuration/mon-lookup-dns.rst @@ -0,0 +1,51 @@ +=============================== +Looking op Monitors through DNS +=============================== + +Since version 11.0.0 RADOS supports looking up Monitors through DNS. + +This way daemons and clients do not require a *mon host* configuration directive in their ceph.conf configuration file. + +Using DNS SRV TCP records clients are able to look up the monitors. + +This allows for less configuration on clients and monitors. Using a DNS update clients and daemons can be made aware of changes in the monitor topology. + +By default clients and daemons will look for the TCP service called *ceph-mon* which is configured by the *mon_dns_srv_name* configuration directive. + + +``mon dns srv name`` + +:Description: the service name used querying the DNS for the monitor hosts/addresses +:Type: String +:Default: ``ceph-mon`` + +Example +------- +When the DNS search domain is set to *example.com* a DNS zone file might contain the following elements. + +First, create records for the Monitors, either IPv4 (A) or IPv6 (AAAA). + +:: + + mon1.example.com. AAAA 2001:db8::100 + mon2.example.com. AAAA 2001:db8::200 + mon3.example.com. AAAA 2001:db8::300 + +:: + + mon1.example.com. A 192.168.0.1 + mon2.example.com. A 192.168.0.2 + mon3.example.com. A 192.168.0.3 + + +With those records now existing we can create the SRV TCP records with the name *ceph-mon* pointing to the three Monitors. + +:: + + _ceph-mon._tcp.example.com. 60 IN SRV 10 60 6789 mon1.example.com. + _ceph-mon._tcp.example.com. 60 IN SRV 10 60 6789 mon2.example.com. + _ceph-mon._tcp.example.com. 60 IN SRV 10 60 6789 mon3.example.com. + +In this case the Monitors are running on port *6789*, and their priority and weight are all *10* and *60* respectively. + +The current implementation in clients and daemons will *only* respect the priority set in SRV records, and they will only connect to the monitors with lowest-numbered priority. The targets with the same priority will be selected at random. diff --git a/src/ceph/doc/rados/configuration/mon-osd-interaction.rst b/src/ceph/doc/rados/configuration/mon-osd-interaction.rst new file mode 100644 index 0000000..e335ff0 --- /dev/null +++ b/src/ceph/doc/rados/configuration/mon-osd-interaction.rst @@ -0,0 +1,408 @@ +===================================== + Configuring Monitor/OSD Interaction +===================================== + +.. index:: heartbeat + +After you have completed your initial Ceph configuration, you may deploy and run +Ceph. When you execute a command such as ``ceph health`` or ``ceph -s``, the +:term:`Ceph Monitor` reports on the current state of the :term:`Ceph Storage +Cluster`. The Ceph Monitor knows about the Ceph Storage Cluster by requiring +reports from each :term:`Ceph OSD Daemon`, and by receiving reports from Ceph +OSD Daemons about the status of their neighboring Ceph OSD Daemons. If the Ceph +Monitor doesn't receive reports, or if it receives reports of changes in the +Ceph Storage Cluster, the Ceph Monitor updates the status of the :term:`Ceph +Cluster Map`. + +Ceph provides reasonable default settings for Ceph Monitor/Ceph OSD Daemon +interaction. However, you may override the defaults. The following sections +describe how Ceph Monitors and Ceph OSD Daemons interact for the purposes of +monitoring the Ceph Storage Cluster. + +.. index:: heartbeat interval + +OSDs Check Heartbeats +===================== + +Each Ceph OSD Daemon checks the heartbeat of other Ceph OSD Daemons every 6 +seconds. You can change the heartbeat interval by adding an ``osd heartbeat +interval`` setting under the ``[osd]`` section of your Ceph configuration file, +or by setting the value at runtime. If a neighboring Ceph OSD Daemon doesn't +show a heartbeat within a 20 second grace period, the Ceph OSD Daemon may +consider the neighboring Ceph OSD Daemon ``down`` and report it back to a Ceph +Monitor, which will update the Ceph Cluster Map. You may change this grace +period by adding an ``osd heartbeat grace`` setting under the ``[mon]`` +and ``[osd]`` or ``[global]`` section of your Ceph configuration file, +or by setting the value at runtime. + + +.. ditaa:: +---------+ +---------+ + | OSD 1 | | OSD 2 | + +---------+ +---------+ + | | + |----+ Heartbeat | + | | Interval | + |<---+ Exceeded | + | | + | Check | + | Heartbeat | + |------------------->| + | | + |<-------------------| + | Heart Beating | + | | + |----+ Heartbeat | + | | Interval | + |<---+ Exceeded | + | | + | Check | + | Heartbeat | + |------------------->| + | | + |----+ Grace | + | | Period | + |<---+ Exceeded | + | | + |----+ Mark | + | | OSD 2 | + |<---+ Down | + + +.. index:: OSD down report + +OSDs Report Down OSDs +===================== + +By default, two Ceph OSD Daemons from different hosts must report to the Ceph +Monitors that another Ceph OSD Daemon is ``down`` before the Ceph Monitors +acknowledge that the reported Ceph OSD Daemon is ``down``. But there is chance +that all the OSDs reporting the failure are hosted in a rack with a bad switch +which has trouble connecting to another OSD. To avoid this sort of false alarm, +we consider the peers reporting a failure a proxy for a potential "subcluster" +over the overall cluster that is similarly laggy. This is clearly not true in +all cases, but will sometimes help us localize the grace correction to a subset +of the system that is unhappy. ``mon osd reporter subtree level`` is used to +group the peers into the "subcluster" by their common ancestor type in CRUSH +map. By default, only two reports from different subtree are required to report +another Ceph OSD Daemon ``down``. You can change the number of reporters from +unique subtrees and the common ancestor type required to report a Ceph OSD +Daemon ``down`` to a Ceph Monitor by adding an ``mon osd min down reporters`` +and ``mon osd reporter subtree level`` settings under the ``[mon]`` section of +your Ceph configuration file, or by setting the value at runtime. + + +.. ditaa:: +---------+ +---------+ +---------+ + | OSD 1 | | OSD 2 | | Monitor | + +---------+ +---------+ +---------+ + | | | + | OSD 3 Is Down | | + |---------------+--------------->| + | | | + | | | + | | OSD 3 Is Down | + | |--------------->| + | | | + | | | + | | |---------+ Mark + | | | | OSD 3 + | | |<--------+ Down + + +.. index:: peering failure + +OSDs Report Peering Failure +=========================== + +If a Ceph OSD Daemon cannot peer with any of the Ceph OSD Daemons defined in its +Ceph configuration file (or the cluster map), it will ping a Ceph Monitor for +the most recent copy of the cluster map every 30 seconds. You can change the +Ceph Monitor heartbeat interval by adding an ``osd mon heartbeat interval`` +setting under the ``[osd]`` section of your Ceph configuration file, or by +setting the value at runtime. + +.. ditaa:: +---------+ +---------+ +-------+ +---------+ + | OSD 1 | | OSD 2 | | OSD 3 | | Monitor | + +---------+ +---------+ +-------+ +---------+ + | | | | + | Request To | | | + | Peer | | | + |-------------->| | | + |<--------------| | | + | Peering | | + | | | + | Request To | | + | Peer | | + |----------------------------->| | + | | + |----+ OSD Monitor | + | | Heartbeat | + |<---+ Interval Exceeded | + | | + | Failed to Peer with OSD 3 | + |-------------------------------------------->| + |<--------------------------------------------| + | Receive New Cluster Map | + + +.. index:: OSD status + +OSDs Report Their Status +======================== + +If an Ceph OSD Daemon doesn't report to a Ceph Monitor, the Ceph Monitor will +consider the Ceph OSD Daemon ``down`` after the ``mon osd report timeout`` +elapses. A Ceph OSD Daemon sends a report to a Ceph Monitor when a reportable +event such as a failure, a change in placement group stats, a change in +``up_thru`` or when it boots within 5 seconds. You can change the Ceph OSD +Daemon minimum report interval by adding an ``osd mon report interval min`` +setting under the ``[osd]`` section of your Ceph configuration file, or by +setting the value at runtime. A Ceph OSD Daemon sends a report to a Ceph +Monitor every 120 seconds irrespective of whether any notable changes occur. +You can change the Ceph Monitor report interval by adding an ``osd mon report +interval max`` setting under the ``[osd]`` section of your Ceph configuration +file, or by setting the value at runtime. + + +.. ditaa:: +---------+ +---------+ + | OSD 1 | | Monitor | + +---------+ +---------+ + | | + |----+ Report Min | + | | Interval | + |<---+ Exceeded | + | | + |----+ Reportable | + | | Event | + |<---+ Occurs | + | | + | Report To | + | Monitor | + |------------------->| + | | + |----+ Report Max | + | | Interval | + |<---+ Exceeded | + | | + | Report To | + | Monitor | + |------------------->| + | | + |----+ Monitor | + | | Fails | + |<---+ | + +----+ Monitor OSD + | | Report Timeout + |<---+ Exceeded + | + +----+ Mark + | | OSD 1 + |<---+ Down + + + + +Configuration Settings +====================== + +When modifying heartbeat settings, you should include them in the ``[global]`` +section of your configuration file. + +.. index:: monitor heartbeat + +Monitor Settings +---------------- + +``mon osd min up ratio`` + +:Description: The minimum ratio of ``up`` Ceph OSD Daemons before Ceph will + mark Ceph OSD Daemons ``down``. + +:Type: Double +:Default: ``.3`` + + +``mon osd min in ratio`` + +:Description: The minimum ratio of ``in`` Ceph OSD Daemons before Ceph will + mark Ceph OSD Daemons ``out``. + +:Type: Double +:Default: ``.75`` + + +``mon osd laggy halflife`` + +:Description: The number of seconds laggy estimates will decay. +:Type: Integer +:Default: ``60*60`` + + +``mon osd laggy weight`` + +:Description: The weight for new samples in laggy estimation decay. +:Type: Double +:Default: ``0.3`` + + + +``mon osd laggy max interval`` + +:Description: Maximum value of ``laggy_interval`` in laggy estimations (in seconds). + Monitor uses an adaptive approach to evaluate the ``laggy_interval`` of + a certain OSD. This value will be used to calculate the grace time for + that OSD. +:Type: Integer +:Default: 300 + +``mon osd adjust heartbeat grace`` + +:Description: If set to ``true``, Ceph will scale based on laggy estimations. +:Type: Boolean +:Default: ``true`` + + +``mon osd adjust down out interval`` + +:Description: If set to ``true``, Ceph will scaled based on laggy estimations. +:Type: Boolean +:Default: ``true`` + + +``mon osd auto mark in`` + +:Description: Ceph will mark any booting Ceph OSD Daemons as ``in`` + the Ceph Storage Cluster. + +:Type: Boolean +:Default: ``false`` + + +``mon osd auto mark auto out in`` + +:Description: Ceph will mark booting Ceph OSD Daemons auto marked ``out`` + of the Ceph Storage Cluster as ``in`` the cluster. + +:Type: Boolean +:Default: ``true`` + + +``mon osd auto mark new in`` + +:Description: Ceph will mark booting new Ceph OSD Daemons as ``in`` the + Ceph Storage Cluster. + +:Type: Boolean +:Default: ``true`` + + +``mon osd down out interval`` + +:Description: The number of seconds Ceph waits before marking a Ceph OSD Daemon + ``down`` and ``out`` if it doesn't respond. + +:Type: 32-bit Integer +:Default: ``600`` + + +``mon osd down out subtree limit`` + +:Description: The smallest :term:`CRUSH` unit type that Ceph will **not** + automatically mark out. For instance, if set to ``host`` and if + all OSDs of a host are down, Ceph will not automatically mark out + these OSDs. + +:Type: String +:Default: ``rack`` + + +``mon osd report timeout`` + +:Description: The grace period in seconds before declaring + unresponsive Ceph OSD Daemons ``down``. + +:Type: 32-bit Integer +:Default: ``900`` + +``mon osd min down reporters`` + +:Description: The minimum number of Ceph OSD Daemons required to report a + ``down`` Ceph OSD Daemon. + +:Type: 32-bit Integer +:Default: ``2`` + + +``mon osd reporter subtree level`` + +:Description: In which level of parent bucket the reporters are counted. The OSDs + send failure reports to monitor if they find its peer is not responsive. + And monitor mark the reported OSD out and then down after a grace period. +:Type: String +:Default: ``host`` + + +.. index:: OSD hearbeat + +OSD Settings +------------ + +``osd heartbeat address`` + +:Description: An Ceph OSD Daemon's network address for heartbeats. +:Type: Address +:Default: The host address. + + +``osd heartbeat interval`` + +:Description: How often an Ceph OSD Daemon pings its peers (in seconds). +:Type: 32-bit Integer +:Default: ``6`` + + +``osd heartbeat grace`` + +:Description: The elapsed time when a Ceph OSD Daemon hasn't shown a heartbeat + that the Ceph Storage Cluster considers it ``down``. + This setting has to be set in both the [mon] and [osd] or [global] + section so that it is read by both the MON and OSD daemons. +:Type: 32-bit Integer +:Default: ``20`` + + +``osd mon heartbeat interval`` + +:Description: How often the Ceph OSD Daemon pings a Ceph Monitor if it has no + Ceph OSD Daemon peers. + +:Type: 32-bit Integer +:Default: ``30`` + + +``osd mon report interval max`` + +:Description: The maximum time in seconds that a Ceph OSD Daemon can wait before + it must report to a Ceph Monitor. + +:Type: 32-bit Integer +:Default: ``120`` + + +``osd mon report interval min`` + +:Description: The minimum number of seconds a Ceph OSD Daemon may wait + from startup or another reportable event before reporting + to a Ceph Monitor. + +:Type: 32-bit Integer +:Default: ``5`` +:Valid Range: Should be less than ``osd mon report interval max`` + + +``osd mon ack timeout`` + +:Description: The number of seconds to wait for a Ceph Monitor to acknowledge a + request for statistics. + +:Type: 32-bit Integer +:Default: ``30`` diff --git a/src/ceph/doc/rados/configuration/ms-ref.rst b/src/ceph/doc/rados/configuration/ms-ref.rst new file mode 100644 index 0000000..55d009e --- /dev/null +++ b/src/ceph/doc/rados/configuration/ms-ref.rst @@ -0,0 +1,154 @@ +=========== + Messaging +=========== + +General Settings +================ + +``ms tcp nodelay`` + +:Description: Disables nagle's algorithm on messenger tcp sessions. +:Type: Boolean +:Required: No +:Default: ``true`` + + +``ms initial backoff`` + +:Description: The initial time to wait before reconnecting on a fault. +:Type: Double +:Required: No +:Default: ``.2`` + + +``ms max backoff`` + +:Description: The maximum time to wait before reconnecting on a fault. +:Type: Double +:Required: No +:Default: ``15.0`` + + +``ms nocrc`` + +:Description: Disables crc on network messages. May increase performance if cpu limited. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``ms die on bad msg`` + +:Description: Debug option; do not configure. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``ms dispatch throttle bytes`` + +:Description: Throttles total size of messages waiting to be dispatched. +:Type: 64-bit Unsigned Integer +:Required: No +:Default: ``100 << 20`` + + +``ms bind ipv6`` + +:Description: Enable if you want your daemons to bind to IPv6 address instead of IPv4 ones. (Not required if you specify a daemon or cluster IP.) +:Type: Boolean +:Required: No +:Default: ``false`` + + +``ms rwthread stack bytes`` + +:Description: Debug option for stack size; do not configure. +:Type: 64-bit Unsigned Integer +:Required: No +:Default: ``1024 << 10`` + + +``ms tcp read timeout`` + +:Description: Controls how long (in seconds) the messenger will wait before closing an idle connection. +:Type: 64-bit Unsigned Integer +:Required: No +:Default: ``900`` + + +``ms inject socket failures`` + +:Description: Debug option; do not configure. +:Type: 64-bit Unsigned Integer +:Required: No +:Default: ``0`` + +Async messenger options +======================= + + +``ms async transport type`` + +:Description: Transport type used by Async Messenger. Can be ``posix``, ``dpdk`` + or ``rdma``. Posix uses standard TCP/IP networking and is default. + Other transports may be experimental and support may be limited. +:Type: String +:Required: No +:Default: ``posix`` + + +``ms async op threads`` + +:Description: Initial number of worker threads used by each Async Messenger instance. + Should be at least equal to highest number of replicas, but you can + decrease it if you are low on CPU core count and/or you host a lot of + OSDs on single server. +:Type: 64-bit Unsigned Integer +:Required: No +:Default: ``3`` + + +``ms async max op threads`` + +:Description: Maximum number of worker threads used by each Async Messenger instance. + Set to lower values when your machine has limited CPU count, and increase + when your CPUs are underutilized (i. e. one or more of CPUs are + constantly on 100% load during I/O operations). +:Type: 64-bit Unsigned Integer +:Required: No +:Default: ``5`` + + +``ms async set affinity`` + +:Description: Set to true to bind Async Messenger workers to particular CPU cores. +:Type: Boolean +:Required: No +:Default: ``true`` + + +``ms async affinity cores`` + +:Description: When ``ms async set affinity`` is true, this string specifies how Async + Messenger workers are bound to CPU cores. For example, "0,2" will bind + workers #1 and #2 to CPU cores #0 and #2, respectively. + NOTE: when manually setting affinity, make sure to not assign workers to + processors that are virtual CPUs created as an effect of Hyperthreading + or similar technology, because they are slower than regular CPU cores. +:Type: String +:Required: No +:Default: ``(empty)`` + + +``ms async send inline`` + +:Description: Send messages directly from the thread that generated them instead of + queuing and sending from Async Messenger thread. This option is known + to decrease performance on systems with a lot of CPU cores, so it's + disabled by default. +:Type: Boolean +:Required: No +:Default: ``false`` + + diff --git a/src/ceph/doc/rados/configuration/network-config-ref.rst b/src/ceph/doc/rados/configuration/network-config-ref.rst new file mode 100644 index 0000000..2d7f9d6 --- /dev/null +++ b/src/ceph/doc/rados/configuration/network-config-ref.rst @@ -0,0 +1,494 @@ +================================= + Network Configuration Reference +================================= + +Network configuration is critical for building a high performance :term:`Ceph +Storage Cluster`. The Ceph Storage Cluster does not perform request routing or +dispatching on behalf of the :term:`Ceph Client`. Instead, Ceph Clients make +requests directly to Ceph OSD Daemons. Ceph OSD Daemons perform data replication +on behalf of Ceph Clients, which means replication and other factors impose +additional loads on Ceph Storage Cluster networks. + +Our Quick Start configurations provide a trivial `Ceph configuration file`_ that +sets monitor IP addresses and daemon host names only. Unless you specify a +cluster network, Ceph assumes a single "public" network. Ceph functions just +fine with a public network only, but you may see significant performance +improvement with a second "cluster" network in a large cluster. + +We recommend running a Ceph Storage Cluster with two networks: a public +(front-side) network and a cluster (back-side) network. To support two networks, +each :term:`Ceph Node` will need to have more than one NIC. See `Hardware +Recommendations - Networks`_ for additional details. + +.. ditaa:: + +-------------+ + | Ceph Client | + +----*--*-----+ + | ^ + Request | : Response + v | + /----------------------------------*--*-------------------------------------\ + | Public Network | + \---*--*------------*--*-------------*--*------------*--*------------*--*---/ + ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ + | | | | | | | | | | + | : | : | : | : | : + v v v v v v v v v v + +---*--*---+ +---*--*---+ +---*--*---+ +---*--*---+ +---*--*---+ + | Ceph MON | | Ceph MDS | | Ceph OSD | | Ceph OSD | | Ceph OSD | + +----------+ +----------+ +---*--*---+ +---*--*---+ +---*--*---+ + ^ ^ ^ ^ ^ ^ + The cluster network relieves | | | | | | + OSD replication and heartbeat | : | : | : + traffic from the public network. v v v v v v + /------------------------------------*--*------------*--*------------*--*---\ + | cCCC Cluster Network | + \---------------------------------------------------------------------------/ + + +There are several reasons to consider operating two separate networks: + +#. **Performance:** Ceph OSD Daemons handle data replication for the Ceph + Clients. When Ceph OSD Daemons replicate data more than once, the network + load between Ceph OSD Daemons easily dwarfs the network load between Ceph + Clients and the Ceph Storage Cluster. This can introduce latency and + create a performance problem. Recovery and rebalancing can + also introduce significant latency on the public network. See + `Scalability and High Availability`_ for additional details on how Ceph + replicates data. See `Monitor / OSD Interaction`_ for details on heartbeat + traffic. + +#. **Security**: While most people are generally civil, a very tiny segment of + the population likes to engage in what's known as a Denial of Service (DoS) + attack. When traffic between Ceph OSD Daemons gets disrupted, placement + groups may no longer reflect an ``active + clean`` state, which may prevent + users from reading and writing data. A great way to defeat this type of + attack is to maintain a completely separate cluster network that doesn't + connect directly to the internet. Also, consider using `Message Signatures`_ + to defeat spoofing attacks. + + +IP Tables +========= + +By default, daemons `bind`_ to ports within the ``6800:7300`` range. You may +configure this range at your discretion. Before configuring your IP tables, +check the default ``iptables`` configuration. + + sudo iptables -L + +Some Linux distributions include rules that reject all inbound requests +except SSH from all network interfaces. For example:: + + REJECT all -- anywhere anywhere reject-with icmp-host-prohibited + +You will need to delete these rules on both your public and cluster networks +initially, and replace them with appropriate rules when you are ready to +harden the ports on your Ceph Nodes. + + +Monitor IP Tables +----------------- + +Ceph Monitors listen on port ``6789`` by default. Additionally, Ceph Monitors +always operate on the public network. When you add the rule using the example +below, make sure you replace ``{iface}`` with the public network interface +(e.g., ``eth0``, ``eth1``, etc.), ``{ip-address}`` with the IP address of the +public network and ``{netmask}`` with the netmask for the public network. :: + + sudo iptables -A INPUT -i {iface} -p tcp -s {ip-address}/{netmask} --dport 6789 -j ACCEPT + + +MDS IP Tables +------------- + +A :term:`Ceph Metadata Server` listens on the first available port on the public +network beginning at port 6800. Note that this behavior is not deterministic, so +if you are running more than one OSD or MDS on the same host, or if you restart +the daemons within a short window of time, the daemons will bind to higher +ports. You should open the entire 6800-7300 range by default. When you add the +rule using the example below, make sure you replace ``{iface}`` with the public +network interface (e.g., ``eth0``, ``eth1``, etc.), ``{ip-address}`` with the IP +address of the public network and ``{netmask}`` with the netmask of the public +network. + +For example:: + + sudo iptables -A INPUT -i {iface} -m multiport -p tcp -s {ip-address}/{netmask} --dports 6800:7300 -j ACCEPT + + +OSD IP Tables +------------- + +By default, Ceph OSD Daemons `bind`_ to the first available ports on a Ceph Node +beginning at port 6800. Note that this behavior is not deterministic, so if you +are running more than one OSD or MDS on the same host, or if you restart the +daemons within a short window of time, the daemons will bind to higher ports. +Each Ceph OSD Daemon on a Ceph Node may use up to four ports: + +#. One for talking to clients and monitors. +#. One for sending data to other OSDs. +#. Two for heartbeating on each interface. + +.. ditaa:: + /---------------\ + | OSD | + | +---+----------------+-----------+ + | | Clients & Monitors | Heartbeat | + | +---+----------------+-----------+ + | | + | +---+----------------+-----------+ + | | Data Replication | Heartbeat | + | +---+----------------+-----------+ + | cCCC | + \---------------/ + +When a daemon fails and restarts without letting go of the port, the restarted +daemon will bind to a new port. You should open the entire 6800-7300 port range +to handle this possibility. + +If you set up separate public and cluster networks, you must add rules for both +the public network and the cluster network, because clients will connect using +the public network and other Ceph OSD Daemons will connect using the cluster +network. When you add the rule using the example below, make sure you replace +``{iface}`` with the network interface (e.g., ``eth0``, ``eth1``, etc.), +``{ip-address}`` with the IP address and ``{netmask}`` with the netmask of the +public or cluster network. For example:: + + sudo iptables -A INPUT -i {iface} -m multiport -p tcp -s {ip-address}/{netmask} --dports 6800:7300 -j ACCEPT + +.. tip:: If you run Ceph Metadata Servers on the same Ceph Node as the + Ceph OSD Daemons, you can consolidate the public network configuration step. + + +Ceph Networks +============= + +To configure Ceph networks, you must add a network configuration to the +``[global]`` section of the configuration file. Our 5-minute Quick Start +provides a trivial `Ceph configuration file`_ that assumes one public network +with client and server on the same network and subnet. Ceph functions just fine +with a public network only. However, Ceph allows you to establish much more +specific criteria, including multiple IP network and subnet masks for your +public network. You can also establish a separate cluster network to handle OSD +heartbeat, object replication and recovery traffic. Don't confuse the IP +addresses you set in your configuration with the public-facing IP addresses +network clients may use to access your service. Typical internal IP networks are +often ``192.168.0.0`` or ``10.0.0.0``. + +.. tip:: If you specify more than one IP address and subnet mask for + either the public or the cluster network, the subnets within the network + must be capable of routing to each other. Additionally, make sure you + include each IP address/subnet in your IP tables and open ports for them + as necessary. + +.. note:: Ceph uses `CIDR`_ notation for subnets (e.g., ``10.0.0.0/24``). + +When you have configured your networks, you may restart your cluster or restart +each daemon. Ceph daemons bind dynamically, so you do not have to restart the +entire cluster at once if you change your network configuration. + + +Public Network +-------------- + +To configure a public network, add the following option to the ``[global]`` +section of your Ceph configuration file. + +.. code-block:: ini + + [global] + ... + public network = {public-network/netmask} + + +Cluster Network +--------------- + +If you declare a cluster network, OSDs will route heartbeat, object replication +and recovery traffic over the cluster network. This may improve performance +compared to using a single network. To configure a cluster network, add the +following option to the ``[global]`` section of your Ceph configuration file. + +.. code-block:: ini + + [global] + ... + cluster network = {cluster-network/netmask} + +We prefer that the cluster network is **NOT** reachable from the public network +or the Internet for added security. + + +Ceph Daemons +============ + +Ceph has one network configuration requirement that applies to all daemons: the +Ceph configuration file **MUST** specify the ``host`` for each daemon. Ceph also +requires that a Ceph configuration file specify the monitor IP address and its +port. + +.. important:: Some deployment tools (e.g., ``ceph-deploy``, Chef) may create a + configuration file for you. **DO NOT** set these values if the deployment + tool does it for you. + +.. tip:: The ``host`` setting is the short name of the host (i.e., not + an fqdn). It is **NOT** an IP address either. Enter ``hostname -s`` on + the command line to retrieve the name of the host. + + +.. code-block:: ini + + [mon.a] + + host = {hostname} + mon addr = {ip-address}:6789 + + [osd.0] + host = {hostname} + + +You do not have to set the host IP address for a daemon. If you have a static IP +configuration and both public and cluster networks running, the Ceph +configuration file may specify the IP address of the host for each daemon. To +set a static IP address for a daemon, the following option(s) should appear in +the daemon instance sections of your ``ceph.conf`` file. + +.. code-block:: ini + + [osd.0] + public addr = {host-public-ip-address} + cluster addr = {host-cluster-ip-address} + + +.. topic:: One NIC OSD in a Two Network Cluster + + Generally, we do not recommend deploying an OSD host with a single NIC in a + cluster with two networks. However, you may accomplish this by forcing the + OSD host to operate on the public network by adding a ``public addr`` entry + to the ``[osd.n]`` section of the Ceph configuration file, where ``n`` + refers to the number of the OSD with one NIC. Additionally, the public + network and cluster network must be able to route traffic to each other, + which we don't recommend for security reasons. + + +Network Config Settings +======================= + +Network configuration settings are not required. Ceph assumes a public network +with all hosts operating on it unless you specifically configure a cluster +network. + + +Public Network +-------------- + +The public network configuration allows you specifically define IP addresses +and subnets for the public network. You may specifically assign static IP +addresses or override ``public network`` settings using the ``public addr`` +setting for a specific daemon. + +``public network`` + +:Description: The IP address and netmask of the public (front-side) network + (e.g., ``192.168.0.0/24``). Set in ``[global]``. You may specify + comma-delimited subnets. + +:Type: ``{ip-address}/{netmask} [, {ip-address}/{netmask}]`` +:Required: No +:Default: N/A + + +``public addr`` + +:Description: The IP address for the public (front-side) network. + Set for each daemon. + +:Type: IP Address +:Required: No +:Default: N/A + + + +Cluster Network +--------------- + +The cluster network configuration allows you to declare a cluster network, and +specifically define IP addresses and subnets for the cluster network. You may +specifically assign static IP addresses or override ``cluster network`` +settings using the ``cluster addr`` setting for specific OSD daemons. + + +``cluster network`` + +:Description: The IP address and netmask of the cluster (back-side) network + (e.g., ``10.0.0.0/24``). Set in ``[global]``. You may specify + comma-delimited subnets. + +:Type: ``{ip-address}/{netmask} [, {ip-address}/{netmask}]`` +:Required: No +:Default: N/A + + +``cluster addr`` + +:Description: The IP address for the cluster (back-side) network. + Set for each daemon. + +:Type: Address +:Required: No +:Default: N/A + + +Bind +---- + +Bind settings set the default port ranges Ceph OSD and MDS daemons use. The +default range is ``6800:7300``. Ensure that your `IP Tables`_ configuration +allows you to use the configured port range. + +You may also enable Ceph daemons to bind to IPv6 addresses instead of IPv4 +addresses. + + +``ms bind port min`` + +:Description: The minimum port number to which an OSD or MDS daemon will bind. +:Type: 32-bit Integer +:Default: ``6800`` +:Required: No + + +``ms bind port max`` + +:Description: The maximum port number to which an OSD or MDS daemon will bind. +:Type: 32-bit Integer +:Default: ``7300`` +:Required: No. + + +``ms bind ipv6`` + +:Description: Enables Ceph daemons to bind to IPv6 addresses. Currently the + messenger *either* uses IPv4 or IPv6, but it cannot do both. +:Type: Boolean +:Default: ``false`` +:Required: No + +``public bind addr`` + +:Description: In some dynamic deployments the Ceph MON daemon might bind + to an IP address locally that is different from the ``public addr`` + advertised to other peers in the network. The environment must ensure + that routing rules are set correclty. If ``public bind addr`` is set + the Ceph MON daemon will bind to it locally and use ``public addr`` + in the monmaps to advertise its address to peers. This behavior is limited + to the MON daemon. + +:Type: IP Address +:Required: No +:Default: N/A + + + +Hosts +----- + +Ceph expects at least one monitor declared in the Ceph configuration file, with +a ``mon addr`` setting under each declared monitor. Ceph expects a ``host`` +setting under each declared monitor, metadata server and OSD in the Ceph +configuration file. Optionally, a monitor can be assigned with a priority, and +the clients will always connect to the monitor with lower value of priority if +specified. + + +``mon addr`` + +:Description: A list of ``{hostname}:{port}`` entries that clients can use to + connect to a Ceph monitor. If not set, Ceph searches ``[mon.*]`` + sections. + +:Type: String +:Required: No +:Default: N/A + +``mon priority`` + +:Description: The priority of the declared monitor, the lower value the more + prefered when a client selects a monitor when trying to connect + to the cluster. + +:Type: Unsigned 16-bit Integer +:Required: No +:Default: 0 + +``host`` + +:Description: The hostname. Use this setting for specific daemon instances + (e.g., ``[osd.0]``). + +:Type: String +:Required: Yes, for daemon instances. +:Default: ``localhost`` + +.. tip:: Do not use ``localhost``. To get your host name, execute + ``hostname -s`` on your command line and use the name of your host + (to the first period, not the fully-qualified domain name). + +.. important:: You should not specify any value for ``host`` when using a third + party deployment system that retrieves the host name for you. + + + +TCP +--- + +Ceph disables TCP buffering by default. + + +``ms tcp nodelay`` + +:Description: Ceph enables ``ms tcp nodelay`` so that each request is sent + immediately (no buffering). Disabling `Nagle's algorithm`_ + increases network traffic, which can introduce latency. If you + experience large numbers of small packets, you may try + disabling ``ms tcp nodelay``. + +:Type: Boolean +:Required: No +:Default: ``true`` + + + +``ms tcp rcvbuf`` + +:Description: The size of the socket buffer on the receiving end of a network + connection. Disable by default. + +:Type: 32-bit Integer +:Required: No +:Default: ``0`` + + + +``ms tcp read timeout`` + +:Description: If a client or daemon makes a request to another Ceph daemon and + does not drop an unused connection, the ``ms tcp read timeout`` + defines the connection as idle after the specified number + of seconds. + +:Type: Unsigned 64-bit Integer +:Required: No +:Default: ``900`` 15 minutes. + + + +.. _Scalability and High Availability: ../../../architecture#scalability-and-high-availability +.. _Hardware Recommendations - Networks: ../../../start/hardware-recommendations#networks +.. _Ceph configuration file: ../../../start/quick-ceph-deploy/#create-a-cluster +.. _hardware recommendations: ../../../start/hardware-recommendations +.. _Monitor / OSD Interaction: ../mon-osd-interaction +.. _Message Signatures: ../auth-config-ref#signatures +.. _CIDR: http://en.wikipedia.org/wiki/Classless_Inter-Domain_Routing +.. _Nagle's Algorithm: http://en.wikipedia.org/wiki/Nagle's_algorithm diff --git a/src/ceph/doc/rados/configuration/osd-config-ref.rst b/src/ceph/doc/rados/configuration/osd-config-ref.rst new file mode 100644 index 0000000..fae7078 --- /dev/null +++ b/src/ceph/doc/rados/configuration/osd-config-ref.rst @@ -0,0 +1,1105 @@ +====================== + OSD Config Reference +====================== + +.. index:: OSD; configuration + +You can configure Ceph OSD Daemons in the Ceph configuration file, but Ceph OSD +Daemons can use the default values and a very minimal configuration. A minimal +Ceph OSD Daemon configuration sets ``osd journal size`` and ``host``, and +uses default values for nearly everything else. + +Ceph OSD Daemons are numerically identified in incremental fashion, beginning +with ``0`` using the following convention. :: + + osd.0 + osd.1 + osd.2 + +In a configuration file, you may specify settings for all Ceph OSD Daemons in +the cluster by adding configuration settings to the ``[osd]`` section of your +configuration file. To add settings directly to a specific Ceph OSD Daemon +(e.g., ``host``), enter it in an OSD-specific section of your configuration +file. For example: + +.. code-block:: ini + + [osd] + osd journal size = 1024 + + [osd.0] + host = osd-host-a + + [osd.1] + host = osd-host-b + + +.. index:: OSD; config settings + +General Settings +================ + +The following settings provide an Ceph OSD Daemon's ID, and determine paths to +data and journals. Ceph deployment scripts typically generate the UUID +automatically. We **DO NOT** recommend changing the default paths for data or +journals, as it makes it more problematic to troubleshoot Ceph later. + +The journal size should be at least twice the product of the expected drive +speed multiplied by ``filestore max sync interval``. However, the most common +practice is to partition the journal drive (often an SSD), and mount it such +that Ceph uses the entire partition for the journal. + + +``osd uuid`` + +:Description: The universally unique identifier (UUID) for the Ceph OSD Daemon. +:Type: UUID +:Default: The UUID. +:Note: The ``osd uuid`` applies to a single Ceph OSD Daemon. The ``fsid`` + applies to the entire cluster. + + +``osd data`` + +:Description: The path to the OSDs data. You must create the directory when + deploying Ceph. You should mount a drive for OSD data at this + mount point. We do not recommend changing the default. + +:Type: String +:Default: ``/var/lib/ceph/osd/$cluster-$id`` + + +``osd max write size`` + +:Description: The maximum size of a write in megabytes. +:Type: 32-bit Integer +:Default: ``90`` + + +``osd client message size cap`` + +:Description: The largest client data message allowed in memory. +:Type: 64-bit Unsigned Integer +:Default: 500MB default. ``500*1024L*1024L`` + + +``osd class dir`` + +:Description: The class path for RADOS class plug-ins. +:Type: String +:Default: ``$libdir/rados-classes`` + + +.. index:: OSD; file system + +File System Settings +==================== +Ceph builds and mounts file systems which are used for Ceph OSDs. + +``osd mkfs options {fs-type}`` + +:Description: Options used when creating a new Ceph OSD of type {fs-type}. + +:Type: String +:Default for xfs: ``-f -i 2048`` +:Default for other file systems: {empty string} + +For example:: + ``osd mkfs options xfs = -f -d agcount=24`` + +``osd mount options {fs-type}`` + +:Description: Options used when mounting a Ceph OSD of type {fs-type}. + +:Type: String +:Default for xfs: ``rw,noatime,inode64`` +:Default for other file systems: ``rw, noatime`` + +For example:: + ``osd mount options xfs = rw, noatime, inode64, logbufs=8`` + + +.. index:: OSD; journal settings + +Journal Settings +================ + +By default, Ceph expects that you will store an Ceph OSD Daemons journal with +the following path:: + + /var/lib/ceph/osd/$cluster-$id/journal + +Without performance optimization, Ceph stores the journal on the same disk as +the Ceph OSD Daemons data. An Ceph OSD Daemon optimized for performance may use +a separate disk to store journal data (e.g., a solid state drive delivers high +performance journaling). + +Ceph's default ``osd journal size`` is 0, so you will need to set this in your +``ceph.conf`` file. A journal size should find the product of the ``filestore +max sync interval`` and the expected throughput, and multiply the product by +two (2):: + + osd journal size = {2 * (expected throughput * filestore max sync interval)} + +The expected throughput number should include the expected disk throughput +(i.e., sustained data transfer rate), and network throughput. For example, +a 7200 RPM disk will likely have approximately 100 MB/s. Taking the ``min()`` +of the disk and network throughput should provide a reasonable expected +throughput. Some users just start off with a 10GB journal size. For +example:: + + osd journal size = 10000 + + +``osd journal`` + +:Description: The path to the OSD's journal. This may be a path to a file or a + block device (such as a partition of an SSD). If it is a file, + you must create the directory to contain it. We recommend using a + drive separate from the ``osd data`` drive. + +:Type: String +:Default: ``/var/lib/ceph/osd/$cluster-$id/journal`` + + +``osd journal size`` + +:Description: The size of the journal in megabytes. If this is 0, and the + journal is a block device, the entire block device is used. + Since v0.54, this is ignored if the journal is a block device, + and the entire block device is used. + +:Type: 32-bit Integer +:Default: ``5120`` +:Recommended: Begin with 1GB. Should be at least twice the product of the + expected speed multiplied by ``filestore max sync interval``. + + +See `Journal Config Reference`_ for additional details. + + +Monitor OSD Interaction +======================= + +Ceph OSD Daemons check each other's heartbeats and report to monitors +periodically. Ceph can use default values in many cases. However, if your +network has latency issues, you may need to adopt longer intervals. See +`Configuring Monitor/OSD Interaction`_ for a detailed discussion of heartbeats. + + +Data Placement +============== + +See `Pool & PG Config Reference`_ for details. + + +.. index:: OSD; scrubbing + +Scrubbing +========= + +In addition to making multiple copies of objects, Ceph insures data integrity by +scrubbing placement groups. Ceph scrubbing is analogous to ``fsck`` on the +object storage layer. For each placement group, Ceph generates a catalog of all +objects and compares each primary object and its replicas to ensure that no +objects are missing or mismatched. Light scrubbing (daily) checks the object +size and attributes. Deep scrubbing (weekly) reads the data and uses checksums +to ensure data integrity. + +Scrubbing is important for maintaining data integrity, but it can reduce +performance. You can adjust the following settings to increase or decrease +scrubbing operations. + + +``osd max scrubs`` + +:Description: The maximum number of simultaneous scrub operations for + a Ceph OSD Daemon. + +:Type: 32-bit Int +:Default: ``1`` + +``osd scrub begin hour`` + +:Description: The time of day for the lower bound when a scheduled scrub can be + performed. +:Type: Integer in the range of 0 to 24 +:Default: ``0`` + + +``osd scrub end hour`` + +:Description: The time of day for the upper bound when a scheduled scrub can be + performed. Along with ``osd scrub begin hour``, they define a time + window, in which the scrubs can happen. But a scrub will be performed + no matter the time window allows or not, as long as the placement + group's scrub interval exceeds ``osd scrub max interval``. +:Type: Integer in the range of 0 to 24 +:Default: ``24`` + + +``osd scrub during recovery`` + +:Description: Allow scrub during recovery. Setting this to ``false`` will disable + scheduling new scrub (and deep--scrub) while there is active recovery. + Already running scrubs will be continued. This might be useful to reduce + load on busy clusters. +:Type: Boolean +:Default: ``true`` + + +``osd scrub thread timeout`` + +:Description: The maximum time in seconds before timing out a scrub thread. +:Type: 32-bit Integer +:Default: ``60`` + + +``osd scrub finalize thread timeout`` + +:Description: The maximum time in seconds before timing out a scrub finalize + thread. + +:Type: 32-bit Integer +:Default: ``60*10`` + + +``osd scrub load threshold`` + +:Description: The maximum load. Ceph will not scrub when the system load + (as defined by ``getloadavg()``) is higher than this number. + Default is ``0.5``. + +:Type: Float +:Default: ``0.5`` + + +``osd scrub min interval`` + +:Description: The minimal interval in seconds for scrubbing the Ceph OSD Daemon + when the Ceph Storage Cluster load is low. + +:Type: Float +:Default: Once per day. ``60*60*24`` + + +``osd scrub max interval`` + +:Description: The maximum interval in seconds for scrubbing the Ceph OSD Daemon + irrespective of cluster load. + +:Type: Float +:Default: Once per week. ``7*60*60*24`` + + +``osd scrub chunk min`` + +:Description: The minimal number of object store chunks to scrub during single operation. + Ceph blocks writes to single chunk during scrub. + +:Type: 32-bit Integer +:Default: 5 + + +``osd scrub chunk max`` + +:Description: The maximum number of object store chunks to scrub during single operation. + +:Type: 32-bit Integer +:Default: 25 + + +``osd scrub sleep`` + +:Description: Time to sleep before scrubbing next group of chunks. Increasing this value will slow + down whole scrub operation while client operations will be less impacted. + +:Type: Float +:Default: 0 + + +``osd deep scrub interval`` + +:Description: The interval for "deep" scrubbing (fully reading all data). The + ``osd scrub load threshold`` does not affect this setting. + +:Type: Float +:Default: Once per week. ``60*60*24*7`` + + +``osd scrub interval randomize ratio`` + +:Description: Add a random delay to ``osd scrub min interval`` when scheduling + the next scrub job for a placement group. The delay is a random + value less than ``osd scrub min interval`` \* + ``osd scrub interval randomized ratio``. So the default setting + practically randomly spreads the scrubs out in the allowed time + window of ``[1, 1.5]`` \* ``osd scrub min interval``. +:Type: Float +:Default: ``0.5`` + +``osd deep scrub stride`` + +:Description: Read size when doing a deep scrub. +:Type: 32-bit Integer +:Default: 512 KB. ``524288`` + + +.. index:: OSD; operations settings + +Operations +========== + +Operations settings allow you to configure the number of threads for servicing +requests. If you set ``osd op threads`` to ``0``, it disables multi-threading. +By default, Ceph uses two threads with a 30 second timeout and a 30 second +complaint time if an operation doesn't complete within those time parameters. +You can set operations priority weights between client operations and +recovery operations to ensure optimal performance during recovery. + + +``osd op threads`` + +:Description: The number of threads to service Ceph OSD Daemon operations. + Set to ``0`` to disable it. Increasing the number may increase + the request processing rate. + +:Type: 32-bit Integer +:Default: ``2`` + + +``osd op queue`` + +:Description: This sets the type of queue to be used for prioritizing ops + in the OSDs. Both queues feature a strict sub-queue which is + dequeued before the normal queue. The normal queue is different + between implementations. The original PrioritizedQueue (``prio``) uses a + token bucket system which when there are sufficient tokens will + dequeue high priority queues first. If there are not enough + tokens available, queues are dequeued low priority to high priority. + The WeightedPriorityQueue (``wpq``) dequeues all priorities in + relation to their priorities to prevent starvation of any queue. + WPQ should help in cases where a few OSDs are more overloaded + than others. The new mClock based OpClassQueue + (``mclock_opclass``) prioritizes operations based on which class + they belong to (recovery, scrub, snaptrim, client op, osd subop). + And, the mClock based ClientQueue (``mclock_client``) also + incorporates the client identifier in order to promote fairness + between clients. See `QoS Based on mClock`_. Requires a restart. + +:Type: String +:Valid Choices: prio, wpq, mclock_opclass, mclock_client +:Default: ``prio`` + + +``osd op queue cut off`` + +:Description: This selects which priority ops will be sent to the strict + queue verses the normal queue. The ``low`` setting sends all + replication ops and higher to the strict queue, while the ``high`` + option sends only replication acknowledgement ops and higher to + the strict queue. Setting this to ``high`` should help when a few + OSDs in the cluster are very busy especially when combined with + ``wpq`` in the ``osd op queue`` setting. OSDs that are very busy + handling replication traffic could starve primary client traffic + on these OSDs without these settings. Requires a restart. + +:Type: String +:Valid Choices: low, high +:Default: ``low`` + + +``osd client op priority`` + +:Description: The priority set for client operations. It is relative to + ``osd recovery op priority``. + +:Type: 32-bit Integer +:Default: ``63`` +:Valid Range: 1-63 + + +``osd recovery op priority`` + +:Description: The priority set for recovery operations. It is relative to + ``osd client op priority``. + +:Type: 32-bit Integer +:Default: ``3`` +:Valid Range: 1-63 + + +``osd scrub priority`` + +:Description: The priority set for scrub operations. It is relative to + ``osd client op priority``. + +:Type: 32-bit Integer +:Default: ``5`` +:Valid Range: 1-63 + + +``osd snap trim priority`` + +:Description: The priority set for snap trim operations. It is relative to + ``osd client op priority``. + +:Type: 32-bit Integer +:Default: ``5`` +:Valid Range: 1-63 + + +``osd op thread timeout`` + +:Description: The Ceph OSD Daemon operation thread timeout in seconds. +:Type: 32-bit Integer +:Default: ``15`` + + +``osd op complaint time`` + +:Description: An operation becomes complaint worthy after the specified number + of seconds have elapsed. + +:Type: Float +:Default: ``30`` + + +``osd disk threads`` + +:Description: The number of disk threads, which are used to perform background + disk intensive OSD operations such as scrubbing and snap + trimming. + +:Type: 32-bit Integer +:Default: ``1`` + +``osd disk thread ioprio class`` + +:Description: Warning: it will only be used if both ``osd disk thread + ioprio class`` and ``osd disk thread ioprio priority`` are + set to a non default value. Sets the ioprio_set(2) I/O + scheduling ``class`` for the disk thread. Acceptable + values are ``idle``, ``be`` or ``rt``. The ``idle`` + class means the disk thread will have lower priority + than any other thread in the OSD. This is useful to slow + down scrubbing on an OSD that is busy handling client + operations. ``be`` is the default and is the same + priority as all other threads in the OSD. ``rt`` means + the disk thread will have precendence over all other + threads in the OSD. Note: Only works with the Linux Kernel + CFQ scheduler. Since Jewel scrubbing is no longer carried + out by the disk iothread, see osd priority options instead. +:Type: String +:Default: the empty string + +``osd disk thread ioprio priority`` + +:Description: Warning: it will only be used if both ``osd disk thread + ioprio class`` and ``osd disk thread ioprio priority`` are + set to a non default value. It sets the ioprio_set(2) + I/O scheduling ``priority`` of the disk thread ranging + from 0 (highest) to 7 (lowest). If all OSDs on a given + host were in class ``idle`` and compete for I/O + (i.e. due to controller congestion), it can be used to + lower the disk thread priority of one OSD to 7 so that + another OSD with priority 0 can have priority. + Note: Only works with the Linux Kernel CFQ scheduler. +:Type: Integer in the range of 0 to 7 or -1 if not to be used. +:Default: ``-1`` + +``osd op history size`` + +:Description: The maximum number of completed operations to track. +:Type: 32-bit Unsigned Integer +:Default: ``20`` + + +``osd op history duration`` + +:Description: The oldest completed operation to track. +:Type: 32-bit Unsigned Integer +:Default: ``600`` + + +``osd op log threshold`` + +:Description: How many operations logs to display at once. +:Type: 32-bit Integer +:Default: ``5`` + + +QoS Based on mClock +------------------- + +Ceph's use of mClock is currently in the experimental phase and should +be approached with an exploratory mindset. + +Core Concepts +````````````` + +The QoS support of Ceph is implemented using a queueing scheduler +based on `the dmClock algorithm`_. This algorithm allocates the I/O +resources of the Ceph cluster in proportion to weights, and enforces +the constraits of minimum reservation and maximum limitation, so that +the services can compete for the resources fairly. Currently the +*mclock_opclass* operation queue divides Ceph services involving I/O +resources into following buckets: + +- client op: the iops issued by client +- osd subop: the iops issued by primary OSD +- snap trim: the snap trimming related requests +- pg recovery: the recovery related requests +- pg scrub: the scrub related requests + +And the resources are partitioned using following three sets of tags. In other +words, the share of each type of service is controlled by three tags: + +#. reservation: the minimum IOPS allocated for the service. +#. limitation: the maximum IOPS allocated for the service. +#. weight: the proportional share of capacity if extra capacity or system + oversubscribed. + +In Ceph operations are graded with "cost". And the resources allocated +for serving various services are consumed by these "costs". So, for +example, the more reservation a services has, the more resource it is +guaranteed to possess, as long as it requires. Assuming there are 2 +services: recovery and client ops: + +- recovery: (r:1, l:5, w:1) +- client ops: (r:2, l:0, w:9) + +The settings above ensure that the recovery won't get more than 5 +requests per second serviced, even if it requires so (see CURRENT +IMPLEMENTATION NOTE below), and no other services are competing with +it. But if the clients start to issue large amount of I/O requests, +neither will they exhaust all the I/O resources. 1 request per second +is always allocated for recovery jobs as long as there are any such +requests. So the recovery jobs won't be starved even in a cluster with +high load. And in the meantime, the client ops can enjoy a larger +portion of the I/O resource, because its weight is "9", while its +competitor "1". In the case of client ops, it is not clamped by the +limit setting, so it can make use of all the resources if there is no +recovery ongoing. + +Along with *mclock_opclass* another mclock operation queue named +*mclock_client* is available. It divides operations based on category +but also divides them based on the client making the request. This +helps not only manage the distribution of resources spent on different +classes of operations but also tries to insure fairness among clients. + +CURRENT IMPLEMENTATION NOTE: the current experimental implementation +does not enforce the limit values. As a first approximation we decided +not to prevent operations that would otherwise enter the operation +sequencer from doing so. + +Subtleties of mClock +```````````````````` + +The reservation and limit values have a unit of requests per +second. The weight, however, does not technically have a unit and the +weights are relative to one another. So if one class of requests has a +weight of 1 and another a weight of 9, then the latter class of +requests should get 9 executed at a 9 to 1 ratio as the first class. +However that will only happen once the reservations are met and those +values include the operations executed under the reservation phase. + +Even though the weights do not have units, one must be careful in +choosing their values due how the algorithm assigns weight tags to +requests. If the weight is *W*, then for a given class of requests, +the next one that comes in will have a weight tag of *1/W* plus the +previous weight tag or the current time, whichever is larger. That +means if *W* is sufficiently large and therefore *1/W* is sufficiently +small, the calculated tag may never be assigned as it will get a value +of the current time. The ultimate lesson is that values for weight +should not be too large. They should be under the number of requests +one expects to ve serviced each second. + +Caveats +``````` + +There are some factors that can reduce the impact of the mClock op +queues within Ceph. First, requests to an OSD are sharded by their +placement group identifier. Each shard has its own mClock queue and +these queues neither interact nor share information among them. The +number of shards can be controlled with the configuration options +``osd_op_num_shards``, ``osd_op_num_shards_hdd``, and +``osd_op_num_shards_ssd``. A lower number of shards will increase the +impact of the mClock queues, but may have other deliterious effects. + +Second, requests are transferred from the operation queue to the +operation sequencer, in which they go through the phases of +execution. The operation queue is where mClock resides and mClock +determines the next op to transfer to the operation sequencer. The +number of operations allowed in the operation sequencer is a complex +issue. In general we want to keep enough operations in the sequencer +so it's always getting work done on some operations while it's waiting +for disk and network access to complete on other operations. On the +other hand, once an operation is transferred to the operation +sequencer, mClock no longer has control over it. Therefore to maximize +the impact of mClock, we want to keep as few operations in the +operation sequencer as possible. So we have an inherent tension. + +The configuration options that influence the number of operations in +the operation sequencer are ``bluestore_throttle_bytes``, +``bluestore_throttle_deferred_bytes``, +``bluestore_throttle_cost_per_io``, +``bluestore_throttle_cost_per_io_hdd``, and +``bluestore_throttle_cost_per_io_ssd``. + +A third factor that affects the impact of the mClock algorithm is that +we're using a distributed system, where requests are made to multiple +OSDs and each OSD has (can have) multiple shards. Yet we're currently +using the mClock algorithm, which is not distributed (note: dmClock is +the distributed version of mClock). + +Various organizations and individuals are currently experimenting with +mClock as it exists in this code base along with their modifications +to the code base. We hope you'll share you're experiences with your +mClock and dmClock experiments in the ceph-devel mailing list. + + +``osd push per object cost`` + +:Description: the overhead for serving a push op + +:Type: Unsigned Integer +:Default: 1000 + +``osd recovery max chunk`` + +:Description: the maximum total size of data chunks a recovery op can carry. + +:Type: Unsigned Integer +:Default: 8 MiB + + +``osd op queue mclock client op res`` + +:Description: the reservation of client op. + +:Type: Float +:Default: 1000.0 + + +``osd op queue mclock client op wgt`` + +:Description: the weight of client op. + +:Type: Float +:Default: 500.0 + + +``osd op queue mclock client op lim`` + +:Description: the limit of client op. + +:Type: Float +:Default: 1000.0 + + +``osd op queue mclock osd subop res`` + +:Description: the reservation of osd subop. + +:Type: Float +:Default: 1000.0 + + +``osd op queue mclock osd subop wgt`` + +:Description: the weight of osd subop. + +:Type: Float +:Default: 500.0 + + +``osd op queue mclock osd subop lim`` + +:Description: the limit of osd subop. + +:Type: Float +:Default: 0.0 + + +``osd op queue mclock snap res`` + +:Description: the reservation of snap trimming. + +:Type: Float +:Default: 0.0 + + +``osd op queue mclock snap wgt`` + +:Description: the weight of snap trimming. + +:Type: Float +:Default: 1.0 + + +``osd op queue mclock snap lim`` + +:Description: the limit of snap trimming. + +:Type: Float +:Default: 0.001 + + +``osd op queue mclock recov res`` + +:Description: the reservation of recovery. + +:Type: Float +:Default: 0.0 + + +``osd op queue mclock recov wgt`` + +:Description: the weight of recovery. + +:Type: Float +:Default: 1.0 + + +``osd op queue mclock recov lim`` + +:Description: the limit of recovery. + +:Type: Float +:Default: 0.001 + + +``osd op queue mclock scrub res`` + +:Description: the reservation of scrub jobs. + +:Type: Float +:Default: 0.0 + + +``osd op queue mclock scrub wgt`` + +:Description: the weight of scrub jobs. + +:Type: Float +:Default: 1.0 + + +``osd op queue mclock scrub lim`` + +:Description: the limit of scrub jobs. + +:Type: Float +:Default: 0.001 + +.. _the dmClock algorithm: https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Gulati.pdf + + +.. index:: OSD; backfilling + +Backfilling +=========== + +When you add or remove Ceph OSD Daemons to a cluster, the CRUSH algorithm will +want to rebalance the cluster by moving placement groups to or from Ceph OSD +Daemons to restore the balance. The process of migrating placement groups and +the objects they contain can reduce the cluster's operational performance +considerably. To maintain operational performance, Ceph performs this migration +with 'backfilling', which allows Ceph to set backfill operations to a lower +priority than requests to read or write data. + + +``osd max backfills`` + +:Description: The maximum number of backfills allowed to or from a single OSD. +:Type: 64-bit Unsigned Integer +:Default: ``1`` + + +``osd backfill scan min`` + +:Description: The minimum number of objects per backfill scan. + +:Type: 32-bit Integer +:Default: ``64`` + + +``osd backfill scan max`` + +:Description: The maximum number of objects per backfill scan. + +:Type: 32-bit Integer +:Default: ``512`` + + +``osd backfill retry interval`` + +:Description: The number of seconds to wait before retrying backfill requests. +:Type: Double +:Default: ``10.0`` + +.. index:: OSD; osdmap + +OSD Map +======= + +OSD maps reflect the OSD daemons operating in the cluster. Over time, the +number of map epochs increases. Ceph provides some settings to ensure that +Ceph performs well as the OSD map grows larger. + + +``osd map dedup`` + +:Description: Enable removing duplicates in the OSD map. +:Type: Boolean +:Default: ``true`` + + +``osd map cache size`` + +:Description: The number of OSD maps to keep cached. +:Type: 32-bit Integer +:Default: ``500`` + + +``osd map cache bl size`` + +:Description: The size of the in-memory OSD map cache in OSD daemons. +:Type: 32-bit Integer +:Default: ``50`` + + +``osd map cache bl inc size`` + +:Description: The size of the in-memory OSD map cache incrementals in + OSD daemons. + +:Type: 32-bit Integer +:Default: ``100`` + + +``osd map message max`` + +:Description: The maximum map entries allowed per MOSDMap message. +:Type: 32-bit Integer +:Default: ``100`` + + + +.. index:: OSD; recovery + +Recovery +======== + +When the cluster starts or when a Ceph OSD Daemon crashes and restarts, the OSD +begins peering with other Ceph OSD Daemons before writes can occur. See +`Monitoring OSDs and PGs`_ for details. + +If a Ceph OSD Daemon crashes and comes back online, usually it will be out of +sync with other Ceph OSD Daemons containing more recent versions of objects in +the placement groups. When this happens, the Ceph OSD Daemon goes into recovery +mode and seeks to get the latest copy of the data and bring its map back up to +date. Depending upon how long the Ceph OSD Daemon was down, the OSD's objects +and placement groups may be significantly out of date. Also, if a failure domain +went down (e.g., a rack), more than one Ceph OSD Daemon may come back online at +the same time. This can make the recovery process time consuming and resource +intensive. + +To maintain operational performance, Ceph performs recovery with limitations on +the number recovery requests, threads and object chunk sizes which allows Ceph +perform well in a degraded state. + + +``osd recovery delay start`` + +:Description: After peering completes, Ceph will delay for the specified number + of seconds before starting to recover objects. + +:Type: Float +:Default: ``0`` + + +``osd recovery max active`` + +:Description: The number of active recovery requests per OSD at one time. More + requests will accelerate recovery, but the requests places an + increased load on the cluster. + +:Type: 32-bit Integer +:Default: ``3`` + + +``osd recovery max chunk`` + +:Description: The maximum size of a recovered chunk of data to push. +:Type: 64-bit Unsigned Integer +:Default: ``8 << 20`` + + +``osd recovery max single start`` + +:Description: The maximum number of recovery operations per OSD that will be + newly started when an OSD is recovering. +:Type: 64-bit Unsigned Integer +:Default: ``1`` + + +``osd recovery thread timeout`` + +:Description: The maximum time in seconds before timing out a recovery thread. +:Type: 32-bit Integer +:Default: ``30`` + + +``osd recover clone overlap`` + +:Description: Preserves clone overlap during recovery. Should always be set + to ``true``. + +:Type: Boolean +:Default: ``true`` + + +``osd recovery sleep`` + +:Description: Time in seconds to sleep before next recovery or backfill op. + Increasing this value will slow down recovery operation while + client operations will be less impacted. + +:Type: Float +:Default: ``0`` + + +``osd recovery sleep hdd`` + +:Description: Time in seconds to sleep before next recovery or backfill op + for HDDs. + +:Type: Float +:Default: ``0.1`` + + +``osd recovery sleep ssd`` + +:Description: Time in seconds to sleep before next recovery or backfill op + for SSDs. + +:Type: Float +:Default: ``0`` + + +``osd recovery sleep hybrid`` + +:Description: Time in seconds to sleep before next recovery or backfill op + when osd data is on HDD and osd journal is on SSD. + +:Type: Float +:Default: ``0.025`` + +Tiering +======= + +``osd agent max ops`` + +:Description: The maximum number of simultaneous flushing ops per tiering agent + in the high speed mode. +:Type: 32-bit Integer +:Default: ``4`` + + +``osd agent max low ops`` + +:Description: The maximum number of simultaneous flushing ops per tiering agent + in the low speed mode. +:Type: 32-bit Integer +:Default: ``2`` + +See `cache target dirty high ratio`_ for when the tiering agent flushes dirty +objects within the high speed mode. + +Miscellaneous +============= + + +``osd snap trim thread timeout`` + +:Description: The maximum time in seconds before timing out a snap trim thread. +:Type: 32-bit Integer +:Default: ``60*60*1`` + + +``osd backlog thread timeout`` + +:Description: The maximum time in seconds before timing out a backlog thread. +:Type: 32-bit Integer +:Default: ``60*60*1`` + + +``osd default notify timeout`` + +:Description: The OSD default notification timeout (in seconds). +:Type: 32-bit Unsigned Integer +:Default: ``30`` + + +``osd check for log corruption`` + +:Description: Check log files for corruption. Can be computationally expensive. +:Type: Boolean +:Default: ``false`` + + +``osd remove thread timeout`` + +:Description: The maximum time in seconds before timing out a remove OSD thread. +:Type: 32-bit Integer +:Default: ``60*60`` + + +``osd command thread timeout`` + +:Description: The maximum time in seconds before timing out a command thread. +:Type: 32-bit Integer +:Default: ``10*60`` + + +``osd command max records`` + +:Description: Limits the number of lost objects to return. +:Type: 32-bit Integer +:Default: ``256`` + + +``osd auto upgrade tmap`` + +:Description: Uses ``tmap`` for ``omap`` on old objects. +:Type: Boolean +:Default: ``true`` + + +``osd tmapput sets users tmap`` + +:Description: Uses ``tmap`` for debugging only. +:Type: Boolean +:Default: ``false`` + + +``osd fast fail on connection refused`` + +:Description: If this option is enabled, crashed OSDs are marked down + immediately by connected peers and MONs (assuming that the + crashed OSD host survives). Disable it to restore old + behavior, at the expense of possible long I/O stalls when + OSDs crash in the middle of I/O operations. +:Type: Boolean +:Default: ``true`` + + + +.. _pool: ../../operations/pools +.. _Configuring Monitor/OSD Interaction: ../mon-osd-interaction +.. _Monitoring OSDs and PGs: ../../operations/monitoring-osd-pg#peering +.. _Pool & PG Config Reference: ../pool-pg-config-ref +.. _Journal Config Reference: ../journal-ref +.. _cache target dirty high ratio: ../../operations/pools#cache-target-dirty-high-ratio diff --git a/src/ceph/doc/rados/configuration/pool-pg-config-ref.rst b/src/ceph/doc/rados/configuration/pool-pg-config-ref.rst new file mode 100644 index 0000000..89a3707 --- /dev/null +++ b/src/ceph/doc/rados/configuration/pool-pg-config-ref.rst @@ -0,0 +1,270 @@ +====================================== + Pool, PG and CRUSH Config Reference +====================================== + +.. index:: pools; configuration + +When you create pools and set the number of placement groups for the pool, Ceph +uses default values when you don't specifically override the defaults. **We +recommend** overridding some of the defaults. Specifically, we recommend setting +a pool's replica size and overriding the default number of placement groups. You +can specifically set these values when running `pool`_ commands. You can also +override the defaults by adding new ones in the ``[global]`` section of your +Ceph configuration file. + + +.. literalinclude:: pool-pg.conf + :language: ini + + + +``mon max pool pg num`` + +:Description: The maximum number of placement groups per pool. +:Type: Integer +:Default: ``65536`` + + +``mon pg create interval`` + +:Description: Number of seconds between PG creation in the same + Ceph OSD Daemon. + +:Type: Float +:Default: ``30.0`` + + +``mon pg stuck threshold`` + +:Description: Number of seconds after which PGs can be considered as + being stuck. + +:Type: 32-bit Integer +:Default: ``300`` + +``mon pg min inactive`` + +:Description: Issue a ``HEALTH_ERR`` in cluster log if the number of PGs stay + inactive longer than ``mon_pg_stuck_threshold`` exceeds this + setting. A non-positive number means disabled, never go into ERR. +:Type: Integer +:Default: ``1`` + + +``mon pg warn min per osd`` + +:Description: Issue a ``HEALTH_WARN`` in cluster log if the average number + of PGs per (in) OSD is under this number. (a non-positive number + disables this) +:Type: Integer +:Default: ``30`` + + +``mon pg warn max per osd`` + +:Description: Issue a ``HEALTH_WARN`` in cluster log if the average number + of PGs per (in) OSD is above this number. (a non-positive number + disables this) +:Type: Integer +:Default: ``300`` + + +``mon pg warn min objects`` + +:Description: Do not warn if the total number of objects in cluster is below + this number +:Type: Integer +:Default: ``1000`` + + +``mon pg warn min pool objects`` + +:Description: Do not warn on pools whose object number is below this number +:Type: Integer +:Default: ``1000`` + + +``mon pg check down all threshold`` + +:Description: Threshold of down OSDs percentage after which we check all PGs + for stale ones. +:Type: Float +:Default: ``0.5`` + + +``mon pg warn max object skew`` + +:Description: Issue a ``HEALTH_WARN`` in cluster log if the average object number + of a certain pool is greater than ``mon pg warn max object skew`` times + the average object number of the whole pool. (a non-positive number + disables this) +:Type: Float +:Default: ``10`` + + +``mon delta reset interval`` + +:Description: Seconds of inactivity before we reset the pg delta to 0. We keep + track of the delta of the used space of each pool, so, for + example, it would be easier for us to understand the progress of + recovery or the performance of cache tier. But if there's no + activity reported for a certain pool, we just reset the history of + deltas of that pool. +:Type: Integer +:Default: ``10`` + + +``mon osd max op age`` + +:Description: Maximum op age before we get concerned (make it a power of 2). + A ``HEALTH_WARN`` will be issued if a request has been blocked longer + than this limit. +:Type: Float +:Default: ``32.0`` + + +``osd pg bits`` + +:Description: Placement group bits per Ceph OSD Daemon. +:Type: 32-bit Integer +:Default: ``6`` + + +``osd pgp bits`` + +:Description: The number of bits per Ceph OSD Daemon for PGPs. +:Type: 32-bit Integer +:Default: ``6`` + + +``osd crush chooseleaf type`` + +:Description: The bucket type to use for ``chooseleaf`` in a CRUSH rule. Uses + ordinal rank rather than name. + +:Type: 32-bit Integer +:Default: ``1``. Typically a host containing one or more Ceph OSD Daemons. + + +``osd crush initial weight`` + +:Description: The initial crush weight for newly added osds into crushmap. + +:Type: Double +:Default: ``the size of newly added osd in TB``. By default, the initial crush + weight for the newly added osd is set to its volume size in TB. + See `Weighting Bucket Items`_ for details. + + +``osd pool default crush replicated ruleset`` + +:Description: The default CRUSH ruleset to use when creating a replicated pool. +:Type: 8-bit Integer +:Default: ``CEPH_DEFAULT_CRUSH_REPLICATED_RULESET``, which means "pick + a ruleset with the lowest numerical ID and use that". This is to + make pool creation work in the absence of ruleset 0. + + +``osd pool erasure code stripe unit`` + +:Description: Sets the default size, in bytes, of a chunk of an object + stripe for erasure coded pools. Every object of size S + will be stored as N stripes, with each data chunk + receiving ``stripe unit`` bytes. Each stripe of ``N * + stripe unit`` bytes will be encoded/decoded + individually. This option can is overridden by the + ``stripe_unit`` setting in an erasure code profile. + +:Type: Unsigned 32-bit Integer +:Default: ``4096`` + + +``osd pool default size`` + +:Description: Sets the number of replicas for objects in the pool. The default + value is the same as + ``ceph osd pool set {pool-name} size {size}``. + +:Type: 32-bit Integer +:Default: ``3`` + + +``osd pool default min size`` + +:Description: Sets the minimum number of written replicas for objects in the + pool in order to acknowledge a write operation to the client. + If minimum is not met, Ceph will not acknowledge the write to the + client. This setting ensures a minimum number of replicas when + operating in ``degraded`` mode. + +:Type: 32-bit Integer +:Default: ``0``, which means no particular minimum. If ``0``, + minimum is ``size - (size / 2)``. + + +``osd pool default pg num`` + +:Description: The default number of placement groups for a pool. The default + value is the same as ``pg_num`` with ``mkpool``. + +:Type: 32-bit Integer +:Default: ``8`` + + +``osd pool default pgp num`` + +:Description: The default number of placement groups for placement for a pool. + The default value is the same as ``pgp_num`` with ``mkpool``. + PG and PGP should be equal (for now). + +:Type: 32-bit Integer +:Default: ``8`` + + +``osd pool default flags`` + +:Description: The default flags for new pools. +:Type: 32-bit Integer +:Default: ``0`` + + +``osd max pgls`` + +:Description: The maximum number of placement groups to list. A client + requesting a large number can tie up the Ceph OSD Daemon. + +:Type: Unsigned 64-bit Integer +:Default: ``1024`` +:Note: Default should be fine. + + +``osd min pg log entries`` + +:Description: The minimum number of placement group logs to maintain + when trimming log files. + +:Type: 32-bit Int Unsigned +:Default: ``1000`` + + +``osd default data pool replay window`` + +:Description: The time (in seconds) for an OSD to wait for a client to replay + a request. + +:Type: 32-bit Integer +:Default: ``45`` + +``osd max pg per osd hard ratio`` + +:Description: The ratio of number of PGs per OSD allowed by the cluster before + OSD refuses to create new PGs. OSD stops creating new PGs if the number + of PGs it serves exceeds + ``osd max pg per osd hard ratio`` \* ``mon max pg per osd``. + +:Type: Float +:Default: ``2`` + +.. _pool: ../../operations/pools +.. _Monitoring OSDs and PGs: ../../operations/monitoring-osd-pg#peering +.. _Weighting Bucket Items: ../../operations/crush-map#weightingbucketitems diff --git a/src/ceph/doc/rados/configuration/pool-pg.conf b/src/ceph/doc/rados/configuration/pool-pg.conf new file mode 100644 index 0000000..5f1b3b7 --- /dev/null +++ b/src/ceph/doc/rados/configuration/pool-pg.conf @@ -0,0 +1,20 @@ +[global] + + # By default, Ceph makes 3 replicas of objects. If you want to make four + # copies of an object the default value--a primary copy and three replica + # copies--reset the default values as shown in 'osd pool default size'. + # If you want to allow Ceph to write a lesser number of copies in a degraded + # state, set 'osd pool default min size' to a number less than the + # 'osd pool default size' value. + + osd pool default size = 4 # Write an object 4 times. + osd pool default min size = 1 # Allow writing one copy in a degraded state. + + # Ensure you have a realistic number of placement groups. We recommend + # approximately 100 per OSD. E.g., total number of OSDs multiplied by 100 + # divided by the number of replicas (i.e., osd pool default size). So for + # 10 OSDs and osd pool default size = 4, we'd recommend approximately + # (100 * 10) / 4 = 250. + + osd pool default pg num = 250 + osd pool default pgp num = 250 diff --git a/src/ceph/doc/rados/configuration/storage-devices.rst b/src/ceph/doc/rados/configuration/storage-devices.rst new file mode 100644 index 0000000..83c0c9b --- /dev/null +++ b/src/ceph/doc/rados/configuration/storage-devices.rst @@ -0,0 +1,83 @@ +================= + Storage Devices +================= + +There are two Ceph daemons that store data on disk: + +* **Ceph OSDs** (or Object Storage Daemons) are where most of the + data is stored in Ceph. Generally speaking, each OSD is backed by + a single storage device, like a traditional hard disk (HDD) or + solid state disk (SSD). OSDs can also be backed by a combination + of devices, like a HDD for most data and an SSD (or partition of an + SSD) for some metadata. The number of OSDs in a cluster is + generally a function of how much data will be stored, how big each + storage device will be, and the level and type of redundancy + (replication or erasure coding). +* **Ceph Monitor** daemons manage critical cluster state like cluster + membership and authentication information. For smaller clusters a + few gigabytes is all that is needed, although for larger clusters + the monitor database can reach tens or possibly hundreds of + gigabytes. + + +OSD Backends +============ + +There are two ways that OSDs can manage the data they store. Starting +with the Luminous 12.2.z release, the new default (and recommended) backend is +*BlueStore*. Prior to Luminous, the default (and only option) was +*FileStore*. + +BlueStore +--------- + +BlueStore is a special-purpose storage backend designed specifically +for managing data on disk for Ceph OSD workloads. It is motivated by +experience supporting and managing OSDs using FileStore over the +last ten years. Key BlueStore features include: + +* Direct management of storage devices. BlueStore consumes raw block + devices or partitions. This avoids any intervening layers of + abstraction (such as local file systems like XFS) that may limit + performance or add complexity. +* Metadata management with RocksDB. We embed RocksDB's key/value database + in order to manage internal metadata, such as the mapping from object + names to block locations on disk. +* Full data and metadata checksumming. By default all data and + metadata written to BlueStore is protected by one or more + checksums. No data or metadata will be read from disk or returned + to the user without being verified. +* Inline compression. Data written may be optionally compressed + before being written to disk. +* Multi-device metadata tiering. BlueStore allows its internal + journal (write-ahead log) to be written to a separate, high-speed + device (like an SSD, NVMe, or NVDIMM) to increased performance. If + a significant amount of faster storage is available, internal + metadata can also be stored on the faster device. +* Efficient copy-on-write. RBD and CephFS snapshots rely on a + copy-on-write *clone* mechanism that is implemented efficiently in + BlueStore. This results in efficient IO both for regular snapshots + and for erasure coded pools (which rely on cloning to implement + efficient two-phase commits). + +For more information, see :doc:`bluestore-config-ref`. + +FileStore +--------- + +FileStore is the legacy approach to storing objects in Ceph. It +relies on a standard file system (normally XFS) in combination with a +key/value database (traditionally LevelDB, now RocksDB) for some +metadata. + +FileStore is well-tested and widely used in production but suffers +from many performance deficiencies due to its overall design and +reliance on a traditional file system for storing object data. + +Although FileStore is generally capable of functioning on most +POSIX-compatible file systems (including btrfs and ext4), we only +recommend that XFS be used. Both btrfs and ext4 have known bugs and +deficiencies and their use may lead to data loss. By default all Ceph +provisioning tools will use XFS. + +For more information, see :doc:`filestore-config-ref`. diff --git a/src/ceph/doc/rados/deployment/ceph-deploy-admin.rst b/src/ceph/doc/rados/deployment/ceph-deploy-admin.rst new file mode 100644 index 0000000..a91f69c --- /dev/null +++ b/src/ceph/doc/rados/deployment/ceph-deploy-admin.rst @@ -0,0 +1,38 @@ +============= + Admin Tasks +============= + +Once you have set up a cluster with ``ceph-deploy``, you may +provide the client admin key and the Ceph configuration file +to another host so that a user on the host may use the ``ceph`` +command line as an administrative user. + + +Create an Admin Host +==================== + +To enable a host to execute ceph commands with administrator +privileges, use the ``admin`` command. :: + + ceph-deploy admin {host-name [host-name]...} + + +Deploy Config File +================== + +To send an updated copy of the Ceph configuration file to hosts +in your cluster, use the ``config push`` command. :: + + ceph-deploy config push {host-name [host-name]...} + +.. tip:: With a base name and increment host-naming convention, + it is easy to deploy configuration files via simple scripts + (e.g., ``ceph-deploy config hostname{1,2,3,4,5}``). + +Retrieve Config File +==================== + +To retrieve a copy of the Ceph configuration file from a host +in your cluster, use the ``config pull`` command. :: + + ceph-deploy config pull {host-name [host-name]...} diff --git a/src/ceph/doc/rados/deployment/ceph-deploy-install.rst b/src/ceph/doc/rados/deployment/ceph-deploy-install.rst new file mode 100644 index 0000000..849d68e --- /dev/null +++ b/src/ceph/doc/rados/deployment/ceph-deploy-install.rst @@ -0,0 +1,46 @@ +==================== + Package Management +==================== + +Install +======= + +To install Ceph packages on your cluster hosts, open a command line on your +client machine and type the following:: + + ceph-deploy install {hostname [hostname] ...} + +Without additional arguments, ``ceph-deploy`` will install the most recent +major release of Ceph to the cluster host(s). To specify a particular package, +you may select from the following: + +- ``--release <code-name>`` +- ``--testing`` +- ``--dev <branch-or-tag>`` + +For example:: + + ceph-deploy install --release cuttlefish hostname1 + ceph-deploy install --testing hostname2 + ceph-deploy install --dev wip-some-branch hostname{1,2,3,4,5} + +For additional usage, execute:: + + ceph-deploy install -h + + +Uninstall +========= + +To uninstall Ceph packages from your cluster hosts, open a terminal on +your admin host and type the following:: + + ceph-deploy uninstall {hostname [hostname] ...} + +On a Debian or Ubuntu system, you may also:: + + ceph-deploy purge {hostname [hostname] ...} + +The tool will unininstall ``ceph`` packages from the specified hosts. Purge +additionally removes configuration files. + diff --git a/src/ceph/doc/rados/deployment/ceph-deploy-keys.rst b/src/ceph/doc/rados/deployment/ceph-deploy-keys.rst new file mode 100644 index 0000000..3e106c9 --- /dev/null +++ b/src/ceph/doc/rados/deployment/ceph-deploy-keys.rst @@ -0,0 +1,32 @@ +================= + Keys Management +================= + + +Gather Keys +=========== + +Before you can provision a host to run OSDs or metadata servers, you must gather +monitor keys and the OSD and MDS bootstrap keyrings. To gather keys, enter the +following:: + + ceph-deploy gatherkeys {monitor-host} + + +.. note:: To retrieve the keys, you specify a host that has a + Ceph monitor. + +.. note:: If you have specified multiple monitors in the setup of the cluster, + make sure, that all monitors are up and running. If the monitors haven't + formed quorum, ``ceph-create-keys`` will not finish and the keys are not + generated. + +Forget Keys +=========== + +When you are no longer using ``ceph-deploy`` (or if you are recreating a +cluster), you should delete the keys in the local directory of your admin host. +To delete keys, enter the following:: + + ceph-deploy forgetkeys + diff --git a/src/ceph/doc/rados/deployment/ceph-deploy-mds.rst b/src/ceph/doc/rados/deployment/ceph-deploy-mds.rst new file mode 100644 index 0000000..d2afaec --- /dev/null +++ b/src/ceph/doc/rados/deployment/ceph-deploy-mds.rst @@ -0,0 +1,46 @@ +============================ + Add/Remove Metadata Server +============================ + +With ``ceph-deploy``, adding and removing metadata servers is a simple task. You +just add or remove one or more metadata servers on the command line with one +command. + +.. important:: You must deploy at least one metadata server to use CephFS. + There is experimental support for running multiple metadata servers. + Do not run multiple active metadata servers in production. + +See `MDS Config Reference`_ for details on configuring metadata servers. + + +Add a Metadata Server +===================== + +Once you deploy monitors and OSDs you may deploy the metadata server(s). :: + + ceph-deploy mds create {host-name}[:{daemon-name}] [{host-name}[:{daemon-name}] ...] + +You may specify a daemon instance a name (optional) if you would like to run +multiple daemons on a single server. + + +Remove a Metadata Server +======================== + +Coming soon... + +.. If you have a metadata server in your cluster that you'd like to remove, you may use +.. the ``destroy`` option. :: + +.. ceph-deploy mds destroy {host-name}[:{daemon-name}] [{host-name}[:{daemon-name}] ...] + +.. You may specify a daemon instance a name (optional) if you would like to destroy +.. a particular daemon that runs on a single server with multiple MDS daemons. + +.. .. note:: Ensure that if you remove a metadata server, the remaining metadata + servers will be able to service requests from CephFS clients. If that is not + possible, consider adding a metadata server before destroying the metadata + server you would like to take offline. + + +.. _MDS Config Reference: ../../../cephfs/mds-config-ref diff --git a/src/ceph/doc/rados/deployment/ceph-deploy-mon.rst b/src/ceph/doc/rados/deployment/ceph-deploy-mon.rst new file mode 100644 index 0000000..bda34fe --- /dev/null +++ b/src/ceph/doc/rados/deployment/ceph-deploy-mon.rst @@ -0,0 +1,56 @@ +===================== + Add/Remove Monitors +===================== + +With ``ceph-deploy``, adding and removing monitors is a simple task. You just +add or remove one or more monitors on the command line with one command. Before +``ceph-deploy``, the process of `adding and removing monitors`_ involved +numerous manual steps. Using ``ceph-deploy`` imposes a restriction: **you may +only install one monitor per host.** + +.. note:: We do not recommend comingling monitors and OSDs on + the same host. + +For high availability, you should run a production Ceph cluster with **AT +LEAST** three monitors. Ceph uses the Paxos algorithm, which requires a +consensus among the majority of monitors in a quorum. With Paxos, the monitors +cannot determine a majority for establishing a quorum with only two monitors. A +majority of monitors must be counted as such: 1:1, 2:3, 3:4, 3:5, 4:6, etc. + +See `Monitor Config Reference`_ for details on configuring monitors. + + +Add a Monitor +============= + +Once you create a cluster and install Ceph packages to the monitor host(s), you +may deploy the monitor(s) to the monitor host(s). When using ``ceph-deploy``, +the tool enforces a single monitor per host. :: + + ceph-deploy mon create {host-name [host-name]...} + + +.. note:: Ensure that you add monitors such that they may arrive at a consensus + among a majority of monitors, otherwise other steps (like ``ceph-deploy gatherkeys``) + will fail. + +.. note:: When adding a monitor on a host that was not in hosts initially defined + with the ``ceph-deploy new`` command, a ``public network`` statement needs + to be added to the ceph.conf file. + +Remove a Monitor +================ + +If you have a monitor in your cluster that you'd like to remove, you may use +the ``destroy`` option. :: + + ceph-deploy mon destroy {host-name [host-name]...} + + +.. note:: Ensure that if you remove a monitor, the remaining monitors will be + able to establish a consensus. If that is not possible, consider adding a + monitor before removing the monitor you would like to take offline. + + +.. _adding and removing monitors: ../../operations/add-or-rm-mons +.. _Monitor Config Reference: ../../configuration/mon-config-ref diff --git a/src/ceph/doc/rados/deployment/ceph-deploy-new.rst b/src/ceph/doc/rados/deployment/ceph-deploy-new.rst new file mode 100644 index 0000000..5eb37a9 --- /dev/null +++ b/src/ceph/doc/rados/deployment/ceph-deploy-new.rst @@ -0,0 +1,66 @@ +================== + Create a Cluster +================== + +The first step in using Ceph with ``ceph-deploy`` is to create a new Ceph +cluster. A new Ceph cluster has: + +- A Ceph configuration file, and +- A monitor keyring. + +The Ceph configuration file consists of at least: + +- Its own filesystem ID (``fsid``) +- The initial monitor(s) hostname(s), and +- The initial monitor(s) and IP address(es). + +For additional details, see the `Monitor Configuration Reference`_. + +The ``ceph-deploy`` tool also creates a monitor keyring and populates it with a +``[mon.]`` key. For additional details, see the `Cephx Guide`_. + + +Usage +----- + +To create a cluster with ``ceph-deploy``, use the ``new`` command and specify +the host(s) that will be initial members of the monitor quorum. :: + + ceph-deploy new {host [host], ...} + +For example:: + + ceph-deploy new mon1.foo.com + ceph-deploy new mon{1,2,3} + +The ``ceph-deploy`` utility will use DNS to resolve hostnames to IP +addresses. The monitors will be named using the first component of +the name (e.g., ``mon1`` above). It will add the specified host names +to the Ceph configuration file. For additional details, execute:: + + ceph-deploy new -h + + +Naming a Cluster +---------------- + +By default, Ceph clusters have a cluster name of ``ceph``. You can specify +a cluster name if you want to run multiple clusters on the same hardware. For +example, if you want to optimize a cluster for use with block devices, and +another for use with the gateway, you can run two different clusters on the same +hardware if they have a different ``fsid`` and cluster name. :: + + ceph-deploy --cluster {cluster-name} new {host [host], ...} + +For example:: + + ceph-deploy --cluster rbdcluster new ceph-mon1 + ceph-deploy --cluster rbdcluster new ceph-mon{1,2,3} + +.. note:: If you run multiple clusters, ensure you adjust the default + port settings and open ports for your additional cluster(s) so that + the networks of the two different clusters don't conflict with each other. + + +.. _Monitor Configuration Reference: ../../configuration/mon-config-ref +.. _Cephx Guide: ../../../dev/mon-bootstrap#secret-keys diff --git a/src/ceph/doc/rados/deployment/ceph-deploy-osd.rst b/src/ceph/doc/rados/deployment/ceph-deploy-osd.rst new file mode 100644 index 0000000..a4eb4d1 --- /dev/null +++ b/src/ceph/doc/rados/deployment/ceph-deploy-osd.rst @@ -0,0 +1,121 @@ +================= + Add/Remove OSDs +================= + +Adding and removing Ceph OSD Daemons to your cluster may involve a few more +steps when compared to adding and removing other Ceph daemons. Ceph OSD Daemons +write data to the disk and to journals. So you need to provide a disk for the +OSD and a path to the journal partition (i.e., this is the most common +configuration, but you may configure your system to your own needs). + +In Ceph v0.60 and later releases, Ceph supports ``dm-crypt`` on disk encryption. +You may specify the ``--dmcrypt`` argument when preparing an OSD to tell +``ceph-deploy`` that you want to use encryption. You may also specify the +``--dmcrypt-key-dir`` argument to specify the location of ``dm-crypt`` +encryption keys. + +You should test various drive configurations to gauge their throughput before +before building out a large cluster. See `Data Storage`_ for additional details. + + +List Disks +========== + +To list the disks on a node, execute the following command:: + + ceph-deploy disk list {node-name [node-name]...} + + +Zap Disks +========= + +To zap a disk (delete its partition table) in preparation for use with Ceph, +execute the following:: + + ceph-deploy disk zap {osd-server-name}:{disk-name} + ceph-deploy disk zap osdserver1:sdb + +.. important:: This will delete all data. + + +Prepare OSDs +============ + +Once you create a cluster, install Ceph packages, and gather keys, you +may prepare the OSDs and deploy them to the OSD node(s). If you need to +identify a disk or zap it prior to preparing it for use as an OSD, +see `List Disks`_ and `Zap Disks`_. :: + + ceph-deploy osd prepare {node-name}:{data-disk}[:{journal-disk}] + ceph-deploy osd prepare osdserver1:sdb:/dev/ssd + ceph-deploy osd prepare osdserver1:sdc:/dev/ssd + +The ``prepare`` command only prepares the OSD. On most operating +systems, the ``activate`` phase will automatically run when the +partitions are created on the disk (using Ceph ``udev`` rules). If not +use the ``activate`` command. See `Activate OSDs`_ for +details. + +The foregoing example assumes a disk dedicated to one Ceph OSD Daemon, and +a path to an SSD journal partition. We recommend storing the journal on +a separate drive to maximize throughput. You may dedicate a single drive +for the journal too (which may be expensive) or place the journal on the +same disk as the OSD (not recommended as it impairs performance). In the +foregoing example we store the journal on a partitioned solid state drive. + +You can use the settings --fs-type or --bluestore to choose which file system +you want to install in the OSD drive. (More information by running +'ceph-deploy osd prepare --help'). + +.. note:: When running multiple Ceph OSD daemons on a single node, and + sharing a partioned journal with each OSD daemon, you should consider + the entire node the minimum failure domain for CRUSH purposes, because + if the SSD drive fails, all of the Ceph OSD daemons that journal to it + will fail too. + + +Activate OSDs +============= + +Once you prepare an OSD you may activate it with the following command. :: + + ceph-deploy osd activate {node-name}:{data-disk-partition}[:{journal-disk-partition}] + ceph-deploy osd activate osdserver1:/dev/sdb1:/dev/ssd1 + ceph-deploy osd activate osdserver1:/dev/sdc1:/dev/ssd2 + +The ``activate`` command will cause your OSD to come ``up`` and be placed +``in`` the cluster. The ``activate`` command uses the path to the partition +created when running the ``prepare`` command. + + +Create OSDs +=========== + +You may prepare OSDs, deploy them to the OSD node(s) and activate them in one +step with the ``create`` command. The ``create`` command is a convenience method +for executing the ``prepare`` and ``activate`` command sequentially. :: + + ceph-deploy osd create {node-name}:{disk}[:{path/to/journal}] + ceph-deploy osd create osdserver1:sdb:/dev/ssd1 + +.. List OSDs +.. ========= + +.. To list the OSDs deployed on a node(s), execute the following command:: + +.. ceph-deploy osd list {node-name} + + +Destroy OSDs +============ + +.. note:: Coming soon. See `Remove OSDs`_ for manual procedures. + +.. To destroy an OSD, execute the following command:: + +.. ceph-deploy osd destroy {node-name}:{path-to-disk}[:{path/to/journal}] + +.. Destroying an OSD will take it ``down`` and ``out`` of the cluster. + +.. _Data Storage: ../../../start/hardware-recommendations#data-storage +.. _Remove OSDs: ../../operations/add-or-rm-osds#removing-osds-manual diff --git a/src/ceph/doc/rados/deployment/ceph-deploy-purge.rst b/src/ceph/doc/rados/deployment/ceph-deploy-purge.rst new file mode 100644 index 0000000..685c3c4 --- /dev/null +++ b/src/ceph/doc/rados/deployment/ceph-deploy-purge.rst @@ -0,0 +1,25 @@ +============== + Purge a Host +============== + +When you remove Ceph daemons and uninstall Ceph, there may still be extraneous +data from the cluster on your server. The ``purge`` and ``purgedata`` commands +provide a convenient means of cleaning up a host. + + +Purge Data +========== + +To remove all data from ``/var/lib/ceph`` (but leave Ceph packages intact), +execute the ``purgedata`` command. + + ceph-deploy purgedata {hostname} [{hostname} ...] + + +Purge +===== + +To remove all data from ``/var/lib/ceph`` and uninstall Ceph packages, execute +the ``purge`` command. + + ceph-deploy purge {hostname} [{hostname} ...]
\ No newline at end of file diff --git a/src/ceph/doc/rados/deployment/index.rst b/src/ceph/doc/rados/deployment/index.rst new file mode 100644 index 0000000..0853e4a --- /dev/null +++ b/src/ceph/doc/rados/deployment/index.rst @@ -0,0 +1,58 @@ +================= + Ceph Deployment +================= + +The ``ceph-deploy`` tool is a way to deploy Ceph relying only upon SSH access to +the servers, ``sudo``, and some Python. It runs on your workstation, and does +not require servers, databases, or any other tools. If you set up and +tear down Ceph clusters a lot, and want minimal extra bureaucracy, +``ceph-deploy`` is an ideal tool. The ``ceph-deploy`` tool is not a generic +deployment system. It was designed exclusively for Ceph users who want to get +Ceph up and running quickly with sensible initial configuration settings without +the overhead of installing Chef, Puppet or Juju. Users who want fine-control +over security settings, partitions or directory locations should use a tool +such as Juju, Puppet, `Chef`_ or Crowbar. + + +With ``ceph-deploy``, you can develop scripts to install Ceph packages on remote +hosts, create a cluster, add monitors, gather (or forget) keys, add OSDs and +metadata servers, configure admin hosts, and tear down the clusters. + +.. raw:: html + + <table cellpadding="10"><tbody valign="top"><tr><td> + +.. toctree:: + + Preflight Checklist <preflight-checklist> + Install Ceph <ceph-deploy-install> + +.. raw:: html + + </td><td> + +.. toctree:: + + Create a Cluster <ceph-deploy-new> + Add/Remove Monitor(s) <ceph-deploy-mon> + Key Management <ceph-deploy-keys> + Add/Remove OSD(s) <ceph-deploy-osd> + Add/Remove MDS(s) <ceph-deploy-mds> + + +.. raw:: html + + </td><td> + +.. toctree:: + + Purge Hosts <ceph-deploy-purge> + Admin Tasks <ceph-deploy-admin> + + +.. raw:: html + + </td></tr></tbody></table> + + +.. _Chef: http://tracker.ceph.com/projects/ceph/wiki/Deploying_Ceph_with_Chef diff --git a/src/ceph/doc/rados/deployment/preflight-checklist.rst b/src/ceph/doc/rados/deployment/preflight-checklist.rst new file mode 100644 index 0000000..64a669f --- /dev/null +++ b/src/ceph/doc/rados/deployment/preflight-checklist.rst @@ -0,0 +1,109 @@ +===================== + Preflight Checklist +===================== + +.. versionadded:: 0.60 + +This **Preflight Checklist** will help you prepare an admin node for use with +``ceph-deploy``, and server nodes for use with passwordless ``ssh`` and +``sudo``. + +Before you can deploy Ceph using ``ceph-deploy``, you need to ensure that you +have a few things set up first on your admin node and on nodes running Ceph +daemons. + + +Install an Operating System +=========================== + +Install a recent release of Debian or Ubuntu (e.g., 12.04 LTS, 14.04 LTS) on +your nodes. For additional details on operating systems or to use other +operating systems other than Debian or Ubuntu, see `OS Recommendations`_. + + +Install an SSH Server +===================== + +The ``ceph-deploy`` utility requires ``ssh``, so your server node(s) require an +SSH server. :: + + sudo apt-get install openssh-server + + +Create a User +============= + +Create a user on nodes running Ceph daemons. + +.. tip:: We recommend a username that brute force attackers won't + guess easily (e.g., something other than ``root``, ``ceph``, etc). + +:: + + ssh user@ceph-server + sudo useradd -d /home/ceph -m ceph + sudo passwd ceph + + +``ceph-deploy`` installs packages onto your nodes. This means that +the user you create requires passwordless ``sudo`` privileges. + +.. note:: We **DO NOT** recommend enabling the ``root`` password + for security reasons. + +To provide full privileges to the user, add the following to +``/etc/sudoers.d/ceph``. :: + + echo "ceph ALL = (root) NOPASSWD:ALL" | sudo tee /etc/sudoers.d/ceph + sudo chmod 0440 /etc/sudoers.d/ceph + + +Configure SSH +============= + +Configure your admin machine with password-less SSH access to each node +running Ceph daemons (leave the passphrase empty). :: + + ssh-keygen + Generating public/private key pair. + Enter file in which to save the key (/ceph-client/.ssh/id_rsa): + Enter passphrase (empty for no passphrase): + Enter same passphrase again: + Your identification has been saved in /ceph-client/.ssh/id_rsa. + Your public key has been saved in /ceph-client/.ssh/id_rsa.pub. + +Copy the key to each node running Ceph daemons:: + + ssh-copy-id ceph@ceph-server + +Modify your ~/.ssh/config file of your admin node so that it defaults +to logging in as the user you created when no username is specified. :: + + Host ceph-server + Hostname ceph-server.fqdn-or-ip-address.com + User ceph + + +Install ceph-deploy +=================== + +To install ``ceph-deploy``, execute the following:: + + wget -q -O- 'https://download.ceph.com/keys/release.asc' | sudo apt-key add - + echo deb http://ceph.com/debian-dumpling/ $(lsb_release -sc) main | sudo tee /etc/apt/sources.list.d/ceph.list + sudo apt-get update + sudo apt-get install ceph-deploy + + +Ensure Connectivity +=================== + +Ensure that your Admin node has connectivity to the network and to your Server +node (e.g., ensure ``iptables``, ``ufw`` or other tools that may prevent +connections, traffic forwarding, etc. to allow what you need). + + +Once you have completed this pre-flight checklist, you are ready to begin using +``ceph-deploy``. + +.. _OS Recommendations: ../../../start/os-recommendations diff --git a/src/ceph/doc/rados/index.rst b/src/ceph/doc/rados/index.rst new file mode 100644 index 0000000..929bb7e --- /dev/null +++ b/src/ceph/doc/rados/index.rst @@ -0,0 +1,76 @@ +====================== + Ceph Storage Cluster +====================== + +The :term:`Ceph Storage Cluster` is the foundation for all Ceph deployments. +Based upon :abbr:`RADOS (Reliable Autonomic Distributed Object Store)`, Ceph +Storage Clusters consist of two types of daemons: a :term:`Ceph OSD Daemon` +(OSD) stores data as objects on a storage node; and a :term:`Ceph Monitor` (MON) +maintains a master copy of the cluster map. A Ceph Storage Cluster may contain +thousands of storage nodes. A minimal system will have at least one +Ceph Monitor and two Ceph OSD Daemons for data replication. + +The Ceph Filesystem, Ceph Object Storage and Ceph Block Devices read data from +and write data to the Ceph Storage Cluster. + +.. raw:: html + + <style type="text/css">div.body h3{margin:5px 0px 0px 0px;}</style> + <table cellpadding="10"><colgroup><col width="33%"><col width="33%"><col width="33%"></colgroup><tbody valign="top"><tr><td><h3>Config and Deploy</h3> + +Ceph Storage Clusters have a few required settings, but most configuration +settings have default values. A typical deployment uses a deployment tool +to define a cluster and bootstrap a monitor. See `Deployment`_ for details +on ``ceph-deploy.`` + +.. toctree:: + :maxdepth: 2 + + Configuration <configuration/index> + Deployment <deployment/index> + +.. raw:: html + + </td><td><h3>Operations</h3> + +Once you have a deployed a Ceph Storage Cluster, you may begin operating +your cluster. + +.. toctree:: + :maxdepth: 2 + + + Operations <operations/index> + +.. toctree:: + :maxdepth: 1 + + Man Pages <man/index> + + +.. toctree:: + :hidden: + + troubleshooting/index + +.. raw:: html + + </td><td><h3>APIs</h3> + +Most Ceph deployments use `Ceph Block Devices`_, `Ceph Object Storage`_ and/or the +`Ceph Filesystem`_. You may also develop applications that talk directly to +the Ceph Storage Cluster. + +.. toctree:: + :maxdepth: 2 + + APIs <api/index> + +.. raw:: html + + </td></tr></tbody></table> + +.. _Ceph Block Devices: ../rbd/ +.. _Ceph Filesystem: ../cephfs/ +.. _Ceph Object Storage: ../radosgw/ +.. _Deployment: ../rados/deployment/ diff --git a/src/ceph/doc/rados/man/index.rst b/src/ceph/doc/rados/man/index.rst new file mode 100644 index 0000000..abeb88b --- /dev/null +++ b/src/ceph/doc/rados/man/index.rst @@ -0,0 +1,34 @@ +======================= + Object Store Manpages +======================= + +.. toctree:: + :maxdepth: 1 + + ../../man/8/ceph-disk.rst + ../../man/8/ceph-volume.rst + ../../man/8/ceph-volume-systemd.rst + ../../man/8/ceph.rst + ../../man/8/ceph-deploy.rst + ../../man/8/ceph-rest-api.rst + ../../man/8/ceph-authtool.rst + ../../man/8/ceph-clsinfo.rst + ../../man/8/ceph-conf.rst + ../../man/8/ceph-debugpack.rst + ../../man/8/ceph-dencoder.rst + ../../man/8/ceph-mon.rst + ../../man/8/ceph-osd.rst + ../../man/8/ceph-kvstore-tool.rst + ../../man/8/ceph-run.rst + ../../man/8/ceph-syn.rst + ../../man/8/crushtool.rst + ../../man/8/librados-config.rst + ../../man/8/monmaptool.rst + ../../man/8/osdmaptool.rst + ../../man/8/rados.rst + + +.. toctree:: + :hidden: + + ../../man/8/ceph-post-file.rst diff --git a/src/ceph/doc/rados/operations/add-or-rm-mons.rst b/src/ceph/doc/rados/operations/add-or-rm-mons.rst new file mode 100644 index 0000000..0cdc431 --- /dev/null +++ b/src/ceph/doc/rados/operations/add-or-rm-mons.rst @@ -0,0 +1,370 @@ +========================== + Adding/Removing Monitors +========================== + +When you have a cluster up and running, you may add or remove monitors +from the cluster at runtime. To bootstrap a monitor, see `Manual Deployment`_ +or `Monitor Bootstrap`_. + +Adding Monitors +=============== + +Ceph monitors are light-weight processes that maintain a master copy of the +cluster map. You can run a cluster with 1 monitor. We recommend at least 3 +monitors for a production cluster. Ceph monitors use a variation of the +`Paxos`_ protocol to establish consensus about maps and other critical +information across the cluster. Due to the nature of Paxos, Ceph requires +a majority of monitors running to establish a quorum (thus establishing +consensus). + +It is advisable to run an odd-number of monitors but not mandatory. An +odd-number of monitors has a higher resiliency to failures than an +even-number of monitors. For instance, on a 2 monitor deployment, no +failures can be tolerated in order to maintain a quorum; with 3 monitors, +one failure can be tolerated; in a 4 monitor deployment, one failure can +be tolerated; with 5 monitors, two failures can be tolerated. This is +why an odd-number is advisable. Summarizing, Ceph needs a majority of +monitors to be running (and able to communicate with each other), but that +majority can be achieved using a single monitor, or 2 out of 2 monitors, +2 out of 3, 3 out of 4, etc. + +For an initial deployment of a multi-node Ceph cluster, it is advisable to +deploy three monitors, increasing the number two at a time if a valid need +for more than three exists. + +Since monitors are light-weight, it is possible to run them on the same +host as an OSD; however, we recommend running them on separate hosts, +because fsync issues with the kernel may impair performance. + +.. note:: A *majority* of monitors in your cluster must be able to + reach each other in order to establish a quorum. + +Deploy your Hardware +-------------------- + +If you are adding a new host when adding a new monitor, see `Hardware +Recommendations`_ for details on minimum recommendations for monitor hardware. +To add a monitor host to your cluster, first make sure you have an up-to-date +version of Linux installed (typically Ubuntu 14.04 or RHEL 7). + +Add your monitor host to a rack in your cluster, connect it to the network +and ensure that it has network connectivity. + +.. _Hardware Recommendations: ../../../start/hardware-recommendations + +Install the Required Software +----------------------------- + +For manually deployed clusters, you must install Ceph packages +manually. See `Installing Packages`_ for details. +You should configure SSH to a user with password-less authentication +and root permissions. + +.. _Installing Packages: ../../../install/install-storage-cluster + + +.. _Adding a Monitor (Manual): + +Adding a Monitor (Manual) +------------------------- + +This procedure creates a ``ceph-mon`` data directory, retrieves the monitor map +and monitor keyring, and adds a ``ceph-mon`` daemon to your cluster. If +this results in only two monitor daemons, you may add more monitors by +repeating this procedure until you have a sufficient number of ``ceph-mon`` +daemons to achieve a quorum. + +At this point you should define your monitor's id. Traditionally, monitors +have been named with single letters (``a``, ``b``, ``c``, ...), but you are +free to define the id as you see fit. For the purpose of this document, +please take into account that ``{mon-id}`` should be the id you chose, +without the ``mon.`` prefix (i.e., ``{mon-id}`` should be the ``a`` +on ``mon.a``). + +#. Create the default directory on the machine that will host your + new monitor. :: + + ssh {new-mon-host} + sudo mkdir /var/lib/ceph/mon/ceph-{mon-id} + +#. Create a temporary directory ``{tmp}`` to keep the files needed during + this process. This directory should be different from the monitor's default + directory created in the previous step, and can be removed after all the + steps are executed. :: + + mkdir {tmp} + +#. Retrieve the keyring for your monitors, where ``{tmp}`` is the path to + the retrieved keyring, and ``{key-filename}`` is the name of the file + containing the retrieved monitor key. :: + + ceph auth get mon. -o {tmp}/{key-filename} + +#. Retrieve the monitor map, where ``{tmp}`` is the path to + the retrieved monitor map, and ``{map-filename}`` is the name of the file + containing the retrieved monitor monitor map. :: + + ceph mon getmap -o {tmp}/{map-filename} + +#. Prepare the monitor's data directory created in the first step. You must + specify the path to the monitor map so that you can retrieve the + information about a quorum of monitors and their ``fsid``. You must also + specify a path to the monitor keyring:: + + sudo ceph-mon -i {mon-id} --mkfs --monmap {tmp}/{map-filename} --keyring {tmp}/{key-filename} + + +#. Start the new monitor and it will automatically join the cluster. + The daemon needs to know which address to bind to, either via + ``--public-addr {ip:port}`` or by setting ``mon addr`` in the + appropriate section of ``ceph.conf``. For example:: + + ceph-mon -i {mon-id} --public-addr {ip:port} + + +Removing Monitors +================= + +When you remove monitors from a cluster, consider that Ceph monitors use +PAXOS to establish consensus about the master cluster map. You must have +a sufficient number of monitors to establish a quorum for consensus about +the cluster map. + +.. _Removing a Monitor (Manual): + +Removing a Monitor (Manual) +--------------------------- + +This procedure removes a ``ceph-mon`` daemon from your cluster. If this +procedure results in only two monitor daemons, you may add or remove another +monitor until you have a number of ``ceph-mon`` daemons that can achieve a +quorum. + +#. Stop the monitor. :: + + service ceph -a stop mon.{mon-id} + +#. Remove the monitor from the cluster. :: + + ceph mon remove {mon-id} + +#. Remove the monitor entry from ``ceph.conf``. + + +Removing Monitors from an Unhealthy Cluster +------------------------------------------- + +This procedure removes a ``ceph-mon`` daemon from an unhealthy +cluster, for example a cluster where the monitors cannot form a +quorum. + + +#. Stop all ``ceph-mon`` daemons on all monitor hosts. :: + + ssh {mon-host} + service ceph stop mon || stop ceph-mon-all + # and repeat for all mons + +#. Identify a surviving monitor and log in to that host. :: + + ssh {mon-host} + +#. Extract a copy of the monmap file. :: + + ceph-mon -i {mon-id} --extract-monmap {map-path} + # in most cases, that's + ceph-mon -i `hostname` --extract-monmap /tmp/monmap + +#. Remove the non-surviving or problematic monitors. For example, if + you have three monitors, ``mon.a``, ``mon.b``, and ``mon.c``, where + only ``mon.a`` will survive, follow the example below:: + + monmaptool {map-path} --rm {mon-id} + # for example, + monmaptool /tmp/monmap --rm b + monmaptool /tmp/monmap --rm c + +#. Inject the surviving map with the removed monitors into the + surviving monitor(s). For example, to inject a map into monitor + ``mon.a``, follow the example below:: + + ceph-mon -i {mon-id} --inject-monmap {map-path} + # for example, + ceph-mon -i a --inject-monmap /tmp/monmap + +#. Start only the surviving monitors. + +#. Verify the monitors form a quorum (``ceph -s``). + +#. You may wish to archive the removed monitors' data directory in + ``/var/lib/ceph/mon`` in a safe location, or delete it if you are + confident the remaining monitors are healthy and are sufficiently + redundant. + +.. _Changing a Monitor's IP address: + +Changing a Monitor's IP Address +=============================== + +.. important:: Existing monitors are not supposed to change their IP addresses. + +Monitors are critical components of a Ceph cluster, and they need to maintain a +quorum for the whole system to work properly. To establish a quorum, the +monitors need to discover each other. Ceph has strict requirements for +discovering monitors. + +Ceph clients and other Ceph daemons use ``ceph.conf`` to discover monitors. +However, monitors discover each other using the monitor map, not ``ceph.conf``. +For example, if you refer to `Adding a Monitor (Manual)`_ you will see that you +need to obtain the current monmap for the cluster when creating a new monitor, +as it is one of the required arguments of ``ceph-mon -i {mon-id} --mkfs``. The +following sections explain the consistency requirements for Ceph monitors, and a +few safe ways to change a monitor's IP address. + + +Consistency Requirements +------------------------ + +A monitor always refers to the local copy of the monmap when discovering other +monitors in the cluster. Using the monmap instead of ``ceph.conf`` avoids +errors that could break the cluster (e.g., typos in ``ceph.conf`` when +specifying a monitor address or port). Since monitors use monmaps for discovery +and they share monmaps with clients and other Ceph daemons, the monmap provides +monitors with a strict guarantee that their consensus is valid. + +Strict consistency also applies to updates to the monmap. As with any other +updates on the monitor, changes to the monmap always run through a distributed +consensus algorithm called `Paxos`_. The monitors must agree on each update to +the monmap, such as adding or removing a monitor, to ensure that each monitor in +the quorum has the same version of the monmap. Updates to the monmap are +incremental so that monitors have the latest agreed upon version, and a set of +previous versions, allowing a monitor that has an older version of the monmap to +catch up with the current state of the cluster. + +If monitors discovered each other through the Ceph configuration file instead of +through the monmap, it would introduce additional risks because the Ceph +configuration files are not updated and distributed automatically. Monitors +might inadvertently use an older ``ceph.conf`` file, fail to recognize a +monitor, fall out of a quorum, or develop a situation where `Paxos`_ is not able +to determine the current state of the system accurately. Consequently, making +changes to an existing monitor's IP address must be done with great care. + + +Changing a Monitor's IP address (The Right Way) +----------------------------------------------- + +Changing a monitor's IP address in ``ceph.conf`` only is not sufficient to +ensure that other monitors in the cluster will receive the update. To change a +monitor's IP address, you must add a new monitor with the IP address you want +to use (as described in `Adding a Monitor (Manual)`_), ensure that the new +monitor successfully joins the quorum; then, remove the monitor that uses the +old IP address. Then, update the ``ceph.conf`` file to ensure that clients and +other daemons know the IP address of the new monitor. + +For example, lets assume there are three monitors in place, such as :: + + [mon.a] + host = host01 + addr = 10.0.0.1:6789 + [mon.b] + host = host02 + addr = 10.0.0.2:6789 + [mon.c] + host = host03 + addr = 10.0.0.3:6789 + +To change ``mon.c`` to ``host04`` with the IP address ``10.0.0.4``, follow the +steps in `Adding a Monitor (Manual)`_ by adding a new monitor ``mon.d``. Ensure +that ``mon.d`` is running before removing ``mon.c``, or it will break the +quorum. Remove ``mon.c`` as described on `Removing a Monitor (Manual)`_. Moving +all three monitors would thus require repeating this process as many times as +needed. + + +Changing a Monitor's IP address (The Messy Way) +----------------------------------------------- + +There may come a time when the monitors must be moved to a different network, a +different part of the datacenter or a different datacenter altogether. While it +is possible to do it, the process becomes a bit more hazardous. + +In such a case, the solution is to generate a new monmap with updated IP +addresses for all the monitors in the cluster, and inject the new map on each +individual monitor. This is not the most user-friendly approach, but we do not +expect this to be something that needs to be done every other week. As it is +clearly stated on the top of this section, monitors are not supposed to change +IP addresses. + +Using the previous monitor configuration as an example, assume you want to move +all the monitors from the ``10.0.0.x`` range to ``10.1.0.x``, and these +networks are unable to communicate. Use the following procedure: + +#. Retrieve the monitor map, where ``{tmp}`` is the path to + the retrieved monitor map, and ``{filename}`` is the name of the file + containing the retrieved monitor monitor map. :: + + ceph mon getmap -o {tmp}/{filename} + +#. The following example demonstrates the contents of the monmap. :: + + $ monmaptool --print {tmp}/{filename} + + monmaptool: monmap file {tmp}/{filename} + epoch 1 + fsid 224e376d-c5fe-4504-96bb-ea6332a19e61 + last_changed 2012-12-17 02:46:41.591248 + created 2012-12-17 02:46:41.591248 + 0: 10.0.0.1:6789/0 mon.a + 1: 10.0.0.2:6789/0 mon.b + 2: 10.0.0.3:6789/0 mon.c + +#. Remove the existing monitors. :: + + $ monmaptool --rm a --rm b --rm c {tmp}/{filename} + + monmaptool: monmap file {tmp}/{filename} + monmaptool: removing a + monmaptool: removing b + monmaptool: removing c + monmaptool: writing epoch 1 to {tmp}/{filename} (0 monitors) + +#. Add the new monitor locations. :: + + $ monmaptool --add a 10.1.0.1:6789 --add b 10.1.0.2:6789 --add c 10.1.0.3:6789 {tmp}/{filename} + + monmaptool: monmap file {tmp}/{filename} + monmaptool: writing epoch 1 to {tmp}/{filename} (3 monitors) + +#. Check new contents. :: + + $ monmaptool --print {tmp}/{filename} + + monmaptool: monmap file {tmp}/{filename} + epoch 1 + fsid 224e376d-c5fe-4504-96bb-ea6332a19e61 + last_changed 2012-12-17 02:46:41.591248 + created 2012-12-17 02:46:41.591248 + 0: 10.1.0.1:6789/0 mon.a + 1: 10.1.0.2:6789/0 mon.b + 2: 10.1.0.3:6789/0 mon.c + +At this point, we assume the monitors (and stores) are installed at the new +location. The next step is to propagate the modified monmap to the new +monitors, and inject the modified monmap into each new monitor. + +#. First, make sure to stop all your monitors. Injection must be done while + the daemon is not running. + +#. Inject the monmap. :: + + ceph-mon -i {mon-id} --inject-monmap {tmp}/{filename} + +#. Restart the monitors. + +After this step, migration to the new location is complete and +the monitors should operate successfully. + + +.. _Manual Deployment: ../../../install/manual-deployment +.. _Monitor Bootstrap: ../../../dev/mon-bootstrap +.. _Paxos: http://en.wikipedia.org/wiki/Paxos_(computer_science) diff --git a/src/ceph/doc/rados/operations/add-or-rm-osds.rst b/src/ceph/doc/rados/operations/add-or-rm-osds.rst new file mode 100644 index 0000000..59ce4c7 --- /dev/null +++ b/src/ceph/doc/rados/operations/add-or-rm-osds.rst @@ -0,0 +1,366 @@ +====================== + Adding/Removing OSDs +====================== + +When you have a cluster up and running, you may add OSDs or remove OSDs +from the cluster at runtime. + +Adding OSDs +=========== + +When you want to expand a cluster, you may add an OSD at runtime. With Ceph, an +OSD is generally one Ceph ``ceph-osd`` daemon for one storage drive within a +host machine. If your host has multiple storage drives, you may map one +``ceph-osd`` daemon for each drive. + +Generally, it's a good idea to check the capacity of your cluster to see if you +are reaching the upper end of its capacity. As your cluster reaches its ``near +full`` ratio, you should add one or more OSDs to expand your cluster's capacity. + +.. warning:: Do not let your cluster reach its ``full ratio`` before + adding an OSD. OSD failures that occur after the cluster reaches + its ``near full`` ratio may cause the cluster to exceed its + ``full ratio``. + +Deploy your Hardware +-------------------- + +If you are adding a new host when adding a new OSD, see `Hardware +Recommendations`_ for details on minimum recommendations for OSD hardware. To +add an OSD host to your cluster, first make sure you have an up-to-date version +of Linux installed, and you have made some initial preparations for your +storage drives. See `Filesystem Recommendations`_ for details. + +Add your OSD host to a rack in your cluster, connect it to the network +and ensure that it has network connectivity. See the `Network Configuration +Reference`_ for details. + +.. _Hardware Recommendations: ../../../start/hardware-recommendations +.. _Filesystem Recommendations: ../../configuration/filesystem-recommendations +.. _Network Configuration Reference: ../../configuration/network-config-ref + +Install the Required Software +----------------------------- + +For manually deployed clusters, you must install Ceph packages +manually. See `Installing Ceph (Manual)`_ for details. +You should configure SSH to a user with password-less authentication +and root permissions. + +.. _Installing Ceph (Manual): ../../../install + + +Adding an OSD (Manual) +---------------------- + +This procedure sets up a ``ceph-osd`` daemon, configures it to use one drive, +and configures the cluster to distribute data to the OSD. If your host has +multiple drives, you may add an OSD for each drive by repeating this procedure. + +To add an OSD, create a data directory for it, mount a drive to that directory, +add the OSD to the cluster, and then add it to the CRUSH map. + +When you add the OSD to the CRUSH map, consider the weight you give to the new +OSD. Hard drive capacity grows 40% per year, so newer OSD hosts may have larger +hard drives than older hosts in the cluster (i.e., they may have greater +weight). + +.. tip:: Ceph prefers uniform hardware across pools. If you are adding drives + of dissimilar size, you can adjust their weights. However, for best + performance, consider a CRUSH hierarchy with drives of the same type/size. + +#. Create the OSD. If no UUID is given, it will be set automatically when the + OSD starts up. The following command will output the OSD number, which you + will need for subsequent steps. :: + + ceph osd create [{uuid} [{id}]] + + If the optional parameter {id} is given it will be used as the OSD id. + Note, in this case the command may fail if the number is already in use. + + .. warning:: In general, explicitly specifying {id} is not recommended. + IDs are allocated as an array, and skipping entries consumes some extra + memory. This can become significant if there are large gaps and/or + clusters are large. If {id} is not specified, the smallest available is + used. + +#. Create the default directory on your new OSD. :: + + ssh {new-osd-host} + sudo mkdir /var/lib/ceph/osd/ceph-{osd-number} + + +#. If the OSD is for a drive other than the OS drive, prepare it + for use with Ceph, and mount it to the directory you just created:: + + ssh {new-osd-host} + sudo mkfs -t {fstype} /dev/{drive} + sudo mount -o user_xattr /dev/{hdd} /var/lib/ceph/osd/ceph-{osd-number} + + +#. Initialize the OSD data directory. :: + + ssh {new-osd-host} + ceph-osd -i {osd-num} --mkfs --mkkey + + The directory must be empty before you can run ``ceph-osd``. + +#. Register the OSD authentication key. The value of ``ceph`` for + ``ceph-{osd-num}`` in the path is the ``$cluster-$id``. If your + cluster name differs from ``ceph``, use your cluster name instead.:: + + ceph auth add osd.{osd-num} osd 'allow *' mon 'allow rwx' -i /var/lib/ceph/osd/ceph-{osd-num}/keyring + + +#. Add the OSD to the CRUSH map so that the OSD can begin receiving data. The + ``ceph osd crush add`` command allows you to add OSDs to the CRUSH hierarchy + wherever you wish. If you specify at least one bucket, the command + will place the OSD into the most specific bucket you specify, *and* it will + move that bucket underneath any other buckets you specify. **Important:** If + you specify only the root bucket, the command will attach the OSD directly + to the root, but CRUSH rules expect OSDs to be inside of hosts. + + For Argonaut (v 0.48), execute the following:: + + ceph osd crush add {id} {name} {weight} [{bucket-type}={bucket-name} ...] + + For Bobtail (v 0.56) and later releases, execute the following:: + + ceph osd crush add {id-or-name} {weight} [{bucket-type}={bucket-name} ...] + + You may also decompile the CRUSH map, add the OSD to the device list, add the + host as a bucket (if it's not already in the CRUSH map), add the device as an + item in the host, assign it a weight, recompile it and set it. See + `Add/Move an OSD`_ for details. + + +.. topic:: Argonaut (v0.48) Best Practices + + To limit impact on user I/O performance, add an OSD to the CRUSH map + with an initial weight of ``0``. Then, ramp up the CRUSH weight a + little bit at a time. For example, to ramp by increments of ``0.2``, + start with:: + + ceph osd crush reweight {osd-id} .2 + + and allow migration to complete before reweighting to ``0.4``, + ``0.6``, and so on until the desired CRUSH weight is reached. + + To limit the impact of OSD failures, you can set:: + + mon osd down out interval = 0 + + which prevents down OSDs from automatically being marked out, and then + ramp them down manually with:: + + ceph osd reweight {osd-num} .8 + + Again, wait for the cluster to finish migrating data, and then adjust + the weight further until you reach a weight of 0. Note that this + problem prevents the cluster to automatically re-replicate data after + a failure, so please ensure that sufficient monitoring is in place for + an administrator to intervene promptly. + + Note that this practice will no longer be necessary in Bobtail and + subsequent releases. + + +Replacing an OSD +---------------- + +When disks fail, or if an admnistrator wants to reprovision OSDs with a new +backend, for instance, for switching from FileStore to BlueStore, OSDs need to +be replaced. Unlike `Removing the OSD`_, replaced OSD's id and CRUSH map entry +need to be keep intact after the OSD is destroyed for replacement. + +#. Destroy the OSD first:: + + ceph osd destroy {id} --yes-i-really-mean-it + +#. Zap a disk for the new OSD, if the disk was used before for other purposes. + It's not necessary for a new disk:: + + ceph-disk zap /dev/sdX + +#. Prepare the disk for replacement by using the previously destroyed OSD id:: + + ceph-disk prepare --bluestore /dev/sdX --osd-id {id} --osd-uuid `uuidgen` + +#. And activate the OSD:: + + ceph-disk activate /dev/sdX1 + + +Starting the OSD +---------------- + +After you add an OSD to Ceph, the OSD is in your configuration. However, +it is not yet running. The OSD is ``down`` and ``in``. You must start +your new OSD before it can begin receiving data. You may use +``service ceph`` from your admin host or start the OSD from its host +machine. + +For Ubuntu Trusty use Upstart. :: + + sudo start ceph-osd id={osd-num} + +For all other distros use systemd. :: + + sudo systemctl start ceph-osd@{osd-num} + + +Once you start your OSD, it is ``up`` and ``in``. + + +Observe the Data Migration +-------------------------- + +Once you have added your new OSD to the CRUSH map, Ceph will begin rebalancing +the server by migrating placement groups to your new OSD. You can observe this +process with the `ceph`_ tool. :: + + ceph -w + +You should see the placement group states change from ``active+clean`` to +``active, some degraded objects``, and finally ``active+clean`` when migration +completes. (Control-c to exit.) + + +.. _Add/Move an OSD: ../crush-map#addosd +.. _ceph: ../monitoring + + + +Removing OSDs (Manual) +====================== + +When you want to reduce the size of a cluster or replace hardware, you may +remove an OSD at runtime. With Ceph, an OSD is generally one Ceph ``ceph-osd`` +daemon for one storage drive within a host machine. If your host has multiple +storage drives, you may need to remove one ``ceph-osd`` daemon for each drive. +Generally, it's a good idea to check the capacity of your cluster to see if you +are reaching the upper end of its capacity. Ensure that when you remove an OSD +that your cluster is not at its ``near full`` ratio. + +.. warning:: Do not let your cluster reach its ``full ratio`` when + removing an OSD. Removing OSDs could cause the cluster to reach + or exceed its ``full ratio``. + + +Take the OSD out of the Cluster +----------------------------------- + +Before you remove an OSD, it is usually ``up`` and ``in``. You need to take it +out of the cluster so that Ceph can begin rebalancing and copying its data to +other OSDs. :: + + ceph osd out {osd-num} + + +Observe the Data Migration +-------------------------- + +Once you have taken your OSD ``out`` of the cluster, Ceph will begin +rebalancing the cluster by migrating placement groups out of the OSD you +removed. You can observe this process with the `ceph`_ tool. :: + + ceph -w + +You should see the placement group states change from ``active+clean`` to +``active, some degraded objects``, and finally ``active+clean`` when migration +completes. (Control-c to exit.) + +.. note:: Sometimes, typically in a "small" cluster with few hosts (for + instance with a small testing cluster), the fact to take ``out`` the + OSD can spawn a CRUSH corner case where some PGs remain stuck in the + ``active+remapped`` state. If you are in this case, you should mark + the OSD ``in`` with: + + ``ceph osd in {osd-num}`` + + to come back to the initial state and then, instead of marking ``out`` + the OSD, set its weight to 0 with: + + ``ceph osd crush reweight osd.{osd-num} 0`` + + After that, you can observe the data migration which should come to its + end. The difference between marking ``out`` the OSD and reweighting it + to 0 is that in the first case the weight of the bucket which contains + the OSD is not changed whereas in the second case the weight of the bucket + is updated (and decreased of the OSD weight). The reweight command could + be sometimes favoured in the case of a "small" cluster. + + + +Stopping the OSD +---------------- + +After you take an OSD out of the cluster, it may still be running. +That is, the OSD may be ``up`` and ``out``. You must stop +your OSD before you remove it from the configuration. :: + + ssh {osd-host} + sudo systemctl stop ceph-osd@{osd-num} + +Once you stop your OSD, it is ``down``. + + +Removing the OSD +---------------- + +This procedure removes an OSD from a cluster map, removes its authentication +key, removes the OSD from the OSD map, and removes the OSD from the +``ceph.conf`` file. If your host has multiple drives, you may need to remove an +OSD for each drive by repeating this procedure. + +#. Let the cluster forget the OSD first. This step removes the OSD from the CRUSH + map, removes its authentication key. And it is removed from the OSD map as + well. Please note the `purge subcommand`_ is introduced in Luminous, for older + versions, please see below :: + + ceph osd purge {id} --yes-i-really-mean-it + +#. Navigate to the host where you keep the master copy of the cluster's + ``ceph.conf`` file. :: + + ssh {admin-host} + cd /etc/ceph + vim ceph.conf + +#. Remove the OSD entry from your ``ceph.conf`` file (if it exists). :: + + [osd.1] + host = {hostname} + +#. From the host where you keep the master copy of the cluster's ``ceph.conf`` file, + copy the updated ``ceph.conf`` file to the ``/etc/ceph`` directory of other + hosts in your cluster. + +If your Ceph cluster is older than Luminous, instead of using ``ceph osd purge``, +you need to perform this step manually: + + +#. Remove the OSD from the CRUSH map so that it no longer receives data. You may + also decompile the CRUSH map, remove the OSD from the device list, remove the + device as an item in the host bucket or remove the host bucket (if it's in the + CRUSH map and you intend to remove the host), recompile the map and set it. + See `Remove an OSD`_ for details. :: + + ceph osd crush remove {name} + +#. Remove the OSD authentication key. :: + + ceph auth del osd.{osd-num} + + The value of ``ceph`` for ``ceph-{osd-num}`` in the path is the ``$cluster-$id``. + If your cluster name differs from ``ceph``, use your cluster name instead. + +#. Remove the OSD. :: + + ceph osd rm {osd-num} + #for example + ceph osd rm 1 + + +.. _Remove an OSD: ../crush-map#removeosd +.. _purge subcommand: /man/8/ceph#osd diff --git a/src/ceph/doc/rados/operations/cache-tiering.rst b/src/ceph/doc/rados/operations/cache-tiering.rst new file mode 100644 index 0000000..322c6ff --- /dev/null +++ b/src/ceph/doc/rados/operations/cache-tiering.rst @@ -0,0 +1,461 @@ +=============== + Cache Tiering +=============== + +A cache tier provides Ceph Clients with better I/O performance for a subset of +the data stored in a backing storage tier. Cache tiering involves creating a +pool of relatively fast/expensive storage devices (e.g., solid state drives) +configured to act as a cache tier, and a backing pool of either erasure-coded +or relatively slower/cheaper devices configured to act as an economical storage +tier. The Ceph objecter handles where to place the objects and the tiering +agent determines when to flush objects from the cache to the backing storage +tier. So the cache tier and the backing storage tier are completely transparent +to Ceph clients. + + +.. ditaa:: + +-------------+ + | Ceph Client | + +------+------+ + ^ + Tiering is | + Transparent | Faster I/O + to Ceph | +---------------+ + Client Ops | | | + | +----->+ Cache Tier | + | | | | + | | +-----+---+-----+ + | | | ^ + v v | | Active Data in Cache Tier + +------+----+--+ | | + | Objecter | | | + +-----------+--+ | | + ^ | | Inactive Data in Storage Tier + | v | + | +-----+---+-----+ + | | | + +----->| Storage Tier | + | | + +---------------+ + Slower I/O + + +The cache tiering agent handles the migration of data between the cache tier +and the backing storage tier automatically. However, admins have the ability to +configure how this migration takes place. There are two main scenarios: + +- **Writeback Mode:** When admins configure tiers with ``writeback`` mode, Ceph + clients write data to the cache tier and receive an ACK from the cache tier. + In time, the data written to the cache tier migrates to the storage tier + and gets flushed from the cache tier. Conceptually, the cache tier is + overlaid "in front" of the backing storage tier. When a Ceph client needs + data that resides in the storage tier, the cache tiering agent migrates the + data to the cache tier on read, then it is sent to the Ceph client. + Thereafter, the Ceph client can perform I/O using the cache tier, until the + data becomes inactive. This is ideal for mutable data (e.g., photo/video + editing, transactional data, etc.). + +- **Read-proxy Mode:** This mode will use any objects that already + exist in the cache tier, but if an object is not present in the + cache the request will be proxied to the base tier. This is useful + for transitioning from ``writeback`` mode to a disabled cache as it + allows the workload to function properly while the cache is drained, + without adding any new objects to the cache. + +A word of caution +================= + +Cache tiering will *degrade* performance for most workloads. Users should use +extreme caution before using this feature. + +* *Workload dependent*: Whether a cache will improve performance is + highly dependent on the workload. Because there is a cost + associated with moving objects into or out of the cache, it can only + be effective when there is a *large skew* in the access pattern in + the data set, such that most of the requests touch a small number of + objects. The cache pool should be large enough to capture the + working set for your workload to avoid thrashing. + +* *Difficult to benchmark*: Most benchmarks that users run to measure + performance will show terrible performance with cache tiering, in + part because very few of them skew requests toward a small set of + objects, it can take a long time for the cache to "warm up," and + because the warm-up cost can be high. + +* *Usually slower*: For workloads that are not cache tiering-friendly, + performance is often slower than a normal RADOS pool without cache + tiering enabled. + +* *librados object enumeration*: The librados-level object enumeration + API is not meant to be coherent in the presence of the case. If + your applicatoin is using librados directly and relies on object + enumeration, cache tiering will probably not work as expected. + (This is not a problem for RGW, RBD, or CephFS.) + +* *Complexity*: Enabling cache tiering means that a lot of additional + machinery and complexity within the RADOS cluster is being used. + This increases the probability that you will encounter a bug in the system + that other users have not yet encountered and will put your deployment at a + higher level of risk. + +Known Good Workloads +-------------------- + +* *RGW time-skewed*: If the RGW workload is such that almost all read + operations are directed at recently written objects, a simple cache + tiering configuration that destages recently written objects from + the cache to the base tier after a configurable period can work + well. + +Known Bad Workloads +------------------- + +The following configurations are *known to work poorly* with cache +tiering. + +* *RBD with replicated cache and erasure-coded base*: This is a common + request, but usually does not perform well. Even reasonably skewed + workloads still send some small writes to cold objects, and because + small writes are not yet supported by the erasure-coded pool, entire + (usually 4 MB) objects must be migrated into the cache in order to + satisfy a small (often 4 KB) write. Only a handful of users have + successfully deployed this configuration, and it only works for them + because their data is extremely cold (backups) and they are not in + any way sensitive to performance. + +* *RBD with replicated cache and base*: RBD with a replicated base + tier does better than when the base is erasure coded, but it is + still highly dependent on the amount of skew in the workload, and + very difficult to validate. The user will need to have a good + understanding of their workload and will need to tune the cache + tiering parameters carefully. + + +Setting Up Pools +================ + +To set up cache tiering, you must have two pools. One will act as the +backing storage and the other will act as the cache. + + +Setting Up a Backing Storage Pool +--------------------------------- + +Setting up a backing storage pool typically involves one of two scenarios: + +- **Standard Storage**: In this scenario, the pool stores multiple copies + of an object in the Ceph Storage Cluster. + +- **Erasure Coding:** In this scenario, the pool uses erasure coding to + store data much more efficiently with a small performance tradeoff. + +In the standard storage scenario, you can setup a CRUSH ruleset to establish +the failure domain (e.g., osd, host, chassis, rack, row, etc.). Ceph OSD +Daemons perform optimally when all storage drives in the ruleset are of the +same size, speed (both RPMs and throughput) and type. See `CRUSH Maps`_ +for details on creating a ruleset. Once you have created a ruleset, create +a backing storage pool. + +In the erasure coding scenario, the pool creation arguments will generate the +appropriate ruleset automatically. See `Create a Pool`_ for details. + +In subsequent examples, we will refer to the backing storage pool +as ``cold-storage``. + + +Setting Up a Cache Pool +----------------------- + +Setting up a cache pool follows the same procedure as the standard storage +scenario, but with this difference: the drives for the cache tier are typically +high performance drives that reside in their own servers and have their own +ruleset. When setting up a ruleset, it should take account of the hosts that +have the high performance drives while omitting the hosts that don't. See +`Placing Different Pools on Different OSDs`_ for details. + + +In subsequent examples, we will refer to the cache pool as ``hot-storage`` and +the backing pool as ``cold-storage``. + +For cache tier configuration and default values, see +`Pools - Set Pool Values`_. + + +Creating a Cache Tier +===================== + +Setting up a cache tier involves associating a backing storage pool with +a cache pool :: + + ceph osd tier add {storagepool} {cachepool} + +For example :: + + ceph osd tier add cold-storage hot-storage + +To set the cache mode, execute the following:: + + ceph osd tier cache-mode {cachepool} {cache-mode} + +For example:: + + ceph osd tier cache-mode hot-storage writeback + +The cache tiers overlay the backing storage tier, so they require one +additional step: you must direct all client traffic from the storage pool to +the cache pool. To direct client traffic directly to the cache pool, execute +the following:: + + ceph osd tier set-overlay {storagepool} {cachepool} + +For example:: + + ceph osd tier set-overlay cold-storage hot-storage + + +Configuring a Cache Tier +======================== + +Cache tiers have several configuration options. You may set +cache tier configuration options with the following usage:: + + ceph osd pool set {cachepool} {key} {value} + +See `Pools - Set Pool Values`_ for details. + + +Target Size and Type +-------------------- + +Ceph's production cache tiers use a `Bloom Filter`_ for the ``hit_set_type``:: + + ceph osd pool set {cachepool} hit_set_type bloom + +For example:: + + ceph osd pool set hot-storage hit_set_type bloom + +The ``hit_set_count`` and ``hit_set_period`` define how much time each HitSet +should cover, and how many such HitSets to store. :: + + ceph osd pool set {cachepool} hit_set_count 12 + ceph osd pool set {cachepool} hit_set_period 14400 + ceph osd pool set {cachepool} target_max_bytes 1000000000000 + +.. note:: A larger ``hit_set_count`` results in more RAM consumed by + the ``ceph-osd`` process. + +Binning accesses over time allows Ceph to determine whether a Ceph client +accessed an object at least once, or more than once over a time period +("age" vs "temperature"). + +The ``min_read_recency_for_promote`` defines how many HitSets to check for the +existence of an object when handling a read operation. The checking result is +used to decide whether to promote the object asynchronously. Its value should be +between 0 and ``hit_set_count``. If it's set to 0, the object is always promoted. +If it's set to 1, the current HitSet is checked. And if this object is in the +current HitSet, it's promoted. Otherwise not. For the other values, the exact +number of archive HitSets are checked. The object is promoted if the object is +found in any of the most recent ``min_read_recency_for_promote`` HitSets. + +A similar parameter can be set for the write operation, which is +``min_write_recency_for_promote``. :: + + ceph osd pool set {cachepool} min_read_recency_for_promote 2 + ceph osd pool set {cachepool} min_write_recency_for_promote 2 + +.. note:: The longer the period and the higher the + ``min_read_recency_for_promote`` and + ``min_write_recency_for_promote``values, the more RAM the ``ceph-osd`` + daemon consumes. In particular, when the agent is active to flush + or evict cache objects, all ``hit_set_count`` HitSets are loaded + into RAM. + + +Cache Sizing +------------ + +The cache tiering agent performs two main functions: + +- **Flushing:** The agent identifies modified (or dirty) objects and forwards + them to the storage pool for long-term storage. + +- **Evicting:** The agent identifies objects that haven't been modified + (or clean) and evicts the least recently used among them from the cache. + + +Absolute Sizing +~~~~~~~~~~~~~~~ + +The cache tiering agent can flush or evict objects based upon the total number +of bytes or the total number of objects. To specify a maximum number of bytes, +execute the following:: + + ceph osd pool set {cachepool} target_max_bytes {#bytes} + +For example, to flush or evict at 1 TB, execute the following:: + + ceph osd pool set hot-storage target_max_bytes 1099511627776 + + +To specify the maximum number of objects, execute the following:: + + ceph osd pool set {cachepool} target_max_objects {#objects} + +For example, to flush or evict at 1M objects, execute the following:: + + ceph osd pool set hot-storage target_max_objects 1000000 + +.. note:: Ceph is not able to determine the size of a cache pool automatically, so + the configuration on the absolute size is required here, otherwise the + flush/evict will not work. If you specify both limits, the cache tiering + agent will begin flushing or evicting when either threshold is triggered. + +.. note:: All client requests will be blocked only when ``target_max_bytes`` or + ``target_max_objects`` reached + +Relative Sizing +~~~~~~~~~~~~~~~ + +The cache tiering agent can flush or evict objects relative to the size of the +cache pool(specified by ``target_max_bytes`` / ``target_max_objects`` in +`Absolute sizing`_). When the cache pool consists of a certain percentage of +modified (or dirty) objects, the cache tiering agent will flush them to the +storage pool. To set the ``cache_target_dirty_ratio``, execute the following:: + + ceph osd pool set {cachepool} cache_target_dirty_ratio {0.0..1.0} + +For example, setting the value to ``0.4`` will begin flushing modified +(dirty) objects when they reach 40% of the cache pool's capacity:: + + ceph osd pool set hot-storage cache_target_dirty_ratio 0.4 + +When the dirty objects reaches a certain percentage of its capacity, flush dirty +objects with a higher speed. To set the ``cache_target_dirty_high_ratio``:: + + ceph osd pool set {cachepool} cache_target_dirty_high_ratio {0.0..1.0} + +For example, setting the value to ``0.6`` will begin aggressively flush dirty objects +when they reach 60% of the cache pool's capacity. obviously, we'd better set the value +between dirty_ratio and full_ratio:: + + ceph osd pool set hot-storage cache_target_dirty_high_ratio 0.6 + +When the cache pool reaches a certain percentage of its capacity, the cache +tiering agent will evict objects to maintain free capacity. To set the +``cache_target_full_ratio``, execute the following:: + + ceph osd pool set {cachepool} cache_target_full_ratio {0.0..1.0} + +For example, setting the value to ``0.8`` will begin flushing unmodified +(clean) objects when they reach 80% of the cache pool's capacity:: + + ceph osd pool set hot-storage cache_target_full_ratio 0.8 + + +Cache Age +--------- + +You can specify the minimum age of an object before the cache tiering agent +flushes a recently modified (or dirty) object to the backing storage pool:: + + ceph osd pool set {cachepool} cache_min_flush_age {#seconds} + +For example, to flush modified (or dirty) objects after 10 minutes, execute +the following:: + + ceph osd pool set hot-storage cache_min_flush_age 600 + +You can specify the minimum age of an object before it will be evicted from +the cache tier:: + + ceph osd pool {cache-tier} cache_min_evict_age {#seconds} + +For example, to evict objects after 30 minutes, execute the following:: + + ceph osd pool set hot-storage cache_min_evict_age 1800 + + +Removing a Cache Tier +===================== + +Removing a cache tier differs depending on whether it is a writeback +cache or a read-only cache. + + +Removing a Read-Only Cache +-------------------------- + +Since a read-only cache does not have modified data, you can disable +and remove it without losing any recent changes to objects in the cache. + +#. Change the cache-mode to ``none`` to disable it. :: + + ceph osd tier cache-mode {cachepool} none + + For example:: + + ceph osd tier cache-mode hot-storage none + +#. Remove the cache pool from the backing pool. :: + + ceph osd tier remove {storagepool} {cachepool} + + For example:: + + ceph osd tier remove cold-storage hot-storage + + + +Removing a Writeback Cache +-------------------------- + +Since a writeback cache may have modified data, you must take steps to ensure +that you do not lose any recent changes to objects in the cache before you +disable and remove it. + + +#. Change the cache mode to ``forward`` so that new and modified objects will + flush to the backing storage pool. :: + + ceph osd tier cache-mode {cachepool} forward + + For example:: + + ceph osd tier cache-mode hot-storage forward + + +#. Ensure that the cache pool has been flushed. This may take a few minutes:: + + rados -p {cachepool} ls + + If the cache pool still has objects, you can flush them manually. + For example:: + + rados -p {cachepool} cache-flush-evict-all + + +#. Remove the overlay so that clients will not direct traffic to the cache. :: + + ceph osd tier remove-overlay {storagetier} + + For example:: + + ceph osd tier remove-overlay cold-storage + + +#. Finally, remove the cache tier pool from the backing storage pool. :: + + ceph osd tier remove {storagepool} {cachepool} + + For example:: + + ceph osd tier remove cold-storage hot-storage + + +.. _Create a Pool: ../pools#create-a-pool +.. _Pools - Set Pool Values: ../pools#set-pool-values +.. _Placing Different Pools on Different OSDs: ../crush-map/#placing-different-pools-on-different-osds +.. _Bloom Filter: http://en.wikipedia.org/wiki/Bloom_filter +.. _CRUSH Maps: ../crush-map +.. _Absolute Sizing: #absolute-sizing diff --git a/src/ceph/doc/rados/operations/control.rst b/src/ceph/doc/rados/operations/control.rst new file mode 100644 index 0000000..1a58076 --- /dev/null +++ b/src/ceph/doc/rados/operations/control.rst @@ -0,0 +1,453 @@ +.. index:: control, commands + +================== + Control Commands +================== + + +Monitor Commands +================ + +Monitor commands are issued using the ceph utility:: + + ceph [-m monhost] {command} + +The command is usually (though not always) of the form:: + + ceph {subsystem} {command} + + +System Commands +=============== + +Execute the following to display the current status of the cluster. :: + + ceph -s + ceph status + +Execute the following to display a running summary of the status of the cluster, +and major events. :: + + ceph -w + +Execute the following to show the monitor quorum, including which monitors are +participating and which one is the leader. :: + + ceph quorum_status + +Execute the following to query the status of a single monitor, including whether +or not it is in the quorum. :: + + ceph [-m monhost] mon_status + + +Authentication Subsystem +======================== + +To add a keyring for an OSD, execute the following:: + + ceph auth add {osd} {--in-file|-i} {path-to-osd-keyring} + +To list the cluster's keys and their capabilities, execute the following:: + + ceph auth ls + + +Placement Group Subsystem +========================= + +To display the statistics for all placement groups, execute the following:: + + ceph pg dump [--format {format}] + +The valid formats are ``plain`` (default) and ``json``. + +To display the statistics for all placement groups stuck in a specified state, +execute the following:: + + ceph pg dump_stuck inactive|unclean|stale|undersized|degraded [--format {format}] [-t|--threshold {seconds}] + + +``--format`` may be ``plain`` (default) or ``json`` + +``--threshold`` defines how many seconds "stuck" is (default: 300) + +**Inactive** Placement groups cannot process reads or writes because they are waiting for an OSD +with the most up-to-date data to come back. + +**Unclean** Placement groups contain objects that are not replicated the desired number +of times. They should be recovering. + +**Stale** Placement groups are in an unknown state - the OSDs that host them have not +reported to the monitor cluster in a while (configured by +``mon_osd_report_timeout``). + +Delete "lost" objects or revert them to their prior state, either a previous version +or delete them if they were just created. :: + + ceph pg {pgid} mark_unfound_lost revert|delete + + +OSD Subsystem +============= + +Query OSD subsystem status. :: + + ceph osd stat + +Write a copy of the most recent OSD map to a file. See +`osdmaptool`_. :: + + ceph osd getmap -o file + +.. _osdmaptool: ../../man/8/osdmaptool + +Write a copy of the crush map from the most recent OSD map to +file. :: + + ceph osd getcrushmap -o file + +The foregoing functionally equivalent to :: + + ceph osd getmap -o /tmp/osdmap + osdmaptool /tmp/osdmap --export-crush file + +Dump the OSD map. Valid formats for ``-f`` are ``plain`` and ``json``. If no +``--format`` option is given, the OSD map is dumped as plain text. :: + + ceph osd dump [--format {format}] + +Dump the OSD map as a tree with one line per OSD containing weight +and state. :: + + ceph osd tree [--format {format}] + +Find out where a specific object is or would be stored in the system:: + + ceph osd map <pool-name> <object-name> + +Add or move a new item (OSD) with the given id/name/weight at the specified +location. :: + + ceph osd crush set {id} {weight} [{loc1} [{loc2} ...]] + +Remove an existing item (OSD) from the CRUSH map. :: + + ceph osd crush remove {name} + +Remove an existing bucket from the CRUSH map. :: + + ceph osd crush remove {bucket-name} + +Move an existing bucket from one position in the hierarchy to another. :: + + ceph osd crush move {id} {loc1} [{loc2} ...] + +Set the weight of the item given by ``{name}`` to ``{weight}``. :: + + ceph osd crush reweight {name} {weight} + +Mark an OSD as lost. This may result in permanent data loss. Use with caution. :: + + ceph osd lost {id} [--yes-i-really-mean-it] + +Create a new OSD. If no UUID is given, it will be set automatically when the OSD +starts up. :: + + ceph osd create [{uuid}] + +Remove the given OSD(s). :: + + ceph osd rm [{id}...] + +Query the current max_osd parameter in the OSD map. :: + + ceph osd getmaxosd + +Import the given crush map. :: + + ceph osd setcrushmap -i file + +Set the ``max_osd`` parameter in the OSD map. This is necessary when +expanding the storage cluster. :: + + ceph osd setmaxosd + +Mark OSD ``{osd-num}`` down. :: + + ceph osd down {osd-num} + +Mark OSD ``{osd-num}`` out of the distribution (i.e. allocated no data). :: + + ceph osd out {osd-num} + +Mark ``{osd-num}`` in the distribution (i.e. allocated data). :: + + ceph osd in {osd-num} + +Set or clear the pause flags in the OSD map. If set, no IO requests +will be sent to any OSD. Clearing the flags via unpause results in +resending pending requests. :: + + ceph osd pause + ceph osd unpause + +Set the weight of ``{osd-num}`` to ``{weight}``. Two OSDs with the +same weight will receive roughly the same number of I/O requests and +store approximately the same amount of data. ``ceph osd reweight`` +sets an override weight on the OSD. This value is in the range 0 to 1, +and forces CRUSH to re-place (1-weight) of the data that would +otherwise live on this drive. It does not change the weights assigned +to the buckets above the OSD in the crush map, and is a corrective +measure in case the normal CRUSH distribution is not working out quite +right. For instance, if one of your OSDs is at 90% and the others are +at 50%, you could reduce this weight to try and compensate for it. :: + + ceph osd reweight {osd-num} {weight} + +Reweights all the OSDs by reducing the weight of OSDs which are +heavily overused. By default it will adjust the weights downward on +OSDs which have 120% of the average utilization, but if you include +threshold it will use that percentage instead. :: + + ceph osd reweight-by-utilization [threshold] + +Describes what reweight-by-utilization would do. :: + + ceph osd test-reweight-by-utilization + +Adds/removes the address to/from the blacklist. When adding an address, +you can specify how long it should be blacklisted in seconds; otherwise, +it will default to 1 hour. A blacklisted address is prevented from +connecting to any OSD. Blacklisting is most often used to prevent a +lagging metadata server from making bad changes to data on the OSDs. + +These commands are mostly only useful for failure testing, as +blacklists are normally maintained automatically and shouldn't need +manual intervention. :: + + ceph osd blacklist add ADDRESS[:source_port] [TIME] + ceph osd blacklist rm ADDRESS[:source_port] + +Creates/deletes a snapshot of a pool. :: + + ceph osd pool mksnap {pool-name} {snap-name} + ceph osd pool rmsnap {pool-name} {snap-name} + +Creates/deletes/renames a storage pool. :: + + ceph osd pool create {pool-name} pg_num [pgp_num] + ceph osd pool delete {pool-name} [{pool-name} --yes-i-really-really-mean-it] + ceph osd pool rename {old-name} {new-name} + +Changes a pool setting. :: + + ceph osd pool set {pool-name} {field} {value} + +Valid fields are: + + * ``size``: Sets the number of copies of data in the pool. + * ``pg_num``: The placement group number. + * ``pgp_num``: Effective number when calculating pg placement. + * ``crush_ruleset``: rule number for mapping placement. + +Get the value of a pool setting. :: + + ceph osd pool get {pool-name} {field} + +Valid fields are: + + * ``pg_num``: The placement group number. + * ``pgp_num``: Effective number of placement groups when calculating placement. + * ``lpg_num``: The number of local placement groups. + * ``lpgp_num``: The number used for placing the local placement groups. + + +Sends a scrub command to OSD ``{osd-num}``. To send the command to all OSDs, use ``*``. :: + + ceph osd scrub {osd-num} + +Sends a repair command to OSD.N. To send the command to all OSDs, use ``*``. :: + + ceph osd repair N + +Runs a simple throughput benchmark against OSD.N, writing ``TOTAL_DATA_BYTES`` +in write requests of ``BYTES_PER_WRITE`` each. By default, the test +writes 1 GB in total in 4-MB increments. +The benchmark is non-destructive and will not overwrite existing live +OSD data, but might temporarily affect the performance of clients +concurrently accessing the OSD. :: + + ceph tell osd.N bench [TOTAL_DATA_BYTES] [BYTES_PER_WRITE] + + +MDS Subsystem +============= + +Change configuration parameters on a running mds. :: + + ceph tell mds.{mds-id} injectargs --{switch} {value} [--{switch} {value}] + +Example:: + + ceph tell mds.0 injectargs --debug_ms 1 --debug_mds 10 + +Enables debug messages. :: + + ceph mds stat + +Displays the status of all metadata servers. :: + + ceph mds fail 0 + +Marks the active MDS as failed, triggering failover to a standby if present. + +.. todo:: ``ceph mds`` subcommands missing docs: set, dump, getmap, stop, setmap + + +Mon Subsystem +============= + +Show monitor stats:: + + ceph mon stat + + e2: 3 mons at {a=127.0.0.1:40000/0,b=127.0.0.1:40001/0,c=127.0.0.1:40002/0}, election epoch 6, quorum 0,1,2 a,b,c + + +The ``quorum`` list at the end lists monitor nodes that are part of the current quorum. + +This is also available more directly:: + + ceph quorum_status -f json-pretty + +.. code-block:: javascript + + { + "election_epoch": 6, + "quorum": [ + 0, + 1, + 2 + ], + "quorum_names": [ + "a", + "b", + "c" + ], + "quorum_leader_name": "a", + "monmap": { + "epoch": 2, + "fsid": "ba807e74-b64f-4b72-b43f-597dfe60ddbc", + "modified": "2016-12-26 14:42:09.288066", + "created": "2016-12-26 14:42:03.573585", + "features": { + "persistent": [ + "kraken" + ], + "optional": [] + }, + "mons": [ + { + "rank": 0, + "name": "a", + "addr": "127.0.0.1:40000\/0", + "public_addr": "127.0.0.1:40000\/0" + }, + { + "rank": 1, + "name": "b", + "addr": "127.0.0.1:40001\/0", + "public_addr": "127.0.0.1:40001\/0" + }, + { + "rank": 2, + "name": "c", + "addr": "127.0.0.1:40002\/0", + "public_addr": "127.0.0.1:40002\/0" + } + ] + } + } + + +The above will block until a quorum is reached. + +For a status of just the monitor you connect to (use ``-m HOST:PORT`` +to select):: + + ceph mon_status -f json-pretty + + +.. code-block:: javascript + + { + "name": "b", + "rank": 1, + "state": "peon", + "election_epoch": 6, + "quorum": [ + 0, + 1, + 2 + ], + "features": { + "required_con": "9025616074522624", + "required_mon": [ + "kraken" + ], + "quorum_con": "1152921504336314367", + "quorum_mon": [ + "kraken" + ] + }, + "outside_quorum": [], + "extra_probe_peers": [], + "sync_provider": [], + "monmap": { + "epoch": 2, + "fsid": "ba807e74-b64f-4b72-b43f-597dfe60ddbc", + "modified": "2016-12-26 14:42:09.288066", + "created": "2016-12-26 14:42:03.573585", + "features": { + "persistent": [ + "kraken" + ], + "optional": [] + }, + "mons": [ + { + "rank": 0, + "name": "a", + "addr": "127.0.0.1:40000\/0", + "public_addr": "127.0.0.1:40000\/0" + }, + { + "rank": 1, + "name": "b", + "addr": "127.0.0.1:40001\/0", + "public_addr": "127.0.0.1:40001\/0" + }, + { + "rank": 2, + "name": "c", + "addr": "127.0.0.1:40002\/0", + "public_addr": "127.0.0.1:40002\/0" + } + ] + } + } + +A dump of the monitor state:: + + ceph mon dump + + dumped monmap epoch 2 + epoch 2 + fsid ba807e74-b64f-4b72-b43f-597dfe60ddbc + last_changed 2016-12-26 14:42:09.288066 + created 2016-12-26 14:42:03.573585 + 0: 127.0.0.1:40000/0 mon.a + 1: 127.0.0.1:40001/0 mon.b + 2: 127.0.0.1:40002/0 mon.c + diff --git a/src/ceph/doc/rados/operations/crush-map-edits.rst b/src/ceph/doc/rados/operations/crush-map-edits.rst new file mode 100644 index 0000000..5222270 --- /dev/null +++ b/src/ceph/doc/rados/operations/crush-map-edits.rst @@ -0,0 +1,654 @@ +Manually editing a CRUSH Map +============================ + +.. note:: Manually editing the CRUSH map is considered an advanced + administrator operation. All CRUSH changes that are + necessary for the overwhelming majority of installations are + possible via the standard ceph CLI and do not require manual + CRUSH map edits. If you have identified a use case where + manual edits *are* necessary, consider contacting the Ceph + developers so that future versions of Ceph can make this + unnecessary. + +To edit an existing CRUSH map: + +#. `Get the CRUSH map`_. +#. `Decompile`_ the CRUSH map. +#. Edit at least one of `Devices`_, `Buckets`_ and `Rules`_. +#. `Recompile`_ the CRUSH map. +#. `Set the CRUSH map`_. + +To activate CRUSH map rules for a specific pool, identify the common ruleset +number for those rules and specify that ruleset number for the pool. See `Set +Pool Values`_ for details. + +.. _Get the CRUSH map: #getcrushmap +.. _Decompile: #decompilecrushmap +.. _Devices: #crushmapdevices +.. _Buckets: #crushmapbuckets +.. _Rules: #crushmaprules +.. _Recompile: #compilecrushmap +.. _Set the CRUSH map: #setcrushmap +.. _Set Pool Values: ../pools#setpoolvalues + +.. _getcrushmap: + +Get a CRUSH Map +--------------- + +To get the CRUSH map for your cluster, execute the following:: + + ceph osd getcrushmap -o {compiled-crushmap-filename} + +Ceph will output (-o) a compiled CRUSH map to the filename you specified. Since +the CRUSH map is in a compiled form, you must decompile it first before you can +edit it. + +.. _decompilecrushmap: + +Decompile a CRUSH Map +--------------------- + +To decompile a CRUSH map, execute the following:: + + crushtool -d {compiled-crushmap-filename} -o {decompiled-crushmap-filename} + + +Sections +-------- + +There are six main sections to a CRUSH Map. + +#. **tunables:** The preamble at the top of the map described any *tunables* + for CRUSH behavior that vary from the historical/legacy CRUSH behavior. These + correct for old bugs, optimizations, or other changes in behavior that have + been made over the years to improve CRUSH's behavior. + +#. **devices:** Devices are individual ``ceph-osd`` daemons that can + store data. + +#. **types**: Bucket ``types`` define the types of buckets used in + your CRUSH hierarchy. Buckets consist of a hierarchical aggregation + of storage locations (e.g., rows, racks, chassis, hosts, etc.) and + their assigned weights. + +#. **buckets:** Once you define bucket types, you must define each node + in the hierarchy, its type, and which devices or other nodes it + containes. + +#. **rules:** Rules define policy about how data is distributed across + devices in the hierarchy. + +#. **choose_args:** Choose_args are alternative weights associated with + the hierarchy that have been adjusted to optimize data placement. A single + choose_args map can be used for the entire cluster, or one can be + created for each individual pool. + + +.. _crushmapdevices: + +CRUSH Map Devices +----------------- + +Devices are individual ``ceph-osd`` daemons that can store data. You +will normally have one defined here for each OSD daemon in your +cluster. Devices are identified by an id (a non-negative integer) and +a name, normally ``osd.N`` where ``N`` is the device id. + +Devices may also have a *device class* associated with them (e.g., +``hdd`` or ``ssd``), allowing them to be conveniently targetted by a +crush rule. + +:: + + # devices + device {num} {osd.name} [class {class}] + +For example:: + + # devices + device 0 osd.0 class ssd + device 1 osd.1 class hdd + device 2 osd.2 + device 3 osd.3 + +In most cases, each device maps to a single ``ceph-osd`` daemon. This +is normally a single storage device, a pair of devices (for example, +one for data and one for a journal or metadata), or in some cases a +small RAID device. + + + + + +CRUSH Map Bucket Types +---------------------- + +The second list in the CRUSH map defines 'bucket' types. Buckets facilitate +a hierarchy of nodes and leaves. Node (or non-leaf) buckets typically represent +physical locations in a hierarchy. Nodes aggregate other nodes or leaves. +Leaf buckets represent ``ceph-osd`` daemons and their corresponding storage +media. + +.. tip:: The term "bucket" used in the context of CRUSH means a node in + the hierarchy, i.e. a location or a piece of physical hardware. It + is a different concept from the term "bucket" when used in the + context of RADOS Gateway APIs. + +To add a bucket type to the CRUSH map, create a new line under your list of +bucket types. Enter ``type`` followed by a unique numeric ID and a bucket name. +By convention, there is one leaf bucket and it is ``type 0``; however, you may +give it any name you like (e.g., osd, disk, drive, storage, etc.):: + + #types + type {num} {bucket-name} + +For example:: + + # types + type 0 osd + type 1 host + type 2 chassis + type 3 rack + type 4 row + type 5 pdu + type 6 pod + type 7 room + type 8 datacenter + type 9 region + type 10 root + + + +.. _crushmapbuckets: + +CRUSH Map Bucket Hierarchy +-------------------------- + +The CRUSH algorithm distributes data objects among storage devices according +to a per-device weight value, approximating a uniform probability distribution. +CRUSH distributes objects and their replicas according to the hierarchical +cluster map you define. Your CRUSH map represents the available storage +devices and the logical elements that contain them. + +To map placement groups to OSDs across failure domains, a CRUSH map defines a +hierarchical list of bucket types (i.e., under ``#types`` in the generated CRUSH +map). The purpose of creating a bucket hierarchy is to segregate the +leaf nodes by their failure domains, such as hosts, chassis, racks, power +distribution units, pods, rows, rooms, and data centers. With the exception of +the leaf nodes representing OSDs, the rest of the hierarchy is arbitrary, and +you may define it according to your own needs. + +We recommend adapting your CRUSH map to your firms's hardware naming conventions +and using instances names that reflect the physical hardware. Your naming +practice can make it easier to administer the cluster and troubleshoot +problems when an OSD and/or other hardware malfunctions and the administrator +need access to physical hardware. + +In the following example, the bucket hierarchy has a leaf bucket named ``osd``, +and two node buckets named ``host`` and ``rack`` respectively. + +.. ditaa:: + +-----------+ + | {o}rack | + | Bucket | + +-----+-----+ + | + +---------------+---------------+ + | | + +-----+-----+ +-----+-----+ + | {o}host | | {o}host | + | Bucket | | Bucket | + +-----+-----+ +-----+-----+ + | | + +-------+-------+ +-------+-------+ + | | | | + +-----+-----+ +-----+-----+ +-----+-----+ +-----+-----+ + | osd | | osd | | osd | | osd | + | Bucket | | Bucket | | Bucket | | Bucket | + +-----------+ +-----------+ +-----------+ +-----------+ + +.. note:: The higher numbered ``rack`` bucket type aggregates the lower + numbered ``host`` bucket type. + +Since leaf nodes reflect storage devices declared under the ``#devices`` list +at the beginning of the CRUSH map, you do not need to declare them as bucket +instances. The second lowest bucket type in your hierarchy usually aggregates +the devices (i.e., it's usually the computer containing the storage media, and +uses whatever term you prefer to describe it, such as "node", "computer", +"server," "host", "machine", etc.). In high density environments, it is +increasingly common to see multiple hosts/nodes per chassis. You should account +for chassis failure too--e.g., the need to pull a chassis if a node fails may +result in bringing down numerous hosts/nodes and their OSDs. + +When declaring a bucket instance, you must specify its type, give it a unique +name (string), assign it a unique ID expressed as a negative integer (optional), +specify a weight relative to the total capacity/capability of its item(s), +specify the bucket algorithm (usually ``straw``), and the hash (usually ``0``, +reflecting hash algorithm ``rjenkins1``). A bucket may have one or more items. +The items may consist of node buckets or leaves. Items may have a weight that +reflects the relative weight of the item. + +You may declare a node bucket with the following syntax:: + + [bucket-type] [bucket-name] { + id [a unique negative numeric ID] + weight [the relative capacity/capability of the item(s)] + alg [the bucket type: uniform | list | tree | straw ] + hash [the hash type: 0 by default] + item [item-name] weight [weight] + } + +For example, using the diagram above, we would define two host buckets +and one rack bucket. The OSDs are declared as items within the host buckets:: + + host node1 { + id -1 + alg straw + hash 0 + item osd.0 weight 1.00 + item osd.1 weight 1.00 + } + + host node2 { + id -2 + alg straw + hash 0 + item osd.2 weight 1.00 + item osd.3 weight 1.00 + } + + rack rack1 { + id -3 + alg straw + hash 0 + item node1 weight 2.00 + item node2 weight 2.00 + } + +.. note:: In the foregoing example, note that the rack bucket does not contain + any OSDs. Rather it contains lower level host buckets, and includes the + sum total of their weight in the item entry. + +.. topic:: Bucket Types + + Ceph supports four bucket types, each representing a tradeoff between + performance and reorganization efficiency. If you are unsure of which bucket + type to use, we recommend using a ``straw`` bucket. For a detailed + discussion of bucket types, refer to + `CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_, + and more specifically to **Section 3.4**. The bucket types are: + + #. **Uniform:** Uniform buckets aggregate devices with **exactly** the same + weight. For example, when firms commission or decommission hardware, they + typically do so with many machines that have exactly the same physical + configuration (e.g., bulk purchases). When storage devices have exactly + the same weight, you may use the ``uniform`` bucket type, which allows + CRUSH to map replicas into uniform buckets in constant time. With + non-uniform weights, you should use another bucket algorithm. + + #. **List**: List buckets aggregate their content as linked lists. Based on + the :abbr:`RUSH (Replication Under Scalable Hashing)` :sub:`P` algorithm, + a list is a natural and intuitive choice for an **expanding cluster**: + either an object is relocated to the newest device with some appropriate + probability, or it remains on the older devices as before. The result is + optimal data migration when items are added to the bucket. Items removed + from the middle or tail of the list, however, can result in a significant + amount of unnecessary movement, making list buckets most suitable for + circumstances in which they **never (or very rarely) shrink**. + + #. **Tree**: Tree buckets use a binary search tree. They are more efficient + than list buckets when a bucket contains a larger set of items. Based on + the :abbr:`RUSH (Replication Under Scalable Hashing)` :sub:`R` algorithm, + tree buckets reduce the placement time to O(log :sub:`n`), making them + suitable for managing much larger sets of devices or nested buckets. + + #. **Straw:** List and Tree buckets use a divide and conquer strategy + in a way that either gives certain items precedence (e.g., those + at the beginning of a list) or obviates the need to consider entire + subtrees of items at all. That improves the performance of the replica + placement process, but can also introduce suboptimal reorganization + behavior when the contents of a bucket change due an addition, removal, + or re-weighting of an item. The straw bucket type allows all items to + fairly “compete” against each other for replica placement through a + process analogous to a draw of straws. + +.. topic:: Hash + + Each bucket uses a hash algorithm. Currently, Ceph supports ``rjenkins1``. + Enter ``0`` as your hash setting to select ``rjenkins1``. + + +.. _weightingbucketitems: + +.. topic:: Weighting Bucket Items + + Ceph expresses bucket weights as doubles, which allows for fine + weighting. A weight is the relative difference between device capacities. We + recommend using ``1.00`` as the relative weight for a 1TB storage device. + In such a scenario, a weight of ``0.5`` would represent approximately 500GB, + and a weight of ``3.00`` would represent approximately 3TB. Higher level + buckets have a weight that is the sum total of the leaf items aggregated by + the bucket. + + A bucket item weight is one dimensional, but you may also calculate your + item weights to reflect the performance of the storage drive. For example, + if you have many 1TB drives where some have relatively low data transfer + rate and the others have a relatively high data transfer rate, you may + weight them differently, even though they have the same capacity (e.g., + a weight of 0.80 for the first set of drives with lower total throughput, + and 1.20 for the second set of drives with higher total throughput). + + +.. _crushmaprules: + +CRUSH Map Rules +--------------- + +CRUSH maps support the notion of 'CRUSH rules', which are the rules that +determine data placement for a pool. For large clusters, you will likely create +many pools where each pool may have its own CRUSH ruleset and rules. The default +CRUSH map has a rule for each pool, and one ruleset assigned to each of the +default pools. + +.. note:: In most cases, you will not need to modify the default rules. When + you create a new pool, its default ruleset is ``0``. + + +CRUSH rules define placement and replication strategies or distribution policies +that allow you to specify exactly how CRUSH places object replicas. For +example, you might create a rule selecting a pair of targets for 2-way +mirroring, another rule for selecting three targets in two different data +centers for 3-way mirroring, and yet another rule for erasure coding over six +storage devices. For a detailed discussion of CRUSH rules, refer to +`CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_, +and more specifically to **Section 3.2**. + +A rule takes the following form:: + + rule <rulename> { + + ruleset <ruleset> + type [ replicated | erasure ] + min_size <min-size> + max_size <max-size> + step take <bucket-name> [class <device-class>] + step [choose|chooseleaf] [firstn|indep] <N> <bucket-type> + step emit + } + + +``ruleset`` + +:Description: A means of classifying a rule as belonging to a set of rules. + Activated by `setting the ruleset in a pool`_. + +:Purpose: A component of the rule mask. +:Type: Integer +:Required: Yes +:Default: 0 + +.. _setting the ruleset in a pool: ../pools#setpoolvalues + + +``type`` + +:Description: Describes a rule for either a storage drive (replicated) + or a RAID. + +:Purpose: A component of the rule mask. +:Type: String +:Required: Yes +:Default: ``replicated`` +:Valid Values: Currently only ``replicated`` and ``erasure`` + +``min_size`` + +:Description: If a pool makes fewer replicas than this number, CRUSH will + **NOT** select this rule. + +:Type: Integer +:Purpose: A component of the rule mask. +:Required: Yes +:Default: ``1`` + +``max_size`` + +:Description: If a pool makes more replicas than this number, CRUSH will + **NOT** select this rule. + +:Type: Integer +:Purpose: A component of the rule mask. +:Required: Yes +:Default: 10 + + +``step take <bucket-name> [class <device-class>]`` + +:Description: Takes a bucket name, and begins iterating down the tree. + If the ``device-class`` is specified, it must match + a class previously used when defining a device. All + devices that do not belong to the class are excluded. +:Purpose: A component of the rule. +:Required: Yes +:Example: ``step take data`` + + +``step choose firstn {num} type {bucket-type}`` + +:Description: Selects the number of buckets of the given type. The number is + usually the number of replicas in the pool (i.e., pool size). + + - If ``{num} == 0``, choose ``pool-num-replicas`` buckets (all available). + - If ``{num} > 0 && < pool-num-replicas``, choose that many buckets. + - If ``{num} < 0``, it means ``pool-num-replicas - {num}``. + +:Purpose: A component of the rule. +:Prerequisite: Follows ``step take`` or ``step choose``. +:Example: ``step choose firstn 1 type row`` + + +``step chooseleaf firstn {num} type {bucket-type}`` + +:Description: Selects a set of buckets of ``{bucket-type}`` and chooses a leaf + node from the subtree of each bucket in the set of buckets. The + number of buckets in the set is usually the number of replicas in + the pool (i.e., pool size). + + - If ``{num} == 0``, choose ``pool-num-replicas`` buckets (all available). + - If ``{num} > 0 && < pool-num-replicas``, choose that many buckets. + - If ``{num} < 0``, it means ``pool-num-replicas - {num}``. + +:Purpose: A component of the rule. Usage removes the need to select a device using two steps. +:Prerequisite: Follows ``step take`` or ``step choose``. +:Example: ``step chooseleaf firstn 0 type row`` + + + +``step emit`` + +:Description: Outputs the current value and empties the stack. Typically used + at the end of a rule, but may also be used to pick from different + trees in the same rule. + +:Purpose: A component of the rule. +:Prerequisite: Follows ``step choose``. +:Example: ``step emit`` + +.. important:: To activate one or more rules with a common ruleset number to a + pool, set the ruleset number of the pool. + + +Placing Different Pools on Different OSDS: +========================================== + +Suppose you want to have most pools default to OSDs backed by large hard drives, +but have some pools mapped to OSDs backed by fast solid-state drives (SSDs). +It's possible to have multiple independent CRUSH hierarchies within the same +CRUSH map. Define two hierarchies with two different root nodes--one for hard +disks (e.g., "root platter") and one for SSDs (e.g., "root ssd") as shown +below:: + + device 0 osd.0 + device 1 osd.1 + device 2 osd.2 + device 3 osd.3 + device 4 osd.4 + device 5 osd.5 + device 6 osd.6 + device 7 osd.7 + + host ceph-osd-ssd-server-1 { + id -1 + alg straw + hash 0 + item osd.0 weight 1.00 + item osd.1 weight 1.00 + } + + host ceph-osd-ssd-server-2 { + id -2 + alg straw + hash 0 + item osd.2 weight 1.00 + item osd.3 weight 1.00 + } + + host ceph-osd-platter-server-1 { + id -3 + alg straw + hash 0 + item osd.4 weight 1.00 + item osd.5 weight 1.00 + } + + host ceph-osd-platter-server-2 { + id -4 + alg straw + hash 0 + item osd.6 weight 1.00 + item osd.7 weight 1.00 + } + + root platter { + id -5 + alg straw + hash 0 + item ceph-osd-platter-server-1 weight 2.00 + item ceph-osd-platter-server-2 weight 2.00 + } + + root ssd { + id -6 + alg straw + hash 0 + item ceph-osd-ssd-server-1 weight 2.00 + item ceph-osd-ssd-server-2 weight 2.00 + } + + rule data { + ruleset 0 + type replicated + min_size 2 + max_size 2 + step take platter + step chooseleaf firstn 0 type host + step emit + } + + rule metadata { + ruleset 1 + type replicated + min_size 0 + max_size 10 + step take platter + step chooseleaf firstn 0 type host + step emit + } + + rule rbd { + ruleset 2 + type replicated + min_size 0 + max_size 10 + step take platter + step chooseleaf firstn 0 type host + step emit + } + + rule platter { + ruleset 3 + type replicated + min_size 0 + max_size 10 + step take platter + step chooseleaf firstn 0 type host + step emit + } + + rule ssd { + ruleset 4 + type replicated + min_size 0 + max_size 4 + step take ssd + step chooseleaf firstn 0 type host + step emit + } + + rule ssd-primary { + ruleset 5 + type replicated + min_size 5 + max_size 10 + step take ssd + step chooseleaf firstn 1 type host + step emit + step take platter + step chooseleaf firstn -1 type host + step emit + } + +You can then set a pool to use the SSD rule by:: + + ceph osd pool set <poolname> crush_ruleset 4 + +Similarly, using the ``ssd-primary`` rule will cause each placement group in the +pool to be placed with an SSD as the primary and platters as the replicas. + + +Tuning CRUSH, the hard way +-------------------------- + +If you can ensure that all clients are running recent code, you can +adjust the tunables by extracting the CRUSH map, modifying the values, +and reinjecting it into the cluster. + +* Extract the latest CRUSH map:: + + ceph osd getcrushmap -o /tmp/crush + +* Adjust tunables. These values appear to offer the best behavior + for both large and small clusters we tested with. You will need to + additionally specify the ``--enable-unsafe-tunables`` argument to + ``crushtool`` for this to work. Please use this option with + extreme care.:: + + crushtool -i /tmp/crush --set-choose-local-tries 0 --set-choose-local-fallback-tries 0 --set-choose-total-tries 50 -o /tmp/crush.new + +* Reinject modified map:: + + ceph osd setcrushmap -i /tmp/crush.new + +Legacy values +------------- + +For reference, the legacy values for the CRUSH tunables can be set +with:: + + crushtool -i /tmp/crush --set-choose-local-tries 2 --set-choose-local-fallback-tries 5 --set-choose-total-tries 19 --set-chooseleaf-descend-once 0 --set-chooseleaf-vary-r 0 -o /tmp/crush.legacy + +Again, the special ``--enable-unsafe-tunables`` option is required. +Further, as noted above, be careful running old versions of the +``ceph-osd`` daemon after reverting to legacy values as the feature +bit is not perfectly enforced. diff --git a/src/ceph/doc/rados/operations/crush-map.rst b/src/ceph/doc/rados/operations/crush-map.rst new file mode 100644 index 0000000..05fa4ff --- /dev/null +++ b/src/ceph/doc/rados/operations/crush-map.rst @@ -0,0 +1,956 @@ +============ + CRUSH Maps +============ + +The :abbr:`CRUSH (Controlled Replication Under Scalable Hashing)` algorithm +determines how to store and retrieve data by computing data storage locations. +CRUSH empowers Ceph clients to communicate with OSDs directly rather than +through a centralized server or broker. With an algorithmically determined +method of storing and retrieving data, Ceph avoids a single point of failure, a +performance bottleneck, and a physical limit to its scalability. + +CRUSH requires a map of your cluster, and uses the CRUSH map to pseudo-randomly +store and retrieve data in OSDs with a uniform distribution of data across the +cluster. For a detailed discussion of CRUSH, see +`CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_ + +CRUSH maps contain a list of :abbr:`OSDs (Object Storage Devices)`, a list of +'buckets' for aggregating the devices into physical locations, and a list of +rules that tell CRUSH how it should replicate data in a Ceph cluster's pools. By +reflecting the underlying physical organization of the installation, CRUSH can +model—and thereby address—potential sources of correlated device failures. +Typical sources include physical proximity, a shared power source, and a shared +network. By encoding this information into the cluster map, CRUSH placement +policies can separate object replicas across different failure domains while +still maintaining the desired distribution. For example, to address the +possibility of concurrent failures, it may be desirable to ensure that data +replicas are on devices using different shelves, racks, power supplies, +controllers, and/or physical locations. + +When you deploy OSDs they are automatically placed within the CRUSH map under a +``host`` node named with the hostname for the host they are running on. This, +combined with the default CRUSH failure domain, ensures that replicas or erasure +code shards are separated across hosts and a single host failure will not +affect availability. For larger clusters, however, administrators should carefully consider their choice of failure domain. Separating replicas across racks, +for example, is common for mid- to large-sized clusters. + + +CRUSH Location +============== + +The location of an OSD in terms of the CRUSH map's hierarchy is +referred to as a ``crush location``. This location specifier takes the +form of a list of key and value pairs describing a position. For +example, if an OSD is in a particular row, rack, chassis and host, and +is part of the 'default' CRUSH tree (this is the case for the vast +majority of clusters), its crush location could be described as:: + + root=default row=a rack=a2 chassis=a2a host=a2a1 + +Note: + +#. Note that the order of the keys does not matter. +#. The key name (left of ``=``) must be a valid CRUSH ``type``. By default + these include root, datacenter, room, row, pod, pdu, rack, chassis and host, + but those types can be customized to be anything appropriate by modifying + the CRUSH map. +#. Not all keys need to be specified. For example, by default, Ceph + automatically sets a ``ceph-osd`` daemon's location to be + ``root=default host=HOSTNAME`` (based on the output from ``hostname -s``). + +The crush location for an OSD is normally expressed via the ``crush location`` +config option being set in the ``ceph.conf`` file. Each time the OSD starts, +it verifies it is in the correct location in the CRUSH map and, if it is not, +it moved itself. To disable this automatic CRUSH map management, add the +following to your configuration file in the ``[osd]`` section:: + + osd crush update on start = false + + +Custom location hooks +--------------------- + +A customized location hook can be used to generate a more complete +crush location on startup. The sample ``ceph-crush-location`` utility +will generate a CRUSH location string for a given daemon. The +location is based on, in order of preference: + +#. A ``crush location`` option in ceph.conf. +#. A default of ``root=default host=HOSTNAME`` where the hostname is + generated with the ``hostname -s`` command. + +This is not useful by itself, as the OSD itself has the exact same +behavior. However, the script can be modified to provide additional +location fields (for example, the rack or datacenter), and then the +hook enabled via the config option:: + + crush location hook = /path/to/customized-ceph-crush-location + +This hook is passed several arguments (below) and should output a single line +to stdout with the CRUSH location description.:: + + $ ceph-crush-location --cluster CLUSTER --id ID --type TYPE + +where the cluster name is typically 'ceph', the id is the daemon +identifier (the OSD number), and the daemon type is typically ``osd``. + + +CRUSH structure +=============== + +The CRUSH map consists of, loosely speaking, a hierarchy describing +the physical topology of the cluster, and a set of rules defining +policy about how we place data on those devices. The hierarchy has +devices (``ceph-osd`` daemons) at the leaves, and internal nodes +corresponding to other physical features or groupings: hosts, racks, +rows, datacenters, and so on. The rules describe how replicas are +placed in terms of that hierarchy (e.g., 'three replicas in different +racks'). + +Devices +------- + +Devices are individual ``ceph-osd`` daemons that can store data. You +will normally have one defined here for each OSD daemon in your +cluster. Devices are identified by an id (a non-negative integer) and +a name, normally ``osd.N`` where ``N`` is the device id. + +Devices may also have a *device class* associated with them (e.g., +``hdd`` or ``ssd``), allowing them to be conveniently targetted by a +crush rule. + +Types and Buckets +----------------- + +A bucket is the CRUSH term for internal nodes in the hierarchy: hosts, +racks, rows, etc. The CRUSH map defines a series of *types* that are +used to describe these nodes. By default, these types include: + +- osd (or device) +- host +- chassis +- rack +- row +- pdu +- pod +- room +- datacenter +- region +- root + +Most clusters make use of only a handful of these types, and others +can be defined as needed. + +The hierarchy is built with devices (normally type ``osd``) at the +leaves, interior nodes with non-device types, and a root node of type +``root``. For example, + +.. ditaa:: + + +-----------------+ + | {o}root default | + +--------+--------+ + | + +---------------+---------------+ + | | + +-------+-------+ +-----+-------+ + | {o}host foo | | {o}host bar | + +-------+-------+ +-----+-------+ + | | + +-------+-------+ +-------+-------+ + | | | | + +-----+-----+ +-----+-----+ +-----+-----+ +-----+-----+ + | osd.0 | | osd.1 | | osd.2 | | osd.3 | + +-----------+ +-----------+ +-----------+ +-----------+ + +Each node (device or bucket) in the hierarchy has a *weight* +associated with it, indicating the relative proportion of the total +data that device or hierarchy subtree should store. Weights are set +at the leaves, indicating the size of the device, and automatically +sum up the tree from there, such that the weight of the default node +will be the total of all devices contained beneath it. Normally +weights are in units of terabytes (TB). + +You can get a simple view the CRUSH hierarchy for your cluster, +including the weights, with:: + + ceph osd crush tree + +Rules +----- + +Rules define policy about how data is distributed across the devices +in the hierarchy. + +CRUSH rules define placement and replication strategies or +distribution policies that allow you to specify exactly how CRUSH +places object replicas. For example, you might create a rule selecting +a pair of targets for 2-way mirroring, another rule for selecting +three targets in two different data centers for 3-way mirroring, and +yet another rule for erasure coding over six storage devices. For a +detailed discussion of CRUSH rules, refer to `CRUSH - Controlled, +Scalable, Decentralized Placement of Replicated Data`_, and more +specifically to **Section 3.2**. + +In almost all cases, CRUSH rules can be created via the CLI by +specifying the *pool type* they will be used for (replicated or +erasure coded), the *failure domain*, and optionally a *device class*. +In rare cases rules must be written by hand by manually editing the +CRUSH map. + +You can see what rules are defined for your cluster with:: + + ceph osd crush rule ls + +You can view the contents of the rules with:: + + ceph osd crush rule dump + +Device classes +-------------- + +Each device can optionally have a *class* associated with it. By +default, OSDs automatically set their class on startup to either +`hdd`, `ssd`, or `nvme` based on the type of device they are backed +by. + +The device class for one or more OSDs can be explicitly set with:: + + ceph osd crush set-device-class <class> <osd-name> [...] + +Once a device class is set, it cannot be changed to another class +until the old class is unset with:: + + ceph osd crush rm-device-class <osd-name> [...] + +This allows administrators to set device classes without the class +being changed on OSD restart or by some other script. + +A placement rule that targets a specific device class can be created with:: + + ceph osd crush rule create-replicated <rule-name> <root> <failure-domain> <class> + +A pool can then be changed to use the new rule with:: + + ceph osd pool set <pool-name> crush_rule <rule-name> + +Device classes are implemented by creating a "shadow" CRUSH hierarchy +for each device class in use that contains only devices of that class. +Rules can then distribute data over the shadow hierarchy. One nice +thing about this approach is that it is fully backward compatible with +old Ceph clients. You can view the CRUSH hierarchy with shadow items +with:: + + ceph osd crush tree --show-shadow + + +Weights sets +------------ + +A *weight set* is an alternative set of weights to use when +calculating data placement. The normal weights associated with each +device in the CRUSH map are set based on the device size and indicate +how much data we *should* be storing where. However, because CRUSH is +based on a pseudorandom placement process, there is always some +variation from this ideal distribution, the same way that rolling a +dice sixty times will not result in rolling exactly 10 ones and 10 +sixes. Weight sets allow the cluster to do a numerical optimization +based on the specifics of your cluster (hierarchy, pools, etc.) to achieve +a balanced distribution. + +There are two types of weight sets supported: + + #. A **compat** weight set is a single alternative set of weights for + each device and node in the cluster. This is not well-suited for + correcting for all anomalies (for example, placement groups for + different pools may be different sizes and have different load + levels, but will be mostly treated the same by the balancer). + However, compat weight sets have the huge advantage that they are + *backward compatible* with previous versions of Ceph, which means + that even though weight sets were first introduced in Luminous + v12.2.z, older clients (e.g., firefly) can still connect to the + cluster when a compat weight set is being used to balance data. + #. A **per-pool** weight set is more flexible in that it allows + placement to be optimized for each data pool. Additionally, + weights can be adjusted for each position of placement, allowing + the optimizer to correct for a suble skew of data toward devices + with small weights relative to their peers (and effect that is + usually only apparently in very large clusters but which can cause + balancing problems). + +When weight sets are in use, the weights associated with each node in +the hierarchy is visible as a separate column (labeled either +``(compat)`` or the pool name) from the command:: + + ceph osd crush tree + +When both *compat* and *per-pool* weight sets are in use, data +placement for a particular pool will use its own per-pool weight set +if present. If not, it will use the compat weight set if present. If +neither are present, it will use the normal CRUSH weights. + +Although weight sets can be set up and manipulated by hand, it is +recommended that the *balancer* module be enabled to do so +automatically. + + +Modifying the CRUSH map +======================= + +.. _addosd: + +Add/Move an OSD +--------------- + +.. note: OSDs are normally automatically added to the CRUSH map when + the OSD is created. This command is rarely needed. + +To add or move an OSD in the CRUSH map of a running cluster:: + + ceph osd crush set {name} {weight} root={root} [{bucket-type}={bucket-name} ...] + +Where: + +``name`` + +:Description: The full name of the OSD. +:Type: String +:Required: Yes +:Example: ``osd.0`` + + +``weight`` + +:Description: The CRUSH weight for the OSD, normally its size measure in terabytes (TB). +:Type: Double +:Required: Yes +:Example: ``2.0`` + + +``root`` + +:Description: The root node of the tree in which the OSD resides (normally ``default``) +:Type: Key/value pair. +:Required: Yes +:Example: ``root=default`` + + +``bucket-type`` + +:Description: You may specify the OSD's location in the CRUSH hierarchy. +:Type: Key/value pairs. +:Required: No +:Example: ``datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1`` + + +The following example adds ``osd.0`` to the hierarchy, or moves the +OSD from a previous location. :: + + ceph osd crush set osd.0 1.0 root=default datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1 + + +Adjust OSD weight +----------------- + +.. note: Normally OSDs automatically add themselves to the CRUSH map + with the correct weight when they are created. This command + is rarely needed. + +To adjust an OSD's crush weight in the CRUSH map of a running cluster, execute +the following:: + + ceph osd crush reweight {name} {weight} + +Where: + +``name`` + +:Description: The full name of the OSD. +:Type: String +:Required: Yes +:Example: ``osd.0`` + + +``weight`` + +:Description: The CRUSH weight for the OSD. +:Type: Double +:Required: Yes +:Example: ``2.0`` + + +.. _removeosd: + +Remove an OSD +------------- + +.. note: OSDs are normally removed from the CRUSH as part of the + ``ceph osd purge`` command. This command is rarely needed. + +To remove an OSD from the CRUSH map of a running cluster, execute the +following:: + + ceph osd crush remove {name} + +Where: + +``name`` + +:Description: The full name of the OSD. +:Type: String +:Required: Yes +:Example: ``osd.0`` + + +Add a Bucket +------------ + +.. note: Buckets are normally implicitly created when an OSD is added + that specifies a ``{bucket-type}={bucket-name}`` as part of its + location and a bucket with that name does not already exist. This + command is typically used when manually adjusting the structure of the + hierarchy after OSDs have been created (for example, to move a + series of hosts underneath a new rack-level bucket). + +To add a bucket in the CRUSH map of a running cluster, execute the +``ceph osd crush add-bucket`` command:: + + ceph osd crush add-bucket {bucket-name} {bucket-type} + +Where: + +``bucket-name`` + +:Description: The full name of the bucket. +:Type: String +:Required: Yes +:Example: ``rack12`` + + +``bucket-type`` + +:Description: The type of the bucket. The type must already exist in the hierarchy. +:Type: String +:Required: Yes +:Example: ``rack`` + + +The following example adds the ``rack12`` bucket to the hierarchy:: + + ceph osd crush add-bucket rack12 rack + +Move a Bucket +------------- + +To move a bucket to a different location or position in the CRUSH map +hierarchy, execute the following:: + + ceph osd crush move {bucket-name} {bucket-type}={bucket-name}, [...] + +Where: + +``bucket-name`` + +:Description: The name of the bucket to move/reposition. +:Type: String +:Required: Yes +:Example: ``foo-bar-1`` + +``bucket-type`` + +:Description: You may specify the bucket's location in the CRUSH hierarchy. +:Type: Key/value pairs. +:Required: No +:Example: ``datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1`` + +Remove a Bucket +--------------- + +To remove a bucket from the CRUSH map hierarchy, execute the following:: + + ceph osd crush remove {bucket-name} + +.. note:: A bucket must be empty before removing it from the CRUSH hierarchy. + +Where: + +``bucket-name`` + +:Description: The name of the bucket that you'd like to remove. +:Type: String +:Required: Yes +:Example: ``rack12`` + +The following example removes the ``rack12`` bucket from the hierarchy:: + + ceph osd crush remove rack12 + +Creating a compat weight set +---------------------------- + +.. note: This step is normally done automatically by the ``balancer`` + module when enabled. + +To create a *compat* weight set:: + + ceph osd crush weight-set create-compat + +Weights for the compat weight set can be adjusted with:: + + ceph osd crush weight-set reweight-compat {name} {weight} + +The compat weight set can be destroyed with:: + + ceph osd crush weight-set rm-compat + +Creating per-pool weight sets +----------------------------- + +To create a weight set for a specific pool,:: + + ceph osd crush weight-set create {pool-name} {mode} + +.. note:: Per-pool weight sets require that all servers and daemons + run Luminous v12.2.z or later. + +Where: + +``pool-name`` + +:Description: The name of a RADOS pool +:Type: String +:Required: Yes +:Example: ``rbd`` + +``mode`` + +:Description: Either ``flat`` or ``positional``. A *flat* weight set + has a single weight for each device or bucket. A + *positional* weight set has a potentially different + weight for each position in the resulting placement + mapping. For example, if a pool has a replica count of + 3, then a positional weight set will have three weights + for each device and bucket. +:Type: String +:Required: Yes +:Example: ``flat`` + +To adjust the weight of an item in a weight set:: + + ceph osd crush weight-set reweight {pool-name} {item-name} {weight [...]} + +To list existing weight sets,:: + + ceph osd crush weight-set ls + +To remove a weight set,:: + + ceph osd crush weight-set rm {pool-name} + +Creating a rule for a replicated pool +------------------------------------- + +For a replicated pool, the primary decision when creating the CRUSH +rule is what the failure domain is going to be. For example, if a +failure domain of ``host`` is selected, then CRUSH will ensure that +each replica of the data is stored on a different host. If ``rack`` +is selected, then each replica will be stored in a different rack. +What failure domain you choose primarily depends on the size of your +cluster and how your hierarchy is structured. + +Normally, the entire cluster hierarchy is nested beneath a root node +named ``default``. If you have customized your hierarchy, you may +want to create a rule nested at some other node in the hierarchy. It +doesn't matter what type is associated with that node (it doesn't have +to be a ``root`` node). + +It is also possible to create a rule that restricts data placement to +a specific *class* of device. By default, Ceph OSDs automatically +classify themselves as either ``hdd`` or ``ssd``, depending on the +underlying type of device being used. These classes can also be +customized. + +To create a replicated rule,:: + + ceph osd crush rule create-replicated {name} {root} {failure-domain-type} [{class}] + +Where: + +``name`` + +:Description: The name of the rule +:Type: String +:Required: Yes +:Example: ``rbd-rule`` + +``root`` + +:Description: The name of the node under which data should be placed. +:Type: String +:Required: Yes +:Example: ``default`` + +``failure-domain-type`` + +:Description: The type of CRUSH nodes across which we should separate replicas. +:Type: String +:Required: Yes +:Example: ``rack`` + +``class`` + +:Description: The device class data should be placed on. +:Type: String +:Required: No +:Example: ``ssd`` + +Creating a rule for an erasure coded pool +----------------------------------------- + +For an erasure-coded pool, the same basic decisions need to be made as +with a replicated pool: what is the failure domain, what node in the +hierarchy will data be placed under (usually ``default``), and will +placement be restricted to a specific device class. Erasure code +pools are created a bit differently, however, because they need to be +constructed carefully based on the erasure code being used. For this reason, +you must include this information in the *erasure code profile*. A CRUSH +rule will then be created from that either explicitly or automatically when +the profile is used to create a pool. + +The erasure code profiles can be listed with:: + + ceph osd erasure-code-profile ls + +An existing profile can be viewed with:: + + ceph osd erasure-code-profile get {profile-name} + +Normally profiles should never be modified; instead, a new profile +should be created and used when creating a new pool or creating a new +rule for an existing pool. + +An erasure code profile consists of a set of key=value pairs. Most of +these control the behavior of the erasure code that is encoding data +in the pool. Those that begin with ``crush-``, however, affect the +CRUSH rule that is created. + +The erasure code profile properties of interest are: + + * **crush-root**: the name of the CRUSH node to place data under [default: ``default``]. + * **crush-failure-domain**: the CRUSH type to separate erasure-coded shards across [default: ``host``]. + * **crush-device-class**: the device class to place data on [default: none, meaning all devices are used]. + * **k** and **m** (and, for the ``lrc`` plugin, **l**): these determine the number of erasure code shards, affecting the resulting CRUSH rule. + +Once a profile is defined, you can create a CRUSH rule with:: + + ceph osd crush rule create-erasure {name} {profile-name} + +.. note: When creating a new pool, it is not actually necessary to + explicitly create the rule. If the erasure code profile alone is + specified and the rule argument is left off then Ceph will create + the CRUSH rule automatically. + +Deleting rules +-------------- + +Rules that are not in use by pools can be deleted with:: + + ceph osd crush rule rm {rule-name} + + +Tunables +======== + +Over time, we have made (and continue to make) improvements to the +CRUSH algorithm used to calculate the placement of data. In order to +support the change in behavior, we have introduced a series of tunable +options that control whether the legacy or improved variation of the +algorithm is used. + +In order to use newer tunables, both clients and servers must support +the new version of CRUSH. For this reason, we have created +``profiles`` that are named after the Ceph version in which they were +introduced. For example, the ``firefly`` tunables are first supported +in the firefly release, and will not work with older (e.g., dumpling) +clients. Once a given set of tunables are changed from the legacy +default behavior, the ``ceph-mon`` and ``ceph-osd`` will prevent older +clients who do not support the new CRUSH features from connecting to +the cluster. + +argonaut (legacy) +----------------- + +The legacy CRUSH behavior used by argonaut and older releases works +fine for most clusters, provided there are not too many OSDs that have +been marked out. + +bobtail (CRUSH_TUNABLES2) +------------------------- + +The bobtail tunable profile fixes a few key misbehaviors: + + * For hierarchies with a small number of devices in the leaf buckets, + some PGs map to fewer than the desired number of replicas. This + commonly happens for hierarchies with "host" nodes with a small + number (1-3) of OSDs nested beneath each one. + + * For large clusters, some small percentages of PGs map to less than + the desired number of OSDs. This is more prevalent when there are + several layers of the hierarchy (e.g., row, rack, host, osd). + + * When some OSDs are marked out, the data tends to get redistributed + to nearby OSDs instead of across the entire hierarchy. + +The new tunables are: + + * ``choose_local_tries``: Number of local retries. Legacy value is + 2, optimal value is 0. + + * ``choose_local_fallback_tries``: Legacy value is 5, optimal value + is 0. + + * ``choose_total_tries``: Total number of attempts to choose an item. + Legacy value was 19, subsequent testing indicates that a value of + 50 is more appropriate for typical clusters. For extremely large + clusters, a larger value might be necessary. + + * ``chooseleaf_descend_once``: Whether a recursive chooseleaf attempt + will retry, or only try once and allow the original placement to + retry. Legacy default is 0, optimal value is 1. + +Migration impact: + + * Moving from argonaut to bobtail tunables triggers a moderate amount + of data movement. Use caution on a cluster that is already + populated with data. + +firefly (CRUSH_TUNABLES3) +------------------------- + +The firefly tunable profile fixes a problem +with the ``chooseleaf`` CRUSH rule behavior that tends to result in PG +mappings with too few results when too many OSDs have been marked out. + +The new tunable is: + + * ``chooseleaf_vary_r``: Whether a recursive chooseleaf attempt will + start with a non-zero value of r, based on how many attempts the + parent has already made. Legacy default is 0, but with this value + CRUSH is sometimes unable to find a mapping. The optimal value (in + terms of computational cost and correctness) is 1. + +Migration impact: + + * For existing clusters that have lots of existing data, changing + from 0 to 1 will cause a lot of data to move; a value of 4 or 5 + will allow CRUSH to find a valid mapping but will make less data + move. + +straw_calc_version tunable (introduced with Firefly too) +-------------------------------------------------------- + +There were some problems with the internal weights calculated and +stored in the CRUSH map for ``straw`` buckets. Specifically, when +there were items with a CRUSH weight of 0 or both a mix of weights and +some duplicated weights CRUSH would distribute data incorrectly (i.e., +not in proportion to the weights). + +The new tunable is: + + * ``straw_calc_version``: A value of 0 preserves the old, broken + internal weight calculation; a value of 1 fixes the behavior. + +Migration impact: + + * Moving to straw_calc_version 1 and then adjusting a straw bucket + (by adding, removing, or reweighting an item, or by using the + reweight-all command) can trigger a small to moderate amount of + data movement *if* the cluster has hit one of the problematic + conditions. + +This tunable option is special because it has absolutely no impact +concerning the required kernel version in the client side. + +hammer (CRUSH_V4) +----------------- + +The hammer tunable profile does not affect the +mapping of existing CRUSH maps simply by changing the profile. However: + + * There is a new bucket type (``straw2``) supported. The new + ``straw2`` bucket type fixes several limitations in the original + ``straw`` bucket. Specifically, the old ``straw`` buckets would + change some mappings that should have changed when a weight was + adjusted, while ``straw2`` achieves the original goal of only + changing mappings to or from the bucket item whose weight has + changed. + + * ``straw2`` is the default for any newly created buckets. + +Migration impact: + + * Changing a bucket type from ``straw`` to ``straw2`` will result in + a reasonably small amount of data movement, depending on how much + the bucket item weights vary from each other. When the weights are + all the same no data will move, and when item weights vary + significantly there will be more movement. + +jewel (CRUSH_TUNABLES5) +----------------------- + +The jewel tunable profile improves the +overall behavior of CRUSH such that significantly fewer mappings +change when an OSD is marked out of the cluster. + +The new tunable is: + + * ``chooseleaf_stable``: Whether a recursive chooseleaf attempt will + use a better value for an inner loop that greatly reduces the number + of mapping changes when an OSD is marked out. The legacy value is 0, + while the new value of 1 uses the new approach. + +Migration impact: + + * Changing this value on an existing cluster will result in a very + large amount of data movement as almost every PG mapping is likely + to change. + + + + +Which client versions support CRUSH_TUNABLES +-------------------------------------------- + + * argonaut series, v0.48.1 or later + * v0.49 or later + * Linux kernel version v3.6 or later (for the file system and RBD kernel clients) + +Which client versions support CRUSH_TUNABLES2 +--------------------------------------------- + + * v0.55 or later, including bobtail series (v0.56.x) + * Linux kernel version v3.9 or later (for the file system and RBD kernel clients) + +Which client versions support CRUSH_TUNABLES3 +--------------------------------------------- + + * v0.78 (firefly) or later + * Linux kernel version v3.15 or later (for the file system and RBD kernel clients) + +Which client versions support CRUSH_V4 +-------------------------------------- + + * v0.94 (hammer) or later + * Linux kernel version v4.1 or later (for the file system and RBD kernel clients) + +Which client versions support CRUSH_TUNABLES5 +--------------------------------------------- + + * v10.0.2 (jewel) or later + * Linux kernel version v4.5 or later (for the file system and RBD kernel clients) + +Warning when tunables are non-optimal +------------------------------------- + +Starting with version v0.74, Ceph will issue a health warning if the +current CRUSH tunables don't include all the optimal values from the +``default`` profile (see below for the meaning of the ``default`` profile). +To make this warning go away, you have two options: + +1. Adjust the tunables on the existing cluster. Note that this will + result in some data movement (possibly as much as 10%). This is the + preferred route, but should be taken with care on a production cluster + where the data movement may affect performance. You can enable optimal + tunables with:: + + ceph osd crush tunables optimal + + If things go poorly (e.g., too much load) and not very much + progress has been made, or there is a client compatibility problem + (old kernel cephfs or rbd clients, or pre-bobtail librados + clients), you can switch back with:: + + ceph osd crush tunables legacy + +2. You can make the warning go away without making any changes to CRUSH by + adding the following option to your ceph.conf ``[mon]`` section:: + + mon warn on legacy crush tunables = false + + For the change to take effect, you will need to restart the monitors, or + apply the option to running monitors with:: + + ceph tell mon.\* injectargs --no-mon-warn-on-legacy-crush-tunables + + +A few important points +---------------------- + + * Adjusting these values will result in the shift of some PGs between + storage nodes. If the Ceph cluster is already storing a lot of + data, be prepared for some fraction of the data to move. + * The ``ceph-osd`` and ``ceph-mon`` daemons will start requiring the + feature bits of new connections as soon as they get + the updated map. However, already-connected clients are + effectively grandfathered in, and will misbehave if they do not + support the new feature. + * If the CRUSH tunables are set to non-legacy values and then later + changed back to the defult values, ``ceph-osd`` daemons will not be + required to support the feature. However, the OSD peering process + requires examining and understanding old maps. Therefore, you + should not run old versions of the ``ceph-osd`` daemon + if the cluster has previously used non-legacy CRUSH values, even if + the latest version of the map has been switched back to using the + legacy defaults. + +Tuning CRUSH +------------ + +The simplest way to adjust the crush tunables is by changing to a known +profile. Those are: + + * ``legacy``: the legacy behavior from argonaut and earlier. + * ``argonaut``: the legacy values supported by the original argonaut release + * ``bobtail``: the values supported by the bobtail release + * ``firefly``: the values supported by the firefly release + * ``hammer``: the values supported by the hammer release + * ``jewel``: the values supported by the jewel release + * ``optimal``: the best (ie optimal) values of the current version of Ceph + * ``default``: the default values of a new cluster installed from + scratch. These values, which depend on the current version of Ceph, + are hard coded and are generally a mix of optimal and legacy values. + These values generally match the ``optimal`` profile of the previous + LTS release, or the most recent release for which we generally except + more users to have up to date clients for. + +You can select a profile on a running cluster with the command:: + + ceph osd crush tunables {PROFILE} + +Note that this may result in some data movement. + + +.. _CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data: https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf + + +Primary Affinity +================ + +When a Ceph Client reads or writes data, it always contacts the primary OSD in +the acting set. For set ``[2, 3, 4]``, ``osd.2`` is the primary. Sometimes an +OSD is not well suited to act as a primary compared to other OSDs (e.g., it has +a slow disk or a slow controller). To prevent performance bottlenecks +(especially on read operations) while maximizing utilization of your hardware, +you can set a Ceph OSD's primary affinity so that CRUSH is less likely to use +the OSD as a primary in an acting set. :: + + ceph osd primary-affinity <osd-id> <weight> + +Primary affinity is ``1`` by default (*i.e.,* an OSD may act as a primary). You +may set the OSD primary range from ``0-1``, where ``0`` means that the OSD may +**NOT** be used as a primary and ``1`` means that an OSD may be used as a +primary. When the weight is ``< 1``, it is less likely that CRUSH will select +the Ceph OSD Daemon to act as a primary. + + + diff --git a/src/ceph/doc/rados/operations/data-placement.rst b/src/ceph/doc/rados/operations/data-placement.rst new file mode 100644 index 0000000..27966b0 --- /dev/null +++ b/src/ceph/doc/rados/operations/data-placement.rst @@ -0,0 +1,37 @@ +========================= + Data Placement Overview +========================= + +Ceph stores, replicates and rebalances data objects across a RADOS cluster +dynamically. With many different users storing objects in different pools for +different purposes on countless OSDs, Ceph operations require some data +placement planning. The main data placement planning concepts in Ceph include: + +- **Pools:** Ceph stores data within pools, which are logical groups for storing + objects. Pools manage the number of placement groups, the number of replicas, + and the ruleset for the pool. To store data in a pool, you must have + an authenticated user with permissions for the pool. Ceph can snapshot pools. + See `Pools`_ for additional details. + +- **Placement Groups:** Ceph maps objects to placement groups (PGs). + Placement groups (PGs) are shards or fragments of a logical object pool + that place objects as a group into OSDs. Placement groups reduce the amount + of per-object metadata when Ceph stores the data in OSDs. A larger number of + placement groups (e.g., 100 per OSD) leads to better balancing. See + `Placement Groups`_ for additional details. + +- **CRUSH Maps:** CRUSH is a big part of what allows Ceph to scale without + performance bottlenecks, without limitations to scalability, and without a + single point of failure. CRUSH maps provide the physical topology of the + cluster to the CRUSH algorithm to determine where the data for an object + and its replicas should be stored, and how to do so across failure domains + for added data safety among other things. See `CRUSH Maps`_ for additional + details. + +When you initially set up a test cluster, you can use the default values. Once +you begin planning for a large Ceph cluster, refer to pools, placement groups +and CRUSH for data placement operations. + +.. _Pools: ../pools +.. _Placement Groups: ../placement-groups +.. _CRUSH Maps: ../crush-map diff --git a/src/ceph/doc/rados/operations/erasure-code-isa.rst b/src/ceph/doc/rados/operations/erasure-code-isa.rst new file mode 100644 index 0000000..b52933a --- /dev/null +++ b/src/ceph/doc/rados/operations/erasure-code-isa.rst @@ -0,0 +1,105 @@ +======================= +ISA erasure code plugin +======================= + +The *isa* plugin encapsulates the `ISA +<https://01.org/intel%C2%AE-storage-acceleration-library-open-source-version/>`_ +library. It only runs on Intel processors. + +Create an isa profile +===================== + +To create a new *isa* erasure code profile:: + + ceph osd erasure-code-profile set {name} \ + plugin=isa \ + technique={reed_sol_van|cauchy} \ + [k={data-chunks}] \ + [m={coding-chunks}] \ + [crush-root={root}] \ + [crush-failure-domain={bucket-type}] \ + [crush-device-class={device-class}] \ + [directory={directory}] \ + [--force] + +Where: + +``k={data chunks}`` + +:Description: Each object is split in **data-chunks** parts, + each stored on a different OSD. + +:Type: Integer +:Required: No. +:Default: 7 + +``m={coding-chunks}`` + +:Description: Compute **coding chunks** for each object and store them + on different OSDs. The number of coding chunks is also + the number of OSDs that can be down without losing data. + +:Type: Integer +:Required: No. +:Default: 3 + +``technique={reed_sol_van|cauchy}`` + +:Description: The ISA plugin comes in two `Reed Solomon + <https://en.wikipedia.org/wiki/Reed%E2%80%93Solomon_error_correction>`_ + forms. If *reed_sol_van* is set, it is `Vandermonde + <https://en.wikipedia.org/wiki/Vandermonde_matrix>`_, if + *cauchy* is set, it is `Cauchy + <https://en.wikipedia.org/wiki/Cauchy_matrix>`_. + +:Type: String +:Required: No. +:Default: reed_sol_van + +``crush-root={root}`` + +:Description: The name of the crush bucket used for the first step of + the ruleset. For intance **step take default**. + +:Type: String +:Required: No. +:Default: default + +``crush-failure-domain={bucket-type}`` + +:Description: Ensure that no two chunks are in a bucket with the same + failure domain. For instance, if the failure domain is + **host** no two chunks will be stored on the same + host. It is used to create a ruleset step such as **step + chooseleaf host**. + +:Type: String +:Required: No. +:Default: host + +``crush-device-class={device-class}`` + +:Description: Restrict placement to devices of a specific class (e.g., + ``ssd`` or ``hdd``), using the crush device class names + in the CRUSH map. + +:Type: String +:Required: No. +:Default: + +``directory={directory}`` + +:Description: Set the **directory** name from which the erasure code + plugin is loaded. + +:Type: String +:Required: No. +:Default: /usr/lib/ceph/erasure-code + +``--force`` + +:Description: Override an existing profile by the same name. + +:Type: String +:Required: No. + diff --git a/src/ceph/doc/rados/operations/erasure-code-jerasure.rst b/src/ceph/doc/rados/operations/erasure-code-jerasure.rst new file mode 100644 index 0000000..e8da097 --- /dev/null +++ b/src/ceph/doc/rados/operations/erasure-code-jerasure.rst @@ -0,0 +1,120 @@ +============================ +Jerasure erasure code plugin +============================ + +The *jerasure* plugin is the most generic and flexible plugin, it is +also the default for Ceph erasure coded pools. + +The *jerasure* plugin encapsulates the `Jerasure +<http://jerasure.org>`_ library. It is +recommended to read the *jerasure* documentation to get a better +understanding of the parameters. + +Create a jerasure profile +========================= + +To create a new *jerasure* erasure code profile:: + + ceph osd erasure-code-profile set {name} \ + plugin=jerasure \ + k={data-chunks} \ + m={coding-chunks} \ + technique={reed_sol_van|reed_sol_r6_op|cauchy_orig|cauchy_good|liberation|blaum_roth|liber8tion} \ + [crush-root={root}] \ + [crush-failure-domain={bucket-type}] \ + [crush-device-class={device-class}] \ + [directory={directory}] \ + [--force] + +Where: + +``k={data chunks}`` + +:Description: Each object is split in **data-chunks** parts, + each stored on a different OSD. + +:Type: Integer +:Required: Yes. +:Example: 4 + +``m={coding-chunks}`` + +:Description: Compute **coding chunks** for each object and store them + on different OSDs. The number of coding chunks is also + the number of OSDs that can be down without losing data. + +:Type: Integer +:Required: Yes. +:Example: 2 + +``technique={reed_sol_van|reed_sol_r6_op|cauchy_orig|cauchy_good|liberation|blaum_roth|liber8tion}`` + +:Description: The more flexible technique is *reed_sol_van* : it is + enough to set *k* and *m*. The *cauchy_good* technique + can be faster but you need to chose the *packetsize* + carefully. All of *reed_sol_r6_op*, *liberation*, + *blaum_roth*, *liber8tion* are *RAID6* equivalents in + the sense that they can only be configured with *m=2*. + +:Type: String +:Required: No. +:Default: reed_sol_van + +``packetsize={bytes}`` + +:Description: The encoding will be done on packets of *bytes* size at + a time. Chosing the right packet size is difficult. The + *jerasure* documentation contains extensive information + on this topic. + +:Type: Integer +:Required: No. +:Default: 2048 + +``crush-root={root}`` + +:Description: The name of the crush bucket used for the first step of + the ruleset. For intance **step take default**. + +:Type: String +:Required: No. +:Default: default + +``crush-failure-domain={bucket-type}`` + +:Description: Ensure that no two chunks are in a bucket with the same + failure domain. For instance, if the failure domain is + **host** no two chunks will be stored on the same + host. It is used to create a ruleset step such as **step + chooseleaf host**. + +:Type: String +:Required: No. +:Default: host + +``crush-device-class={device-class}`` + +:Description: Restrict placement to devices of a specific class (e.g., + ``ssd`` or ``hdd``), using the crush device class names + in the CRUSH map. + +:Type: String +:Required: No. +:Default: + + ``directory={directory}`` + +:Description: Set the **directory** name from which the erasure code + plugin is loaded. + +:Type: String +:Required: No. +:Default: /usr/lib/ceph/erasure-code + +``--force`` + +:Description: Override an existing profile by the same name. + +:Type: String +:Required: No. + diff --git a/src/ceph/doc/rados/operations/erasure-code-lrc.rst b/src/ceph/doc/rados/operations/erasure-code-lrc.rst new file mode 100644 index 0000000..447ce23 --- /dev/null +++ b/src/ceph/doc/rados/operations/erasure-code-lrc.rst @@ -0,0 +1,371 @@ +====================================== +Locally repairable erasure code plugin +====================================== + +With the *jerasure* plugin, when an erasure coded object is stored on +multiple OSDs, recovering from the loss of one OSD requires reading +from all the others. For instance if *jerasure* is configured with +*k=8* and *m=4*, losing one OSD requires reading from the eleven +others to repair. + +The *lrc* erasure code plugin creates local parity chunks to be able +to recover using less OSDs. For instance if *lrc* is configured with +*k=8*, *m=4* and *l=4*, it will create an additional parity chunk for +every four OSDs. When a single OSD is lost, it can be recovered with +only four OSDs instead of eleven. + +Erasure code profile examples +============================= + +Reduce recovery bandwidth between hosts +--------------------------------------- + +Although it is probably not an interesting use case when all hosts are +connected to the same switch, reduced bandwidth usage can actually be +observed.:: + + $ ceph osd erasure-code-profile set LRCprofile \ + plugin=lrc \ + k=4 m=2 l=3 \ + crush-failure-domain=host + $ ceph osd pool create lrcpool 12 12 erasure LRCprofile + + +Reduce recovery bandwidth between racks +--------------------------------------- + +In Firefly the reduced bandwidth will only be observed if the primary +OSD is in the same rack as the lost chunk.:: + + $ ceph osd erasure-code-profile set LRCprofile \ + plugin=lrc \ + k=4 m=2 l=3 \ + crush-locality=rack \ + crush-failure-domain=host + $ ceph osd pool create lrcpool 12 12 erasure LRCprofile + + +Create an lrc profile +===================== + +To create a new lrc erasure code profile:: + + ceph osd erasure-code-profile set {name} \ + plugin=lrc \ + k={data-chunks} \ + m={coding-chunks} \ + l={locality} \ + [crush-root={root}] \ + [crush-locality={bucket-type}] \ + [crush-failure-domain={bucket-type}] \ + [crush-device-class={device-class}] \ + [directory={directory}] \ + [--force] + +Where: + +``k={data chunks}`` + +:Description: Each object is split in **data-chunks** parts, + each stored on a different OSD. + +:Type: Integer +:Required: Yes. +:Example: 4 + +``m={coding-chunks}`` + +:Description: Compute **coding chunks** for each object and store them + on different OSDs. The number of coding chunks is also + the number of OSDs that can be down without losing data. + +:Type: Integer +:Required: Yes. +:Example: 2 + +``l={locality}`` + +:Description: Group the coding and data chunks into sets of size + **locality**. For instance, for **k=4** and **m=2**, + when **locality=3** two groups of three are created. + Each set can be recovered without reading chunks + from another set. + +:Type: Integer +:Required: Yes. +:Example: 3 + +``crush-root={root}`` + +:Description: The name of the crush bucket used for the first step of + the ruleset. For intance **step take default**. + +:Type: String +:Required: No. +:Default: default + +``crush-locality={bucket-type}`` + +:Description: The type of the crush bucket in which each set of chunks + defined by **l** will be stored. For instance, if it is + set to **rack**, each group of **l** chunks will be + placed in a different rack. It is used to create a + ruleset step such as **step choose rack**. If it is not + set, no such grouping is done. + +:Type: String +:Required: No. + +``crush-failure-domain={bucket-type}`` + +:Description: Ensure that no two chunks are in a bucket with the same + failure domain. For instance, if the failure domain is + **host** no two chunks will be stored on the same + host. It is used to create a ruleset step such as **step + chooseleaf host**. + +:Type: String +:Required: No. +:Default: host + +``crush-device-class={device-class}`` + +:Description: Restrict placement to devices of a specific class (e.g., + ``ssd`` or ``hdd``), using the crush device class names + in the CRUSH map. + +:Type: String +:Required: No. +:Default: + +``directory={directory}`` + +:Description: Set the **directory** name from which the erasure code + plugin is loaded. + +:Type: String +:Required: No. +:Default: /usr/lib/ceph/erasure-code + +``--force`` + +:Description: Override an existing profile by the same name. + +:Type: String +:Required: No. + +Low level plugin configuration +============================== + +The sum of **k** and **m** must be a multiple of the **l** parameter. +The low level configuration parameters do not impose such a +restriction and it may be more convienient to use it for specific +purposes. It is for instance possible to define two groups, one with 4 +chunks and another with 3 chunks. It is also possible to recursively +define locality sets, for instance datacenters and racks into +datacenters. The **k/m/l** are implemented by generating a low level +configuration. + +The *lrc* erasure code plugin recursively applies erasure code +techniques so that recovering from the loss of some chunks only +requires a subset of the available chunks, most of the time. + +For instance, when three coding steps are described as:: + + chunk nr 01234567 + step 1 _cDD_cDD + step 2 cDDD____ + step 3 ____cDDD + +where *c* are coding chunks calculated from the data chunks *D*, the +loss of chunk *7* can be recovered with the last four chunks. And the +loss of chunk *2* chunk can be recovered with the first four +chunks. + +Erasure code profile examples using low level configuration +=========================================================== + +Minimal testing +--------------- + +It is strictly equivalent to using the default erasure code profile. The *DD* +implies *K=2*, the *c* implies *M=1* and the *jerasure* plugin is used +by default.:: + + $ ceph osd erasure-code-profile set LRCprofile \ + plugin=lrc \ + mapping=DD_ \ + layers='[ [ "DDc", "" ] ]' + $ ceph osd pool create lrcpool 12 12 erasure LRCprofile + +Reduce recovery bandwidth between hosts +--------------------------------------- + +Although it is probably not an interesting use case when all hosts are +connected to the same switch, reduced bandwidth usage can actually be +observed. It is equivalent to **k=4**, **m=2** and **l=3** although +the layout of the chunks is different:: + + $ ceph osd erasure-code-profile set LRCprofile \ + plugin=lrc \ + mapping=__DD__DD \ + layers='[ + [ "_cDD_cDD", "" ], + [ "cDDD____", "" ], + [ "____cDDD", "" ], + ]' + $ ceph osd pool create lrcpool 12 12 erasure LRCprofile + + +Reduce recovery bandwidth between racks +--------------------------------------- + +In Firefly the reduced bandwidth will only be observed if the primary +OSD is in the same rack as the lost chunk.:: + + $ ceph osd erasure-code-profile set LRCprofile \ + plugin=lrc \ + mapping=__DD__DD \ + layers='[ + [ "_cDD_cDD", "" ], + [ "cDDD____", "" ], + [ "____cDDD", "" ], + ]' \ + crush-steps='[ + [ "choose", "rack", 2 ], + [ "chooseleaf", "host", 4 ], + ]' + $ ceph osd pool create lrcpool 12 12 erasure LRCprofile + +Testing with different Erasure Code backends +-------------------------------------------- + +LRC now uses jerasure as the default EC backend. It is possible to +specify the EC backend/algorithm on a per layer basis using the low +level configuration. The second argument in layers='[ [ "DDc", "" ] ]' +is actually an erasure code profile to be used for this level. The +example below specifies the ISA backend with the cauchy technique to +be used in the lrcpool.:: + + $ ceph osd erasure-code-profile set LRCprofile \ + plugin=lrc \ + mapping=DD_ \ + layers='[ [ "DDc", "plugin=isa technique=cauchy" ] ]' + $ ceph osd pool create lrcpool 12 12 erasure LRCprofile + +You could also use a different erasure code profile for for each +layer.:: + + $ ceph osd erasure-code-profile set LRCprofile \ + plugin=lrc \ + mapping=__DD__DD \ + layers='[ + [ "_cDD_cDD", "plugin=isa technique=cauchy" ], + [ "cDDD____", "plugin=isa" ], + [ "____cDDD", "plugin=jerasure" ], + ]' + $ ceph osd pool create lrcpool 12 12 erasure LRCprofile + + + +Erasure coding and decoding algorithm +===================================== + +The steps found in the layers description:: + + chunk nr 01234567 + + step 1 _cDD_cDD + step 2 cDDD____ + step 3 ____cDDD + +are applied in order. For instance, if a 4K object is encoded, it will +first go thru *step 1* and be divided in four 1K chunks (the four +uppercase D). They are stored in the chunks 2, 3, 6 and 7, in +order. From these, two coding chunks are calculated (the two lowercase +c). The coding chunks are stored in the chunks 1 and 5, respectively. + +The *step 2* re-uses the content created by *step 1* in a similar +fashion and stores a single coding chunk *c* at position 0. The last four +chunks, marked with an underscore (*_*) for readability, are ignored. + +The *step 3* stores a single coding chunk *c* at position 4. The three +chunks created by *step 1* are used to compute this coding chunk, +i.e. the coding chunk from *step 1* becomes a data chunk in *step 3*. + +If chunk *2* is lost:: + + chunk nr 01234567 + + step 1 _c D_cDD + step 2 cD D____ + step 3 __ _cDDD + +decoding will attempt to recover it by walking the steps in reverse +order: *step 3* then *step 2* and finally *step 1*. + +The *step 3* knows nothing about chunk *2* (i.e. it is an underscore) +and is skipped. + +The coding chunk from *step 2*, stored in chunk *0*, allows it to +recover the content of chunk *2*. There are no more chunks to recover +and the process stops, without considering *step 1*. + +Recovering chunk *2* requires reading chunks *0, 1, 3* and writing +back chunk *2*. + +If chunk *2, 3, 6* are lost:: + + chunk nr 01234567 + + step 1 _c _c D + step 2 cD __ _ + step 3 __ cD D + +The *step 3* can recover the content of chunk *6*:: + + chunk nr 01234567 + + step 1 _c _cDD + step 2 cD ____ + step 3 __ cDDD + +The *step 2* fails to recover and is skipped because there are two +chunks missing (*2, 3*) and it can only recover from one missing +chunk. + +The coding chunk from *step 1*, stored in chunk *1, 5*, allows it to +recover the content of chunk *2, 3*:: + + chunk nr 01234567 + + step 1 _cDD_cDD + step 2 cDDD____ + step 3 ____cDDD + +Controlling crush placement +=========================== + +The default crush ruleset provides OSDs that are on different hosts. For instance:: + + chunk nr 01234567 + + step 1 _cDD_cDD + step 2 cDDD____ + step 3 ____cDDD + +needs exactly *8* OSDs, one for each chunk. If the hosts are in two +adjacent racks, the first four chunks can be placed in the first rack +and the last four in the second rack. So that recovering from the loss +of a single OSD does not require using bandwidth between the two +racks. + +For instance:: + + crush-steps='[ [ "choose", "rack", 2 ], [ "chooseleaf", "host", 4 ] ]' + +will create a ruleset that will select two crush buckets of type +*rack* and for each of them choose four OSDs, each of them located in +different buckets of type *host*. + +The ruleset can also be manually crafted for finer control. diff --git a/src/ceph/doc/rados/operations/erasure-code-profile.rst b/src/ceph/doc/rados/operations/erasure-code-profile.rst new file mode 100644 index 0000000..ddf772d --- /dev/null +++ b/src/ceph/doc/rados/operations/erasure-code-profile.rst @@ -0,0 +1,121 @@ +===================== +Erasure code profiles +===================== + +Erasure code is defined by a **profile** and is used when creating an +erasure coded pool and the associated crush ruleset. + +The **default** erasure code profile (which is created when the Ceph +cluster is initialized) provides the same level of redundancy as two +copies but requires 25% less disk space. It is described as a profile +with **k=2** and **m=1**, meaning the information is spread over three +OSD (k+m == 3) and one of them can be lost. + +To improve redundancy without increasing raw storage requirements, a +new profile can be created. For instance, a profile with **k=10** and +**m=4** can sustain the loss of four (**m=4**) OSDs by distributing an +object on fourteen (k+m=14) OSDs. The object is first divided in +**10** chunks (if the object is 10MB, each chunk is 1MB) and **4** +coding chunks are computed, for recovery (each coding chunk has the +same size as the data chunk, i.e. 1MB). The raw space overhead is only +40% and the object will not be lost even if four OSDs break at the +same time. + +.. _list of available plugins: + +.. toctree:: + :maxdepth: 1 + + erasure-code-jerasure + erasure-code-isa + erasure-code-lrc + erasure-code-shec + +osd erasure-code-profile set +============================ + +To create a new erasure code profile:: + + ceph osd erasure-code-profile set {name} \ + [{directory=directory}] \ + [{plugin=plugin}] \ + [{stripe_unit=stripe_unit}] \ + [{key=value} ...] \ + [--force] + +Where: + +``{directory=directory}`` + +:Description: Set the **directory** name from which the erasure code + plugin is loaded. + +:Type: String +:Required: No. +:Default: /usr/lib/ceph/erasure-code + +``{plugin=plugin}`` + +:Description: Use the erasure code **plugin** to compute coding chunks + and recover missing chunks. See the `list of available + plugins`_ for more information. + +:Type: String +:Required: No. +:Default: jerasure + +``{stripe_unit=stripe_unit}`` + +:Description: The amount of data in a data chunk, per stripe. For + example, a profile with 2 data chunks and stripe_unit=4K + would put the range 0-4K in chunk 0, 4K-8K in chunk 1, + then 8K-12K in chunk 0 again. This should be a multiple + of 4K for best performance. The default value is taken + from the monitor config option + ``osd_pool_erasure_code_stripe_unit`` when a pool is + created. The stripe_width of a pool using this profile + will be the number of data chunks multiplied by this + stripe_unit. + +:Type: String +:Required: No. + +``{key=value}`` + +:Description: The semantic of the remaining key/value pairs is defined + by the erasure code plugin. + +:Type: String +:Required: No. + +``--force`` + +:Description: Override an existing profile by the same name, and allow + setting a non-4K-aligned stripe_unit. + +:Type: String +:Required: No. + +osd erasure-code-profile rm +============================ + +To remove an erasure code profile:: + + ceph osd erasure-code-profile rm {name} + +If the profile is referenced by a pool, the deletion will fail. + +osd erasure-code-profile get +============================ + +To display an erasure code profile:: + + ceph osd erasure-code-profile get {name} + +osd erasure-code-profile ls +=========================== + +To list the names of all erasure code profiles:: + + ceph osd erasure-code-profile ls + diff --git a/src/ceph/doc/rados/operations/erasure-code-shec.rst b/src/ceph/doc/rados/operations/erasure-code-shec.rst new file mode 100644 index 0000000..e3bab37 --- /dev/null +++ b/src/ceph/doc/rados/operations/erasure-code-shec.rst @@ -0,0 +1,144 @@ +======================== +SHEC erasure code plugin +======================== + +The *shec* plugin encapsulates the `multiple SHEC +<http://tracker.ceph.com/projects/ceph/wiki/Shingled_Erasure_Code_(SHEC)>`_ +library. It allows ceph to recover data more efficiently than Reed Solomon codes. + +Create an SHEC profile +====================== + +To create a new *shec* erasure code profile:: + + ceph osd erasure-code-profile set {name} \ + plugin=shec \ + [k={data-chunks}] \ + [m={coding-chunks}] \ + [c={durability-estimator}] \ + [crush-root={root}] \ + [crush-failure-domain={bucket-type}] \ + [crush-device-class={device-class}] \ + [directory={directory}] \ + [--force] + +Where: + +``k={data-chunks}`` + +:Description: Each object is split in **data-chunks** parts, + each stored on a different OSD. + +:Type: Integer +:Required: No. +:Default: 4 + +``m={coding-chunks}`` + +:Description: Compute **coding-chunks** for each object and store them on + different OSDs. The number of **coding-chunks** does not necessarily + equal the number of OSDs that can be down without losing data. + +:Type: Integer +:Required: No. +:Default: 3 + +``c={durability-estimator}`` + +:Description: The number of parity chunks each of which includes each data chunk in its + calculation range. The number is used as a **durability estimator**. + For instance, if c=2, 2 OSDs can be down without losing data. + +:Type: Integer +:Required: No. +:Default: 2 + +``crush-root={root}`` + +:Description: The name of the crush bucket used for the first step of + the ruleset. For intance **step take default**. + +:Type: String +:Required: No. +:Default: default + +``crush-failure-domain={bucket-type}`` + +:Description: Ensure that no two chunks are in a bucket with the same + failure domain. For instance, if the failure domain is + **host** no two chunks will be stored on the same + host. It is used to create a ruleset step such as **step + chooseleaf host**. + +:Type: String +:Required: No. +:Default: host + +``crush-device-class={device-class}`` + +:Description: Restrict placement to devices of a specific class (e.g., + ``ssd`` or ``hdd``), using the crush device class names + in the CRUSH map. + +:Type: String +:Required: No. +:Default: + +``directory={directory}`` + +:Description: Set the **directory** name from which the erasure code + plugin is loaded. + +:Type: String +:Required: No. +:Default: /usr/lib/ceph/erasure-code + +``--force`` + +:Description: Override an existing profile by the same name. + +:Type: String +:Required: No. + +Brief description of SHEC's layouts +=================================== + +Space Efficiency +---------------- + +Space efficiency is a ratio of data chunks to all ones in a object and +represented as k/(k+m). +In order to improve space efficiency, you should increase k or decrease m. + +:: + + space efficiency of SHEC(4,3,2) = 4/(4+3) = 0.57 + SHEC(5,3,2) or SHEC(4,2,2) improves SHEC(4,3,2)'s space efficiency + +Durability +---------- + +The third parameter of SHEC (=c) is a durability estimator, which approximates +the number of OSDs that can be down without losing data. + +``durability estimator of SHEC(4,3,2) = 2`` + +Recovery Efficiency +------------------- + +Describing calculation of recovery efficiency is beyond the scope of this document, +but at least increasing m without increasing c achieves improvement of recovery efficiency. +(However, we must pay attention to the sacrifice of space efficiency in this case.) + +``SHEC(4,2,2) -> SHEC(4,3,2) : achieves improvement of recovery efficiency`` + +Erasure code profile examples +============================= + +:: + + $ ceph osd erasure-code-profile set SHECprofile \ + plugin=shec \ + k=8 m=4 c=3 \ + crush-failure-domain=host + $ ceph osd pool create shecpool 256 256 erasure SHECprofile diff --git a/src/ceph/doc/rados/operations/erasure-code.rst b/src/ceph/doc/rados/operations/erasure-code.rst new file mode 100644 index 0000000..6ec5a09 --- /dev/null +++ b/src/ceph/doc/rados/operations/erasure-code.rst @@ -0,0 +1,195 @@ +============= + Erasure code +============= + +A Ceph pool is associated to a type to sustain the loss of an OSD +(i.e. a disk since most of the time there is one OSD per disk). The +default choice when `creating a pool <../pools>`_ is *replicated*, +meaning every object is copied on multiple disks. The `Erasure Code +<https://en.wikipedia.org/wiki/Erasure_code>`_ pool type can be used +instead to save space. + +Creating a sample erasure coded pool +------------------------------------ + +The simplest erasure coded pool is equivalent to `RAID5 +<https://en.wikipedia.org/wiki/Standard_RAID_levels#RAID_5>`_ and +requires at least three hosts:: + + $ ceph osd pool create ecpool 12 12 erasure + pool 'ecpool' created + $ echo ABCDEFGHI | rados --pool ecpool put NYAN - + $ rados --pool ecpool get NYAN - + ABCDEFGHI + +.. note:: the 12 in *pool create* stands for + `the number of placement groups <../pools>`_. + +Erasure code profiles +--------------------- + +The default erasure code profile sustains the loss of a single OSD. It +is equivalent to a replicated pool of size two but requires 1.5TB +instead of 2TB to store 1TB of data. The default profile can be +displayed with:: + + $ ceph osd erasure-code-profile get default + k=2 + m=1 + plugin=jerasure + crush-failure-domain=host + technique=reed_sol_van + +Choosing the right profile is important because it cannot be modified +after the pool is created: a new pool with a different profile needs +to be created and all objects from the previous pool moved to the new. + +The most important parameters of the profile are *K*, *M* and +*crush-failure-domain* because they define the storage overhead and +the data durability. For instance, if the desired architecture must +sustain the loss of two racks with a storage overhead of 40% overhead, +the following profile can be defined:: + + $ ceph osd erasure-code-profile set myprofile \ + k=3 \ + m=2 \ + crush-failure-domain=rack + $ ceph osd pool create ecpool 12 12 erasure myprofile + $ echo ABCDEFGHI | rados --pool ecpool put NYAN - + $ rados --pool ecpool get NYAN - + ABCDEFGHI + +The *NYAN* object will be divided in three (*K=3*) and two additional +*chunks* will be created (*M=2*). The value of *M* defines how many +OSD can be lost simultaneously without losing any data. The +*crush-failure-domain=rack* will create a CRUSH ruleset that ensures +no two *chunks* are stored in the same rack. + +.. ditaa:: + +-------------------+ + name | NYAN | + +-------------------+ + content | ABCDEFGHI | + +--------+----------+ + | + | + v + +------+------+ + +---------------+ encode(3,2) +-----------+ + | +--+--+---+---+ | + | | | | | + | +-------+ | +-----+ | + | | | | | + +--v---+ +--v---+ +--v---+ +--v---+ +--v---+ + name | NYAN | | NYAN | | NYAN | | NYAN | | NYAN | + +------+ +------+ +------+ +------+ +------+ + shard | 1 | | 2 | | 3 | | 4 | | 5 | + +------+ +------+ +------+ +------+ +------+ + content | ABC | | DEF | | GHI | | YXY | | QGC | + +--+---+ +--+---+ +--+---+ +--+---+ +--+---+ + | | | | | + | | v | | + | | +--+---+ | | + | | | OSD1 | | | + | | +------+ | | + | | | | + | | +------+ | | + | +------>| OSD2 | | | + | +------+ | | + | | | + | +------+ | | + | | OSD3 |<----+ | + | +------+ | + | | + | +------+ | + | | OSD4 |<--------------+ + | +------+ + | + | +------+ + +----------------->| OSD5 | + +------+ + + +More information can be found in the `erasure code profiles +<../erasure-code-profile>`_ documentation. + + +Erasure Coding with Overwrites +------------------------------ + +By default, erasure coded pools only work with uses like RGW that +perform full object writes and appends. + +Since Luminous, partial writes for an erasure coded pool may be +enabled with a per-pool setting. This lets RBD and Cephfs store their +data in an erasure coded pool:: + + ceph osd pool set ec_pool allow_ec_overwrites true + +This can only be enabled on a pool residing on bluestore OSDs, since +bluestore's checksumming is used to detect bitrot or other corruption +during deep-scrub. In addition to being unsafe, using filestore with +ec overwrites yields low performance compared to bluestore. + +Erasure coded pools do not support omap, so to use them with RBD and +Cephfs you must instruct them to store their data in an ec pool, and +their metadata in a replicated pool. For RBD, this means using the +erasure coded pool as the ``--data-pool`` during image creation:: + + rbd create --size 1G --data-pool ec_pool replicated_pool/image_name + +For Cephfs, using an erasure coded pool means setting that pool in +a `file layout <../../../cephfs/file-layouts>`_. + + +Erasure coded pool and cache tiering +------------------------------------ + +Erasure coded pools require more resources than replicated pools and +lack some functionalities such as omap. To overcome these +limitations, one can set up a `cache tier <../cache-tiering>`_ +before the erasure coded pool. + +For instance, if the pool *hot-storage* is made of fast storage:: + + $ ceph osd tier add ecpool hot-storage + $ ceph osd tier cache-mode hot-storage writeback + $ ceph osd tier set-overlay ecpool hot-storage + +will place the *hot-storage* pool as tier of *ecpool* in *writeback* +mode so that every write and read to the *ecpool* are actually using +the *hot-storage* and benefit from its flexibility and speed. + +More information can be found in the `cache tiering +<../cache-tiering>`_ documentation. + +Glossary +-------- + +*chunk* + when the encoding function is called, it returns chunks of the same + size. Data chunks which can be concatenated to reconstruct the original + object and coding chunks which can be used to rebuild a lost chunk. + +*K* + the number of data *chunks*, i.e. the number of *chunks* in which the + original object is divided. For instance if *K* = 2 a 10KB object + will be divided into *K* objects of 5KB each. + +*M* + the number of coding *chunks*, i.e. the number of additional *chunks* + computed by the encoding functions. If there are 2 coding *chunks*, + it means 2 OSDs can be out without losing data. + + +Table of content +---------------- + +.. toctree:: + :maxdepth: 1 + + erasure-code-profile + erasure-code-jerasure + erasure-code-isa + erasure-code-lrc + erasure-code-shec diff --git a/src/ceph/doc/rados/operations/health-checks.rst b/src/ceph/doc/rados/operations/health-checks.rst new file mode 100644 index 0000000..c1e2200 --- /dev/null +++ b/src/ceph/doc/rados/operations/health-checks.rst @@ -0,0 +1,527 @@ + +============= +Health checks +============= + +Overview +======== + +There is a finite set of possible health messages that a Ceph cluster can +raise -- these are defined as *health checks* which have unique identifiers. + +The identifier is a terse pseudo-human-readable (i.e. like a variable name) +string. It is intended to enable tools (such as UIs) to make sense of +health checks, and present them in a way that reflects their meaning. + +This page lists the health checks that are raised by the monitor and manager +daemons. In addition to these, you may also see health checks that originate +from MDS daemons (see :doc:`/cephfs/health-messages`), and health checks +that are defined by ceph-mgr python modules. + +Definitions +=========== + + +OSDs +---- + +OSD_DOWN +________ + +One or more OSDs are marked down. The ceph-osd daemon may have been +stopped, or peer OSDs may be unable to reach the OSD over the network. +Common causes include a stopped or crashed daemon, a down host, or a +network outage. + +Verify the host is healthy, the daemon is started, and network is +functioning. If the daemon has crashed, the daemon log file +(``/var/log/ceph/ceph-osd.*``) may contain debugging information. + +OSD_<crush type>_DOWN +_____________________ + +(e.g. OSD_HOST_DOWN, OSD_ROOT_DOWN) + +All the OSDs within a particular CRUSH subtree are marked down, for example +all OSDs on a host. + +OSD_ORPHAN +__________ + +An OSD is referenced in the CRUSH map hierarchy but does not exist. + +The OSD can be removed from the CRUSH hierarchy with:: + + ceph osd crush rm osd.<id> + +OSD_OUT_OF_ORDER_FULL +_____________________ + +The utilization thresholds for `backfillfull`, `nearfull`, `full`, +and/or `failsafe_full` are not ascending. In particular, we expect +`backfillfull < nearfull`, `nearfull < full`, and `full < +failsafe_full`. + +The thresholds can be adjusted with:: + + ceph osd set-backfillfull-ratio <ratio> + ceph osd set-nearfull-ratio <ratio> + ceph osd set-full-ratio <ratio> + + +OSD_FULL +________ + +One or more OSDs has exceeded the `full` threshold and is preventing +the cluster from servicing writes. + +Utilization by pool can be checked with:: + + ceph df + +The currently defined `full` ratio can be seen with:: + + ceph osd dump | grep full_ratio + +A short-term workaround to restore write availability is to raise the full +threshold by a small amount:: + + ceph osd set-full-ratio <ratio> + +New storage should be added to the cluster by deploying more OSDs or +existing data should be deleted in order to free up space. + +OSD_BACKFILLFULL +________________ + +One or more OSDs has exceeded the `backfillfull` threshold, which will +prevent data from being allowed to rebalance to this device. This is +an early warning that rebalancing may not be able to complete and that +the cluster is approaching full. + +Utilization by pool can be checked with:: + + ceph df + +OSD_NEARFULL +____________ + +One or more OSDs has exceeded the `nearfull` threshold. This is an early +warning that the cluster is approaching full. + +Utilization by pool can be checked with:: + + ceph df + +OSDMAP_FLAGS +____________ + +One or more cluster flags of interest has been set. These flags include: + +* *full* - the cluster is flagged as full and cannot service writes +* *pauserd*, *pausewr* - paused reads or writes +* *noup* - OSDs are not allowed to start +* *nodown* - OSD failure reports are being ignored, such that the + monitors will not mark OSDs `down` +* *noin* - OSDs that were previously marked `out` will not be marked + back `in` when they start +* *noout* - down OSDs will not automatically be marked out after the + configured interval +* *nobackfill*, *norecover*, *norebalance* - recovery or data + rebalancing is suspended +* *noscrub*, *nodeep_scrub* - scrubbing is disabled +* *notieragent* - cache tiering activity is suspended + +With the exception of *full*, these flags can be set or cleared with:: + + ceph osd set <flag> + ceph osd unset <flag> + +OSD_FLAGS +_________ + +One or more OSDs has a per-OSD flag of interest set. These flags include: + +* *noup*: OSD is not allowed to start +* *nodown*: failure reports for this OSD will be ignored +* *noin*: if this OSD was previously marked `out` automatically + after a failure, it will not be marked in when it stats +* *noout*: if this OSD is down it will not automatically be marked + `out` after the configured interval + +Per-OSD flags can be set and cleared with:: + + ceph osd add-<flag> <osd-id> + ceph osd rm-<flag> <osd-id> + +For example, :: + + ceph osd rm-nodown osd.123 + +OLD_CRUSH_TUNABLES +__________________ + +The CRUSH map is using very old settings and should be updated. The +oldest tunables that can be used (i.e., the oldest client version that +can connect to the cluster) without triggering this health warning is +determined by the ``mon_crush_min_required_version`` config option. +See :doc:`/rados/operations/crush-map/#tunables` for more information. + +OLD_CRUSH_STRAW_CALC_VERSION +____________________________ + +The CRUSH map is using an older, non-optimal method for calculating +intermediate weight values for ``straw`` buckets. + +The CRUSH map should be updated to use the newer method +(``straw_calc_version=1``). See +:doc:`/rados/operations/crush-map/#tunables` for more information. + +CACHE_POOL_NO_HIT_SET +_____________________ + +One or more cache pools is not configured with a *hit set* to track +utilization, which will prevent the tiering agent from identifying +cold objects to flush and evict from the cache. + +Hit sets can be configured on the cache pool with:: + + ceph osd pool set <poolname> hit_set_type <type> + ceph osd pool set <poolname> hit_set_period <period-in-seconds> + ceph osd pool set <poolname> hit_set_count <number-of-hitsets> + ceph osd pool set <poolname> hit_set_fpp <target-false-positive-rate> + +OSD_NO_SORTBITWISE +__________________ + +No pre-luminous v12.y.z OSDs are running but the ``sortbitwise`` flag has not +been set. + +The ``sortbitwise`` flag must be set before luminous v12.y.z or newer +OSDs can start. You can safely set the flag with:: + + ceph osd set sortbitwise + +POOL_FULL +_________ + +One or more pools has reached its quota and is no longer allowing writes. + +Pool quotas and utilization can be seen with:: + + ceph df detail + +You can either raise the pool quota with:: + + ceph osd pool set-quota <poolname> max_objects <num-objects> + ceph osd pool set-quota <poolname> max_bytes <num-bytes> + +or delete some existing data to reduce utilization. + + +Data health (pools & placement groups) +-------------------------------------- + +PG_AVAILABILITY +_______________ + +Data availability is reduced, meaning that the cluster is unable to +service potential read or write requests for some data in the cluster. +Specifically, one or more PGs is in a state that does not allow IO +requests to be serviced. Problematic PG states include *peering*, +*stale*, *incomplete*, and the lack of *active* (if those conditions do not clear +quickly). + +Detailed information about which PGs are affected is available from:: + + ceph health detail + +In most cases the root cause is that one or more OSDs is currently +down; see the dicussion for ``OSD_DOWN`` above. + +The state of specific problematic PGs can be queried with:: + + ceph tell <pgid> query + +PG_DEGRADED +___________ + +Data redundancy is reduced for some data, meaning the cluster does not +have the desired number of replicas for all data (for replicated +pools) or erasure code fragments (for erasure coded pools). +Specifically, one or more PGs: + +* has the *degraded* or *undersized* flag set, meaning there are not + enough instances of that placement group in the cluster; +* has not had the *clean* flag set for some time. + +Detailed information about which PGs are affected is available from:: + + ceph health detail + +In most cases the root cause is that one or more OSDs is currently +down; see the dicussion for ``OSD_DOWN`` above. + +The state of specific problematic PGs can be queried with:: + + ceph tell <pgid> query + + +PG_DEGRADED_FULL +________________ + +Data redundancy may be reduced or at risk for some data due to a lack +of free space in the cluster. Specifically, one or more PGs has the +*backfill_toofull* or *recovery_toofull* flag set, meaning that the +cluster is unable to migrate or recover data because one or more OSDs +is above the *backfillfull* threshold. + +See the discussion for *OSD_BACKFILLFULL* or *OSD_FULL* above for +steps to resolve this condition. + +PG_DAMAGED +__________ + +Data scrubbing has discovered some problems with data consistency in +the cluster. Specifically, one or more PGs has the *inconsistent* or +*snaptrim_error* flag is set, indicating an earlier scrub operation +found a problem, or that the *repair* flag is set, meaning a repair +for such an inconsistency is currently in progress. + +See :doc:`pg-repair` for more information. + +OSD_SCRUB_ERRORS +________________ + +Recent OSD scrubs have uncovered inconsistencies. This error is generally +paired with *PG_DAMANGED* (see above). + +See :doc:`pg-repair` for more information. + +CACHE_POOL_NEAR_FULL +____________________ + +A cache tier pool is nearly full. Full in this context is determined +by the ``target_max_bytes`` and ``target_max_objects`` properties on +the cache pool. Once the pool reaches the target threshold, write +requests to the pool may block while data is flushed and evicted +from the cache, a state that normally leads to very high latencies and +poor performance. + +The cache pool target size can be adjusted with:: + + ceph osd pool set <cache-pool-name> target_max_bytes <bytes> + ceph osd pool set <cache-pool-name> target_max_objects <objects> + +Normal cache flush and evict activity may also be throttled due to reduced +availability or performance of the base tier, or overall cluster load. + +TOO_FEW_PGS +___________ + +The number of PGs in use in the cluster is below the configurable +threshold of ``mon_pg_warn_min_per_osd`` PGs per OSD. This can lead +to suboptimizal distribution and balance of data across the OSDs in +the cluster, and similar reduce overall performance. + +This may be an expected condition if data pools have not yet been +created. + +The PG count for existing pools can be increased or new pools can be +created. Please refer to +:doc:`placement-groups#Choosing-the-number-of-Placement-Groups` for +more information. + +TOO_MANY_PGS +____________ + +The number of PGs in use in the cluster is above the configurable +threshold of ``mon_max_pg_per_osd`` PGs per OSD. If this threshold is +exceed the cluster will not allow new pools to be created, pool `pg_num` to +be increased, or pool replication to be increased (any of which would lead to +more PGs in the cluster). A large number of PGs can lead +to higher memory utilization for OSD daemons, slower peering after +cluster state changes (like OSD restarts, additions, or removals), and +higher load on the Manager and Monitor daemons. + +The simplest way to mitigate the problem is to increase the number of +OSDs in the cluster by adding more hardware. Note that the OSD count +used for the purposes of this health check is the number of "in" OSDs, +so marking "out" OSDs "in" (if there are any) can also help:: + + ceph osd in <osd id(s)> + +Please refer to +:doc:`placement-groups#Choosing-the-number-of-Placement-Groups` for +more information. + +SMALLER_PGP_NUM +_______________ + +One or more pools has a ``pgp_num`` value less than ``pg_num``. This +is normally an indication that the PG count was increased without +also increasing the placement behavior. + +This is sometimes done deliberately to separate out the `split` step +when the PG count is adjusted from the data migration that is needed +when ``pgp_num`` is changed. + +This is normally resolved by setting ``pgp_num`` to match ``pg_num``, +triggering the data migration, with:: + + ceph osd pool set <pool> pgp_num <pg-num-value> + +MANY_OBJECTS_PER_PG +___________________ + +One or more pools has an average number of objects per PG that is +significantly higher than the overall cluster average. The specific +threshold is controlled by the ``mon_pg_warn_max_object_skew`` +configuration value. + +This is usually an indication that the pool(s) containing most of the +data in the cluster have too few PGs, and/or that other pools that do +not contain as much data have too many PGs. See the discussion of +*TOO_MANY_PGS* above. + +The threshold can be raised to silence the health warning by adjusting +the ``mon_pg_warn_max_object_skew`` config option on the monitors. + +POOL_APP_NOT_ENABLED +____________________ + +A pool exists that contains one or more objects but has not been +tagged for use by a particular application. + +Resolve this warning by labeling the pool for use by an application. For +example, if the pool is used by RBD,:: + + rbd pool init <poolname> + +If the pool is being used by a custom application 'foo', you can also label +via the low-level command:: + + ceph osd pool application enable foo + +For more information, see :doc:`pools.rst#associate-pool-to-application`. + +POOL_FULL +_________ + +One or more pools has reached (or is very close to reaching) its +quota. The threshold to trigger this error condition is controlled by +the ``mon_pool_quota_crit_threshold`` configuration option. + +Pool quotas can be adjusted up or down (or removed) with:: + + ceph osd pool set-quota <pool> max_bytes <bytes> + ceph osd pool set-quota <pool> max_objects <objects> + +Setting the quota value to 0 will disable the quota. + +POOL_NEAR_FULL +______________ + +One or more pools is approaching is quota. The threshold to trigger +this warning condition is controlled by the +``mon_pool_quota_warn_threshold`` configuration option. + +Pool quotas can be adjusted up or down (or removed) with:: + + ceph osd pool set-quota <pool> max_bytes <bytes> + ceph osd pool set-quota <pool> max_objects <objects> + +Setting the quota value to 0 will disable the quota. + +OBJECT_MISPLACED +________________ + +One or more objects in the cluster is not stored on the node the +cluster would like it to be stored on. This is an indication that +data migration due to some recent cluster change has not yet completed. + +Misplaced data is not a dangerous condition in and of itself; data +consistency is never at risk, and old copies of objects are never +removed until the desired number of new copies (in the desired +locations) are present. + +OBJECT_UNFOUND +______________ + +One or more objects in the cluster cannot be found. Specifically, the +OSDs know that a new or updated copy of an object should exist, but a +copy of that version of the object has not been found on OSDs that are +currently online. + +Read or write requests to unfound objects will block. + +Ideally, a down OSD can be brought back online that has the more +recent copy of the unfound object. Candidate OSDs can be identified from the +peering state for the PG(s) responsible for the unfound object:: + + ceph tell <pgid> query + +If the latest copy of the object is not available, the cluster can be +told to roll back to a previous version of the object. See +:doc:`troubleshooting-pg#Unfound-objects` for more information. + +REQUEST_SLOW +____________ + +One or more OSD requests is taking a long time to process. This can +be an indication of extreme load, a slow storage device, or a software +bug. + +The request queue on the OSD(s) in question can be queried with the +following command, executed from the OSD host:: + + ceph daemon osd.<id> ops + +A summary of the slowest recent requests can be seen with:: + + ceph daemon osd.<id> dump_historic_ops + +The location of an OSD can be found with:: + + ceph osd find osd.<id> + +REQUEST_STUCK +_____________ + +One or more OSD requests has been blocked for an extremely long time. +This is an indication that either the cluster has been unhealthy for +an extended period of time (e.g., not enough running OSDs) or there is +some internal problem with the OSD. See the dicussion of +*REQUEST_SLOW* above. + +PG_NOT_SCRUBBED +_______________ + +One or more PGs has not been scrubbed recently. PGs are normally +scrubbed every ``mon_scrub_interval`` seconds, and this warning +triggers when ``mon_warn_not_scrubbed`` such intervals have elapsed +without a scrub. + +PGs will not scrub if they are not flagged as *clean*, which may +happen if they are misplaced or degraded (see *PG_AVAILABILITY* and +*PG_DEGRADED* above). + +You can manually initiate a scrub of a clean PG with:: + + ceph pg scrub <pgid> + +PG_NOT_DEEP_SCRUBBED +____________________ + +One or more PGs has not been deep scrubbed recently. PGs are normally +scrubbed every ``osd_deep_mon_scrub_interval`` seconds, and this warning +triggers when ``mon_warn_not_deep_scrubbed`` such intervals have elapsed +without a scrub. + +PGs will not (deep) scrub if they are not flagged as *clean*, which may +happen if they are misplaced or degraded (see *PG_AVAILABILITY* and +*PG_DEGRADED* above). + +You can manually initiate a scrub of a clean PG with:: + + ceph pg deep-scrub <pgid> diff --git a/src/ceph/doc/rados/operations/index.rst b/src/ceph/doc/rados/operations/index.rst new file mode 100644 index 0000000..aacf764 --- /dev/null +++ b/src/ceph/doc/rados/operations/index.rst @@ -0,0 +1,90 @@ +==================== + Cluster Operations +==================== + +.. raw:: html + + <table><colgroup><col width="50%"><col width="50%"></colgroup><tbody valign="top"><tr><td><h3>High-level Operations</h3> + +High-level cluster operations consist primarily of starting, stopping, and +restarting a cluster with the ``ceph`` service; checking the cluster's health; +and, monitoring an operating cluster. + +.. toctree:: + :maxdepth: 1 + + operating + health-checks + monitoring + monitoring-osd-pg + user-management + +.. raw:: html + + </td><td><h3>Data Placement</h3> + +Once you have your cluster up and running, you may begin working with data +placement. Ceph supports petabyte-scale data storage clusters, with storage +pools and placement groups that distribute data across the cluster using Ceph's +CRUSH algorithm. + +.. toctree:: + :maxdepth: 1 + + data-placement + pools + erasure-code + cache-tiering + placement-groups + upmap + crush-map + crush-map-edits + + + +.. raw:: html + + </td></tr><tr><td><h3>Low-level Operations</h3> + +Low-level cluster operations consist of starting, stopping, and restarting a +particular daemon within a cluster; changing the settings of a particular +daemon or subsystem; and, adding a daemon to the cluster or removing a daemon +from the cluster. The most common use cases for low-level operations include +growing or shrinking the Ceph cluster and replacing legacy or failed hardware +with new hardware. + +.. toctree:: + :maxdepth: 1 + + add-or-rm-osds + add-or-rm-mons + Command Reference <control> + + + +.. raw:: html + + </td><td><h3>Troubleshooting</h3> + +Ceph is still on the leading edge, so you may encounter situations that require +you to evaluate your Ceph configuration and modify your logging and debugging +settings to identify and remedy issues you are encountering with your cluster. + +.. toctree:: + :maxdepth: 1 + + ../troubleshooting/community + ../troubleshooting/troubleshooting-mon + ../troubleshooting/troubleshooting-osd + ../troubleshooting/troubleshooting-pg + ../troubleshooting/log-and-debug + ../troubleshooting/cpu-profiling + ../troubleshooting/memory-profiling + + + + +.. raw:: html + + </td></tr></tbody></table> + diff --git a/src/ceph/doc/rados/operations/monitoring-osd-pg.rst b/src/ceph/doc/rados/operations/monitoring-osd-pg.rst new file mode 100644 index 0000000..0107e34 --- /dev/null +++ b/src/ceph/doc/rados/operations/monitoring-osd-pg.rst @@ -0,0 +1,617 @@ +========================= + Monitoring OSDs and PGs +========================= + +High availability and high reliability require a fault-tolerant approach to +managing hardware and software issues. Ceph has no single point-of-failure, and +can service requests for data in a "degraded" mode. Ceph's `data placement`_ +introduces a layer of indirection to ensure that data doesn't bind directly to +particular OSD addresses. This means that tracking down system faults requires +finding the `placement group`_ and the underlying OSDs at root of the problem. + +.. tip:: A fault in one part of the cluster may prevent you from accessing a + particular object, but that doesn't mean that you cannot access other objects. + When you run into a fault, don't panic. Just follow the steps for monitoring + your OSDs and placement groups. Then, begin troubleshooting. + +Ceph is generally self-repairing. However, when problems persist, monitoring +OSDs and placement groups will help you identify the problem. + + +Monitoring OSDs +=============== + +An OSD's status is either in the cluster (``in``) or out of the cluster +(``out``); and, it is either up and running (``up``), or it is down and not +running (``down``). If an OSD is ``up``, it may be either ``in`` the cluster +(you can read and write data) or it is ``out`` of the cluster. If it was +``in`` the cluster and recently moved ``out`` of the cluster, Ceph will migrate +placement groups to other OSDs. If an OSD is ``out`` of the cluster, CRUSH will +not assign placement groups to the OSD. If an OSD is ``down``, it should also be +``out``. + +.. note:: If an OSD is ``down`` and ``in``, there is a problem and the cluster + will not be in a healthy state. + +.. ditaa:: +----------------+ +----------------+ + | | | | + | OSD #n In | | OSD #n Up | + | | | | + +----------------+ +----------------+ + ^ ^ + | | + | | + v v + +----------------+ +----------------+ + | | | | + | OSD #n Out | | OSD #n Down | + | | | | + +----------------+ +----------------+ + +If you execute a command such as ``ceph health``, ``ceph -s`` or ``ceph -w``, +you may notice that the cluster does not always echo back ``HEALTH OK``. Don't +panic. With respect to OSDs, you should expect that the cluster will **NOT** +echo ``HEALTH OK`` in a few expected circumstances: + +#. You haven't started the cluster yet (it won't respond). +#. You have just started or restarted the cluster and it's not ready yet, + because the placement groups are getting created and the OSDs are in + the process of peering. +#. You just added or removed an OSD. +#. You just have modified your cluster map. + +An important aspect of monitoring OSDs is to ensure that when the cluster +is up and running that all OSDs that are ``in`` the cluster are ``up`` and +running, too. To see if all OSDs are running, execute:: + + ceph osd stat + +The result should tell you the map epoch (eNNNN), the total number of OSDs (x), +how many are ``up`` (y) and how many are ``in`` (z). :: + + eNNNN: x osds: y up, z in + +If the number of OSDs that are ``in`` the cluster is more than the number of +OSDs that are ``up``, execute the following command to identify the ``ceph-osd`` +daemons that are not running:: + + ceph osd tree + +:: + + dumped osdmap tree epoch 1 + # id weight type name up/down reweight + -1 2 pool openstack + -3 2 rack dell-2950-rack-A + -2 2 host dell-2950-A1 + 0 1 osd.0 up 1 + 1 1 osd.1 down 1 + + +.. tip:: The ability to search through a well-designed CRUSH hierarchy may help + you troubleshoot your cluster by identifying the physcial locations faster. + +If an OSD is ``down``, start it:: + + sudo systemctl start ceph-osd@1 + +See `OSD Not Running`_ for problems associated with OSDs that stopped, or won't +restart. + + +PG Sets +======= + +When CRUSH assigns placement groups to OSDs, it looks at the number of replicas +for the pool and assigns the placement group to OSDs such that each replica of +the placement group gets assigned to a different OSD. For example, if the pool +requires three replicas of a placement group, CRUSH may assign them to +``osd.1``, ``osd.2`` and ``osd.3`` respectively. CRUSH actually seeks a +pseudo-random placement that will take into account failure domains you set in +your `CRUSH map`_, so you will rarely see placement groups assigned to nearest +neighbor OSDs in a large cluster. We refer to the set of OSDs that should +contain the replicas of a particular placement group as the **Acting Set**. In +some cases, an OSD in the Acting Set is ``down`` or otherwise not able to +service requests for objects in the placement group. When these situations +arise, don't panic. Common examples include: + +- You added or removed an OSD. Then, CRUSH reassigned the placement group to + other OSDs--thereby changing the composition of the Acting Set and spawning + the migration of data with a "backfill" process. +- An OSD was ``down``, was restarted, and is now ``recovering``. +- An OSD in the Acting Set is ``down`` or unable to service requests, + and another OSD has temporarily assumed its duties. + +Ceph processes a client request using the **Up Set**, which is the set of OSDs +that will actually handle the requests. In most cases, the Up Set and the Acting +Set are virtually identical. When they are not, it may indicate that Ceph is +migrating data, an OSD is recovering, or that there is a problem (i.e., Ceph +usually echoes a "HEALTH WARN" state with a "stuck stale" message in such +scenarios). + +To retrieve a list of placement groups, execute:: + + ceph pg dump + +To view which OSDs are within the Acting Set or the Up Set for a given placement +group, execute:: + + ceph pg map {pg-num} + +The result should tell you the osdmap epoch (eNNN), the placement group number +({pg-num}), the OSDs in the Up Set (up[]), and the OSDs in the acting set +(acting[]). :: + + osdmap eNNN pg {pg-num} -> up [0,1,2] acting [0,1,2] + +.. note:: If the Up Set and Acting Set do not match, this may be an indicator + that the cluster rebalancing itself or of a potential problem with + the cluster. + + +Peering +======= + +Before you can write data to a placement group, it must be in an ``active`` +state, and it **should** be in a ``clean`` state. For Ceph to determine the +current state of a placement group, the primary OSD of the placement group +(i.e., the first OSD in the acting set), peers with the secondary and tertiary +OSDs to establish agreement on the current state of the placement group +(assuming a pool with 3 replicas of the PG). + + +.. ditaa:: +---------+ +---------+ +-------+ + | OSD 1 | | OSD 2 | | OSD 3 | + +---------+ +---------+ +-------+ + | | | + | Request To | | + | Peer | | + |-------------->| | + |<--------------| | + | Peering | + | | + | Request To | + | Peer | + |----------------------------->| + |<-----------------------------| + | Peering | + +The OSDs also report their status to the monitor. See `Configuring Monitor/OSD +Interaction`_ for details. To troubleshoot peering issues, see `Peering +Failure`_. + + +Monitoring Placement Group States +================================= + +If you execute a command such as ``ceph health``, ``ceph -s`` or ``ceph -w``, +you may notice that the cluster does not always echo back ``HEALTH OK``. After +you check to see if the OSDs are running, you should also check placement group +states. You should expect that the cluster will **NOT** echo ``HEALTH OK`` in a +number of placement group peering-related circumstances: + +#. You have just created a pool and placement groups haven't peered yet. +#. The placement groups are recovering. +#. You have just added an OSD to or removed an OSD from the cluster. +#. You have just modified your CRUSH map and your placement groups are migrating. +#. There is inconsistent data in different replicas of a placement group. +#. Ceph is scrubbing a placement group's replicas. +#. Ceph doesn't have enough storage capacity to complete backfilling operations. + +If one of the foregoing circumstances causes Ceph to echo ``HEALTH WARN``, don't +panic. In many cases, the cluster will recover on its own. In some cases, you +may need to take action. An important aspect of monitoring placement groups is +to ensure that when the cluster is up and running that all placement groups are +``active``, and preferably in the ``clean`` state. To see the status of all +placement groups, execute:: + + ceph pg stat + +The result should tell you the placement group map version (vNNNNNN), the total +number of placement groups (x), and how many placement groups are in a +particular state such as ``active+clean`` (y). :: + + vNNNNNN: x pgs: y active+clean; z bytes data, aa MB used, bb GB / cc GB avail + +.. note:: It is common for Ceph to report multiple states for placement groups. + +In addition to the placement group states, Ceph will also echo back the amount +of data used (aa), the amount of storage capacity remaining (bb), and the total +storage capacity for the placement group. These numbers can be important in a +few cases: + +- You are reaching your ``near full ratio`` or ``full ratio``. +- Your data is not getting distributed across the cluster due to an + error in your CRUSH configuration. + + +.. topic:: Placement Group IDs + + Placement group IDs consist of the pool number (not pool name) followed + by a period (.) and the placement group ID--a hexadecimal number. You + can view pool numbers and their names from the output of ``ceph osd + lspools``. For example, the default pool ``rbd`` corresponds to + pool number ``0``. A fully qualified placement group ID has the + following form:: + + {pool-num}.{pg-id} + + And it typically looks like this:: + + 0.1f + + +To retrieve a list of placement groups, execute the following:: + + ceph pg dump + +You can also format the output in JSON format and save it to a file:: + + ceph pg dump -o {filename} --format=json + +To query a particular placement group, execute the following:: + + ceph pg {poolnum}.{pg-id} query + +Ceph will output the query in JSON format. + +.. code-block:: javascript + + { + "state": "active+clean", + "up": [ + 1, + 0 + ], + "acting": [ + 1, + 0 + ], + "info": { + "pgid": "1.e", + "last_update": "4'1", + "last_complete": "4'1", + "log_tail": "0'0", + "last_backfill": "MAX", + "purged_snaps": "[]", + "history": { + "epoch_created": 1, + "last_epoch_started": 537, + "last_epoch_clean": 537, + "last_epoch_split": 534, + "same_up_since": 536, + "same_interval_since": 536, + "same_primary_since": 536, + "last_scrub": "4'1", + "last_scrub_stamp": "2013-01-25 10:12:23.828174" + }, + "stats": { + "version": "4'1", + "reported": "536'782", + "state": "active+clean", + "last_fresh": "2013-01-25 10:12:23.828271", + "last_change": "2013-01-25 10:12:23.828271", + "last_active": "2013-01-25 10:12:23.828271", + "last_clean": "2013-01-25 10:12:23.828271", + "last_unstale": "2013-01-25 10:12:23.828271", + "mapping_epoch": 535, + "log_start": "0'0", + "ondisk_log_start": "0'0", + "created": 1, + "last_epoch_clean": 1, + "parent": "0.0", + "parent_split_bits": 0, + "last_scrub": "4'1", + "last_scrub_stamp": "2013-01-25 10:12:23.828174", + "log_size": 128, + "ondisk_log_size": 128, + "stat_sum": { + "num_bytes": 205, + "num_objects": 1, + "num_object_clones": 0, + "num_object_copies": 0, + "num_objects_missing_on_primary": 0, + "num_objects_degraded": 0, + "num_objects_unfound": 0, + "num_read": 1, + "num_read_kb": 0, + "num_write": 3, + "num_write_kb": 1 + }, + "stat_cat_sum": { + + }, + "up": [ + 1, + 0 + ], + "acting": [ + 1, + 0 + ] + }, + "empty": 0, + "dne": 0, + "incomplete": 0 + }, + "recovery_state": [ + { + "name": "Started\/Primary\/Active", + "enter_time": "2013-01-23 09:35:37.594691", + "might_have_unfound": [ + + ], + "scrub": { + "scrub_epoch_start": "536", + "scrub_active": 0, + "scrub_block_writes": 0, + "finalizing_scrub": 0, + "scrub_waiting_on": 0, + "scrub_waiting_on_whom": [ + + ] + } + }, + { + "name": "Started", + "enter_time": "2013-01-23 09:35:31.581160" + } + ] + } + + + +The following subsections describe common states in greater detail. + +Creating +-------- + +When you create a pool, it will create the number of placement groups you +specified. Ceph will echo ``creating`` when it is creating one or more +placement groups. Once they are created, the OSDs that are part of a placement +group's Acting Set will peer. Once peering is complete, the placement group +status should be ``active+clean``, which means a Ceph client can begin writing +to the placement group. + +.. ditaa:: + + /-----------\ /-----------\ /-----------\ + | Creating |------>| Peering |------>| Active | + \-----------/ \-----------/ \-----------/ + +Peering +------- + +When Ceph is Peering a placement group, Ceph is bringing the OSDs that +store the replicas of the placement group into **agreement about the state** +of the objects and metadata in the placement group. When Ceph completes peering, +this means that the OSDs that store the placement group agree about the current +state of the placement group. However, completion of the peering process does +**NOT** mean that each replica has the latest contents. + +.. topic:: Authoratative History + + Ceph will **NOT** acknowledge a write operation to a client, until + all OSDs of the acting set persist the write operation. This practice + ensures that at least one member of the acting set will have a record + of every acknowledged write operation since the last successful + peering operation. + + With an accurate record of each acknowledged write operation, Ceph can + construct and disseminate a new authoritative history of the placement + group--a complete, and fully ordered set of operations that, if performed, + would bring an OSD’s copy of a placement group up to date. + + +Active +------ + +Once Ceph completes the peering process, a placement group may become +``active``. The ``active`` state means that the data in the placement group is +generally available in the primary placement group and the replicas for read +and write operations. + + +Clean +----- + +When a placement group is in the ``clean`` state, the primary OSD and the +replica OSDs have successfully peered and there are no stray replicas for the +placement group. Ceph replicated all objects in the placement group the correct +number of times. + + +Degraded +-------- + +When a client writes an object to the primary OSD, the primary OSD is +responsible for writing the replicas to the replica OSDs. After the primary OSD +writes the object to storage, the placement group will remain in a ``degraded`` +state until the primary OSD has received an acknowledgement from the replica +OSDs that Ceph created the replica objects successfully. + +The reason a placement group can be ``active+degraded`` is that an OSD may be +``active`` even though it doesn't hold all of the objects yet. If an OSD goes +``down``, Ceph marks each placement group assigned to the OSD as ``degraded``. +The OSDs must peer again when the OSD comes back online. However, a client can +still write a new object to a ``degraded`` placement group if it is ``active``. + +If an OSD is ``down`` and the ``degraded`` condition persists, Ceph may mark the +``down`` OSD as ``out`` of the cluster and remap the data from the ``down`` OSD +to another OSD. The time between being marked ``down`` and being marked ``out`` +is controlled by ``mon osd down out interval``, which is set to ``600`` seconds +by default. + +A placement group can also be ``degraded``, because Ceph cannot find one or more +objects that Ceph thinks should be in the placement group. While you cannot +read or write to unfound objects, you can still access all of the other objects +in the ``degraded`` placement group. + + +Recovering +---------- + +Ceph was designed for fault-tolerance at a scale where hardware and software +problems are ongoing. When an OSD goes ``down``, its contents may fall behind +the current state of other replicas in the placement groups. When the OSD is +back ``up``, the contents of the placement groups must be updated to reflect the +current state. During that time period, the OSD may reflect a ``recovering`` +state. + +Recovery is not always trivial, because a hardware failure might cause a +cascading failure of multiple OSDs. For example, a network switch for a rack or +cabinet may fail, which can cause the OSDs of a number of host machines to fall +behind the current state of the cluster. Each one of the OSDs must recover once +the fault is resolved. + +Ceph provides a number of settings to balance the resource contention between +new service requests and the need to recover data objects and restore the +placement groups to the current state. The ``osd recovery delay start`` setting +allows an OSD to restart, re-peer and even process some replay requests before +starting the recovery process. The ``osd +recovery thread timeout`` sets a thread timeout, because multiple OSDs may fail, +restart and re-peer at staggered rates. The ``osd recovery max active`` setting +limits the number of recovery requests an OSD will entertain simultaneously to +prevent the OSD from failing to serve . The ``osd recovery max chunk`` setting +limits the size of the recovered data chunks to prevent network congestion. + + +Back Filling +------------ + +When a new OSD joins the cluster, CRUSH will reassign placement groups from OSDs +in the cluster to the newly added OSD. Forcing the new OSD to accept the +reassigned placement groups immediately can put excessive load on the new OSD. +Back filling the OSD with the placement groups allows this process to begin in +the background. Once backfilling is complete, the new OSD will begin serving +requests when it is ready. + +During the backfill operations, you may see one of several states: +``backfill_wait`` indicates that a backfill operation is pending, but is not +underway yet; ``backfill`` indicates that a backfill operation is underway; +and, ``backfill_too_full`` indicates that a backfill operation was requested, +but couldn't be completed due to insufficient storage capacity. When a +placement group cannot be backfilled, it may be considered ``incomplete``. + +Ceph provides a number of settings to manage the load spike associated with +reassigning placement groups to an OSD (especially a new OSD). By default, +``osd_max_backfills`` sets the maximum number of concurrent backfills to or from +an OSD to 10. The ``backfill full ratio`` enables an OSD to refuse a +backfill request if the OSD is approaching its full ratio (90%, by default) and +change with ``ceph osd set-backfillfull-ratio`` comand. +If an OSD refuses a backfill request, the ``osd backfill retry interval`` +enables an OSD to retry the request (after 10 seconds, by default). OSDs can +also set ``osd backfill scan min`` and ``osd backfill scan max`` to manage scan +intervals (64 and 512, by default). + + +Remapped +-------- + +When the Acting Set that services a placement group changes, the data migrates +from the old acting set to the new acting set. It may take some time for a new +primary OSD to service requests. So it may ask the old primary to continue to +service requests until the placement group migration is complete. Once data +migration completes, the mapping uses the primary OSD of the new acting set. + + +Stale +----- + +While Ceph uses heartbeats to ensure that hosts and daemons are running, the +``ceph-osd`` daemons may also get into a ``stuck`` state where they are not +reporting statistics in a timely manner (e.g., a temporary network fault). By +default, OSD daemons report their placement group, up thru, boot and failure +statistics every half second (i.e., ``0.5``), which is more frequent than the +heartbeat thresholds. If the **Primary OSD** of a placement group's acting set +fails to report to the monitor or if other OSDs have reported the primary OSD +``down``, the monitors will mark the placement group ``stale``. + +When you start your cluster, it is common to see the ``stale`` state until +the peering process completes. After your cluster has been running for awhile, +seeing placement groups in the ``stale`` state indicates that the primary OSD +for those placement groups is ``down`` or not reporting placement group statistics +to the monitor. + + +Identifying Troubled PGs +======================== + +As previously noted, a placement group is not necessarily problematic just +because its state is not ``active+clean``. Generally, Ceph's ability to self +repair may not be working when placement groups get stuck. The stuck states +include: + +- **Unclean**: Placement groups contain objects that are not replicated the + desired number of times. They should be recovering. +- **Inactive**: Placement groups cannot process reads or writes because they + are waiting for an OSD with the most up-to-date data to come back ``up``. +- **Stale**: Placement groups are in an unknown state, because the OSDs that + host them have not reported to the monitor cluster in a while (configured + by ``mon osd report timeout``). + +To identify stuck placement groups, execute the following:: + + ceph pg dump_stuck [unclean|inactive|stale|undersized|degraded] + +See `Placement Group Subsystem`_ for additional details. To troubleshoot +stuck placement groups, see `Troubleshooting PG Errors`_. + + +Finding an Object Location +========================== + +To store object data in the Ceph Object Store, a Ceph client must: + +#. Set an object name +#. Specify a `pool`_ + +The Ceph client retrieves the latest cluster map and the CRUSH algorithm +calculates how to map the object to a `placement group`_, and then calculates +how to assign the placement group to an OSD dynamically. To find the object +location, all you need is the object name and the pool name. For example:: + + ceph osd map {poolname} {object-name} + +.. topic:: Exercise: Locate an Object + + As an exercise, lets create an object. Specify an object name, a path to a + test file containing some object data and a pool name using the + ``rados put`` command on the command line. For example:: + + rados put {object-name} {file-path} --pool=data + rados put test-object-1 testfile.txt --pool=data + + To verify that the Ceph Object Store stored the object, execute the following:: + + rados -p data ls + + Now, identify the object location:: + + ceph osd map {pool-name} {object-name} + ceph osd map data test-object-1 + + Ceph should output the object's location. For example:: + + osdmap e537 pool 'data' (0) object 'test-object-1' -> pg 0.d1743484 (0.4) -> up [1,0] acting [1,0] + + To remove the test object, simply delete it using the ``rados rm`` command. + For example:: + + rados rm test-object-1 --pool=data + + +As the cluster evolves, the object location may change dynamically. One benefit +of Ceph's dynamic rebalancing is that Ceph relieves you from having to perform +the migration manually. See the `Architecture`_ section for details. + +.. _data placement: ../data-placement +.. _pool: ../pools +.. _placement group: ../placement-groups +.. _Architecture: ../../../architecture +.. _OSD Not Running: ../../troubleshooting/troubleshooting-osd#osd-not-running +.. _Troubleshooting PG Errors: ../../troubleshooting/troubleshooting-pg#troubleshooting-pg-errors +.. _Peering Failure: ../../troubleshooting/troubleshooting-pg#failures-osd-peering +.. _CRUSH map: ../crush-map +.. _Configuring Monitor/OSD Interaction: ../../configuration/mon-osd-interaction/ +.. _Placement Group Subsystem: ../control#placement-group-subsystem diff --git a/src/ceph/doc/rados/operations/monitoring.rst b/src/ceph/doc/rados/operations/monitoring.rst new file mode 100644 index 0000000..c291440 --- /dev/null +++ b/src/ceph/doc/rados/operations/monitoring.rst @@ -0,0 +1,351 @@ +====================== + Monitoring a Cluster +====================== + +Once you have a running cluster, you may use the ``ceph`` tool to monitor your +cluster. Monitoring a cluster typically involves checking OSD status, monitor +status, placement group status and metadata server status. + +Using the command line +====================== + +Interactive mode +---------------- + +To run the ``ceph`` tool in interactive mode, type ``ceph`` at the command line +with no arguments. For example:: + + ceph + ceph> health + ceph> status + ceph> quorum_status + ceph> mon_status + +Non-default paths +----------------- + +If you specified non-default locations for your configuration or keyring, +you may specify their locations:: + + ceph -c /path/to/conf -k /path/to/keyring health + +Checking a Cluster's Status +=========================== + +After you start your cluster, and before you start reading and/or +writing data, check your cluster's status first. + +To check a cluster's status, execute the following:: + + ceph status + +Or:: + + ceph -s + +In interactive mode, type ``status`` and press **Enter**. :: + + ceph> status + +Ceph will print the cluster status. For example, a tiny Ceph demonstration +cluster with one of each service may print the following: + +:: + + cluster: + id: 477e46f1-ae41-4e43-9c8f-72c918ab0a20 + health: HEALTH_OK + + services: + mon: 1 daemons, quorum a + mgr: x(active) + mds: 1/1/1 up {0=a=up:active} + osd: 1 osds: 1 up, 1 in + + data: + pools: 2 pools, 16 pgs + objects: 21 objects, 2246 bytes + usage: 546 GB used, 384 GB / 931 GB avail + pgs: 16 active+clean + + +.. topic:: How Ceph Calculates Data Usage + + The ``usage`` value reflects the *actual* amount of raw storage used. The + ``xxx GB / xxx GB`` value means the amount available (the lesser number) + of the overall storage capacity of the cluster. The notional number reflects + the size of the stored data before it is replicated, cloned or snapshotted. + Therefore, the amount of data actually stored typically exceeds the notional + amount stored, because Ceph creates replicas of the data and may also use + storage capacity for cloning and snapshotting. + + +Watching a Cluster +================== + +In addition to local logging by each daemon, Ceph clusters maintain +a *cluster log* that records high level events about the whole system. +This is logged to disk on monitor servers (as ``/var/log/ceph/ceph.log`` by +default), but can also be monitored via the command line. + +To follow the cluster log, use the following command + +:: + + ceph -w + +Ceph will print the status of the system, followed by each log message as it +is emitted. For example: + +:: + + cluster: + id: 477e46f1-ae41-4e43-9c8f-72c918ab0a20 + health: HEALTH_OK + + services: + mon: 1 daemons, quorum a + mgr: x(active) + mds: 1/1/1 up {0=a=up:active} + osd: 1 osds: 1 up, 1 in + + data: + pools: 2 pools, 16 pgs + objects: 21 objects, 2246 bytes + usage: 546 GB used, 384 GB / 931 GB avail + pgs: 16 active+clean + + + 2017-07-24 08:15:11.329298 mon.a mon.0 172.21.9.34:6789/0 23 : cluster [INF] osd.0 172.21.9.34:6806/20527 boot + 2017-07-24 08:15:14.258143 mon.a mon.0 172.21.9.34:6789/0 39 : cluster [INF] Activating manager daemon x + 2017-07-24 08:15:15.446025 mon.a mon.0 172.21.9.34:6789/0 47 : cluster [INF] Manager daemon x is now available + + +In addition to using ``ceph -w`` to print log lines as they are emitted, +use ``ceph log last [n]`` to see the most recent ``n`` lines from the cluster +log. + +Monitoring Health Checks +======================== + +Ceph continously runs various *health checks* against its own status. When +a health check fails, this is reflected in the output of ``ceph status`` (or +``ceph health``). In addition, messages are sent to the cluster log to +indicate when a check fails, and when the cluster recovers. + +For example, when an OSD goes down, the ``health`` section of the status +output may be updated as follows: + +:: + + health: HEALTH_WARN + 1 osds down + Degraded data redundancy: 21/63 objects degraded (33.333%), 16 pgs unclean, 16 pgs degraded + +At this time, cluster log messages are also emitted to record the failure of the +health checks: + +:: + + 2017-07-25 10:08:58.265945 mon.a mon.0 172.21.9.34:6789/0 91 : cluster [WRN] Health check failed: 1 osds down (OSD_DOWN) + 2017-07-25 10:09:01.302624 mon.a mon.0 172.21.9.34:6789/0 94 : cluster [WRN] Health check failed: Degraded data redundancy: 21/63 objects degraded (33.333%), 16 pgs unclean, 16 pgs degraded (PG_DEGRADED) + +When the OSD comes back online, the cluster log records the cluster's return +to a health state: + +:: + + 2017-07-25 10:11:11.526841 mon.a mon.0 172.21.9.34:6789/0 109 : cluster [WRN] Health check update: Degraded data redundancy: 2 pgs unclean, 2 pgs degraded, 2 pgs undersized (PG_DEGRADED) + 2017-07-25 10:11:13.535493 mon.a mon.0 172.21.9.34:6789/0 110 : cluster [INF] Health check cleared: PG_DEGRADED (was: Degraded data redundancy: 2 pgs unclean, 2 pgs degraded, 2 pgs undersized) + 2017-07-25 10:11:13.535577 mon.a mon.0 172.21.9.34:6789/0 111 : cluster [INF] Cluster is now healthy + + +Detecting configuration issues +============================== + +In addition to the health checks that Ceph continuously runs on its +own status, there are some configuration issues that may only be detected +by an external tool. + +Use the `ceph-medic`_ tool to run these additional checks on your Ceph +cluster's configuration. + +Checking a Cluster's Usage Stats +================================ + +To check a cluster's data usage and data distribution among pools, you can +use the ``df`` option. It is similar to Linux ``df``. Execute +the following:: + + ceph df + +The **GLOBAL** section of the output provides an overview of the amount of +storage your cluster uses for your data. + +- **SIZE:** The overall storage capacity of the cluster. +- **AVAIL:** The amount of free space available in the cluster. +- **RAW USED:** The amount of raw storage used. +- **% RAW USED:** The percentage of raw storage used. Use this number in + conjunction with the ``full ratio`` and ``near full ratio`` to ensure that + you are not reaching your cluster's capacity. See `Storage Capacity`_ for + additional details. + +The **POOLS** section of the output provides a list of pools and the notional +usage of each pool. The output from this section **DOES NOT** reflect replicas, +clones or snapshots. For example, if you store an object with 1MB of data, the +notional usage will be 1MB, but the actual usage may be 2MB or more depending +on the number of replicas, clones and snapshots. + +- **NAME:** The name of the pool. +- **ID:** The pool ID. +- **USED:** The notional amount of data stored in kilobytes, unless the number + appends **M** for megabytes or **G** for gigabytes. +- **%USED:** The notional percentage of storage used per pool. +- **MAX AVAIL:** An estimate of the notional amount of data that can be written + to this pool. +- **Objects:** The notional number of objects stored per pool. + +.. note:: The numbers in the **POOLS** section are notional. They are not + inclusive of the number of replicas, shapshots or clones. As a result, + the sum of the **USED** and **%USED** amounts will not add up to the + **RAW USED** and **%RAW USED** amounts in the **GLOBAL** section of the + output. + +.. note:: The **MAX AVAIL** value is a complicated function of the + replication or erasure code used, the CRUSH rule that maps storage + to devices, the utilization of those devices, and the configured + mon_osd_full_ratio. + + + +Checking OSD Status +=================== + +You can check OSDs to ensure they are ``up`` and ``in`` by executing:: + + ceph osd stat + +Or:: + + ceph osd dump + +You can also check view OSDs according to their position in the CRUSH map. :: + + ceph osd tree + +Ceph will print out a CRUSH tree with a host, its OSDs, whether they are up +and their weight. :: + + # id weight type name up/down reweight + -1 3 pool default + -3 3 rack mainrack + -2 3 host osd-host + 0 1 osd.0 up 1 + 1 1 osd.1 up 1 + 2 1 osd.2 up 1 + +For a detailed discussion, refer to `Monitoring OSDs and Placement Groups`_. + +Checking Monitor Status +======================= + +If your cluster has multiple monitors (likely), you should check the monitor +quorum status after you start the cluster before reading and/or writing data. A +quorum must be present when multiple monitors are running. You should also check +monitor status periodically to ensure that they are running. + +To see display the monitor map, execute the following:: + + ceph mon stat + +Or:: + + ceph mon dump + +To check the quorum status for the monitor cluster, execute the following:: + + ceph quorum_status + +Ceph will return the quorum status. For example, a Ceph cluster consisting of +three monitors may return the following: + +.. code-block:: javascript + + { "election_epoch": 10, + "quorum": [ + 0, + 1, + 2], + "monmap": { "epoch": 1, + "fsid": "444b489c-4f16-4b75-83f0-cb8097468898", + "modified": "2011-12-12 13:28:27.505520", + "created": "2011-12-12 13:28:27.505520", + "mons": [ + { "rank": 0, + "name": "a", + "addr": "127.0.0.1:6789\/0"}, + { "rank": 1, + "name": "b", + "addr": "127.0.0.1:6790\/0"}, + { "rank": 2, + "name": "c", + "addr": "127.0.0.1:6791\/0"} + ] + } + } + +Checking MDS Status +=================== + +Metadata servers provide metadata services for Ceph FS. Metadata servers have +two sets of states: ``up | down`` and ``active | inactive``. To ensure your +metadata servers are ``up`` and ``active``, execute the following:: + + ceph mds stat + +To display details of the metadata cluster, execute the following:: + + ceph fs dump + + +Checking Placement Group States +=============================== + +Placement groups map objects to OSDs. When you monitor your +placement groups, you will want them to be ``active`` and ``clean``. +For a detailed discussion, refer to `Monitoring OSDs and Placement Groups`_. + +.. _Monitoring OSDs and Placement Groups: ../monitoring-osd-pg + + +Using the Admin Socket +====================== + +The Ceph admin socket allows you to query a daemon via a socket interface. +By default, Ceph sockets reside under ``/var/run/ceph``. To access a daemon +via the admin socket, login to the host running the daemon and use the +following command:: + + ceph daemon {daemon-name} + ceph daemon {path-to-socket-file} + +For example, the following are equivalent:: + + ceph daemon osd.0 foo + ceph daemon /var/run/ceph/ceph-osd.0.asok foo + +To view the available admin socket commands, execute the following command:: + + ceph daemon {daemon-name} help + +The admin socket command enables you to show and set your configuration at +runtime. See `Viewing a Configuration at Runtime`_ for details. + +Additionally, you can set configuration values at runtime directly (i.e., the +admin socket bypasses the monitor, unlike ``ceph tell {daemon-type}.{id} +injectargs``, which relies on the monitor but doesn't require you to login +directly to the host in question ). + +.. _Viewing a Configuration at Runtime: ../../configuration/ceph-conf#ceph-runtime-config +.. _Storage Capacity: ../../configuration/mon-config-ref#storage-capacity +.. _ceph-medic: http://docs.ceph.com/ceph-medic/master/ diff --git a/src/ceph/doc/rados/operations/operating.rst b/src/ceph/doc/rados/operations/operating.rst new file mode 100644 index 0000000..791941a --- /dev/null +++ b/src/ceph/doc/rados/operations/operating.rst @@ -0,0 +1,251 @@ +===================== + Operating a Cluster +===================== + +.. index:: systemd; operating a cluster + + +Running Ceph with systemd +========================== + +For all distributions that support systemd (CentOS 7, Fedora, Debian +Jessie 8 and later, SUSE), ceph daemons are now managed using native +systemd files instead of the legacy sysvinit scripts. For example:: + + sudo systemctl start ceph.target # start all daemons + sudo systemctl status ceph-osd@12 # check status of osd.12 + +To list the Ceph systemd units on a node, execute:: + + sudo systemctl status ceph\*.service ceph\*.target + +Starting all Daemons +-------------------- + +To start all daemons on a Ceph Node (irrespective of type), execute the +following:: + + sudo systemctl start ceph.target + + +Stopping all Daemons +-------------------- + +To stop all daemons on a Ceph Node (irrespective of type), execute the +following:: + + sudo systemctl stop ceph\*.service ceph\*.target + + +Starting all Daemons by Type +---------------------------- + +To start all daemons of a particular type on a Ceph Node, execute one of the +following:: + + sudo systemctl start ceph-osd.target + sudo systemctl start ceph-mon.target + sudo systemctl start ceph-mds.target + + +Stopping all Daemons by Type +---------------------------- + +To stop all daemons of a particular type on a Ceph Node, execute one of the +following:: + + sudo systemctl stop ceph-mon\*.service ceph-mon.target + sudo systemctl stop ceph-osd\*.service ceph-osd.target + sudo systemctl stop ceph-mds\*.service ceph-mds.target + + +Starting a Daemon +----------------- + +To start a specific daemon instance on a Ceph Node, execute one of the +following:: + + sudo systemctl start ceph-osd@{id} + sudo systemctl start ceph-mon@{hostname} + sudo systemctl start ceph-mds@{hostname} + +For example:: + + sudo systemctl start ceph-osd@1 + sudo systemctl start ceph-mon@ceph-server + sudo systemctl start ceph-mds@ceph-server + + +Stopping a Daemon +----------------- + +To stop a specific daemon instance on a Ceph Node, execute one of the +following:: + + sudo systemctl stop ceph-osd@{id} + sudo systemctl stop ceph-mon@{hostname} + sudo systemctl stop ceph-mds@{hostname} + +For example:: + + sudo systemctl stop ceph-osd@1 + sudo systemctl stop ceph-mon@ceph-server + sudo systemctl stop ceph-mds@ceph-server + + +.. index:: Ceph service; Upstart; operating a cluster + + + +Running Ceph with Upstart +========================= + +When deploying Ceph with ``ceph-deploy`` on Ubuntu Trusty, you may start and +stop Ceph daemons on a :term:`Ceph Node` using the event-based `Upstart`_. +Upstart does not require you to define daemon instances in the Ceph +configuration file. + +To list the Ceph Upstart jobs and instances on a node, execute:: + + sudo initctl list | grep ceph + +See `initctl`_ for additional details. + + +Starting all Daemons +-------------------- + +To start all daemons on a Ceph Node (irrespective of type), execute the +following:: + + sudo start ceph-all + + +Stopping all Daemons +-------------------- + +To stop all daemons on a Ceph Node (irrespective of type), execute the +following:: + + sudo stop ceph-all + + +Starting all Daemons by Type +---------------------------- + +To start all daemons of a particular type on a Ceph Node, execute one of the +following:: + + sudo start ceph-osd-all + sudo start ceph-mon-all + sudo start ceph-mds-all + + +Stopping all Daemons by Type +---------------------------- + +To stop all daemons of a particular type on a Ceph Node, execute one of the +following:: + + sudo stop ceph-osd-all + sudo stop ceph-mon-all + sudo stop ceph-mds-all + + +Starting a Daemon +----------------- + +To start a specific daemon instance on a Ceph Node, execute one of the +following:: + + sudo start ceph-osd id={id} + sudo start ceph-mon id={hostname} + sudo start ceph-mds id={hostname} + +For example:: + + sudo start ceph-osd id=1 + sudo start ceph-mon id=ceph-server + sudo start ceph-mds id=ceph-server + + +Stopping a Daemon +----------------- + +To stop a specific daemon instance on a Ceph Node, execute one of the +following:: + + sudo stop ceph-osd id={id} + sudo stop ceph-mon id={hostname} + sudo stop ceph-mds id={hostname} + +For example:: + + sudo stop ceph-osd id=1 + sudo start ceph-mon id=ceph-server + sudo start ceph-mds id=ceph-server + + +.. index:: Ceph service; sysvinit; operating a cluster + + +Running Ceph +============ + +Each time you to **start**, **restart**, and **stop** Ceph daemons (or your +entire cluster) you must specify at least one option and one command. You may +also specify a daemon type or a daemon instance. :: + + {commandline} [options] [commands] [daemons] + + +The ``ceph`` options include: + ++-----------------+----------+-------------------------------------------------+ +| Option | Shortcut | Description | ++=================+==========+=================================================+ +| ``--verbose`` | ``-v`` | Use verbose logging. | ++-----------------+----------+-------------------------------------------------+ +| ``--valgrind`` | ``N/A`` | (Dev and QA only) Use `Valgrind`_ debugging. | ++-----------------+----------+-------------------------------------------------+ +| ``--allhosts`` | ``-a`` | Execute on all nodes in ``ceph.conf.`` | +| | | Otherwise, it only executes on ``localhost``. | ++-----------------+----------+-------------------------------------------------+ +| ``--restart`` | ``N/A`` | Automatically restart daemon if it core dumps. | ++-----------------+----------+-------------------------------------------------+ +| ``--norestart`` | ``N/A`` | Don't restart a daemon if it core dumps. | ++-----------------+----------+-------------------------------------------------+ +| ``--conf`` | ``-c`` | Use an alternate configuration file. | ++-----------------+----------+-------------------------------------------------+ + +The ``ceph`` commands include: + ++------------------+------------------------------------------------------------+ +| Command | Description | ++==================+============================================================+ +| ``start`` | Start the daemon(s). | ++------------------+------------------------------------------------------------+ +| ``stop`` | Stop the daemon(s). | ++------------------+------------------------------------------------------------+ +| ``forcestop`` | Force the daemon(s) to stop. Same as ``kill -9`` | ++------------------+------------------------------------------------------------+ +| ``killall`` | Kill all daemons of a particular type. | ++------------------+------------------------------------------------------------+ +| ``cleanlogs`` | Cleans out the log directory. | ++------------------+------------------------------------------------------------+ +| ``cleanalllogs`` | Cleans out **everything** in the log directory. | ++------------------+------------------------------------------------------------+ + +For subsystem operations, the ``ceph`` service can target specific daemon types +by adding a particular daemon type for the ``[daemons]`` option. Daemon types +include: + +- ``mon`` +- ``osd`` +- ``mds`` + + + +.. _Valgrind: http://www.valgrind.org/ +.. _Upstart: http://upstart.ubuntu.com/index.html +.. _initctl: http://manpages.ubuntu.com/manpages/raring/en/man8/initctl.8.html diff --git a/src/ceph/doc/rados/operations/pg-concepts.rst b/src/ceph/doc/rados/operations/pg-concepts.rst new file mode 100644 index 0000000..636d6bf --- /dev/null +++ b/src/ceph/doc/rados/operations/pg-concepts.rst @@ -0,0 +1,102 @@ +========================== + Placement Group Concepts +========================== + +When you execute commands like ``ceph -w``, ``ceph osd dump``, and other +commands related to placement groups, Ceph may return values using some +of the following terms: + +*Peering* + The process of bringing all of the OSDs that store + a Placement Group (PG) into agreement about the state + of all of the objects (and their metadata) in that PG. + Note that agreeing on the state does not mean that + they all have the latest contents. + +*Acting Set* + The ordered list of OSDs who are (or were as of some epoch) + responsible for a particular placement group. + +*Up Set* + The ordered list of OSDs responsible for a particular placement + group for a particular epoch according to CRUSH. Normally this + is the same as the *Acting Set*, except when the *Acting Set* has + been explicitly overridden via ``pg_temp`` in the OSD Map. + +*Current Interval* or *Past Interval* + A sequence of OSD map epochs during which the *Acting Set* and *Up + Set* for particular placement group do not change. + +*Primary* + The member (and by convention first) of the *Acting Set*, + that is responsible for coordination peering, and is + the only OSD that will accept client-initiated + writes to objects in a placement group. + +*Replica* + A non-primary OSD in the *Acting Set* for a placement group + (and who has been recognized as such and *activated* by the primary). + +*Stray* + An OSD that is not a member of the current *Acting Set*, but + has not yet been told that it can delete its copies of a + particular placement group. + +*Recovery* + Ensuring that copies of all of the objects in a placement group + are on all of the OSDs in the *Acting Set*. Once *Peering* has + been performed, the *Primary* can start accepting write operations, + and *Recovery* can proceed in the background. + +*PG Info* + Basic metadata about the placement group's creation epoch, the version + for the most recent write to the placement group, *last epoch started*, + *last epoch clean*, and the beginning of the *current interval*. Any + inter-OSD communication about placement groups includes the *PG Info*, + such that any OSD that knows a placement group exists (or once existed) + also has a lower bound on *last epoch clean* or *last epoch started*. + +*PG Log* + A list of recent updates made to objects in a placement group. + Note that these logs can be truncated after all OSDs + in the *Acting Set* have acknowledged up to a certain + point. + +*Missing Set* + Each OSD notes update log entries and if they imply updates to + the contents of an object, adds that object to a list of needed + updates. This list is called the *Missing Set* for that ``<OSD,PG>``. + +*Authoritative History* + A complete, and fully ordered set of operations that, if + performed, would bring an OSD's copy of a placement group + up to date. + +*Epoch* + A (monotonically increasing) OSD map version number + +*Last Epoch Start* + The last epoch at which all nodes in the *Acting Set* + for a particular placement group agreed on an + *Authoritative History*. At this point, *Peering* is + deemed to have been successful. + +*up_thru* + Before a *Primary* can successfully complete the *Peering* process, + it must inform a monitor that is alive through the current + OSD map *Epoch* by having the monitor set its *up_thru* in the osd + map. This helps *Peering* ignore previous *Acting Sets* for which + *Peering* never completed after certain sequences of failures, such as + the second interval below: + + - *acting set* = [A,B] + - *acting set* = [A] + - *acting set* = [] very shortly after (e.g., simultaneous failure, but staggered detection) + - *acting set* = [B] (B restarts, A does not) + +*Last Epoch Clean* + The last *Epoch* at which all nodes in the *Acting set* + for a particular placement group were completely + up to date (both placement group logs and object contents). + At this point, *recovery* is deemed to have been + completed. diff --git a/src/ceph/doc/rados/operations/pg-repair.rst b/src/ceph/doc/rados/operations/pg-repair.rst new file mode 100644 index 0000000..0d6692a --- /dev/null +++ b/src/ceph/doc/rados/operations/pg-repair.rst @@ -0,0 +1,4 @@ +Repairing PG inconsistencies +============================ + + diff --git a/src/ceph/doc/rados/operations/pg-states.rst b/src/ceph/doc/rados/operations/pg-states.rst new file mode 100644 index 0000000..0fbd3dc --- /dev/null +++ b/src/ceph/doc/rados/operations/pg-states.rst @@ -0,0 +1,80 @@ +======================== + Placement Group States +======================== + +When checking a cluster's status (e.g., running ``ceph -w`` or ``ceph -s``), +Ceph will report on the status of the placement groups. A placement group has +one or more states. The optimum state for placement groups in the placement group +map is ``active + clean``. + +*Creating* + Ceph is still creating the placement group. + +*Active* + Ceph will process requests to the placement group. + +*Clean* + Ceph replicated all objects in the placement group the correct number of times. + +*Down* + A replica with necessary data is down, so the placement group is offline. + +*Scrubbing* + Ceph is checking the placement group for inconsistencies. + +*Degraded* + Ceph has not replicated some objects in the placement group the correct number of times yet. + +*Inconsistent* + Ceph detects inconsistencies in the one or more replicas of an object in the placement group + (e.g. objects are the wrong size, objects are missing from one replica *after* recovery finished, etc.). + +*Peering* + The placement group is undergoing the peering process + +*Repair* + Ceph is checking the placement group and repairing any inconsistencies it finds (if possible). + +*Recovering* + Ceph is migrating/synchronizing objects and their replicas. + +*Forced-Recovery* + High recovery priority of that PG is enforced by user. + +*Backfill* + Ceph is scanning and synchronizing the entire contents of a placement group + instead of inferring what contents need to be synchronized from the logs of + recent operations. *Backfill* is a special case of recovery. + +*Forced-Backfill* + High backfill priority of that PG is enforced by user. + +*Wait-backfill* + The placement group is waiting in line to start backfill. + +*Backfill-toofull* + A backfill operation is waiting because the destination OSD is over its + full ratio. + +*Incomplete* + Ceph detects that a placement group is missing information about + writes that may have occurred, or does not have any healthy + copies. If you see this state, try to start any failed OSDs that may + contain the needed information. In the case of an erasure coded pool + temporarily reducing min_size may allow recovery. + +*Stale* + The placement group is in an unknown state - the monitors have not received + an update for it since the placement group mapping changed. + +*Remapped* + The placement group is temporarily mapped to a different set of OSDs from what + CRUSH specified. + +*Undersized* + The placement group fewer copies than the configured pool replication level. + +*Peered* + The placement group has peered, but cannot serve client IO due to not having + enough copies to reach the pool's configured min_size parameter. Recovery + may occur in this state, so the pg may heal up to min_size eventually. diff --git a/src/ceph/doc/rados/operations/placement-groups.rst b/src/ceph/doc/rados/operations/placement-groups.rst new file mode 100644 index 0000000..fee833a --- /dev/null +++ b/src/ceph/doc/rados/operations/placement-groups.rst @@ -0,0 +1,469 @@ +================== + Placement Groups +================== + +.. _preselection: + +A preselection of pg_num +======================== + +When creating a new pool with:: + + ceph osd pool create {pool-name} pg_num + +it is mandatory to choose the value of ``pg_num`` because it cannot be +calculated automatically. Here are a few values commonly used: + +- Less than 5 OSDs set ``pg_num`` to 128 + +- Between 5 and 10 OSDs set ``pg_num`` to 512 + +- Between 10 and 50 OSDs set ``pg_num`` to 1024 + +- If you have more than 50 OSDs, you need to understand the tradeoffs + and how to calculate the ``pg_num`` value by yourself + +- For calculating ``pg_num`` value by yourself please take help of `pgcalc`_ tool + +As the number of OSDs increases, chosing the right value for pg_num +becomes more important because it has a significant influence on the +behavior of the cluster as well as the durability of the data when +something goes wrong (i.e. the probability that a catastrophic event +leads to data loss). + +How are Placement Groups used ? +=============================== + +A placement group (PG) aggregates objects within a pool because +tracking object placement and object metadata on a per-object basis is +computationally expensive--i.e., a system with millions of objects +cannot realistically track placement on a per-object basis. + +.. ditaa:: + /-----\ /-----\ /-----\ /-----\ /-----\ + | obj | | obj | | obj | | obj | | obj | + \-----/ \-----/ \-----/ \-----/ \-----/ + | | | | | + +--------+--------+ +---+----+ + | | + v v + +-----------------------+ +-----------------------+ + | Placement Group #1 | | Placement Group #2 | + | | | | + +-----------------------+ +-----------------------+ + | | + +------------------------------+ + | + v + +-----------------------+ + | Pool | + | | + +-----------------------+ + +The Ceph client will calculate which placement group an object should +be in. It does this by hashing the object ID and applying an operation +based on the number of PGs in the defined pool and the ID of the pool. +See `Mapping PGs to OSDs`_ for details. + +The object's contents within a placement group are stored in a set of +OSDs. For instance, in a replicated pool of size two, each placement +group will store objects on two OSDs, as shown below. + +.. ditaa:: + + +-----------------------+ +-----------------------+ + | Placement Group #1 | | Placement Group #2 | + | | | | + +-----------------------+ +-----------------------+ + | | | | + v v v v + /----------\ /----------\ /----------\ /----------\ + | | | | | | | | + | OSD #1 | | OSD #2 | | OSD #2 | | OSD #3 | + | | | | | | | | + \----------/ \----------/ \----------/ \----------/ + + +Should OSD #2 fail, another will be assigned to Placement Group #1 and +will be filled with copies of all objects in OSD #1. If the pool size +is changed from two to three, an additional OSD will be assigned to +the placement group and will receive copies of all objects in the +placement group. + +Placement groups do not own the OSD, they share it with other +placement groups from the same pool or even other pools. If OSD #2 +fails, the Placement Group #2 will also have to restore copies of +objects, using OSD #3. + +When the number of placement groups increases, the new placement +groups will be assigned OSDs. The result of the CRUSH function will +also change and some objects from the former placement groups will be +copied over to the new Placement Groups and removed from the old ones. + +Placement Groups Tradeoffs +========================== + +Data durability and even distribution among all OSDs call for more +placement groups but their number should be reduced to the minimum to +save CPU and memory. + +.. _data durability: + +Data durability +--------------- + +After an OSD fails, the risk of data loss increases until the data it +contained is fully recovered. Let's imagine a scenario that causes +permanent data loss in a single placement group: + +- The OSD fails and all copies of the object it contains are lost. + For all objects within the placement group the number of replica + suddently drops from three to two. + +- Ceph starts recovery for this placement group by chosing a new OSD + to re-create the third copy of all objects. + +- Another OSD, within the same placement group, fails before the new + OSD is fully populated with the third copy. Some objects will then + only have one surviving copies. + +- Ceph picks yet another OSD and keeps copying objects to restore the + desired number of copies. + +- A third OSD, within the same placement group, fails before recovery + is complete. If this OSD contained the only remaining copy of an + object, it is permanently lost. + +In a cluster containing 10 OSDs with 512 placement groups in a three +replica pool, CRUSH will give each placement groups three OSDs. In the +end, each OSDs will end up hosting (512 * 3) / 10 = ~150 Placement +Groups. When the first OSD fails, the above scenario will therefore +start recovery for all 150 placement groups at the same time. + +The 150 placement groups being recovered are likely to be +homogeneously spread over the 9 remaining OSDs. Each remaining OSD is +therefore likely to send copies of objects to all others and also +receive some new objects to be stored because they became part of a +new placement group. + +The amount of time it takes for this recovery to complete entirely +depends on the architecture of the Ceph cluster. Let say each OSD is +hosted by a 1TB SSD on a single machine and all of them are connected +to a 10Gb/s switch and the recovery for a single OSD completes within +M minutes. If there are two OSDs per machine using spinners with no +SSD journal and a 1Gb/s switch, it will at least be an order of +magnitude slower. + +In a cluster of this size, the number of placement groups has almost +no influence on data durability. It could be 128 or 8192 and the +recovery would not be slower or faster. + +However, growing the same Ceph cluster to 20 OSDs instead of 10 OSDs +is likely to speed up recovery and therefore improve data durability +significantly. Each OSD now participates in only ~75 placement groups +instead of ~150 when there were only 10 OSDs and it will still require +all 19 remaining OSDs to perform the same amount of object copies in +order to recover. But where 10 OSDs had to copy approximately 100GB +each, they now have to copy 50GB each instead. If the network was the +bottleneck, recovery will happen twice as fast. In other words, +recovery goes faster when the number of OSDs increases. + +If this cluster grows to 40 OSDs, each of them will only host ~35 +placement groups. If an OSD dies, recovery will keep going faster +unless it is blocked by another bottleneck. However, if this cluster +grows to 200 OSDs, each of them will only host ~7 placement groups. If +an OSD dies, recovery will happen between at most of ~21 (7 * 3) OSDs +in these placement groups: recovery will take longer than when there +were 40 OSDs, meaning the number of placement groups should be +increased. + +No matter how short the recovery time is, there is a chance for a +second OSD to fail while it is in progress. In the 10 OSDs cluster +described above, if any of them fail, then ~17 placement groups +(i.e. ~150 / 9 placement groups being recovered) will only have one +surviving copy. And if any of the 8 remaining OSD fail, the last +objects of two placement groups are likely to be lost (i.e. ~17 / 8 +placement groups with only one remaining copy being recovered). + +When the size of the cluster grows to 20 OSDs, the number of Placement +Groups damaged by the loss of three OSDs drops. The second OSD lost +will degrade ~4 (i.e. ~75 / 19 placement groups being recovered) +instead of ~17 and the third OSD lost will only lose data if it is one +of the four OSDs containing the surviving copy. In other words, if the +probability of losing one OSD is 0.0001% during the recovery time +frame, it goes from 17 * 10 * 0.0001% in the cluster with 10 OSDs to 4 * 20 * +0.0001% in the cluster with 20 OSDs. + +In a nutshell, more OSDs mean faster recovery and a lower risk of +cascading failures leading to the permanent loss of a Placement +Group. Having 512 or 4096 Placement Groups is roughly equivalent in a +cluster with less than 50 OSDs as far as data durability is concerned. + +Note: It may take a long time for a new OSD added to the cluster to be +populated with placement groups that were assigned to it. However +there is no degradation of any object and it has no impact on the +durability of the data contained in the Cluster. + +.. _object distribution: + +Object distribution within a pool +--------------------------------- + +Ideally objects are evenly distributed in each placement group. Since +CRUSH computes the placement group for each object, but does not +actually know how much data is stored in each OSD within this +placement group, the ratio between the number of placement groups and +the number of OSDs may influence the distribution of the data +significantly. + +For instance, if there was single a placement group for ten OSDs in a +three replica pool, only three OSD would be used because CRUSH would +have no other choice. When more placement groups are available, +objects are more likely to be evenly spread among them. CRUSH also +makes every effort to evenly spread OSDs among all existing Placement +Groups. + +As long as there are one or two orders of magnitude more Placement +Groups than OSDs, the distribution should be even. For instance, 300 +placement groups for 3 OSDs, 1000 placement groups for 10 OSDs etc. + +Uneven data distribution can be caused by factors other than the ratio +between OSDs and placement groups. Since CRUSH does not take into +account the size of the objects, a few very large objects may create +an imbalance. Let say one million 4K objects totaling 4GB are evenly +spread among 1000 placement groups on 10 OSDs. They will use 4GB / 10 += 400MB on each OSD. If one 400MB object is added to the pool, the +three OSDs supporting the placement group in which the object has been +placed will be filled with 400MB + 400MB = 800MB while the seven +others will remain occupied with only 400MB. + +.. _resource usage: + +Memory, CPU and network usage +----------------------------- + +For each placement group, OSDs and MONs need memory, network and CPU +at all times and even more during recovery. Sharing this overhead by +clustering objects within a placement group is one of the main reasons +they exist. + +Minimizing the number of placement groups saves significant amounts of +resources. + +Choosing the number of Placement Groups +======================================= + +If you have more than 50 OSDs, we recommend approximately 50-100 +placement groups per OSD to balance out resource usage, data +durability and distribution. If you have less than 50 OSDs, chosing +among the `preselection`_ above is best. For a single pool of objects, +you can use the following formula to get a baseline:: + + (OSDs * 100) + Total PGs = ------------ + pool size + +Where **pool size** is either the number of replicas for replicated +pools or the K+M sum for erasure coded pools (as returned by **ceph +osd erasure-code-profile get**). + +You should then check if the result makes sense with the way you +designed your Ceph cluster to maximize `data durability`_, +`object distribution`_ and minimize `resource usage`_. + +The result should be **rounded up to the nearest power of two.** +Rounding up is optional, but recommended for CRUSH to evenly balance +the number of objects among placement groups. + +As an example, for a cluster with 200 OSDs and a pool size of 3 +replicas, you would estimate your number of PGs as follows:: + + (200 * 100) + ----------- = 6667. Nearest power of 2: 8192 + 3 + +When using multiple data pools for storing objects, you need to ensure +that you balance the number of placement groups per pool with the +number of placement groups per OSD so that you arrive at a reasonable +total number of placement groups that provides reasonably low variance +per OSD without taxing system resources or making the peering process +too slow. + +For instance a cluster of 10 pools each with 512 placement groups on +ten OSDs is a total of 5,120 placement groups spread over ten OSDs, +that is 512 placement groups per OSD. That does not use too many +resources. However, if 1,000 pools were created with 512 placement +groups each, the OSDs will handle ~50,000 placement groups each and it +would require significantly more resources and time for peering. + +You may find the `PGCalc`_ tool helpful. + + +.. _setting the number of placement groups: + +Set the Number of Placement Groups +================================== + +To set the number of placement groups in a pool, you must specify the +number of placement groups at the time you create the pool. +See `Create a Pool`_ for details. Once you have set placement groups for a +pool, you may increase the number of placement groups (but you cannot +decrease the number of placement groups). To increase the number of +placement groups, execute the following:: + + ceph osd pool set {pool-name} pg_num {pg_num} + +Once you increase the number of placement groups, you must also +increase the number of placement groups for placement (``pgp_num``) +before your cluster will rebalance. The ``pgp_num`` will be the number of +placement groups that will be considered for placement by the CRUSH +algorithm. Increasing ``pg_num`` splits the placement groups but data +will not be migrated to the newer placement groups until placement +groups for placement, ie. ``pgp_num`` is increased. The ``pgp_num`` +should be equal to the ``pg_num``. To increase the number of +placement groups for placement, execute the following:: + + ceph osd pool set {pool-name} pgp_num {pgp_num} + + +Get the Number of Placement Groups +================================== + +To get the number of placement groups in a pool, execute the following:: + + ceph osd pool get {pool-name} pg_num + + +Get a Cluster's PG Statistics +============================= + +To get the statistics for the placement groups in your cluster, execute the following:: + + ceph pg dump [--format {format}] + +Valid formats are ``plain`` (default) and ``json``. + + +Get Statistics for Stuck PGs +============================ + +To get the statistics for all placement groups stuck in a specified state, +execute the following:: + + ceph pg dump_stuck inactive|unclean|stale|undersized|degraded [--format <format>] [-t|--threshold <seconds>] + +**Inactive** Placement groups cannot process reads or writes because they are waiting for an OSD +with the most up-to-date data to come up and in. + +**Unclean** Placement groups contain objects that are not replicated the desired number +of times. They should be recovering. + +**Stale** Placement groups are in an unknown state - the OSDs that host them have not +reported to the monitor cluster in a while (configured by ``mon_osd_report_timeout``). + +Valid formats are ``plain`` (default) and ``json``. The threshold defines the minimum number +of seconds the placement group is stuck before including it in the returned statistics +(default 300 seconds). + + +Get a PG Map +============ + +To get the placement group map for a particular placement group, execute the following:: + + ceph pg map {pg-id} + +For example:: + + ceph pg map 1.6c + +Ceph will return the placement group map, the placement group, and the OSD status:: + + osdmap e13 pg 1.6c (1.6c) -> up [1,0] acting [1,0] + + +Get a PGs Statistics +==================== + +To retrieve statistics for a particular placement group, execute the following:: + + ceph pg {pg-id} query + + +Scrub a Placement Group +======================= + +To scrub a placement group, execute the following:: + + ceph pg scrub {pg-id} + +Ceph checks the primary and any replica nodes, generates a catalog of all objects +in the placement group and compares them to ensure that no objects are missing +or mismatched, and their contents are consistent. Assuming the replicas all +match, a final semantic sweep ensures that all of the snapshot-related object +metadata is consistent. Errors are reported via logs. + +Prioritize backfill/recovery of a Placement Group(s) +==================================================== + +You may run into a situation where a bunch of placement groups will require +recovery and/or backfill, and some particular groups hold data more important +than others (for example, those PGs may hold data for images used by running +machines and other PGs may be used by inactive machines/less relevant data). +In that case, you may want to prioritize recovery of those groups so +performance and/or availability of data stored on those groups is restored +earlier. To do this (mark particular placement group(s) as prioritized during +backfill or recovery), execute the following:: + + ceph pg force-recovery {pg-id} [{pg-id #2}] [{pg-id #3} ...] + ceph pg force-backfill {pg-id} [{pg-id #2}] [{pg-id #3} ...] + +This will cause Ceph to perform recovery or backfill on specified placement +groups first, before other placement groups. This does not interrupt currently +ongoing backfills or recovery, but causes specified PGs to be processed +as soon as possible. If you change your mind or prioritize wrong groups, +use:: + + ceph pg cancel-force-recovery {pg-id} [{pg-id #2}] [{pg-id #3} ...] + ceph pg cancel-force-backfill {pg-id} [{pg-id #2}] [{pg-id #3} ...] + +This will remove "force" flag from those PGs and they will be processed +in default order. Again, this doesn't affect currently processed placement +group, only those that are still queued. + +The "force" flag is cleared automatically after recovery or backfill of group +is done. + +Revert Lost +=========== + +If the cluster has lost one or more objects, and you have decided to +abandon the search for the lost data, you must mark the unfound objects +as ``lost``. + +If all possible locations have been queried and objects are still +lost, you may have to give up on the lost objects. This is +possible given unusual combinations of failures that allow the cluster +to learn about writes that were performed before the writes themselves +are recovered. + +Currently the only supported option is "revert", which will either roll back to +a previous version of the object or (if it was a new object) forget about it +entirely. To mark the "unfound" objects as "lost", execute the following:: + + ceph pg {pg-id} mark_unfound_lost revert|delete + +.. important:: Use this feature with caution, because it may confuse + applications that expect the object(s) to exist. + + +.. toctree:: + :hidden: + + pg-states + pg-concepts + + +.. _Create a Pool: ../pools#createpool +.. _Mapping PGs to OSDs: ../../../architecture#mapping-pgs-to-osds +.. _pgcalc: http://ceph.com/pgcalc/ diff --git a/src/ceph/doc/rados/operations/pools.rst b/src/ceph/doc/rados/operations/pools.rst new file mode 100644 index 0000000..7015593 --- /dev/null +++ b/src/ceph/doc/rados/operations/pools.rst @@ -0,0 +1,798 @@ +======= + Pools +======= + +When you first deploy a cluster without creating a pool, Ceph uses the default +pools for storing data. A pool provides you with: + +- **Resilience**: You can set how many OSD are allowed to fail without losing data. + For replicated pools, it is the desired number of copies/replicas of an object. + A typical configuration stores an object and one additional copy + (i.e., ``size = 2``), but you can determine the number of copies/replicas. + For `erasure coded pools <../erasure-code>`_, it is the number of coding chunks + (i.e. ``m=2`` in the **erasure code profile**) + +- **Placement Groups**: You can set the number of placement groups for the pool. + A typical configuration uses approximately 100 placement groups per OSD to + provide optimal balancing without using up too many computing resources. When + setting up multiple pools, be careful to ensure you set a reasonable number of + placement groups for both the pool and the cluster as a whole. + +- **CRUSH Rules**: When you store data in a pool, a CRUSH ruleset mapped to the + pool enables CRUSH to identify a rule for the placement of the object + and its replicas (or chunks for erasure coded pools) in your cluster. + You can create a custom CRUSH rule for your pool. + +- **Snapshots**: When you create snapshots with ``ceph osd pool mksnap``, + you effectively take a snapshot of a particular pool. + +To organize data into pools, you can list, create, and remove pools. +You can also view the utilization statistics for each pool. + +List Pools +========== + +To list your cluster's pools, execute:: + + ceph osd lspools + +On a freshly installed cluster, only the ``rbd`` pool exists. + + +.. _createpool: + +Create a Pool +============= + +Before creating pools, refer to the `Pool, PG and CRUSH Config Reference`_. +Ideally, you should override the default value for the number of placement +groups in your Ceph configuration file, as the default is NOT ideal. +For details on placement group numbers refer to `setting the number of placement groups`_ + +.. note:: Starting with Luminous, all pools need to be associated to the + application using the pool. See `Associate Pool to Application`_ below for + more information. + +For example:: + + osd pool default pg num = 100 + osd pool default pgp num = 100 + +To create a pool, execute:: + + ceph osd pool create {pool-name} {pg-num} [{pgp-num}] [replicated] \ + [crush-rule-name] [expected-num-objects] + ceph osd pool create {pool-name} {pg-num} {pgp-num} erasure \ + [erasure-code-profile] [crush-rule-name] [expected_num_objects] + +Where: + +``{pool-name}`` + +:Description: The name of the pool. It must be unique. +:Type: String +:Required: Yes. + +``{pg-num}`` + +:Description: The total number of placement groups for the pool. See `Placement + Groups`_ for details on calculating a suitable number. The + default value ``8`` is NOT suitable for most systems. + +:Type: Integer +:Required: Yes. +:Default: 8 + +``{pgp-num}`` + +:Description: The total number of placement groups for placement purposes. This + **should be equal to the total number of placement groups**, except + for placement group splitting scenarios. + +:Type: Integer +:Required: Yes. Picks up default or Ceph configuration value if not specified. +:Default: 8 + +``{replicated|erasure}`` + +:Description: The pool type which may either be **replicated** to + recover from lost OSDs by keeping multiple copies of the + objects or **erasure** to get a kind of + `generalized RAID5 <../erasure-code>`_ capability. + The **replicated** pools require more + raw storage but implement all Ceph operations. The + **erasure** pools require less raw storage but only + implement a subset of the available operations. + +:Type: String +:Required: No. +:Default: replicated + +``[crush-rule-name]`` + +:Description: The name of a CRUSH rule to use for this pool. The specified + rule must exist. + +:Type: String +:Required: No. +:Default: For **replicated** pools it is the ruleset specified by the ``osd + pool default crush replicated ruleset`` config variable. This + ruleset must exist. + For **erasure** pools it is ``erasure-code`` if the ``default`` + `erasure code profile`_ is used or ``{pool-name}`` otherwise. This + ruleset will be created implicitly if it doesn't exist already. + + +``[erasure-code-profile=profile]`` + +.. _erasure code profile: ../erasure-code-profile + +:Description: For **erasure** pools only. Use the `erasure code profile`_. It + must be an existing profile as defined by + **osd erasure-code-profile set**. + +:Type: String +:Required: No. + +When you create a pool, set the number of placement groups to a reasonable value +(e.g., ``100``). Consider the total number of placement groups per OSD too. +Placement groups are computationally expensive, so performance will degrade when +you have many pools with many placement groups (e.g., 50 pools with 100 +placement groups each). The point of diminishing returns depends upon the power +of the OSD host. + +See `Placement Groups`_ for details on calculating an appropriate number of +placement groups for your pool. + +.. _Placement Groups: ../placement-groups + +``[expected-num-objects]`` + +:Description: The expected number of objects for this pool. By setting this value ( + together with a negative **filestore merge threshold**), the PG folder + splitting would happen at the pool creation time, to avoid the latency + impact to do a runtime folder splitting. + +:Type: Integer +:Required: No. +:Default: 0, no splitting at the pool creation time. + +Associate Pool to Application +============================= + +Pools need to be associated with an application before use. Pools that will be +used with CephFS or pools that are automatically created by RGW are +automatically associated. Pools that are intended for use with RBD should be +initialized using the ``rbd`` tool (see `Block Device Commands`_ for more +information). + +For other cases, you can manually associate a free-form application name to +a pool.:: + + ceph osd pool application enable {pool-name} {application-name} + +.. note:: CephFS uses the application name ``cephfs``, RBD uses the + application name ``rbd``, and RGW uses the application name ``rgw``. + +Set Pool Quotas +=============== + +You can set pool quotas for the maximum number of bytes and/or the maximum +number of objects per pool. :: + + ceph osd pool set-quota {pool-name} [max_objects {obj-count}] [max_bytes {bytes}] + +For example:: + + ceph osd pool set-quota data max_objects 10000 + +To remove a quota, set its value to ``0``. + + +Delete a Pool +============= + +To delete a pool, execute:: + + ceph osd pool delete {pool-name} [{pool-name} --yes-i-really-really-mean-it] + + +To remove a pool the mon_allow_pool_delete flag must be set to true in the Monitor's +configuration. Otherwise they will refuse to remove a pool. + +See `Monitor Configuration`_ for more information. + +.. _Monitor Configuration: ../../configuration/mon-config-ref + +If you created your own rulesets and rules for a pool you created, you should +consider removing them when you no longer need your pool:: + + ceph osd pool get {pool-name} crush_ruleset + +If the ruleset was "123", for example, you can check the other pools like so:: + + ceph osd dump | grep "^pool" | grep "crush_ruleset 123" + +If no other pools use that custom ruleset, then it's safe to delete that +ruleset from the cluster. + +If you created users with permissions strictly for a pool that no longer +exists, you should consider deleting those users too:: + + ceph auth ls | grep -C 5 {pool-name} + ceph auth del {user} + + +Rename a Pool +============= + +To rename a pool, execute:: + + ceph osd pool rename {current-pool-name} {new-pool-name} + +If you rename a pool and you have per-pool capabilities for an authenticated +user, you must update the user's capabilities (i.e., caps) with the new pool +name. + +.. note:: Version ``0.48`` Argonaut and above. + +Show Pool Statistics +==================== + +To show a pool's utilization statistics, execute:: + + rados df + + +Make a Snapshot of a Pool +========================= + +To make a snapshot of a pool, execute:: + + ceph osd pool mksnap {pool-name} {snap-name} + +.. note:: Version ``0.48`` Argonaut and above. + + +Remove a Snapshot of a Pool +=========================== + +To remove a snapshot of a pool, execute:: + + ceph osd pool rmsnap {pool-name} {snap-name} + +.. note:: Version ``0.48`` Argonaut and above. + +.. _setpoolvalues: + + +Set Pool Values +=============== + +To set a value to a pool, execute the following:: + + ceph osd pool set {pool-name} {key} {value} + +You may set values for the following keys: + +.. _compression_algorithm: + +``compression_algorithm`` +:Description: Sets inline compression algorithm to use for underlying BlueStore. + This setting overrides the `global setting <rados/configuration/bluestore-config-ref/#inline-compression>`_ of ``bluestore compression algorithm``. + +:Type: String +:Valid Settings: ``lz4``, ``snappy``, ``zlib``, ``zstd`` + +``compression_mode`` + +:Description: Sets the policy for the inline compression algorithm for underlying BlueStore. + This setting overrides the `global setting <rados/configuration/bluestore-config-ref/#inline-compression>`_ of ``bluestore compression mode``. + +:Type: String +:Valid Settings: ``none``, ``passive``, ``aggressive``, ``force`` + +``compression_min_blob_size`` + +:Description: Chunks smaller than this are never compressed. + This setting overrides the `global setting <rados/configuration/bluestore-config-ref/#inline-compression>`_ of ``bluestore compression min blob *``. + +:Type: Unsigned Integer + +``compression_max_blob_size`` + +:Description: Chunks larger than this are broken into smaller blobs sizing + ``compression_max_blob_size`` before being compressed. + +:Type: Unsigned Integer + +.. _size: + +``size`` + +:Description: Sets the number of replicas for objects in the pool. + See `Set the Number of Object Replicas`_ for further details. + Replicated pools only. + +:Type: Integer + +.. _min_size: + +``min_size`` + +:Description: Sets the minimum number of replicas required for I/O. + See `Set the Number of Object Replicas`_ for further details. + Replicated pools only. + +:Type: Integer +:Version: ``0.54`` and above + +.. _pg_num: + +``pg_num`` + +:Description: The effective number of placement groups to use when calculating + data placement. +:Type: Integer +:Valid Range: Superior to ``pg_num`` current value. + +.. _pgp_num: + +``pgp_num`` + +:Description: The effective number of placement groups for placement to use + when calculating data placement. + +:Type: Integer +:Valid Range: Equal to or less than ``pg_num``. + +.. _crush_ruleset: + +``crush_ruleset`` + +:Description: The ruleset to use for mapping object placement in the cluster. +:Type: Integer + +.. _allow_ec_overwrites: + +``allow_ec_overwrites`` + +:Description: Whether writes to an erasure coded pool can update part + of an object, so cephfs and rbd can use it. See + `Erasure Coding with Overwrites`_ for more details. +:Type: Boolean +:Version: ``12.2.0`` and above + +.. _hashpspool: + +``hashpspool`` + +:Description: Set/Unset HASHPSPOOL flag on a given pool. +:Type: Integer +:Valid Range: 1 sets flag, 0 unsets flag +:Version: Version ``0.48`` Argonaut and above. + +.. _nodelete: + +``nodelete`` + +:Description: Set/Unset NODELETE flag on a given pool. +:Type: Integer +:Valid Range: 1 sets flag, 0 unsets flag +:Version: Version ``FIXME`` + +.. _nopgchange: + +``nopgchange`` + +:Description: Set/Unset NOPGCHANGE flag on a given pool. +:Type: Integer +:Valid Range: 1 sets flag, 0 unsets flag +:Version: Version ``FIXME`` + +.. _nosizechange: + +``nosizechange`` + +:Description: Set/Unset NOSIZECHANGE flag on a given pool. +:Type: Integer +:Valid Range: 1 sets flag, 0 unsets flag +:Version: Version ``FIXME`` + +.. _write_fadvise_dontneed: + +``write_fadvise_dontneed`` + +:Description: Set/Unset WRITE_FADVISE_DONTNEED flag on a given pool. +:Type: Integer +:Valid Range: 1 sets flag, 0 unsets flag + +.. _noscrub: + +``noscrub`` + +:Description: Set/Unset NOSCRUB flag on a given pool. +:Type: Integer +:Valid Range: 1 sets flag, 0 unsets flag + +.. _nodeep-scrub: + +``nodeep-scrub`` + +:Description: Set/Unset NODEEP_SCRUB flag on a given pool. +:Type: Integer +:Valid Range: 1 sets flag, 0 unsets flag + +.. _hit_set_type: + +``hit_set_type`` + +:Description: Enables hit set tracking for cache pools. + See `Bloom Filter`_ for additional information. + +:Type: String +:Valid Settings: ``bloom``, ``explicit_hash``, ``explicit_object`` +:Default: ``bloom``. Other values are for testing. + +.. _hit_set_count: + +``hit_set_count`` + +:Description: The number of hit sets to store for cache pools. The higher + the number, the more RAM consumed by the ``ceph-osd`` daemon. + +:Type: Integer +:Valid Range: ``1``. Agent doesn't handle > 1 yet. + +.. _hit_set_period: + +``hit_set_period`` + +:Description: The duration of a hit set period in seconds for cache pools. + The higher the number, the more RAM consumed by the + ``ceph-osd`` daemon. + +:Type: Integer +:Example: ``3600`` 1hr + +.. _hit_set_fpp: + +``hit_set_fpp`` + +:Description: The false positive probability for the ``bloom`` hit set type. + See `Bloom Filter`_ for additional information. + +:Type: Double +:Valid Range: 0.0 - 1.0 +:Default: ``0.05`` + +.. _cache_target_dirty_ratio: + +``cache_target_dirty_ratio`` + +:Description: The percentage of the cache pool containing modified (dirty) + objects before the cache tiering agent will flush them to the + backing storage pool. + +:Type: Double +:Default: ``.4`` + +.. _cache_target_dirty_high_ratio: + +``cache_target_dirty_high_ratio`` + +:Description: The percentage of the cache pool containing modified (dirty) + objects before the cache tiering agent will flush them to the + backing storage pool with a higher speed. + +:Type: Double +:Default: ``.6`` + +.. _cache_target_full_ratio: + +``cache_target_full_ratio`` + +:Description: The percentage of the cache pool containing unmodified (clean) + objects before the cache tiering agent will evict them from the + cache pool. + +:Type: Double +:Default: ``.8`` + +.. _target_max_bytes: + +``target_max_bytes`` + +:Description: Ceph will begin flushing or evicting objects when the + ``max_bytes`` threshold is triggered. + +:Type: Integer +:Example: ``1000000000000`` #1-TB + +.. _target_max_objects: + +``target_max_objects`` + +:Description: Ceph will begin flushing or evicting objects when the + ``max_objects`` threshold is triggered. + +:Type: Integer +:Example: ``1000000`` #1M objects + + +``hit_set_grade_decay_rate`` + +:Description: Temperature decay rate between two successive hit_sets +:Type: Integer +:Valid Range: 0 - 100 +:Default: ``20`` + + +``hit_set_search_last_n`` + +:Description: Count at most N appearance in hit_sets for temperature calculation +:Type: Integer +:Valid Range: 0 - hit_set_count +:Default: ``1`` + + +.. _cache_min_flush_age: + +``cache_min_flush_age`` + +:Description: The time (in seconds) before the cache tiering agent will flush + an object from the cache pool to the storage pool. + +:Type: Integer +:Example: ``600`` 10min + +.. _cache_min_evict_age: + +``cache_min_evict_age`` + +:Description: The time (in seconds) before the cache tiering agent will evict + an object from the cache pool. + +:Type: Integer +:Example: ``1800`` 30min + +.. _fast_read: + +``fast_read`` + +:Description: On Erasure Coding pool, if this flag is turned on, the read request + would issue sub reads to all shards, and waits until it receives enough + shards to decode to serve the client. In the case of jerasure and isa + erasure plugins, once the first K replies return, client's request is + served immediately using the data decoded from these replies. This + helps to tradeoff some resources for better performance. Currently this + flag is only supported for Erasure Coding pool. + +:Type: Boolean +:Defaults: ``0`` + +.. _scrub_min_interval: + +``scrub_min_interval`` + +:Description: The minimum interval in seconds for pool scrubbing when + load is low. If it is 0, the value osd_scrub_min_interval + from config is used. + +:Type: Double +:Default: ``0`` + +.. _scrub_max_interval: + +``scrub_max_interval`` + +:Description: The maximum interval in seconds for pool scrubbing + irrespective of cluster load. If it is 0, the value + osd_scrub_max_interval from config is used. + +:Type: Double +:Default: ``0`` + +.. _deep_scrub_interval: + +``deep_scrub_interval`` + +:Description: The interval in seconds for pool “deep” scrubbing. If it + is 0, the value osd_deep_scrub_interval from config is used. + +:Type: Double +:Default: ``0`` + + +Get Pool Values +=============== + +To get a value from a pool, execute the following:: + + ceph osd pool get {pool-name} {key} + +You may get values for the following keys: + +``size`` + +:Description: see size_ + +:Type: Integer + +``min_size`` + +:Description: see min_size_ + +:Type: Integer +:Version: ``0.54`` and above + +``pg_num`` + +:Description: see pg_num_ + +:Type: Integer + + +``pgp_num`` + +:Description: see pgp_num_ + +:Type: Integer +:Valid Range: Equal to or less than ``pg_num``. + + +``crush_ruleset`` + +:Description: see crush_ruleset_ + + +``hit_set_type`` + +:Description: see hit_set_type_ + +:Type: String +:Valid Settings: ``bloom``, ``explicit_hash``, ``explicit_object`` + +``hit_set_count`` + +:Description: see hit_set_count_ + +:Type: Integer + + +``hit_set_period`` + +:Description: see hit_set_period_ + +:Type: Integer + + +``hit_set_fpp`` + +:Description: see hit_set_fpp_ + +:Type: Double + + +``cache_target_dirty_ratio`` + +:Description: see cache_target_dirty_ratio_ + +:Type: Double + + +``cache_target_dirty_high_ratio`` + +:Description: see cache_target_dirty_high_ratio_ + +:Type: Double + + +``cache_target_full_ratio`` + +:Description: see cache_target_full_ratio_ + +:Type: Double + + +``target_max_bytes`` + +:Description: see target_max_bytes_ + +:Type: Integer + + +``target_max_objects`` + +:Description: see target_max_objects_ + +:Type: Integer + + +``cache_min_flush_age`` + +:Description: see cache_min_flush_age_ + +:Type: Integer + + +``cache_min_evict_age`` + +:Description: see cache_min_evict_age_ + +:Type: Integer + + +``fast_read`` + +:Description: see fast_read_ + +:Type: Boolean + + +``scrub_min_interval`` + +:Description: see scrub_min_interval_ + +:Type: Double + + +``scrub_max_interval`` + +:Description: see scrub_max_interval_ + +:Type: Double + + +``deep_scrub_interval`` + +:Description: see deep_scrub_interval_ + +:Type: Double + + +Set the Number of Object Replicas +================================= + +To set the number of object replicas on a replicated pool, execute the following:: + + ceph osd pool set {poolname} size {num-replicas} + +.. important:: The ``{num-replicas}`` includes the object itself. + If you want the object and two copies of the object for a total of + three instances of the object, specify ``3``. + +For example:: + + ceph osd pool set data size 3 + +You may execute this command for each pool. **Note:** An object might accept +I/Os in degraded mode with fewer than ``pool size`` replicas. To set a minimum +number of required replicas for I/O, you should use the ``min_size`` setting. +For example:: + + ceph osd pool set data min_size 2 + +This ensures that no object in the data pool will receive I/O with fewer than +``min_size`` replicas. + + +Get the Number of Object Replicas +================================= + +To get the number of object replicas, execute the following:: + + ceph osd dump | grep 'replicated size' + +Ceph will list the pools, with the ``replicated size`` attribute highlighted. +By default, ceph creates two replicas of an object (a total of three copies, or +a size of 3). + + + +.. _Pool, PG and CRUSH Config Reference: ../../configuration/pool-pg-config-ref +.. _Bloom Filter: http://en.wikipedia.org/wiki/Bloom_filter +.. _setting the number of placement groups: ../placement-groups#set-the-number-of-placement-groups +.. _Erasure Coding with Overwrites: ../erasure-code#erasure-coding-with-overwrites +.. _Block Device Commands: ../../../rbd/rados-rbd-cmds/#create-a-block-device-pool + diff --git a/src/ceph/doc/rados/operations/upmap.rst b/src/ceph/doc/rados/operations/upmap.rst new file mode 100644 index 0000000..58f6322 --- /dev/null +++ b/src/ceph/doc/rados/operations/upmap.rst @@ -0,0 +1,75 @@ +Using the pg-upmap +================== + +Starting in Luminous v12.2.z there is a new *pg-upmap* exception table +in the OSDMap that allows the cluster to explicitly map specific PGs to +specific OSDs. This allows the cluster to fine-tune the data +distribution to, in most cases, perfectly distributed PGs across OSDs. + +The key caveat to this new mechanism is that it requires that all +clients understand the new *pg-upmap* structure in the OSDMap. + +Enabling +-------- + +To allow use of the feature, you must tell the cluster that it only +needs to support luminous (and newer) clients with:: + + ceph osd set-require-min-compat-client luminous + +This command will fail if any pre-luminous clients or daemons are +connected to the monitors. You can see what client versions are in +use with:: + + ceph features + +A word of caution +----------------- + +This is a new feature and not very user friendly. At the time of this +writing we are working on a new `balancer` module for ceph-mgr that +will eventually do all of this automatically. + +Until then, + +Offline optimization +-------------------- + +Upmap entries are updated with an offline optimizer built into ``osdmaptool``. + +#. Grab the latest copy of your osdmap:: + + ceph osd getmap -o om + +#. Run the optimizer:: + + osdmaptool om --upmap out.txt [--upmap-pool <pool>] [--upmap-max <max-count>] [--upmap-deviation <max-deviation>] + + It is highly recommended that optimization be done for each pool + individually, or for sets of similarly-utilized pools. You can + specify the ``--upmap-pool`` option multiple times. "Similar pools" + means pools that are mapped to the same devices and store the same + kind of data (e.g., RBD image pools, yes; RGW index pool and RGW + data pool, no). + + The ``max-count`` value is the maximum number of upmap entries to + identify in the run. The default is 100, but you may want to make + this a smaller number so that the tool completes more quickly (but + does less work). If it cannot find any additional changes to make + it will stop early (i.e., when the pool distribution is perfect). + + The ``max-deviation`` value defaults to `.01` (i.e., 1%). If an OSD + utilization varies from the average by less than this amount it + will be considered perfect. + +#. The proposed changes are written to the output file ``out.txt`` in + the example above. These are normal ceph CLI commands that can be + run to apply the changes to the cluster. This can be done with:: + + source out.txt + +The above steps can be repeated as many times as necessary to achieve +a perfect distribution of PGs for each set of pools. + +You can see some (gory) details about what the tool is doing by +passing ``--debug-osd 10`` to ``osdmaptool``. diff --git a/src/ceph/doc/rados/operations/user-management.rst b/src/ceph/doc/rados/operations/user-management.rst new file mode 100644 index 0000000..8a35a50 --- /dev/null +++ b/src/ceph/doc/rados/operations/user-management.rst @@ -0,0 +1,665 @@ +================= + User Management +================= + +This document describes :term:`Ceph Client` users, and their authentication and +authorization with the :term:`Ceph Storage Cluster`. Users are either +individuals or system actors such as applications, which use Ceph clients to +interact with the Ceph Storage Cluster daemons. + +.. ditaa:: +-----+ + | {o} | + | | + +--+--+ /---------\ /---------\ + | | Ceph | | Ceph | + ---+---*----->| |<------------->| | + | uses | Clients | | Servers | + | \---------/ \---------/ + /--+--\ + | | + | | + actor + + +When Ceph runs with authentication and authorization enabled (enabled by +default), you must specify a user name and a keyring containing the secret key +of the specified user (usually via the command line). If you do not specify a +user name, Ceph will use ``client.admin`` as the default user name. If you do +not specify a keyring, Ceph will look for a keyring via the ``keyring`` setting +in the Ceph configuration. For example, if you execute the ``ceph health`` +command without specifying a user or keyring:: + + ceph health + +Ceph interprets the command like this:: + + ceph -n client.admin --keyring=/etc/ceph/ceph.client.admin.keyring health + +Alternatively, you may use the ``CEPH_ARGS`` environment variable to avoid +re-entry of the user name and secret. + +For details on configuring the Ceph Storage Cluster to use authentication, +see `Cephx Config Reference`_. For details on the architecture of Cephx, see +`Architecture - High Availability Authentication`_. + + +Background +========== + +Irrespective of the type of Ceph client (e.g., Block Device, Object Storage, +Filesystem, native API, etc.), Ceph stores all data as objects within `pools`_. +Ceph users must have access to pools in order to read and write data. +Additionally, Ceph users must have execute permissions to use Ceph's +administrative commands. The following concepts will help you understand Ceph +user management. + + +User +---- + +A user is either an individual or a system actor such as an application. +Creating users allows you to control who (or what) can access your Ceph Storage +Cluster, its pools, and the data within pools. + +Ceph has the notion of a ``type`` of user. For the purposes of user management, +the type will always be ``client``. Ceph identifies users in period (.) +delimited form consisting of the user type and the user ID: for example, +``TYPE.ID``, ``client.admin``, or ``client.user1``. The reason for user typing +is that Ceph Monitors, OSDs, and Metadata Servers also use the Cephx protocol, +but they are not clients. Distinguishing the user type helps to distinguish +between client users and other users--streamlining access control, user +monitoring and traceability. + +Sometimes Ceph's user type may seem confusing, because the Ceph command line +allows you to specify a user with or without the type, depending upon your +command line usage. If you specify ``--user`` or ``--id``, you can omit the +type. So ``client.user1`` can be entered simply as ``user1``. If you specify +``--name`` or ``-n``, you must specify the type and name, such as +``client.user1``. We recommend using the type and name as a best practice +wherever possible. + +.. note:: A Ceph Storage Cluster user is not the same as a Ceph Object Storage + user or a Ceph Filesystem user. The Ceph Object Gateway uses a Ceph Storage + Cluster user to communicate between the gateway daemon and the storage + cluster, but the gateway has its own user management functionality for end + users. The Ceph Filesystem uses POSIX semantics. The user space associated + with the Ceph Filesystem is not the same as a Ceph Storage Cluster user. + + + +Authorization (Capabilities) +---------------------------- + +Ceph uses the term "capabilities" (caps) to describe authorizing an +authenticated user to exercise the functionality of the monitors, OSDs and +metadata servers. Capabilities can also restrict access to data within a pool or +a namespace within a pool. A Ceph administrative user sets a user's +capabilities when creating or updating a user. + +Capability syntax follows the form:: + + {daemon-type} '{capspec}[, {capspec} ...]' + +- **Monitor Caps:** Monitor capabilities include ``r``, ``w``, ``x`` access + settings or ``profile {name}``. For example:: + + mon 'allow rwx' + mon 'profile osd' + +- **OSD Caps:** OSD capabilities include ``r``, ``w``, ``x``, ``class-read``, + ``class-write`` access settings or ``profile {name}``. Additionally, OSD + capabilities also allow for pool and namespace settings. :: + + osd 'allow {access} [pool={pool-name} [namespace={namespace-name}]]' + osd 'profile {name} [pool={pool-name} [namespace={namespace-name}]]' + +- **Metadata Server Caps:** For administrators, use ``allow *``. For all + other users, such as CephFS clients, consult :doc:`/cephfs/client-auth` + + +.. note:: The Ceph Object Gateway daemon (``radosgw``) is a client of the + Ceph Storage Cluster, so it is not represented as a Ceph Storage + Cluster daemon type. + +The following entries describe each capability. + +``allow`` + +:Description: Precedes access settings for a daemon. Implies ``rw`` + for MDS only. + + +``r`` + +:Description: Gives the user read access. Required with monitors to retrieve + the CRUSH map. + + +``w`` + +:Description: Gives the user write access to objects. + + +``x`` + +:Description: Gives the user the capability to call class methods + (i.e., both read and write) and to conduct ``auth`` + operations on monitors. + + +``class-read`` + +:Descriptions: Gives the user the capability to call class read methods. + Subset of ``x``. + + +``class-write`` + +:Description: Gives the user the capability to call class write methods. + Subset of ``x``. + + +``*`` + +:Description: Gives the user read, write and execute permissions for a + particular daemon/pool, and the ability to execute + admin commands. + + +``profile osd`` (Monitor only) + +:Description: Gives a user permissions to connect as an OSD to other OSDs or + monitors. Conferred on OSDs to enable OSDs to handle replication + heartbeat traffic and status reporting. + + +``profile mds`` (Monitor only) + +:Description: Gives a user permissions to connect as a MDS to other MDSs or + monitors. + + +``profile bootstrap-osd`` (Monitor only) + +:Description: Gives a user permissions to bootstrap an OSD. Conferred on + deployment tools such as ``ceph-disk``, ``ceph-deploy``, etc. + so that they have permissions to add keys, etc. when + bootstrapping an OSD. + + +``profile bootstrap-mds`` (Monitor only) + +:Description: Gives a user permissions to bootstrap a metadata server. + Conferred on deployment tools such as ``ceph-deploy``, etc. + so they have permissions to add keys, etc. when bootstrapping + a metadata server. + +``profile rbd`` (Monitor and OSD) + +:Description: Gives a user permissions to manipulate RBD images. When used + as a Monitor cap, it provides the minimal privileges required + by an RBD client application. When used as an OSD cap, it + provides read-write access to an RBD client application. + +``profile rbd-read-only`` (OSD only) + +:Description: Gives a user read-only permissions to an RBD image. + + +Pool +---- + +A pool is a logical partition where users store data. +In Ceph deployments, it is common to create a pool as a logical partition for +similar types of data. For example, when deploying Ceph as a backend for +OpenStack, a typical deployment would have pools for volumes, images, backups +and virtual machines, and users such as ``client.glance``, ``client.cinder``, +etc. + + +Namespace +--------- + +Objects within a pool can be associated to a namespace--a logical group of +objects within the pool. A user's access to a pool can be associated with a +namespace such that reads and writes by the user take place only within the +namespace. Objects written to a namespace within the pool can only be accessed +by users who have access to the namespace. + +.. note:: Namespaces are primarily useful for applications written on top of + ``librados`` where the logical grouping can alleviate the need to create + different pools. Ceph Object Gateway (from ``luminous``) uses namespaces for various + metadata objects. + +The rationale for namespaces is that pools can be a computationally expensive +method of segregating data sets for the purposes of authorizing separate sets +of users. For example, a pool should have ~100 placement groups per OSD. So an +exemplary cluster with 1000 OSDs would have 100,000 placement groups for one +pool. Each pool would create another 100,000 placement groups in the exemplary +cluster. By contrast, writing an object to a namespace simply associates the +namespace to the object name with out the computational overhead of a separate +pool. Rather than creating a separate pool for a user or set of users, you may +use a namespace. **Note:** Only available using ``librados`` at this time. + + +Managing Users +============== + +User management functionality provides Ceph Storage Cluster administrators with +the ability to create, update and delete users directly in the Ceph Storage +Cluster. + +When you create or delete users in the Ceph Storage Cluster, you may need to +distribute keys to clients so that they can be added to keyrings. See `Keyring +Management`_ for details. + + +List Users +---------- + +To list the users in your cluster, execute the following:: + + ceph auth ls + +Ceph will list out all users in your cluster. For example, in a two-node +exemplary cluster, ``ceph auth ls`` will output something that looks like +this:: + + installed auth entries: + + osd.0 + key: AQCvCbtToC6MDhAATtuT70Sl+DymPCfDSsyV4w== + caps: [mon] allow profile osd + caps: [osd] allow * + osd.1 + key: AQC4CbtTCFJBChAAVq5spj0ff4eHZICxIOVZeA== + caps: [mon] allow profile osd + caps: [osd] allow * + client.admin + key: AQBHCbtT6APDHhAA5W00cBchwkQjh3dkKsyPjw== + caps: [mds] allow + caps: [mon] allow * + caps: [osd] allow * + client.bootstrap-mds + key: AQBICbtTOK9uGBAAdbe5zcIGHZL3T/u2g6EBww== + caps: [mon] allow profile bootstrap-mds + client.bootstrap-osd + key: AQBHCbtT4GxqORAADE5u7RkpCN/oo4e5W0uBtw== + caps: [mon] allow profile bootstrap-osd + + +Note that the ``TYPE.ID`` notation for users applies such that ``osd.0`` is a +user of type ``osd`` and its ID is ``0``, ``client.admin`` is a user of type +``client`` and its ID is ``admin`` (i.e., the default ``client.admin`` user). +Note also that each entry has a ``key: <value>`` entry, and one or more +``caps:`` entries. + +You may use the ``-o {filename}`` option with ``ceph auth ls`` to +save the output to a file. + + +Get a User +---------- + +To retrieve a specific user, key and capabilities, execute the +following:: + + ceph auth get {TYPE.ID} + +For example:: + + ceph auth get client.admin + +You may also use the ``-o {filename}`` option with ``ceph auth get`` to +save the output to a file. Developers may also execute the following:: + + ceph auth export {TYPE.ID} + +The ``auth export`` command is identical to ``auth get``, but also prints +out the internal ``auid``, which is not relevant to end users. + + + +Add a User +---------- + +Adding a user creates a username (i.e., ``TYPE.ID``), a secret key and +any capabilities included in the command you use to create the user. + +A user's key enables the user to authenticate with the Ceph Storage Cluster. +The user's capabilities authorize the user to read, write, or execute on Ceph +monitors (``mon``), Ceph OSDs (``osd``) or Ceph Metadata Servers (``mds``). + +There are a few ways to add a user: + +- ``ceph auth add``: This command is the canonical way to add a user. It + will create the user, generate a key and add any specified capabilities. + +- ``ceph auth get-or-create``: This command is often the most convenient way + to create a user, because it returns a keyfile format with the user name + (in brackets) and the key. If the user already exists, this command + simply returns the user name and key in the keyfile format. You may use the + ``-o {filename}`` option to save the output to a file. + +- ``ceph auth get-or-create-key``: This command is a convenient way to create + a user and return the user's key (only). This is useful for clients that + need the key only (e.g., libvirt). If the user already exists, this command + simply returns the key. You may use the ``-o {filename}`` option to save the + output to a file. + +When creating client users, you may create a user with no capabilities. A user +with no capabilities is useless beyond mere authentication, because the client +cannot retrieve the cluster map from the monitor. However, you can create a +user with no capabilities if you wish to defer adding capabilities later using +the ``ceph auth caps`` command. + +A typical user has at least read capabilities on the Ceph monitor and +read and write capability on Ceph OSDs. Additionally, a user's OSD permissions +are often restricted to accessing a particular pool. :: + + ceph auth add client.john mon 'allow r' osd 'allow rw pool=liverpool' + ceph auth get-or-create client.paul mon 'allow r' osd 'allow rw pool=liverpool' + ceph auth get-or-create client.george mon 'allow r' osd 'allow rw pool=liverpool' -o george.keyring + ceph auth get-or-create-key client.ringo mon 'allow r' osd 'allow rw pool=liverpool' -o ringo.key + + +.. important:: If you provide a user with capabilities to OSDs, but you DO NOT + restrict access to particular pools, the user will have access to ALL + pools in the cluster! + + +.. _modify-user-capabilities: + +Modify User Capabilities +------------------------ + +The ``ceph auth caps`` command allows you to specify a user and change the +user's capabilities. Setting new capabilities will overwrite current capabilities. +To view current capabilities run ``ceph auth get USERTYPE.USERID``. To add +capabilities, you should also specify the existing capabilities when using the form:: + + ceph auth caps USERTYPE.USERID {daemon} 'allow [r|w|x|*|...] [pool={pool-name}] [namespace={namespace-name}]' [{daemon} 'allow [r|w|x|*|...] [pool={pool-name}] [namespace={namespace-name}]'] + +For example:: + + ceph auth get client.john + ceph auth caps client.john mon 'allow r' osd 'allow rw pool=liverpool' + ceph auth caps client.paul mon 'allow rw' osd 'allow rwx pool=liverpool' + ceph auth caps client.brian-manager mon 'allow *' osd 'allow *' + +To remove a capability, you may reset the capability. If you want the user +to have no access to a particular daemon that was previously set, specify +an empty string. For example:: + + ceph auth caps client.ringo mon ' ' osd ' ' + +See `Authorization (Capabilities)`_ for additional details on capabilities. + + +Delete a User +------------- + +To delete a user, use ``ceph auth del``:: + + ceph auth del {TYPE}.{ID} + +Where ``{TYPE}`` is one of ``client``, ``osd``, ``mon``, or ``mds``, +and ``{ID}`` is the user name or ID of the daemon. + + +Print a User's Key +------------------ + +To print a user's authentication key to standard output, execute the following:: + + ceph auth print-key {TYPE}.{ID} + +Where ``{TYPE}`` is one of ``client``, ``osd``, ``mon``, or ``mds``, +and ``{ID}`` is the user name or ID of the daemon. + +Printing a user's key is useful when you need to populate client +software with a user's key (e.g., libvirt). :: + + mount -t ceph serverhost:/ mountpoint -o name=client.user,secret=`ceph auth print-key client.user` + + +Import a User(s) +---------------- + +To import one or more users, use ``ceph auth import`` and +specify a keyring:: + + ceph auth import -i /path/to/keyring + +For example:: + + sudo ceph auth import -i /etc/ceph/ceph.keyring + + +.. note:: The ceph storage cluster will add new users, their keys and their + capabilities and will update existing users, their keys and their + capabilities. + + +Keyring Management +================== + +When you access Ceph via a Ceph client, the Ceph client will look for a local +keyring. Ceph presets the ``keyring`` setting with the following four keyring +names by default so you don't have to set them in your Ceph configuration file +unless you want to override the defaults (not recommended): + +- ``/etc/ceph/$cluster.$name.keyring`` +- ``/etc/ceph/$cluster.keyring`` +- ``/etc/ceph/keyring`` +- ``/etc/ceph/keyring.bin`` + +The ``$cluster`` metavariable is your Ceph cluster name as defined by the +name of the Ceph configuration file (i.e., ``ceph.conf`` means the cluster name +is ``ceph``; thus, ``ceph.keyring``). The ``$name`` metavariable is the user +type and user ID (e.g., ``client.admin``; thus, ``ceph.client.admin.keyring``). + +.. note:: When executing commands that read or write to ``/etc/ceph``, you may + need to use ``sudo`` to execute the command as ``root``. + +After you create a user (e.g., ``client.ringo``), you must get the key and add +it to a keyring on a Ceph client so that the user can access the Ceph Storage +Cluster. + +The `User Management`_ section details how to list, get, add, modify and delete +users directly in the Ceph Storage Cluster. However, Ceph also provides the +``ceph-authtool`` utility to allow you to manage keyrings from a Ceph client. + + +Create a Keyring +---------------- + +When you use the procedures in the `Managing Users`_ section to create users, +you need to provide user keys to the Ceph client(s) so that the Ceph client +can retrieve the key for the specified user and authenticate with the Ceph +Storage Cluster. Ceph Clients access keyrings to lookup a user name and +retrieve the user's key. + +The ``ceph-authtool`` utility allows you to create a keyring. To create an +empty keyring, use ``--create-keyring`` or ``-C``. For example:: + + ceph-authtool --create-keyring /path/to/keyring + +When creating a keyring with multiple users, we recommend using the cluster name +(e.g., ``$cluster.keyring``) for the keyring filename and saving it in the +``/etc/ceph`` directory so that the ``keyring`` configuration default setting +will pick up the filename without requiring you to specify it in the local copy +of your Ceph configuration file. For example, create ``ceph.keyring`` by +executing the following:: + + sudo ceph-authtool -C /etc/ceph/ceph.keyring + +When creating a keyring with a single user, we recommend using the cluster name, +the user type and the user name and saving it in the ``/etc/ceph`` directory. +For example, ``ceph.client.admin.keyring`` for the ``client.admin`` user. + +To create a keyring in ``/etc/ceph``, you must do so as ``root``. This means +the file will have ``rw`` permissions for the ``root`` user only, which is +appropriate when the keyring contains administrator keys. However, if you +intend to use the keyring for a particular user or group of users, ensure +that you execute ``chown`` or ``chmod`` to establish appropriate keyring +ownership and access. + + +Add a User to a Keyring +----------------------- + +When you `Add a User`_ to the Ceph Storage Cluster, you can use the `Get a +User`_ procedure to retrieve a user, key and capabilities and save the user to a +keyring. + +When you only want to use one user per keyring, the `Get a User`_ procedure with +the ``-o`` option will save the output in the keyring file format. For example, +to create a keyring for the ``client.admin`` user, execute the following:: + + sudo ceph auth get client.admin -o /etc/ceph/ceph.client.admin.keyring + +Notice that we use the recommended file format for an individual user. + +When you want to import users to a keyring, you can use ``ceph-authtool`` +to specify the destination keyring and the source keyring. +For example:: + + sudo ceph-authtool /etc/ceph/ceph.keyring --import-keyring /etc/ceph/ceph.client.admin.keyring + + +Create a User +------------- + +Ceph provides the `Add a User`_ function to create a user directly in the Ceph +Storage Cluster. However, you can also create a user, keys and capabilities +directly on a Ceph client keyring. Then, you can import the user to the Ceph +Storage Cluster. For example:: + + sudo ceph-authtool -n client.ringo --cap osd 'allow rwx' --cap mon 'allow rwx' /etc/ceph/ceph.keyring + +See `Authorization (Capabilities)`_ for additional details on capabilities. + +You can also create a keyring and add a new user to the keyring simultaneously. +For example:: + + sudo ceph-authtool -C /etc/ceph/ceph.keyring -n client.ringo --cap osd 'allow rwx' --cap mon 'allow rwx' --gen-key + +In the foregoing scenarios, the new user ``client.ringo`` is only in the +keyring. To add the new user to the Ceph Storage Cluster, you must still add +the new user to the Ceph Storage Cluster. :: + + sudo ceph auth add client.ringo -i /etc/ceph/ceph.keyring + + +Modify a User +------------- + +To modify the capabilities of a user record in a keyring, specify the keyring, +and the user followed by the capabilities. For example:: + + sudo ceph-authtool /etc/ceph/ceph.keyring -n client.ringo --cap osd 'allow rwx' --cap mon 'allow rwx' + +To update the user to the Ceph Storage Cluster, you must update the user +in the keyring to the user entry in the the Ceph Storage Cluster. :: + + sudo ceph auth import -i /etc/ceph/ceph.keyring + +See `Import a User(s)`_ for details on updating a Ceph Storage Cluster user +from a keyring. + +You may also `Modify User Capabilities`_ directly in the cluster, store the +results to a keyring file; then, import the keyring into your main +``ceph.keyring`` file. + + +Command Line Usage +================== + +Ceph supports the following usage for user name and secret: + +``--id`` | ``--user`` + +:Description: Ceph identifies users with a type and an ID (e.g., ``TYPE.ID`` or + ``client.admin``, ``client.user1``). The ``id``, ``name`` and + ``-n`` options enable you to specify the ID portion of the user + name (e.g., ``admin``, ``user1``, ``foo``, etc.). You can specify + the user with the ``--id`` and omit the type. For example, + to specify user ``client.foo`` enter the following:: + + ceph --id foo --keyring /path/to/keyring health + ceph --user foo --keyring /path/to/keyring health + + +``--name`` | ``-n`` + +:Description: Ceph identifies users with a type and an ID (e.g., ``TYPE.ID`` or + ``client.admin``, ``client.user1``). The ``--name`` and ``-n`` + options enables you to specify the fully qualified user name. + You must specify the user type (typically ``client``) with the + user ID. For example:: + + ceph --name client.foo --keyring /path/to/keyring health + ceph -n client.foo --keyring /path/to/keyring health + + +``--keyring`` + +:Description: The path to the keyring containing one or more user name and + secret. The ``--secret`` option provides the same functionality, + but it does not work with Ceph RADOS Gateway, which uses + ``--secret`` for another purpose. You may retrieve a keyring with + ``ceph auth get-or-create`` and store it locally. This is a + preferred approach, because you can switch user names without + switching the keyring path. For example:: + + sudo rbd map --id foo --keyring /path/to/keyring mypool/myimage + + +.. _pools: ../pools + + +Limitations +=========== + +The ``cephx`` protocol authenticates Ceph clients and servers to each other. It +is not intended to handle authentication of human users or application programs +run on their behalf. If that effect is required to handle your access control +needs, you must have another mechanism, which is likely to be specific to the +front end used to access the Ceph object store. This other mechanism has the +role of ensuring that only acceptable users and programs are able to run on the +machine that Ceph will permit to access its object store. + +The keys used to authenticate Ceph clients and servers are typically stored in +a plain text file with appropriate permissions in a trusted host. + +.. important:: Storing keys in plaintext files has security shortcomings, but + they are difficult to avoid, given the basic authentication methods Ceph + uses in the background. Those setting up Ceph systems should be aware of + these shortcomings. + +In particular, arbitrary user machines, especially portable machines, should not +be configured to interact directly with Ceph, since that mode of use would +require the storage of a plaintext authentication key on an insecure machine. +Anyone who stole that machine or obtained surreptitious access to it could +obtain the key that will allow them to authenticate their own machines to Ceph. + +Rather than permitting potentially insecure machines to access a Ceph object +store directly, users should be required to sign in to a trusted machine in +your environment using a method that provides sufficient security for your +purposes. That trusted machine will store the plaintext Ceph keys for the +human users. A future version of Ceph may address these particular +authentication issues more fully. + +At the moment, none of the Ceph authentication protocols provide secrecy for +messages in transit. Thus, an eavesdropper on the wire can hear and understand +all data sent between clients and servers in Ceph, even if it cannot create or +alter them. Further, Ceph does not include options to encrypt user data in the +object store. Users can hand-encrypt and store their own data in the Ceph +object store, of course, but Ceph provides no features to perform object +encryption itself. Those storing sensitive data in Ceph should consider +encrypting their data before providing it to the Ceph system. + + +.. _Architecture - High Availability Authentication: ../../../architecture#high-availability-authentication +.. _Cephx Config Reference: ../../configuration/auth-config-ref diff --git a/src/ceph/doc/rados/troubleshooting/community.rst b/src/ceph/doc/rados/troubleshooting/community.rst new file mode 100644 index 0000000..9faad13 --- /dev/null +++ b/src/ceph/doc/rados/troubleshooting/community.rst @@ -0,0 +1,29 @@ +==================== + The Ceph Community +==================== + +The Ceph community is an excellent source of information and help. For +operational issues with Ceph releases we recommend you `subscribe to the +ceph-users email list`_. When you no longer want to receive emails, you can +`unsubscribe from the ceph-users email list`_. + +You may also `subscribe to the ceph-devel email list`_. You should do so if +your issue is: + +- Likely related to a bug +- Related to a development release package +- Related to a development testing package +- Related to your own builds + +If you no longer want to receive emails from the ``ceph-devel`` email list, you +may `unsubscribe from the ceph-devel email list`_. + +.. tip:: The Ceph community is growing rapidly, and community members can help + you if you provide them with detailed information about your problem. You + can attach the output of the ``ceph report`` command to help people understand your issues. + +.. _subscribe to the ceph-devel email list: mailto:majordomo@vger.kernel.org?body=subscribe+ceph-devel +.. _unsubscribe from the ceph-devel email list: mailto:majordomo@vger.kernel.org?body=unsubscribe+ceph-devel +.. _subscribe to the ceph-users email list: mailto:ceph-users-join@lists.ceph.com +.. _unsubscribe from the ceph-users email list: mailto:ceph-users-leave@lists.ceph.com +.. _ceph-devel: ceph-devel@vger.kernel.org
\ No newline at end of file diff --git a/src/ceph/doc/rados/troubleshooting/cpu-profiling.rst b/src/ceph/doc/rados/troubleshooting/cpu-profiling.rst new file mode 100644 index 0000000..159f799 --- /dev/null +++ b/src/ceph/doc/rados/troubleshooting/cpu-profiling.rst @@ -0,0 +1,67 @@ +=============== + CPU Profiling +=============== + +If you built Ceph from source and compiled Ceph for use with `oprofile`_ +you can profile Ceph's CPU usage. See `Installing Oprofile`_ for details. + + +Initializing oprofile +===================== + +The first time you use ``oprofile`` you need to initialize it. Locate the +``vmlinux`` image corresponding to the kernel you are now running. :: + + ls /boot + sudo opcontrol --init + sudo opcontrol --setup --vmlinux={path-to-image} --separate=library --callgraph=6 + + +Starting oprofile +================= + +To start ``oprofile`` execute the following command:: + + opcontrol --start + +Once you start ``oprofile``, you may run some tests with Ceph. + + +Stopping oprofile +================= + +To stop ``oprofile`` execute the following command:: + + opcontrol --stop + + +Retrieving oprofile Results +=========================== + +To retrieve the top ``cmon`` results, execute the following command:: + + opreport -gal ./cmon | less + + +To retrieve the top ``cmon`` results with call graphs attached, execute the +following command:: + + opreport -cal ./cmon | less + +.. important:: After reviewing results, you should reset ``oprofile`` before + running it again. Resetting ``oprofile`` removes data from the session + directory. + + +Resetting oprofile +================== + +To reset ``oprofile``, execute the following command:: + + sudo opcontrol --reset + +.. important:: You should reset ``oprofile`` after analyzing data so that + you do not commingle results from different tests. + +.. _oprofile: http://oprofile.sourceforge.net/about/ +.. _Installing Oprofile: ../../../dev/cpu-profiler diff --git a/src/ceph/doc/rados/troubleshooting/index.rst b/src/ceph/doc/rados/troubleshooting/index.rst new file mode 100644 index 0000000..80d14f3 --- /dev/null +++ b/src/ceph/doc/rados/troubleshooting/index.rst @@ -0,0 +1,19 @@ +================= + Troubleshooting +================= + +Ceph is still on the leading edge, so you may encounter situations that require +you to examine your configuration, modify your logging output, troubleshoot +monitors and OSDs, profile memory and CPU usage, and reach out to the +Ceph community for help. + +.. toctree:: + :maxdepth: 1 + + community + log-and-debug + troubleshooting-mon + troubleshooting-osd + troubleshooting-pg + memory-profiling + cpu-profiling diff --git a/src/ceph/doc/rados/troubleshooting/log-and-debug.rst b/src/ceph/doc/rados/troubleshooting/log-and-debug.rst new file mode 100644 index 0000000..c91f272 --- /dev/null +++ b/src/ceph/doc/rados/troubleshooting/log-and-debug.rst @@ -0,0 +1,550 @@ +======================= + Logging and Debugging +======================= + +Typically, when you add debugging to your Ceph configuration, you do so at +runtime. You can also add Ceph debug logging to your Ceph configuration file if +you are encountering issues when starting your cluster. You may view Ceph log +files under ``/var/log/ceph`` (the default location). + +.. tip:: When debug output slows down your system, the latency can hide + race conditions. + +Logging is resource intensive. If you are encountering a problem in a specific +area of your cluster, enable logging for that area of the cluster. For example, +if your OSDs are running fine, but your metadata servers are not, you should +start by enabling debug logging for the specific metadata server instance(s) +giving you trouble. Enable logging for each subsystem as needed. + +.. important:: Verbose logging can generate over 1GB of data per hour. If your + OS disk reaches its capacity, the node will stop working. + +If you enable or increase the rate of Ceph logging, ensure that you have +sufficient disk space on your OS disk. See `Accelerating Log Rotation`_ for +details on rotating log files. When your system is running well, remove +unnecessary debugging settings to ensure your cluster runs optimally. Logging +debug output messages is relatively slow, and a waste of resources when +operating your cluster. + +See `Subsystem, Log and Debug Settings`_ for details on available settings. + +Runtime +======= + +If you would like to see the configuration settings at runtime, you must log +in to a host with a running daemon and execute the following:: + + ceph daemon {daemon-name} config show | less + +For example,:: + + ceph daemon osd.0 config show | less + +To activate Ceph's debugging output (*i.e.*, ``dout()``) at runtime, use the +``ceph tell`` command to inject arguments into the runtime configuration:: + + ceph tell {daemon-type}.{daemon id or *} injectargs --{name} {value} [--{name} {value}] + +Replace ``{daemon-type}`` with one of ``osd``, ``mon`` or ``mds``. You may apply +the runtime setting to all daemons of a particular type with ``*``, or specify +a specific daemon's ID. For example, to increase +debug logging for a ``ceph-osd`` daemon named ``osd.0``, execute the following:: + + ceph tell osd.0 injectargs --debug-osd 0/5 + +The ``ceph tell`` command goes through the monitors. If you cannot bind to the +monitor, you can still make the change by logging into the host of the daemon +whose configuration you'd like to change using ``ceph daemon``. +For example:: + + sudo ceph daemon osd.0 config set debug_osd 0/5 + +See `Subsystem, Log and Debug Settings`_ for details on available settings. + + +Boot Time +========= + +To activate Ceph's debugging output (*i.e.*, ``dout()``) at boot time, you must +add settings to your Ceph configuration file. Subsystems common to each daemon +may be set under ``[global]`` in your configuration file. Subsystems for +particular daemons are set under the daemon section in your configuration file +(*e.g.*, ``[mon]``, ``[osd]``, ``[mds]``). For example:: + + [global] + debug ms = 1/5 + + [mon] + debug mon = 20 + debug paxos = 1/5 + debug auth = 2 + + [osd] + debug osd = 1/5 + debug filestore = 1/5 + debug journal = 1 + debug monc = 5/20 + + [mds] + debug mds = 1 + debug mds balancer = 1 + + +See `Subsystem, Log and Debug Settings`_ for details. + + +Accelerating Log Rotation +========================= + +If your OS disk is relatively full, you can accelerate log rotation by modifying +the Ceph log rotation file at ``/etc/logrotate.d/ceph``. Add a size setting +after the rotation frequency to accelerate log rotation (via cronjob) if your +logs exceed the size setting. For example, the default setting looks like +this:: + + rotate 7 + weekly + compress + sharedscripts + +Modify it by adding a ``size`` setting. :: + + rotate 7 + weekly + size 500M + compress + sharedscripts + +Then, start the crontab editor for your user space. :: + + crontab -e + +Finally, add an entry to check the ``etc/logrotate.d/ceph`` file. :: + + 30 * * * * /usr/sbin/logrotate /etc/logrotate.d/ceph >/dev/null 2>&1 + +The preceding example checks the ``etc/logrotate.d/ceph`` file every 30 minutes. + + +Valgrind +======== + +Debugging may also require you to track down memory and threading issues. +You can run a single daemon, a type of daemon, or the whole cluster with +Valgrind. You should only use Valgrind when developing or debugging Ceph. +Valgrind is computationally expensive, and will slow down your system otherwise. +Valgrind messages are logged to ``stderr``. + + +Subsystem, Log and Debug Settings +================================= + +In most cases, you will enable debug logging output via subsystems. + +Ceph Subsystems +--------------- + +Each subsystem has a logging level for its output logs, and for its logs +in-memory. You may set different values for each of these subsystems by setting +a log file level and a memory level for debug logging. Ceph's logging levels +operate on a scale of ``1`` to ``20``, where ``1`` is terse and ``20`` is +verbose [#]_ . In general, the logs in-memory are not sent to the output log unless: + +- a fatal signal is raised or +- an ``assert`` in source code is triggered or +- upon requested. Please consult `document on admin socket <http://docs.ceph.com/docs/master/man/8/ceph/#daemon>`_ for more details. + +A debug logging setting can take a single value for the log level and the +memory level, which sets them both as the same value. For example, if you +specify ``debug ms = 5``, Ceph will treat it as a log level and a memory level +of ``5``. You may also specify them separately. The first setting is the log +level, and the second setting is the memory level. You must separate them with +a forward slash (/). For example, if you want to set the ``ms`` subsystem's +debug logging level to ``1`` and its memory level to ``5``, you would specify it +as ``debug ms = 1/5``. For example: + + + +.. code-block:: ini + + debug {subsystem} = {log-level}/{memory-level} + #for example + debug mds balancer = 1/20 + + +The following table provides a list of Ceph subsystems and their default log and +memory levels. Once you complete your logging efforts, restore the subsystems +to their default level or to a level suitable for normal operations. + + ++--------------------+-----------+--------------+ +| Subsystem | Log Level | Memory Level | ++====================+===========+==============+ +| ``default`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``lockdep`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``context`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``crush`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``mds`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``mds balancer`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``mds locker`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``mds log`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``mds log expire`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``mds migrator`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``buffer`` | 0 | 0 | ++--------------------+-----------+--------------+ +| ``timer`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``filer`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``objecter`` | 0 | 0 | ++--------------------+-----------+--------------+ +| ``rados`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``rbd`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``journaler`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``objectcacher`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``client`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``osd`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``optracker`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``objclass`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``filestore`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``journal`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``ms`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``mon`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``monc`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``paxos`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``tp`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``auth`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``finisher`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``heartbeatmap`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``perfcounter`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``rgw`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``javaclient`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``asok`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``throttle`` | 1 | 5 | ++--------------------+-----------+--------------+ + + +Logging Settings +---------------- + +Logging and debugging settings are not required in a Ceph configuration file, +but you may override default settings as needed. Ceph supports the following +settings: + + +``log file`` + +:Description: The location of the logging file for your cluster. +:Type: String +:Required: No +:Default: ``/var/log/ceph/$cluster-$name.log`` + + +``log max new`` + +:Description: The maximum number of new log files. +:Type: Integer +:Required: No +:Default: ``1000`` + + +``log max recent`` + +:Description: The maximum number of recent events to include in a log file. +:Type: Integer +:Required: No +:Default: ``1000000`` + + +``log to stderr`` + +:Description: Determines if logging messages should appear in ``stderr``. +:Type: Boolean +:Required: No +:Default: ``true`` + + +``err to stderr`` + +:Description: Determines if error messages should appear in ``stderr``. +:Type: Boolean +:Required: No +:Default: ``true`` + + +``log to syslog`` + +:Description: Determines if logging messages should appear in ``syslog``. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``err to syslog`` + +:Description: Determines if error messages should appear in ``syslog``. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``log flush on exit`` + +:Description: Determines if Ceph should flush the log files after exit. +:Type: Boolean +:Required: No +:Default: ``true`` + + +``clog to monitors`` + +:Description: Determines if ``clog`` messages should be sent to monitors. +:Type: Boolean +:Required: No +:Default: ``true`` + + +``clog to syslog`` + +:Description: Determines if ``clog`` messages should be sent to syslog. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``mon cluster log to syslog`` + +:Description: Determines if the cluster log should be output to the syslog. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``mon cluster log file`` + +:Description: The location of the cluster's log file. +:Type: String +:Required: No +:Default: ``/var/log/ceph/$cluster.log`` + + + +OSD +--- + + +``osd debug drop ping probability`` + +:Description: ? +:Type: Double +:Required: No +:Default: 0 + + +``osd debug drop ping duration`` + +:Description: +:Type: Integer +:Required: No +:Default: 0 + +``osd debug drop pg create probability`` + +:Description: +:Type: Integer +:Required: No +:Default: 0 + +``osd debug drop pg create duration`` + +:Description: ? +:Type: Double +:Required: No +:Default: 1 + + +``osd tmapput sets uses tmap`` + +:Description: Uses ``tmap``. For debug only. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``osd min pg log entries`` + +:Description: The minimum number of log entries for placement groups. +:Type: 32-bit Unsigned Integer +:Required: No +:Default: 1000 + + +``osd op log threshold`` + +:Description: How many op log messages to show up in one pass. +:Type: Integer +:Required: No +:Default: 5 + + + +Filestore +--------- + +``filestore debug omap check`` + +:Description: Debugging check on synchronization. This is an expensive operation. +:Type: Boolean +:Required: No +:Default: 0 + + +MDS +--- + + +``mds debug scatterstat`` + +:Description: Ceph will assert that various recursive stat invariants are true + (for developers only). + +:Type: Boolean +:Required: No +:Default: ``false`` + + +``mds debug frag`` + +:Description: Ceph will verify directory fragmentation invariants when + convenient (developers only). + +:Type: Boolean +:Required: No +:Default: ``false`` + + +``mds debug auth pins`` + +:Description: The debug auth pin invariants (for developers only). +:Type: Boolean +:Required: No +:Default: ``false`` + + +``mds debug subtrees`` + +:Description: The debug subtree invariants (for developers only). +:Type: Boolean +:Required: No +:Default: ``false`` + + + +RADOS Gateway +------------- + + +``rgw log nonexistent bucket`` + +:Description: Should we log a non-existent buckets? +:Type: Boolean +:Required: No +:Default: ``false`` + + +``rgw log object name`` + +:Description: Should an object's name be logged. // man date to see codes (a subset are supported) +:Type: String +:Required: No +:Default: ``%Y-%m-%d-%H-%i-%n`` + + +``rgw log object name utc`` + +:Description: Object log name contains UTC? +:Type: Boolean +:Required: No +:Default: ``false`` + + +``rgw enable ops log`` + +:Description: Enables logging of every RGW operation. +:Type: Boolean +:Required: No +:Default: ``true`` + + +``rgw enable usage log`` + +:Description: Enable logging of RGW's bandwidth usage. +:Type: Boolean +:Required: No +:Default: ``true`` + + +``rgw usage log flush threshold`` + +:Description: Threshold to flush pending log data. +:Type: Integer +:Required: No +:Default: ``1024`` + + +``rgw usage log tick interval`` + +:Description: Flush pending log data every ``s`` seconds. +:Type: Integer +:Required: No +:Default: 30 + + +``rgw intent log object name`` + +:Description: +:Type: String +:Required: No +:Default: ``%Y-%m-%d-%i-%n`` + + +``rgw intent log object name utc`` + +:Description: Include a UTC timestamp in the intent log object name. +:Type: Boolean +:Required: No +:Default: ``false`` + +.. [#] there are levels >20 in some rare cases and that they are extremely verbose. diff --git a/src/ceph/doc/rados/troubleshooting/memory-profiling.rst b/src/ceph/doc/rados/troubleshooting/memory-profiling.rst new file mode 100644 index 0000000..e2396e2 --- /dev/null +++ b/src/ceph/doc/rados/troubleshooting/memory-profiling.rst @@ -0,0 +1,142 @@ +================== + Memory Profiling +================== + +Ceph MON, OSD and MDS can generate heap profiles using +``tcmalloc``. To generate heap profiles, ensure you have +``google-perftools`` installed:: + + sudo apt-get install google-perftools + +The profiler dumps output to your ``log file`` directory (i.e., +``/var/log/ceph``). See `Logging and Debugging`_ for details. +To view the profiler logs with Google's performance tools, execute the +following:: + + google-pprof --text {path-to-daemon} {log-path/filename} + +For example:: + + $ ceph tell osd.0 heap start_profiler + $ ceph tell osd.0 heap dump + osd.0 tcmalloc heap stats:------------------------------------------------ + MALLOC: 2632288 ( 2.5 MiB) Bytes in use by application + MALLOC: + 499712 ( 0.5 MiB) Bytes in page heap freelist + MALLOC: + 543800 ( 0.5 MiB) Bytes in central cache freelist + MALLOC: + 327680 ( 0.3 MiB) Bytes in transfer cache freelist + MALLOC: + 1239400 ( 1.2 MiB) Bytes in thread cache freelists + MALLOC: + 1142936 ( 1.1 MiB) Bytes in malloc metadata + MALLOC: ------------ + MALLOC: = 6385816 ( 6.1 MiB) Actual memory used (physical + swap) + MALLOC: + 0 ( 0.0 MiB) Bytes released to OS (aka unmapped) + MALLOC: ------------ + MALLOC: = 6385816 ( 6.1 MiB) Virtual address space used + MALLOC: + MALLOC: 231 Spans in use + MALLOC: 56 Thread heaps in use + MALLOC: 8192 Tcmalloc page size + ------------------------------------------------ + Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()). + Bytes released to the OS take up virtual address space but no physical memory. + $ google-pprof --text \ + /usr/bin/ceph-osd \ + /var/log/ceph/ceph-osd.0.profile.0001.heap + Total: 3.7 MB + 1.9 51.1% 51.1% 1.9 51.1% ceph::log::Log::create_entry + 1.8 47.3% 98.4% 1.8 47.3% std::string::_Rep::_S_create + 0.0 0.4% 98.9% 0.0 0.6% SimpleMessenger::add_accept_pipe + 0.0 0.4% 99.2% 0.0 0.6% decode_message + ... + +Another heap dump on the same daemon will add another file. It is +convenient to compare to a previous heap dump to show what has grown +in the interval. For instance:: + + $ google-pprof --text --base out/osd.0.profile.0001.heap \ + ceph-osd out/osd.0.profile.0003.heap + Total: 0.2 MB + 0.1 50.3% 50.3% 0.1 50.3% ceph::log::Log::create_entry + 0.1 46.6% 96.8% 0.1 46.6% std::string::_Rep::_S_create + 0.0 0.9% 97.7% 0.0 26.1% ReplicatedPG::do_op + 0.0 0.8% 98.5% 0.0 0.8% __gnu_cxx::new_allocator::allocate + +Refer to `Google Heap Profiler`_ for additional details. + +Once you have the heap profiler installed, start your cluster and +begin using the heap profiler. You may enable or disable the heap +profiler at runtime, or ensure that it runs continuously. For the +following commandline usage, replace ``{daemon-type}`` with ``mon``, +``osd`` or ``mds``, and replace ``{daemon-id}`` with the OSD number or +the MON or MDS id. + + +Starting the Profiler +--------------------- + +To start the heap profiler, execute the following:: + + ceph tell {daemon-type}.{daemon-id} heap start_profiler + +For example:: + + ceph tell osd.1 heap start_profiler + +Alternatively the profile can be started when the daemon starts +running if the ``CEPH_HEAP_PROFILER_INIT=true`` variable is found in +the environment. + +Printing Stats +-------------- + +To print out statistics, execute the following:: + + ceph tell {daemon-type}.{daemon-id} heap stats + +For example:: + + ceph tell osd.0 heap stats + +.. note:: Printing stats does not require the profiler to be running and does + not dump the heap allocation information to a file. + + +Dumping Heap Information +------------------------ + +To dump heap information, execute the following:: + + ceph tell {daemon-type}.{daemon-id} heap dump + +For example:: + + ceph tell mds.a heap dump + +.. note:: Dumping heap information only works when the profiler is running. + + +Releasing Memory +---------------- + +To release memory that ``tcmalloc`` has allocated but which is not being used by +the Ceph daemon itself, execute the following:: + + ceph tell {daemon-type}{daemon-id} heap release + +For example:: + + ceph tell osd.2 heap release + + +Stopping the Profiler +--------------------- + +To stop the heap profiler, execute the following:: + + ceph tell {daemon-type}.{daemon-id} heap stop_profiler + +For example:: + + ceph tell osd.0 heap stop_profiler + +.. _Logging and Debugging: ../log-and-debug +.. _Google Heap Profiler: http://goog-perftools.sourceforge.net/doc/heap_profiler.html diff --git a/src/ceph/doc/rados/troubleshooting/troubleshooting-mon.rst b/src/ceph/doc/rados/troubleshooting/troubleshooting-mon.rst new file mode 100644 index 0000000..89fb94c --- /dev/null +++ b/src/ceph/doc/rados/troubleshooting/troubleshooting-mon.rst @@ -0,0 +1,567 @@ +================================= + Troubleshooting Monitors +================================= + +.. index:: monitor, high availability + +When a cluster encounters monitor-related troubles there's a tendency to +panic, and some times with good reason. You should keep in mind that losing +a monitor, or a bunch of them, don't necessarily mean that your cluster is +down, as long as a majority is up, running and with a formed quorum. +Regardless of how bad the situation is, the first thing you should do is to +calm down, take a breath and try answering our initial troubleshooting script. + + +Initial Troubleshooting +======================== + + +**Are the monitors running?** + + First of all, we need to make sure the monitors are running. You would be + amazed by how often people forget to run the monitors, or restart them after + an upgrade. There's no shame in that, but let's try not losing a couple of + hours chasing an issue that is not there. + +**Are you able to connect to the monitor's servers?** + + Doesn't happen often, but sometimes people do have ``iptables`` rules that + block accesses to monitor servers or monitor ports. Usually leftovers from + monitor stress-testing that were forgotten at some point. Try ssh'ing into + the server and, if that succeeds, try connecting to the monitor's port + using you tool of choice (telnet, nc,...). + +**Does ceph -s run and obtain a reply from the cluster?** + + If the answer is yes then your cluster is up and running. One thing you + can take for granted is that the monitors will only answer to a ``status`` + request if there is a formed quorum. + + If ``ceph -s`` blocked however, without obtaining a reply from the cluster + or showing a lot of ``fault`` messages, then it is likely that your monitors + are either down completely or just a portion is up -- a portion that is not + enough to form a quorum (keep in mind that a quorum if formed by a majority + of monitors). + +**What if ceph -s doesn't finish?** + + If you haven't gone through all the steps so far, please go back and do. + + For those running on Emperor 0.72-rc1 and forward, you will be able to + contact each monitor individually asking them for their status, regardless + of a quorum being formed. This an be achieved using ``ceph ping mon.ID``, + ID being the monitor's identifier. You should perform this for each monitor + in the cluster. In section `Understanding mon_status`_ we will explain how + to interpret the output of this command. + + For the rest of you who don't tread on the bleeding edge, you will need to + ssh into the server and use the monitor's admin socket. Please jump to + `Using the monitor's admin socket`_. + +For other specific issues, keep on reading. + + +Using the monitor's admin socket +================================= + +The admin socket allows you to interact with a given daemon directly using a +Unix socket file. This file can be found in your monitor's ``run`` directory. +By default, the admin socket will be kept in ``/var/run/ceph/ceph-mon.ID.asok`` +but this can vary if you defined it otherwise. If you don't find it there, +please check your ``ceph.conf`` for an alternative path or run:: + + ceph-conf --name mon.ID --show-config-value admin_socket + +Please bear in mind that the admin socket will only be available while the +monitor is running. When the monitor is properly shutdown, the admin socket +will be removed. If however the monitor is not running and the admin socket +still persists, it is likely that the monitor was improperly shutdown. +Regardless, if the monitor is not running, you will not be able to use the +admin socket, with ``ceph`` likely returning ``Error 111: Connection Refused``. + +Accessing the admin socket is as simple as telling the ``ceph`` tool to use +the ``asok`` file. In pre-Dumpling Ceph, this can be achieved by:: + + ceph --admin-daemon /var/run/ceph/ceph-mon.<id>.asok <command> + +while in Dumpling and beyond you can use the alternate (and recommended) +format:: + + ceph daemon mon.<id> <command> + +Using ``help`` as the command to the ``ceph`` tool will show you the +supported commands available through the admin socket. Please take a look +at ``config get``, ``config show``, ``mon_status`` and ``quorum_status``, +as those can be enlightening when troubleshooting a monitor. + + +Understanding mon_status +========================= + +``mon_status`` can be obtained through the ``ceph`` tool when you have +a formed quorum, or via the admin socket if you don't. This command will +output a multitude of information about the monitor, including the same +output you would get with ``quorum_status``. + +Take the following example of ``mon_status``:: + + + { "name": "c", + "rank": 2, + "state": "peon", + "election_epoch": 38, + "quorum": [ + 1, + 2], + "outside_quorum": [], + "extra_probe_peers": [], + "sync_provider": [], + "monmap": { "epoch": 3, + "fsid": "5c4e9d53-e2e1-478a-8061-f543f8be4cf8", + "modified": "2013-10-30 04:12:01.945629", + "created": "2013-10-29 14:14:41.914786", + "mons": [ + { "rank": 0, + "name": "a", + "addr": "127.0.0.1:6789\/0"}, + { "rank": 1, + "name": "b", + "addr": "127.0.0.1:6790\/0"}, + { "rank": 2, + "name": "c", + "addr": "127.0.0.1:6795\/0"}]}} + +A couple of things are obvious: we have three monitors in the monmap (*a*, *b* +and *c*), the quorum is formed by only two monitors, and *c* is in the quorum +as a *peon*. + +Which monitor is out of the quorum? + + The answer would be **a**. + +Why? + + Take a look at the ``quorum`` set. We have two monitors in this set: *1* + and *2*. These are not monitor names. These are monitor ranks, as established + in the current monmap. We are missing the monitor with rank 0, and according + to the monmap that would be ``mon.a``. + +By the way, how are ranks established? + + Ranks are (re)calculated whenever you add or remove monitors and follow a + simple rule: the **greater** the ``IP:PORT`` combination, the **lower** the + rank is. In this case, considering that ``127.0.0.1:6789`` is lower than all + the remaining ``IP:PORT`` combinations, ``mon.a`` has rank 0. + +Most Common Monitor Issues +=========================== + +Have Quorum but at least one Monitor is down +--------------------------------------------- + +When this happens, depending on the version of Ceph you are running, +you should be seeing something similar to:: + + $ ceph health detail + [snip] + mon.a (rank 0) addr 127.0.0.1:6789/0 is down (out of quorum) + +How to troubleshoot this? + + First, make sure ``mon.a`` is running. + + Second, make sure you are able to connect to ``mon.a``'s server from the + other monitors' servers. Check the ports as well. Check ``iptables`` on + all your monitor nodes and make sure you are not dropping/rejecting + connections. + + If this initial troubleshooting doesn't solve your problems, then it's + time to go deeper. + + First, check the problematic monitor's ``mon_status`` via the admin + socket as explained in `Using the monitor's admin socket`_ and + `Understanding mon_status`_. + + Considering the monitor is out of the quorum, its state should be one of + ``probing``, ``electing`` or ``synchronizing``. If it happens to be either + ``leader`` or ``peon``, then the monitor believes to be in quorum, while + the remaining cluster is sure it is not; or maybe it got into the quorum + while we were troubleshooting the monitor, so check you ``ceph -s`` again + just to make sure. Proceed if the monitor is not yet in the quorum. + +What if the state is ``probing``? + + This means the monitor is still looking for the other monitors. Every time + you start a monitor, the monitor will stay in this state for some time + while trying to find the rest of the monitors specified in the ``monmap``. + The time a monitor will spend in this state can vary. For instance, when on + a single-monitor cluster, the monitor will pass through the probing state + almost instantaneously, since there are no other monitors around. On a + multi-monitor cluster, the monitors will stay in this state until they + find enough monitors to form a quorum -- this means that if you have 2 out + of 3 monitors down, the one remaining monitor will stay in this state + indefinitively until you bring one of the other monitors up. + + If you have a quorum, however, the monitor should be able to find the + remaining monitors pretty fast, as long as they can be reached. If your + monitor is stuck probing and you have gone through with all the communication + troubleshooting, then there is a fair chance that the monitor is trying + to reach the other monitors on a wrong address. ``mon_status`` outputs the + ``monmap`` known to the monitor: check if the other monitor's locations + match reality. If they don't, jump to + `Recovering a Monitor's Broken monmap`_; if they do, then it may be related + to severe clock skews amongst the monitor nodes and you should refer to + `Clock Skews`_ first, but if that doesn't solve your problem then it is + the time to prepare some logs and reach out to the community (please refer + to `Preparing your logs`_ on how to best prepare your logs). + + +What if state is ``electing``? + + This means the monitor is in the middle of an election. These should be + fast to complete, but at times the monitors can get stuck electing. This + is usually a sign of a clock skew among the monitor nodes; jump to + `Clock Skews`_ for more infos on that. If all your clocks are properly + synchronized, it is best if you prepare some logs and reach out to the + community. This is not a state that is likely to persist and aside from + (*really*) old bugs there is not an obvious reason besides clock skews on + why this would happen. + +What if state is ``synchronizing``? + + This means the monitor is synchronizing with the rest of the cluster in + order to join the quorum. The synchronization process is as faster as + smaller your monitor store is, so if you have a big store it may + take a while. Don't worry, it should be finished soon enough. + + However, if you notice that the monitor jumps from ``synchronizing`` to + ``electing`` and then back to ``synchronizing``, then you do have a + problem: the cluster state is advancing (i.e., generating new maps) way + too fast for the synchronization process to keep up. This used to be a + thing in early Cuttlefish, but since then the synchronization process was + quite refactored and enhanced to avoid just this sort of behavior. If this + happens in later versions let us know. And bring some logs + (see `Preparing your logs`_). + +What if state is ``leader`` or ``peon``? + + This should not happen. There is a chance this might happen however, and + it has a lot to do with clock skews -- see `Clock Skews`_. If you are not + suffering from clock skews, then please prepare your logs (see + `Preparing your logs`_) and reach out to us. + + +Recovering a Monitor's Broken monmap +------------------------------------- + +This is how a ``monmap`` usually looks like, depending on the number of +monitors:: + + + epoch 3 + fsid 5c4e9d53-e2e1-478a-8061-f543f8be4cf8 + last_changed 2013-10-30 04:12:01.945629 + created 2013-10-29 14:14:41.914786 + 0: 127.0.0.1:6789/0 mon.a + 1: 127.0.0.1:6790/0 mon.b + 2: 127.0.0.1:6795/0 mon.c + +This may not be what you have however. For instance, in some versions of +early Cuttlefish there was this one bug that could cause your ``monmap`` +to be nullified. Completely filled with zeros. This means that not even +``monmaptool`` would be able to read it because it would find it hard to +make sense of only-zeros. Some other times, you may end up with a monitor +with a severely outdated monmap, thus being unable to find the remaining +monitors (e.g., say ``mon.c`` is down; you add a new monitor ``mon.d``, +then remove ``mon.a``, then add a new monitor ``mon.e`` and remove +``mon.b``; you will end up with a totally different monmap from the one +``mon.c`` knows). + +In this sort of situations, you have two possible solutions: + +Scrap the monitor and create a new one + + You should only take this route if you are positive that you won't + lose the information kept by that monitor; that you have other monitors + and that they are running just fine so that your new monitor is able + to synchronize from the remaining monitors. Keep in mind that destroying + a monitor, if there are no other copies of its contents, may lead to + loss of data. + +Inject a monmap into the monitor + + Usually the safest path. You should grab the monmap from the remaining + monitors and inject it into the monitor with the corrupted/lost monmap. + + These are the basic steps: + + 1. Is there a formed quorum? If so, grab the monmap from the quorum:: + + $ ceph mon getmap -o /tmp/monmap + + 2. No quorum? Grab the monmap directly from another monitor (this + assumes the monitor you are grabbing the monmap from has id ID-FOO + and has been stopped):: + + $ ceph-mon -i ID-FOO --extract-monmap /tmp/monmap + + 3. Stop the monitor you are going to inject the monmap into. + + 4. Inject the monmap:: + + $ ceph-mon -i ID --inject-monmap /tmp/monmap + + 5. Start the monitor + + Please keep in mind that the ability to inject monmaps is a powerful + feature that can cause havoc with your monitors if misused as it will + overwrite the latest, existing monmap kept by the monitor. + + +Clock Skews +------------ + +Monitors can be severely affected by significant clock skews across the +monitor nodes. This usually translates into weird behavior with no obvious +cause. To avoid such issues, you should run a clock synchronization tool +on your monitor nodes. + + +What's the maximum tolerated clock skew? + + By default the monitors will allow clocks to drift up to ``0.05 seconds``. + + +Can I increase the maximum tolerated clock skew? + + This value is configurable via the ``mon-clock-drift-allowed`` option, and + although you *CAN* it doesn't mean you *SHOULD*. The clock skew mechanism + is in place because clock skewed monitor may not properly behave. We, as + developers and QA afficcionados, are comfortable with the current default + value, as it will alert the user before the monitors get out hand. Changing + this value without testing it first may cause unforeseen effects on the + stability of the monitors and overall cluster healthiness, although there is + no risk of dataloss. + + +How do I know there's a clock skew? + + The monitors will warn you in the form of a ``HEALTH_WARN``. ``ceph health + detail`` should show something in the form of:: + + mon.c addr 10.10.0.1:6789/0 clock skew 0.08235s > max 0.05s (latency 0.0045s) + + That means that ``mon.c`` has been flagged as suffering from a clock skew. + + +What should I do if there's a clock skew? + + Synchronize your clocks. Running an NTP client may help. If you are already + using one and you hit this sort of issues, check if you are using some NTP + server remote to your network and consider hosting your own NTP server on + your network. This last option tends to reduce the amount of issues with + monitor clock skews. + + +Client Can't Connect or Mount +------------------------------ + +Check your IP tables. Some OS install utilities add a ``REJECT`` rule to +``iptables``. The rule rejects all clients trying to connect to the host except +for ``ssh``. If your monitor host's IP tables have such a ``REJECT`` rule in +place, clients connecting from a separate node will fail to mount with a timeout +error. You need to address ``iptables`` rules that reject clients trying to +connect to Ceph daemons. For example, you would need to address rules that look +like this appropriately:: + + REJECT all -- anywhere anywhere reject-with icmp-host-prohibited + +You may also need to add rules to IP tables on your Ceph hosts to ensure +that clients can access the ports associated with your Ceph monitors (i.e., port +6789 by default) and Ceph OSDs (i.e., 6800 through 7300 by default). For +example:: + + iptables -A INPUT -m multiport -p tcp -s {ip-address}/{netmask} --dports 6789,6800:7300 -j ACCEPT + +Monitor Store Failures +====================== + +Symptoms of store corruption +---------------------------- + +Ceph monitor stores the `cluster map`_ in a key/value store such as LevelDB. If +a monitor fails due to the key/value store corruption, following error messages +might be found in the monitor log:: + + Corruption: error in middle of record + +or:: + + Corruption: 1 missing files; e.g.: /var/lib/ceph/mon/mon.0/store.db/1234567.ldb + +Recovery using healthy monitor(s) +--------------------------------- + +If there is any survivers, we can always `replace`_ the corrupted one with a +new one. And after booting up, the new joiner will sync up with a healthy +peer, and once it is fully sync'ed, it will be able to serve the clients. + +Recovery using OSDs +------------------- + +But what if all monitors fail at the same time? Since users are encouraged to +deploy at least three monitors in a Ceph cluster, the chance of simultaneous +failure is rare. But unplanned power-downs in a data center with improperly +configured disk/fs settings could fail the underlying filesystem, and hence +kill all the monitors. In this case, we can recover the monitor store with the +information stored in OSDs.:: + + ms=/tmp/mon-store + mkdir $ms + # collect the cluster map from OSDs + for host in $hosts; do + rsync -avz $ms user@host:$ms + rm -rf $ms + ssh user@host <<EOF + for osd in /var/lib/osd/osd-*; do + ceph-objectstore-tool --data-path \$osd --op update-mon-db --mon-store-path $ms + done + EOF + rsync -avz user@host:$ms $ms + done + # rebuild the monitor store from the collected map, if the cluster does not + # use cephx authentication, we can skip the following steps to update the + # keyring with the caps, and there is no need to pass the "--keyring" option. + # i.e. just use "ceph-monstore-tool /tmp/mon-store rebuild" instead + ceph-authtool /path/to/admin.keyring -n mon. \ + --cap mon 'allow *' + ceph-authtool /path/to/admin.keyring -n client.admin \ + --cap mon 'allow *' --cap osd 'allow *' --cap mds 'allow *' + ceph-monstore-tool /tmp/mon-store rebuild -- --keyring /path/to/admin.keyring + # backup corrupted store.db just in case + mv /var/lib/ceph/mon/mon.0/store.db /var/lib/ceph/mon/mon.0/store.db.corrupted + mv /tmp/mon-store/store.db /var/lib/ceph/mon/mon.0/store.db + chown -R ceph:ceph /var/lib/ceph/mon/mon.0/store.db + +The steps above + +#. collect the map from all OSD hosts, +#. then rebuild the store, +#. fill the entities in keyring file with appropriate caps +#. replace the corrupted store on ``mon.0`` with the recovered copy. + +Known limitations +~~~~~~~~~~~~~~~~~ + +Following information are not recoverable using the steps above: + +- **some added keyrings**: all the OSD keyrings added using ``ceph auth add`` command + are recovered from the OSD's copy. And the ``client.admin`` keyring is imported + using ``ceph-monstore-tool``. But the MDS keyrings and other keyrings are missing + in the recovered monitor store. You might need to re-add them manually. + +- **pg settings**: the ``full ratio`` and ``nearfull ratio`` settings configured using + ``ceph pg set_full_ratio`` and ``ceph pg set_nearfull_ratio`` will be lost. + +- **MDS Maps**: the MDS maps are lost. + + +Everything Failed! Now What? +============================= + +Reaching out for help +---------------------- + +You can find us on IRC at #ceph and #ceph-devel at OFTC (server irc.oftc.net) +and on ``ceph-devel@vger.kernel.org`` and ``ceph-users@lists.ceph.com``. Make +sure you have grabbed your logs and have them ready if someone asks: the faster +the interaction and lower the latency in response, the better chances everyone's +time is optimized. + + +Preparing your logs +--------------------- + +Monitor logs are, by default, kept in ``/var/log/ceph/ceph-mon.FOO.log*``. We +may want them. However, your logs may not have the necessary information. If +you don't find your monitor logs at their default location, you can check +where they should be by running:: + + ceph-conf --name mon.FOO --show-config-value log_file + +The amount of information in the logs are subject to the debug levels being +enforced by your configuration files. If you have not enforced a specific +debug level then Ceph is using the default levels and your logs may not +contain important information to track down you issue. +A first step in getting relevant information into your logs will be to raise +debug levels. In this case we will be interested in the information from the +monitor. +Similarly to what happens on other components, different parts of the monitor +will output their debug information on different subsystems. + +You will have to raise the debug levels of those subsystems more closely +related to your issue. This may not be an easy task for someone unfamiliar +with troubleshooting Ceph. For most situations, setting the following options +on your monitors will be enough to pinpoint a potential source of the issue:: + + debug mon = 10 + debug ms = 1 + +If we find that these debug levels are not enough, there's a chance we may +ask you to raise them or even define other debug subsystems to obtain infos +from -- but at least we started off with some useful information, instead +of a massively empty log without much to go on with. + +Do I need to restart a monitor to adjust debug levels? +------------------------------------------------------ + +No. You may do it in one of two ways: + +You have quorum + + Either inject the debug option into the monitor you want to debug:: + + ceph tell mon.FOO injectargs --debug_mon 10/10 + + or into all monitors at once:: + + ceph tell mon.* injectargs --debug_mon 10/10 + +No quourm + + Use the monitor's admin socket and directly adjust the configuration + options:: + + ceph daemon mon.FOO config set debug_mon 10/10 + + +Going back to default values is as easy as rerunning the above commands +using the debug level ``1/10`` instead. You can check your current +values using the admin socket and the following commands:: + + ceph daemon mon.FOO config show + +or:: + + ceph daemon mon.FOO config get 'OPTION_NAME' + + +Reproduced the problem with appropriate debug levels. Now what? +---------------------------------------------------------------- + +Ideally you would send us only the relevant portions of your logs. +We realise that figuring out the corresponding portion may not be the +easiest of tasks. Therefore, we won't hold it to you if you provide the +full log, but common sense should be employed. If your log has hundreds +of thousands of lines, it may get tricky to go through the whole thing, +specially if we are not aware at which point, whatever your issue is, +happened. For instance, when reproducing, keep in mind to write down +current time and date and to extract the relevant portions of your logs +based on that. + +Finally, you should reach out to us on the mailing lists, on IRC or file +a new issue on the `tracker`_. + +.. _cluster map: ../../architecture#cluster-map +.. _replace: ../operation/add-or-rm-mons +.. _tracker: http://tracker.ceph.com/projects/ceph/issues/new diff --git a/src/ceph/doc/rados/troubleshooting/troubleshooting-osd.rst b/src/ceph/doc/rados/troubleshooting/troubleshooting-osd.rst new file mode 100644 index 0000000..88307fe --- /dev/null +++ b/src/ceph/doc/rados/troubleshooting/troubleshooting-osd.rst @@ -0,0 +1,536 @@ +====================== + Troubleshooting OSDs +====================== + +Before troubleshooting your OSDs, check your monitors and network first. If +you execute ``ceph health`` or ``ceph -s`` on the command line and Ceph returns +a health status, it means that the monitors have a quorum. +If you don't have a monitor quorum or if there are errors with the monitor +status, `address the monitor issues first <../troubleshooting-mon>`_. +Check your networks to ensure they +are running properly, because networks may have a significant impact on OSD +operation and performance. + + + +Obtaining Data About OSDs +========================= + +A good first step in troubleshooting your OSDs is to obtain information in +addition to the information you collected while `monitoring your OSDs`_ +(e.g., ``ceph osd tree``). + + +Ceph Logs +--------- + +If you haven't changed the default path, you can find Ceph log files at +``/var/log/ceph``:: + + ls /var/log/ceph + +If you don't get enough log detail, you can change your logging level. See +`Logging and Debugging`_ for details to ensure that Ceph performs adequately +under high logging volume. + + +Admin Socket +------------ + +Use the admin socket tool to retrieve runtime information. For details, list +the sockets for your Ceph processes:: + + ls /var/run/ceph + +Then, execute the following, replacing ``{daemon-name}`` with an actual +daemon (e.g., ``osd.0``):: + + ceph daemon osd.0 help + +Alternatively, you can specify a ``{socket-file}`` (e.g., something in ``/var/run/ceph``):: + + ceph daemon {socket-file} help + + +The admin socket, among other things, allows you to: + +- List your configuration at runtime +- Dump historic operations +- Dump the operation priority queue state +- Dump operations in flight +- Dump perfcounters + + +Display Freespace +----------------- + +Filesystem issues may arise. To display your filesystem's free space, execute +``df``. :: + + df -h + +Execute ``df --help`` for additional usage. + + +I/O Statistics +-------------- + +Use `iostat`_ to identify I/O-related issues. :: + + iostat -x + + +Diagnostic Messages +------------------- + +To retrieve diagnostic messages, use ``dmesg`` with ``less``, ``more``, ``grep`` +or ``tail``. For example:: + + dmesg | grep scsi + + +Stopping w/out Rebalancing +========================== + +Periodically, you may need to perform maintenance on a subset of your cluster, +or resolve a problem that affects a failure domain (e.g., a rack). If you do not +want CRUSH to automatically rebalance the cluster as you stop OSDs for +maintenance, set the cluster to ``noout`` first:: + + ceph osd set noout + +Once the cluster is set to ``noout``, you can begin stopping the OSDs within the +failure domain that requires maintenance work. :: + + stop ceph-osd id={num} + +.. note:: Placement groups within the OSDs you stop will become ``degraded`` + while you are addressing issues with within the failure domain. + +Once you have completed your maintenance, restart the OSDs. :: + + start ceph-osd id={num} + +Finally, you must unset the cluster from ``noout``. :: + + ceph osd unset noout + + + +.. _osd-not-running: + +OSD Not Running +=============== + +Under normal circumstances, simply restarting the ``ceph-osd`` daemon will +allow it to rejoin the cluster and recover. + +An OSD Won't Start +------------------ + +If you start your cluster and an OSD won't start, check the following: + +- **Configuration File:** If you were not able to get OSDs running from + a new installation, check your configuration file to ensure it conforms + (e.g., ``host`` not ``hostname``, etc.). + +- **Check Paths:** Check the paths in your configuration, and the actual + paths themselves for data and journals. If you separate the OSD data from + the journal data and there are errors in your configuration file or in the + actual mounts, you may have trouble starting OSDs. If you want to store the + journal on a block device, you should partition your journal disk and assign + one partition per OSD. + +- **Check Max Threadcount:** If you have a node with a lot of OSDs, you may be + hitting the default maximum number of threads (e.g., usually 32k), especially + during recovery. You can increase the number of threads using ``sysctl`` to + see if increasing the maximum number of threads to the maximum possible + number of threads allowed (i.e., 4194303) will help. For example:: + + sysctl -w kernel.pid_max=4194303 + + If increasing the maximum thread count resolves the issue, you can make it + permanent by including a ``kernel.pid_max`` setting in the + ``/etc/sysctl.conf`` file. For example:: + + kernel.pid_max = 4194303 + +- **Kernel Version:** Identify the kernel version and distribution you + are using. Ceph uses some third party tools by default, which may be + buggy or may conflict with certain distributions and/or kernel + versions (e.g., Google perftools). Check the `OS recommendations`_ + to ensure you have addressed any issues related to your kernel. + +- **Segment Fault:** If there is a segment fault, turn your logging up + (if it is not already), and try again. If it segment faults again, + contact the ceph-devel email list and provide your Ceph configuration + file, your monitor output and the contents of your log file(s). + + + +An OSD Failed +------------- + +When a ``ceph-osd`` process dies, the monitor will learn about the failure +from surviving ``ceph-osd`` daemons and report it via the ``ceph health`` +command:: + + ceph health + HEALTH_WARN 1/3 in osds are down + +Specifically, you will get a warning whenever there are ``ceph-osd`` +processes that are marked ``in`` and ``down``. You can identify which +``ceph-osds`` are ``down`` with:: + + ceph health detail + HEALTH_WARN 1/3 in osds are down + osd.0 is down since epoch 23, last address 192.168.106.220:6800/11080 + +If there is a disk +failure or other fault preventing ``ceph-osd`` from functioning or +restarting, an error message should be present in its log file in +``/var/log/ceph``. + +If the daemon stopped because of a heartbeat failure, the underlying +kernel file system may be unresponsive. Check ``dmesg`` output for disk +or other kernel errors. + +If the problem is a software error (failed assertion or other +unexpected error), it should be reported to the `ceph-devel`_ email list. + + +No Free Drive Space +------------------- + +Ceph prevents you from writing to a full OSD so that you don't lose data. +In an operational cluster, you should receive a warning when your cluster +is getting near its full ratio. The ``mon osd full ratio`` defaults to +``0.95``, or 95% of capacity before it stops clients from writing data. +The ``mon osd backfillfull ratio`` defaults to ``0.90``, or 90 % of +capacity when it blocks backfills from starting. The +``mon osd nearfull ratio`` defaults to ``0.85``, or 85% of capacity +when it generates a health warning. + +Full cluster issues usually arise when testing how Ceph handles an OSD +failure on a small cluster. When one node has a high percentage of the +cluster's data, the cluster can easily eclipse its nearfull and full ratio +immediately. If you are testing how Ceph reacts to OSD failures on a small +cluster, you should leave ample free disk space and consider temporarily +lowering the ``mon osd full ratio``, ``mon osd backfillfull ratio`` and +``mon osd nearfull ratio``. + +Full ``ceph-osds`` will be reported by ``ceph health``:: + + ceph health + HEALTH_WARN 1 nearfull osd(s) + +Or:: + + ceph health detail + HEALTH_ERR 1 full osd(s); 1 backfillfull osd(s); 1 nearfull osd(s) + osd.3 is full at 97% + osd.4 is backfill full at 91% + osd.2 is near full at 87% + +The best way to deal with a full cluster is to add new ``ceph-osds``, allowing +the cluster to redistribute data to the newly available storage. + +If you cannot start an OSD because it is full, you may delete some data by deleting +some placement group directories in the full OSD. + +.. important:: If you choose to delete a placement group directory on a full OSD, + **DO NOT** delete the same placement group directory on another full OSD, or + **YOU MAY LOSE DATA**. You **MUST** maintain at least one copy of your data on + at least one OSD. + +See `Monitor Config Reference`_ for additional details. + + +OSDs are Slow/Unresponsive +========================== + +A commonly recurring issue involves slow or unresponsive OSDs. Ensure that you +have eliminated other troubleshooting possibilities before delving into OSD +performance issues. For example, ensure that your network(s) is working properly +and your OSDs are running. Check to see if OSDs are throttling recovery traffic. + +.. tip:: Newer versions of Ceph provide better recovery handling by preventing + recovering OSDs from using up system resources so that ``up`` and ``in`` + OSDs are not available or are otherwise slow. + + +Networking Issues +----------------- + +Ceph is a distributed storage system, so it depends upon networks to peer with +OSDs, replicate objects, recover from faults and check heartbeats. Networking +issues can cause OSD latency and flapping OSDs. See `Flapping OSDs`_ for +details. + +Ensure that Ceph processes and Ceph-dependent processes are connected and/or +listening. :: + + netstat -a | grep ceph + netstat -l | grep ceph + sudo netstat -p | grep ceph + +Check network statistics. :: + + netstat -s + + +Drive Configuration +------------------- + +A storage drive should only support one OSD. Sequential read and sequential +write throughput can bottleneck if other processes share the drive, including +journals, operating systems, monitors, other OSDs and non-Ceph processes. + +Ceph acknowledges writes *after* journaling, so fast SSDs are an +attractive option to accelerate the response time--particularly when +using the ``XFS`` or ``ext4`` filesystems. By contrast, the ``btrfs`` +filesystem can write and journal simultaneously. (Note, however, that +we recommend against using ``btrfs`` for production deployments.) + +.. note:: Partitioning a drive does not change its total throughput or + sequential read/write limits. Running a journal in a separate partition + may help, but you should prefer a separate physical drive. + + +Bad Sectors / Fragmented Disk +----------------------------- + +Check your disks for bad sectors and fragmentation. This can cause total throughput +to drop substantially. + + +Co-resident Monitors/OSDs +------------------------- + +Monitors are generally light-weight processes, but they do lots of ``fsync()``, +which can interfere with other workloads, particularly if monitors run on the +same drive as your OSDs. Additionally, if you run monitors on the same host as +the OSDs, you may incur performance issues related to: + +- Running an older kernel (pre-3.0) +- Running Argonaut with an old ``glibc`` +- Running a kernel with no syncfs(2) syscall. + +In these cases, multiple OSDs running on the same host can drag each other down +by doing lots of commits. That often leads to the bursty writes. + + +Co-resident Processes +--------------------- + +Spinning up co-resident processes such as a cloud-based solution, virtual +machines and other applications that write data to Ceph while operating on the +same hardware as OSDs can introduce significant OSD latency. Generally, we +recommend optimizing a host for use with Ceph and using other hosts for other +processes. The practice of separating Ceph operations from other applications +may help improve performance and may streamline troubleshooting and maintenance. + + +Logging Levels +-------------- + +If you turned logging levels up to track an issue and then forgot to turn +logging levels back down, the OSD may be putting a lot of logs onto the disk. If +you intend to keep logging levels high, you may consider mounting a drive to the +default path for logging (i.e., ``/var/log/ceph/$cluster-$name.log``). + + +Recovery Throttling +------------------- + +Depending upon your configuration, Ceph may reduce recovery rates to maintain +performance or it may increase recovery rates to the point that recovery +impacts OSD performance. Check to see if the OSD is recovering. + + +Kernel Version +-------------- + +Check the kernel version you are running. Older kernels may not receive +new backports that Ceph depends upon for better performance. + + +Kernel Issues with SyncFS +------------------------- + +Try running one OSD per host to see if performance improves. Old kernels +might not have a recent enough version of ``glibc`` to support ``syncfs(2)``. + + +Filesystem Issues +----------------- + +Currently, we recommend deploying clusters with XFS. + +We recommend against using btrfs or ext4. The btrfs filesystem has +many attractive features, but bugs in the filesystem may lead to +performance issues and suprious ENOSPC errors. We do not recommend +ext4 because xattr size limitations break our support for long object +names (needed for RGW). + +For more information, see `Filesystem Recommendations`_. + +.. _Filesystem Recommendations: ../configuration/filesystem-recommendations + + +Insufficient RAM +---------------- + +We recommend 1GB of RAM per OSD daemon. You may notice that during normal +operations, the OSD only uses a fraction of that amount (e.g., 100-200MB). +Unused RAM makes it tempting to use the excess RAM for co-resident applications, +VMs and so forth. However, when OSDs go into recovery mode, their memory +utilization spikes. If there is no RAM available, the OSD performance will slow +considerably. + + +Old Requests or Slow Requests +----------------------------- + +If a ``ceph-osd`` daemon is slow to respond to a request, it will generate log messages +complaining about requests that are taking too long. The warning threshold +defaults to 30 seconds, and is configurable via the ``osd op complaint time`` +option. When this happens, the cluster log will receive messages. + +Legacy versions of Ceph complain about 'old requests`:: + + osd.0 192.168.106.220:6800/18813 312 : [WRN] old request osd_op(client.5099.0:790 fatty_26485_object789 [write 0~4096] 2.5e54f643) v4 received at 2012-03-06 15:42:56.054801 currently waiting for sub ops + +New versions of Ceph complain about 'slow requests`:: + + {date} {osd.num} [WRN] 1 slow requests, 1 included below; oldest blocked for > 30.005692 secs + {date} {osd.num} [WRN] slow request 30.005692 seconds old, received at {date-time}: osd_op(client.4240.0:8 benchmark_data_ceph-1_39426_object7 [write 0~4194304] 0.69848840) v4 currently waiting for subops from [610] + + +Possible causes include: + +- A bad drive (check ``dmesg`` output) +- A bug in the kernel file system bug (check ``dmesg`` output) +- An overloaded cluster (check system load, iostat, etc.) +- A bug in the ``ceph-osd`` daemon. + +Possible solutions + +- Remove VMs Cloud Solutions from Ceph Hosts +- Upgrade Kernel +- Upgrade Ceph +- Restart OSDs + +Debugging Slow Requests +----------------------- + +If you run "ceph daemon osd.<id> dump_historic_ops" or "dump_ops_in_flight", +you will see a set of operations and a list of events each operation went +through. These are briefly described below. + +Events from the Messenger layer: + +- header_read: when the messenger first started reading the message off the wire +- throttled: when the messenger tried to acquire memory throttle space to read + the message into memory +- all_read: when the messenger finished reading the message off the wire +- dispatched: when the messenger gave the message to the OSD +- Initiated: <This is identical to header_read. The existence of both is a + historical oddity. + +Events from the OSD as it prepares operations + +- queued_for_pg: the op has been put into the queue for processing by its PG +- reached_pg: the PG has started doing the op +- waiting for \*: the op is waiting for some other work to complete before it + can proceed (a new OSDMap; for its object target to scrub; for the PG to + finish peering; all as specified in the message) +- started: the op has been accepted as something the OSD should actually do + (reasons not to do it: failed security/permission checks; out-of-date local + state; etc) and is now actually being performed +- waiting for subops from: the op has been sent to replica OSDs + +Events from the FileStore + +- commit_queued_for_journal_write: the op has been given to the FileStore +- write_thread_in_journal_buffer: the op is in the journal's buffer and waiting + to be persisted (as the next disk write) +- journaled_completion_queued: the op was journaled to disk and its callback + queued for invocation + +Events from the OSD after stuff has been given to local disk + +- op_commit: the op has been committed (ie, written to journal) by the + primary OSD +- op_applied: The op has been write()'en to the backing FS (ie, applied in + memory but not flushed out to disk) on the primary +- sub_op_applied: op_applied, but for a replica's "subop" +- sub_op_committed: op_commited, but for a replica's subop (only for EC pools) +- sub_op_commit_rec/sub_op_apply_rec from <X>: the primary marks this when it + hears about the above, but for a particular replica <X> +- commit_sent: we sent a reply back to the client (or primary OSD, for sub ops) + +Many of these events are seemingly redundant, but cross important boundaries in +the internal code (such as passing data across locks into new threads). + +Flapping OSDs +============= + +We recommend using both a public (front-end) network and a cluster (back-end) +network so that you can better meet the capacity requirements of object +replication. Another advantage is that you can run a cluster network such that +it is not connected to the internet, thereby preventing some denial of service +attacks. When OSDs peer and check heartbeats, they use the cluster (back-end) +network when it's available. See `Monitor/OSD Interaction`_ for details. + +However, if the cluster (back-end) network fails or develops significant latency +while the public (front-end) network operates optimally, OSDs currently do not +handle this situation well. What happens is that OSDs mark each other ``down`` +on the monitor, while marking themselves ``up``. We call this scenario +'flapping`. + +If something is causing OSDs to 'flap' (repeatedly getting marked ``down`` and +then ``up`` again), you can force the monitors to stop the flapping with:: + + ceph osd set noup # prevent OSDs from getting marked up + ceph osd set nodown # prevent OSDs from getting marked down + +These flags are recorded in the osdmap structure:: + + ceph osd dump | grep flags + flags no-up,no-down + +You can clear the flags with:: + + ceph osd unset noup + ceph osd unset nodown + +Two other flags are supported, ``noin`` and ``noout``, which prevent +booting OSDs from being marked ``in`` (allocated data) or protect OSDs +from eventually being marked ``out`` (regardless of what the current value for +``mon osd down out interval`` is). + +.. note:: ``noup``, ``noout``, and ``nodown`` are temporary in the + sense that once the flags are cleared, the action they were blocking + should occur shortly after. The ``noin`` flag, on the other hand, + prevents OSDs from being marked ``in`` on boot, and any daemons that + started while the flag was set will remain that way. + + + + + + +.. _iostat: http://en.wikipedia.org/wiki/Iostat +.. _Ceph Logging and Debugging: ../../configuration/ceph-conf#ceph-logging-and-debugging +.. _Logging and Debugging: ../log-and-debug +.. _Debugging and Logging: ../debug +.. _Monitor/OSD Interaction: ../../configuration/mon-osd-interaction +.. _Monitor Config Reference: ../../configuration/mon-config-ref +.. _monitoring your OSDs: ../../operations/monitoring-osd-pg +.. _subscribe to the ceph-devel email list: mailto:majordomo@vger.kernel.org?body=subscribe+ceph-devel +.. _unsubscribe from the ceph-devel email list: mailto:majordomo@vger.kernel.org?body=unsubscribe+ceph-devel +.. _subscribe to the ceph-users email list: mailto:ceph-users-join@lists.ceph.com +.. _unsubscribe from the ceph-users email list: mailto:ceph-users-leave@lists.ceph.com +.. _OS recommendations: ../../../start/os-recommendations +.. _ceph-devel: ceph-devel@vger.kernel.org diff --git a/src/ceph/doc/rados/troubleshooting/troubleshooting-pg.rst b/src/ceph/doc/rados/troubleshooting/troubleshooting-pg.rst new file mode 100644 index 0000000..4241fee --- /dev/null +++ b/src/ceph/doc/rados/troubleshooting/troubleshooting-pg.rst @@ -0,0 +1,668 @@ +===================== + Troubleshooting PGs +===================== + +Placement Groups Never Get Clean +================================ + +When you create a cluster and your cluster remains in ``active``, +``active+remapped`` or ``active+degraded`` status and never achieve an +``active+clean`` status, you likely have a problem with your configuration. + +You may need to review settings in the `Pool, PG and CRUSH Config Reference`_ +and make appropriate adjustments. + +As a general rule, you should run your cluster with more than one OSD and a +pool size greater than 1 object replica. + +One Node Cluster +---------------- + +Ceph no longer provides documentation for operating on a single node, because +you would never deploy a system designed for distributed computing on a single +node. Additionally, mounting client kernel modules on a single node containing a +Ceph daemon may cause a deadlock due to issues with the Linux kernel itself +(unless you use VMs for the clients). You can experiment with Ceph in a 1-node +configuration, in spite of the limitations as described herein. + +If you are trying to create a cluster on a single node, you must change the +default of the ``osd crush chooseleaf type`` setting from ``1`` (meaning +``host`` or ``node``) to ``0`` (meaning ``osd``) in your Ceph configuration +file before you create your monitors and OSDs. This tells Ceph that an OSD +can peer with another OSD on the same host. If you are trying to set up a +1-node cluster and ``osd crush chooseleaf type`` is greater than ``0``, +Ceph will try to peer the PGs of one OSD with the PGs of another OSD on +another node, chassis, rack, row, or even datacenter depending on the setting. + +.. tip:: DO NOT mount kernel clients directly on the same node as your + Ceph Storage Cluster, because kernel conflicts can arise. However, you + can mount kernel clients within virtual machines (VMs) on a single node. + +If you are creating OSDs using a single disk, you must create directories +for the data manually first. For example:: + + mkdir /var/local/osd0 /var/local/osd1 + ceph-deploy osd prepare {localhost-name}:/var/local/osd0 {localhost-name}:/var/local/osd1 + ceph-deploy osd activate {localhost-name}:/var/local/osd0 {localhost-name}:/var/local/osd1 + + +Fewer OSDs than Replicas +------------------------ + +If you have brought up two OSDs to an ``up`` and ``in`` state, but you still +don't see ``active + clean`` placement groups, you may have an +``osd pool default size`` set to greater than ``2``. + +There are a few ways to address this situation. If you want to operate your +cluster in an ``active + degraded`` state with two replicas, you can set the +``osd pool default min size`` to ``2`` so that you can write objects in +an ``active + degraded`` state. You may also set the ``osd pool default size`` +setting to ``2`` so that you only have two stored replicas (the original and +one replica), in which case the cluster should achieve an ``active + clean`` +state. + +.. note:: You can make the changes at runtime. If you make the changes in + your Ceph configuration file, you may need to restart your cluster. + + +Pool Size = 1 +------------- + +If you have the ``osd pool default size`` set to ``1``, you will only have +one copy of the object. OSDs rely on other OSDs to tell them which objects +they should have. If a first OSD has a copy of an object and there is no +second copy, then no second OSD can tell the first OSD that it should have +that copy. For each placement group mapped to the first OSD (see +``ceph pg dump``), you can force the first OSD to notice the placement groups +it needs by running:: + + ceph osd force-create-pg <pgid> + + +CRUSH Map Errors +---------------- + +Another candidate for placement groups remaining unclean involves errors +in your CRUSH map. + + +Stuck Placement Groups +====================== + +It is normal for placement groups to enter states like "degraded" or "peering" +following a failure. Normally these states indicate the normal progression +through the failure recovery process. However, if a placement group stays in one +of these states for a long time this may be an indication of a larger problem. +For this reason, the monitor will warn when placement groups get "stuck" in a +non-optimal state. Specifically, we check for: + +* ``inactive`` - The placement group has not been ``active`` for too long + (i.e., it hasn't been able to service read/write requests). + +* ``unclean`` - The placement group has not been ``clean`` for too long + (i.e., it hasn't been able to completely recover from a previous failure). + +* ``stale`` - The placement group status has not been updated by a ``ceph-osd``, + indicating that all nodes storing this placement group may be ``down``. + +You can explicitly list stuck placement groups with one of:: + + ceph pg dump_stuck stale + ceph pg dump_stuck inactive + ceph pg dump_stuck unclean + +For stuck ``stale`` placement groups, it is normally a matter of getting the +right ``ceph-osd`` daemons running again. For stuck ``inactive`` placement +groups, it is usually a peering problem (see :ref:`failures-osd-peering`). For +stuck ``unclean`` placement groups, there is usually something preventing +recovery from completing, like unfound objects (see +:ref:`failures-osd-unfound`); + + + +.. _failures-osd-peering: + +Placement Group Down - Peering Failure +====================================== + +In certain cases, the ``ceph-osd`` `Peering` process can run into +problems, preventing a PG from becoming active and usable. For +example, ``ceph health`` might report:: + + ceph health detail + HEALTH_ERR 7 pgs degraded; 12 pgs down; 12 pgs peering; 1 pgs recovering; 6 pgs stuck unclean; 114/3300 degraded (3.455%); 1/3 in osds are down + ... + pg 0.5 is down+peering + pg 1.4 is down+peering + ... + osd.1 is down since epoch 69, last address 192.168.106.220:6801/8651 + +We can query the cluster to determine exactly why the PG is marked ``down`` with:: + + ceph pg 0.5 query + +.. code-block:: javascript + + { "state": "down+peering", + ... + "recovery_state": [ + { "name": "Started\/Primary\/Peering\/GetInfo", + "enter_time": "2012-03-06 14:40:16.169679", + "requested_info_from": []}, + { "name": "Started\/Primary\/Peering", + "enter_time": "2012-03-06 14:40:16.169659", + "probing_osds": [ + 0, + 1], + "blocked": "peering is blocked due to down osds", + "down_osds_we_would_probe": [ + 1], + "peering_blocked_by": [ + { "osd": 1, + "current_lost_at": 0, + "comment": "starting or marking this osd lost may let us proceed"}]}, + { "name": "Started", + "enter_time": "2012-03-06 14:40:16.169513"} + ] + } + +The ``recovery_state`` section tells us that peering is blocked due to +down ``ceph-osd`` daemons, specifically ``osd.1``. In this case, we can start that ``ceph-osd`` +and things will recover. + +Alternatively, if there is a catastrophic failure of ``osd.1`` (e.g., disk +failure), we can tell the cluster that it is ``lost`` and to cope as +best it can. + +.. important:: This is dangerous in that the cluster cannot + guarantee that the other copies of the data are consistent + and up to date. + +To instruct Ceph to continue anyway:: + + ceph osd lost 1 + +Recovery will proceed. + + +.. _failures-osd-unfound: + +Unfound Objects +=============== + +Under certain combinations of failures Ceph may complain about +``unfound`` objects:: + + ceph health detail + HEALTH_WARN 1 pgs degraded; 78/3778 unfound (2.065%) + pg 2.4 is active+degraded, 78 unfound + +This means that the storage cluster knows that some objects (or newer +copies of existing objects) exist, but it hasn't found copies of them. +One example of how this might come about for a PG whose data is on ceph-osds +1 and 2: + +* 1 goes down +* 2 handles some writes, alone +* 1 comes up +* 1 and 2 repeer, and the objects missing on 1 are queued for recovery. +* Before the new objects are copied, 2 goes down. + +Now 1 knows that these object exist, but there is no live ``ceph-osd`` who +has a copy. In this case, IO to those objects will block, and the +cluster will hope that the failed node comes back soon; this is +assumed to be preferable to returning an IO error to the user. + +First, you can identify which objects are unfound with:: + + ceph pg 2.4 list_missing [starting offset, in json] + +.. code-block:: javascript + + { "offset": { "oid": "", + "key": "", + "snapid": 0, + "hash": 0, + "max": 0}, + "num_missing": 0, + "num_unfound": 0, + "objects": [ + { "oid": "object 1", + "key": "", + "hash": 0, + "max": 0 }, + ... + ], + "more": 0} + +If there are too many objects to list in a single result, the ``more`` +field will be true and you can query for more. (Eventually the +command line tool will hide this from you, but not yet.) + +Second, you can identify which OSDs have been probed or might contain +data:: + + ceph pg 2.4 query + +.. code-block:: javascript + + "recovery_state": [ + { "name": "Started\/Primary\/Active", + "enter_time": "2012-03-06 15:15:46.713212", + "might_have_unfound": [ + { "osd": 1, + "status": "osd is down"}]}, + +In this case, for example, the cluster knows that ``osd.1`` might have +data, but it is ``down``. The full range of possible states include: + +* already probed +* querying +* OSD is down +* not queried (yet) + +Sometimes it simply takes some time for the cluster to query possible +locations. + +It is possible that there are other locations where the object can +exist that are not listed. For example, if a ceph-osd is stopped and +taken out of the cluster, the cluster fully recovers, and due to some +future set of failures ends up with an unfound object, it won't +consider the long-departed ceph-osd as a potential location to +consider. (This scenario, however, is unlikely.) + +If all possible locations have been queried and objects are still +lost, you may have to give up on the lost objects. This, again, is +possible given unusual combinations of failures that allow the cluster +to learn about writes that were performed before the writes themselves +are recovered. To mark the "unfound" objects as "lost":: + + ceph pg 2.5 mark_unfound_lost revert|delete + +This the final argument specifies how the cluster should deal with +lost objects. + +The "delete" option will forget about them entirely. + +The "revert" option (not available for erasure coded pools) will +either roll back to a previous version of the object or (if it was a +new object) forget about it entirely. Use this with caution, as it +may confuse applications that expected the object to exist. + + +Homeless Placement Groups +========================= + +It is possible for all OSDs that had copies of a given placement groups to fail. +If that's the case, that subset of the object store is unavailable, and the +monitor will receive no status updates for those placement groups. To detect +this situation, the monitor marks any placement group whose primary OSD has +failed as ``stale``. For example:: + + ceph health + HEALTH_WARN 24 pgs stale; 3/300 in osds are down + +You can identify which placement groups are ``stale``, and what the last OSDs to +store them were, with:: + + ceph health detail + HEALTH_WARN 24 pgs stale; 3/300 in osds are down + ... + pg 2.5 is stuck stale+active+remapped, last acting [2,0] + ... + osd.10 is down since epoch 23, last address 192.168.106.220:6800/11080 + osd.11 is down since epoch 13, last address 192.168.106.220:6803/11539 + osd.12 is down since epoch 24, last address 192.168.106.220:6806/11861 + +If we want to get placement group 2.5 back online, for example, this tells us that +it was last managed by ``osd.0`` and ``osd.2``. Restarting those ``ceph-osd`` +daemons will allow the cluster to recover that placement group (and, presumably, +many others). + + +Only a Few OSDs Receive Data +============================ + +If you have many nodes in your cluster and only a few of them receive data, +`check`_ the number of placement groups in your pool. Since placement groups get +mapped to OSDs, a small number of placement groups will not distribute across +your cluster. Try creating a pool with a placement group count that is a +multiple of the number of OSDs. See `Placement Groups`_ for details. The default +placement group count for pools is not useful, but you can change it `here`_. + + +Can't Write Data +================ + +If your cluster is up, but some OSDs are down and you cannot write data, +check to ensure that you have the minimum number of OSDs running for the +placement group. If you don't have the minimum number of OSDs running, +Ceph will not allow you to write data because there is no guarantee +that Ceph can replicate your data. See ``osd pool default min size`` +in the `Pool, PG and CRUSH Config Reference`_ for details. + + +PGs Inconsistent +================ + +If you receive an ``active + clean + inconsistent`` state, this may happen +due to an error during scrubbing. As always, we can identify the inconsistent +placement group(s) with:: + + $ ceph health detail + HEALTH_ERR 1 pgs inconsistent; 2 scrub errors + pg 0.6 is active+clean+inconsistent, acting [0,1,2] + 2 scrub errors + +Or if you prefer inspecting the output in a programmatic way:: + + $ rados list-inconsistent-pg rbd + ["0.6"] + +There is only one consistent state, but in the worst case, we could have +different inconsistencies in multiple perspectives found in more than one +objects. If an object named ``foo`` in PG ``0.6`` is truncated, we will have:: + + $ rados list-inconsistent-obj 0.6 --format=json-pretty + +.. code-block:: javascript + + { + "epoch": 14, + "inconsistents": [ + { + "object": { + "name": "foo", + "nspace": "", + "locator": "", + "snap": "head", + "version": 1 + }, + "errors": [ + "data_digest_mismatch", + "size_mismatch" + ], + "union_shard_errors": [ + "data_digest_mismatch_oi", + "size_mismatch_oi" + ], + "selected_object_info": "0:602f83fe:::foo:head(16'1 client.4110.0:1 dirty|data_digest|omap_digest s 968 uv 1 dd e978e67f od ffffffff alloc_hint [0 0 0])", + "shards": [ + { + "osd": 0, + "errors": [], + "size": 968, + "omap_digest": "0xffffffff", + "data_digest": "0xe978e67f" + }, + { + "osd": 1, + "errors": [], + "size": 968, + "omap_digest": "0xffffffff", + "data_digest": "0xe978e67f" + }, + { + "osd": 2, + "errors": [ + "data_digest_mismatch_oi", + "size_mismatch_oi" + ], + "size": 0, + "omap_digest": "0xffffffff", + "data_digest": "0xffffffff" + } + ] + } + ] + } + +In this case, we can learn from the output: + +* The only inconsistent object is named ``foo``, and it is its head that has + inconsistencies. +* The inconsistencies fall into two categories: + + * ``errors``: these errors indicate inconsistencies between shards without a + determination of which shard(s) are bad. Check for the ``errors`` in the + `shards` array, if available, to pinpoint the problem. + + * ``data_digest_mismatch``: the digest of the replica read from OSD.2 is + different from the ones of OSD.0 and OSD.1 + * ``size_mismatch``: the size of the replica read from OSD.2 is 0, while + the size reported by OSD.0 and OSD.1 is 968. + * ``union_shard_errors``: the union of all shard specific ``errors`` in + ``shards`` array. The ``errors`` are set for the given shard that has the + problem. They include errors like ``read_error``. The ``errors`` ending in + ``oi`` indicate a comparison with ``selected_object_info``. Look at the + ``shards`` array to determine which shard has which error(s). + + * ``data_digest_mismatch_oi``: the digest stored in the object-info is not + ``0xffffffff``, which is calculated from the shard read from OSD.2 + * ``size_mismatch_oi``: the size stored in the object-info is different + from the one read from OSD.2. The latter is 0. + +You can repair the inconsistent placement group by executing:: + + ceph pg repair {placement-group-ID} + +Which overwrites the `bad` copies with the `authoritative` ones. In most cases, +Ceph is able to choose authoritative copies from all available replicas using +some predefined criteria. But this does not always work. For example, the stored +data digest could be missing, and the calculated digest will be ignored when +choosing the authoritative copies. So, please use the above command with caution. + +If ``read_error`` is listed in the ``errors`` attribute of a shard, the +inconsistency is likely due to disk errors. You might want to check your disk +used by that OSD. + +If you receive ``active + clean + inconsistent`` states periodically due to +clock skew, you may consider configuring your `NTP`_ daemons on your +monitor hosts to act as peers. See `The Network Time Protocol`_ and Ceph +`Clock Settings`_ for additional details. + + +Erasure Coded PGs are not active+clean +====================================== + +When CRUSH fails to find enough OSDs to map to a PG, it will show as a +``2147483647`` which is ITEM_NONE or ``no OSD found``. For instance:: + + [2,1,6,0,5,8,2147483647,7,4] + +Not enough OSDs +--------------- + +If the Ceph cluster only has 8 OSDs and the erasure coded pool needs +9, that is what it will show. You can either create another erasure +coded pool that requires less OSDs:: + + ceph osd erasure-code-profile set myprofile k=5 m=3 + ceph osd pool create erasurepool 16 16 erasure myprofile + +or add a new OSDs and the PG will automatically use them. + +CRUSH constraints cannot be satisfied +------------------------------------- + +If the cluster has enough OSDs, it is possible that the CRUSH ruleset +imposes constraints that cannot be satisfied. If there are 10 OSDs on +two hosts and the CRUSH rulesets require that no two OSDs from the +same host are used in the same PG, the mapping may fail because only +two OSD will be found. You can check the constraint by displaying the +ruleset:: + + $ ceph osd crush rule ls + [ + "replicated_ruleset", + "erasurepool"] + $ ceph osd crush rule dump erasurepool + { "rule_id": 1, + "rule_name": "erasurepool", + "ruleset": 1, + "type": 3, + "min_size": 3, + "max_size": 20, + "steps": [ + { "op": "take", + "item": -1, + "item_name": "default"}, + { "op": "chooseleaf_indep", + "num": 0, + "type": "host"}, + { "op": "emit"}]} + + +You can resolve the problem by creating a new pool in which PGs are allowed +to have OSDs residing on the same host with:: + + ceph osd erasure-code-profile set myprofile crush-failure-domain=osd + ceph osd pool create erasurepool 16 16 erasure myprofile + +CRUSH gives up too soon +----------------------- + +If the Ceph cluster has just enough OSDs to map the PG (for instance a +cluster with a total of 9 OSDs and an erasure coded pool that requires +9 OSDs per PG), it is possible that CRUSH gives up before finding a +mapping. It can be resolved by: + +* lowering the erasure coded pool requirements to use less OSDs per PG + (that requires the creation of another pool as erasure code profiles + cannot be dynamically modified). + +* adding more OSDs to the cluster (that does not require the erasure + coded pool to be modified, it will become clean automatically) + +* use a hand made CRUSH ruleset that tries more times to find a good + mapping. It can be done by setting ``set_choose_tries`` to a value + greater than the default. + +You should first verify the problem with ``crushtool`` after +extracting the crushmap from the cluster so your experiments do not +modify the Ceph cluster and only work on a local files:: + + $ ceph osd crush rule dump erasurepool + { "rule_name": "erasurepool", + "ruleset": 1, + "type": 3, + "min_size": 3, + "max_size": 20, + "steps": [ + { "op": "take", + "item": -1, + "item_name": "default"}, + { "op": "chooseleaf_indep", + "num": 0, + "type": "host"}, + { "op": "emit"}]} + $ ceph osd getcrushmap > crush.map + got crush map from osdmap epoch 13 + $ crushtool -i crush.map --test --show-bad-mappings \ + --rule 1 \ + --num-rep 9 \ + --min-x 1 --max-x $((1024 * 1024)) + bad mapping rule 8 x 43 num_rep 9 result [3,2,7,1,2147483647,8,5,6,0] + bad mapping rule 8 x 79 num_rep 9 result [6,0,2,1,4,7,2147483647,5,8] + bad mapping rule 8 x 173 num_rep 9 result [0,4,6,8,2,1,3,7,2147483647] + +Where ``--num-rep`` is the number of OSDs the erasure code crush +ruleset needs, ``--rule`` is the value of the ``ruleset`` field +displayed by ``ceph osd crush rule dump``. The test will try mapping +one million values (i.e. the range defined by ``[--min-x,--max-x]``) +and must display at least one bad mapping. If it outputs nothing it +means all mappings are successfull and you can stop right there: the +problem is elsewhere. + +The crush ruleset can be edited by decompiling the crush map:: + + $ crushtool --decompile crush.map > crush.txt + +and adding the following line to the ruleset:: + + step set_choose_tries 100 + +The relevant part of of the ``crush.txt`` file should look something +like:: + + rule erasurepool { + ruleset 1 + type erasure + min_size 3 + max_size 20 + step set_chooseleaf_tries 5 + step set_choose_tries 100 + step take default + step chooseleaf indep 0 type host + step emit + } + +It can then be compiled and tested again:: + + $ crushtool --compile crush.txt -o better-crush.map + +When all mappings succeed, an histogram of the number of tries that +were necessary to find all of them can be displayed with the +``--show-choose-tries`` option of ``crushtool``:: + + $ crushtool -i better-crush.map --test --show-bad-mappings \ + --show-choose-tries \ + --rule 1 \ + --num-rep 9 \ + --min-x 1 --max-x $((1024 * 1024)) + ... + 11: 42 + 12: 44 + 13: 54 + 14: 45 + 15: 35 + 16: 34 + 17: 30 + 18: 25 + 19: 19 + 20: 22 + 21: 20 + 22: 17 + 23: 13 + 24: 16 + 25: 13 + 26: 11 + 27: 11 + 28: 13 + 29: 11 + 30: 10 + 31: 6 + 32: 5 + 33: 10 + 34: 3 + 35: 7 + 36: 5 + 37: 2 + 38: 5 + 39: 5 + 40: 2 + 41: 5 + 42: 4 + 43: 1 + 44: 2 + 45: 2 + 46: 3 + 47: 1 + 48: 0 + ... + 102: 0 + 103: 1 + 104: 0 + ... + +It took 11 tries to map 42 PGs, 12 tries to map 44 PGs etc. The highest number of tries is the minimum value of ``set_choose_tries`` that prevents bad mappings (i.e. 103 in the above output because it did not take more than 103 tries for any PG to be mapped). + +.. _check: ../../operations/placement-groups#get-the-number-of-placement-groups +.. _here: ../../configuration/pool-pg-config-ref +.. _Placement Groups: ../../operations/placement-groups +.. _Pool, PG and CRUSH Config Reference: ../../configuration/pool-pg-config-ref +.. _NTP: http://en.wikipedia.org/wiki/Network_Time_Protocol +.. _The Network Time Protocol: http://www.ntp.org/ +.. _Clock Settings: ../../configuration/mon-config-ref/#clock + + |