From 7da45d65be36d36b880cc55c5036e96c24b53f00 Mon Sep 17 00:00:00 2001 From: Qiaowei Ren Date: Thu, 1 Mar 2018 14:38:11 +0800 Subject: remove ceph code This patch removes initial ceph code, due to license issue. Change-Id: I092d44f601cdf34aed92300fe13214925563081c Signed-off-by: Qiaowei Ren --- src/ceph/doc/rados/troubleshooting/community.rst | 29 - .../doc/rados/troubleshooting/cpu-profiling.rst | 67 --- src/ceph/doc/rados/troubleshooting/index.rst | 19 - .../doc/rados/troubleshooting/log-and-debug.rst | 550 ----------------- .../doc/rados/troubleshooting/memory-profiling.rst | 142 ----- .../rados/troubleshooting/troubleshooting-mon.rst | 567 ----------------- .../rados/troubleshooting/troubleshooting-osd.rst | 536 ----------------- .../rados/troubleshooting/troubleshooting-pg.rst | 668 --------------------- 8 files changed, 2578 deletions(-) delete mode 100644 src/ceph/doc/rados/troubleshooting/community.rst delete mode 100644 src/ceph/doc/rados/troubleshooting/cpu-profiling.rst delete mode 100644 src/ceph/doc/rados/troubleshooting/index.rst delete mode 100644 src/ceph/doc/rados/troubleshooting/log-and-debug.rst delete mode 100644 src/ceph/doc/rados/troubleshooting/memory-profiling.rst delete mode 100644 src/ceph/doc/rados/troubleshooting/troubleshooting-mon.rst delete mode 100644 src/ceph/doc/rados/troubleshooting/troubleshooting-osd.rst delete mode 100644 src/ceph/doc/rados/troubleshooting/troubleshooting-pg.rst (limited to 'src/ceph/doc/rados/troubleshooting') diff --git a/src/ceph/doc/rados/troubleshooting/community.rst b/src/ceph/doc/rados/troubleshooting/community.rst deleted file mode 100644 index 9faad13..0000000 --- a/src/ceph/doc/rados/troubleshooting/community.rst +++ /dev/null @@ -1,29 +0,0 @@ -==================== - The Ceph Community -==================== - -The Ceph community is an excellent source of information and help. For -operational issues with Ceph releases we recommend you `subscribe to the -ceph-users email list`_. When you no longer want to receive emails, you can -`unsubscribe from the ceph-users email list`_. - -You may also `subscribe to the ceph-devel email list`_. You should do so if -your issue is: - -- Likely related to a bug -- Related to a development release package -- Related to a development testing package -- Related to your own builds - -If you no longer want to receive emails from the ``ceph-devel`` email list, you -may `unsubscribe from the ceph-devel email list`_. - -.. tip:: The Ceph community is growing rapidly, and community members can help - you if you provide them with detailed information about your problem. You - can attach the output of the ``ceph report`` command to help people understand your issues. - -.. _subscribe to the ceph-devel email list: mailto:majordomo@vger.kernel.org?body=subscribe+ceph-devel -.. _unsubscribe from the ceph-devel email list: mailto:majordomo@vger.kernel.org?body=unsubscribe+ceph-devel -.. _subscribe to the ceph-users email list: mailto:ceph-users-join@lists.ceph.com -.. _unsubscribe from the ceph-users email list: mailto:ceph-users-leave@lists.ceph.com -.. _ceph-devel: ceph-devel@vger.kernel.org \ No newline at end of file diff --git a/src/ceph/doc/rados/troubleshooting/cpu-profiling.rst b/src/ceph/doc/rados/troubleshooting/cpu-profiling.rst deleted file mode 100644 index 159f799..0000000 --- a/src/ceph/doc/rados/troubleshooting/cpu-profiling.rst +++ /dev/null @@ -1,67 +0,0 @@ -=============== - CPU Profiling -=============== - -If you built Ceph from source and compiled Ceph for use with `oprofile`_ -you can profile Ceph's CPU usage. See `Installing Oprofile`_ for details. - - -Initializing oprofile -===================== - -The first time you use ``oprofile`` you need to initialize it. Locate the -``vmlinux`` image corresponding to the kernel you are now running. :: - - ls /boot - sudo opcontrol --init - sudo opcontrol --setup --vmlinux={path-to-image} --separate=library --callgraph=6 - - -Starting oprofile -================= - -To start ``oprofile`` execute the following command:: - - opcontrol --start - -Once you start ``oprofile``, you may run some tests with Ceph. - - -Stopping oprofile -================= - -To stop ``oprofile`` execute the following command:: - - opcontrol --stop - - -Retrieving oprofile Results -=========================== - -To retrieve the top ``cmon`` results, execute the following command:: - - opreport -gal ./cmon | less - - -To retrieve the top ``cmon`` results with call graphs attached, execute the -following command:: - - opreport -cal ./cmon | less - -.. important:: After reviewing results, you should reset ``oprofile`` before - running it again. Resetting ``oprofile`` removes data from the session - directory. - - -Resetting oprofile -================== - -To reset ``oprofile``, execute the following command:: - - sudo opcontrol --reset - -.. important:: You should reset ``oprofile`` after analyzing data so that - you do not commingle results from different tests. - -.. _oprofile: http://oprofile.sourceforge.net/about/ -.. _Installing Oprofile: ../../../dev/cpu-profiler diff --git a/src/ceph/doc/rados/troubleshooting/index.rst b/src/ceph/doc/rados/troubleshooting/index.rst deleted file mode 100644 index 80d14f3..0000000 --- a/src/ceph/doc/rados/troubleshooting/index.rst +++ /dev/null @@ -1,19 +0,0 @@ -================= - Troubleshooting -================= - -Ceph is still on the leading edge, so you may encounter situations that require -you to examine your configuration, modify your logging output, troubleshoot -monitors and OSDs, profile memory and CPU usage, and reach out to the -Ceph community for help. - -.. toctree:: - :maxdepth: 1 - - community - log-and-debug - troubleshooting-mon - troubleshooting-osd - troubleshooting-pg - memory-profiling - cpu-profiling diff --git a/src/ceph/doc/rados/troubleshooting/log-and-debug.rst b/src/ceph/doc/rados/troubleshooting/log-and-debug.rst deleted file mode 100644 index c91f272..0000000 --- a/src/ceph/doc/rados/troubleshooting/log-and-debug.rst +++ /dev/null @@ -1,550 +0,0 @@ -======================= - Logging and Debugging -======================= - -Typically, when you add debugging to your Ceph configuration, you do so at -runtime. You can also add Ceph debug logging to your Ceph configuration file if -you are encountering issues when starting your cluster. You may view Ceph log -files under ``/var/log/ceph`` (the default location). - -.. tip:: When debug output slows down your system, the latency can hide - race conditions. - -Logging is resource intensive. If you are encountering a problem in a specific -area of your cluster, enable logging for that area of the cluster. For example, -if your OSDs are running fine, but your metadata servers are not, you should -start by enabling debug logging for the specific metadata server instance(s) -giving you trouble. Enable logging for each subsystem as needed. - -.. important:: Verbose logging can generate over 1GB of data per hour. If your - OS disk reaches its capacity, the node will stop working. - -If you enable or increase the rate of Ceph logging, ensure that you have -sufficient disk space on your OS disk. See `Accelerating Log Rotation`_ for -details on rotating log files. When your system is running well, remove -unnecessary debugging settings to ensure your cluster runs optimally. Logging -debug output messages is relatively slow, and a waste of resources when -operating your cluster. - -See `Subsystem, Log and Debug Settings`_ for details on available settings. - -Runtime -======= - -If you would like to see the configuration settings at runtime, you must log -in to a host with a running daemon and execute the following:: - - ceph daemon {daemon-name} config show | less - -For example,:: - - ceph daemon osd.0 config show | less - -To activate Ceph's debugging output (*i.e.*, ``dout()``) at runtime, use the -``ceph tell`` command to inject arguments into the runtime configuration:: - - ceph tell {daemon-type}.{daemon id or *} injectargs --{name} {value} [--{name} {value}] - -Replace ``{daemon-type}`` with one of ``osd``, ``mon`` or ``mds``. You may apply -the runtime setting to all daemons of a particular type with ``*``, or specify -a specific daemon's ID. For example, to increase -debug logging for a ``ceph-osd`` daemon named ``osd.0``, execute the following:: - - ceph tell osd.0 injectargs --debug-osd 0/5 - -The ``ceph tell`` command goes through the monitors. If you cannot bind to the -monitor, you can still make the change by logging into the host of the daemon -whose configuration you'd like to change using ``ceph daemon``. -For example:: - - sudo ceph daemon osd.0 config set debug_osd 0/5 - -See `Subsystem, Log and Debug Settings`_ for details on available settings. - - -Boot Time -========= - -To activate Ceph's debugging output (*i.e.*, ``dout()``) at boot time, you must -add settings to your Ceph configuration file. Subsystems common to each daemon -may be set under ``[global]`` in your configuration file. Subsystems for -particular daemons are set under the daemon section in your configuration file -(*e.g.*, ``[mon]``, ``[osd]``, ``[mds]``). For example:: - - [global] - debug ms = 1/5 - - [mon] - debug mon = 20 - debug paxos = 1/5 - debug auth = 2 - - [osd] - debug osd = 1/5 - debug filestore = 1/5 - debug journal = 1 - debug monc = 5/20 - - [mds] - debug mds = 1 - debug mds balancer = 1 - - -See `Subsystem, Log and Debug Settings`_ for details. - - -Accelerating Log Rotation -========================= - -If your OS disk is relatively full, you can accelerate log rotation by modifying -the Ceph log rotation file at ``/etc/logrotate.d/ceph``. Add a size setting -after the rotation frequency to accelerate log rotation (via cronjob) if your -logs exceed the size setting. For example, the default setting looks like -this:: - - rotate 7 - weekly - compress - sharedscripts - -Modify it by adding a ``size`` setting. :: - - rotate 7 - weekly - size 500M - compress - sharedscripts - -Then, start the crontab editor for your user space. :: - - crontab -e - -Finally, add an entry to check the ``etc/logrotate.d/ceph`` file. :: - - 30 * * * * /usr/sbin/logrotate /etc/logrotate.d/ceph >/dev/null 2>&1 - -The preceding example checks the ``etc/logrotate.d/ceph`` file every 30 minutes. - - -Valgrind -======== - -Debugging may also require you to track down memory and threading issues. -You can run a single daemon, a type of daemon, or the whole cluster with -Valgrind. You should only use Valgrind when developing or debugging Ceph. -Valgrind is computationally expensive, and will slow down your system otherwise. -Valgrind messages are logged to ``stderr``. - - -Subsystem, Log and Debug Settings -================================= - -In most cases, you will enable debug logging output via subsystems. - -Ceph Subsystems ---------------- - -Each subsystem has a logging level for its output logs, and for its logs -in-memory. You may set different values for each of these subsystems by setting -a log file level and a memory level for debug logging. Ceph's logging levels -operate on a scale of ``1`` to ``20``, where ``1`` is terse and ``20`` is -verbose [#]_ . In general, the logs in-memory are not sent to the output log unless: - -- a fatal signal is raised or -- an ``assert`` in source code is triggered or -- upon requested. Please consult `document on admin socket `_ for more details. - -A debug logging setting can take a single value for the log level and the -memory level, which sets them both as the same value. For example, if you -specify ``debug ms = 5``, Ceph will treat it as a log level and a memory level -of ``5``. You may also specify them separately. The first setting is the log -level, and the second setting is the memory level. You must separate them with -a forward slash (/). For example, if you want to set the ``ms`` subsystem's -debug logging level to ``1`` and its memory level to ``5``, you would specify it -as ``debug ms = 1/5``. For example: - - - -.. code-block:: ini - - debug {subsystem} = {log-level}/{memory-level} - #for example - debug mds balancer = 1/20 - - -The following table provides a list of Ceph subsystems and their default log and -memory levels. Once you complete your logging efforts, restore the subsystems -to their default level or to a level suitable for normal operations. - - -+--------------------+-----------+--------------+ -| Subsystem | Log Level | Memory Level | -+====================+===========+==============+ -| ``default`` | 0 | 5 | -+--------------------+-----------+--------------+ -| ``lockdep`` | 0 | 5 | -+--------------------+-----------+--------------+ -| ``context`` | 0 | 5 | -+--------------------+-----------+--------------+ -| ``crush`` | 1 | 5 | -+--------------------+-----------+--------------+ -| ``mds`` | 1 | 5 | -+--------------------+-----------+--------------+ -| ``mds balancer`` | 1 | 5 | -+--------------------+-----------+--------------+ -| ``mds locker`` | 1 | 5 | -+--------------------+-----------+--------------+ -| ``mds log`` | 1 | 5 | -+--------------------+-----------+--------------+ -| ``mds log expire`` | 1 | 5 | -+--------------------+-----------+--------------+ -| ``mds migrator`` | 1 | 5 | -+--------------------+-----------+--------------+ -| ``buffer`` | 0 | 0 | -+--------------------+-----------+--------------+ -| ``timer`` | 0 | 5 | -+--------------------+-----------+--------------+ -| ``filer`` | 0 | 5 | -+--------------------+-----------+--------------+ -| ``objecter`` | 0 | 0 | -+--------------------+-----------+--------------+ -| ``rados`` | 0 | 5 | -+--------------------+-----------+--------------+ -| ``rbd`` | 0 | 5 | -+--------------------+-----------+--------------+ -| ``journaler`` | 0 | 5 | -+--------------------+-----------+--------------+ -| ``objectcacher`` | 0 | 5 | -+--------------------+-----------+--------------+ -| ``client`` | 0 | 5 | -+--------------------+-----------+--------------+ -| ``osd`` | 0 | 5 | -+--------------------+-----------+--------------+ -| ``optracker`` | 0 | 5 | -+--------------------+-----------+--------------+ -| ``objclass`` | 0 | 5 | -+--------------------+-----------+--------------+ -| ``filestore`` | 1 | 5 | -+--------------------+-----------+--------------+ -| ``journal`` | 1 | 5 | -+--------------------+-----------+--------------+ -| ``ms`` | 0 | 5 | -+--------------------+-----------+--------------+ -| ``mon`` | 1 | 5 | -+--------------------+-----------+--------------+ -| ``monc`` | 0 | 5 | -+--------------------+-----------+--------------+ -| ``paxos`` | 0 | 5 | -+--------------------+-----------+--------------+ -| ``tp`` | 0 | 5 | -+--------------------+-----------+--------------+ -| ``auth`` | 1 | 5 | -+--------------------+-----------+--------------+ -| ``finisher`` | 1 | 5 | -+--------------------+-----------+--------------+ -| ``heartbeatmap`` | 1 | 5 | -+--------------------+-----------+--------------+ -| ``perfcounter`` | 1 | 5 | -+--------------------+-----------+--------------+ -| ``rgw`` | 1 | 5 | -+--------------------+-----------+--------------+ -| ``javaclient`` | 1 | 5 | -+--------------------+-----------+--------------+ -| ``asok`` | 1 | 5 | -+--------------------+-----------+--------------+ -| ``throttle`` | 1 | 5 | -+--------------------+-----------+--------------+ - - -Logging Settings ----------------- - -Logging and debugging settings are not required in a Ceph configuration file, -but you may override default settings as needed. Ceph supports the following -settings: - - -``log file`` - -:Description: The location of the logging file for your cluster. -:Type: String -:Required: No -:Default: ``/var/log/ceph/$cluster-$name.log`` - - -``log max new`` - -:Description: The maximum number of new log files. -:Type: Integer -:Required: No -:Default: ``1000`` - - -``log max recent`` - -:Description: The maximum number of recent events to include in a log file. -:Type: Integer -:Required: No -:Default: ``1000000`` - - -``log to stderr`` - -:Description: Determines if logging messages should appear in ``stderr``. -:Type: Boolean -:Required: No -:Default: ``true`` - - -``err to stderr`` - -:Description: Determines if error messages should appear in ``stderr``. -:Type: Boolean -:Required: No -:Default: ``true`` - - -``log to syslog`` - -:Description: Determines if logging messages should appear in ``syslog``. -:Type: Boolean -:Required: No -:Default: ``false`` - - -``err to syslog`` - -:Description: Determines if error messages should appear in ``syslog``. -:Type: Boolean -:Required: No -:Default: ``false`` - - -``log flush on exit`` - -:Description: Determines if Ceph should flush the log files after exit. -:Type: Boolean -:Required: No -:Default: ``true`` - - -``clog to monitors`` - -:Description: Determines if ``clog`` messages should be sent to monitors. -:Type: Boolean -:Required: No -:Default: ``true`` - - -``clog to syslog`` - -:Description: Determines if ``clog`` messages should be sent to syslog. -:Type: Boolean -:Required: No -:Default: ``false`` - - -``mon cluster log to syslog`` - -:Description: Determines if the cluster log should be output to the syslog. -:Type: Boolean -:Required: No -:Default: ``false`` - - -``mon cluster log file`` - -:Description: The location of the cluster's log file. -:Type: String -:Required: No -:Default: ``/var/log/ceph/$cluster.log`` - - - -OSD ---- - - -``osd debug drop ping probability`` - -:Description: ? -:Type: Double -:Required: No -:Default: 0 - - -``osd debug drop ping duration`` - -:Description: -:Type: Integer -:Required: No -:Default: 0 - -``osd debug drop pg create probability`` - -:Description: -:Type: Integer -:Required: No -:Default: 0 - -``osd debug drop pg create duration`` - -:Description: ? -:Type: Double -:Required: No -:Default: 1 - - -``osd tmapput sets uses tmap`` - -:Description: Uses ``tmap``. For debug only. -:Type: Boolean -:Required: No -:Default: ``false`` - - -``osd min pg log entries`` - -:Description: The minimum number of log entries for placement groups. -:Type: 32-bit Unsigned Integer -:Required: No -:Default: 1000 - - -``osd op log threshold`` - -:Description: How many op log messages to show up in one pass. -:Type: Integer -:Required: No -:Default: 5 - - - -Filestore ---------- - -``filestore debug omap check`` - -:Description: Debugging check on synchronization. This is an expensive operation. -:Type: Boolean -:Required: No -:Default: 0 - - -MDS ---- - - -``mds debug scatterstat`` - -:Description: Ceph will assert that various recursive stat invariants are true - (for developers only). - -:Type: Boolean -:Required: No -:Default: ``false`` - - -``mds debug frag`` - -:Description: Ceph will verify directory fragmentation invariants when - convenient (developers only). - -:Type: Boolean -:Required: No -:Default: ``false`` - - -``mds debug auth pins`` - -:Description: The debug auth pin invariants (for developers only). -:Type: Boolean -:Required: No -:Default: ``false`` - - -``mds debug subtrees`` - -:Description: The debug subtree invariants (for developers only). -:Type: Boolean -:Required: No -:Default: ``false`` - - - -RADOS Gateway -------------- - - -``rgw log nonexistent bucket`` - -:Description: Should we log a non-existent buckets? -:Type: Boolean -:Required: No -:Default: ``false`` - - -``rgw log object name`` - -:Description: Should an object's name be logged. // man date to see codes (a subset are supported) -:Type: String -:Required: No -:Default: ``%Y-%m-%d-%H-%i-%n`` - - -``rgw log object name utc`` - -:Description: Object log name contains UTC? -:Type: Boolean -:Required: No -:Default: ``false`` - - -``rgw enable ops log`` - -:Description: Enables logging of every RGW operation. -:Type: Boolean -:Required: No -:Default: ``true`` - - -``rgw enable usage log`` - -:Description: Enable logging of RGW's bandwidth usage. -:Type: Boolean -:Required: No -:Default: ``true`` - - -``rgw usage log flush threshold`` - -:Description: Threshold to flush pending log data. -:Type: Integer -:Required: No -:Default: ``1024`` - - -``rgw usage log tick interval`` - -:Description: Flush pending log data every ``s`` seconds. -:Type: Integer -:Required: No -:Default: 30 - - -``rgw intent log object name`` - -:Description: -:Type: String -:Required: No -:Default: ``%Y-%m-%d-%i-%n`` - - -``rgw intent log object name utc`` - -:Description: Include a UTC timestamp in the intent log object name. -:Type: Boolean -:Required: No -:Default: ``false`` - -.. [#] there are levels >20 in some rare cases and that they are extremely verbose. diff --git a/src/ceph/doc/rados/troubleshooting/memory-profiling.rst b/src/ceph/doc/rados/troubleshooting/memory-profiling.rst deleted file mode 100644 index e2396e2..0000000 --- a/src/ceph/doc/rados/troubleshooting/memory-profiling.rst +++ /dev/null @@ -1,142 +0,0 @@ -================== - Memory Profiling -================== - -Ceph MON, OSD and MDS can generate heap profiles using -``tcmalloc``. To generate heap profiles, ensure you have -``google-perftools`` installed:: - - sudo apt-get install google-perftools - -The profiler dumps output to your ``log file`` directory (i.e., -``/var/log/ceph``). See `Logging and Debugging`_ for details. -To view the profiler logs with Google's performance tools, execute the -following:: - - google-pprof --text {path-to-daemon} {log-path/filename} - -For example:: - - $ ceph tell osd.0 heap start_profiler - $ ceph tell osd.0 heap dump - osd.0 tcmalloc heap stats:------------------------------------------------ - MALLOC: 2632288 ( 2.5 MiB) Bytes in use by application - MALLOC: + 499712 ( 0.5 MiB) Bytes in page heap freelist - MALLOC: + 543800 ( 0.5 MiB) Bytes in central cache freelist - MALLOC: + 327680 ( 0.3 MiB) Bytes in transfer cache freelist - MALLOC: + 1239400 ( 1.2 MiB) Bytes in thread cache freelists - MALLOC: + 1142936 ( 1.1 MiB) Bytes in malloc metadata - MALLOC: ------------ - MALLOC: = 6385816 ( 6.1 MiB) Actual memory used (physical + swap) - MALLOC: + 0 ( 0.0 MiB) Bytes released to OS (aka unmapped) - MALLOC: ------------ - MALLOC: = 6385816 ( 6.1 MiB) Virtual address space used - MALLOC: - MALLOC: 231 Spans in use - MALLOC: 56 Thread heaps in use - MALLOC: 8192 Tcmalloc page size - ------------------------------------------------ - Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()). - Bytes released to the OS take up virtual address space but no physical memory. - $ google-pprof --text \ - /usr/bin/ceph-osd \ - /var/log/ceph/ceph-osd.0.profile.0001.heap - Total: 3.7 MB - 1.9 51.1% 51.1% 1.9 51.1% ceph::log::Log::create_entry - 1.8 47.3% 98.4% 1.8 47.3% std::string::_Rep::_S_create - 0.0 0.4% 98.9% 0.0 0.6% SimpleMessenger::add_accept_pipe - 0.0 0.4% 99.2% 0.0 0.6% decode_message - ... - -Another heap dump on the same daemon will add another file. It is -convenient to compare to a previous heap dump to show what has grown -in the interval. For instance:: - - $ google-pprof --text --base out/osd.0.profile.0001.heap \ - ceph-osd out/osd.0.profile.0003.heap - Total: 0.2 MB - 0.1 50.3% 50.3% 0.1 50.3% ceph::log::Log::create_entry - 0.1 46.6% 96.8% 0.1 46.6% std::string::_Rep::_S_create - 0.0 0.9% 97.7% 0.0 26.1% ReplicatedPG::do_op - 0.0 0.8% 98.5% 0.0 0.8% __gnu_cxx::new_allocator::allocate - -Refer to `Google Heap Profiler`_ for additional details. - -Once you have the heap profiler installed, start your cluster and -begin using the heap profiler. You may enable or disable the heap -profiler at runtime, or ensure that it runs continuously. For the -following commandline usage, replace ``{daemon-type}`` with ``mon``, -``osd`` or ``mds``, and replace ``{daemon-id}`` with the OSD number or -the MON or MDS id. - - -Starting the Profiler ---------------------- - -To start the heap profiler, execute the following:: - - ceph tell {daemon-type}.{daemon-id} heap start_profiler - -For example:: - - ceph tell osd.1 heap start_profiler - -Alternatively the profile can be started when the daemon starts -running if the ``CEPH_HEAP_PROFILER_INIT=true`` variable is found in -the environment. - -Printing Stats --------------- - -To print out statistics, execute the following:: - - ceph tell {daemon-type}.{daemon-id} heap stats - -For example:: - - ceph tell osd.0 heap stats - -.. note:: Printing stats does not require the profiler to be running and does - not dump the heap allocation information to a file. - - -Dumping Heap Information ------------------------- - -To dump heap information, execute the following:: - - ceph tell {daemon-type}.{daemon-id} heap dump - -For example:: - - ceph tell mds.a heap dump - -.. note:: Dumping heap information only works when the profiler is running. - - -Releasing Memory ----------------- - -To release memory that ``tcmalloc`` has allocated but which is not being used by -the Ceph daemon itself, execute the following:: - - ceph tell {daemon-type}{daemon-id} heap release - -For example:: - - ceph tell osd.2 heap release - - -Stopping the Profiler ---------------------- - -To stop the heap profiler, execute the following:: - - ceph tell {daemon-type}.{daemon-id} heap stop_profiler - -For example:: - - ceph tell osd.0 heap stop_profiler - -.. _Logging and Debugging: ../log-and-debug -.. _Google Heap Profiler: http://goog-perftools.sourceforge.net/doc/heap_profiler.html diff --git a/src/ceph/doc/rados/troubleshooting/troubleshooting-mon.rst b/src/ceph/doc/rados/troubleshooting/troubleshooting-mon.rst deleted file mode 100644 index 89fb94c..0000000 --- a/src/ceph/doc/rados/troubleshooting/troubleshooting-mon.rst +++ /dev/null @@ -1,567 +0,0 @@ -================================= - Troubleshooting Monitors -================================= - -.. index:: monitor, high availability - -When a cluster encounters monitor-related troubles there's a tendency to -panic, and some times with good reason. You should keep in mind that losing -a monitor, or a bunch of them, don't necessarily mean that your cluster is -down, as long as a majority is up, running and with a formed quorum. -Regardless of how bad the situation is, the first thing you should do is to -calm down, take a breath and try answering our initial troubleshooting script. - - -Initial Troubleshooting -======================== - - -**Are the monitors running?** - - First of all, we need to make sure the monitors are running. You would be - amazed by how often people forget to run the monitors, or restart them after - an upgrade. There's no shame in that, but let's try not losing a couple of - hours chasing an issue that is not there. - -**Are you able to connect to the monitor's servers?** - - Doesn't happen often, but sometimes people do have ``iptables`` rules that - block accesses to monitor servers or monitor ports. Usually leftovers from - monitor stress-testing that were forgotten at some point. Try ssh'ing into - the server and, if that succeeds, try connecting to the monitor's port - using you tool of choice (telnet, nc,...). - -**Does ceph -s run and obtain a reply from the cluster?** - - If the answer is yes then your cluster is up and running. One thing you - can take for granted is that the monitors will only answer to a ``status`` - request if there is a formed quorum. - - If ``ceph -s`` blocked however, without obtaining a reply from the cluster - or showing a lot of ``fault`` messages, then it is likely that your monitors - are either down completely or just a portion is up -- a portion that is not - enough to form a quorum (keep in mind that a quorum if formed by a majority - of monitors). - -**What if ceph -s doesn't finish?** - - If you haven't gone through all the steps so far, please go back and do. - - For those running on Emperor 0.72-rc1 and forward, you will be able to - contact each monitor individually asking them for their status, regardless - of a quorum being formed. This an be achieved using ``ceph ping mon.ID``, - ID being the monitor's identifier. You should perform this for each monitor - in the cluster. In section `Understanding mon_status`_ we will explain how - to interpret the output of this command. - - For the rest of you who don't tread on the bleeding edge, you will need to - ssh into the server and use the monitor's admin socket. Please jump to - `Using the monitor's admin socket`_. - -For other specific issues, keep on reading. - - -Using the monitor's admin socket -================================= - -The admin socket allows you to interact with a given daemon directly using a -Unix socket file. This file can be found in your monitor's ``run`` directory. -By default, the admin socket will be kept in ``/var/run/ceph/ceph-mon.ID.asok`` -but this can vary if you defined it otherwise. If you don't find it there, -please check your ``ceph.conf`` for an alternative path or run:: - - ceph-conf --name mon.ID --show-config-value admin_socket - -Please bear in mind that the admin socket will only be available while the -monitor is running. When the monitor is properly shutdown, the admin socket -will be removed. If however the monitor is not running and the admin socket -still persists, it is likely that the monitor was improperly shutdown. -Regardless, if the monitor is not running, you will not be able to use the -admin socket, with ``ceph`` likely returning ``Error 111: Connection Refused``. - -Accessing the admin socket is as simple as telling the ``ceph`` tool to use -the ``asok`` file. In pre-Dumpling Ceph, this can be achieved by:: - - ceph --admin-daemon /var/run/ceph/ceph-mon..asok - -while in Dumpling and beyond you can use the alternate (and recommended) -format:: - - ceph daemon mon. - -Using ``help`` as the command to the ``ceph`` tool will show you the -supported commands available through the admin socket. Please take a look -at ``config get``, ``config show``, ``mon_status`` and ``quorum_status``, -as those can be enlightening when troubleshooting a monitor. - - -Understanding mon_status -========================= - -``mon_status`` can be obtained through the ``ceph`` tool when you have -a formed quorum, or via the admin socket if you don't. This command will -output a multitude of information about the monitor, including the same -output you would get with ``quorum_status``. - -Take the following example of ``mon_status``:: - - - { "name": "c", - "rank": 2, - "state": "peon", - "election_epoch": 38, - "quorum": [ - 1, - 2], - "outside_quorum": [], - "extra_probe_peers": [], - "sync_provider": [], - "monmap": { "epoch": 3, - "fsid": "5c4e9d53-e2e1-478a-8061-f543f8be4cf8", - "modified": "2013-10-30 04:12:01.945629", - "created": "2013-10-29 14:14:41.914786", - "mons": [ - { "rank": 0, - "name": "a", - "addr": "127.0.0.1:6789\/0"}, - { "rank": 1, - "name": "b", - "addr": "127.0.0.1:6790\/0"}, - { "rank": 2, - "name": "c", - "addr": "127.0.0.1:6795\/0"}]}} - -A couple of things are obvious: we have three monitors in the monmap (*a*, *b* -and *c*), the quorum is formed by only two monitors, and *c* is in the quorum -as a *peon*. - -Which monitor is out of the quorum? - - The answer would be **a**. - -Why? - - Take a look at the ``quorum`` set. We have two monitors in this set: *1* - and *2*. These are not monitor names. These are monitor ranks, as established - in the current monmap. We are missing the monitor with rank 0, and according - to the monmap that would be ``mon.a``. - -By the way, how are ranks established? - - Ranks are (re)calculated whenever you add or remove monitors and follow a - simple rule: the **greater** the ``IP:PORT`` combination, the **lower** the - rank is. In this case, considering that ``127.0.0.1:6789`` is lower than all - the remaining ``IP:PORT`` combinations, ``mon.a`` has rank 0. - -Most Common Monitor Issues -=========================== - -Have Quorum but at least one Monitor is down ---------------------------------------------- - -When this happens, depending on the version of Ceph you are running, -you should be seeing something similar to:: - - $ ceph health detail - [snip] - mon.a (rank 0) addr 127.0.0.1:6789/0 is down (out of quorum) - -How to troubleshoot this? - - First, make sure ``mon.a`` is running. - - Second, make sure you are able to connect to ``mon.a``'s server from the - other monitors' servers. Check the ports as well. Check ``iptables`` on - all your monitor nodes and make sure you are not dropping/rejecting - connections. - - If this initial troubleshooting doesn't solve your problems, then it's - time to go deeper. - - First, check the problematic monitor's ``mon_status`` via the admin - socket as explained in `Using the monitor's admin socket`_ and - `Understanding mon_status`_. - - Considering the monitor is out of the quorum, its state should be one of - ``probing``, ``electing`` or ``synchronizing``. If it happens to be either - ``leader`` or ``peon``, then the monitor believes to be in quorum, while - the remaining cluster is sure it is not; or maybe it got into the quorum - while we were troubleshooting the monitor, so check you ``ceph -s`` again - just to make sure. Proceed if the monitor is not yet in the quorum. - -What if the state is ``probing``? - - This means the monitor is still looking for the other monitors. Every time - you start a monitor, the monitor will stay in this state for some time - while trying to find the rest of the monitors specified in the ``monmap``. - The time a monitor will spend in this state can vary. For instance, when on - a single-monitor cluster, the monitor will pass through the probing state - almost instantaneously, since there are no other monitors around. On a - multi-monitor cluster, the monitors will stay in this state until they - find enough monitors to form a quorum -- this means that if you have 2 out - of 3 monitors down, the one remaining monitor will stay in this state - indefinitively until you bring one of the other monitors up. - - If you have a quorum, however, the monitor should be able to find the - remaining monitors pretty fast, as long as they can be reached. If your - monitor is stuck probing and you have gone through with all the communication - troubleshooting, then there is a fair chance that the monitor is trying - to reach the other monitors on a wrong address. ``mon_status`` outputs the - ``monmap`` known to the monitor: check if the other monitor's locations - match reality. If they don't, jump to - `Recovering a Monitor's Broken monmap`_; if they do, then it may be related - to severe clock skews amongst the monitor nodes and you should refer to - `Clock Skews`_ first, but if that doesn't solve your problem then it is - the time to prepare some logs and reach out to the community (please refer - to `Preparing your logs`_ on how to best prepare your logs). - - -What if state is ``electing``? - - This means the monitor is in the middle of an election. These should be - fast to complete, but at times the monitors can get stuck electing. This - is usually a sign of a clock skew among the monitor nodes; jump to - `Clock Skews`_ for more infos on that. If all your clocks are properly - synchronized, it is best if you prepare some logs and reach out to the - community. This is not a state that is likely to persist and aside from - (*really*) old bugs there is not an obvious reason besides clock skews on - why this would happen. - -What if state is ``synchronizing``? - - This means the monitor is synchronizing with the rest of the cluster in - order to join the quorum. The synchronization process is as faster as - smaller your monitor store is, so if you have a big store it may - take a while. Don't worry, it should be finished soon enough. - - However, if you notice that the monitor jumps from ``synchronizing`` to - ``electing`` and then back to ``synchronizing``, then you do have a - problem: the cluster state is advancing (i.e., generating new maps) way - too fast for the synchronization process to keep up. This used to be a - thing in early Cuttlefish, but since then the synchronization process was - quite refactored and enhanced to avoid just this sort of behavior. If this - happens in later versions let us know. And bring some logs - (see `Preparing your logs`_). - -What if state is ``leader`` or ``peon``? - - This should not happen. There is a chance this might happen however, and - it has a lot to do with clock skews -- see `Clock Skews`_. If you are not - suffering from clock skews, then please prepare your logs (see - `Preparing your logs`_) and reach out to us. - - -Recovering a Monitor's Broken monmap -------------------------------------- - -This is how a ``monmap`` usually looks like, depending on the number of -monitors:: - - - epoch 3 - fsid 5c4e9d53-e2e1-478a-8061-f543f8be4cf8 - last_changed 2013-10-30 04:12:01.945629 - created 2013-10-29 14:14:41.914786 - 0: 127.0.0.1:6789/0 mon.a - 1: 127.0.0.1:6790/0 mon.b - 2: 127.0.0.1:6795/0 mon.c - -This may not be what you have however. For instance, in some versions of -early Cuttlefish there was this one bug that could cause your ``monmap`` -to be nullified. Completely filled with zeros. This means that not even -``monmaptool`` would be able to read it because it would find it hard to -make sense of only-zeros. Some other times, you may end up with a monitor -with a severely outdated monmap, thus being unable to find the remaining -monitors (e.g., say ``mon.c`` is down; you add a new monitor ``mon.d``, -then remove ``mon.a``, then add a new monitor ``mon.e`` and remove -``mon.b``; you will end up with a totally different monmap from the one -``mon.c`` knows). - -In this sort of situations, you have two possible solutions: - -Scrap the monitor and create a new one - - You should only take this route if you are positive that you won't - lose the information kept by that monitor; that you have other monitors - and that they are running just fine so that your new monitor is able - to synchronize from the remaining monitors. Keep in mind that destroying - a monitor, if there are no other copies of its contents, may lead to - loss of data. - -Inject a monmap into the monitor - - Usually the safest path. You should grab the monmap from the remaining - monitors and inject it into the monitor with the corrupted/lost monmap. - - These are the basic steps: - - 1. Is there a formed quorum? If so, grab the monmap from the quorum:: - - $ ceph mon getmap -o /tmp/monmap - - 2. No quorum? Grab the monmap directly from another monitor (this - assumes the monitor you are grabbing the monmap from has id ID-FOO - and has been stopped):: - - $ ceph-mon -i ID-FOO --extract-monmap /tmp/monmap - - 3. Stop the monitor you are going to inject the monmap into. - - 4. Inject the monmap:: - - $ ceph-mon -i ID --inject-monmap /tmp/monmap - - 5. Start the monitor - - Please keep in mind that the ability to inject monmaps is a powerful - feature that can cause havoc with your monitors if misused as it will - overwrite the latest, existing monmap kept by the monitor. - - -Clock Skews ------------- - -Monitors can be severely affected by significant clock skews across the -monitor nodes. This usually translates into weird behavior with no obvious -cause. To avoid such issues, you should run a clock synchronization tool -on your monitor nodes. - - -What's the maximum tolerated clock skew? - - By default the monitors will allow clocks to drift up to ``0.05 seconds``. - - -Can I increase the maximum tolerated clock skew? - - This value is configurable via the ``mon-clock-drift-allowed`` option, and - although you *CAN* it doesn't mean you *SHOULD*. The clock skew mechanism - is in place because clock skewed monitor may not properly behave. We, as - developers and QA afficcionados, are comfortable with the current default - value, as it will alert the user before the monitors get out hand. Changing - this value without testing it first may cause unforeseen effects on the - stability of the monitors and overall cluster healthiness, although there is - no risk of dataloss. - - -How do I know there's a clock skew? - - The monitors will warn you in the form of a ``HEALTH_WARN``. ``ceph health - detail`` should show something in the form of:: - - mon.c addr 10.10.0.1:6789/0 clock skew 0.08235s > max 0.05s (latency 0.0045s) - - That means that ``mon.c`` has been flagged as suffering from a clock skew. - - -What should I do if there's a clock skew? - - Synchronize your clocks. Running an NTP client may help. If you are already - using one and you hit this sort of issues, check if you are using some NTP - server remote to your network and consider hosting your own NTP server on - your network. This last option tends to reduce the amount of issues with - monitor clock skews. - - -Client Can't Connect or Mount ------------------------------- - -Check your IP tables. Some OS install utilities add a ``REJECT`` rule to -``iptables``. The rule rejects all clients trying to connect to the host except -for ``ssh``. If your monitor host's IP tables have such a ``REJECT`` rule in -place, clients connecting from a separate node will fail to mount with a timeout -error. You need to address ``iptables`` rules that reject clients trying to -connect to Ceph daemons. For example, you would need to address rules that look -like this appropriately:: - - REJECT all -- anywhere anywhere reject-with icmp-host-prohibited - -You may also need to add rules to IP tables on your Ceph hosts to ensure -that clients can access the ports associated with your Ceph monitors (i.e., port -6789 by default) and Ceph OSDs (i.e., 6800 through 7300 by default). For -example:: - - iptables -A INPUT -m multiport -p tcp -s {ip-address}/{netmask} --dports 6789,6800:7300 -j ACCEPT - -Monitor Store Failures -====================== - -Symptoms of store corruption ----------------------------- - -Ceph monitor stores the `cluster map`_ in a key/value store such as LevelDB. If -a monitor fails due to the key/value store corruption, following error messages -might be found in the monitor log:: - - Corruption: error in middle of record - -or:: - - Corruption: 1 missing files; e.g.: /var/lib/ceph/mon/mon.0/store.db/1234567.ldb - -Recovery using healthy monitor(s) ---------------------------------- - -If there is any survivers, we can always `replace`_ the corrupted one with a -new one. And after booting up, the new joiner will sync up with a healthy -peer, and once it is fully sync'ed, it will be able to serve the clients. - -Recovery using OSDs -------------------- - -But what if all monitors fail at the same time? Since users are encouraged to -deploy at least three monitors in a Ceph cluster, the chance of simultaneous -failure is rare. But unplanned power-downs in a data center with improperly -configured disk/fs settings could fail the underlying filesystem, and hence -kill all the monitors. In this case, we can recover the monitor store with the -information stored in OSDs.:: - - ms=/tmp/mon-store - mkdir $ms - # collect the cluster map from OSDs - for host in $hosts; do - rsync -avz $ms user@host:$ms - rm -rf $ms - ssh user@host <`_. -Check your networks to ensure they -are running properly, because networks may have a significant impact on OSD -operation and performance. - - - -Obtaining Data About OSDs -========================= - -A good first step in troubleshooting your OSDs is to obtain information in -addition to the information you collected while `monitoring your OSDs`_ -(e.g., ``ceph osd tree``). - - -Ceph Logs ---------- - -If you haven't changed the default path, you can find Ceph log files at -``/var/log/ceph``:: - - ls /var/log/ceph - -If you don't get enough log detail, you can change your logging level. See -`Logging and Debugging`_ for details to ensure that Ceph performs adequately -under high logging volume. - - -Admin Socket ------------- - -Use the admin socket tool to retrieve runtime information. For details, list -the sockets for your Ceph processes:: - - ls /var/run/ceph - -Then, execute the following, replacing ``{daemon-name}`` with an actual -daemon (e.g., ``osd.0``):: - - ceph daemon osd.0 help - -Alternatively, you can specify a ``{socket-file}`` (e.g., something in ``/var/run/ceph``):: - - ceph daemon {socket-file} help - - -The admin socket, among other things, allows you to: - -- List your configuration at runtime -- Dump historic operations -- Dump the operation priority queue state -- Dump operations in flight -- Dump perfcounters - - -Display Freespace ------------------ - -Filesystem issues may arise. To display your filesystem's free space, execute -``df``. :: - - df -h - -Execute ``df --help`` for additional usage. - - -I/O Statistics --------------- - -Use `iostat`_ to identify I/O-related issues. :: - - iostat -x - - -Diagnostic Messages -------------------- - -To retrieve diagnostic messages, use ``dmesg`` with ``less``, ``more``, ``grep`` -or ``tail``. For example:: - - dmesg | grep scsi - - -Stopping w/out Rebalancing -========================== - -Periodically, you may need to perform maintenance on a subset of your cluster, -or resolve a problem that affects a failure domain (e.g., a rack). If you do not -want CRUSH to automatically rebalance the cluster as you stop OSDs for -maintenance, set the cluster to ``noout`` first:: - - ceph osd set noout - -Once the cluster is set to ``noout``, you can begin stopping the OSDs within the -failure domain that requires maintenance work. :: - - stop ceph-osd id={num} - -.. note:: Placement groups within the OSDs you stop will become ``degraded`` - while you are addressing issues with within the failure domain. - -Once you have completed your maintenance, restart the OSDs. :: - - start ceph-osd id={num} - -Finally, you must unset the cluster from ``noout``. :: - - ceph osd unset noout - - - -.. _osd-not-running: - -OSD Not Running -=============== - -Under normal circumstances, simply restarting the ``ceph-osd`` daemon will -allow it to rejoin the cluster and recover. - -An OSD Won't Start ------------------- - -If you start your cluster and an OSD won't start, check the following: - -- **Configuration File:** If you were not able to get OSDs running from - a new installation, check your configuration file to ensure it conforms - (e.g., ``host`` not ``hostname``, etc.). - -- **Check Paths:** Check the paths in your configuration, and the actual - paths themselves for data and journals. If you separate the OSD data from - the journal data and there are errors in your configuration file or in the - actual mounts, you may have trouble starting OSDs. If you want to store the - journal on a block device, you should partition your journal disk and assign - one partition per OSD. - -- **Check Max Threadcount:** If you have a node with a lot of OSDs, you may be - hitting the default maximum number of threads (e.g., usually 32k), especially - during recovery. You can increase the number of threads using ``sysctl`` to - see if increasing the maximum number of threads to the maximum possible - number of threads allowed (i.e., 4194303) will help. For example:: - - sysctl -w kernel.pid_max=4194303 - - If increasing the maximum thread count resolves the issue, you can make it - permanent by including a ``kernel.pid_max`` setting in the - ``/etc/sysctl.conf`` file. For example:: - - kernel.pid_max = 4194303 - -- **Kernel Version:** Identify the kernel version and distribution you - are using. Ceph uses some third party tools by default, which may be - buggy or may conflict with certain distributions and/or kernel - versions (e.g., Google perftools). Check the `OS recommendations`_ - to ensure you have addressed any issues related to your kernel. - -- **Segment Fault:** If there is a segment fault, turn your logging up - (if it is not already), and try again. If it segment faults again, - contact the ceph-devel email list and provide your Ceph configuration - file, your monitor output and the contents of your log file(s). - - - -An OSD Failed -------------- - -When a ``ceph-osd`` process dies, the monitor will learn about the failure -from surviving ``ceph-osd`` daemons and report it via the ``ceph health`` -command:: - - ceph health - HEALTH_WARN 1/3 in osds are down - -Specifically, you will get a warning whenever there are ``ceph-osd`` -processes that are marked ``in`` and ``down``. You can identify which -``ceph-osds`` are ``down`` with:: - - ceph health detail - HEALTH_WARN 1/3 in osds are down - osd.0 is down since epoch 23, last address 192.168.106.220:6800/11080 - -If there is a disk -failure or other fault preventing ``ceph-osd`` from functioning or -restarting, an error message should be present in its log file in -``/var/log/ceph``. - -If the daemon stopped because of a heartbeat failure, the underlying -kernel file system may be unresponsive. Check ``dmesg`` output for disk -or other kernel errors. - -If the problem is a software error (failed assertion or other -unexpected error), it should be reported to the `ceph-devel`_ email list. - - -No Free Drive Space -------------------- - -Ceph prevents you from writing to a full OSD so that you don't lose data. -In an operational cluster, you should receive a warning when your cluster -is getting near its full ratio. The ``mon osd full ratio`` defaults to -``0.95``, or 95% of capacity before it stops clients from writing data. -The ``mon osd backfillfull ratio`` defaults to ``0.90``, or 90 % of -capacity when it blocks backfills from starting. The -``mon osd nearfull ratio`` defaults to ``0.85``, or 85% of capacity -when it generates a health warning. - -Full cluster issues usually arise when testing how Ceph handles an OSD -failure on a small cluster. When one node has a high percentage of the -cluster's data, the cluster can easily eclipse its nearfull and full ratio -immediately. If you are testing how Ceph reacts to OSD failures on a small -cluster, you should leave ample free disk space and consider temporarily -lowering the ``mon osd full ratio``, ``mon osd backfillfull ratio`` and -``mon osd nearfull ratio``. - -Full ``ceph-osds`` will be reported by ``ceph health``:: - - ceph health - HEALTH_WARN 1 nearfull osd(s) - -Or:: - - ceph health detail - HEALTH_ERR 1 full osd(s); 1 backfillfull osd(s); 1 nearfull osd(s) - osd.3 is full at 97% - osd.4 is backfill full at 91% - osd.2 is near full at 87% - -The best way to deal with a full cluster is to add new ``ceph-osds``, allowing -the cluster to redistribute data to the newly available storage. - -If you cannot start an OSD because it is full, you may delete some data by deleting -some placement group directories in the full OSD. - -.. important:: If you choose to delete a placement group directory on a full OSD, - **DO NOT** delete the same placement group directory on another full OSD, or - **YOU MAY LOSE DATA**. You **MUST** maintain at least one copy of your data on - at least one OSD. - -See `Monitor Config Reference`_ for additional details. - - -OSDs are Slow/Unresponsive -========================== - -A commonly recurring issue involves slow or unresponsive OSDs. Ensure that you -have eliminated other troubleshooting possibilities before delving into OSD -performance issues. For example, ensure that your network(s) is working properly -and your OSDs are running. Check to see if OSDs are throttling recovery traffic. - -.. tip:: Newer versions of Ceph provide better recovery handling by preventing - recovering OSDs from using up system resources so that ``up`` and ``in`` - OSDs are not available or are otherwise slow. - - -Networking Issues ------------------ - -Ceph is a distributed storage system, so it depends upon networks to peer with -OSDs, replicate objects, recover from faults and check heartbeats. Networking -issues can cause OSD latency and flapping OSDs. See `Flapping OSDs`_ for -details. - -Ensure that Ceph processes and Ceph-dependent processes are connected and/or -listening. :: - - netstat -a | grep ceph - netstat -l | grep ceph - sudo netstat -p | grep ceph - -Check network statistics. :: - - netstat -s - - -Drive Configuration -------------------- - -A storage drive should only support one OSD. Sequential read and sequential -write throughput can bottleneck if other processes share the drive, including -journals, operating systems, monitors, other OSDs and non-Ceph processes. - -Ceph acknowledges writes *after* journaling, so fast SSDs are an -attractive option to accelerate the response time--particularly when -using the ``XFS`` or ``ext4`` filesystems. By contrast, the ``btrfs`` -filesystem can write and journal simultaneously. (Note, however, that -we recommend against using ``btrfs`` for production deployments.) - -.. note:: Partitioning a drive does not change its total throughput or - sequential read/write limits. Running a journal in a separate partition - may help, but you should prefer a separate physical drive. - - -Bad Sectors / Fragmented Disk ------------------------------ - -Check your disks for bad sectors and fragmentation. This can cause total throughput -to drop substantially. - - -Co-resident Monitors/OSDs -------------------------- - -Monitors are generally light-weight processes, but they do lots of ``fsync()``, -which can interfere with other workloads, particularly if monitors run on the -same drive as your OSDs. Additionally, if you run monitors on the same host as -the OSDs, you may incur performance issues related to: - -- Running an older kernel (pre-3.0) -- Running Argonaut with an old ``glibc`` -- Running a kernel with no syncfs(2) syscall. - -In these cases, multiple OSDs running on the same host can drag each other down -by doing lots of commits. That often leads to the bursty writes. - - -Co-resident Processes ---------------------- - -Spinning up co-resident processes such as a cloud-based solution, virtual -machines and other applications that write data to Ceph while operating on the -same hardware as OSDs can introduce significant OSD latency. Generally, we -recommend optimizing a host for use with Ceph and using other hosts for other -processes. The practice of separating Ceph operations from other applications -may help improve performance and may streamline troubleshooting and maintenance. - - -Logging Levels --------------- - -If you turned logging levels up to track an issue and then forgot to turn -logging levels back down, the OSD may be putting a lot of logs onto the disk. If -you intend to keep logging levels high, you may consider mounting a drive to the -default path for logging (i.e., ``/var/log/ceph/$cluster-$name.log``). - - -Recovery Throttling -------------------- - -Depending upon your configuration, Ceph may reduce recovery rates to maintain -performance or it may increase recovery rates to the point that recovery -impacts OSD performance. Check to see if the OSD is recovering. - - -Kernel Version --------------- - -Check the kernel version you are running. Older kernels may not receive -new backports that Ceph depends upon for better performance. - - -Kernel Issues with SyncFS -------------------------- - -Try running one OSD per host to see if performance improves. Old kernels -might not have a recent enough version of ``glibc`` to support ``syncfs(2)``. - - -Filesystem Issues ------------------ - -Currently, we recommend deploying clusters with XFS. - -We recommend against using btrfs or ext4. The btrfs filesystem has -many attractive features, but bugs in the filesystem may lead to -performance issues and suprious ENOSPC errors. We do not recommend -ext4 because xattr size limitations break our support for long object -names (needed for RGW). - -For more information, see `Filesystem Recommendations`_. - -.. _Filesystem Recommendations: ../configuration/filesystem-recommendations - - -Insufficient RAM ----------------- - -We recommend 1GB of RAM per OSD daemon. You may notice that during normal -operations, the OSD only uses a fraction of that amount (e.g., 100-200MB). -Unused RAM makes it tempting to use the excess RAM for co-resident applications, -VMs and so forth. However, when OSDs go into recovery mode, their memory -utilization spikes. If there is no RAM available, the OSD performance will slow -considerably. - - -Old Requests or Slow Requests ------------------------------ - -If a ``ceph-osd`` daemon is slow to respond to a request, it will generate log messages -complaining about requests that are taking too long. The warning threshold -defaults to 30 seconds, and is configurable via the ``osd op complaint time`` -option. When this happens, the cluster log will receive messages. - -Legacy versions of Ceph complain about 'old requests`:: - - osd.0 192.168.106.220:6800/18813 312 : [WRN] old request osd_op(client.5099.0:790 fatty_26485_object789 [write 0~4096] 2.5e54f643) v4 received at 2012-03-06 15:42:56.054801 currently waiting for sub ops - -New versions of Ceph complain about 'slow requests`:: - - {date} {osd.num} [WRN] 1 slow requests, 1 included below; oldest blocked for > 30.005692 secs - {date} {osd.num} [WRN] slow request 30.005692 seconds old, received at {date-time}: osd_op(client.4240.0:8 benchmark_data_ceph-1_39426_object7 [write 0~4194304] 0.69848840) v4 currently waiting for subops from [610] - - -Possible causes include: - -- A bad drive (check ``dmesg`` output) -- A bug in the kernel file system bug (check ``dmesg`` output) -- An overloaded cluster (check system load, iostat, etc.) -- A bug in the ``ceph-osd`` daemon. - -Possible solutions - -- Remove VMs Cloud Solutions from Ceph Hosts -- Upgrade Kernel -- Upgrade Ceph -- Restart OSDs - -Debugging Slow Requests ------------------------ - -If you run "ceph daemon osd. dump_historic_ops" or "dump_ops_in_flight", -you will see a set of operations and a list of events each operation went -through. These are briefly described below. - -Events from the Messenger layer: - -- header_read: when the messenger first started reading the message off the wire -- throttled: when the messenger tried to acquire memory throttle space to read - the message into memory -- all_read: when the messenger finished reading the message off the wire -- dispatched: when the messenger gave the message to the OSD -- Initiated: : the primary marks this when it - hears about the above, but for a particular replica -- commit_sent: we sent a reply back to the client (or primary OSD, for sub ops) - -Many of these events are seemingly redundant, but cross important boundaries in -the internal code (such as passing data across locks into new threads). - -Flapping OSDs -============= - -We recommend using both a public (front-end) network and a cluster (back-end) -network so that you can better meet the capacity requirements of object -replication. Another advantage is that you can run a cluster network such that -it is not connected to the internet, thereby preventing some denial of service -attacks. When OSDs peer and check heartbeats, they use the cluster (back-end) -network when it's available. See `Monitor/OSD Interaction`_ for details. - -However, if the cluster (back-end) network fails or develops significant latency -while the public (front-end) network operates optimally, OSDs currently do not -handle this situation well. What happens is that OSDs mark each other ``down`` -on the monitor, while marking themselves ``up``. We call this scenario -'flapping`. - -If something is causing OSDs to 'flap' (repeatedly getting marked ``down`` and -then ``up`` again), you can force the monitors to stop the flapping with:: - - ceph osd set noup # prevent OSDs from getting marked up - ceph osd set nodown # prevent OSDs from getting marked down - -These flags are recorded in the osdmap structure:: - - ceph osd dump | grep flags - flags no-up,no-down - -You can clear the flags with:: - - ceph osd unset noup - ceph osd unset nodown - -Two other flags are supported, ``noin`` and ``noout``, which prevent -booting OSDs from being marked ``in`` (allocated data) or protect OSDs -from eventually being marked ``out`` (regardless of what the current value for -``mon osd down out interval`` is). - -.. note:: ``noup``, ``noout``, and ``nodown`` are temporary in the - sense that once the flags are cleared, the action they were blocking - should occur shortly after. The ``noin`` flag, on the other hand, - prevents OSDs from being marked ``in`` on boot, and any daemons that - started while the flag was set will remain that way. - - - - - - -.. _iostat: http://en.wikipedia.org/wiki/Iostat -.. _Ceph Logging and Debugging: ../../configuration/ceph-conf#ceph-logging-and-debugging -.. _Logging and Debugging: ../log-and-debug -.. _Debugging and Logging: ../debug -.. _Monitor/OSD Interaction: ../../configuration/mon-osd-interaction -.. _Monitor Config Reference: ../../configuration/mon-config-ref -.. _monitoring your OSDs: ../../operations/monitoring-osd-pg -.. _subscribe to the ceph-devel email list: mailto:majordomo@vger.kernel.org?body=subscribe+ceph-devel -.. _unsubscribe from the ceph-devel email list: mailto:majordomo@vger.kernel.org?body=unsubscribe+ceph-devel -.. _subscribe to the ceph-users email list: mailto:ceph-users-join@lists.ceph.com -.. _unsubscribe from the ceph-users email list: mailto:ceph-users-leave@lists.ceph.com -.. _OS recommendations: ../../../start/os-recommendations -.. _ceph-devel: ceph-devel@vger.kernel.org diff --git a/src/ceph/doc/rados/troubleshooting/troubleshooting-pg.rst b/src/ceph/doc/rados/troubleshooting/troubleshooting-pg.rst deleted file mode 100644 index 4241fee..0000000 --- a/src/ceph/doc/rados/troubleshooting/troubleshooting-pg.rst +++ /dev/null @@ -1,668 +0,0 @@ -===================== - Troubleshooting PGs -===================== - -Placement Groups Never Get Clean -================================ - -When you create a cluster and your cluster remains in ``active``, -``active+remapped`` or ``active+degraded`` status and never achieve an -``active+clean`` status, you likely have a problem with your configuration. - -You may need to review settings in the `Pool, PG and CRUSH Config Reference`_ -and make appropriate adjustments. - -As a general rule, you should run your cluster with more than one OSD and a -pool size greater than 1 object replica. - -One Node Cluster ----------------- - -Ceph no longer provides documentation for operating on a single node, because -you would never deploy a system designed for distributed computing on a single -node. Additionally, mounting client kernel modules on a single node containing a -Ceph daemon may cause a deadlock due to issues with the Linux kernel itself -(unless you use VMs for the clients). You can experiment with Ceph in a 1-node -configuration, in spite of the limitations as described herein. - -If you are trying to create a cluster on a single node, you must change the -default of the ``osd crush chooseleaf type`` setting from ``1`` (meaning -``host`` or ``node``) to ``0`` (meaning ``osd``) in your Ceph configuration -file before you create your monitors and OSDs. This tells Ceph that an OSD -can peer with another OSD on the same host. If you are trying to set up a -1-node cluster and ``osd crush chooseleaf type`` is greater than ``0``, -Ceph will try to peer the PGs of one OSD with the PGs of another OSD on -another node, chassis, rack, row, or even datacenter depending on the setting. - -.. tip:: DO NOT mount kernel clients directly on the same node as your - Ceph Storage Cluster, because kernel conflicts can arise. However, you - can mount kernel clients within virtual machines (VMs) on a single node. - -If you are creating OSDs using a single disk, you must create directories -for the data manually first. For example:: - - mkdir /var/local/osd0 /var/local/osd1 - ceph-deploy osd prepare {localhost-name}:/var/local/osd0 {localhost-name}:/var/local/osd1 - ceph-deploy osd activate {localhost-name}:/var/local/osd0 {localhost-name}:/var/local/osd1 - - -Fewer OSDs than Replicas ------------------------- - -If you have brought up two OSDs to an ``up`` and ``in`` state, but you still -don't see ``active + clean`` placement groups, you may have an -``osd pool default size`` set to greater than ``2``. - -There are a few ways to address this situation. If you want to operate your -cluster in an ``active + degraded`` state with two replicas, you can set the -``osd pool default min size`` to ``2`` so that you can write objects in -an ``active + degraded`` state. You may also set the ``osd pool default size`` -setting to ``2`` so that you only have two stored replicas (the original and -one replica), in which case the cluster should achieve an ``active + clean`` -state. - -.. note:: You can make the changes at runtime. If you make the changes in - your Ceph configuration file, you may need to restart your cluster. - - -Pool Size = 1 -------------- - -If you have the ``osd pool default size`` set to ``1``, you will only have -one copy of the object. OSDs rely on other OSDs to tell them which objects -they should have. If a first OSD has a copy of an object and there is no -second copy, then no second OSD can tell the first OSD that it should have -that copy. For each placement group mapped to the first OSD (see -``ceph pg dump``), you can force the first OSD to notice the placement groups -it needs by running:: - - ceph osd force-create-pg - - -CRUSH Map Errors ----------------- - -Another candidate for placement groups remaining unclean involves errors -in your CRUSH map. - - -Stuck Placement Groups -====================== - -It is normal for placement groups to enter states like "degraded" or "peering" -following a failure. Normally these states indicate the normal progression -through the failure recovery process. However, if a placement group stays in one -of these states for a long time this may be an indication of a larger problem. -For this reason, the monitor will warn when placement groups get "stuck" in a -non-optimal state. Specifically, we check for: - -* ``inactive`` - The placement group has not been ``active`` for too long - (i.e., it hasn't been able to service read/write requests). - -* ``unclean`` - The placement group has not been ``clean`` for too long - (i.e., it hasn't been able to completely recover from a previous failure). - -* ``stale`` - The placement group status has not been updated by a ``ceph-osd``, - indicating that all nodes storing this placement group may be ``down``. - -You can explicitly list stuck placement groups with one of:: - - ceph pg dump_stuck stale - ceph pg dump_stuck inactive - ceph pg dump_stuck unclean - -For stuck ``stale`` placement groups, it is normally a matter of getting the -right ``ceph-osd`` daemons running again. For stuck ``inactive`` placement -groups, it is usually a peering problem (see :ref:`failures-osd-peering`). For -stuck ``unclean`` placement groups, there is usually something preventing -recovery from completing, like unfound objects (see -:ref:`failures-osd-unfound`); - - - -.. _failures-osd-peering: - -Placement Group Down - Peering Failure -====================================== - -In certain cases, the ``ceph-osd`` `Peering` process can run into -problems, preventing a PG from becoming active and usable. For -example, ``ceph health`` might report:: - - ceph health detail - HEALTH_ERR 7 pgs degraded; 12 pgs down; 12 pgs peering; 1 pgs recovering; 6 pgs stuck unclean; 114/3300 degraded (3.455%); 1/3 in osds are down - ... - pg 0.5 is down+peering - pg 1.4 is down+peering - ... - osd.1 is down since epoch 69, last address 192.168.106.220:6801/8651 - -We can query the cluster to determine exactly why the PG is marked ``down`` with:: - - ceph pg 0.5 query - -.. code-block:: javascript - - { "state": "down+peering", - ... - "recovery_state": [ - { "name": "Started\/Primary\/Peering\/GetInfo", - "enter_time": "2012-03-06 14:40:16.169679", - "requested_info_from": []}, - { "name": "Started\/Primary\/Peering", - "enter_time": "2012-03-06 14:40:16.169659", - "probing_osds": [ - 0, - 1], - "blocked": "peering is blocked due to down osds", - "down_osds_we_would_probe": [ - 1], - "peering_blocked_by": [ - { "osd": 1, - "current_lost_at": 0, - "comment": "starting or marking this osd lost may let us proceed"}]}, - { "name": "Started", - "enter_time": "2012-03-06 14:40:16.169513"} - ] - } - -The ``recovery_state`` section tells us that peering is blocked due to -down ``ceph-osd`` daemons, specifically ``osd.1``. In this case, we can start that ``ceph-osd`` -and things will recover. - -Alternatively, if there is a catastrophic failure of ``osd.1`` (e.g., disk -failure), we can tell the cluster that it is ``lost`` and to cope as -best it can. - -.. important:: This is dangerous in that the cluster cannot - guarantee that the other copies of the data are consistent - and up to date. - -To instruct Ceph to continue anyway:: - - ceph osd lost 1 - -Recovery will proceed. - - -.. _failures-osd-unfound: - -Unfound Objects -=============== - -Under certain combinations of failures Ceph may complain about -``unfound`` objects:: - - ceph health detail - HEALTH_WARN 1 pgs degraded; 78/3778 unfound (2.065%) - pg 2.4 is active+degraded, 78 unfound - -This means that the storage cluster knows that some objects (or newer -copies of existing objects) exist, but it hasn't found copies of them. -One example of how this might come about for a PG whose data is on ceph-osds -1 and 2: - -* 1 goes down -* 2 handles some writes, alone -* 1 comes up -* 1 and 2 repeer, and the objects missing on 1 are queued for recovery. -* Before the new objects are copied, 2 goes down. - -Now 1 knows that these object exist, but there is no live ``ceph-osd`` who -has a copy. In this case, IO to those objects will block, and the -cluster will hope that the failed node comes back soon; this is -assumed to be preferable to returning an IO error to the user. - -First, you can identify which objects are unfound with:: - - ceph pg 2.4 list_missing [starting offset, in json] - -.. code-block:: javascript - - { "offset": { "oid": "", - "key": "", - "snapid": 0, - "hash": 0, - "max": 0}, - "num_missing": 0, - "num_unfound": 0, - "objects": [ - { "oid": "object 1", - "key": "", - "hash": 0, - "max": 0 }, - ... - ], - "more": 0} - -If there are too many objects to list in a single result, the ``more`` -field will be true and you can query for more. (Eventually the -command line tool will hide this from you, but not yet.) - -Second, you can identify which OSDs have been probed or might contain -data:: - - ceph pg 2.4 query - -.. code-block:: javascript - - "recovery_state": [ - { "name": "Started\/Primary\/Active", - "enter_time": "2012-03-06 15:15:46.713212", - "might_have_unfound": [ - { "osd": 1, - "status": "osd is down"}]}, - -In this case, for example, the cluster knows that ``osd.1`` might have -data, but it is ``down``. The full range of possible states include: - -* already probed -* querying -* OSD is down -* not queried (yet) - -Sometimes it simply takes some time for the cluster to query possible -locations. - -It is possible that there are other locations where the object can -exist that are not listed. For example, if a ceph-osd is stopped and -taken out of the cluster, the cluster fully recovers, and due to some -future set of failures ends up with an unfound object, it won't -consider the long-departed ceph-osd as a potential location to -consider. (This scenario, however, is unlikely.) - -If all possible locations have been queried and objects are still -lost, you may have to give up on the lost objects. This, again, is -possible given unusual combinations of failures that allow the cluster -to learn about writes that were performed before the writes themselves -are recovered. To mark the "unfound" objects as "lost":: - - ceph pg 2.5 mark_unfound_lost revert|delete - -This the final argument specifies how the cluster should deal with -lost objects. - -The "delete" option will forget about them entirely. - -The "revert" option (not available for erasure coded pools) will -either roll back to a previous version of the object or (if it was a -new object) forget about it entirely. Use this with caution, as it -may confuse applications that expected the object to exist. - - -Homeless Placement Groups -========================= - -It is possible for all OSDs that had copies of a given placement groups to fail. -If that's the case, that subset of the object store is unavailable, and the -monitor will receive no status updates for those placement groups. To detect -this situation, the monitor marks any placement group whose primary OSD has -failed as ``stale``. For example:: - - ceph health - HEALTH_WARN 24 pgs stale; 3/300 in osds are down - -You can identify which placement groups are ``stale``, and what the last OSDs to -store them were, with:: - - ceph health detail - HEALTH_WARN 24 pgs stale; 3/300 in osds are down - ... - pg 2.5 is stuck stale+active+remapped, last acting [2,0] - ... - osd.10 is down since epoch 23, last address 192.168.106.220:6800/11080 - osd.11 is down since epoch 13, last address 192.168.106.220:6803/11539 - osd.12 is down since epoch 24, last address 192.168.106.220:6806/11861 - -If we want to get placement group 2.5 back online, for example, this tells us that -it was last managed by ``osd.0`` and ``osd.2``. Restarting those ``ceph-osd`` -daemons will allow the cluster to recover that placement group (and, presumably, -many others). - - -Only a Few OSDs Receive Data -============================ - -If you have many nodes in your cluster and only a few of them receive data, -`check`_ the number of placement groups in your pool. Since placement groups get -mapped to OSDs, a small number of placement groups will not distribute across -your cluster. Try creating a pool with a placement group count that is a -multiple of the number of OSDs. See `Placement Groups`_ for details. The default -placement group count for pools is not useful, but you can change it `here`_. - - -Can't Write Data -================ - -If your cluster is up, but some OSDs are down and you cannot write data, -check to ensure that you have the minimum number of OSDs running for the -placement group. If you don't have the minimum number of OSDs running, -Ceph will not allow you to write data because there is no guarantee -that Ceph can replicate your data. See ``osd pool default min size`` -in the `Pool, PG and CRUSH Config Reference`_ for details. - - -PGs Inconsistent -================ - -If you receive an ``active + clean + inconsistent`` state, this may happen -due to an error during scrubbing. As always, we can identify the inconsistent -placement group(s) with:: - - $ ceph health detail - HEALTH_ERR 1 pgs inconsistent; 2 scrub errors - pg 0.6 is active+clean+inconsistent, acting [0,1,2] - 2 scrub errors - -Or if you prefer inspecting the output in a programmatic way:: - - $ rados list-inconsistent-pg rbd - ["0.6"] - -There is only one consistent state, but in the worst case, we could have -different inconsistencies in multiple perspectives found in more than one -objects. If an object named ``foo`` in PG ``0.6`` is truncated, we will have:: - - $ rados list-inconsistent-obj 0.6 --format=json-pretty - -.. code-block:: javascript - - { - "epoch": 14, - "inconsistents": [ - { - "object": { - "name": "foo", - "nspace": "", - "locator": "", - "snap": "head", - "version": 1 - }, - "errors": [ - "data_digest_mismatch", - "size_mismatch" - ], - "union_shard_errors": [ - "data_digest_mismatch_oi", - "size_mismatch_oi" - ], - "selected_object_info": "0:602f83fe:::foo:head(16'1 client.4110.0:1 dirty|data_digest|omap_digest s 968 uv 1 dd e978e67f od ffffffff alloc_hint [0 0 0])", - "shards": [ - { - "osd": 0, - "errors": [], - "size": 968, - "omap_digest": "0xffffffff", - "data_digest": "0xe978e67f" - }, - { - "osd": 1, - "errors": [], - "size": 968, - "omap_digest": "0xffffffff", - "data_digest": "0xe978e67f" - }, - { - "osd": 2, - "errors": [ - "data_digest_mismatch_oi", - "size_mismatch_oi" - ], - "size": 0, - "omap_digest": "0xffffffff", - "data_digest": "0xffffffff" - } - ] - } - ] - } - -In this case, we can learn from the output: - -* The only inconsistent object is named ``foo``, and it is its head that has - inconsistencies. -* The inconsistencies fall into two categories: - - * ``errors``: these errors indicate inconsistencies between shards without a - determination of which shard(s) are bad. Check for the ``errors`` in the - `shards` array, if available, to pinpoint the problem. - - * ``data_digest_mismatch``: the digest of the replica read from OSD.2 is - different from the ones of OSD.0 and OSD.1 - * ``size_mismatch``: the size of the replica read from OSD.2 is 0, while - the size reported by OSD.0 and OSD.1 is 968. - * ``union_shard_errors``: the union of all shard specific ``errors`` in - ``shards`` array. The ``errors`` are set for the given shard that has the - problem. They include errors like ``read_error``. The ``errors`` ending in - ``oi`` indicate a comparison with ``selected_object_info``. Look at the - ``shards`` array to determine which shard has which error(s). - - * ``data_digest_mismatch_oi``: the digest stored in the object-info is not - ``0xffffffff``, which is calculated from the shard read from OSD.2 - * ``size_mismatch_oi``: the size stored in the object-info is different - from the one read from OSD.2. The latter is 0. - -You can repair the inconsistent placement group by executing:: - - ceph pg repair {placement-group-ID} - -Which overwrites the `bad` copies with the `authoritative` ones. In most cases, -Ceph is able to choose authoritative copies from all available replicas using -some predefined criteria. But this does not always work. For example, the stored -data digest could be missing, and the calculated digest will be ignored when -choosing the authoritative copies. So, please use the above command with caution. - -If ``read_error`` is listed in the ``errors`` attribute of a shard, the -inconsistency is likely due to disk errors. You might want to check your disk -used by that OSD. - -If you receive ``active + clean + inconsistent`` states periodically due to -clock skew, you may consider configuring your `NTP`_ daemons on your -monitor hosts to act as peers. See `The Network Time Protocol`_ and Ceph -`Clock Settings`_ for additional details. - - -Erasure Coded PGs are not active+clean -====================================== - -When CRUSH fails to find enough OSDs to map to a PG, it will show as a -``2147483647`` which is ITEM_NONE or ``no OSD found``. For instance:: - - [2,1,6,0,5,8,2147483647,7,4] - -Not enough OSDs ---------------- - -If the Ceph cluster only has 8 OSDs and the erasure coded pool needs -9, that is what it will show. You can either create another erasure -coded pool that requires less OSDs:: - - ceph osd erasure-code-profile set myprofile k=5 m=3 - ceph osd pool create erasurepool 16 16 erasure myprofile - -or add a new OSDs and the PG will automatically use them. - -CRUSH constraints cannot be satisfied -------------------------------------- - -If the cluster has enough OSDs, it is possible that the CRUSH ruleset -imposes constraints that cannot be satisfied. If there are 10 OSDs on -two hosts and the CRUSH rulesets require that no two OSDs from the -same host are used in the same PG, the mapping may fail because only -two OSD will be found. You can check the constraint by displaying the -ruleset:: - - $ ceph osd crush rule ls - [ - "replicated_ruleset", - "erasurepool"] - $ ceph osd crush rule dump erasurepool - { "rule_id": 1, - "rule_name": "erasurepool", - "ruleset": 1, - "type": 3, - "min_size": 3, - "max_size": 20, - "steps": [ - { "op": "take", - "item": -1, - "item_name": "default"}, - { "op": "chooseleaf_indep", - "num": 0, - "type": "host"}, - { "op": "emit"}]} - - -You can resolve the problem by creating a new pool in which PGs are allowed -to have OSDs residing on the same host with:: - - ceph osd erasure-code-profile set myprofile crush-failure-domain=osd - ceph osd pool create erasurepool 16 16 erasure myprofile - -CRUSH gives up too soon ------------------------ - -If the Ceph cluster has just enough OSDs to map the PG (for instance a -cluster with a total of 9 OSDs and an erasure coded pool that requires -9 OSDs per PG), it is possible that CRUSH gives up before finding a -mapping. It can be resolved by: - -* lowering the erasure coded pool requirements to use less OSDs per PG - (that requires the creation of another pool as erasure code profiles - cannot be dynamically modified). - -* adding more OSDs to the cluster (that does not require the erasure - coded pool to be modified, it will become clean automatically) - -* use a hand made CRUSH ruleset that tries more times to find a good - mapping. It can be done by setting ``set_choose_tries`` to a value - greater than the default. - -You should first verify the problem with ``crushtool`` after -extracting the crushmap from the cluster so your experiments do not -modify the Ceph cluster and only work on a local files:: - - $ ceph osd crush rule dump erasurepool - { "rule_name": "erasurepool", - "ruleset": 1, - "type": 3, - "min_size": 3, - "max_size": 20, - "steps": [ - { "op": "take", - "item": -1, - "item_name": "default"}, - { "op": "chooseleaf_indep", - "num": 0, - "type": "host"}, - { "op": "emit"}]} - $ ceph osd getcrushmap > crush.map - got crush map from osdmap epoch 13 - $ crushtool -i crush.map --test --show-bad-mappings \ - --rule 1 \ - --num-rep 9 \ - --min-x 1 --max-x $((1024 * 1024)) - bad mapping rule 8 x 43 num_rep 9 result [3,2,7,1,2147483647,8,5,6,0] - bad mapping rule 8 x 79 num_rep 9 result [6,0,2,1,4,7,2147483647,5,8] - bad mapping rule 8 x 173 num_rep 9 result [0,4,6,8,2,1,3,7,2147483647] - -Where ``--num-rep`` is the number of OSDs the erasure code crush -ruleset needs, ``--rule`` is the value of the ``ruleset`` field -displayed by ``ceph osd crush rule dump``. The test will try mapping -one million values (i.e. the range defined by ``[--min-x,--max-x]``) -and must display at least one bad mapping. If it outputs nothing it -means all mappings are successfull and you can stop right there: the -problem is elsewhere. - -The crush ruleset can be edited by decompiling the crush map:: - - $ crushtool --decompile crush.map > crush.txt - -and adding the following line to the ruleset:: - - step set_choose_tries 100 - -The relevant part of of the ``crush.txt`` file should look something -like:: - - rule erasurepool { - ruleset 1 - type erasure - min_size 3 - max_size 20 - step set_chooseleaf_tries 5 - step set_choose_tries 100 - step take default - step chooseleaf indep 0 type host - step emit - } - -It can then be compiled and tested again:: - - $ crushtool --compile crush.txt -o better-crush.map - -When all mappings succeed, an histogram of the number of tries that -were necessary to find all of them can be displayed with the -``--show-choose-tries`` option of ``crushtool``:: - - $ crushtool -i better-crush.map --test --show-bad-mappings \ - --show-choose-tries \ - --rule 1 \ - --num-rep 9 \ - --min-x 1 --max-x $((1024 * 1024)) - ... - 11: 42 - 12: 44 - 13: 54 - 14: 45 - 15: 35 - 16: 34 - 17: 30 - 18: 25 - 19: 19 - 20: 22 - 21: 20 - 22: 17 - 23: 13 - 24: 16 - 25: 13 - 26: 11 - 27: 11 - 28: 13 - 29: 11 - 30: 10 - 31: 6 - 32: 5 - 33: 10 - 34: 3 - 35: 7 - 36: 5 - 37: 2 - 38: 5 - 39: 5 - 40: 2 - 41: 5 - 42: 4 - 43: 1 - 44: 2 - 45: 2 - 46: 3 - 47: 1 - 48: 0 - ... - 102: 0 - 103: 1 - 104: 0 - ... - -It took 11 tries to map 42 PGs, 12 tries to map 44 PGs etc. The highest number of tries is the minimum value of ``set_choose_tries`` that prevents bad mappings (i.e. 103 in the above output because it did not take more than 103 tries for any PG to be mapped). - -.. _check: ../../operations/placement-groups#get-the-number-of-placement-groups -.. _here: ../../configuration/pool-pg-config-ref -.. _Placement Groups: ../../operations/placement-groups -.. _Pool, PG and CRUSH Config Reference: ../../configuration/pool-pg-config-ref -.. _NTP: http://en.wikipedia.org/wiki/Network_Time_Protocol -.. _The Network Time Protocol: http://www.ntp.org/ -.. _Clock Settings: ../../configuration/mon-config-ref/#clock - - -- cgit 1.2.3-korg