diff options
Diffstat (limited to 'src/ceph/doc/rados/troubleshooting')
-rw-r--r-- | src/ceph/doc/rados/troubleshooting/community.rst | 29 | ||||
-rw-r--r-- | src/ceph/doc/rados/troubleshooting/cpu-profiling.rst | 67 | ||||
-rw-r--r-- | src/ceph/doc/rados/troubleshooting/index.rst | 19 | ||||
-rw-r--r-- | src/ceph/doc/rados/troubleshooting/log-and-debug.rst | 550 | ||||
-rw-r--r-- | src/ceph/doc/rados/troubleshooting/memory-profiling.rst | 142 | ||||
-rw-r--r-- | src/ceph/doc/rados/troubleshooting/troubleshooting-mon.rst | 567 | ||||
-rw-r--r-- | src/ceph/doc/rados/troubleshooting/troubleshooting-osd.rst | 536 | ||||
-rw-r--r-- | src/ceph/doc/rados/troubleshooting/troubleshooting-pg.rst | 668 |
8 files changed, 2578 insertions, 0 deletions
diff --git a/src/ceph/doc/rados/troubleshooting/community.rst b/src/ceph/doc/rados/troubleshooting/community.rst new file mode 100644 index 0000000..9faad13 --- /dev/null +++ b/src/ceph/doc/rados/troubleshooting/community.rst @@ -0,0 +1,29 @@ +==================== + The Ceph Community +==================== + +The Ceph community is an excellent source of information and help. For +operational issues with Ceph releases we recommend you `subscribe to the +ceph-users email list`_. When you no longer want to receive emails, you can +`unsubscribe from the ceph-users email list`_. + +You may also `subscribe to the ceph-devel email list`_. You should do so if +your issue is: + +- Likely related to a bug +- Related to a development release package +- Related to a development testing package +- Related to your own builds + +If you no longer want to receive emails from the ``ceph-devel`` email list, you +may `unsubscribe from the ceph-devel email list`_. + +.. tip:: The Ceph community is growing rapidly, and community members can help + you if you provide them with detailed information about your problem. You + can attach the output of the ``ceph report`` command to help people understand your issues. + +.. _subscribe to the ceph-devel email list: mailto:majordomo@vger.kernel.org?body=subscribe+ceph-devel +.. _unsubscribe from the ceph-devel email list: mailto:majordomo@vger.kernel.org?body=unsubscribe+ceph-devel +.. _subscribe to the ceph-users email list: mailto:ceph-users-join@lists.ceph.com +.. _unsubscribe from the ceph-users email list: mailto:ceph-users-leave@lists.ceph.com +.. _ceph-devel: ceph-devel@vger.kernel.org
\ No newline at end of file diff --git a/src/ceph/doc/rados/troubleshooting/cpu-profiling.rst b/src/ceph/doc/rados/troubleshooting/cpu-profiling.rst new file mode 100644 index 0000000..159f799 --- /dev/null +++ b/src/ceph/doc/rados/troubleshooting/cpu-profiling.rst @@ -0,0 +1,67 @@ +=============== + CPU Profiling +=============== + +If you built Ceph from source and compiled Ceph for use with `oprofile`_ +you can profile Ceph's CPU usage. See `Installing Oprofile`_ for details. + + +Initializing oprofile +===================== + +The first time you use ``oprofile`` you need to initialize it. Locate the +``vmlinux`` image corresponding to the kernel you are now running. :: + + ls /boot + sudo opcontrol --init + sudo opcontrol --setup --vmlinux={path-to-image} --separate=library --callgraph=6 + + +Starting oprofile +================= + +To start ``oprofile`` execute the following command:: + + opcontrol --start + +Once you start ``oprofile``, you may run some tests with Ceph. + + +Stopping oprofile +================= + +To stop ``oprofile`` execute the following command:: + + opcontrol --stop + + +Retrieving oprofile Results +=========================== + +To retrieve the top ``cmon`` results, execute the following command:: + + opreport -gal ./cmon | less + + +To retrieve the top ``cmon`` results with call graphs attached, execute the +following command:: + + opreport -cal ./cmon | less + +.. important:: After reviewing results, you should reset ``oprofile`` before + running it again. Resetting ``oprofile`` removes data from the session + directory. + + +Resetting oprofile +================== + +To reset ``oprofile``, execute the following command:: + + sudo opcontrol --reset + +.. important:: You should reset ``oprofile`` after analyzing data so that + you do not commingle results from different tests. + +.. _oprofile: http://oprofile.sourceforge.net/about/ +.. _Installing Oprofile: ../../../dev/cpu-profiler diff --git a/src/ceph/doc/rados/troubleshooting/index.rst b/src/ceph/doc/rados/troubleshooting/index.rst new file mode 100644 index 0000000..80d14f3 --- /dev/null +++ b/src/ceph/doc/rados/troubleshooting/index.rst @@ -0,0 +1,19 @@ +================= + Troubleshooting +================= + +Ceph is still on the leading edge, so you may encounter situations that require +you to examine your configuration, modify your logging output, troubleshoot +monitors and OSDs, profile memory and CPU usage, and reach out to the +Ceph community for help. + +.. toctree:: + :maxdepth: 1 + + community + log-and-debug + troubleshooting-mon + troubleshooting-osd + troubleshooting-pg + memory-profiling + cpu-profiling diff --git a/src/ceph/doc/rados/troubleshooting/log-and-debug.rst b/src/ceph/doc/rados/troubleshooting/log-and-debug.rst new file mode 100644 index 0000000..c91f272 --- /dev/null +++ b/src/ceph/doc/rados/troubleshooting/log-and-debug.rst @@ -0,0 +1,550 @@ +======================= + Logging and Debugging +======================= + +Typically, when you add debugging to your Ceph configuration, you do so at +runtime. You can also add Ceph debug logging to your Ceph configuration file if +you are encountering issues when starting your cluster. You may view Ceph log +files under ``/var/log/ceph`` (the default location). + +.. tip:: When debug output slows down your system, the latency can hide + race conditions. + +Logging is resource intensive. If you are encountering a problem in a specific +area of your cluster, enable logging for that area of the cluster. For example, +if your OSDs are running fine, but your metadata servers are not, you should +start by enabling debug logging for the specific metadata server instance(s) +giving you trouble. Enable logging for each subsystem as needed. + +.. important:: Verbose logging can generate over 1GB of data per hour. If your + OS disk reaches its capacity, the node will stop working. + +If you enable or increase the rate of Ceph logging, ensure that you have +sufficient disk space on your OS disk. See `Accelerating Log Rotation`_ for +details on rotating log files. When your system is running well, remove +unnecessary debugging settings to ensure your cluster runs optimally. Logging +debug output messages is relatively slow, and a waste of resources when +operating your cluster. + +See `Subsystem, Log and Debug Settings`_ for details on available settings. + +Runtime +======= + +If you would like to see the configuration settings at runtime, you must log +in to a host with a running daemon and execute the following:: + + ceph daemon {daemon-name} config show | less + +For example,:: + + ceph daemon osd.0 config show | less + +To activate Ceph's debugging output (*i.e.*, ``dout()``) at runtime, use the +``ceph tell`` command to inject arguments into the runtime configuration:: + + ceph tell {daemon-type}.{daemon id or *} injectargs --{name} {value} [--{name} {value}] + +Replace ``{daemon-type}`` with one of ``osd``, ``mon`` or ``mds``. You may apply +the runtime setting to all daemons of a particular type with ``*``, or specify +a specific daemon's ID. For example, to increase +debug logging for a ``ceph-osd`` daemon named ``osd.0``, execute the following:: + + ceph tell osd.0 injectargs --debug-osd 0/5 + +The ``ceph tell`` command goes through the monitors. If you cannot bind to the +monitor, you can still make the change by logging into the host of the daemon +whose configuration you'd like to change using ``ceph daemon``. +For example:: + + sudo ceph daemon osd.0 config set debug_osd 0/5 + +See `Subsystem, Log and Debug Settings`_ for details on available settings. + + +Boot Time +========= + +To activate Ceph's debugging output (*i.e.*, ``dout()``) at boot time, you must +add settings to your Ceph configuration file. Subsystems common to each daemon +may be set under ``[global]`` in your configuration file. Subsystems for +particular daemons are set under the daemon section in your configuration file +(*e.g.*, ``[mon]``, ``[osd]``, ``[mds]``). For example:: + + [global] + debug ms = 1/5 + + [mon] + debug mon = 20 + debug paxos = 1/5 + debug auth = 2 + + [osd] + debug osd = 1/5 + debug filestore = 1/5 + debug journal = 1 + debug monc = 5/20 + + [mds] + debug mds = 1 + debug mds balancer = 1 + + +See `Subsystem, Log and Debug Settings`_ for details. + + +Accelerating Log Rotation +========================= + +If your OS disk is relatively full, you can accelerate log rotation by modifying +the Ceph log rotation file at ``/etc/logrotate.d/ceph``. Add a size setting +after the rotation frequency to accelerate log rotation (via cronjob) if your +logs exceed the size setting. For example, the default setting looks like +this:: + + rotate 7 + weekly + compress + sharedscripts + +Modify it by adding a ``size`` setting. :: + + rotate 7 + weekly + size 500M + compress + sharedscripts + +Then, start the crontab editor for your user space. :: + + crontab -e + +Finally, add an entry to check the ``etc/logrotate.d/ceph`` file. :: + + 30 * * * * /usr/sbin/logrotate /etc/logrotate.d/ceph >/dev/null 2>&1 + +The preceding example checks the ``etc/logrotate.d/ceph`` file every 30 minutes. + + +Valgrind +======== + +Debugging may also require you to track down memory and threading issues. +You can run a single daemon, a type of daemon, or the whole cluster with +Valgrind. You should only use Valgrind when developing or debugging Ceph. +Valgrind is computationally expensive, and will slow down your system otherwise. +Valgrind messages are logged to ``stderr``. + + +Subsystem, Log and Debug Settings +================================= + +In most cases, you will enable debug logging output via subsystems. + +Ceph Subsystems +--------------- + +Each subsystem has a logging level for its output logs, and for its logs +in-memory. You may set different values for each of these subsystems by setting +a log file level and a memory level for debug logging. Ceph's logging levels +operate on a scale of ``1`` to ``20``, where ``1`` is terse and ``20`` is +verbose [#]_ . In general, the logs in-memory are not sent to the output log unless: + +- a fatal signal is raised or +- an ``assert`` in source code is triggered or +- upon requested. Please consult `document on admin socket <http://docs.ceph.com/docs/master/man/8/ceph/#daemon>`_ for more details. + +A debug logging setting can take a single value for the log level and the +memory level, which sets them both as the same value. For example, if you +specify ``debug ms = 5``, Ceph will treat it as a log level and a memory level +of ``5``. You may also specify them separately. The first setting is the log +level, and the second setting is the memory level. You must separate them with +a forward slash (/). For example, if you want to set the ``ms`` subsystem's +debug logging level to ``1`` and its memory level to ``5``, you would specify it +as ``debug ms = 1/5``. For example: + + + +.. code-block:: ini + + debug {subsystem} = {log-level}/{memory-level} + #for example + debug mds balancer = 1/20 + + +The following table provides a list of Ceph subsystems and their default log and +memory levels. Once you complete your logging efforts, restore the subsystems +to their default level or to a level suitable for normal operations. + + ++--------------------+-----------+--------------+ +| Subsystem | Log Level | Memory Level | ++====================+===========+==============+ +| ``default`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``lockdep`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``context`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``crush`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``mds`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``mds balancer`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``mds locker`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``mds log`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``mds log expire`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``mds migrator`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``buffer`` | 0 | 0 | ++--------------------+-----------+--------------+ +| ``timer`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``filer`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``objecter`` | 0 | 0 | ++--------------------+-----------+--------------+ +| ``rados`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``rbd`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``journaler`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``objectcacher`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``client`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``osd`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``optracker`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``objclass`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``filestore`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``journal`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``ms`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``mon`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``monc`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``paxos`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``tp`` | 0 | 5 | ++--------------------+-----------+--------------+ +| ``auth`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``finisher`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``heartbeatmap`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``perfcounter`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``rgw`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``javaclient`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``asok`` | 1 | 5 | ++--------------------+-----------+--------------+ +| ``throttle`` | 1 | 5 | ++--------------------+-----------+--------------+ + + +Logging Settings +---------------- + +Logging and debugging settings are not required in a Ceph configuration file, +but you may override default settings as needed. Ceph supports the following +settings: + + +``log file`` + +:Description: The location of the logging file for your cluster. +:Type: String +:Required: No +:Default: ``/var/log/ceph/$cluster-$name.log`` + + +``log max new`` + +:Description: The maximum number of new log files. +:Type: Integer +:Required: No +:Default: ``1000`` + + +``log max recent`` + +:Description: The maximum number of recent events to include in a log file. +:Type: Integer +:Required: No +:Default: ``1000000`` + + +``log to stderr`` + +:Description: Determines if logging messages should appear in ``stderr``. +:Type: Boolean +:Required: No +:Default: ``true`` + + +``err to stderr`` + +:Description: Determines if error messages should appear in ``stderr``. +:Type: Boolean +:Required: No +:Default: ``true`` + + +``log to syslog`` + +:Description: Determines if logging messages should appear in ``syslog``. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``err to syslog`` + +:Description: Determines if error messages should appear in ``syslog``. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``log flush on exit`` + +:Description: Determines if Ceph should flush the log files after exit. +:Type: Boolean +:Required: No +:Default: ``true`` + + +``clog to monitors`` + +:Description: Determines if ``clog`` messages should be sent to monitors. +:Type: Boolean +:Required: No +:Default: ``true`` + + +``clog to syslog`` + +:Description: Determines if ``clog`` messages should be sent to syslog. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``mon cluster log to syslog`` + +:Description: Determines if the cluster log should be output to the syslog. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``mon cluster log file`` + +:Description: The location of the cluster's log file. +:Type: String +:Required: No +:Default: ``/var/log/ceph/$cluster.log`` + + + +OSD +--- + + +``osd debug drop ping probability`` + +:Description: ? +:Type: Double +:Required: No +:Default: 0 + + +``osd debug drop ping duration`` + +:Description: +:Type: Integer +:Required: No +:Default: 0 + +``osd debug drop pg create probability`` + +:Description: +:Type: Integer +:Required: No +:Default: 0 + +``osd debug drop pg create duration`` + +:Description: ? +:Type: Double +:Required: No +:Default: 1 + + +``osd tmapput sets uses tmap`` + +:Description: Uses ``tmap``. For debug only. +:Type: Boolean +:Required: No +:Default: ``false`` + + +``osd min pg log entries`` + +:Description: The minimum number of log entries for placement groups. +:Type: 32-bit Unsigned Integer +:Required: No +:Default: 1000 + + +``osd op log threshold`` + +:Description: How many op log messages to show up in one pass. +:Type: Integer +:Required: No +:Default: 5 + + + +Filestore +--------- + +``filestore debug omap check`` + +:Description: Debugging check on synchronization. This is an expensive operation. +:Type: Boolean +:Required: No +:Default: 0 + + +MDS +--- + + +``mds debug scatterstat`` + +:Description: Ceph will assert that various recursive stat invariants are true + (for developers only). + +:Type: Boolean +:Required: No +:Default: ``false`` + + +``mds debug frag`` + +:Description: Ceph will verify directory fragmentation invariants when + convenient (developers only). + +:Type: Boolean +:Required: No +:Default: ``false`` + + +``mds debug auth pins`` + +:Description: The debug auth pin invariants (for developers only). +:Type: Boolean +:Required: No +:Default: ``false`` + + +``mds debug subtrees`` + +:Description: The debug subtree invariants (for developers only). +:Type: Boolean +:Required: No +:Default: ``false`` + + + +RADOS Gateway +------------- + + +``rgw log nonexistent bucket`` + +:Description: Should we log a non-existent buckets? +:Type: Boolean +:Required: No +:Default: ``false`` + + +``rgw log object name`` + +:Description: Should an object's name be logged. // man date to see codes (a subset are supported) +:Type: String +:Required: No +:Default: ``%Y-%m-%d-%H-%i-%n`` + + +``rgw log object name utc`` + +:Description: Object log name contains UTC? +:Type: Boolean +:Required: No +:Default: ``false`` + + +``rgw enable ops log`` + +:Description: Enables logging of every RGW operation. +:Type: Boolean +:Required: No +:Default: ``true`` + + +``rgw enable usage log`` + +:Description: Enable logging of RGW's bandwidth usage. +:Type: Boolean +:Required: No +:Default: ``true`` + + +``rgw usage log flush threshold`` + +:Description: Threshold to flush pending log data. +:Type: Integer +:Required: No +:Default: ``1024`` + + +``rgw usage log tick interval`` + +:Description: Flush pending log data every ``s`` seconds. +:Type: Integer +:Required: No +:Default: 30 + + +``rgw intent log object name`` + +:Description: +:Type: String +:Required: No +:Default: ``%Y-%m-%d-%i-%n`` + + +``rgw intent log object name utc`` + +:Description: Include a UTC timestamp in the intent log object name. +:Type: Boolean +:Required: No +:Default: ``false`` + +.. [#] there are levels >20 in some rare cases and that they are extremely verbose. diff --git a/src/ceph/doc/rados/troubleshooting/memory-profiling.rst b/src/ceph/doc/rados/troubleshooting/memory-profiling.rst new file mode 100644 index 0000000..e2396e2 --- /dev/null +++ b/src/ceph/doc/rados/troubleshooting/memory-profiling.rst @@ -0,0 +1,142 @@ +================== + Memory Profiling +================== + +Ceph MON, OSD and MDS can generate heap profiles using +``tcmalloc``. To generate heap profiles, ensure you have +``google-perftools`` installed:: + + sudo apt-get install google-perftools + +The profiler dumps output to your ``log file`` directory (i.e., +``/var/log/ceph``). See `Logging and Debugging`_ for details. +To view the profiler logs with Google's performance tools, execute the +following:: + + google-pprof --text {path-to-daemon} {log-path/filename} + +For example:: + + $ ceph tell osd.0 heap start_profiler + $ ceph tell osd.0 heap dump + osd.0 tcmalloc heap stats:------------------------------------------------ + MALLOC: 2632288 ( 2.5 MiB) Bytes in use by application + MALLOC: + 499712 ( 0.5 MiB) Bytes in page heap freelist + MALLOC: + 543800 ( 0.5 MiB) Bytes in central cache freelist + MALLOC: + 327680 ( 0.3 MiB) Bytes in transfer cache freelist + MALLOC: + 1239400 ( 1.2 MiB) Bytes in thread cache freelists + MALLOC: + 1142936 ( 1.1 MiB) Bytes in malloc metadata + MALLOC: ------------ + MALLOC: = 6385816 ( 6.1 MiB) Actual memory used (physical + swap) + MALLOC: + 0 ( 0.0 MiB) Bytes released to OS (aka unmapped) + MALLOC: ------------ + MALLOC: = 6385816 ( 6.1 MiB) Virtual address space used + MALLOC: + MALLOC: 231 Spans in use + MALLOC: 56 Thread heaps in use + MALLOC: 8192 Tcmalloc page size + ------------------------------------------------ + Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()). + Bytes released to the OS take up virtual address space but no physical memory. + $ google-pprof --text \ + /usr/bin/ceph-osd \ + /var/log/ceph/ceph-osd.0.profile.0001.heap + Total: 3.7 MB + 1.9 51.1% 51.1% 1.9 51.1% ceph::log::Log::create_entry + 1.8 47.3% 98.4% 1.8 47.3% std::string::_Rep::_S_create + 0.0 0.4% 98.9% 0.0 0.6% SimpleMessenger::add_accept_pipe + 0.0 0.4% 99.2% 0.0 0.6% decode_message + ... + +Another heap dump on the same daemon will add another file. It is +convenient to compare to a previous heap dump to show what has grown +in the interval. For instance:: + + $ google-pprof --text --base out/osd.0.profile.0001.heap \ + ceph-osd out/osd.0.profile.0003.heap + Total: 0.2 MB + 0.1 50.3% 50.3% 0.1 50.3% ceph::log::Log::create_entry + 0.1 46.6% 96.8% 0.1 46.6% std::string::_Rep::_S_create + 0.0 0.9% 97.7% 0.0 26.1% ReplicatedPG::do_op + 0.0 0.8% 98.5% 0.0 0.8% __gnu_cxx::new_allocator::allocate + +Refer to `Google Heap Profiler`_ for additional details. + +Once you have the heap profiler installed, start your cluster and +begin using the heap profiler. You may enable or disable the heap +profiler at runtime, or ensure that it runs continuously. For the +following commandline usage, replace ``{daemon-type}`` with ``mon``, +``osd`` or ``mds``, and replace ``{daemon-id}`` with the OSD number or +the MON or MDS id. + + +Starting the Profiler +--------------------- + +To start the heap profiler, execute the following:: + + ceph tell {daemon-type}.{daemon-id} heap start_profiler + +For example:: + + ceph tell osd.1 heap start_profiler + +Alternatively the profile can be started when the daemon starts +running if the ``CEPH_HEAP_PROFILER_INIT=true`` variable is found in +the environment. + +Printing Stats +-------------- + +To print out statistics, execute the following:: + + ceph tell {daemon-type}.{daemon-id} heap stats + +For example:: + + ceph tell osd.0 heap stats + +.. note:: Printing stats does not require the profiler to be running and does + not dump the heap allocation information to a file. + + +Dumping Heap Information +------------------------ + +To dump heap information, execute the following:: + + ceph tell {daemon-type}.{daemon-id} heap dump + +For example:: + + ceph tell mds.a heap dump + +.. note:: Dumping heap information only works when the profiler is running. + + +Releasing Memory +---------------- + +To release memory that ``tcmalloc`` has allocated but which is not being used by +the Ceph daemon itself, execute the following:: + + ceph tell {daemon-type}{daemon-id} heap release + +For example:: + + ceph tell osd.2 heap release + + +Stopping the Profiler +--------------------- + +To stop the heap profiler, execute the following:: + + ceph tell {daemon-type}.{daemon-id} heap stop_profiler + +For example:: + + ceph tell osd.0 heap stop_profiler + +.. _Logging and Debugging: ../log-and-debug +.. _Google Heap Profiler: http://goog-perftools.sourceforge.net/doc/heap_profiler.html diff --git a/src/ceph/doc/rados/troubleshooting/troubleshooting-mon.rst b/src/ceph/doc/rados/troubleshooting/troubleshooting-mon.rst new file mode 100644 index 0000000..89fb94c --- /dev/null +++ b/src/ceph/doc/rados/troubleshooting/troubleshooting-mon.rst @@ -0,0 +1,567 @@ +================================= + Troubleshooting Monitors +================================= + +.. index:: monitor, high availability + +When a cluster encounters monitor-related troubles there's a tendency to +panic, and some times with good reason. You should keep in mind that losing +a monitor, or a bunch of them, don't necessarily mean that your cluster is +down, as long as a majority is up, running and with a formed quorum. +Regardless of how bad the situation is, the first thing you should do is to +calm down, take a breath and try answering our initial troubleshooting script. + + +Initial Troubleshooting +======================== + + +**Are the monitors running?** + + First of all, we need to make sure the monitors are running. You would be + amazed by how often people forget to run the monitors, or restart them after + an upgrade. There's no shame in that, but let's try not losing a couple of + hours chasing an issue that is not there. + +**Are you able to connect to the monitor's servers?** + + Doesn't happen often, but sometimes people do have ``iptables`` rules that + block accesses to monitor servers or monitor ports. Usually leftovers from + monitor stress-testing that were forgotten at some point. Try ssh'ing into + the server and, if that succeeds, try connecting to the monitor's port + using you tool of choice (telnet, nc,...). + +**Does ceph -s run and obtain a reply from the cluster?** + + If the answer is yes then your cluster is up and running. One thing you + can take for granted is that the monitors will only answer to a ``status`` + request if there is a formed quorum. + + If ``ceph -s`` blocked however, without obtaining a reply from the cluster + or showing a lot of ``fault`` messages, then it is likely that your monitors + are either down completely or just a portion is up -- a portion that is not + enough to form a quorum (keep in mind that a quorum if formed by a majority + of monitors). + +**What if ceph -s doesn't finish?** + + If you haven't gone through all the steps so far, please go back and do. + + For those running on Emperor 0.72-rc1 and forward, you will be able to + contact each monitor individually asking them for their status, regardless + of a quorum being formed. This an be achieved using ``ceph ping mon.ID``, + ID being the monitor's identifier. You should perform this for each monitor + in the cluster. In section `Understanding mon_status`_ we will explain how + to interpret the output of this command. + + For the rest of you who don't tread on the bleeding edge, you will need to + ssh into the server and use the monitor's admin socket. Please jump to + `Using the monitor's admin socket`_. + +For other specific issues, keep on reading. + + +Using the monitor's admin socket +================================= + +The admin socket allows you to interact with a given daemon directly using a +Unix socket file. This file can be found in your monitor's ``run`` directory. +By default, the admin socket will be kept in ``/var/run/ceph/ceph-mon.ID.asok`` +but this can vary if you defined it otherwise. If you don't find it there, +please check your ``ceph.conf`` for an alternative path or run:: + + ceph-conf --name mon.ID --show-config-value admin_socket + +Please bear in mind that the admin socket will only be available while the +monitor is running. When the monitor is properly shutdown, the admin socket +will be removed. If however the monitor is not running and the admin socket +still persists, it is likely that the monitor was improperly shutdown. +Regardless, if the monitor is not running, you will not be able to use the +admin socket, with ``ceph`` likely returning ``Error 111: Connection Refused``. + +Accessing the admin socket is as simple as telling the ``ceph`` tool to use +the ``asok`` file. In pre-Dumpling Ceph, this can be achieved by:: + + ceph --admin-daemon /var/run/ceph/ceph-mon.<id>.asok <command> + +while in Dumpling and beyond you can use the alternate (and recommended) +format:: + + ceph daemon mon.<id> <command> + +Using ``help`` as the command to the ``ceph`` tool will show you the +supported commands available through the admin socket. Please take a look +at ``config get``, ``config show``, ``mon_status`` and ``quorum_status``, +as those can be enlightening when troubleshooting a monitor. + + +Understanding mon_status +========================= + +``mon_status`` can be obtained through the ``ceph`` tool when you have +a formed quorum, or via the admin socket if you don't. This command will +output a multitude of information about the monitor, including the same +output you would get with ``quorum_status``. + +Take the following example of ``mon_status``:: + + + { "name": "c", + "rank": 2, + "state": "peon", + "election_epoch": 38, + "quorum": [ + 1, + 2], + "outside_quorum": [], + "extra_probe_peers": [], + "sync_provider": [], + "monmap": { "epoch": 3, + "fsid": "5c4e9d53-e2e1-478a-8061-f543f8be4cf8", + "modified": "2013-10-30 04:12:01.945629", + "created": "2013-10-29 14:14:41.914786", + "mons": [ + { "rank": 0, + "name": "a", + "addr": "127.0.0.1:6789\/0"}, + { "rank": 1, + "name": "b", + "addr": "127.0.0.1:6790\/0"}, + { "rank": 2, + "name": "c", + "addr": "127.0.0.1:6795\/0"}]}} + +A couple of things are obvious: we have three monitors in the monmap (*a*, *b* +and *c*), the quorum is formed by only two monitors, and *c* is in the quorum +as a *peon*. + +Which monitor is out of the quorum? + + The answer would be **a**. + +Why? + + Take a look at the ``quorum`` set. We have two monitors in this set: *1* + and *2*. These are not monitor names. These are monitor ranks, as established + in the current monmap. We are missing the monitor with rank 0, and according + to the monmap that would be ``mon.a``. + +By the way, how are ranks established? + + Ranks are (re)calculated whenever you add or remove monitors and follow a + simple rule: the **greater** the ``IP:PORT`` combination, the **lower** the + rank is. In this case, considering that ``127.0.0.1:6789`` is lower than all + the remaining ``IP:PORT`` combinations, ``mon.a`` has rank 0. + +Most Common Monitor Issues +=========================== + +Have Quorum but at least one Monitor is down +--------------------------------------------- + +When this happens, depending on the version of Ceph you are running, +you should be seeing something similar to:: + + $ ceph health detail + [snip] + mon.a (rank 0) addr 127.0.0.1:6789/0 is down (out of quorum) + +How to troubleshoot this? + + First, make sure ``mon.a`` is running. + + Second, make sure you are able to connect to ``mon.a``'s server from the + other monitors' servers. Check the ports as well. Check ``iptables`` on + all your monitor nodes and make sure you are not dropping/rejecting + connections. + + If this initial troubleshooting doesn't solve your problems, then it's + time to go deeper. + + First, check the problematic monitor's ``mon_status`` via the admin + socket as explained in `Using the monitor's admin socket`_ and + `Understanding mon_status`_. + + Considering the monitor is out of the quorum, its state should be one of + ``probing``, ``electing`` or ``synchronizing``. If it happens to be either + ``leader`` or ``peon``, then the monitor believes to be in quorum, while + the remaining cluster is sure it is not; or maybe it got into the quorum + while we were troubleshooting the monitor, so check you ``ceph -s`` again + just to make sure. Proceed if the monitor is not yet in the quorum. + +What if the state is ``probing``? + + This means the monitor is still looking for the other monitors. Every time + you start a monitor, the monitor will stay in this state for some time + while trying to find the rest of the monitors specified in the ``monmap``. + The time a monitor will spend in this state can vary. For instance, when on + a single-monitor cluster, the monitor will pass through the probing state + almost instantaneously, since there are no other monitors around. On a + multi-monitor cluster, the monitors will stay in this state until they + find enough monitors to form a quorum -- this means that if you have 2 out + of 3 monitors down, the one remaining monitor will stay in this state + indefinitively until you bring one of the other monitors up. + + If you have a quorum, however, the monitor should be able to find the + remaining monitors pretty fast, as long as they can be reached. If your + monitor is stuck probing and you have gone through with all the communication + troubleshooting, then there is a fair chance that the monitor is trying + to reach the other monitors on a wrong address. ``mon_status`` outputs the + ``monmap`` known to the monitor: check if the other monitor's locations + match reality. If they don't, jump to + `Recovering a Monitor's Broken monmap`_; if they do, then it may be related + to severe clock skews amongst the monitor nodes and you should refer to + `Clock Skews`_ first, but if that doesn't solve your problem then it is + the time to prepare some logs and reach out to the community (please refer + to `Preparing your logs`_ on how to best prepare your logs). + + +What if state is ``electing``? + + This means the monitor is in the middle of an election. These should be + fast to complete, but at times the monitors can get stuck electing. This + is usually a sign of a clock skew among the monitor nodes; jump to + `Clock Skews`_ for more infos on that. If all your clocks are properly + synchronized, it is best if you prepare some logs and reach out to the + community. This is not a state that is likely to persist and aside from + (*really*) old bugs there is not an obvious reason besides clock skews on + why this would happen. + +What if state is ``synchronizing``? + + This means the monitor is synchronizing with the rest of the cluster in + order to join the quorum. The synchronization process is as faster as + smaller your monitor store is, so if you have a big store it may + take a while. Don't worry, it should be finished soon enough. + + However, if you notice that the monitor jumps from ``synchronizing`` to + ``electing`` and then back to ``synchronizing``, then you do have a + problem: the cluster state is advancing (i.e., generating new maps) way + too fast for the synchronization process to keep up. This used to be a + thing in early Cuttlefish, but since then the synchronization process was + quite refactored and enhanced to avoid just this sort of behavior. If this + happens in later versions let us know. And bring some logs + (see `Preparing your logs`_). + +What if state is ``leader`` or ``peon``? + + This should not happen. There is a chance this might happen however, and + it has a lot to do with clock skews -- see `Clock Skews`_. If you are not + suffering from clock skews, then please prepare your logs (see + `Preparing your logs`_) and reach out to us. + + +Recovering a Monitor's Broken monmap +------------------------------------- + +This is how a ``monmap`` usually looks like, depending on the number of +monitors:: + + + epoch 3 + fsid 5c4e9d53-e2e1-478a-8061-f543f8be4cf8 + last_changed 2013-10-30 04:12:01.945629 + created 2013-10-29 14:14:41.914786 + 0: 127.0.0.1:6789/0 mon.a + 1: 127.0.0.1:6790/0 mon.b + 2: 127.0.0.1:6795/0 mon.c + +This may not be what you have however. For instance, in some versions of +early Cuttlefish there was this one bug that could cause your ``monmap`` +to be nullified. Completely filled with zeros. This means that not even +``monmaptool`` would be able to read it because it would find it hard to +make sense of only-zeros. Some other times, you may end up with a monitor +with a severely outdated monmap, thus being unable to find the remaining +monitors (e.g., say ``mon.c`` is down; you add a new monitor ``mon.d``, +then remove ``mon.a``, then add a new monitor ``mon.e`` and remove +``mon.b``; you will end up with a totally different monmap from the one +``mon.c`` knows). + +In this sort of situations, you have two possible solutions: + +Scrap the monitor and create a new one + + You should only take this route if you are positive that you won't + lose the information kept by that monitor; that you have other monitors + and that they are running just fine so that your new monitor is able + to synchronize from the remaining monitors. Keep in mind that destroying + a monitor, if there are no other copies of its contents, may lead to + loss of data. + +Inject a monmap into the monitor + + Usually the safest path. You should grab the monmap from the remaining + monitors and inject it into the monitor with the corrupted/lost monmap. + + These are the basic steps: + + 1. Is there a formed quorum? If so, grab the monmap from the quorum:: + + $ ceph mon getmap -o /tmp/monmap + + 2. No quorum? Grab the monmap directly from another monitor (this + assumes the monitor you are grabbing the monmap from has id ID-FOO + and has been stopped):: + + $ ceph-mon -i ID-FOO --extract-monmap /tmp/monmap + + 3. Stop the monitor you are going to inject the monmap into. + + 4. Inject the monmap:: + + $ ceph-mon -i ID --inject-monmap /tmp/monmap + + 5. Start the monitor + + Please keep in mind that the ability to inject monmaps is a powerful + feature that can cause havoc with your monitors if misused as it will + overwrite the latest, existing monmap kept by the monitor. + + +Clock Skews +------------ + +Monitors can be severely affected by significant clock skews across the +monitor nodes. This usually translates into weird behavior with no obvious +cause. To avoid such issues, you should run a clock synchronization tool +on your monitor nodes. + + +What's the maximum tolerated clock skew? + + By default the monitors will allow clocks to drift up to ``0.05 seconds``. + + +Can I increase the maximum tolerated clock skew? + + This value is configurable via the ``mon-clock-drift-allowed`` option, and + although you *CAN* it doesn't mean you *SHOULD*. The clock skew mechanism + is in place because clock skewed monitor may not properly behave. We, as + developers and QA afficcionados, are comfortable with the current default + value, as it will alert the user before the monitors get out hand. Changing + this value without testing it first may cause unforeseen effects on the + stability of the monitors and overall cluster healthiness, although there is + no risk of dataloss. + + +How do I know there's a clock skew? + + The monitors will warn you in the form of a ``HEALTH_WARN``. ``ceph health + detail`` should show something in the form of:: + + mon.c addr 10.10.0.1:6789/0 clock skew 0.08235s > max 0.05s (latency 0.0045s) + + That means that ``mon.c`` has been flagged as suffering from a clock skew. + + +What should I do if there's a clock skew? + + Synchronize your clocks. Running an NTP client may help. If you are already + using one and you hit this sort of issues, check if you are using some NTP + server remote to your network and consider hosting your own NTP server on + your network. This last option tends to reduce the amount of issues with + monitor clock skews. + + +Client Can't Connect or Mount +------------------------------ + +Check your IP tables. Some OS install utilities add a ``REJECT`` rule to +``iptables``. The rule rejects all clients trying to connect to the host except +for ``ssh``. If your monitor host's IP tables have such a ``REJECT`` rule in +place, clients connecting from a separate node will fail to mount with a timeout +error. You need to address ``iptables`` rules that reject clients trying to +connect to Ceph daemons. For example, you would need to address rules that look +like this appropriately:: + + REJECT all -- anywhere anywhere reject-with icmp-host-prohibited + +You may also need to add rules to IP tables on your Ceph hosts to ensure +that clients can access the ports associated with your Ceph monitors (i.e., port +6789 by default) and Ceph OSDs (i.e., 6800 through 7300 by default). For +example:: + + iptables -A INPUT -m multiport -p tcp -s {ip-address}/{netmask} --dports 6789,6800:7300 -j ACCEPT + +Monitor Store Failures +====================== + +Symptoms of store corruption +---------------------------- + +Ceph monitor stores the `cluster map`_ in a key/value store such as LevelDB. If +a monitor fails due to the key/value store corruption, following error messages +might be found in the monitor log:: + + Corruption: error in middle of record + +or:: + + Corruption: 1 missing files; e.g.: /var/lib/ceph/mon/mon.0/store.db/1234567.ldb + +Recovery using healthy monitor(s) +--------------------------------- + +If there is any survivers, we can always `replace`_ the corrupted one with a +new one. And after booting up, the new joiner will sync up with a healthy +peer, and once it is fully sync'ed, it will be able to serve the clients. + +Recovery using OSDs +------------------- + +But what if all monitors fail at the same time? Since users are encouraged to +deploy at least three monitors in a Ceph cluster, the chance of simultaneous +failure is rare. But unplanned power-downs in a data center with improperly +configured disk/fs settings could fail the underlying filesystem, and hence +kill all the monitors. In this case, we can recover the monitor store with the +information stored in OSDs.:: + + ms=/tmp/mon-store + mkdir $ms + # collect the cluster map from OSDs + for host in $hosts; do + rsync -avz $ms user@host:$ms + rm -rf $ms + ssh user@host <<EOF + for osd in /var/lib/osd/osd-*; do + ceph-objectstore-tool --data-path \$osd --op update-mon-db --mon-store-path $ms + done + EOF + rsync -avz user@host:$ms $ms + done + # rebuild the monitor store from the collected map, if the cluster does not + # use cephx authentication, we can skip the following steps to update the + # keyring with the caps, and there is no need to pass the "--keyring" option. + # i.e. just use "ceph-monstore-tool /tmp/mon-store rebuild" instead + ceph-authtool /path/to/admin.keyring -n mon. \ + --cap mon 'allow *' + ceph-authtool /path/to/admin.keyring -n client.admin \ + --cap mon 'allow *' --cap osd 'allow *' --cap mds 'allow *' + ceph-monstore-tool /tmp/mon-store rebuild -- --keyring /path/to/admin.keyring + # backup corrupted store.db just in case + mv /var/lib/ceph/mon/mon.0/store.db /var/lib/ceph/mon/mon.0/store.db.corrupted + mv /tmp/mon-store/store.db /var/lib/ceph/mon/mon.0/store.db + chown -R ceph:ceph /var/lib/ceph/mon/mon.0/store.db + +The steps above + +#. collect the map from all OSD hosts, +#. then rebuild the store, +#. fill the entities in keyring file with appropriate caps +#. replace the corrupted store on ``mon.0`` with the recovered copy. + +Known limitations +~~~~~~~~~~~~~~~~~ + +Following information are not recoverable using the steps above: + +- **some added keyrings**: all the OSD keyrings added using ``ceph auth add`` command + are recovered from the OSD's copy. And the ``client.admin`` keyring is imported + using ``ceph-monstore-tool``. But the MDS keyrings and other keyrings are missing + in the recovered monitor store. You might need to re-add them manually. + +- **pg settings**: the ``full ratio`` and ``nearfull ratio`` settings configured using + ``ceph pg set_full_ratio`` and ``ceph pg set_nearfull_ratio`` will be lost. + +- **MDS Maps**: the MDS maps are lost. + + +Everything Failed! Now What? +============================= + +Reaching out for help +---------------------- + +You can find us on IRC at #ceph and #ceph-devel at OFTC (server irc.oftc.net) +and on ``ceph-devel@vger.kernel.org`` and ``ceph-users@lists.ceph.com``. Make +sure you have grabbed your logs and have them ready if someone asks: the faster +the interaction and lower the latency in response, the better chances everyone's +time is optimized. + + +Preparing your logs +--------------------- + +Monitor logs are, by default, kept in ``/var/log/ceph/ceph-mon.FOO.log*``. We +may want them. However, your logs may not have the necessary information. If +you don't find your monitor logs at their default location, you can check +where they should be by running:: + + ceph-conf --name mon.FOO --show-config-value log_file + +The amount of information in the logs are subject to the debug levels being +enforced by your configuration files. If you have not enforced a specific +debug level then Ceph is using the default levels and your logs may not +contain important information to track down you issue. +A first step in getting relevant information into your logs will be to raise +debug levels. In this case we will be interested in the information from the +monitor. +Similarly to what happens on other components, different parts of the monitor +will output their debug information on different subsystems. + +You will have to raise the debug levels of those subsystems more closely +related to your issue. This may not be an easy task for someone unfamiliar +with troubleshooting Ceph. For most situations, setting the following options +on your monitors will be enough to pinpoint a potential source of the issue:: + + debug mon = 10 + debug ms = 1 + +If we find that these debug levels are not enough, there's a chance we may +ask you to raise them or even define other debug subsystems to obtain infos +from -- but at least we started off with some useful information, instead +of a massively empty log without much to go on with. + +Do I need to restart a monitor to adjust debug levels? +------------------------------------------------------ + +No. You may do it in one of two ways: + +You have quorum + + Either inject the debug option into the monitor you want to debug:: + + ceph tell mon.FOO injectargs --debug_mon 10/10 + + or into all monitors at once:: + + ceph tell mon.* injectargs --debug_mon 10/10 + +No quourm + + Use the monitor's admin socket and directly adjust the configuration + options:: + + ceph daemon mon.FOO config set debug_mon 10/10 + + +Going back to default values is as easy as rerunning the above commands +using the debug level ``1/10`` instead. You can check your current +values using the admin socket and the following commands:: + + ceph daemon mon.FOO config show + +or:: + + ceph daemon mon.FOO config get 'OPTION_NAME' + + +Reproduced the problem with appropriate debug levels. Now what? +---------------------------------------------------------------- + +Ideally you would send us only the relevant portions of your logs. +We realise that figuring out the corresponding portion may not be the +easiest of tasks. Therefore, we won't hold it to you if you provide the +full log, but common sense should be employed. If your log has hundreds +of thousands of lines, it may get tricky to go through the whole thing, +specially if we are not aware at which point, whatever your issue is, +happened. For instance, when reproducing, keep in mind to write down +current time and date and to extract the relevant portions of your logs +based on that. + +Finally, you should reach out to us on the mailing lists, on IRC or file +a new issue on the `tracker`_. + +.. _cluster map: ../../architecture#cluster-map +.. _replace: ../operation/add-or-rm-mons +.. _tracker: http://tracker.ceph.com/projects/ceph/issues/new diff --git a/src/ceph/doc/rados/troubleshooting/troubleshooting-osd.rst b/src/ceph/doc/rados/troubleshooting/troubleshooting-osd.rst new file mode 100644 index 0000000..88307fe --- /dev/null +++ b/src/ceph/doc/rados/troubleshooting/troubleshooting-osd.rst @@ -0,0 +1,536 @@ +====================== + Troubleshooting OSDs +====================== + +Before troubleshooting your OSDs, check your monitors and network first. If +you execute ``ceph health`` or ``ceph -s`` on the command line and Ceph returns +a health status, it means that the monitors have a quorum. +If you don't have a monitor quorum or if there are errors with the monitor +status, `address the monitor issues first <../troubleshooting-mon>`_. +Check your networks to ensure they +are running properly, because networks may have a significant impact on OSD +operation and performance. + + + +Obtaining Data About OSDs +========================= + +A good first step in troubleshooting your OSDs is to obtain information in +addition to the information you collected while `monitoring your OSDs`_ +(e.g., ``ceph osd tree``). + + +Ceph Logs +--------- + +If you haven't changed the default path, you can find Ceph log files at +``/var/log/ceph``:: + + ls /var/log/ceph + +If you don't get enough log detail, you can change your logging level. See +`Logging and Debugging`_ for details to ensure that Ceph performs adequately +under high logging volume. + + +Admin Socket +------------ + +Use the admin socket tool to retrieve runtime information. For details, list +the sockets for your Ceph processes:: + + ls /var/run/ceph + +Then, execute the following, replacing ``{daemon-name}`` with an actual +daemon (e.g., ``osd.0``):: + + ceph daemon osd.0 help + +Alternatively, you can specify a ``{socket-file}`` (e.g., something in ``/var/run/ceph``):: + + ceph daemon {socket-file} help + + +The admin socket, among other things, allows you to: + +- List your configuration at runtime +- Dump historic operations +- Dump the operation priority queue state +- Dump operations in flight +- Dump perfcounters + + +Display Freespace +----------------- + +Filesystem issues may arise. To display your filesystem's free space, execute +``df``. :: + + df -h + +Execute ``df --help`` for additional usage. + + +I/O Statistics +-------------- + +Use `iostat`_ to identify I/O-related issues. :: + + iostat -x + + +Diagnostic Messages +------------------- + +To retrieve diagnostic messages, use ``dmesg`` with ``less``, ``more``, ``grep`` +or ``tail``. For example:: + + dmesg | grep scsi + + +Stopping w/out Rebalancing +========================== + +Periodically, you may need to perform maintenance on a subset of your cluster, +or resolve a problem that affects a failure domain (e.g., a rack). If you do not +want CRUSH to automatically rebalance the cluster as you stop OSDs for +maintenance, set the cluster to ``noout`` first:: + + ceph osd set noout + +Once the cluster is set to ``noout``, you can begin stopping the OSDs within the +failure domain that requires maintenance work. :: + + stop ceph-osd id={num} + +.. note:: Placement groups within the OSDs you stop will become ``degraded`` + while you are addressing issues with within the failure domain. + +Once you have completed your maintenance, restart the OSDs. :: + + start ceph-osd id={num} + +Finally, you must unset the cluster from ``noout``. :: + + ceph osd unset noout + + + +.. _osd-not-running: + +OSD Not Running +=============== + +Under normal circumstances, simply restarting the ``ceph-osd`` daemon will +allow it to rejoin the cluster and recover. + +An OSD Won't Start +------------------ + +If you start your cluster and an OSD won't start, check the following: + +- **Configuration File:** If you were not able to get OSDs running from + a new installation, check your configuration file to ensure it conforms + (e.g., ``host`` not ``hostname``, etc.). + +- **Check Paths:** Check the paths in your configuration, and the actual + paths themselves for data and journals. If you separate the OSD data from + the journal data and there are errors in your configuration file or in the + actual mounts, you may have trouble starting OSDs. If you want to store the + journal on a block device, you should partition your journal disk and assign + one partition per OSD. + +- **Check Max Threadcount:** If you have a node with a lot of OSDs, you may be + hitting the default maximum number of threads (e.g., usually 32k), especially + during recovery. You can increase the number of threads using ``sysctl`` to + see if increasing the maximum number of threads to the maximum possible + number of threads allowed (i.e., 4194303) will help. For example:: + + sysctl -w kernel.pid_max=4194303 + + If increasing the maximum thread count resolves the issue, you can make it + permanent by including a ``kernel.pid_max`` setting in the + ``/etc/sysctl.conf`` file. For example:: + + kernel.pid_max = 4194303 + +- **Kernel Version:** Identify the kernel version and distribution you + are using. Ceph uses some third party tools by default, which may be + buggy or may conflict with certain distributions and/or kernel + versions (e.g., Google perftools). Check the `OS recommendations`_ + to ensure you have addressed any issues related to your kernel. + +- **Segment Fault:** If there is a segment fault, turn your logging up + (if it is not already), and try again. If it segment faults again, + contact the ceph-devel email list and provide your Ceph configuration + file, your monitor output and the contents of your log file(s). + + + +An OSD Failed +------------- + +When a ``ceph-osd`` process dies, the monitor will learn about the failure +from surviving ``ceph-osd`` daemons and report it via the ``ceph health`` +command:: + + ceph health + HEALTH_WARN 1/3 in osds are down + +Specifically, you will get a warning whenever there are ``ceph-osd`` +processes that are marked ``in`` and ``down``. You can identify which +``ceph-osds`` are ``down`` with:: + + ceph health detail + HEALTH_WARN 1/3 in osds are down + osd.0 is down since epoch 23, last address 192.168.106.220:6800/11080 + +If there is a disk +failure or other fault preventing ``ceph-osd`` from functioning or +restarting, an error message should be present in its log file in +``/var/log/ceph``. + +If the daemon stopped because of a heartbeat failure, the underlying +kernel file system may be unresponsive. Check ``dmesg`` output for disk +or other kernel errors. + +If the problem is a software error (failed assertion or other +unexpected error), it should be reported to the `ceph-devel`_ email list. + + +No Free Drive Space +------------------- + +Ceph prevents you from writing to a full OSD so that you don't lose data. +In an operational cluster, you should receive a warning when your cluster +is getting near its full ratio. The ``mon osd full ratio`` defaults to +``0.95``, or 95% of capacity before it stops clients from writing data. +The ``mon osd backfillfull ratio`` defaults to ``0.90``, or 90 % of +capacity when it blocks backfills from starting. The +``mon osd nearfull ratio`` defaults to ``0.85``, or 85% of capacity +when it generates a health warning. + +Full cluster issues usually arise when testing how Ceph handles an OSD +failure on a small cluster. When one node has a high percentage of the +cluster's data, the cluster can easily eclipse its nearfull and full ratio +immediately. If you are testing how Ceph reacts to OSD failures on a small +cluster, you should leave ample free disk space and consider temporarily +lowering the ``mon osd full ratio``, ``mon osd backfillfull ratio`` and +``mon osd nearfull ratio``. + +Full ``ceph-osds`` will be reported by ``ceph health``:: + + ceph health + HEALTH_WARN 1 nearfull osd(s) + +Or:: + + ceph health detail + HEALTH_ERR 1 full osd(s); 1 backfillfull osd(s); 1 nearfull osd(s) + osd.3 is full at 97% + osd.4 is backfill full at 91% + osd.2 is near full at 87% + +The best way to deal with a full cluster is to add new ``ceph-osds``, allowing +the cluster to redistribute data to the newly available storage. + +If you cannot start an OSD because it is full, you may delete some data by deleting +some placement group directories in the full OSD. + +.. important:: If you choose to delete a placement group directory on a full OSD, + **DO NOT** delete the same placement group directory on another full OSD, or + **YOU MAY LOSE DATA**. You **MUST** maintain at least one copy of your data on + at least one OSD. + +See `Monitor Config Reference`_ for additional details. + + +OSDs are Slow/Unresponsive +========================== + +A commonly recurring issue involves slow or unresponsive OSDs. Ensure that you +have eliminated other troubleshooting possibilities before delving into OSD +performance issues. For example, ensure that your network(s) is working properly +and your OSDs are running. Check to see if OSDs are throttling recovery traffic. + +.. tip:: Newer versions of Ceph provide better recovery handling by preventing + recovering OSDs from using up system resources so that ``up`` and ``in`` + OSDs are not available or are otherwise slow. + + +Networking Issues +----------------- + +Ceph is a distributed storage system, so it depends upon networks to peer with +OSDs, replicate objects, recover from faults and check heartbeats. Networking +issues can cause OSD latency and flapping OSDs. See `Flapping OSDs`_ for +details. + +Ensure that Ceph processes and Ceph-dependent processes are connected and/or +listening. :: + + netstat -a | grep ceph + netstat -l | grep ceph + sudo netstat -p | grep ceph + +Check network statistics. :: + + netstat -s + + +Drive Configuration +------------------- + +A storage drive should only support one OSD. Sequential read and sequential +write throughput can bottleneck if other processes share the drive, including +journals, operating systems, monitors, other OSDs and non-Ceph processes. + +Ceph acknowledges writes *after* journaling, so fast SSDs are an +attractive option to accelerate the response time--particularly when +using the ``XFS`` or ``ext4`` filesystems. By contrast, the ``btrfs`` +filesystem can write and journal simultaneously. (Note, however, that +we recommend against using ``btrfs`` for production deployments.) + +.. note:: Partitioning a drive does not change its total throughput or + sequential read/write limits. Running a journal in a separate partition + may help, but you should prefer a separate physical drive. + + +Bad Sectors / Fragmented Disk +----------------------------- + +Check your disks for bad sectors and fragmentation. This can cause total throughput +to drop substantially. + + +Co-resident Monitors/OSDs +------------------------- + +Monitors are generally light-weight processes, but they do lots of ``fsync()``, +which can interfere with other workloads, particularly if monitors run on the +same drive as your OSDs. Additionally, if you run monitors on the same host as +the OSDs, you may incur performance issues related to: + +- Running an older kernel (pre-3.0) +- Running Argonaut with an old ``glibc`` +- Running a kernel with no syncfs(2) syscall. + +In these cases, multiple OSDs running on the same host can drag each other down +by doing lots of commits. That often leads to the bursty writes. + + +Co-resident Processes +--------------------- + +Spinning up co-resident processes such as a cloud-based solution, virtual +machines and other applications that write data to Ceph while operating on the +same hardware as OSDs can introduce significant OSD latency. Generally, we +recommend optimizing a host for use with Ceph and using other hosts for other +processes. The practice of separating Ceph operations from other applications +may help improve performance and may streamline troubleshooting and maintenance. + + +Logging Levels +-------------- + +If you turned logging levels up to track an issue and then forgot to turn +logging levels back down, the OSD may be putting a lot of logs onto the disk. If +you intend to keep logging levels high, you may consider mounting a drive to the +default path for logging (i.e., ``/var/log/ceph/$cluster-$name.log``). + + +Recovery Throttling +------------------- + +Depending upon your configuration, Ceph may reduce recovery rates to maintain +performance or it may increase recovery rates to the point that recovery +impacts OSD performance. Check to see if the OSD is recovering. + + +Kernel Version +-------------- + +Check the kernel version you are running. Older kernels may not receive +new backports that Ceph depends upon for better performance. + + +Kernel Issues with SyncFS +------------------------- + +Try running one OSD per host to see if performance improves. Old kernels +might not have a recent enough version of ``glibc`` to support ``syncfs(2)``. + + +Filesystem Issues +----------------- + +Currently, we recommend deploying clusters with XFS. + +We recommend against using btrfs or ext4. The btrfs filesystem has +many attractive features, but bugs in the filesystem may lead to +performance issues and suprious ENOSPC errors. We do not recommend +ext4 because xattr size limitations break our support for long object +names (needed for RGW). + +For more information, see `Filesystem Recommendations`_. + +.. _Filesystem Recommendations: ../configuration/filesystem-recommendations + + +Insufficient RAM +---------------- + +We recommend 1GB of RAM per OSD daemon. You may notice that during normal +operations, the OSD only uses a fraction of that amount (e.g., 100-200MB). +Unused RAM makes it tempting to use the excess RAM for co-resident applications, +VMs and so forth. However, when OSDs go into recovery mode, their memory +utilization spikes. If there is no RAM available, the OSD performance will slow +considerably. + + +Old Requests or Slow Requests +----------------------------- + +If a ``ceph-osd`` daemon is slow to respond to a request, it will generate log messages +complaining about requests that are taking too long. The warning threshold +defaults to 30 seconds, and is configurable via the ``osd op complaint time`` +option. When this happens, the cluster log will receive messages. + +Legacy versions of Ceph complain about 'old requests`:: + + osd.0 192.168.106.220:6800/18813 312 : [WRN] old request osd_op(client.5099.0:790 fatty_26485_object789 [write 0~4096] 2.5e54f643) v4 received at 2012-03-06 15:42:56.054801 currently waiting for sub ops + +New versions of Ceph complain about 'slow requests`:: + + {date} {osd.num} [WRN] 1 slow requests, 1 included below; oldest blocked for > 30.005692 secs + {date} {osd.num} [WRN] slow request 30.005692 seconds old, received at {date-time}: osd_op(client.4240.0:8 benchmark_data_ceph-1_39426_object7 [write 0~4194304] 0.69848840) v4 currently waiting for subops from [610] + + +Possible causes include: + +- A bad drive (check ``dmesg`` output) +- A bug in the kernel file system bug (check ``dmesg`` output) +- An overloaded cluster (check system load, iostat, etc.) +- A bug in the ``ceph-osd`` daemon. + +Possible solutions + +- Remove VMs Cloud Solutions from Ceph Hosts +- Upgrade Kernel +- Upgrade Ceph +- Restart OSDs + +Debugging Slow Requests +----------------------- + +If you run "ceph daemon osd.<id> dump_historic_ops" or "dump_ops_in_flight", +you will see a set of operations and a list of events each operation went +through. These are briefly described below. + +Events from the Messenger layer: + +- header_read: when the messenger first started reading the message off the wire +- throttled: when the messenger tried to acquire memory throttle space to read + the message into memory +- all_read: when the messenger finished reading the message off the wire +- dispatched: when the messenger gave the message to the OSD +- Initiated: <This is identical to header_read. The existence of both is a + historical oddity. + +Events from the OSD as it prepares operations + +- queued_for_pg: the op has been put into the queue for processing by its PG +- reached_pg: the PG has started doing the op +- waiting for \*: the op is waiting for some other work to complete before it + can proceed (a new OSDMap; for its object target to scrub; for the PG to + finish peering; all as specified in the message) +- started: the op has been accepted as something the OSD should actually do + (reasons not to do it: failed security/permission checks; out-of-date local + state; etc) and is now actually being performed +- waiting for subops from: the op has been sent to replica OSDs + +Events from the FileStore + +- commit_queued_for_journal_write: the op has been given to the FileStore +- write_thread_in_journal_buffer: the op is in the journal's buffer and waiting + to be persisted (as the next disk write) +- journaled_completion_queued: the op was journaled to disk and its callback + queued for invocation + +Events from the OSD after stuff has been given to local disk + +- op_commit: the op has been committed (ie, written to journal) by the + primary OSD +- op_applied: The op has been write()'en to the backing FS (ie, applied in + memory but not flushed out to disk) on the primary +- sub_op_applied: op_applied, but for a replica's "subop" +- sub_op_committed: op_commited, but for a replica's subop (only for EC pools) +- sub_op_commit_rec/sub_op_apply_rec from <X>: the primary marks this when it + hears about the above, but for a particular replica <X> +- commit_sent: we sent a reply back to the client (or primary OSD, for sub ops) + +Many of these events are seemingly redundant, but cross important boundaries in +the internal code (such as passing data across locks into new threads). + +Flapping OSDs +============= + +We recommend using both a public (front-end) network and a cluster (back-end) +network so that you can better meet the capacity requirements of object +replication. Another advantage is that you can run a cluster network such that +it is not connected to the internet, thereby preventing some denial of service +attacks. When OSDs peer and check heartbeats, they use the cluster (back-end) +network when it's available. See `Monitor/OSD Interaction`_ for details. + +However, if the cluster (back-end) network fails or develops significant latency +while the public (front-end) network operates optimally, OSDs currently do not +handle this situation well. What happens is that OSDs mark each other ``down`` +on the monitor, while marking themselves ``up``. We call this scenario +'flapping`. + +If something is causing OSDs to 'flap' (repeatedly getting marked ``down`` and +then ``up`` again), you can force the monitors to stop the flapping with:: + + ceph osd set noup # prevent OSDs from getting marked up + ceph osd set nodown # prevent OSDs from getting marked down + +These flags are recorded in the osdmap structure:: + + ceph osd dump | grep flags + flags no-up,no-down + +You can clear the flags with:: + + ceph osd unset noup + ceph osd unset nodown + +Two other flags are supported, ``noin`` and ``noout``, which prevent +booting OSDs from being marked ``in`` (allocated data) or protect OSDs +from eventually being marked ``out`` (regardless of what the current value for +``mon osd down out interval`` is). + +.. note:: ``noup``, ``noout``, and ``nodown`` are temporary in the + sense that once the flags are cleared, the action they were blocking + should occur shortly after. The ``noin`` flag, on the other hand, + prevents OSDs from being marked ``in`` on boot, and any daemons that + started while the flag was set will remain that way. + + + + + + +.. _iostat: http://en.wikipedia.org/wiki/Iostat +.. _Ceph Logging and Debugging: ../../configuration/ceph-conf#ceph-logging-and-debugging +.. _Logging and Debugging: ../log-and-debug +.. _Debugging and Logging: ../debug +.. _Monitor/OSD Interaction: ../../configuration/mon-osd-interaction +.. _Monitor Config Reference: ../../configuration/mon-config-ref +.. _monitoring your OSDs: ../../operations/monitoring-osd-pg +.. _subscribe to the ceph-devel email list: mailto:majordomo@vger.kernel.org?body=subscribe+ceph-devel +.. _unsubscribe from the ceph-devel email list: mailto:majordomo@vger.kernel.org?body=unsubscribe+ceph-devel +.. _subscribe to the ceph-users email list: mailto:ceph-users-join@lists.ceph.com +.. _unsubscribe from the ceph-users email list: mailto:ceph-users-leave@lists.ceph.com +.. _OS recommendations: ../../../start/os-recommendations +.. _ceph-devel: ceph-devel@vger.kernel.org diff --git a/src/ceph/doc/rados/troubleshooting/troubleshooting-pg.rst b/src/ceph/doc/rados/troubleshooting/troubleshooting-pg.rst new file mode 100644 index 0000000..4241fee --- /dev/null +++ b/src/ceph/doc/rados/troubleshooting/troubleshooting-pg.rst @@ -0,0 +1,668 @@ +===================== + Troubleshooting PGs +===================== + +Placement Groups Never Get Clean +================================ + +When you create a cluster and your cluster remains in ``active``, +``active+remapped`` or ``active+degraded`` status and never achieve an +``active+clean`` status, you likely have a problem with your configuration. + +You may need to review settings in the `Pool, PG and CRUSH Config Reference`_ +and make appropriate adjustments. + +As a general rule, you should run your cluster with more than one OSD and a +pool size greater than 1 object replica. + +One Node Cluster +---------------- + +Ceph no longer provides documentation for operating on a single node, because +you would never deploy a system designed for distributed computing on a single +node. Additionally, mounting client kernel modules on a single node containing a +Ceph daemon may cause a deadlock due to issues with the Linux kernel itself +(unless you use VMs for the clients). You can experiment with Ceph in a 1-node +configuration, in spite of the limitations as described herein. + +If you are trying to create a cluster on a single node, you must change the +default of the ``osd crush chooseleaf type`` setting from ``1`` (meaning +``host`` or ``node``) to ``0`` (meaning ``osd``) in your Ceph configuration +file before you create your monitors and OSDs. This tells Ceph that an OSD +can peer with another OSD on the same host. If you are trying to set up a +1-node cluster and ``osd crush chooseleaf type`` is greater than ``0``, +Ceph will try to peer the PGs of one OSD with the PGs of another OSD on +another node, chassis, rack, row, or even datacenter depending on the setting. + +.. tip:: DO NOT mount kernel clients directly on the same node as your + Ceph Storage Cluster, because kernel conflicts can arise. However, you + can mount kernel clients within virtual machines (VMs) on a single node. + +If you are creating OSDs using a single disk, you must create directories +for the data manually first. For example:: + + mkdir /var/local/osd0 /var/local/osd1 + ceph-deploy osd prepare {localhost-name}:/var/local/osd0 {localhost-name}:/var/local/osd1 + ceph-deploy osd activate {localhost-name}:/var/local/osd0 {localhost-name}:/var/local/osd1 + + +Fewer OSDs than Replicas +------------------------ + +If you have brought up two OSDs to an ``up`` and ``in`` state, but you still +don't see ``active + clean`` placement groups, you may have an +``osd pool default size`` set to greater than ``2``. + +There are a few ways to address this situation. If you want to operate your +cluster in an ``active + degraded`` state with two replicas, you can set the +``osd pool default min size`` to ``2`` so that you can write objects in +an ``active + degraded`` state. You may also set the ``osd pool default size`` +setting to ``2`` so that you only have two stored replicas (the original and +one replica), in which case the cluster should achieve an ``active + clean`` +state. + +.. note:: You can make the changes at runtime. If you make the changes in + your Ceph configuration file, you may need to restart your cluster. + + +Pool Size = 1 +------------- + +If you have the ``osd pool default size`` set to ``1``, you will only have +one copy of the object. OSDs rely on other OSDs to tell them which objects +they should have. If a first OSD has a copy of an object and there is no +second copy, then no second OSD can tell the first OSD that it should have +that copy. For each placement group mapped to the first OSD (see +``ceph pg dump``), you can force the first OSD to notice the placement groups +it needs by running:: + + ceph osd force-create-pg <pgid> + + +CRUSH Map Errors +---------------- + +Another candidate for placement groups remaining unclean involves errors +in your CRUSH map. + + +Stuck Placement Groups +====================== + +It is normal for placement groups to enter states like "degraded" or "peering" +following a failure. Normally these states indicate the normal progression +through the failure recovery process. However, if a placement group stays in one +of these states for a long time this may be an indication of a larger problem. +For this reason, the monitor will warn when placement groups get "stuck" in a +non-optimal state. Specifically, we check for: + +* ``inactive`` - The placement group has not been ``active`` for too long + (i.e., it hasn't been able to service read/write requests). + +* ``unclean`` - The placement group has not been ``clean`` for too long + (i.e., it hasn't been able to completely recover from a previous failure). + +* ``stale`` - The placement group status has not been updated by a ``ceph-osd``, + indicating that all nodes storing this placement group may be ``down``. + +You can explicitly list stuck placement groups with one of:: + + ceph pg dump_stuck stale + ceph pg dump_stuck inactive + ceph pg dump_stuck unclean + +For stuck ``stale`` placement groups, it is normally a matter of getting the +right ``ceph-osd`` daemons running again. For stuck ``inactive`` placement +groups, it is usually a peering problem (see :ref:`failures-osd-peering`). For +stuck ``unclean`` placement groups, there is usually something preventing +recovery from completing, like unfound objects (see +:ref:`failures-osd-unfound`); + + + +.. _failures-osd-peering: + +Placement Group Down - Peering Failure +====================================== + +In certain cases, the ``ceph-osd`` `Peering` process can run into +problems, preventing a PG from becoming active and usable. For +example, ``ceph health`` might report:: + + ceph health detail + HEALTH_ERR 7 pgs degraded; 12 pgs down; 12 pgs peering; 1 pgs recovering; 6 pgs stuck unclean; 114/3300 degraded (3.455%); 1/3 in osds are down + ... + pg 0.5 is down+peering + pg 1.4 is down+peering + ... + osd.1 is down since epoch 69, last address 192.168.106.220:6801/8651 + +We can query the cluster to determine exactly why the PG is marked ``down`` with:: + + ceph pg 0.5 query + +.. code-block:: javascript + + { "state": "down+peering", + ... + "recovery_state": [ + { "name": "Started\/Primary\/Peering\/GetInfo", + "enter_time": "2012-03-06 14:40:16.169679", + "requested_info_from": []}, + { "name": "Started\/Primary\/Peering", + "enter_time": "2012-03-06 14:40:16.169659", + "probing_osds": [ + 0, + 1], + "blocked": "peering is blocked due to down osds", + "down_osds_we_would_probe": [ + 1], + "peering_blocked_by": [ + { "osd": 1, + "current_lost_at": 0, + "comment": "starting or marking this osd lost may let us proceed"}]}, + { "name": "Started", + "enter_time": "2012-03-06 14:40:16.169513"} + ] + } + +The ``recovery_state`` section tells us that peering is blocked due to +down ``ceph-osd`` daemons, specifically ``osd.1``. In this case, we can start that ``ceph-osd`` +and things will recover. + +Alternatively, if there is a catastrophic failure of ``osd.1`` (e.g., disk +failure), we can tell the cluster that it is ``lost`` and to cope as +best it can. + +.. important:: This is dangerous in that the cluster cannot + guarantee that the other copies of the data are consistent + and up to date. + +To instruct Ceph to continue anyway:: + + ceph osd lost 1 + +Recovery will proceed. + + +.. _failures-osd-unfound: + +Unfound Objects +=============== + +Under certain combinations of failures Ceph may complain about +``unfound`` objects:: + + ceph health detail + HEALTH_WARN 1 pgs degraded; 78/3778 unfound (2.065%) + pg 2.4 is active+degraded, 78 unfound + +This means that the storage cluster knows that some objects (or newer +copies of existing objects) exist, but it hasn't found copies of them. +One example of how this might come about for a PG whose data is on ceph-osds +1 and 2: + +* 1 goes down +* 2 handles some writes, alone +* 1 comes up +* 1 and 2 repeer, and the objects missing on 1 are queued for recovery. +* Before the new objects are copied, 2 goes down. + +Now 1 knows that these object exist, but there is no live ``ceph-osd`` who +has a copy. In this case, IO to those objects will block, and the +cluster will hope that the failed node comes back soon; this is +assumed to be preferable to returning an IO error to the user. + +First, you can identify which objects are unfound with:: + + ceph pg 2.4 list_missing [starting offset, in json] + +.. code-block:: javascript + + { "offset": { "oid": "", + "key": "", + "snapid": 0, + "hash": 0, + "max": 0}, + "num_missing": 0, + "num_unfound": 0, + "objects": [ + { "oid": "object 1", + "key": "", + "hash": 0, + "max": 0 }, + ... + ], + "more": 0} + +If there are too many objects to list in a single result, the ``more`` +field will be true and you can query for more. (Eventually the +command line tool will hide this from you, but not yet.) + +Second, you can identify which OSDs have been probed or might contain +data:: + + ceph pg 2.4 query + +.. code-block:: javascript + + "recovery_state": [ + { "name": "Started\/Primary\/Active", + "enter_time": "2012-03-06 15:15:46.713212", + "might_have_unfound": [ + { "osd": 1, + "status": "osd is down"}]}, + +In this case, for example, the cluster knows that ``osd.1`` might have +data, but it is ``down``. The full range of possible states include: + +* already probed +* querying +* OSD is down +* not queried (yet) + +Sometimes it simply takes some time for the cluster to query possible +locations. + +It is possible that there are other locations where the object can +exist that are not listed. For example, if a ceph-osd is stopped and +taken out of the cluster, the cluster fully recovers, and due to some +future set of failures ends up with an unfound object, it won't +consider the long-departed ceph-osd as a potential location to +consider. (This scenario, however, is unlikely.) + +If all possible locations have been queried and objects are still +lost, you may have to give up on the lost objects. This, again, is +possible given unusual combinations of failures that allow the cluster +to learn about writes that were performed before the writes themselves +are recovered. To mark the "unfound" objects as "lost":: + + ceph pg 2.5 mark_unfound_lost revert|delete + +This the final argument specifies how the cluster should deal with +lost objects. + +The "delete" option will forget about them entirely. + +The "revert" option (not available for erasure coded pools) will +either roll back to a previous version of the object or (if it was a +new object) forget about it entirely. Use this with caution, as it +may confuse applications that expected the object to exist. + + +Homeless Placement Groups +========================= + +It is possible for all OSDs that had copies of a given placement groups to fail. +If that's the case, that subset of the object store is unavailable, and the +monitor will receive no status updates for those placement groups. To detect +this situation, the monitor marks any placement group whose primary OSD has +failed as ``stale``. For example:: + + ceph health + HEALTH_WARN 24 pgs stale; 3/300 in osds are down + +You can identify which placement groups are ``stale``, and what the last OSDs to +store them were, with:: + + ceph health detail + HEALTH_WARN 24 pgs stale; 3/300 in osds are down + ... + pg 2.5 is stuck stale+active+remapped, last acting [2,0] + ... + osd.10 is down since epoch 23, last address 192.168.106.220:6800/11080 + osd.11 is down since epoch 13, last address 192.168.106.220:6803/11539 + osd.12 is down since epoch 24, last address 192.168.106.220:6806/11861 + +If we want to get placement group 2.5 back online, for example, this tells us that +it was last managed by ``osd.0`` and ``osd.2``. Restarting those ``ceph-osd`` +daemons will allow the cluster to recover that placement group (and, presumably, +many others). + + +Only a Few OSDs Receive Data +============================ + +If you have many nodes in your cluster and only a few of them receive data, +`check`_ the number of placement groups in your pool. Since placement groups get +mapped to OSDs, a small number of placement groups will not distribute across +your cluster. Try creating a pool with a placement group count that is a +multiple of the number of OSDs. See `Placement Groups`_ for details. The default +placement group count for pools is not useful, but you can change it `here`_. + + +Can't Write Data +================ + +If your cluster is up, but some OSDs are down and you cannot write data, +check to ensure that you have the minimum number of OSDs running for the +placement group. If you don't have the minimum number of OSDs running, +Ceph will not allow you to write data because there is no guarantee +that Ceph can replicate your data. See ``osd pool default min size`` +in the `Pool, PG and CRUSH Config Reference`_ for details. + + +PGs Inconsistent +================ + +If you receive an ``active + clean + inconsistent`` state, this may happen +due to an error during scrubbing. As always, we can identify the inconsistent +placement group(s) with:: + + $ ceph health detail + HEALTH_ERR 1 pgs inconsistent; 2 scrub errors + pg 0.6 is active+clean+inconsistent, acting [0,1,2] + 2 scrub errors + +Or if you prefer inspecting the output in a programmatic way:: + + $ rados list-inconsistent-pg rbd + ["0.6"] + +There is only one consistent state, but in the worst case, we could have +different inconsistencies in multiple perspectives found in more than one +objects. If an object named ``foo`` in PG ``0.6`` is truncated, we will have:: + + $ rados list-inconsistent-obj 0.6 --format=json-pretty + +.. code-block:: javascript + + { + "epoch": 14, + "inconsistents": [ + { + "object": { + "name": "foo", + "nspace": "", + "locator": "", + "snap": "head", + "version": 1 + }, + "errors": [ + "data_digest_mismatch", + "size_mismatch" + ], + "union_shard_errors": [ + "data_digest_mismatch_oi", + "size_mismatch_oi" + ], + "selected_object_info": "0:602f83fe:::foo:head(16'1 client.4110.0:1 dirty|data_digest|omap_digest s 968 uv 1 dd e978e67f od ffffffff alloc_hint [0 0 0])", + "shards": [ + { + "osd": 0, + "errors": [], + "size": 968, + "omap_digest": "0xffffffff", + "data_digest": "0xe978e67f" + }, + { + "osd": 1, + "errors": [], + "size": 968, + "omap_digest": "0xffffffff", + "data_digest": "0xe978e67f" + }, + { + "osd": 2, + "errors": [ + "data_digest_mismatch_oi", + "size_mismatch_oi" + ], + "size": 0, + "omap_digest": "0xffffffff", + "data_digest": "0xffffffff" + } + ] + } + ] + } + +In this case, we can learn from the output: + +* The only inconsistent object is named ``foo``, and it is its head that has + inconsistencies. +* The inconsistencies fall into two categories: + + * ``errors``: these errors indicate inconsistencies between shards without a + determination of which shard(s) are bad. Check for the ``errors`` in the + `shards` array, if available, to pinpoint the problem. + + * ``data_digest_mismatch``: the digest of the replica read from OSD.2 is + different from the ones of OSD.0 and OSD.1 + * ``size_mismatch``: the size of the replica read from OSD.2 is 0, while + the size reported by OSD.0 and OSD.1 is 968. + * ``union_shard_errors``: the union of all shard specific ``errors`` in + ``shards`` array. The ``errors`` are set for the given shard that has the + problem. They include errors like ``read_error``. The ``errors`` ending in + ``oi`` indicate a comparison with ``selected_object_info``. Look at the + ``shards`` array to determine which shard has which error(s). + + * ``data_digest_mismatch_oi``: the digest stored in the object-info is not + ``0xffffffff``, which is calculated from the shard read from OSD.2 + * ``size_mismatch_oi``: the size stored in the object-info is different + from the one read from OSD.2. The latter is 0. + +You can repair the inconsistent placement group by executing:: + + ceph pg repair {placement-group-ID} + +Which overwrites the `bad` copies with the `authoritative` ones. In most cases, +Ceph is able to choose authoritative copies from all available replicas using +some predefined criteria. But this does not always work. For example, the stored +data digest could be missing, and the calculated digest will be ignored when +choosing the authoritative copies. So, please use the above command with caution. + +If ``read_error`` is listed in the ``errors`` attribute of a shard, the +inconsistency is likely due to disk errors. You might want to check your disk +used by that OSD. + +If you receive ``active + clean + inconsistent`` states periodically due to +clock skew, you may consider configuring your `NTP`_ daemons on your +monitor hosts to act as peers. See `The Network Time Protocol`_ and Ceph +`Clock Settings`_ for additional details. + + +Erasure Coded PGs are not active+clean +====================================== + +When CRUSH fails to find enough OSDs to map to a PG, it will show as a +``2147483647`` which is ITEM_NONE or ``no OSD found``. For instance:: + + [2,1,6,0,5,8,2147483647,7,4] + +Not enough OSDs +--------------- + +If the Ceph cluster only has 8 OSDs and the erasure coded pool needs +9, that is what it will show. You can either create another erasure +coded pool that requires less OSDs:: + + ceph osd erasure-code-profile set myprofile k=5 m=3 + ceph osd pool create erasurepool 16 16 erasure myprofile + +or add a new OSDs and the PG will automatically use them. + +CRUSH constraints cannot be satisfied +------------------------------------- + +If the cluster has enough OSDs, it is possible that the CRUSH ruleset +imposes constraints that cannot be satisfied. If there are 10 OSDs on +two hosts and the CRUSH rulesets require that no two OSDs from the +same host are used in the same PG, the mapping may fail because only +two OSD will be found. You can check the constraint by displaying the +ruleset:: + + $ ceph osd crush rule ls + [ + "replicated_ruleset", + "erasurepool"] + $ ceph osd crush rule dump erasurepool + { "rule_id": 1, + "rule_name": "erasurepool", + "ruleset": 1, + "type": 3, + "min_size": 3, + "max_size": 20, + "steps": [ + { "op": "take", + "item": -1, + "item_name": "default"}, + { "op": "chooseleaf_indep", + "num": 0, + "type": "host"}, + { "op": "emit"}]} + + +You can resolve the problem by creating a new pool in which PGs are allowed +to have OSDs residing on the same host with:: + + ceph osd erasure-code-profile set myprofile crush-failure-domain=osd + ceph osd pool create erasurepool 16 16 erasure myprofile + +CRUSH gives up too soon +----------------------- + +If the Ceph cluster has just enough OSDs to map the PG (for instance a +cluster with a total of 9 OSDs and an erasure coded pool that requires +9 OSDs per PG), it is possible that CRUSH gives up before finding a +mapping. It can be resolved by: + +* lowering the erasure coded pool requirements to use less OSDs per PG + (that requires the creation of another pool as erasure code profiles + cannot be dynamically modified). + +* adding more OSDs to the cluster (that does not require the erasure + coded pool to be modified, it will become clean automatically) + +* use a hand made CRUSH ruleset that tries more times to find a good + mapping. It can be done by setting ``set_choose_tries`` to a value + greater than the default. + +You should first verify the problem with ``crushtool`` after +extracting the crushmap from the cluster so your experiments do not +modify the Ceph cluster and only work on a local files:: + + $ ceph osd crush rule dump erasurepool + { "rule_name": "erasurepool", + "ruleset": 1, + "type": 3, + "min_size": 3, + "max_size": 20, + "steps": [ + { "op": "take", + "item": -1, + "item_name": "default"}, + { "op": "chooseleaf_indep", + "num": 0, + "type": "host"}, + { "op": "emit"}]} + $ ceph osd getcrushmap > crush.map + got crush map from osdmap epoch 13 + $ crushtool -i crush.map --test --show-bad-mappings \ + --rule 1 \ + --num-rep 9 \ + --min-x 1 --max-x $((1024 * 1024)) + bad mapping rule 8 x 43 num_rep 9 result [3,2,7,1,2147483647,8,5,6,0] + bad mapping rule 8 x 79 num_rep 9 result [6,0,2,1,4,7,2147483647,5,8] + bad mapping rule 8 x 173 num_rep 9 result [0,4,6,8,2,1,3,7,2147483647] + +Where ``--num-rep`` is the number of OSDs the erasure code crush +ruleset needs, ``--rule`` is the value of the ``ruleset`` field +displayed by ``ceph osd crush rule dump``. The test will try mapping +one million values (i.e. the range defined by ``[--min-x,--max-x]``) +and must display at least one bad mapping. If it outputs nothing it +means all mappings are successfull and you can stop right there: the +problem is elsewhere. + +The crush ruleset can be edited by decompiling the crush map:: + + $ crushtool --decompile crush.map > crush.txt + +and adding the following line to the ruleset:: + + step set_choose_tries 100 + +The relevant part of of the ``crush.txt`` file should look something +like:: + + rule erasurepool { + ruleset 1 + type erasure + min_size 3 + max_size 20 + step set_chooseleaf_tries 5 + step set_choose_tries 100 + step take default + step chooseleaf indep 0 type host + step emit + } + +It can then be compiled and tested again:: + + $ crushtool --compile crush.txt -o better-crush.map + +When all mappings succeed, an histogram of the number of tries that +were necessary to find all of them can be displayed with the +``--show-choose-tries`` option of ``crushtool``:: + + $ crushtool -i better-crush.map --test --show-bad-mappings \ + --show-choose-tries \ + --rule 1 \ + --num-rep 9 \ + --min-x 1 --max-x $((1024 * 1024)) + ... + 11: 42 + 12: 44 + 13: 54 + 14: 45 + 15: 35 + 16: 34 + 17: 30 + 18: 25 + 19: 19 + 20: 22 + 21: 20 + 22: 17 + 23: 13 + 24: 16 + 25: 13 + 26: 11 + 27: 11 + 28: 13 + 29: 11 + 30: 10 + 31: 6 + 32: 5 + 33: 10 + 34: 3 + 35: 7 + 36: 5 + 37: 2 + 38: 5 + 39: 5 + 40: 2 + 41: 5 + 42: 4 + 43: 1 + 44: 2 + 45: 2 + 46: 3 + 47: 1 + 48: 0 + ... + 102: 0 + 103: 1 + 104: 0 + ... + +It took 11 tries to map 42 PGs, 12 tries to map 44 PGs etc. The highest number of tries is the minimum value of ``set_choose_tries`` that prevents bad mappings (i.e. 103 in the above output because it did not take more than 103 tries for any PG to be mapped). + +.. _check: ../../operations/placement-groups#get-the-number-of-placement-groups +.. _here: ../../configuration/pool-pg-config-ref +.. _Placement Groups: ../../operations/placement-groups +.. _Pool, PG and CRUSH Config Reference: ../../configuration/pool-pg-config-ref +.. _NTP: http://en.wikipedia.org/wiki/Network_Time_Protocol +.. _The Network Time Protocol: http://www.ntp.org/ +.. _Clock Settings: ../../configuration/mon-config-ref/#clock + + |