1 files changed, 536 insertions, 0 deletions
diff --git a/src/ceph/doc/rados/troubleshooting/troubleshooting-osd.rst b/src/ceph/doc/rados/troubleshooting/troubleshooting-osd.rst
new file mode 100644
index 0000000..88307fe
--- /dev/null
+++ b/src/ceph/doc/rados/troubleshooting/troubleshooting-osd.rst
@@ -0,0 +1,536 @@
+======================
+ Troubleshooting OSDs
+======================
+
+Before troubleshooting your OSDs, check your monitors and network first. If
+you execute ``ceph health`` or ``ceph -s`` on the command line and Ceph returns
+a health status, it means that the monitors have a quorum.
+If you don't have a monitor quorum or if there are errors with the monitor
+status, `address the monitor issues first <../troubleshooting-mon>`_.
+Check your networks to ensure they
+are running properly, because networks may have a significant impact on OSD
+operation and performance.
+
+
+
+Obtaining Data About OSDs
+=========================
+
+A good first step in troubleshooting your OSDs is to obtain information in
+addition to the information you collected while `monitoring your OSDs`_
+(e.g., ``ceph osd tree``).
+
+
+Ceph Logs
+---------
+
+If you haven't changed the default path, you can find Ceph log files at
+``/var/log/ceph``::
+
+	ls /var/log/ceph
+
+If you don't get enough log detail, you can change your logging level.  See
+`Logging and Debugging`_ for details to ensure that Ceph performs adequately
+under high logging volume.
+
+
+Admin Socket
+------------
+
+Use the admin socket tool to retrieve runtime information. For details, list
+the sockets for your Ceph processes::
+
+	ls /var/run/ceph
+
+Then, execute the following, replacing ``{daemon-name}`` with an actual
+daemon (e.g., ``osd.0``)::
+
+  ceph daemon osd.0 help
+
+Alternatively, you can specify a ``{socket-file}`` (e.g., something in ``/var/run/ceph``)::
+
+  ceph daemon {socket-file} help
+
+
+The admin socket, among other things, allows you to:
+
+- List your configuration at runtime
+- Dump historic operations
+- Dump the operation priority queue state
+- Dump operations in flight
+- Dump perfcounters
+
+
+Display Freespace
+-----------------
+
+Filesystem issues may arise. To display your filesystem's free space, execute
+``df``. ::
+
+	df -h
+
+Execute ``df --help`` for additional usage.
+
+
+I/O Statistics
+--------------
+
+Use `iostat`_ to identify I/O-related issues. ::
+
+	iostat -x
+
+
+Diagnostic Messages
+-------------------
+
+To retrieve diagnostic messages, use ``dmesg`` with ``less``, ``more``, ``grep``
+or ``tail``.  For example::
+
+	dmesg | grep scsi
+
+
+Stopping w/out Rebalancing
+==========================
+
+Periodically, you may need to perform maintenance on a subset of your cluster,
+or resolve a problem that affects a failure domain (e.g., a rack). If you do not
+want CRUSH to automatically rebalance the cluster as you stop OSDs for
+maintenance, set the cluster to ``noout`` first::
+
+	ceph osd set noout
+
+Once the cluster is set to ``noout``, you can begin stopping the OSDs within the
+failure domain that requires maintenance work. ::
+
+	stop ceph-osd id={num}
+
+.. note:: Placement groups within the OSDs you stop will become ``degraded``
+   while you are addressing issues with within the failure domain.
+
+Once you have completed your maintenance, restart the OSDs. ::
+
+	start ceph-osd id={num}
+
+Finally, you must unset the cluster from ``noout``. ::
+
+	ceph osd unset noout
+
+
+
+.. _osd-not-running:
+
+OSD Not Running
+===============
+
+Under normal circumstances, simply restarting the ``ceph-osd`` daemon will
+allow it to rejoin the cluster and recover.
+
+An OSD Won't Start
+------------------
+
+If you start your cluster and an OSD won't start, check the following:
+
+- **Configuration File:** If you were not able to get OSDs running from
+  a new installation, check your configuration file to ensure it conforms
+  (e.g., ``host`` not ``hostname``, etc.).
+
+- **Check Paths:** Check the paths in your configuration, and the actual
+  paths themselves for data and journals. If you separate the OSD data from
+  the journal data and there are errors in your configuration file or in the
+  actual mounts, you may have trouble starting OSDs. If you want to store the
+  journal on a block device, you should partition your journal disk and assign
+  one partition per OSD.
+
+- **Check Max Threadcount:** If you have a node with a lot of OSDs, you may be
+  hitting the default maximum number of threads (e.g., usually 32k), especially
+  during recovery. You can increase the number of threads using ``sysctl`` to
+  see if increasing the maximum number of threads to the maximum possible
+  number of threads allowed (i.e.,  4194303) will help. For example::
+
+	sysctl -w kernel.pid_max=4194303
+
+  If increasing the maximum thread count resolves the issue, you can make it
+  permanent by including a ``kernel.pid_max`` setting in the
+  ``/etc/sysctl.conf`` file. For example::
+
+	kernel.pid_max = 4194303
+
+- **Kernel Version:** Identify the kernel version and distribution you
+  are using. Ceph uses some third party tools by default, which may be
+  buggy or may conflict with certain distributions and/or kernel
+  versions (e.g., Google perftools). Check the `OS recommendations`_
+  to ensure you have addressed any issues related to your kernel.
+
+- **Segment Fault:** If there is a segment fault, turn your logging up
+  (if it is not already), and try again. If it segment faults again,
+  contact the ceph-devel email list and provide your Ceph configuration
+  file, your monitor output and the contents of your log file(s).
+
+
+
+An OSD Failed
+-------------
+
+When a ``ceph-osd`` process dies, the monitor will learn about the failure
+from surviving ``ceph-osd`` daemons and report it via the ``ceph health``
+command::
+
+	ceph health
+	HEALTH_WARN 1/3 in osds are down
+
+Specifically, you will get a warning whenever there are ``ceph-osd``
+processes that are marked ``in`` and ``down``.  You can identify which
+``ceph-osds`` are ``down`` with::
+
+	ceph health detail
+	HEALTH_WARN 1/3 in osds are down
+	osd.0 is down since epoch 23, last address 192.168.106.220:6800/11080
+
+If there is a disk
+failure or other fault preventing ``ceph-osd`` from functioning or
+restarting, an error message should be present in its log file in
+``/var/log/ceph``.
+
+If the daemon stopped because of a heartbeat failure, the underlying
+kernel file system may be unresponsive. Check ``dmesg`` output for disk
+or other kernel errors.
+
+If the problem is a software error (failed assertion or other
+unexpected error), it should be reported to the `ceph-devel`_ email list.
+
+
+No Free Drive Space
+-------------------
+
+Ceph prevents you from writing to a full OSD so that you don't lose data.
+In an operational cluster, you should receive a warning when your cluster
+is getting near its full ratio. The ``mon osd full ratio`` defaults to
+``0.95``, or 95% of capacity before it stops clients from writing data.
+The ``mon osd backfillfull ratio`` defaults to ``0.90``, or 90 % of
+capacity when it blocks backfills from starting. The
+``mon osd nearfull ratio`` defaults to ``0.85``, or 85% of capacity
+when it generates a health warning.
+
+Full cluster issues usually arise when testing how Ceph handles an OSD
+failure on a small cluster. When one node has a high percentage of the
+cluster's data, the cluster can easily eclipse its nearfull and full ratio
+immediately. If you are testing how Ceph reacts to OSD failures on a small
+cluster, you should leave ample free disk space and consider temporarily
+lowering the ``mon osd full ratio``, ``mon osd backfillfull ratio``  and
+``mon osd nearfull ratio``.
+
+Full ``ceph-osds`` will be reported by ``ceph health``::
+
+	ceph health
+	HEALTH_WARN 1 nearfull osd(s)
+
+Or::
+
+	ceph health detail
+	HEALTH_ERR 1 full osd(s); 1 backfillfull osd(s); 1 nearfull osd(s)
+	osd.3 is full at 97%
+	osd.4 is backfill full at 91%
+	osd.2 is near full at 87%
+
+The best way to deal with a full cluster is to add new ``ceph-osds``, allowing
+the cluster to redistribute data to the newly available storage.
+
+If you cannot start an OSD because it is full, you may delete some data by deleting
+some placement group directories in the full OSD.
+
+.. important:: If you choose to delete a placement group directory on a full OSD,
+   **DO NOT** delete the same placement group directory on another full OSD, or
+   **YOU MAY LOSE DATA**. You **MUST** maintain at least one copy of your data on
+   at least one OSD.
+
+See `Monitor Config Reference`_ for additional details.
+
+
+OSDs are Slow/Unresponsive
+==========================
+
+A commonly recurring issue involves slow or unresponsive OSDs. Ensure that you
+have eliminated other troubleshooting possibilities before delving into OSD
+performance issues. For example, ensure that your network(s) is working properly
+and your OSDs are running. Check to see if OSDs are throttling recovery traffic.
+
+.. tip:: Newer versions of Ceph provide better recovery handling by preventing
+   recovering OSDs from using up system resources so that ``up`` and ``in``
+   OSDs are not available or are otherwise slow.
+
+
+Networking Issues
+-----------------
+
+Ceph is a distributed storage system, so it  depends upon networks to peer with
+OSDs, replicate objects, recover from faults and check heartbeats. Networking
+issues can cause OSD latency and flapping OSDs. See `Flapping OSDs`_ for
+details.
+
+Ensure that Ceph processes and Ceph-dependent processes are connected and/or
+listening. ::
+
+	netstat -a | grep ceph
+	netstat -l | grep ceph
+	sudo netstat -p | grep ceph
+
+Check network statistics. ::
+
+	netstat -s
+
+
+Drive Configuration
+-------------------
+
+A storage drive should only support one OSD. Sequential read and sequential
+write throughput can bottleneck if other processes share the drive, including
+journals, operating systems, monitors, other OSDs and non-Ceph processes.
+
+Ceph acknowledges writes *after* journaling, so fast SSDs are an
+attractive option to accelerate the response time--particularly when
+using the ``XFS`` or ``ext4`` filesystems.  By contrast, the ``btrfs``
+filesystem can write and journal simultaneously.  (Note, however, that
+we recommend against using ``btrfs`` for production deployments.)
+
+.. note:: Partitioning a drive does not change its total throughput or
+   sequential read/write limits. Running a journal in a separate partition
+   may help, but you should prefer a separate physical drive.
+
+
+Bad Sectors / Fragmented Disk
+-----------------------------
+
+Check your disks for bad sectors and fragmentation. This can cause total throughput
+to drop substantially.
+
+
+Co-resident Monitors/OSDs
+-------------------------
+
+Monitors are generally light-weight processes, but they do lots of ``fsync()``,
+which can interfere with other workloads, particularly if monitors run on the
+same drive as your OSDs. Additionally, if you run monitors on the same host as
+the OSDs, you may incur performance issues related to:
+
+- Running an older kernel (pre-3.0)
+- Running Argonaut with an old ``glibc``
+- Running a kernel with no syncfs(2) syscall.
+
+In these cases, multiple OSDs running on the same host can drag each other down
+by doing lots of commits. That often leads to the bursty writes.
+
+
+Co-resident Processes
+---------------------
+
+Spinning up co-resident processes such as a cloud-based solution, virtual
+machines and other applications that write data to Ceph while operating on the
+same hardware as OSDs can introduce significant OSD latency. Generally, we
+recommend optimizing a host for use with Ceph and using other hosts for other
+processes. The practice of separating Ceph operations from other applications
+may help improve performance and may streamline troubleshooting and maintenance.
+
+
+Logging Levels
+--------------
+
+If you turned logging levels up to track an issue and then forgot to turn
+logging levels back down, the OSD may be putting a lot of logs onto the disk. If
+you intend to keep logging levels high, you may consider mounting a drive to the
+default path for logging (i.e., ``/var/log/ceph/$cluster-$name.log``).
+
+
+Recovery Throttling
+-------------------
+
+Depending upon your configuration, Ceph may reduce recovery rates to maintain
+performance or it may increase recovery rates to the point that recovery
+impacts OSD performance. Check to see if the OSD is recovering.
+
+
+Kernel Version
+--------------
+
+Check the kernel version you are running. Older kernels may not receive
+new backports that Ceph depends upon for better performance.
+
+
+Kernel Issues with SyncFS
+-------------------------
+
+Try running one OSD per host to see if performance improves. Old kernels
+might not have a recent enough version of ``glibc`` to support ``syncfs(2)``.
+
+
+Filesystem Issues
+-----------------
+
+Currently, we recommend deploying clusters with XFS.
+
+We recommend against using btrfs or ext4.  The btrfs filesystem has
+many attractive features, but bugs in the filesystem may lead to
+performance issues and suprious ENOSPC errors.  We do not recommend
+ext4 because xattr size limitations break our support for long object
+names (needed for RGW).
+
+For more information, see `Filesystem Recommendations`_.
+
+.. _Filesystem Recommendations: ../configuration/filesystem-recommendations
+
+
+Insufficient RAM
+----------------
+
+We recommend 1GB of RAM per OSD daemon. You may notice that during normal
+operations, the OSD only uses a fraction of that amount (e.g., 100-200MB).
+Unused RAM makes it tempting to use the excess RAM for co-resident applications,
+VMs and so forth. However, when OSDs go into recovery mode, their memory
+utilization spikes. If there is no RAM available, the OSD performance will slow
+considerably.
+
+
+Old Requests or Slow Requests
+-----------------------------
+
+If a ``ceph-osd`` daemon is slow to respond to a request, it will generate log messages
+complaining about requests that are taking too long.  The warning threshold
+defaults to 30 seconds, and is configurable via the ``osd op complaint time``
+option.  When this happens, the cluster log will receive messages.
+
+Legacy versions of Ceph complain about 'old requests`::
+
+	osd.0 192.168.106.220:6800/18813 312 : [WRN] old request osd_op(client.5099.0:790 fatty_26485_object789 [write 0~4096] 2.5e54f643) v4 received at 2012-03-06 15:42:56.054801 currently waiting for sub ops
+
+New versions of Ceph complain about 'slow requests`::
+
+	{date} {osd.num} [WRN] 1 slow requests, 1 included below; oldest blocked for > 30.005692 secs
+	{date} {osd.num}  [WRN] slow request 30.005692 seconds old, received at {date-time}: osd_op(client.4240.0:8 benchmark_data_ceph-1_39426_object7 [write 0~4194304] 0.69848840) v4 currently waiting for subops from [610]
+
+
+Possible causes include:
+
+- A bad drive (check ``dmesg`` output)
+- A bug in the kernel file system bug (check ``dmesg`` output)
+- An overloaded cluster (check system load, iostat, etc.)
+- A bug in the ``ceph-osd`` daemon.
+
+Possible solutions
+
+- Remove VMs Cloud Solutions from Ceph Hosts
+- Upgrade Kernel
+- Upgrade Ceph
+- Restart OSDs
+
+Debugging Slow Requests
+-----------------------
+
+If you run "ceph daemon osd.<id> dump_historic_ops" or "dump_ops_in_flight",
+you will see a set of operations and a list of events each operation went
+through. These are briefly described below.
+
+Events from the Messenger layer:
+
+- header_read: when the messenger first started reading the message off the wire
+- throttled: when the messenger tried to acquire memory throttle space to read
+  the message into memory
+- all_read: when the messenger finished reading the message off the wire
+- dispatched: when the messenger gave the message to the OSD
+- Initiated: <This is identical to header_read. The existence of both is a
+  historical oddity.
+
+Events from the OSD as it prepares operations
+
+- queued_for_pg: the op has been put into the queue for processing by its PG
+- reached_pg: the PG has started doing the op
+- waiting for \*: the op is waiting for some other work to complete before it
+  can proceed (a new OSDMap; for its object target to scrub; for the PG to
+  finish peering; all as specified in the message)
+- started: the op has been accepted as something the OSD should actually do
+  (reasons not to do it: failed security/permission checks; out-of-date local
+  state; etc) and is now actually being performed
+- waiting for subops from: the op has been sent to replica OSDs
+
+Events from the FileStore
+
+- commit_queued_for_journal_write: the op has been given to the FileStore
+- write_thread_in_journal_buffer: the op is in the journal's buffer and waiting
+  to be persisted (as the next disk write)
+- journaled_completion_queued: the op was journaled to disk and its callback
+  queued for invocation
+
+Events from the OSD after stuff has been given to local disk
+
+- op_commit: the op has been committed (ie, written to journal) by the
+  primary OSD
+- op_applied: The op has been write()'en to the backing FS (ie, applied in
+  memory but not flushed out to disk) on the primary
+- sub_op_applied: op_applied, but for a replica's "subop"
+- sub_op_committed: op_commited, but for a replica's subop (only for EC pools)
+- sub_op_commit_rec/sub_op_apply_rec from <X>: the primary marks this when it
+  hears about the above, but for a particular replica <X>
+- commit_sent: we sent a reply back to the client (or primary OSD, for sub ops)
+
+Many of these events are seemingly redundant, but cross important boundaries in
+the internal code (such as passing data across locks into new threads).
+
+Flapping OSDs
+=============
+
+We recommend using both a public (front-end) network and a cluster (back-end)
+network so that you can better meet the capacity requirements of object
+replication. Another advantage is that you can run a cluster network such that
+it is not connected to the internet, thereby preventing some denial of service
+attacks. When OSDs peer and check heartbeats, they use the cluster (back-end)
+network when it's available. See `Monitor/OSD Interaction`_ for details.
+
+However, if the cluster (back-end) network fails or develops significant latency
+while the public (front-end) network operates optimally, OSDs currently do not
+handle this situation well. What happens is that OSDs mark each other ``down``
+on the monitor, while marking themselves ``up``. We call this scenario
+'flapping`.
+
+If something is causing OSDs to 'flap' (repeatedly getting marked ``down`` and
+then ``up`` again), you can force the monitors to stop the flapping with::
+
+	ceph osd set noup      # prevent OSDs from getting marked up
+	ceph osd set nodown    # prevent OSDs from getting marked down
+
+These flags are recorded in the osdmap structure::
+
+	ceph osd dump | grep flags
+	flags no-up,no-down
+
+You can clear the flags with::
+
+	ceph osd unset noup
+	ceph osd unset nodown
+
+Two other flags are supported, ``noin`` and ``noout``, which prevent
+booting OSDs from being marked ``in`` (allocated data) or protect OSDs
+from eventually being marked ``out`` (regardless of what the current value for
+``mon osd down out interval`` is).
+
+.. note:: ``noup``, ``noout``, and ``nodown`` are temporary in the
+   sense that once the flags are cleared, the action they were blocking
+   should occur shortly after.  The ``noin`` flag, on the other hand,
+   prevents OSDs from being marked ``in`` on boot, and any daemons that
+   started while the flag was set will remain that way.
+
+
+
+
+
+
+.. _iostat: http://en.wikipedia.org/wiki/Iostat
+.. _Ceph Logging and Debugging: ../../configuration/ceph-conf#ceph-logging-and-debugging
+.. _Logging and Debugging: ../log-and-debug
+.. _Debugging and Logging: ../debug
+.. _Monitor/OSD Interaction: ../../configuration/mon-osd-interaction
+.. _Monitor Config Reference: ../../configuration/mon-config-ref
+.. _monitoring your OSDs: ../../operations/monitoring-osd-pg
+.. _subscribe to the ceph-devel email list: mailto:majordomo@vger.kernel.org?body=subscribe+ceph-devel
+.. _unsubscribe from the ceph-devel email list: mailto:majordomo@vger.kernel.org?body=unsubscribe+ceph-devel
+.. _subscribe to the ceph-users email list: mailto:ceph-users-join@lists.ceph.com
+.. _unsubscribe from the ceph-users email list: mailto:ceph-users-leave@lists.ceph.com
+.. _OS recommendations: ../../../start/os-recommendations
+.. _ceph-devel: ceph-devel@vger.kernel.org