From 7da45d65be36d36b880cc55c5036e96c24b53f00 Mon Sep 17 00:00:00 2001 From: Qiaowei Ren Date: Thu, 1 Mar 2018 14:38:11 +0800 Subject: remove ceph code This patch removes initial ceph code, due to license issue. Change-Id: I092d44f601cdf34aed92300fe13214925563081c Signed-off-by: Qiaowei Ren --- .../doc/rados/operations/monitoring-osd-pg.rst | 617 --------------------- 1 file changed, 617 deletions(-) delete mode 100644 src/ceph/doc/rados/operations/monitoring-osd-pg.rst (limited to 'src/ceph/doc/rados/operations/monitoring-osd-pg.rst') diff --git a/src/ceph/doc/rados/operations/monitoring-osd-pg.rst b/src/ceph/doc/rados/operations/monitoring-osd-pg.rst deleted file mode 100644 index 0107e34..0000000 --- a/src/ceph/doc/rados/operations/monitoring-osd-pg.rst +++ /dev/null @@ -1,617 +0,0 @@ -========================= - Monitoring OSDs and PGs -========================= - -High availability and high reliability require a fault-tolerant approach to -managing hardware and software issues. Ceph has no single point-of-failure, and -can service requests for data in a "degraded" mode. Ceph's `data placement`_ -introduces a layer of indirection to ensure that data doesn't bind directly to -particular OSD addresses. This means that tracking down system faults requires -finding the `placement group`_ and the underlying OSDs at root of the problem. - -.. tip:: A fault in one part of the cluster may prevent you from accessing a - particular object, but that doesn't mean that you cannot access other objects. - When you run into a fault, don't panic. Just follow the steps for monitoring - your OSDs and placement groups. Then, begin troubleshooting. - -Ceph is generally self-repairing. However, when problems persist, monitoring -OSDs and placement groups will help you identify the problem. - - -Monitoring OSDs -=============== - -An OSD's status is either in the cluster (``in``) or out of the cluster -(``out``); and, it is either up and running (``up``), or it is down and not -running (``down``). If an OSD is ``up``, it may be either ``in`` the cluster -(you can read and write data) or it is ``out`` of the cluster. If it was -``in`` the cluster and recently moved ``out`` of the cluster, Ceph will migrate -placement groups to other OSDs. If an OSD is ``out`` of the cluster, CRUSH will -not assign placement groups to the OSD. If an OSD is ``down``, it should also be -``out``. - -.. note:: If an OSD is ``down`` and ``in``, there is a problem and the cluster - will not be in a healthy state. - -.. ditaa:: +----------------+ +----------------+ - | | | | - | OSD #n In | | OSD #n Up | - | | | | - +----------------+ +----------------+ - ^ ^ - | | - | | - v v - +----------------+ +----------------+ - | | | | - | OSD #n Out | | OSD #n Down | - | | | | - +----------------+ +----------------+ - -If you execute a command such as ``ceph health``, ``ceph -s`` or ``ceph -w``, -you may notice that the cluster does not always echo back ``HEALTH OK``. Don't -panic. With respect to OSDs, you should expect that the cluster will **NOT** -echo ``HEALTH OK`` in a few expected circumstances: - -#. You haven't started the cluster yet (it won't respond). -#. You have just started or restarted the cluster and it's not ready yet, - because the placement groups are getting created and the OSDs are in - the process of peering. -#. You just added or removed an OSD. -#. You just have modified your cluster map. - -An important aspect of monitoring OSDs is to ensure that when the cluster -is up and running that all OSDs that are ``in`` the cluster are ``up`` and -running, too. To see if all OSDs are running, execute:: - - ceph osd stat - -The result should tell you the map epoch (eNNNN), the total number of OSDs (x), -how many are ``up`` (y) and how many are ``in`` (z). :: - - eNNNN: x osds: y up, z in - -If the number of OSDs that are ``in`` the cluster is more than the number of -OSDs that are ``up``, execute the following command to identify the ``ceph-osd`` -daemons that are not running:: - - ceph osd tree - -:: - - dumped osdmap tree epoch 1 - # id weight type name up/down reweight - -1 2 pool openstack - -3 2 rack dell-2950-rack-A - -2 2 host dell-2950-A1 - 0 1 osd.0 up 1 - 1 1 osd.1 down 1 - - -.. tip:: The ability to search through a well-designed CRUSH hierarchy may help - you troubleshoot your cluster by identifying the physcial locations faster. - -If an OSD is ``down``, start it:: - - sudo systemctl start ceph-osd@1 - -See `OSD Not Running`_ for problems associated with OSDs that stopped, or won't -restart. - - -PG Sets -======= - -When CRUSH assigns placement groups to OSDs, it looks at the number of replicas -for the pool and assigns the placement group to OSDs such that each replica of -the placement group gets assigned to a different OSD. For example, if the pool -requires three replicas of a placement group, CRUSH may assign them to -``osd.1``, ``osd.2`` and ``osd.3`` respectively. CRUSH actually seeks a -pseudo-random placement that will take into account failure domains you set in -your `CRUSH map`_, so you will rarely see placement groups assigned to nearest -neighbor OSDs in a large cluster. We refer to the set of OSDs that should -contain the replicas of a particular placement group as the **Acting Set**. In -some cases, an OSD in the Acting Set is ``down`` or otherwise not able to -service requests for objects in the placement group. When these situations -arise, don't panic. Common examples include: - -- You added or removed an OSD. Then, CRUSH reassigned the placement group to - other OSDs--thereby changing the composition of the Acting Set and spawning - the migration of data with a "backfill" process. -- An OSD was ``down``, was restarted, and is now ``recovering``. -- An OSD in the Acting Set is ``down`` or unable to service requests, - and another OSD has temporarily assumed its duties. - -Ceph processes a client request using the **Up Set**, which is the set of OSDs -that will actually handle the requests. In most cases, the Up Set and the Acting -Set are virtually identical. When they are not, it may indicate that Ceph is -migrating data, an OSD is recovering, or that there is a problem (i.e., Ceph -usually echoes a "HEALTH WARN" state with a "stuck stale" message in such -scenarios). - -To retrieve a list of placement groups, execute:: - - ceph pg dump - -To view which OSDs are within the Acting Set or the Up Set for a given placement -group, execute:: - - ceph pg map {pg-num} - -The result should tell you the osdmap epoch (eNNN), the placement group number -({pg-num}), the OSDs in the Up Set (up[]), and the OSDs in the acting set -(acting[]). :: - - osdmap eNNN pg {pg-num} -> up [0,1,2] acting [0,1,2] - -.. note:: If the Up Set and Acting Set do not match, this may be an indicator - that the cluster rebalancing itself or of a potential problem with - the cluster. - - -Peering -======= - -Before you can write data to a placement group, it must be in an ``active`` -state, and it **should** be in a ``clean`` state. For Ceph to determine the -current state of a placement group, the primary OSD of the placement group -(i.e., the first OSD in the acting set), peers with the secondary and tertiary -OSDs to establish agreement on the current state of the placement group -(assuming a pool with 3 replicas of the PG). - - -.. ditaa:: +---------+ +---------+ +-------+ - | OSD 1 | | OSD 2 | | OSD 3 | - +---------+ +---------+ +-------+ - | | | - | Request To | | - | Peer | | - |-------------->| | - |<--------------| | - | Peering | - | | - | Request To | - | Peer | - |----------------------------->| - |<-----------------------------| - | Peering | - -The OSDs also report their status to the monitor. See `Configuring Monitor/OSD -Interaction`_ for details. To troubleshoot peering issues, see `Peering -Failure`_. - - -Monitoring Placement Group States -================================= - -If you execute a command such as ``ceph health``, ``ceph -s`` or ``ceph -w``, -you may notice that the cluster does not always echo back ``HEALTH OK``. After -you check to see if the OSDs are running, you should also check placement group -states. You should expect that the cluster will **NOT** echo ``HEALTH OK`` in a -number of placement group peering-related circumstances: - -#. You have just created a pool and placement groups haven't peered yet. -#. The placement groups are recovering. -#. You have just added an OSD to or removed an OSD from the cluster. -#. You have just modified your CRUSH map and your placement groups are migrating. -#. There is inconsistent data in different replicas of a placement group. -#. Ceph is scrubbing a placement group's replicas. -#. Ceph doesn't have enough storage capacity to complete backfilling operations. - -If one of the foregoing circumstances causes Ceph to echo ``HEALTH WARN``, don't -panic. In many cases, the cluster will recover on its own. In some cases, you -may need to take action. An important aspect of monitoring placement groups is -to ensure that when the cluster is up and running that all placement groups are -``active``, and preferably in the ``clean`` state. To see the status of all -placement groups, execute:: - - ceph pg stat - -The result should tell you the placement group map version (vNNNNNN), the total -number of placement groups (x), and how many placement groups are in a -particular state such as ``active+clean`` (y). :: - - vNNNNNN: x pgs: y active+clean; z bytes data, aa MB used, bb GB / cc GB avail - -.. note:: It is common for Ceph to report multiple states for placement groups. - -In addition to the placement group states, Ceph will also echo back the amount -of data used (aa), the amount of storage capacity remaining (bb), and the total -storage capacity for the placement group. These numbers can be important in a -few cases: - -- You are reaching your ``near full ratio`` or ``full ratio``. -- Your data is not getting distributed across the cluster due to an - error in your CRUSH configuration. - - -.. topic:: Placement Group IDs - - Placement group IDs consist of the pool number (not pool name) followed - by a period (.) and the placement group ID--a hexadecimal number. You - can view pool numbers and their names from the output of ``ceph osd - lspools``. For example, the default pool ``rbd`` corresponds to - pool number ``0``. A fully qualified placement group ID has the - following form:: - - {pool-num}.{pg-id} - - And it typically looks like this:: - - 0.1f - - -To retrieve a list of placement groups, execute the following:: - - ceph pg dump - -You can also format the output in JSON format and save it to a file:: - - ceph pg dump -o {filename} --format=json - -To query a particular placement group, execute the following:: - - ceph pg {poolnum}.{pg-id} query - -Ceph will output the query in JSON format. - -.. code-block:: javascript - - { - "state": "active+clean", - "up": [ - 1, - 0 - ], - "acting": [ - 1, - 0 - ], - "info": { - "pgid": "1.e", - "last_update": "4'1", - "last_complete": "4'1", - "log_tail": "0'0", - "last_backfill": "MAX", - "purged_snaps": "[]", - "history": { - "epoch_created": 1, - "last_epoch_started": 537, - "last_epoch_clean": 537, - "last_epoch_split": 534, - "same_up_since": 536, - "same_interval_since": 536, - "same_primary_since": 536, - "last_scrub": "4'1", - "last_scrub_stamp": "2013-01-25 10:12:23.828174" - }, - "stats": { - "version": "4'1", - "reported": "536'782", - "state": "active+clean", - "last_fresh": "2013-01-25 10:12:23.828271", - "last_change": "2013-01-25 10:12:23.828271", - "last_active": "2013-01-25 10:12:23.828271", - "last_clean": "2013-01-25 10:12:23.828271", - "last_unstale": "2013-01-25 10:12:23.828271", - "mapping_epoch": 535, - "log_start": "0'0", - "ondisk_log_start": "0'0", - "created": 1, - "last_epoch_clean": 1, - "parent": "0.0", - "parent_split_bits": 0, - "last_scrub": "4'1", - "last_scrub_stamp": "2013-01-25 10:12:23.828174", - "log_size": 128, - "ondisk_log_size": 128, - "stat_sum": { - "num_bytes": 205, - "num_objects": 1, - "num_object_clones": 0, - "num_object_copies": 0, - "num_objects_missing_on_primary": 0, - "num_objects_degraded": 0, - "num_objects_unfound": 0, - "num_read": 1, - "num_read_kb": 0, - "num_write": 3, - "num_write_kb": 1 - }, - "stat_cat_sum": { - - }, - "up": [ - 1, - 0 - ], - "acting": [ - 1, - 0 - ] - }, - "empty": 0, - "dne": 0, - "incomplete": 0 - }, - "recovery_state": [ - { - "name": "Started\/Primary\/Active", - "enter_time": "2013-01-23 09:35:37.594691", - "might_have_unfound": [ - - ], - "scrub": { - "scrub_epoch_start": "536", - "scrub_active": 0, - "scrub_block_writes": 0, - "finalizing_scrub": 0, - "scrub_waiting_on": 0, - "scrub_waiting_on_whom": [ - - ] - } - }, - { - "name": "Started", - "enter_time": "2013-01-23 09:35:31.581160" - } - ] - } - - - -The following subsections describe common states in greater detail. - -Creating --------- - -When you create a pool, it will create the number of placement groups you -specified. Ceph will echo ``creating`` when it is creating one or more -placement groups. Once they are created, the OSDs that are part of a placement -group's Acting Set will peer. Once peering is complete, the placement group -status should be ``active+clean``, which means a Ceph client can begin writing -to the placement group. - -.. ditaa:: - - /-----------\ /-----------\ /-----------\ - | Creating |------>| Peering |------>| Active | - \-----------/ \-----------/ \-----------/ - -Peering -------- - -When Ceph is Peering a placement group, Ceph is bringing the OSDs that -store the replicas of the placement group into **agreement about the state** -of the objects and metadata in the placement group. When Ceph completes peering, -this means that the OSDs that store the placement group agree about the current -state of the placement group. However, completion of the peering process does -**NOT** mean that each replica has the latest contents. - -.. topic:: Authoratative History - - Ceph will **NOT** acknowledge a write operation to a client, until - all OSDs of the acting set persist the write operation. This practice - ensures that at least one member of the acting set will have a record - of every acknowledged write operation since the last successful - peering operation. - - With an accurate record of each acknowledged write operation, Ceph can - construct and disseminate a new authoritative history of the placement - group--a complete, and fully ordered set of operations that, if performed, - would bring an OSD’s copy of a placement group up to date. - - -Active ------- - -Once Ceph completes the peering process, a placement group may become -``active``. The ``active`` state means that the data in the placement group is -generally available in the primary placement group and the replicas for read -and write operations. - - -Clean ------ - -When a placement group is in the ``clean`` state, the primary OSD and the -replica OSDs have successfully peered and there are no stray replicas for the -placement group. Ceph replicated all objects in the placement group the correct -number of times. - - -Degraded --------- - -When a client writes an object to the primary OSD, the primary OSD is -responsible for writing the replicas to the replica OSDs. After the primary OSD -writes the object to storage, the placement group will remain in a ``degraded`` -state until the primary OSD has received an acknowledgement from the replica -OSDs that Ceph created the replica objects successfully. - -The reason a placement group can be ``active+degraded`` is that an OSD may be -``active`` even though it doesn't hold all of the objects yet. If an OSD goes -``down``, Ceph marks each placement group assigned to the OSD as ``degraded``. -The OSDs must peer again when the OSD comes back online. However, a client can -still write a new object to a ``degraded`` placement group if it is ``active``. - -If an OSD is ``down`` and the ``degraded`` condition persists, Ceph may mark the -``down`` OSD as ``out`` of the cluster and remap the data from the ``down`` OSD -to another OSD. The time between being marked ``down`` and being marked ``out`` -is controlled by ``mon osd down out interval``, which is set to ``600`` seconds -by default. - -A placement group can also be ``degraded``, because Ceph cannot find one or more -objects that Ceph thinks should be in the placement group. While you cannot -read or write to unfound objects, you can still access all of the other objects -in the ``degraded`` placement group. - - -Recovering ----------- - -Ceph was designed for fault-tolerance at a scale where hardware and software -problems are ongoing. When an OSD goes ``down``, its contents may fall behind -the current state of other replicas in the placement groups. When the OSD is -back ``up``, the contents of the placement groups must be updated to reflect the -current state. During that time period, the OSD may reflect a ``recovering`` -state. - -Recovery is not always trivial, because a hardware failure might cause a -cascading failure of multiple OSDs. For example, a network switch for a rack or -cabinet may fail, which can cause the OSDs of a number of host machines to fall -behind the current state of the cluster. Each one of the OSDs must recover once -the fault is resolved. - -Ceph provides a number of settings to balance the resource contention between -new service requests and the need to recover data objects and restore the -placement groups to the current state. The ``osd recovery delay start`` setting -allows an OSD to restart, re-peer and even process some replay requests before -starting the recovery process. The ``osd -recovery thread timeout`` sets a thread timeout, because multiple OSDs may fail, -restart and re-peer at staggered rates. The ``osd recovery max active`` setting -limits the number of recovery requests an OSD will entertain simultaneously to -prevent the OSD from failing to serve . The ``osd recovery max chunk`` setting -limits the size of the recovered data chunks to prevent network congestion. - - -Back Filling ------------- - -When a new OSD joins the cluster, CRUSH will reassign placement groups from OSDs -in the cluster to the newly added OSD. Forcing the new OSD to accept the -reassigned placement groups immediately can put excessive load on the new OSD. -Back filling the OSD with the placement groups allows this process to begin in -the background. Once backfilling is complete, the new OSD will begin serving -requests when it is ready. - -During the backfill operations, you may see one of several states: -``backfill_wait`` indicates that a backfill operation is pending, but is not -underway yet; ``backfill`` indicates that a backfill operation is underway; -and, ``backfill_too_full`` indicates that a backfill operation was requested, -but couldn't be completed due to insufficient storage capacity. When a -placement group cannot be backfilled, it may be considered ``incomplete``. - -Ceph provides a number of settings to manage the load spike associated with -reassigning placement groups to an OSD (especially a new OSD). By default, -``osd_max_backfills`` sets the maximum number of concurrent backfills to or from -an OSD to 10. The ``backfill full ratio`` enables an OSD to refuse a -backfill request if the OSD is approaching its full ratio (90%, by default) and -change with ``ceph osd set-backfillfull-ratio`` comand. -If an OSD refuses a backfill request, the ``osd backfill retry interval`` -enables an OSD to retry the request (after 10 seconds, by default). OSDs can -also set ``osd backfill scan min`` and ``osd backfill scan max`` to manage scan -intervals (64 and 512, by default). - - -Remapped --------- - -When the Acting Set that services a placement group changes, the data migrates -from the old acting set to the new acting set. It may take some time for a new -primary OSD to service requests. So it may ask the old primary to continue to -service requests until the placement group migration is complete. Once data -migration completes, the mapping uses the primary OSD of the new acting set. - - -Stale ------ - -While Ceph uses heartbeats to ensure that hosts and daemons are running, the -``ceph-osd`` daemons may also get into a ``stuck`` state where they are not -reporting statistics in a timely manner (e.g., a temporary network fault). By -default, OSD daemons report their placement group, up thru, boot and failure -statistics every half second (i.e., ``0.5``), which is more frequent than the -heartbeat thresholds. If the **Primary OSD** of a placement group's acting set -fails to report to the monitor or if other OSDs have reported the primary OSD -``down``, the monitors will mark the placement group ``stale``. - -When you start your cluster, it is common to see the ``stale`` state until -the peering process completes. After your cluster has been running for awhile, -seeing placement groups in the ``stale`` state indicates that the primary OSD -for those placement groups is ``down`` or not reporting placement group statistics -to the monitor. - - -Identifying Troubled PGs -======================== - -As previously noted, a placement group is not necessarily problematic just -because its state is not ``active+clean``. Generally, Ceph's ability to self -repair may not be working when placement groups get stuck. The stuck states -include: - -- **Unclean**: Placement groups contain objects that are not replicated the - desired number of times. They should be recovering. -- **Inactive**: Placement groups cannot process reads or writes because they - are waiting for an OSD with the most up-to-date data to come back ``up``. -- **Stale**: Placement groups are in an unknown state, because the OSDs that - host them have not reported to the monitor cluster in a while (configured - by ``mon osd report timeout``). - -To identify stuck placement groups, execute the following:: - - ceph pg dump_stuck [unclean|inactive|stale|undersized|degraded] - -See `Placement Group Subsystem`_ for additional details. To troubleshoot -stuck placement groups, see `Troubleshooting PG Errors`_. - - -Finding an Object Location -========================== - -To store object data in the Ceph Object Store, a Ceph client must: - -#. Set an object name -#. Specify a `pool`_ - -The Ceph client retrieves the latest cluster map and the CRUSH algorithm -calculates how to map the object to a `placement group`_, and then calculates -how to assign the placement group to an OSD dynamically. To find the object -location, all you need is the object name and the pool name. For example:: - - ceph osd map {poolname} {object-name} - -.. topic:: Exercise: Locate an Object - - As an exercise, lets create an object. Specify an object name, a path to a - test file containing some object data and a pool name using the - ``rados put`` command on the command line. For example:: - - rados put {object-name} {file-path} --pool=data - rados put test-object-1 testfile.txt --pool=data - - To verify that the Ceph Object Store stored the object, execute the following:: - - rados -p data ls - - Now, identify the object location:: - - ceph osd map {pool-name} {object-name} - ceph osd map data test-object-1 - - Ceph should output the object's location. For example:: - - osdmap e537 pool 'data' (0) object 'test-object-1' -> pg 0.d1743484 (0.4) -> up [1,0] acting [1,0] - - To remove the test object, simply delete it using the ``rados rm`` command. - For example:: - - rados rm test-object-1 --pool=data - - -As the cluster evolves, the object location may change dynamically. One benefit -of Ceph's dynamic rebalancing is that Ceph relieves you from having to perform -the migration manually. See the `Architecture`_ section for details. - -.. _data placement: ../data-placement -.. _pool: ../pools -.. _placement group: ../placement-groups -.. _Architecture: ../../../architecture -.. _OSD Not Running: ../../troubleshooting/troubleshooting-osd#osd-not-running -.. _Troubleshooting PG Errors: ../../troubleshooting/troubleshooting-pg#troubleshooting-pg-errors -.. _Peering Failure: ../../troubleshooting/troubleshooting-pg#failures-osd-peering -.. _CRUSH map: ../crush-map -.. _Configuring Monitor/OSD Interaction: ../../configuration/mon-osd-interaction/ -.. _Placement Group Subsystem: ../control#placement-group-subsystem -- cgit 1.2.3-korg