From 7da45d65be36d36b880cc55c5036e96c24b53f00 Mon Sep 17 00:00:00 2001 From: Qiaowei Ren Date: Thu, 1 Mar 2018 14:38:11 +0800 Subject: remove ceph code This patch removes initial ceph code, due to license issue. Change-Id: I092d44f601cdf34aed92300fe13214925563081c Signed-off-by: Qiaowei Ren --- .../rados/troubleshooting/troubleshooting-pg.rst | 668 --------------------- 1 file changed, 668 deletions(-) delete mode 100644 src/ceph/doc/rados/troubleshooting/troubleshooting-pg.rst (limited to 'src/ceph/doc/rados/troubleshooting/troubleshooting-pg.rst') diff --git a/src/ceph/doc/rados/troubleshooting/troubleshooting-pg.rst b/src/ceph/doc/rados/troubleshooting/troubleshooting-pg.rst deleted file mode 100644 index 4241fee..0000000 --- a/src/ceph/doc/rados/troubleshooting/troubleshooting-pg.rst +++ /dev/null @@ -1,668 +0,0 @@ -===================== - Troubleshooting PGs -===================== - -Placement Groups Never Get Clean -================================ - -When you create a cluster and your cluster remains in ``active``, -``active+remapped`` or ``active+degraded`` status and never achieve an -``active+clean`` status, you likely have a problem with your configuration. - -You may need to review settings in the `Pool, PG and CRUSH Config Reference`_ -and make appropriate adjustments. - -As a general rule, you should run your cluster with more than one OSD and a -pool size greater than 1 object replica. - -One Node Cluster ----------------- - -Ceph no longer provides documentation for operating on a single node, because -you would never deploy a system designed for distributed computing on a single -node. Additionally, mounting client kernel modules on a single node containing a -Ceph daemon may cause a deadlock due to issues with the Linux kernel itself -(unless you use VMs for the clients). You can experiment with Ceph in a 1-node -configuration, in spite of the limitations as described herein. - -If you are trying to create a cluster on a single node, you must change the -default of the ``osd crush chooseleaf type`` setting from ``1`` (meaning -``host`` or ``node``) to ``0`` (meaning ``osd``) in your Ceph configuration -file before you create your monitors and OSDs. This tells Ceph that an OSD -can peer with another OSD on the same host. If you are trying to set up a -1-node cluster and ``osd crush chooseleaf type`` is greater than ``0``, -Ceph will try to peer the PGs of one OSD with the PGs of another OSD on -another node, chassis, rack, row, or even datacenter depending on the setting. - -.. tip:: DO NOT mount kernel clients directly on the same node as your - Ceph Storage Cluster, because kernel conflicts can arise. However, you - can mount kernel clients within virtual machines (VMs) on a single node. - -If you are creating OSDs using a single disk, you must create directories -for the data manually first. For example:: - - mkdir /var/local/osd0 /var/local/osd1 - ceph-deploy osd prepare {localhost-name}:/var/local/osd0 {localhost-name}:/var/local/osd1 - ceph-deploy osd activate {localhost-name}:/var/local/osd0 {localhost-name}:/var/local/osd1 - - -Fewer OSDs than Replicas ------------------------- - -If you have brought up two OSDs to an ``up`` and ``in`` state, but you still -don't see ``active + clean`` placement groups, you may have an -``osd pool default size`` set to greater than ``2``. - -There are a few ways to address this situation. If you want to operate your -cluster in an ``active + degraded`` state with two replicas, you can set the -``osd pool default min size`` to ``2`` so that you can write objects in -an ``active + degraded`` state. You may also set the ``osd pool default size`` -setting to ``2`` so that you only have two stored replicas (the original and -one replica), in which case the cluster should achieve an ``active + clean`` -state. - -.. note:: You can make the changes at runtime. If you make the changes in - your Ceph configuration file, you may need to restart your cluster. - - -Pool Size = 1 -------------- - -If you have the ``osd pool default size`` set to ``1``, you will only have -one copy of the object. OSDs rely on other OSDs to tell them which objects -they should have. If a first OSD has a copy of an object and there is no -second copy, then no second OSD can tell the first OSD that it should have -that copy. For each placement group mapped to the first OSD (see -``ceph pg dump``), you can force the first OSD to notice the placement groups -it needs by running:: - - ceph osd force-create-pg - - -CRUSH Map Errors ----------------- - -Another candidate for placement groups remaining unclean involves errors -in your CRUSH map. - - -Stuck Placement Groups -====================== - -It is normal for placement groups to enter states like "degraded" or "peering" -following a failure. Normally these states indicate the normal progression -through the failure recovery process. However, if a placement group stays in one -of these states for a long time this may be an indication of a larger problem. -For this reason, the monitor will warn when placement groups get "stuck" in a -non-optimal state. Specifically, we check for: - -* ``inactive`` - The placement group has not been ``active`` for too long - (i.e., it hasn't been able to service read/write requests). - -* ``unclean`` - The placement group has not been ``clean`` for too long - (i.e., it hasn't been able to completely recover from a previous failure). - -* ``stale`` - The placement group status has not been updated by a ``ceph-osd``, - indicating that all nodes storing this placement group may be ``down``. - -You can explicitly list stuck placement groups with one of:: - - ceph pg dump_stuck stale - ceph pg dump_stuck inactive - ceph pg dump_stuck unclean - -For stuck ``stale`` placement groups, it is normally a matter of getting the -right ``ceph-osd`` daemons running again. For stuck ``inactive`` placement -groups, it is usually a peering problem (see :ref:`failures-osd-peering`). For -stuck ``unclean`` placement groups, there is usually something preventing -recovery from completing, like unfound objects (see -:ref:`failures-osd-unfound`); - - - -.. _failures-osd-peering: - -Placement Group Down - Peering Failure -====================================== - -In certain cases, the ``ceph-osd`` `Peering` process can run into -problems, preventing a PG from becoming active and usable. For -example, ``ceph health`` might report:: - - ceph health detail - HEALTH_ERR 7 pgs degraded; 12 pgs down; 12 pgs peering; 1 pgs recovering; 6 pgs stuck unclean; 114/3300 degraded (3.455%); 1/3 in osds are down - ... - pg 0.5 is down+peering - pg 1.4 is down+peering - ... - osd.1 is down since epoch 69, last address 192.168.106.220:6801/8651 - -We can query the cluster to determine exactly why the PG is marked ``down`` with:: - - ceph pg 0.5 query - -.. code-block:: javascript - - { "state": "down+peering", - ... - "recovery_state": [ - { "name": "Started\/Primary\/Peering\/GetInfo", - "enter_time": "2012-03-06 14:40:16.169679", - "requested_info_from": []}, - { "name": "Started\/Primary\/Peering", - "enter_time": "2012-03-06 14:40:16.169659", - "probing_osds": [ - 0, - 1], - "blocked": "peering is blocked due to down osds", - "down_osds_we_would_probe": [ - 1], - "peering_blocked_by": [ - { "osd": 1, - "current_lost_at": 0, - "comment": "starting or marking this osd lost may let us proceed"}]}, - { "name": "Started", - "enter_time": "2012-03-06 14:40:16.169513"} - ] - } - -The ``recovery_state`` section tells us that peering is blocked due to -down ``ceph-osd`` daemons, specifically ``osd.1``. In this case, we can start that ``ceph-osd`` -and things will recover. - -Alternatively, if there is a catastrophic failure of ``osd.1`` (e.g., disk -failure), we can tell the cluster that it is ``lost`` and to cope as -best it can. - -.. important:: This is dangerous in that the cluster cannot - guarantee that the other copies of the data are consistent - and up to date. - -To instruct Ceph to continue anyway:: - - ceph osd lost 1 - -Recovery will proceed. - - -.. _failures-osd-unfound: - -Unfound Objects -=============== - -Under certain combinations of failures Ceph may complain about -``unfound`` objects:: - - ceph health detail - HEALTH_WARN 1 pgs degraded; 78/3778 unfound (2.065%) - pg 2.4 is active+degraded, 78 unfound - -This means that the storage cluster knows that some objects (or newer -copies of existing objects) exist, but it hasn't found copies of them. -One example of how this might come about for a PG whose data is on ceph-osds -1 and 2: - -* 1 goes down -* 2 handles some writes, alone -* 1 comes up -* 1 and 2 repeer, and the objects missing on 1 are queued for recovery. -* Before the new objects are copied, 2 goes down. - -Now 1 knows that these object exist, but there is no live ``ceph-osd`` who -has a copy. In this case, IO to those objects will block, and the -cluster will hope that the failed node comes back soon; this is -assumed to be preferable to returning an IO error to the user. - -First, you can identify which objects are unfound with:: - - ceph pg 2.4 list_missing [starting offset, in json] - -.. code-block:: javascript - - { "offset": { "oid": "", - "key": "", - "snapid": 0, - "hash": 0, - "max": 0}, - "num_missing": 0, - "num_unfound": 0, - "objects": [ - { "oid": "object 1", - "key": "", - "hash": 0, - "max": 0 }, - ... - ], - "more": 0} - -If there are too many objects to list in a single result, the ``more`` -field will be true and you can query for more. (Eventually the -command line tool will hide this from you, but not yet.) - -Second, you can identify which OSDs have been probed or might contain -data:: - - ceph pg 2.4 query - -.. code-block:: javascript - - "recovery_state": [ - { "name": "Started\/Primary\/Active", - "enter_time": "2012-03-06 15:15:46.713212", - "might_have_unfound": [ - { "osd": 1, - "status": "osd is down"}]}, - -In this case, for example, the cluster knows that ``osd.1`` might have -data, but it is ``down``. The full range of possible states include: - -* already probed -* querying -* OSD is down -* not queried (yet) - -Sometimes it simply takes some time for the cluster to query possible -locations. - -It is possible that there are other locations where the object can -exist that are not listed. For example, if a ceph-osd is stopped and -taken out of the cluster, the cluster fully recovers, and due to some -future set of failures ends up with an unfound object, it won't -consider the long-departed ceph-osd as a potential location to -consider. (This scenario, however, is unlikely.) - -If all possible locations have been queried and objects are still -lost, you may have to give up on the lost objects. This, again, is -possible given unusual combinations of failures that allow the cluster -to learn about writes that were performed before the writes themselves -are recovered. To mark the "unfound" objects as "lost":: - - ceph pg 2.5 mark_unfound_lost revert|delete - -This the final argument specifies how the cluster should deal with -lost objects. - -The "delete" option will forget about them entirely. - -The "revert" option (not available for erasure coded pools) will -either roll back to a previous version of the object or (if it was a -new object) forget about it entirely. Use this with caution, as it -may confuse applications that expected the object to exist. - - -Homeless Placement Groups -========================= - -It is possible for all OSDs that had copies of a given placement groups to fail. -If that's the case, that subset of the object store is unavailable, and the -monitor will receive no status updates for those placement groups. To detect -this situation, the monitor marks any placement group whose primary OSD has -failed as ``stale``. For example:: - - ceph health - HEALTH_WARN 24 pgs stale; 3/300 in osds are down - -You can identify which placement groups are ``stale``, and what the last OSDs to -store them were, with:: - - ceph health detail - HEALTH_WARN 24 pgs stale; 3/300 in osds are down - ... - pg 2.5 is stuck stale+active+remapped, last acting [2,0] - ... - osd.10 is down since epoch 23, last address 192.168.106.220:6800/11080 - osd.11 is down since epoch 13, last address 192.168.106.220:6803/11539 - osd.12 is down since epoch 24, last address 192.168.106.220:6806/11861 - -If we want to get placement group 2.5 back online, for example, this tells us that -it was last managed by ``osd.0`` and ``osd.2``. Restarting those ``ceph-osd`` -daemons will allow the cluster to recover that placement group (and, presumably, -many others). - - -Only a Few OSDs Receive Data -============================ - -If you have many nodes in your cluster and only a few of them receive data, -`check`_ the number of placement groups in your pool. Since placement groups get -mapped to OSDs, a small number of placement groups will not distribute across -your cluster. Try creating a pool with a placement group count that is a -multiple of the number of OSDs. See `Placement Groups`_ for details. The default -placement group count for pools is not useful, but you can change it `here`_. - - -Can't Write Data -================ - -If your cluster is up, but some OSDs are down and you cannot write data, -check to ensure that you have the minimum number of OSDs running for the -placement group. If you don't have the minimum number of OSDs running, -Ceph will not allow you to write data because there is no guarantee -that Ceph can replicate your data. See ``osd pool default min size`` -in the `Pool, PG and CRUSH Config Reference`_ for details. - - -PGs Inconsistent -================ - -If you receive an ``active + clean + inconsistent`` state, this may happen -due to an error during scrubbing. As always, we can identify the inconsistent -placement group(s) with:: - - $ ceph health detail - HEALTH_ERR 1 pgs inconsistent; 2 scrub errors - pg 0.6 is active+clean+inconsistent, acting [0,1,2] - 2 scrub errors - -Or if you prefer inspecting the output in a programmatic way:: - - $ rados list-inconsistent-pg rbd - ["0.6"] - -There is only one consistent state, but in the worst case, we could have -different inconsistencies in multiple perspectives found in more than one -objects. If an object named ``foo`` in PG ``0.6`` is truncated, we will have:: - - $ rados list-inconsistent-obj 0.6 --format=json-pretty - -.. code-block:: javascript - - { - "epoch": 14, - "inconsistents": [ - { - "object": { - "name": "foo", - "nspace": "", - "locator": "", - "snap": "head", - "version": 1 - }, - "errors": [ - "data_digest_mismatch", - "size_mismatch" - ], - "union_shard_errors": [ - "data_digest_mismatch_oi", - "size_mismatch_oi" - ], - "selected_object_info": "0:602f83fe:::foo:head(16'1 client.4110.0:1 dirty|data_digest|omap_digest s 968 uv 1 dd e978e67f od ffffffff alloc_hint [0 0 0])", - "shards": [ - { - "osd": 0, - "errors": [], - "size": 968, - "omap_digest": "0xffffffff", - "data_digest": "0xe978e67f" - }, - { - "osd": 1, - "errors": [], - "size": 968, - "omap_digest": "0xffffffff", - "data_digest": "0xe978e67f" - }, - { - "osd": 2, - "errors": [ - "data_digest_mismatch_oi", - "size_mismatch_oi" - ], - "size": 0, - "omap_digest": "0xffffffff", - "data_digest": "0xffffffff" - } - ] - } - ] - } - -In this case, we can learn from the output: - -* The only inconsistent object is named ``foo``, and it is its head that has - inconsistencies. -* The inconsistencies fall into two categories: - - * ``errors``: these errors indicate inconsistencies between shards without a - determination of which shard(s) are bad. Check for the ``errors`` in the - `shards` array, if available, to pinpoint the problem. - - * ``data_digest_mismatch``: the digest of the replica read from OSD.2 is - different from the ones of OSD.0 and OSD.1 - * ``size_mismatch``: the size of the replica read from OSD.2 is 0, while - the size reported by OSD.0 and OSD.1 is 968. - * ``union_shard_errors``: the union of all shard specific ``errors`` in - ``shards`` array. The ``errors`` are set for the given shard that has the - problem. They include errors like ``read_error``. The ``errors`` ending in - ``oi`` indicate a comparison with ``selected_object_info``. Look at the - ``shards`` array to determine which shard has which error(s). - - * ``data_digest_mismatch_oi``: the digest stored in the object-info is not - ``0xffffffff``, which is calculated from the shard read from OSD.2 - * ``size_mismatch_oi``: the size stored in the object-info is different - from the one read from OSD.2. The latter is 0. - -You can repair the inconsistent placement group by executing:: - - ceph pg repair {placement-group-ID} - -Which overwrites the `bad` copies with the `authoritative` ones. In most cases, -Ceph is able to choose authoritative copies from all available replicas using -some predefined criteria. But this does not always work. For example, the stored -data digest could be missing, and the calculated digest will be ignored when -choosing the authoritative copies. So, please use the above command with caution. - -If ``read_error`` is listed in the ``errors`` attribute of a shard, the -inconsistency is likely due to disk errors. You might want to check your disk -used by that OSD. - -If you receive ``active + clean + inconsistent`` states periodically due to -clock skew, you may consider configuring your `NTP`_ daemons on your -monitor hosts to act as peers. See `The Network Time Protocol`_ and Ceph -`Clock Settings`_ for additional details. - - -Erasure Coded PGs are not active+clean -====================================== - -When CRUSH fails to find enough OSDs to map to a PG, it will show as a -``2147483647`` which is ITEM_NONE or ``no OSD found``. For instance:: - - [2,1,6,0,5,8,2147483647,7,4] - -Not enough OSDs ---------------- - -If the Ceph cluster only has 8 OSDs and the erasure coded pool needs -9, that is what it will show. You can either create another erasure -coded pool that requires less OSDs:: - - ceph osd erasure-code-profile set myprofile k=5 m=3 - ceph osd pool create erasurepool 16 16 erasure myprofile - -or add a new OSDs and the PG will automatically use them. - -CRUSH constraints cannot be satisfied -------------------------------------- - -If the cluster has enough OSDs, it is possible that the CRUSH ruleset -imposes constraints that cannot be satisfied. If there are 10 OSDs on -two hosts and the CRUSH rulesets require that no two OSDs from the -same host are used in the same PG, the mapping may fail because only -two OSD will be found. You can check the constraint by displaying the -ruleset:: - - $ ceph osd crush rule ls - [ - "replicated_ruleset", - "erasurepool"] - $ ceph osd crush rule dump erasurepool - { "rule_id": 1, - "rule_name": "erasurepool", - "ruleset": 1, - "type": 3, - "min_size": 3, - "max_size": 20, - "steps": [ - { "op": "take", - "item": -1, - "item_name": "default"}, - { "op": "chooseleaf_indep", - "num": 0, - "type": "host"}, - { "op": "emit"}]} - - -You can resolve the problem by creating a new pool in which PGs are allowed -to have OSDs residing on the same host with:: - - ceph osd erasure-code-profile set myprofile crush-failure-domain=osd - ceph osd pool create erasurepool 16 16 erasure myprofile - -CRUSH gives up too soon ------------------------ - -If the Ceph cluster has just enough OSDs to map the PG (for instance a -cluster with a total of 9 OSDs and an erasure coded pool that requires -9 OSDs per PG), it is possible that CRUSH gives up before finding a -mapping. It can be resolved by: - -* lowering the erasure coded pool requirements to use less OSDs per PG - (that requires the creation of another pool as erasure code profiles - cannot be dynamically modified). - -* adding more OSDs to the cluster (that does not require the erasure - coded pool to be modified, it will become clean automatically) - -* use a hand made CRUSH ruleset that tries more times to find a good - mapping. It can be done by setting ``set_choose_tries`` to a value - greater than the default. - -You should first verify the problem with ``crushtool`` after -extracting the crushmap from the cluster so your experiments do not -modify the Ceph cluster and only work on a local files:: - - $ ceph osd crush rule dump erasurepool - { "rule_name": "erasurepool", - "ruleset": 1, - "type": 3, - "min_size": 3, - "max_size": 20, - "steps": [ - { "op": "take", - "item": -1, - "item_name": "default"}, - { "op": "chooseleaf_indep", - "num": 0, - "type": "host"}, - { "op": "emit"}]} - $ ceph osd getcrushmap > crush.map - got crush map from osdmap epoch 13 - $ crushtool -i crush.map --test --show-bad-mappings \ - --rule 1 \ - --num-rep 9 \ - --min-x 1 --max-x $((1024 * 1024)) - bad mapping rule 8 x 43 num_rep 9 result [3,2,7,1,2147483647,8,5,6,0] - bad mapping rule 8 x 79 num_rep 9 result [6,0,2,1,4,7,2147483647,5,8] - bad mapping rule 8 x 173 num_rep 9 result [0,4,6,8,2,1,3,7,2147483647] - -Where ``--num-rep`` is the number of OSDs the erasure code crush -ruleset needs, ``--rule`` is the value of the ``ruleset`` field -displayed by ``ceph osd crush rule dump``. The test will try mapping -one million values (i.e. the range defined by ``[--min-x,--max-x]``) -and must display at least one bad mapping. If it outputs nothing it -means all mappings are successfull and you can stop right there: the -problem is elsewhere. - -The crush ruleset can be edited by decompiling the crush map:: - - $ crushtool --decompile crush.map > crush.txt - -and adding the following line to the ruleset:: - - step set_choose_tries 100 - -The relevant part of of the ``crush.txt`` file should look something -like:: - - rule erasurepool { - ruleset 1 - type erasure - min_size 3 - max_size 20 - step set_chooseleaf_tries 5 - step set_choose_tries 100 - step take default - step chooseleaf indep 0 type host - step emit - } - -It can then be compiled and tested again:: - - $ crushtool --compile crush.txt -o better-crush.map - -When all mappings succeed, an histogram of the number of tries that -were necessary to find all of them can be displayed with the -``--show-choose-tries`` option of ``crushtool``:: - - $ crushtool -i better-crush.map --test --show-bad-mappings \ - --show-choose-tries \ - --rule 1 \ - --num-rep 9 \ - --min-x 1 --max-x $((1024 * 1024)) - ... - 11: 42 - 12: 44 - 13: 54 - 14: 45 - 15: 35 - 16: 34 - 17: 30 - 18: 25 - 19: 19 - 20: 22 - 21: 20 - 22: 17 - 23: 13 - 24: 16 - 25: 13 - 26: 11 - 27: 11 - 28: 13 - 29: 11 - 30: 10 - 31: 6 - 32: 5 - 33: 10 - 34: 3 - 35: 7 - 36: 5 - 37: 2 - 38: 5 - 39: 5 - 40: 2 - 41: 5 - 42: 4 - 43: 1 - 44: 2 - 45: 2 - 46: 3 - 47: 1 - 48: 0 - ... - 102: 0 - 103: 1 - 104: 0 - ... - -It took 11 tries to map 42 PGs, 12 tries to map 44 PGs etc. The highest number of tries is the minimum value of ``set_choose_tries`` that prevents bad mappings (i.e. 103 in the above output because it did not take more than 103 tries for any PG to be mapped). - -.. _check: ../../operations/placement-groups#get-the-number-of-placement-groups -.. _here: ../../configuration/pool-pg-config-ref -.. _Placement Groups: ../../operations/placement-groups -.. _Pool, PG and CRUSH Config Reference: ../../configuration/pool-pg-config-ref -.. _NTP: http://en.wikipedia.org/wiki/Network_Time_Protocol -.. _The Network Time Protocol: http://www.ntp.org/ -.. _Clock Settings: ../../configuration/mon-config-ref/#clock - - -- cgit 1.2.3-korg