diff options
Diffstat (limited to 'src/ceph/doc/rados/operations/crush-map.rst')
-rw-r--r-- | src/ceph/doc/rados/operations/crush-map.rst | 956 |
1 files changed, 0 insertions, 956 deletions
diff --git a/src/ceph/doc/rados/operations/crush-map.rst b/src/ceph/doc/rados/operations/crush-map.rst deleted file mode 100644 index 05fa4ff..0000000 --- a/src/ceph/doc/rados/operations/crush-map.rst +++ /dev/null @@ -1,956 +0,0 @@ -============ - CRUSH Maps -============ - -The :abbr:`CRUSH (Controlled Replication Under Scalable Hashing)` algorithm -determines how to store and retrieve data by computing data storage locations. -CRUSH empowers Ceph clients to communicate with OSDs directly rather than -through a centralized server or broker. With an algorithmically determined -method of storing and retrieving data, Ceph avoids a single point of failure, a -performance bottleneck, and a physical limit to its scalability. - -CRUSH requires a map of your cluster, and uses the CRUSH map to pseudo-randomly -store and retrieve data in OSDs with a uniform distribution of data across the -cluster. For a detailed discussion of CRUSH, see -`CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_ - -CRUSH maps contain a list of :abbr:`OSDs (Object Storage Devices)`, a list of -'buckets' for aggregating the devices into physical locations, and a list of -rules that tell CRUSH how it should replicate data in a Ceph cluster's pools. By -reflecting the underlying physical organization of the installation, CRUSH can -model—and thereby address—potential sources of correlated device failures. -Typical sources include physical proximity, a shared power source, and a shared -network. By encoding this information into the cluster map, CRUSH placement -policies can separate object replicas across different failure domains while -still maintaining the desired distribution. For example, to address the -possibility of concurrent failures, it may be desirable to ensure that data -replicas are on devices using different shelves, racks, power supplies, -controllers, and/or physical locations. - -When you deploy OSDs they are automatically placed within the CRUSH map under a -``host`` node named with the hostname for the host they are running on. This, -combined with the default CRUSH failure domain, ensures that replicas or erasure -code shards are separated across hosts and a single host failure will not -affect availability. For larger clusters, however, administrators should carefully consider their choice of failure domain. Separating replicas across racks, -for example, is common for mid- to large-sized clusters. - - -CRUSH Location -============== - -The location of an OSD in terms of the CRUSH map's hierarchy is -referred to as a ``crush location``. This location specifier takes the -form of a list of key and value pairs describing a position. For -example, if an OSD is in a particular row, rack, chassis and host, and -is part of the 'default' CRUSH tree (this is the case for the vast -majority of clusters), its crush location could be described as:: - - root=default row=a rack=a2 chassis=a2a host=a2a1 - -Note: - -#. Note that the order of the keys does not matter. -#. The key name (left of ``=``) must be a valid CRUSH ``type``. By default - these include root, datacenter, room, row, pod, pdu, rack, chassis and host, - but those types can be customized to be anything appropriate by modifying - the CRUSH map. -#. Not all keys need to be specified. For example, by default, Ceph - automatically sets a ``ceph-osd`` daemon's location to be - ``root=default host=HOSTNAME`` (based on the output from ``hostname -s``). - -The crush location for an OSD is normally expressed via the ``crush location`` -config option being set in the ``ceph.conf`` file. Each time the OSD starts, -it verifies it is in the correct location in the CRUSH map and, if it is not, -it moved itself. To disable this automatic CRUSH map management, add the -following to your configuration file in the ``[osd]`` section:: - - osd crush update on start = false - - -Custom location hooks ---------------------- - -A customized location hook can be used to generate a more complete -crush location on startup. The sample ``ceph-crush-location`` utility -will generate a CRUSH location string for a given daemon. The -location is based on, in order of preference: - -#. A ``crush location`` option in ceph.conf. -#. A default of ``root=default host=HOSTNAME`` where the hostname is - generated with the ``hostname -s`` command. - -This is not useful by itself, as the OSD itself has the exact same -behavior. However, the script can be modified to provide additional -location fields (for example, the rack or datacenter), and then the -hook enabled via the config option:: - - crush location hook = /path/to/customized-ceph-crush-location - -This hook is passed several arguments (below) and should output a single line -to stdout with the CRUSH location description.:: - - $ ceph-crush-location --cluster CLUSTER --id ID --type TYPE - -where the cluster name is typically 'ceph', the id is the daemon -identifier (the OSD number), and the daemon type is typically ``osd``. - - -CRUSH structure -=============== - -The CRUSH map consists of, loosely speaking, a hierarchy describing -the physical topology of the cluster, and a set of rules defining -policy about how we place data on those devices. The hierarchy has -devices (``ceph-osd`` daemons) at the leaves, and internal nodes -corresponding to other physical features or groupings: hosts, racks, -rows, datacenters, and so on. The rules describe how replicas are -placed in terms of that hierarchy (e.g., 'three replicas in different -racks'). - -Devices -------- - -Devices are individual ``ceph-osd`` daemons that can store data. You -will normally have one defined here for each OSD daemon in your -cluster. Devices are identified by an id (a non-negative integer) and -a name, normally ``osd.N`` where ``N`` is the device id. - -Devices may also have a *device class* associated with them (e.g., -``hdd`` or ``ssd``), allowing them to be conveniently targetted by a -crush rule. - -Types and Buckets ------------------ - -A bucket is the CRUSH term for internal nodes in the hierarchy: hosts, -racks, rows, etc. The CRUSH map defines a series of *types* that are -used to describe these nodes. By default, these types include: - -- osd (or device) -- host -- chassis -- rack -- row -- pdu -- pod -- room -- datacenter -- region -- root - -Most clusters make use of only a handful of these types, and others -can be defined as needed. - -The hierarchy is built with devices (normally type ``osd``) at the -leaves, interior nodes with non-device types, and a root node of type -``root``. For example, - -.. ditaa:: - - +-----------------+ - | {o}root default | - +--------+--------+ - | - +---------------+---------------+ - | | - +-------+-------+ +-----+-------+ - | {o}host foo | | {o}host bar | - +-------+-------+ +-----+-------+ - | | - +-------+-------+ +-------+-------+ - | | | | - +-----+-----+ +-----+-----+ +-----+-----+ +-----+-----+ - | osd.0 | | osd.1 | | osd.2 | | osd.3 | - +-----------+ +-----------+ +-----------+ +-----------+ - -Each node (device or bucket) in the hierarchy has a *weight* -associated with it, indicating the relative proportion of the total -data that device or hierarchy subtree should store. Weights are set -at the leaves, indicating the size of the device, and automatically -sum up the tree from there, such that the weight of the default node -will be the total of all devices contained beneath it. Normally -weights are in units of terabytes (TB). - -You can get a simple view the CRUSH hierarchy for your cluster, -including the weights, with:: - - ceph osd crush tree - -Rules ------ - -Rules define policy about how data is distributed across the devices -in the hierarchy. - -CRUSH rules define placement and replication strategies or -distribution policies that allow you to specify exactly how CRUSH -places object replicas. For example, you might create a rule selecting -a pair of targets for 2-way mirroring, another rule for selecting -three targets in two different data centers for 3-way mirroring, and -yet another rule for erasure coding over six storage devices. For a -detailed discussion of CRUSH rules, refer to `CRUSH - Controlled, -Scalable, Decentralized Placement of Replicated Data`_, and more -specifically to **Section 3.2**. - -In almost all cases, CRUSH rules can be created via the CLI by -specifying the *pool type* they will be used for (replicated or -erasure coded), the *failure domain*, and optionally a *device class*. -In rare cases rules must be written by hand by manually editing the -CRUSH map. - -You can see what rules are defined for your cluster with:: - - ceph osd crush rule ls - -You can view the contents of the rules with:: - - ceph osd crush rule dump - -Device classes --------------- - -Each device can optionally have a *class* associated with it. By -default, OSDs automatically set their class on startup to either -`hdd`, `ssd`, or `nvme` based on the type of device they are backed -by. - -The device class for one or more OSDs can be explicitly set with:: - - ceph osd crush set-device-class <class> <osd-name> [...] - -Once a device class is set, it cannot be changed to another class -until the old class is unset with:: - - ceph osd crush rm-device-class <osd-name> [...] - -This allows administrators to set device classes without the class -being changed on OSD restart or by some other script. - -A placement rule that targets a specific device class can be created with:: - - ceph osd crush rule create-replicated <rule-name> <root> <failure-domain> <class> - -A pool can then be changed to use the new rule with:: - - ceph osd pool set <pool-name> crush_rule <rule-name> - -Device classes are implemented by creating a "shadow" CRUSH hierarchy -for each device class in use that contains only devices of that class. -Rules can then distribute data over the shadow hierarchy. One nice -thing about this approach is that it is fully backward compatible with -old Ceph clients. You can view the CRUSH hierarchy with shadow items -with:: - - ceph osd crush tree --show-shadow - - -Weights sets ------------- - -A *weight set* is an alternative set of weights to use when -calculating data placement. The normal weights associated with each -device in the CRUSH map are set based on the device size and indicate -how much data we *should* be storing where. However, because CRUSH is -based on a pseudorandom placement process, there is always some -variation from this ideal distribution, the same way that rolling a -dice sixty times will not result in rolling exactly 10 ones and 10 -sixes. Weight sets allow the cluster to do a numerical optimization -based on the specifics of your cluster (hierarchy, pools, etc.) to achieve -a balanced distribution. - -There are two types of weight sets supported: - - #. A **compat** weight set is a single alternative set of weights for - each device and node in the cluster. This is not well-suited for - correcting for all anomalies (for example, placement groups for - different pools may be different sizes and have different load - levels, but will be mostly treated the same by the balancer). - However, compat weight sets have the huge advantage that they are - *backward compatible* with previous versions of Ceph, which means - that even though weight sets were first introduced in Luminous - v12.2.z, older clients (e.g., firefly) can still connect to the - cluster when a compat weight set is being used to balance data. - #. A **per-pool** weight set is more flexible in that it allows - placement to be optimized for each data pool. Additionally, - weights can be adjusted for each position of placement, allowing - the optimizer to correct for a suble skew of data toward devices - with small weights relative to their peers (and effect that is - usually only apparently in very large clusters but which can cause - balancing problems). - -When weight sets are in use, the weights associated with each node in -the hierarchy is visible as a separate column (labeled either -``(compat)`` or the pool name) from the command:: - - ceph osd crush tree - -When both *compat* and *per-pool* weight sets are in use, data -placement for a particular pool will use its own per-pool weight set -if present. If not, it will use the compat weight set if present. If -neither are present, it will use the normal CRUSH weights. - -Although weight sets can be set up and manipulated by hand, it is -recommended that the *balancer* module be enabled to do so -automatically. - - -Modifying the CRUSH map -======================= - -.. _addosd: - -Add/Move an OSD ---------------- - -.. note: OSDs are normally automatically added to the CRUSH map when - the OSD is created. This command is rarely needed. - -To add or move an OSD in the CRUSH map of a running cluster:: - - ceph osd crush set {name} {weight} root={root} [{bucket-type}={bucket-name} ...] - -Where: - -``name`` - -:Description: The full name of the OSD. -:Type: String -:Required: Yes -:Example: ``osd.0`` - - -``weight`` - -:Description: The CRUSH weight for the OSD, normally its size measure in terabytes (TB). -:Type: Double -:Required: Yes -:Example: ``2.0`` - - -``root`` - -:Description: The root node of the tree in which the OSD resides (normally ``default``) -:Type: Key/value pair. -:Required: Yes -:Example: ``root=default`` - - -``bucket-type`` - -:Description: You may specify the OSD's location in the CRUSH hierarchy. -:Type: Key/value pairs. -:Required: No -:Example: ``datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1`` - - -The following example adds ``osd.0`` to the hierarchy, or moves the -OSD from a previous location. :: - - ceph osd crush set osd.0 1.0 root=default datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1 - - -Adjust OSD weight ------------------ - -.. note: Normally OSDs automatically add themselves to the CRUSH map - with the correct weight when they are created. This command - is rarely needed. - -To adjust an OSD's crush weight in the CRUSH map of a running cluster, execute -the following:: - - ceph osd crush reweight {name} {weight} - -Where: - -``name`` - -:Description: The full name of the OSD. -:Type: String -:Required: Yes -:Example: ``osd.0`` - - -``weight`` - -:Description: The CRUSH weight for the OSD. -:Type: Double -:Required: Yes -:Example: ``2.0`` - - -.. _removeosd: - -Remove an OSD -------------- - -.. note: OSDs are normally removed from the CRUSH as part of the - ``ceph osd purge`` command. This command is rarely needed. - -To remove an OSD from the CRUSH map of a running cluster, execute the -following:: - - ceph osd crush remove {name} - -Where: - -``name`` - -:Description: The full name of the OSD. -:Type: String -:Required: Yes -:Example: ``osd.0`` - - -Add a Bucket ------------- - -.. note: Buckets are normally implicitly created when an OSD is added - that specifies a ``{bucket-type}={bucket-name}`` as part of its - location and a bucket with that name does not already exist. This - command is typically used when manually adjusting the structure of the - hierarchy after OSDs have been created (for example, to move a - series of hosts underneath a new rack-level bucket). - -To add a bucket in the CRUSH map of a running cluster, execute the -``ceph osd crush add-bucket`` command:: - - ceph osd crush add-bucket {bucket-name} {bucket-type} - -Where: - -``bucket-name`` - -:Description: The full name of the bucket. -:Type: String -:Required: Yes -:Example: ``rack12`` - - -``bucket-type`` - -:Description: The type of the bucket. The type must already exist in the hierarchy. -:Type: String -:Required: Yes -:Example: ``rack`` - - -The following example adds the ``rack12`` bucket to the hierarchy:: - - ceph osd crush add-bucket rack12 rack - -Move a Bucket -------------- - -To move a bucket to a different location or position in the CRUSH map -hierarchy, execute the following:: - - ceph osd crush move {bucket-name} {bucket-type}={bucket-name}, [...] - -Where: - -``bucket-name`` - -:Description: The name of the bucket to move/reposition. -:Type: String -:Required: Yes -:Example: ``foo-bar-1`` - -``bucket-type`` - -:Description: You may specify the bucket's location in the CRUSH hierarchy. -:Type: Key/value pairs. -:Required: No -:Example: ``datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1`` - -Remove a Bucket ---------------- - -To remove a bucket from the CRUSH map hierarchy, execute the following:: - - ceph osd crush remove {bucket-name} - -.. note:: A bucket must be empty before removing it from the CRUSH hierarchy. - -Where: - -``bucket-name`` - -:Description: The name of the bucket that you'd like to remove. -:Type: String -:Required: Yes -:Example: ``rack12`` - -The following example removes the ``rack12`` bucket from the hierarchy:: - - ceph osd crush remove rack12 - -Creating a compat weight set ----------------------------- - -.. note: This step is normally done automatically by the ``balancer`` - module when enabled. - -To create a *compat* weight set:: - - ceph osd crush weight-set create-compat - -Weights for the compat weight set can be adjusted with:: - - ceph osd crush weight-set reweight-compat {name} {weight} - -The compat weight set can be destroyed with:: - - ceph osd crush weight-set rm-compat - -Creating per-pool weight sets ------------------------------ - -To create a weight set for a specific pool,:: - - ceph osd crush weight-set create {pool-name} {mode} - -.. note:: Per-pool weight sets require that all servers and daemons - run Luminous v12.2.z or later. - -Where: - -``pool-name`` - -:Description: The name of a RADOS pool -:Type: String -:Required: Yes -:Example: ``rbd`` - -``mode`` - -:Description: Either ``flat`` or ``positional``. A *flat* weight set - has a single weight for each device or bucket. A - *positional* weight set has a potentially different - weight for each position in the resulting placement - mapping. For example, if a pool has a replica count of - 3, then a positional weight set will have three weights - for each device and bucket. -:Type: String -:Required: Yes -:Example: ``flat`` - -To adjust the weight of an item in a weight set:: - - ceph osd crush weight-set reweight {pool-name} {item-name} {weight [...]} - -To list existing weight sets,:: - - ceph osd crush weight-set ls - -To remove a weight set,:: - - ceph osd crush weight-set rm {pool-name} - -Creating a rule for a replicated pool -------------------------------------- - -For a replicated pool, the primary decision when creating the CRUSH -rule is what the failure domain is going to be. For example, if a -failure domain of ``host`` is selected, then CRUSH will ensure that -each replica of the data is stored on a different host. If ``rack`` -is selected, then each replica will be stored in a different rack. -What failure domain you choose primarily depends on the size of your -cluster and how your hierarchy is structured. - -Normally, the entire cluster hierarchy is nested beneath a root node -named ``default``. If you have customized your hierarchy, you may -want to create a rule nested at some other node in the hierarchy. It -doesn't matter what type is associated with that node (it doesn't have -to be a ``root`` node). - -It is also possible to create a rule that restricts data placement to -a specific *class* of device. By default, Ceph OSDs automatically -classify themselves as either ``hdd`` or ``ssd``, depending on the -underlying type of device being used. These classes can also be -customized. - -To create a replicated rule,:: - - ceph osd crush rule create-replicated {name} {root} {failure-domain-type} [{class}] - -Where: - -``name`` - -:Description: The name of the rule -:Type: String -:Required: Yes -:Example: ``rbd-rule`` - -``root`` - -:Description: The name of the node under which data should be placed. -:Type: String -:Required: Yes -:Example: ``default`` - -``failure-domain-type`` - -:Description: The type of CRUSH nodes across which we should separate replicas. -:Type: String -:Required: Yes -:Example: ``rack`` - -``class`` - -:Description: The device class data should be placed on. -:Type: String -:Required: No -:Example: ``ssd`` - -Creating a rule for an erasure coded pool ------------------------------------------ - -For an erasure-coded pool, the same basic decisions need to be made as -with a replicated pool: what is the failure domain, what node in the -hierarchy will data be placed under (usually ``default``), and will -placement be restricted to a specific device class. Erasure code -pools are created a bit differently, however, because they need to be -constructed carefully based on the erasure code being used. For this reason, -you must include this information in the *erasure code profile*. A CRUSH -rule will then be created from that either explicitly or automatically when -the profile is used to create a pool. - -The erasure code profiles can be listed with:: - - ceph osd erasure-code-profile ls - -An existing profile can be viewed with:: - - ceph osd erasure-code-profile get {profile-name} - -Normally profiles should never be modified; instead, a new profile -should be created and used when creating a new pool or creating a new -rule for an existing pool. - -An erasure code profile consists of a set of key=value pairs. Most of -these control the behavior of the erasure code that is encoding data -in the pool. Those that begin with ``crush-``, however, affect the -CRUSH rule that is created. - -The erasure code profile properties of interest are: - - * **crush-root**: the name of the CRUSH node to place data under [default: ``default``]. - * **crush-failure-domain**: the CRUSH type to separate erasure-coded shards across [default: ``host``]. - * **crush-device-class**: the device class to place data on [default: none, meaning all devices are used]. - * **k** and **m** (and, for the ``lrc`` plugin, **l**): these determine the number of erasure code shards, affecting the resulting CRUSH rule. - -Once a profile is defined, you can create a CRUSH rule with:: - - ceph osd crush rule create-erasure {name} {profile-name} - -.. note: When creating a new pool, it is not actually necessary to - explicitly create the rule. If the erasure code profile alone is - specified and the rule argument is left off then Ceph will create - the CRUSH rule automatically. - -Deleting rules --------------- - -Rules that are not in use by pools can be deleted with:: - - ceph osd crush rule rm {rule-name} - - -Tunables -======== - -Over time, we have made (and continue to make) improvements to the -CRUSH algorithm used to calculate the placement of data. In order to -support the change in behavior, we have introduced a series of tunable -options that control whether the legacy or improved variation of the -algorithm is used. - -In order to use newer tunables, both clients and servers must support -the new version of CRUSH. For this reason, we have created -``profiles`` that are named after the Ceph version in which they were -introduced. For example, the ``firefly`` tunables are first supported -in the firefly release, and will not work with older (e.g., dumpling) -clients. Once a given set of tunables are changed from the legacy -default behavior, the ``ceph-mon`` and ``ceph-osd`` will prevent older -clients who do not support the new CRUSH features from connecting to -the cluster. - -argonaut (legacy) ------------------ - -The legacy CRUSH behavior used by argonaut and older releases works -fine for most clusters, provided there are not too many OSDs that have -been marked out. - -bobtail (CRUSH_TUNABLES2) -------------------------- - -The bobtail tunable profile fixes a few key misbehaviors: - - * For hierarchies with a small number of devices in the leaf buckets, - some PGs map to fewer than the desired number of replicas. This - commonly happens for hierarchies with "host" nodes with a small - number (1-3) of OSDs nested beneath each one. - - * For large clusters, some small percentages of PGs map to less than - the desired number of OSDs. This is more prevalent when there are - several layers of the hierarchy (e.g., row, rack, host, osd). - - * When some OSDs are marked out, the data tends to get redistributed - to nearby OSDs instead of across the entire hierarchy. - -The new tunables are: - - * ``choose_local_tries``: Number of local retries. Legacy value is - 2, optimal value is 0. - - * ``choose_local_fallback_tries``: Legacy value is 5, optimal value - is 0. - - * ``choose_total_tries``: Total number of attempts to choose an item. - Legacy value was 19, subsequent testing indicates that a value of - 50 is more appropriate for typical clusters. For extremely large - clusters, a larger value might be necessary. - - * ``chooseleaf_descend_once``: Whether a recursive chooseleaf attempt - will retry, or only try once and allow the original placement to - retry. Legacy default is 0, optimal value is 1. - -Migration impact: - - * Moving from argonaut to bobtail tunables triggers a moderate amount - of data movement. Use caution on a cluster that is already - populated with data. - -firefly (CRUSH_TUNABLES3) -------------------------- - -The firefly tunable profile fixes a problem -with the ``chooseleaf`` CRUSH rule behavior that tends to result in PG -mappings with too few results when too many OSDs have been marked out. - -The new tunable is: - - * ``chooseleaf_vary_r``: Whether a recursive chooseleaf attempt will - start with a non-zero value of r, based on how many attempts the - parent has already made. Legacy default is 0, but with this value - CRUSH is sometimes unable to find a mapping. The optimal value (in - terms of computational cost and correctness) is 1. - -Migration impact: - - * For existing clusters that have lots of existing data, changing - from 0 to 1 will cause a lot of data to move; a value of 4 or 5 - will allow CRUSH to find a valid mapping but will make less data - move. - -straw_calc_version tunable (introduced with Firefly too) --------------------------------------------------------- - -There were some problems with the internal weights calculated and -stored in the CRUSH map for ``straw`` buckets. Specifically, when -there were items with a CRUSH weight of 0 or both a mix of weights and -some duplicated weights CRUSH would distribute data incorrectly (i.e., -not in proportion to the weights). - -The new tunable is: - - * ``straw_calc_version``: A value of 0 preserves the old, broken - internal weight calculation; a value of 1 fixes the behavior. - -Migration impact: - - * Moving to straw_calc_version 1 and then adjusting a straw bucket - (by adding, removing, or reweighting an item, or by using the - reweight-all command) can trigger a small to moderate amount of - data movement *if* the cluster has hit one of the problematic - conditions. - -This tunable option is special because it has absolutely no impact -concerning the required kernel version in the client side. - -hammer (CRUSH_V4) ------------------ - -The hammer tunable profile does not affect the -mapping of existing CRUSH maps simply by changing the profile. However: - - * There is a new bucket type (``straw2``) supported. The new - ``straw2`` bucket type fixes several limitations in the original - ``straw`` bucket. Specifically, the old ``straw`` buckets would - change some mappings that should have changed when a weight was - adjusted, while ``straw2`` achieves the original goal of only - changing mappings to or from the bucket item whose weight has - changed. - - * ``straw2`` is the default for any newly created buckets. - -Migration impact: - - * Changing a bucket type from ``straw`` to ``straw2`` will result in - a reasonably small amount of data movement, depending on how much - the bucket item weights vary from each other. When the weights are - all the same no data will move, and when item weights vary - significantly there will be more movement. - -jewel (CRUSH_TUNABLES5) ------------------------ - -The jewel tunable profile improves the -overall behavior of CRUSH such that significantly fewer mappings -change when an OSD is marked out of the cluster. - -The new tunable is: - - * ``chooseleaf_stable``: Whether a recursive chooseleaf attempt will - use a better value for an inner loop that greatly reduces the number - of mapping changes when an OSD is marked out. The legacy value is 0, - while the new value of 1 uses the new approach. - -Migration impact: - - * Changing this value on an existing cluster will result in a very - large amount of data movement as almost every PG mapping is likely - to change. - - - - -Which client versions support CRUSH_TUNABLES --------------------------------------------- - - * argonaut series, v0.48.1 or later - * v0.49 or later - * Linux kernel version v3.6 or later (for the file system and RBD kernel clients) - -Which client versions support CRUSH_TUNABLES2 ---------------------------------------------- - - * v0.55 or later, including bobtail series (v0.56.x) - * Linux kernel version v3.9 or later (for the file system and RBD kernel clients) - -Which client versions support CRUSH_TUNABLES3 ---------------------------------------------- - - * v0.78 (firefly) or later - * Linux kernel version v3.15 or later (for the file system and RBD kernel clients) - -Which client versions support CRUSH_V4 --------------------------------------- - - * v0.94 (hammer) or later - * Linux kernel version v4.1 or later (for the file system and RBD kernel clients) - -Which client versions support CRUSH_TUNABLES5 ---------------------------------------------- - - * v10.0.2 (jewel) or later - * Linux kernel version v4.5 or later (for the file system and RBD kernel clients) - -Warning when tunables are non-optimal -------------------------------------- - -Starting with version v0.74, Ceph will issue a health warning if the -current CRUSH tunables don't include all the optimal values from the -``default`` profile (see below for the meaning of the ``default`` profile). -To make this warning go away, you have two options: - -1. Adjust the tunables on the existing cluster. Note that this will - result in some data movement (possibly as much as 10%). This is the - preferred route, but should be taken with care on a production cluster - where the data movement may affect performance. You can enable optimal - tunables with:: - - ceph osd crush tunables optimal - - If things go poorly (e.g., too much load) and not very much - progress has been made, or there is a client compatibility problem - (old kernel cephfs or rbd clients, or pre-bobtail librados - clients), you can switch back with:: - - ceph osd crush tunables legacy - -2. You can make the warning go away without making any changes to CRUSH by - adding the following option to your ceph.conf ``[mon]`` section:: - - mon warn on legacy crush tunables = false - - For the change to take effect, you will need to restart the monitors, or - apply the option to running monitors with:: - - ceph tell mon.\* injectargs --no-mon-warn-on-legacy-crush-tunables - - -A few important points ----------------------- - - * Adjusting these values will result in the shift of some PGs between - storage nodes. If the Ceph cluster is already storing a lot of - data, be prepared for some fraction of the data to move. - * The ``ceph-osd`` and ``ceph-mon`` daemons will start requiring the - feature bits of new connections as soon as they get - the updated map. However, already-connected clients are - effectively grandfathered in, and will misbehave if they do not - support the new feature. - * If the CRUSH tunables are set to non-legacy values and then later - changed back to the defult values, ``ceph-osd`` daemons will not be - required to support the feature. However, the OSD peering process - requires examining and understanding old maps. Therefore, you - should not run old versions of the ``ceph-osd`` daemon - if the cluster has previously used non-legacy CRUSH values, even if - the latest version of the map has been switched back to using the - legacy defaults. - -Tuning CRUSH ------------- - -The simplest way to adjust the crush tunables is by changing to a known -profile. Those are: - - * ``legacy``: the legacy behavior from argonaut and earlier. - * ``argonaut``: the legacy values supported by the original argonaut release - * ``bobtail``: the values supported by the bobtail release - * ``firefly``: the values supported by the firefly release - * ``hammer``: the values supported by the hammer release - * ``jewel``: the values supported by the jewel release - * ``optimal``: the best (ie optimal) values of the current version of Ceph - * ``default``: the default values of a new cluster installed from - scratch. These values, which depend on the current version of Ceph, - are hard coded and are generally a mix of optimal and legacy values. - These values generally match the ``optimal`` profile of the previous - LTS release, or the most recent release for which we generally except - more users to have up to date clients for. - -You can select a profile on a running cluster with the command:: - - ceph osd crush tunables {PROFILE} - -Note that this may result in some data movement. - - -.. _CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data: https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf - - -Primary Affinity -================ - -When a Ceph Client reads or writes data, it always contacts the primary OSD in -the acting set. For set ``[2, 3, 4]``, ``osd.2`` is the primary. Sometimes an -OSD is not well suited to act as a primary compared to other OSDs (e.g., it has -a slow disk or a slow controller). To prevent performance bottlenecks -(especially on read operations) while maximizing utilization of your hardware, -you can set a Ceph OSD's primary affinity so that CRUSH is less likely to use -the OSD as a primary in an acting set. :: - - ceph osd primary-affinity <osd-id> <weight> - -Primary affinity is ``1`` by default (*i.e.,* an OSD may act as a primary). You -may set the OSD primary range from ``0-1``, where ``0`` means that the OSD may -**NOT** be used as a primary and ``1`` means that an OSD may be used as a -primary. When the weight is ``< 1``, it is less likely that CRUSH will select -the Ceph OSD Daemon to act as a primary. - - - |