summaryrefslogtreecommitdiffstats
path: root/src/ceph/doc/dev/cache-pool.rst
diff options
context:
space:
mode:
Diffstat (limited to 'src/ceph/doc/dev/cache-pool.rst')
-rw-r--r--src/ceph/doc/dev/cache-pool.rst200
1 files changed, 200 insertions, 0 deletions
diff --git a/src/ceph/doc/dev/cache-pool.rst b/src/ceph/doc/dev/cache-pool.rst
new file mode 100644
index 0000000..7dc71c8
--- /dev/null
+++ b/src/ceph/doc/dev/cache-pool.rst
@@ -0,0 +1,200 @@
+Cache pool
+==========
+
+Purpose
+-------
+
+Use a pool of fast storage devices (probably SSDs) and use it as a
+cache for an existing slower and larger pool.
+
+Use a replicated pool as a front-end to service most I/O, and destage
+cold data to a separate erasure coded pool that does not currently (and
+cannot efficiently) handle the workload.
+
+We should be able to create and add a cache pool to an existing pool
+of data, and later remove it, without disrupting service or migrating
+data around.
+
+Use cases
+---------
+
+Read-write pool, writeback
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+We have an existing data pool and put a fast cache pool "in front" of
+it. Writes will go to the cache pool and immediately ack. We flush
+them back to the data pool based on the defined policy.
+
+Read-only pool, weak consistency
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+We have an existing data pool and add one or more read-only cache
+pools. We copy data to the cache pool(s) on read. Writes are
+forwarded to the original data pool. Stale data is expired from the
+cache pools based on the defined policy.
+
+This is likely only useful for specific applications with specific
+data access patterns. It may be a match for rgw, for example.
+
+
+Interface
+---------
+
+Set up a read/write cache pool foo-hot for pool foo::
+
+ ceph osd tier add foo foo-hot
+ ceph osd tier cache-mode foo-hot writeback
+
+Direct all traffic for foo to foo-hot::
+
+ ceph osd tier set-overlay foo foo-hot
+
+Set the target size and enable the tiering agent for foo-hot::
+
+ ceph osd pool set foo-hot hit_set_type bloom
+ ceph osd pool set foo-hot hit_set_count 1
+ ceph osd pool set foo-hot hit_set_period 3600 # 1 hour
+ ceph osd pool set foo-hot target_max_bytes 1000000000000 # 1 TB
+ ceph osd pool set foo-hot min_read_recency_for_promote 1
+ ceph osd pool set foo-hot min_write_recency_for_promote 1
+
+Drain the cache in preparation for turning it off::
+
+ ceph osd tier cache-mode foo-hot forward
+ rados -p foo-hot cache-flush-evict-all
+
+When cache pool is finally empty, disable it::
+
+ ceph osd tier remove-overlay foo
+ ceph osd tier remove foo foo-hot
+
+Read-only pools with lazy consistency::
+
+ ceph osd tier add foo foo-east
+ ceph osd tier cache-mode foo-east readonly
+ ceph osd tier add foo foo-west
+ ceph osd tier cache-mode foo-west readonly
+
+
+
+Tiering agent
+-------------
+
+The tiering policy is defined as properties on the cache pool itself.
+
+HitSet metadata
+~~~~~~~~~~~~~~~
+
+First, the agent requires HitSet information to be tracked on the
+cache pool in order to determine which objects in the pool are being
+accessed. This is enabled with::
+
+ ceph osd pool set foo-hot hit_set_type bloom
+ ceph osd pool set foo-hot hit_set_count 1
+ ceph osd pool set foo-hot hit_set_period 3600 # 1 hour
+
+The supported HitSet types include 'bloom' (a bloom filter, the
+default), 'explicit_hash', and 'explicit_object'. The latter two
+explicitly enumerate accessed objects and are less memory efficient.
+They are there primarily for debugging and to demonstrate pluggability
+for the infrastructure. For the bloom filter type, you can additionally
+define the false positive probability for the bloom filter (default is 0.05)::
+
+ ceph osd pool set foo-hot hit_set_fpp 0.15
+
+The hit_set_count and hit_set_period define how much time each HitSet
+should cover, and how many such HitSets to store. Binning accesses
+over time allows Ceph to independently determine whether an object was
+accessed at least once and whether it was accessed more than once over
+some time period ("age" vs "temperature").
+
+The ``min_read_recency_for_promote`` defines how many HitSets to check for the
+existence of an object when handling a read operation. The checking result is
+used to decide whether to promote the object asynchronously. Its value should be
+between 0 and ``hit_set_count``. If it's set to 0, the object is always promoted.
+If it's set to 1, the current HitSet is checked. And if this object is in the
+current HitSet, it's promoted. Otherwise not. For the other values, the exact
+number of archive HitSets are checked. The object is promoted if the object is
+found in any of the most recent ``min_read_recency_for_promote`` HitSets.
+
+A similar parameter can be set for the write operation, which is
+``min_write_recency_for_promote``. ::
+
+ ceph osd pool set {cachepool} min_read_recency_for_promote 1
+ ceph osd pool set {cachepool} min_write_recency_for_promote 1
+
+Note that the longer the ``hit_set_period`` and the higher the
+``min_read_recency_for_promote``/``min_write_recency_for_promote`` the more RAM
+will be consumed by the ceph-osd process. In particular, when the agent is active
+to flush or evict cache objects, all hit_set_count HitSets are loaded into RAM.
+
+Cache mode
+~~~~~~~~~~
+
+The most important policy is the cache mode:
+
+ ceph osd pool set foo-hot cache-mode writeback
+
+The supported modes are 'none', 'writeback', 'forward', and
+'readonly'. Most installations want 'writeback', which will write
+into the cache tier and only later flush updates back to the base
+tier. Similarly, any object that is read will be promoted into the
+cache tier.
+
+The 'forward' mode is intended for when the cache is being disabled
+and needs to be drained. No new objects will be promoted or written
+to the cache pool unless they are already present. A background
+operation can then do something like::
+
+ rados -p foo-hot cache-try-flush-evict-all
+ rados -p foo-hot cache-flush-evict-all
+
+to force all data to be flushed back to the base tier.
+
+The 'readonly' mode is intended for read-only workloads that do not
+require consistency to be enforced by the storage system. Writes will
+be forwarded to the base tier, but objects that are read will get
+promoted to the cache. No attempt is made by Ceph to ensure that the
+contents of the cache tier(s) are consistent in the presence of object
+updates.
+
+Cache sizing
+~~~~~~~~~~~~
+
+The agent performs two basic functions: flushing (writing 'dirty'
+cache objects back to the base tier) and evicting (removing cold and
+clean objects from the cache).
+
+The thresholds at which Ceph will flush or evict objects is specified
+relative to a 'target size' of the pool. For example::
+
+ ceph osd pool set foo-hot cache_target_dirty_ratio .4
+ ceph osd pool set foo-hot cache_target_dirty_high_ratio .6
+ ceph osd pool set foo-hot cache_target_full_ratio .8
+
+will begin flushing dirty objects when 40% of the pool is dirty and begin
+evicting clean objects when we reach 80% of the target size.
+
+The target size can be specified either in terms of objects or bytes::
+
+ ceph osd pool set foo-hot target_max_bytes 1000000000000 # 1 TB
+ ceph osd pool set foo-hot target_max_objects 1000000 # 1 million objects
+
+Note that if both limits are specified, Ceph will begin flushing or
+evicting when either threshold is triggered.
+
+Other tunables
+~~~~~~~~~~~~~~
+
+You can specify a minimum object age before a recently updated object is
+flushed to the base tier::
+
+ ceph osd pool set foo-hot cache_min_flush_age 600 # 10 minutes
+
+You can specify the minimum age of an object before it will be evicted from
+the cache tier::
+
+ ceph osd pool set foo-hot cache_min_evict_age 1800 # 30 minutes
+
+
+