summaryrefslogtreecommitdiffstats
path: root/src/ceph/doc/dev
diff options
context:
space:
mode:
Diffstat (limited to 'src/ceph/doc/dev')
-rw-r--r--src/ceph/doc/dev/PlanningImplementation.txt43
-rw-r--r--src/ceph/doc/dev/blkin.rst167
-rw-r--r--src/ceph/doc/dev/bluestore.rst85
-rw-r--r--src/ceph/doc/dev/cache-pool.rst200
-rw-r--r--src/ceph/doc/dev/ceph-disk.rst61
-rw-r--r--src/ceph/doc/dev/ceph-volume/index.rst13
-rw-r--r--src/ceph/doc/dev/ceph-volume/lvm.rst127
-rw-r--r--src/ceph/doc/dev/ceph-volume/plugins.rst65
-rw-r--r--src/ceph/doc/dev/ceph-volume/systemd.rst37
-rw-r--r--src/ceph/doc/dev/cephfs-snapshots.rst114
-rw-r--r--src/ceph/doc/dev/cephx_protocol.rst335
-rw-r--r--src/ceph/doc/dev/config.rst157
-rw-r--r--src/ceph/doc/dev/confusing.txt36
-rw-r--r--src/ceph/doc/dev/context.rst20
-rw-r--r--src/ceph/doc/dev/corpus.rst95
-rw-r--r--src/ceph/doc/dev/cpu-profiler.rst54
-rw-r--r--src/ceph/doc/dev/delayed-delete.rst12
-rw-r--r--src/ceph/doc/dev/dev_cluster_deployement.rst121
-rw-r--r--src/ceph/doc/dev/development-workflow.rst250
-rw-r--r--src/ceph/doc/dev/documenting.rst132
-rw-r--r--src/ceph/doc/dev/erasure-coded-pool.rst137
-rw-r--r--src/ceph/doc/dev/file-striping.rst161
-rw-r--r--src/ceph/doc/dev/freebsd.rst58
-rw-r--r--src/ceph/doc/dev/generatedocs.rst56
-rw-r--r--src/ceph/doc/dev/index-old.rst42
-rw-r--r--src/ceph/doc/dev/index.rst1558
-rw-r--r--src/ceph/doc/dev/kernel-client-troubleshooting.rst17
-rw-r--r--src/ceph/doc/dev/libs.rst18
-rw-r--r--src/ceph/doc/dev/logging.rst106
-rw-r--r--src/ceph/doc/dev/logs.rst57
-rw-r--r--src/ceph/doc/dev/mds_internals/data-structures.rst36
-rw-r--r--src/ceph/doc/dev/mds_internals/exports.rst76
-rw-r--r--src/ceph/doc/dev/mds_internals/index.rst10
-rw-r--r--src/ceph/doc/dev/messenger.rst33
-rw-r--r--src/ceph/doc/dev/mon-bootstrap.rst199
-rw-r--r--src/ceph/doc/dev/msgr2.rst250
-rw-r--r--src/ceph/doc/dev/network-encoding.rst214
-rw-r--r--src/ceph/doc/dev/network-protocol.rst197
-rw-r--r--src/ceph/doc/dev/object-store.rst70
-rw-r--r--src/ceph/doc/dev/osd-class-path.rst16
-rw-r--r--src/ceph/doc/dev/osd_internals/backfill_reservation.rst38
-rw-r--r--src/ceph/doc/dev/osd_internals/erasure_coding.rst82
-rw-r--r--src/ceph/doc/dev/osd_internals/erasure_coding/developer_notes.rst223
-rw-r--r--src/ceph/doc/dev/osd_internals/erasure_coding/ecbackend.rst207
-rw-r--r--src/ceph/doc/dev/osd_internals/erasure_coding/jerasure.rst33
-rw-r--r--src/ceph/doc/dev/osd_internals/erasure_coding/proposals.rst385
-rw-r--r--src/ceph/doc/dev/osd_internals/index.rst10
-rw-r--r--src/ceph/doc/dev/osd_internals/last_epoch_started.rst60
-rw-r--r--src/ceph/doc/dev/osd_internals/log_based_pg.rst206
-rw-r--r--src/ceph/doc/dev/osd_internals/map_message_handling.rst131
-rw-r--r--src/ceph/doc/dev/osd_internals/osd_overview.rst106
-rw-r--r--src/ceph/doc/dev/osd_internals/osd_throttles.rst93
-rw-r--r--src/ceph/doc/dev/osd_internals/osd_throttles.txt21
-rw-r--r--src/ceph/doc/dev/osd_internals/pg.rst31
-rw-r--r--src/ceph/doc/dev/osd_internals/pg_removal.rst56
-rw-r--r--src/ceph/doc/dev/osd_internals/pgpool.rst22
-rw-r--r--src/ceph/doc/dev/osd_internals/recovery_reservation.rst75
-rw-r--r--src/ceph/doc/dev/osd_internals/scrub.rst30
-rw-r--r--src/ceph/doc/dev/osd_internals/snaps.rst128
-rw-r--r--src/ceph/doc/dev/osd_internals/watch_notify.rst81
-rw-r--r--src/ceph/doc/dev/osd_internals/wbthrottle.rst28
-rw-r--r--src/ceph/doc/dev/peering.rst259
-rw-r--r--src/ceph/doc/dev/perf.rst39
-rw-r--r--src/ceph/doc/dev/perf_counters.rst198
-rw-r--r--src/ceph/doc/dev/perf_histograms.rst677
-rw-r--r--src/ceph/doc/dev/placement-group.rst151
-rw-r--r--src/ceph/doc/dev/quick_guide.rst131
-rw-r--r--src/ceph/doc/dev/rados-client-protocol.rst117
-rw-r--r--src/ceph/doc/dev/radosgw/admin/adminops_nonimplemented.rst495
-rw-r--r--src/ceph/doc/dev/radosgw/index.rst13
-rw-r--r--src/ceph/doc/dev/radosgw/s3_compliance.rst304
-rw-r--r--src/ceph/doc/dev/radosgw/usage.rst84
-rw-r--r--src/ceph/doc/dev/rbd-diff.rst146
-rw-r--r--src/ceph/doc/dev/rbd-export.rst84
-rw-r--r--src/ceph/doc/dev/rbd-layering.rst281
-rw-r--r--src/ceph/doc/dev/release-process.rst173
-rw-r--r--src/ceph/doc/dev/repo-access.rst38
-rw-r--r--src/ceph/doc/dev/sepia.rst9
-rw-r--r--src/ceph/doc/dev/session_authentication.rst160
-rw-r--r--src/ceph/doc/dev/versions.rst42
-rw-r--r--src/ceph/doc/dev/wireshark.rst41
81 files changed, 10923 insertions, 0 deletions
diff --git a/src/ceph/doc/dev/PlanningImplementation.txt b/src/ceph/doc/dev/PlanningImplementation.txt
new file mode 100644
index 0000000..871eb5f
--- /dev/null
+++ b/src/ceph/doc/dev/PlanningImplementation.txt
@@ -0,0 +1,43 @@
+ <big>About this Document</big>
+This document contains planning and implementation procedures for Ceph. The audience for this document includes technical support personnel, installation engineers, system administrators, and quality assurance.
+<B>Prerequisites<b>
+Users of this document must be familiar with Linux command line options. They must also be familiar with the overall Ceph product.
+Before You Begin
+Before implementing a new Ceph System, first answer the questions in the Ceph Getting Started Guide to determine your configuration needs. Once you have determined your hardware and configuration needs, the following decisions must be made:
+• Determine what level of technical support you need. Pick from the Ceph Technical Support options in the next section.
+• Determine how much and what level of training your organization needs.
+Ceph Technical Support Options
+The Ceph Technical support model provides 4 tiers of technical support options:
+1st – This option is for brand new customers that need installation, configuration, and setup on their production environment.
+2nd – This level of support requires a trouble ticket to be generated on a case by case basis as customer difficulties arise. Customers can choose between two maintenance options; they can either purchase a yearly maintenance contract, or pay for each trouble resolution as it occurs.
+3rd – This option comes with our bundled packages for customers who have also purchased our hosting plans. In this case, the customer is a service provider. The Help Desk can generally provide this level of incident resolution. (NEED MORE INFO)
+4th – This level of support requires a Service Level Agreement (SLA) between the customer and Dreamhost. This level is used for handling the most difficult or advanced problems.
+Planning a Ceph Cluster Configuration
+The following section contains guidelines for planning the deployment for a Ceph cluster configuration. A Ceph cluster consists of the following core components:
+• Monitors – These must be an odd number, such as one, three, or five. Three is the preferred configuration.
+• Object Storage Devices (OSD) – used as storage nodes
+• Metadata Servers (MDS)
+For redundancy, you should employ several of these components.
+Monitors
+The monitors handle central cluster management, configuration, and state.
+Hardware Requirements:
+• A few gigs of local disk space
+• A fixed network address
+ Warning: Never configure 2 monitors per cluster. If you do, they will both have to be up all of the time, which will greatly degrade system performance.
+Object Storage Devices
+The OSDs store the actual data on the disks. A minimum of two is required.
+Hardware Requirements:
+• As many disks as possible for faster performance and scalability
+• An SSD or NVRAM for a journal, or a RAID controller with a battery-backed NVRAM.
+• Ample RAM for better file system caching
+• Fast network
+ Metadata Servers
+The metadata server daemon commands act as a distributed, coherent cache of file system metadata. They do not store data locally; all metadata is stored on disk via the storage nodes.
+Metadata servers can be added into the cluster on an as-needed basis. The load is automatically balanced. The max_mds parameter controls how many cmds instances are active. Any additional running instances are put in standby mode and can be activated if one of the active daemons becomes unresponsive.
+Hardware Requirements:
+• Large amount of RAM
+• Fast CPU
+• Fast (low latency) network
+• At least two servers for redundancy and load balancing
+TIPS: If you have just a few nodes, put cmon, cmds, and cosd on the same node. For moderate node configurations, put cmon and cmds together, and cosd on the disk nodes. For large node configurations, put cmon, cmds, and cosd each on their own dedicated machine.
+
diff --git a/src/ceph/doc/dev/blkin.rst b/src/ceph/doc/dev/blkin.rst
new file mode 100644
index 0000000..8e0320f
--- /dev/null
+++ b/src/ceph/doc/dev/blkin.rst
@@ -0,0 +1,167 @@
+=========================
+ Tracing Ceph With BlkKin
+=========================
+
+Ceph can use Blkin, a library created by Marios Kogias and others,
+which enables tracking a specific request from the time it enters
+the system at higher levels till it is finally served by RADOS.
+
+In general, Blkin implements the Dapper_ tracing semantics
+in order to show the causal relationships between the different
+processing phases that an IO request may trigger. The goal is an
+end-to-end visualisation of the request's route in the system,
+accompanied by information concerning latencies in each processing
+phase. Thanks to LTTng this can happen with a minimal overhead and
+in realtime. The LTTng traces can then be visualized with Twitter's
+Zipkin_.
+
+.. _Dapper: http://static.googleusercontent.com/media/research.google.com/el//pubs/archive/36356.pdf
+.. _Zipkin: http://zipkin.io/
+
+
+Installing Blkin
+================
+
+You can install Markos Kogias' upstream Blkin_ by hand.::
+
+ cd blkin/
+ make && make install
+
+or build distribution packages using DistroReadyBlkin_, which also comes with
+pkgconfig support. If you choose the latter, then you must generate the
+configure and make files first.::
+
+ cd blkin
+ autoreconf -i
+
+.. _Blkin: https://github.com/marioskogias/blkin
+.. _DistroReadyBlkin: https://github.com/agshew/blkin
+
+
+Configuring Ceph with Blkin
+===========================
+
+If you built and installed Blkin by hand, rather than building and
+installing packages, then set these variables before configuring
+Ceph.::
+
+ export BLKIN_CFLAGS=-Iblkin/
+ export BLKIN_LIBS=-lzipkin-cpp
+
+Since there are separate lttng and blkin changes to Ceph, you may
+want to configure with something like::
+
+ ./configure --with-blkin --without-lttng --with-debug
+
+
+Testing Blkin
+=============
+
+It's easy to test Ceph's Blkin tracing. Let's assume you don't have
+Ceph already running, and you compiled Ceph with Blkin support but
+you did't install it. Then launch Ceph with the ``vstart.sh`` script
+in Ceph's src directgory so you can see the possible tracepoints.::
+
+ cd src
+ OSD=3 MON=3 RGW=1 ./vstart.sh -n
+ lttng list --userspace
+
+You'll see something like the following:::
+
+ UST events:
+ -------------
+ PID: 8987 - Name: ./ceph-osd
+ zipkin:timestamp (loglevel: TRACE_WARNING (4)) (type: tracepoint)
+ zipkin:keyval (loglevel: TRACE_WARNING (4)) (type: tracepoint)
+ ust_baddr_statedump:soinfo (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint)
+
+ PID: 8407 - Name: ./ceph-mon
+ zipkin:timestamp (loglevel: TRACE_WARNING (4)) (type: tracepoint)
+ zipkin:keyval (loglevel: TRACE_WARNING (4)) (type: tracepoint)
+ ust_baddr_statedump:soinfo (loglevel: TRACE_DEBUG_LINE (13)) (type: tracepoint)
+
+ ...
+
+Next, stop Ceph so that the tracepoints can be enabled.::
+
+ ./stop.sh
+
+Start up an LTTng session and enable the tracepoints.::
+
+ lttng create blkin-test
+ lttng enable-event --userspace zipkin:timestamp
+ lttng enable-event --userspace zipkin:keyval
+ lttng start
+
+Then start up Ceph again.::
+
+ OSD=3 MON=3 RGW=1 ./vstart.sh -n
+
+You may want to check that ceph is up.::
+
+ ./ceph status
+
+Now put something in usin rados, check that it made it, get it back, and remove it.::
+
+ ./rados mkpool test-blkin
+ ./rados put test-object-1 ./vstart.sh --pool=test-blkin
+ ./rados -p test-blkin ls
+ ./ceph osd map test-blkin test-object-1
+ ./rados get test-object-1 ./vstart-copy.sh --pool=test-blkin
+ md5sum vstart*
+ ./rados rm test-object-1 --pool=test-blkin
+
+You could also use the example in ``examples/librados/`` or ``rados bench``.
+
+Then stop the LTTng session and see what was collected.::
+
+ lttng stop
+ lttng view
+
+You'll see something like:::
+
+ [13:09:07.755054973] (+?.?????????) scruffy zipkin:timestamp: { cpu_id = 5 }, { trace_name = "Main", service_name = "MOSDOp", port_no = 0, ip = "0.0.0.0", trace_id = 7492589359882233221, span_id = 2694140257089376129, parent_span_id = 0, event = "Message allocated" }
+ [13:09:07.755071569] (+0.000016596) scruffy zipkin:keyval: { cpu_id = 5 }, { trace_name = "Main", service_name = "MOSDOp", port_no = 0, ip = "0.0.0.0", trace_id = 7492589359882233221, span_id = 2694140257089376129, parent_span_id = 0, key = "Type", val = "MOSDOp" }
+ [13:09:07.755074217] (+0.000002648) scruffy zipkin:keyval: { cpu_id = 5 }, { trace_name = "Main", service_name = "MOSDOp", port_no = 0, ip = "0.0.0.0", trace_id = 7492589359882233221, span_id = 2694140257089376129, parent_span_id = 0, key = "Reqid", val = "client.4126.0:1" }
+ ...
+
+
+Install Zipkin
+===============
+One of the points of using Blkin is so that you can look at the traces
+using Zipkin. Users should run Zipkin as a tracepoints collector and
+also a web service, which means users need to run three services,
+zipkin-collector, zipkin-query and zipkin-web.
+
+Download Zipkin Package::
+
+ wget https://github.com/twitter/zipkin/archive/1.1.0.tar.gz
+ tar zxf 1.1.0.tar.gz
+ cd zipkin-1.1.0
+ bin/collector cassandra &
+ bin/query cassandra &
+ bin/web &
+
+Check Zipkin::
+
+ bin/test
+ Browse http://${zipkin-web-ip}:8080
+
+
+Show Ceph's Blkin Traces in Zipkin-web
+======================================
+Blkin provides a script which translates lttng result to Zipkin
+(Dapper) semantics.
+
+Send lttng data to Zipkin::
+
+ python3 babeltrace_zipkin.py ${lttng-traces-dir}/${blkin-test}/ust/uid/0/64-bit/ -p ${zipkin-collector-port(9410 by default)} -s ${zipkin-collector-ip}
+
+Example::
+
+ python3 babeltrace_zipkin.py ~/lttng-traces-dir/blkin-test-20150225-160222/ust/uid/0/64-bit/ -p 9410 -s 127.0.0.1
+
+Check Ceph traces on webpage::
+
+ Browse http://${zipkin-web-ip}:8080
+ Click "Find traces"
diff --git a/src/ceph/doc/dev/bluestore.rst b/src/ceph/doc/dev/bluestore.rst
new file mode 100644
index 0000000..91d71d0
--- /dev/null
+++ b/src/ceph/doc/dev/bluestore.rst
@@ -0,0 +1,85 @@
+===================
+BlueStore Internals
+===================
+
+
+Small write strategies
+----------------------
+
+* *U*: Uncompressed write of a complete, new blob.
+
+ - write to new blob
+ - kv commit
+
+* *P*: Uncompressed partial write to unused region of an existing
+ blob.
+
+ - write to unused chunk(s) of existing blob
+ - kv commit
+
+* *W*: WAL overwrite: commit intent to overwrite, then overwrite
+ async. Must be chunk_size = MAX(block_size, csum_block_size)
+ aligned.
+
+ - kv commit
+ - wal overwrite (chunk-aligned) of existing blob
+
+* *N*: Uncompressed partial write to a new blob. Initially sparsely
+ utilized. Future writes will either be *P* or *W*.
+
+ - write into a new (sparse) blob
+ - kv commit
+
+* *R+W*: Read partial chunk, then to WAL overwrite.
+
+ - read (out to chunk boundaries)
+ - kv commit
+ - wal overwrite (chunk-aligned) of existing blob
+
+* *C*: Compress data, write to new blob.
+
+ - compress and write to new blob
+ - kv commit
+
+Possible future modes
+---------------------
+
+* *F*: Fragment lextent space by writing small piece of data into a
+ piecemeal blob (that collects random, noncontiguous bits of data we
+ need to write).
+
+ - write to a piecemeal blob (min_alloc_size or larger, but we use just one block of it)
+ - kv commit
+
+* *X*: WAL read/modify/write on a single block (like legacy
+ bluestore). No checksum.
+
+ - kv commit
+ - wal read/modify/write
+
+Mapping
+-------
+
+This very roughly maps the type of write onto what we do when we
+encounter a given blob. In practice it's a bit more complicated since there
+might be several blobs to consider (e.g., we might be able to *W* into one or
+*P* into another), but it should communicate a rough idea of strategy.
+
++--------------------------+--------+--------------+-------------+--------------+---------------+
+| | raw | raw (cached) | csum (4 KB) | csum (16 KB) | comp (128 KB) |
++--------------------------+--------+--------------+-------------+--------------+---------------+
+| 128+ KB (over)write | U | U | U | U | C |
++--------------------------+--------+--------------+-------------+--------------+---------------+
+| 64 KB (over)write | U | U | U | U | U or C |
++--------------------------+--------+--------------+-------------+--------------+---------------+
+| 4 KB overwrite | W | P | W | P | W | P | R+W | P | N (F?) |
++--------------------------+--------+--------------+-------------+--------------+---------------+
+| 100 byte overwrite | R+W | P | W | P | R+W | P | R+W | P | N (F?) |
++--------------------------+--------+--------------+-------------+--------------+---------------+
+| 100 byte append | R+W | P | W | P | R+W | P | R+W | P | N (F?) |
++--------------------------+--------+--------------+-------------+--------------+---------------+
++--------------------------+--------+--------------+-------------+--------------+---------------+
+| 4 KB clone overwrite | P | N | P | N | P | N | P | N | N (F?) |
++--------------------------+--------+--------------+-------------+--------------+---------------+
+| 100 byte clone overwrite | P | N | P | N | P | N | P | N | N (F?) |
++--------------------------+--------+--------------+-------------+--------------+---------------+
diff --git a/src/ceph/doc/dev/cache-pool.rst b/src/ceph/doc/dev/cache-pool.rst
new file mode 100644
index 0000000..7dc71c8
--- /dev/null
+++ b/src/ceph/doc/dev/cache-pool.rst
@@ -0,0 +1,200 @@
+Cache pool
+==========
+
+Purpose
+-------
+
+Use a pool of fast storage devices (probably SSDs) and use it as a
+cache for an existing slower and larger pool.
+
+Use a replicated pool as a front-end to service most I/O, and destage
+cold data to a separate erasure coded pool that does not currently (and
+cannot efficiently) handle the workload.
+
+We should be able to create and add a cache pool to an existing pool
+of data, and later remove it, without disrupting service or migrating
+data around.
+
+Use cases
+---------
+
+Read-write pool, writeback
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+We have an existing data pool and put a fast cache pool "in front" of
+it. Writes will go to the cache pool and immediately ack. We flush
+them back to the data pool based on the defined policy.
+
+Read-only pool, weak consistency
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+We have an existing data pool and add one or more read-only cache
+pools. We copy data to the cache pool(s) on read. Writes are
+forwarded to the original data pool. Stale data is expired from the
+cache pools based on the defined policy.
+
+This is likely only useful for specific applications with specific
+data access patterns. It may be a match for rgw, for example.
+
+
+Interface
+---------
+
+Set up a read/write cache pool foo-hot for pool foo::
+
+ ceph osd tier add foo foo-hot
+ ceph osd tier cache-mode foo-hot writeback
+
+Direct all traffic for foo to foo-hot::
+
+ ceph osd tier set-overlay foo foo-hot
+
+Set the target size and enable the tiering agent for foo-hot::
+
+ ceph osd pool set foo-hot hit_set_type bloom
+ ceph osd pool set foo-hot hit_set_count 1
+ ceph osd pool set foo-hot hit_set_period 3600 # 1 hour
+ ceph osd pool set foo-hot target_max_bytes 1000000000000 # 1 TB
+ ceph osd pool set foo-hot min_read_recency_for_promote 1
+ ceph osd pool set foo-hot min_write_recency_for_promote 1
+
+Drain the cache in preparation for turning it off::
+
+ ceph osd tier cache-mode foo-hot forward
+ rados -p foo-hot cache-flush-evict-all
+
+When cache pool is finally empty, disable it::
+
+ ceph osd tier remove-overlay foo
+ ceph osd tier remove foo foo-hot
+
+Read-only pools with lazy consistency::
+
+ ceph osd tier add foo foo-east
+ ceph osd tier cache-mode foo-east readonly
+ ceph osd tier add foo foo-west
+ ceph osd tier cache-mode foo-west readonly
+
+
+
+Tiering agent
+-------------
+
+The tiering policy is defined as properties on the cache pool itself.
+
+HitSet metadata
+~~~~~~~~~~~~~~~
+
+First, the agent requires HitSet information to be tracked on the
+cache pool in order to determine which objects in the pool are being
+accessed. This is enabled with::
+
+ ceph osd pool set foo-hot hit_set_type bloom
+ ceph osd pool set foo-hot hit_set_count 1
+ ceph osd pool set foo-hot hit_set_period 3600 # 1 hour
+
+The supported HitSet types include 'bloom' (a bloom filter, the
+default), 'explicit_hash', and 'explicit_object'. The latter two
+explicitly enumerate accessed objects and are less memory efficient.
+They are there primarily for debugging and to demonstrate pluggability
+for the infrastructure. For the bloom filter type, you can additionally
+define the false positive probability for the bloom filter (default is 0.05)::
+
+ ceph osd pool set foo-hot hit_set_fpp 0.15
+
+The hit_set_count and hit_set_period define how much time each HitSet
+should cover, and how many such HitSets to store. Binning accesses
+over time allows Ceph to independently determine whether an object was
+accessed at least once and whether it was accessed more than once over
+some time period ("age" vs "temperature").
+
+The ``min_read_recency_for_promote`` defines how many HitSets to check for the
+existence of an object when handling a read operation. The checking result is
+used to decide whether to promote the object asynchronously. Its value should be
+between 0 and ``hit_set_count``. If it's set to 0, the object is always promoted.
+If it's set to 1, the current HitSet is checked. And if this object is in the
+current HitSet, it's promoted. Otherwise not. For the other values, the exact
+number of archive HitSets are checked. The object is promoted if the object is
+found in any of the most recent ``min_read_recency_for_promote`` HitSets.
+
+A similar parameter can be set for the write operation, which is
+``min_write_recency_for_promote``. ::
+
+ ceph osd pool set {cachepool} min_read_recency_for_promote 1
+ ceph osd pool set {cachepool} min_write_recency_for_promote 1
+
+Note that the longer the ``hit_set_period`` and the higher the
+``min_read_recency_for_promote``/``min_write_recency_for_promote`` the more RAM
+will be consumed by the ceph-osd process. In particular, when the agent is active
+to flush or evict cache objects, all hit_set_count HitSets are loaded into RAM.
+
+Cache mode
+~~~~~~~~~~
+
+The most important policy is the cache mode:
+
+ ceph osd pool set foo-hot cache-mode writeback
+
+The supported modes are 'none', 'writeback', 'forward', and
+'readonly'. Most installations want 'writeback', which will write
+into the cache tier and only later flush updates back to the base
+tier. Similarly, any object that is read will be promoted into the
+cache tier.
+
+The 'forward' mode is intended for when the cache is being disabled
+and needs to be drained. No new objects will be promoted or written
+to the cache pool unless they are already present. A background
+operation can then do something like::
+
+ rados -p foo-hot cache-try-flush-evict-all
+ rados -p foo-hot cache-flush-evict-all
+
+to force all data to be flushed back to the base tier.
+
+The 'readonly' mode is intended for read-only workloads that do not
+require consistency to be enforced by the storage system. Writes will
+be forwarded to the base tier, but objects that are read will get
+promoted to the cache. No attempt is made by Ceph to ensure that the
+contents of the cache tier(s) are consistent in the presence of object
+updates.
+
+Cache sizing
+~~~~~~~~~~~~
+
+The agent performs two basic functions: flushing (writing 'dirty'
+cache objects back to the base tier) and evicting (removing cold and
+clean objects from the cache).
+
+The thresholds at which Ceph will flush or evict objects is specified
+relative to a 'target size' of the pool. For example::
+
+ ceph osd pool set foo-hot cache_target_dirty_ratio .4
+ ceph osd pool set foo-hot cache_target_dirty_high_ratio .6
+ ceph osd pool set foo-hot cache_target_full_ratio .8
+
+will begin flushing dirty objects when 40% of the pool is dirty and begin
+evicting clean objects when we reach 80% of the target size.
+
+The target size can be specified either in terms of objects or bytes::
+
+ ceph osd pool set foo-hot target_max_bytes 1000000000000 # 1 TB
+ ceph osd pool set foo-hot target_max_objects 1000000 # 1 million objects
+
+Note that if both limits are specified, Ceph will begin flushing or
+evicting when either threshold is triggered.
+
+Other tunables
+~~~~~~~~~~~~~~
+
+You can specify a minimum object age before a recently updated object is
+flushed to the base tier::
+
+ ceph osd pool set foo-hot cache_min_flush_age 600 # 10 minutes
+
+You can specify the minimum age of an object before it will be evicted from
+the cache tier::
+
+ ceph osd pool set foo-hot cache_min_evict_age 1800 # 30 minutes
+
+
+
diff --git a/src/ceph/doc/dev/ceph-disk.rst b/src/ceph/doc/dev/ceph-disk.rst
new file mode 100644
index 0000000..a4008aa
--- /dev/null
+++ b/src/ceph/doc/dev/ceph-disk.rst
@@ -0,0 +1,61 @@
+=========
+ceph-disk
+=========
+
+
+device-mapper crypt
+===================
+
+Settings
+--------
+
+``osd_dmcrypt_type``
+
+:Description: this option specifies the mode in which ``cryptsetup`` works. It can be ``luks`` or ``plain``. It kicks in only if the ``--dmcrypt`` option is passed to ``ceph-disk``. See also `cryptsetup document <https://gitlab.com/cryptsetup/cryptsetup/wikis/DMCrypt#configuration-using-cryptsetup>`_ for more details.
+
+:Type: String
+:Default: ``luks``
+
+
+``osd_dmcrypt_key_size``
+
+:Description: the size of the random string in bytes used as the LUKS key. The string is read from ``/dev/urandom`` and then encoded using base64. It will be stored with the key of ``dm-crypt/osd/$uuid/luks`` using config-key.
+
+:Type: String
+:Default: 1024 if ``osd_dmcrypt_type`` is ``luks``, 256 otherwise.
+
+lockbox
+-------
+
+``ceph-disk`` supports dmcrypt (device-mapper crypt). If dmcrypt is enabled, the partitions will be encrypted using this machinary. For each OSD device, a lockbox is introduced for holding the information regarding how the dmcrypt key is stored. To prepare a lockbox, ``ceph-disk``
+
+#. creates a dedicated lockbox partition on device, and
+#. populates it with a tiny filesystem, then
+#. automounts it at ``/var/lib/ceph/osd-lockbox/$uuid``, read-only. where the ``uuid`` is the lockbox's uuid.
+
+under which, settings are stored using plain files:
+
+- key-management-mode: ``ceph-mon v1``
+- osd-uuid: the OSD's uuid
+- ceph_fsid: the fsid of the cluster
+- keyring: the lockbox's allowing one to fetch the LUKS key
+- block_uuid: the partition uuid for the block device
+- journal_uuid: the partition uuid for the journal device
+- block.db_uuid: the partition uuid for the block.db device
+- block.wal_uuid: the partition uuid for the block.wal device
+- magic: a magic string indicating that this partition is a lockbox. It's not used currently.
+- ``${space_uuid}``: symbolic links named after the uuid of space partitions pointing to ``/var/lib/ceph/osd-lockbox/$uuid``. in the case of FileStore, the space partitions are ``data`` and ``journal`` partitions, for BlueStore, they are ``data``, ``block.db`` and ``block.wal``.
+
+Currently, ``ceph-mon v1`` is the only supported key-management-mode. In that case, the LUKS key is stored using the config-key in the monitor store with the key of ``dm-crypt/osd/$uuid/luks``.
+
+
+partitions
+==========
+
+``ceph-disk`` creates partitions for preparing a device for OSD deployment. Their partition numbers are hardcoded. For instance, data partition's partition number is always *1* :
+
+1. data partition
+2. journal partition, if co-located with data
+3. block.db for BlueStore, if co-located with data
+4. block.wal for BlueStore, if co-located with data
+5. lockbox
diff --git a/src/ceph/doc/dev/ceph-volume/index.rst b/src/ceph/doc/dev/ceph-volume/index.rst
new file mode 100644
index 0000000..b6f9dc0
--- /dev/null
+++ b/src/ceph/doc/dev/ceph-volume/index.rst
@@ -0,0 +1,13 @@
+===================================
+ceph-volume developer documentation
+===================================
+
+.. rubric:: Contents
+
+.. toctree::
+ :maxdepth: 1
+
+
+ plugins
+ lvm
+ systemd
diff --git a/src/ceph/doc/dev/ceph-volume/lvm.rst b/src/ceph/doc/dev/ceph-volume/lvm.rst
new file mode 100644
index 0000000..f89424a
--- /dev/null
+++ b/src/ceph/doc/dev/ceph-volume/lvm.rst
@@ -0,0 +1,127 @@
+
+.. _ceph-volume-lvm-api:
+
+LVM
+===
+The backend of ``ceph-volume lvm`` is LVM, it relies heavily on the usage of
+tags, which is a way for LVM to allow extending its volume metadata. These
+values can later be queried against devices and it is how they get discovered
+later.
+
+.. warning:: These APIs are not meant to be public, but are documented so that
+ it is clear what the tool is doing behind the scenes. Do not alter
+ any of these values.
+
+
+.. _ceph-volume-lvm-tag-api:
+
+Tag API
+-------
+The process of identifying logical volumes as part of Ceph relies on applying
+tags on all volumes. It follows a naming convention for the namespace that
+looks like::
+
+ ceph.<tag name>=<tag value>
+
+All tags are prefixed by the ``ceph`` keyword do claim ownership of that
+namespace and make it easily identifiable. This is how the OSD ID would be used
+in the context of lvm tags::
+
+ ceph.osd_id=0
+
+
+.. _ceph-volume-lvm-tags:
+
+Metadata
+--------
+The following describes all the metadata from Ceph OSDs that is stored on an
+LVM volume:
+
+
+``type``
+--------
+Describes if the device is a an OSD or Journal, with the ability to expand to
+other types when supported (for example a lockbox)
+
+Example::
+
+ ceph.type=osd
+
+
+``cluster_fsid``
+----------------
+Example::
+
+ ceph.cluster_fsid=7146B649-AE00-4157-9F5D-1DBFF1D52C26
+
+``data_device``
+---------------
+Example::
+
+ ceph.data_device=/dev/ceph/data-0
+
+``journal_device``
+------------------
+Example::
+
+ ceph.journal_device=/dev/ceph/journal-0
+
+``encrypted``
+-------------
+Example for enabled encryption with ``luks``::
+
+ ceph.encrypted=luks
+
+For plain dmcrypt::
+
+ ceph.encrypted=dmcrypt
+
+For disabled encryption::
+
+ ceph.encrypted=0
+
+``osd_fsid``
+------------
+Example::
+
+ ceph.osd_fsid=88ab9018-f84b-4d62-90b4-ce7c076728ff
+
+``osd_id``
+----------
+Example::
+
+ ceph.osd_id=1
+
+``block``
+---------
+Just used on :term:`bluestore` backends.
+
+Example::
+
+ ceph.block=/dev/mapper/vg-block-0
+
+``db``
+------
+Just used on :term:`bluestore` backends.
+
+Example::
+
+ ceph.db=/dev/mapper/vg-db-0
+
+``wal``
+-------
+Just used on :term:`bluestore` backends.
+
+Example::
+
+ ceph.wal=/dev/mapper/vg-wal-0
+
+
+``lockbox_device``
+------------------
+Only used when encryption is enabled, to store keys in an unencrypted
+volume.
+
+Example::
+
+ ceph.lockbox_device=/dev/mapper/vg-lockbox-0
diff --git a/src/ceph/doc/dev/ceph-volume/plugins.rst b/src/ceph/doc/dev/ceph-volume/plugins.rst
new file mode 100644
index 0000000..95bc761
--- /dev/null
+++ b/src/ceph/doc/dev/ceph-volume/plugins.rst
@@ -0,0 +1,65 @@
+.. _ceph-volume-plugins:
+
+Plugins
+=======
+``ceph-volume`` started initially to provide support for using ``lvm`` as
+the underlying system for an OSD. It is included as part of the tool but it is
+treated like a plugin.
+
+This modularity, allows for other device or device-like technologies to be able
+to consume and re-use the utilities and workflows provided.
+
+Adding Plugins
+--------------
+As a Python tool, plugins ``setuptools`` entry points. For a new plugin to be
+available, it should have an entry similar to this in its ``setup.py`` file:
+
+.. code-block:: python
+
+ setup(
+ ...
+ entry_points = dict(
+ ceph_volume_handlers = [
+ 'my_command = my_package.my_module:MyClass',
+ ],
+ ),
+
+The ``MyClass`` should be a class that accepts ``sys.argv`` as its argument,
+``ceph-volume`` will pass that in at instantiation and call them ``main``
+method.
+
+This is how a plugin for ``ZFS`` could look like for example:
+
+.. code-block:: python
+
+ class ZFS(object):
+
+ help_menu = 'Deploy OSDs with ZFS'
+ _help = """
+ Use ZFS as the underlying technology for OSDs
+
+ --verbose Increase the verbosity level
+ """
+
+ def __init__(self, argv):
+ self.argv = argv
+
+ def main(self):
+ parser = argparse.ArgumentParser()
+ args = parser.parse_args(self.argv)
+ ...
+
+And its entry point (via ``setuptools``) in ``setup.py`` would looke like:
+
+.. code-block:: python
+
+ entry_points = {
+ 'ceph_volume_handlers': [
+ 'zfs = ceph_volume_zfs.zfs:ZFS',
+ ],
+ },
+
+After installation, the ``zfs`` subcommand would be listed and could be used
+as::
+
+ ceph-volume zfs
diff --git a/src/ceph/doc/dev/ceph-volume/systemd.rst b/src/ceph/doc/dev/ceph-volume/systemd.rst
new file mode 100644
index 0000000..8553430
--- /dev/null
+++ b/src/ceph/doc/dev/ceph-volume/systemd.rst
@@ -0,0 +1,37 @@
+.. _ceph-volume-systemd-api:
+
+systemd
+=======
+The workflow to *"activate"* an OSD is by relying on systemd unit files and its
+ability to persist information as a suffix to the instance name.
+
+``ceph-volume`` exposes the following convention for unit files::
+
+ ceph-volume@<sub command>-<extra metadata>
+
+For example, this is how enabling an OSD could look like for the
+:ref:`ceph-volume-lvm` sub command::
+
+ systemctl enable ceph-volume@lvm-0-8715BEB4-15C5-49DE-BA6F-401086EC7B41
+
+
+These 3 pieces of persisted information are needed by the sub-command so that
+it understands what OSD it needs to activate.
+
+Since ``lvm`` is not the only subcommand that will be supported, this
+is how it will allow other device types to be defined.
+
+At some point for example, for plain disks, it could be::
+
+ systemctl enable ceph-volume@disk-0-8715BEB4-15C5-49DE-BA6F-401086EC7B41
+
+At startup, the systemd unit will execute a helper script that will parse the
+suffix and will end up calling ``ceph-volume`` back. Using the previous
+example for lvm, that call will look like::
+
+ ceph-volume lvm activate 0 8715BEB4-15C5-49DE-BA6F-401086EC7B41
+
+
+.. warning:: These workflows are not meant to be public, but are documented so that
+ it is clear what the tool is doing behind the scenes. Do not alter
+ any of these values.
diff --git a/src/ceph/doc/dev/cephfs-snapshots.rst b/src/ceph/doc/dev/cephfs-snapshots.rst
new file mode 100644
index 0000000..7954f66
--- /dev/null
+++ b/src/ceph/doc/dev/cephfs-snapshots.rst
@@ -0,0 +1,114 @@
+CephFS Snapshots
+================
+
+CephFS supports snapshots, generally created by invoking mkdir against the
+(hidden, special) .snap directory.
+
+Overview
+-----------
+
+Generally, snapshots do what they sound like: they create an immutable view
+of the filesystem at the point in time they're taken. There are some headline
+features that make CephFS snapshots different from what you might expect:
+
+* Arbitrary subtrees. Snapshots are created within any directory you choose,
+ and cover all data in the filesystem under that directory.
+* Asynchronous. If you create a snapshot, buffered data is flushed out lazily,
+ including from other clients. As a result, "creating" the snapshot is
+ very fast.
+
+Important Data Structures
+-------------------------
+* SnapRealm: A `SnapRealm` is created whenever you create a snapshot at a new
+ point in the hierarchy (or, when a snapshotted inode is moved outside of its
+ parent snapshot). SnapRealms contain an `sr_t srnode`, links to `past_parents`
+ and `past_children`, and all `inodes_with_caps` that are part of the snapshot.
+ Clients also have a SnapRealm concept that maintains less data but is used to
+ associate a `SnapContext` with each open file for writing.
+* sr_t: An `sr_t` is the on-disk snapshot metadata. It is part of the containing
+ directory and contains sequence counters, timestamps, the list of associated
+ snapshot IDs, and `past_parents`.
+* snaplink_t: `past_parents` et al are stored on-disk as a `snaplink_t`, holding
+ the inode number and first `snapid` of the inode/snapshot referenced.
+
+Creating a snapshot
+-------------------
+To make a snapshot on directory "/1/2/3/foo", the client invokes "mkdir" on
+"/1/2/3/foo/.snaps" directory. This is transmitted to the MDS Server as a
+CEPH_MDS_OP_MKSNAP-tagged `MClientRequest`, and initially handled in
+Server::handle_client_mksnap(). It allocates a `snapid` from the `SnapServer`,
+projects a new inode with the new SnapRealm, and commits it to the MDLog as
+usual. When committed, it invokes
+`MDCache::do_realm_invalidate_and_update_notify()`, which triggers most of the
+real work of the snapshot.
+
+If there were already snapshots above directory "foo" (rooted at "/1", say),
+the new SnapRealm adds its most immediate ancestor as a `past_parent` on
+creation. After committing to the MDLog, all clients with caps on files in
+"/1/2/3/foo/" are notified (MDCache::send_snaps()) of the new SnapRealm, and
+update the `SnapContext` they are using with that data. Note that this
+*is not* a synchronous part of the snapshot creation!
+
+Updating a snapshot
+-------------------
+If you delete a snapshot, or move data out of the parent snapshot's hierarchy,
+a similar process is followed. Extra code paths check to see if we can break
+the `past_parent` links between SnapRealms, or eliminate them entirely.
+
+Generating a SnapContext
+------------------------
+A RADOS `SnapContext` consists of a snapshot sequence ID (`snapid`) and all
+the snapshot IDs that an object is already part of. To generate that list, we
+generate a list of all `snapids` associated with the SnapRealm and all its
+`past_parents`.
+
+Storing snapshot data
+---------------------
+File data is stored in RADOS "self-managed" snapshots. Clients are careful to
+use the correct `SnapContext` when writing file data to the OSDs.
+
+Storing snapshot metadata
+-------------------------
+Snapshotted dentries (and their inodes) are stored in-line as part of the
+directory they were in at the time of the snapshot. *All dentries* include a
+`first` and `last` snapid for which they are valid. (Non-snapshotted dentries
+will have their `last` set to CEPH_NOSNAP).
+
+Snapshot writeback
+------------------
+There is a great deal of code to handle writeback efficiently. When a Client
+receives an `MClientSnap` message, it updates the local `SnapRealm`
+representation and its links to specific `Inodes`, and generates a `CapSnap`
+for the `Inode`. The `CapSnap` is flushed out as part of capability writeback,
+and if there is dirty data the `CapSnap` is used to block fresh data writes
+until the snapshot is completely flushed to the OSDs.
+
+In the MDS, we generate snapshot-representing dentries as part of the regular
+process for flushing them. Dentries with outstanding `CapSnap` data is kept
+pinned and in the journal.
+
+Deleting snapshots
+------------------
+Snapshots are deleted by invoking "rmdir" on the ".snaps" directory they are
+rooted in. (Attempts to delete a directory which roots snapshots *will fail*;
+you must delete the snapshots first.) Once deleted, they are entered into the
+`OSDMap` list of deleted snapshots and the file data is removed by the OSDs.
+Metadata is cleaned up as the directory objects are read in and written back
+out again.
+
+Hard links
+----------
+Hard links do not interact well with snapshots. A file is snapshotted when its
+primary link is part of a SnapRealm; other links *will not* preserve data.
+Generally the location where a file was first created will be its primary link,
+but if the original link has been deleted it is not easy (nor always
+determnistic) to find which link is now the primary.
+
+Multi-FS
+---------
+Snapshots and multiiple filesystems don't interact well. Specifically, each
+MDS cluster allocates `snapids` independently; if you have multiple filesystems
+sharing a single pool (via namespaces), their snapshots *will* collide and
+deleting one will result in missing file data for others. (This may even be
+invisible, not throwing errors to the user.) If each FS gets its own
+pool things probably work, but this isn't tested and may not be true.
diff --git a/src/ceph/doc/dev/cephx_protocol.rst b/src/ceph/doc/dev/cephx_protocol.rst
new file mode 100644
index 0000000..45c7440
--- /dev/null
+++ b/src/ceph/doc/dev/cephx_protocol.rst
@@ -0,0 +1,335 @@
+============================================================
+A Detailed Description of the Cephx Authentication Protocol
+============================================================
+Peter Reiher
+7/13/12
+
+This document provides deeper detail on the Cephx authorization protocol whose high level flow
+is described in the memo by Yehuda (12/19/09). Because this memo discusses details of
+routines called and variables used, it represents a snapshot. The code might be changed
+subsequent to the creation of this document, and the document is not likely to be updated in
+lockstep. With luck, code comments will indicate major changes in the way the protocol is
+implemented.
+
+Introduction
+-------------
+
+The basic idea of the protocol is based on Kerberos. A client wishes to obtain something from
+a server. The server will only offer the requested service to authorized clients. Rather
+than requiring each server to deal with authentication and authorization issues, the system
+uses an authorization server. Thus, the client must first communicate with the authorization
+server to authenticate itself and to obtain credentials that will grant it access to the
+service it wants.
+
+Authorization is not the same as authentication. Authentication provides evidence that some
+party is who it claims to be. Authorization provides evidence that a particular party is
+allowed to do something. Generally, secure authorization implies secure authentication
+(since without authentication, you may authorize something for an imposter), but the reverse
+is not necessarily true. One can authenticate without authorizing. The purpose
+of this protocol is to authorize.
+
+The basic approach is to use symmetric cryptography throughout. Each client C has its own
+secret key, known only to itself and the authorization server A. Each server S has its own
+secret key, known only to itself and the authorization server A. Authorization information
+will be passed in tickets, encrypted with the secret key of the entity that offers the service.
+There will be a ticket that A gives to C, which permits C to ask A for other tickets. This
+ticket will be encrypted with A's key, since A is the one who needs to check it. There will
+later be tickets that A issues that allow C to communicate with S to ask for service. These
+tickets will be encrypted with S's key, since S needs to check them. Since we wish to provide
+security of the communications, as well, session keys are set up along with the tickets.
+Currently, those session keys are only used for authentication purposes during this protocol
+and the handshake between the client C and the server S, when the client provides its service
+ticket. They could be used for authentication or secrecy throughout, with some changes to
+the system.
+
+Several parties need to prove something to each other if this protocol is to achieve its
+desired security effects.
+
+1. The client C must prove to the authenticator A that it really is C. Since everything
+is being done via messages, the client must also prove that the message proving authenticity
+is fresh, and is not being replayed by an attacker.
+
+2. The authenticator A must prove to client C that it really is the authenticator. Again,
+proof that replay is not occurring is also required.
+
+3. A and C must securely share a session key to be used for distribution of later
+authorization material between them. Again, no replay is allowable, and the key must be
+known only to A and C.
+
+4. A must receive evidence from C that allows A to look up C's authorized operations with
+server S.
+
+5. C must receive a ticket from A that will prove to S that C can perform its authorized
+operations. This ticket must be usable only by C.
+
+6. C must receive from A a session key to protect the communications between C and S. The
+session key must be fresh and not the result of a replay.
+
+Getting Started With Authorization
+-----------------------------------
+
+When the client first needs to get service, it contacts the monitor. At the moment, it has
+no tickets. Therefore, it uses the "unknown" protocol to talk to the monitor. This protocol
+is specified as ``CEPH_AUTH_UNKNOWN``. The monitor also takes on the authentication server
+role, A. The remainder of the communications will use the cephx protocol (most of whose code
+will be found in files in ``auth/cephx``). This protocol is responsible for creating and
+communicating the tickets spoken of above.
+
+Currently, this document does not follow the pre-cephx protocol flow. It starts up at the
+point where the client has contacted the server and is ready to start the cephx protocol itself.
+
+Once we are in the cephx protocol, we can get the tickets. First, C needs a ticket that
+allows secure communications with A. This ticket can then be used to obtain other tickets.
+This is phase I of the protocol, and consists of a send from C to A and a response from A to C.
+Then, C needs a ticket to allow it to talk to S to get services. This is phase II of the
+protocol, and consists of a send from C to A and a response from A to C.
+
+Phase I:
+--------
+
+The client is set up to know that it needs certain things, using a variable called ``need``,
+which is part of the ``AuthClientHandler`` class, which the ``CephxClientHandler`` inherits
+from. At this point, one thing that's encoded in the ``need`` variable is
+``CEPH_ENTITY_TYPE_AUTH``, indicating that we need to start the authentication protocol
+from scratch. Since we're always talking to the same authorization server, if we've gone
+through this step of the protocol before (and the resulting ticket/session hasn't timed out),
+we can skip this step and just ask for client tickets. But it must be done initially, and
+we'll assume that we are in that state.
+
+The message C sends to A in phase I is build in ``CephxClientHandler::build_request()`` (in
+``auth/cephx/CephxClientHandler.cc``). This routine is used for more than one purpose.
+In this case, we first call ``validate_tickets()`` (from routine
+``CephXTicektManager::validate_tickets()`` which lives in ``auth/cephx/CephxProtocol.h``).
+This code runs through the list of possible tickets to determine what we need, setting values
+in the ``need`` flag as necessary. Then we call ``ticket.get_handler()``. This routine
+(in ``CephxProtocol.h``) finds a ticket of the specified type (a ticket to perform
+authorization) in the ticket map, creates a ticket handler object for it, and puts the
+handler into the right place in the map. Then we hit specialized code to deal with individual
+cases. The case here is when we still need to authenticate to A (the
+``if (need & CEPH_ENTITY_TYPE_AUTH)`` branch).
+
+We now create a message of type ``CEPH_AUTH_UNKNOWN``. We need to authenticate
+this message with C's secret key, so we fetch that from the local key repository. (It's
+called a key server in the code, but it's not really a separate machine or processing entity.
+It's more like the place where locally used keys are kept.) We create a
+random challenge, whose purpose is to prevent replays. We encrypt that challenge. We already
+have a server challenge (a similar set of random bytes, but created by the server and sent to
+the client) from our pre-cephx stage. We take both challenges and our secret key and
+produce a combined encrypted challenge value, which goes into ``req.key``.
+
+If we have an old ticket, we store it in ``req.old_ticket``. We're about to get a new one.
+
+The entire ``req`` structure, including the old ticket and the cryptographic hash of the two
+challenges, gets put into the message. Then we return from this function, and the
+message is sent.
+
+We now switch over to the authenticator side, A. The server receives the message that was
+sent, of type ``CEPH_AUTH_UNKNOWN``. The message gets handled in ``prep_auth()``,
+in ``mon/AuthMonitor.cc``, which calls ``handle_request()`` is ``CephxServiceHandler.cc`` to
+do most of the work. This routine, also, handles multiple cases.
+
+The control flow is determined by the ``request_type`` in the ``cephx_header`` associated
+with the message. Our case here is ``CEPH_AUTH_UNKNOWN``. We need the
+secret key A shares with C, so we call ``get_secret()`` from out local key repository to get
+it. We should have set up a server challenge already with this client, so we make sure
+we really do have one. (This variable is specific to a ``CephxServiceHandler``, so there
+is a different one for each such structure we create, presumably one per client A is
+dealing with.) If there is no challenge, we'll need to start over, since we need to
+check the client's crypto hash, which depends on a server challenge, in part.
+
+We now call the same routine the client used to calculate the hash, based on the same values:
+the client challenge (which is in the incoming message), the server challenge (which we saved),
+and the client's key (which we just obtained). We check to see if the client sent the same
+thing we expected. If so, we know we're talking to the right client. We know the session is
+fresh, because it used the challenge we sent it to calculate its crypto hash. So we can
+give it an authentication ticket.
+
+We fetch C's ``eauth`` structure. This contains an ID, a key, and a set of caps (capabilities).
+
+The client sent us its old ticket in the message, if it had one. If so, we set a flag,
+``should_enc_ticket``, to true and set the global ID to the global ID in that old ticket.
+If the attempt to decode its old ticket fails (most probably because it didn't have one),
+``should_enc_ticket`` remains false. Now we set up the new ticket, filling in timestamps,
+the name of C, the global ID provided in the method call (unless there was an old ticket), and
+his ``auid``, obtained from the ``eauth`` structure obtained above. We need a new session key
+to help the client communicate securely with us, not using its permanent key. We set the
+service ID to ``CEPH_ENTITY_TYPE_AUTH``, which will tell the client C what to do with the
+message we send it. We build a cephx response header and call
+``cephx_build_service_ticket_reply()``.
+
+``cephx_build_service_ticket_reply()`` is in ``auth/cephx/CephxProtocol.cc``. This
+routine will build up the response message. Much of it copies data from its parameters to
+a message structure. Part of that information (the session key and the validity period)
+gets encrypted with C's permanent key. If the ``should_encrypt_ticket`` flag is set,
+encrypt it using the old ticket's key. Otherwise, there was no old ticket key, so the
+new ticket is not encrypted. (It is, of course, already encrypted with A's permanent key.)
+Presumably the point of this second encryption is to expose less material encrypted with
+permanent keys.
+
+Then we call the key server's ``get_service_caps()`` routine on the entity name, with a
+flag ``CEPH_ENTITY_TYPE_MON``, and capabilities, which will be filled in by this routine.
+The use of that constant flag means we're going to get the client's caps for A, not for some
+other data server. The ticket here is to access the authorizer A, not the service S. The
+result of this call is that the caps variable (a parameter to the routine we're in) is
+filled in with the monitor capabilities that will allow C to access A's authorization services.
+
+``handle_request()`` itself does not send the response message. It builds up the
+``result_bl``, which basically holds that message's contents, and the capabilities structure,
+but it doesn't send the message. We go back to ``prep_auth()``, in ``mon/AuthMonitor.cc``,
+for that. This routine does some fiddling around with the caps structure that just got
+filled in. There's a global ID that comes up as a result of this fiddling that is put into
+the reply message. The reply message is built here (mostly from the ``response_bl`` buffer)
+and sent off.
+
+This completes Phase I of the protocol. At this point, C has authenticated itself to A, and A has generated a new session key and ticket allowing C to obtain server tickets from A.
+
+Phase II
+--------
+
+This phase starts when C receives the message from A containing a new ticket and session key.
+The goal of this phase is to provide C with a session key and ticket allowing it to
+communicate with S.
+
+The message A sent to C is dispatched to ``build_request()`` in ``CephxClientHandler.cc``,
+the same routine that was used early in Phase I to build the first message in the protocol.
+This time, when ``validate_tickets()`` is called, the ``need`` variable will not contain
+``CEPH_ENTITY_TYPE_AUTH``, so a different branch through the bulk of the routine will be
+used. This is the branch indicated by ``if (need)``. We have a ticket for the authorizer,
+but we still need service tickets.
+
+We must send another message to A to obtain the tickets (and session key) for the server
+S. We set the ``request_type`` of the message to ``CEPHX_GET_PRINCIPAL_SESSION_KEY`` and
+call ``ticket_handler.build_authorizer()`` to obtain an authorizer. This routine is in
+``CephxProtocol.cc``. We set the key for this authorizer to be the session key we just got
+from A,and create a new nonce. We put the global ID, the service ID, and the ticket into a
+message buffer that is part of the authorizer. Then we create a new ``CephXAuthorize``
+structure. The nonce we just created goes there. We encrypt this ``CephXAuthorize``
+structure with the current session key and stuff it into the authorizer's buffer. We
+return the authorizer.
+
+Back in ``build_request()``, we take the part of the authorizer that was just built (its
+buffer, not the session key or anything else) and shove it into the buffer we're creating
+for the message that will go to A. Then we delete the authorizer. We put the requirements
+for what we want in ``req.keys``, and we put ``req`` into the buffer. Then we return, and
+the message gets sent.
+
+The authorizer A receives this message which is of type ``CEPHX_GET_PRINCIPAL_SESSION_KEY``.
+The message gets handled in ``prep_auth()``, in ``mon/AuthMonitor.cc``, which again calls
+``handle_request()`` in ``CephxServiceHandler.cc`` to do most of the work.
+
+In this case, ``handle_request()`` will take the ``CEPHX_GET_PRINCIPAL_SESSION_KEY`` case.
+It will call ``cephx_verify_authorizer()`` in ``CephxProtocol.cc``. Here, we will grab
+a bunch of data out of the input buffer, including the global and service IDs and the ticket
+for A. The ticket contains a ``secret_id``, indicating which key is being used for it.
+If the secret ID pulled out of the ticket was -1, the ticket does not specify which secret
+key A should use. In this case, A should use the key for the specific entity that C wants
+to contact, rather than a rotating key shared by all server entities of the same type.
+To get that key, A must consult the key repository to find the right key. Otherwise,
+there's already a structure obtained from the key repository to hold the necessary secret.
+Server secrets rotate on a time expiration basis (key rotation is not covered in this
+document), so run through that structure to find its current secret. Either way, A now
+knows the secret key used to create this ticket. Now decrypt the encrypted part of the
+ticket, using this key. It should be a ticket for A.
+
+The ticket also contains a session key that C should have used to encrypt other parts of
+this message. Use that session key to decrypt the rest of the message.
+
+Create a ``CephXAuthorizeReply`` to hold our reply. Extract the nonce (which was in the stuff
+we just decrypted), add 1 to it, and put the result in the reply. Encrypt the reply and
+put it in the buffer provided in the call to ``cephx_verify_authorizer()`` and return
+to ``handle_request()``. This will be used to prove to C that A (rather than an attacker)
+created this response.
+
+Having verified that the message is valid and from C, now we need to build it a ticket for S.
+We need to know what S it wants to communicate with and what services it wants. Pull the
+ticket request that describes those things out of its message. Now run through the ticket
+request to see what it wanted. (He could potentially be asking for multiple different
+services in the same request, but we will assume it's just one, for this discussion.) Once we
+know which service ID it's after, call ``build_session_auth_info()``.
+
+``build_session_auth_info()`` is in ``CephxKeyServer.cc``. It checks to see if the
+secret for the ``service_ID`` of S is available and puts it into the subfield of one of
+the parameters, and calls the similarly named ``_build_session_auth_info()``, located in
+the same file. This routine loads up the new ``auth_info`` structure with the
+ID of S, a ticket, and some timestamps for that ticket. It generates a new session key
+and puts it in the structure. It then calls ``get_caps()`` to fill in the
+``info.ticket`` caps field. ``get_caps()`` is also in ``CephxKeyServer.cc``. It fills the
+``caps_info`` structure it is provided with caps for S allowed to C.
+
+Once ``build_session_auth_info()`` returns, A has a list of the capabilities allowed to
+C for S. We put a validity period based on the current TTL for this context into the info
+structure, and put it into the ``info_vec`` structure we are preparing in response to the
+message.
+
+Now call ``build_cephx_response_header()``, also in ``CephxServiceHandler.cc``. Fill in
+the ``request_type``, which is ``CEPHX_GET_PRINCIPAL_SESSION_KEY``, a status of 0,
+and the result buffer.
+
+Now call ``cephx_build_service_ticket_reply()``, which is in ``CephxProtocol.cc``. The
+same routine was used towards the end of A's handling of its response in phase I. Here,
+the session key (now a session key to talk to S, not A) and the validity period for that
+key will be encrypted with the existing session key shared between C and A.
+The ``should_encrypt_ticket`` parameter is false here, and no key is provided for that
+encryption. The ticket in question, destined for S once C sends it there, is already
+encrypted with S's secret. So, essentially, this routine will put ID information,
+the encrypted session key, and the ticket allowing C to talk to S into the buffer to
+be sent to C.
+
+After this routine returns, we exit from ``handle_request()``, going back to ``prep_auth()``
+and ultimately to the underlying message send code.
+
+The client receives this message. The nonce is checked as the message passes through
+``Pipe::connect()``, which is in ``msg/SimpleMessager.cc``. In a lengthy ``while(1)`` loop in
+the middle of this routine, it gets an authorizer. If the get was successful, eventually
+it will call ``verify_reply()``, which checks the nonce. ``connect()`` never explicitly
+checks to see if it got an authorizer, which would suggest that failure to provide an
+authorizer would allow an attacker to skip checking of the nonce. However, in many places,
+if there is no authorizer, important connection fields will get set to zero, which will
+ultimately cause the connection to fail to provide data. It would be worth testing, but
+it looks like failure to provide an authorizer, which contains the nonce, would not be helpful
+to an attacker.
+
+The message eventually makes its way through to ``handle_response()``, in
+``CephxClientHandler.cc``. In this routine, we call ``get_handler()`` to get a ticket
+handler to hold the ticket we have just received. This routine is embedded in the definition
+for a ``CephXTicketManager`` structure. It takes a type (``CEPH_ENTITY_TYPE_AUTH``, in
+this case) and looks through the ``tickets_map`` to find that type. There should be one, and
+it should have the session key of the session between C and A in its entry. This key will
+be used to decrypt the information provided by A, particularly the new session key allowing
+C to talk to S.
+
+We then call ``verify_service_ticket_reply()``, in ``CephxProtocol.cc``. This routine
+needs to determine if the ticket is OK and also obtain the session key associated with this
+ticket. It decrypts the encrypted portion of the message buffer, using the session key
+shared with A. This ticket was not encrypted (well, not twice - tickets are always encrypted,
+but sometimes double encrypted, which this one isn't). So it can be stored in a service
+ticket buffer directly. We now grab the ticket out of that buffer.
+
+The stuff we decrypted with the session key shared between C and A included the new session
+key. That's our current session key for this ticket, so set it. Check validity and
+set the expiration times. Now return true, if we got this far.
+
+Back in ``handle_response()``, we now call ``validate_tickets()`` to adjust what we think
+we need, since we now have a ticket we didn't have before. If we've taken care of
+everything we need, we'll return 0.
+
+This ends phase II of the protocol. We have now successfully set up a ticket and session key
+for client C to talk to server S. S will know that C is who it claims to be, since A will
+verify it. C will know it is S it's talking to, again because A verified it. The only
+copies of the session key for C and S to communicate were sent encrypted under the permanent
+keys of C and S, respectively, so no other party (excepting A, who is trusted by all) knows
+that session key. The ticket will securely indicate to S what C is allowed to do, attested
+to by A. The nonces passed back and forth between A and C ensure that they have not been
+subject to a replay attack. C has not yet actually talked to S, but it is ready to.
+
+Much of the security here falls apart if one of the permanent keys is compromised. Compromise
+of C's key means that the attacker can pose as C and obtain all of C's privileges, and can
+eavesdrop on C's legitimate conversations. He can also pretend to be A, but only in
+conversations with C. Since it does not (by hypothesis) have keys for any services, he
+cannot generate any new tickets for services, though it can replay old tickets and session
+keys until S's permanent key is changed or the old tickets time out.
+
+Compromise of S's key means that the attacker can pose as S to anyone, and can eavesdrop on
+any user's conversation with S. Unless some client's key is also compromised, the attacker
+cannot generate new fake client tickets for S, since doing so requires it to authenticate
+himself as A, using the client key it doesn't know.
diff --git a/src/ceph/doc/dev/config.rst b/src/ceph/doc/dev/config.rst
new file mode 100644
index 0000000..572ba8d
--- /dev/null
+++ b/src/ceph/doc/dev/config.rst
@@ -0,0 +1,157 @@
+=================================
+ Configuration Management System
+=================================
+
+The configuration management system exists to provide every daemon with the
+proper configuration information. The configuration can be viewed as a set of
+key-value pairs.
+
+How can the configuration be set? Well, there are several sources:
+ - the ceph configuration file, usually named ceph.conf
+ - command line arguments::
+ --debug-ms=1
+ --debug-pg=10
+ etc.
+ - arguments injected at runtime by using injectargs
+
+
+The Configuration File
+======================
+
+Most configuration settings originate in the Ceph configuration file.
+
+How do we find the configuration file? Well, in order, we check:
+ - the default locations
+ - the environment variable CEPH_CONF
+ - the command line argument -c
+
+Each stanza of the configuration file describes the key-value pairs that will be in
+effect for a particular subset of the daemons. The "global" stanza applies to
+everything. The "mon", "osd", and "mds" stanzas specify settings to take effect
+for all monitors, all OSDs, and all mds servers, respectively. A stanza of the
+form mon.$name, osd.$name, or mds.$name gives settings for the monitor, OSD, or
+MDS of that name, respectively. Configuration values that appear later in the
+file win over earlier ones.
+
+A sample configuration file can be found in src/sample.ceph.conf.
+
+
+Metavariables
+=============
+
+The configuration system allows any configuration value to be
+substituted into another value using the ``$varname`` syntax, similar
+to how bash shell expansion works.
+
+A few additional special metavariables are also defined:
+ - $host: expands to the current hostname
+ - $type: expands to one of "mds", "osd", "mon", or "client"
+ - $id: expands to the daemon identifier. For ``osd.0``, this would be ``0``; for ``mds.a``, it would be ``a``; for ``client.admin``, it would be ``admin``.
+ - $num: same as $id
+ - $name: expands to $type.$id
+
+
+Reading configuration values
+====================================================
+
+There are two ways for Ceph code to get configuration values. One way is to
+read it directly from a variable named "g_conf," or equivalently,
+"g_ceph_ctx->_conf." The other is to register an observer that will be called
+every time the relevant configuration values changes. This observer will be
+called soon after the initial configuration is read, and every time after that
+when one of the relevant values changes. Each observer tracks a set of keys
+and is invoked only when one of the relevant keys changes.
+
+The interface to implement is found in common/config_obs.h.
+
+The observer method should be preferred in new code because
+ - It is more flexible, allowing the code to do whatever reinitialization needs
+ to be done to implement the new configuration value.
+ - It is the only way to create a std::string configuration variable that can
+ be changed by injectargs.
+ - Even for int-valued configuration options, changing the values in one thread
+ while another thread is reading them can lead to subtle and
+ impossible-to-diagnose bugs.
+
+For these reasons, reading directly from g_conf should be considered deprecated
+and not done in new code. Do not ever alter g_conf.
+
+Changing configuration values
+====================================================
+
+Configuration values can be changed by calling g_conf->set_val. After changing
+the configuration, you should call g_conf->apply_changes to re-run all the
+affected configuration observers. For convenience, you can call
+g_conf->set_val_or_die to make a configuration change which you think should
+never fail.
+
+Injectargs, parse_argv, and parse_env are three other functions which modify
+the configuration. Just like with set_val, you should call apply_changes after
+calling these functions to make sure your changes get applied.
+
+
+Defining config options
+=======================
+
+New-style config options are defined in common/options.cc. All new config
+options should go here (and not into legacy_config_opts.h).
+
+Levels
+------
+
+The Option constructor takes a "level" value:
+
+* *LEVEL_BASIC* is for basic config options that a normal operator is likely to adjust.
+* *LEVEL_ADVANCED* is for options that an operator *can* adjust, but should not touch unless they understand what they are doing. Adjusting advanced options poorly can lead to problems (performance or even data loss) if done incorrectly.
+* *LEVEL_DEV* is for options in place for use by developers only, either for testing purposes, or to describe constants that no user should adjust but we prefer not to compile into the code.
+
+Description and long description
+--------------------------------
+
+Short description of the option. Sentence fragment. e.g.::
+
+ .set_description("Default checksum algorithm to use")
+
+The long description is complete sentences, perhaps even multiple
+paragraphs, and may include other detailed information or notes.::
+
+ .set_long_description("crc32c, xxhash32, and xxhash64 are available. The _16 and _8 variants use only a subset of the bits for more compact (but less reliable) checksumming.")
+
+Default values
+--------------
+
+There is a default value for every config option. In some cases, there may
+also be a *daemon default* that only applies to code that declares itself
+as a daemon (in thise case, the regular default only applies to non-daemons).
+
+Safety
+------
+
+If an option can be safely changed at runtime::
+
+ .set_safe()
+
+Service
+-------
+
+Service is a component name, like "common", "osd", "rgw", "mds", etc. It may
+be a list of components, like::
+
+ .add_service("mon mds osd mgr")
+
+For example, the rocksdb options affect both the osd and mon.
+
+Tags
+----
+
+Tags identify options across services that relate in some way. Example include;
+
+ - network -- options affecting network configuration
+ - mkfs -- options that only matter at mkfs time
+
+Enums
+-----
+
+For options with a defined set of allowed values::
+
+ .set_enum_allowed({"none", "crc32c", "crc32c_16", "crc32c_8", "xxhash32", "xxhash64"})
diff --git a/src/ceph/doc/dev/confusing.txt b/src/ceph/doc/dev/confusing.txt
new file mode 100644
index 0000000..a860c25
--- /dev/null
+++ b/src/ceph/doc/dev/confusing.txt
@@ -0,0 +1,36 @@
+About this Document
+This document contains procedures for a new customer to configure a Ceph System.
+Before You Begin
+Before you begin configuring your system for Ceph, use the following checklist to decide what type of system you need.
+1. Identify the amount of storage that you need based on your current data, network traffic, workload, and other parameters
+2. Identify the growth potential for your business so that you can project ahead for future storage needs.
+3. Plan ahead for redundancy and replacement options.
+4. Study market forecasts and how they affect your business.
+Preparing a Ceph Cluster
+A Ceph cluster consists of the following core components:
+1. Monitors – These must be an odd number, such as one, three, or five. Three is the preferred configuration.
+2. Object Storage Devices (OSD) – used as storage nodes
+3. Metadata Servers (MDS)
+Although Ceph is extremely scalable, and nodes can be added any time on an as-needed basis, it is important to first determine the base needs of your configuration prior to setting up your system. This will save time and money in the long run. The following table offers a guideline on how many components your business should obtain prior to configuring your Ceph Cluster.
+Size/Workload Monitors OSD MDS Bandwidth
+Small/low 1 10 0-1 ???
+Small/average 3 23 1-3 ???
+Small/high 3 or more 30 3-5 ???
+Medium/low 3 20 1 ???
+Medium/average 3 30 1-3 ???
+Medium/high 3 or more 1000 3-5 ???
+Large/low 3 30 1 ???
+Large/average 3 or more 50 1-3 ???
+Large/high 3 or more 2000 3-10 ???
+ Warning: If you are using a low bandwidth system, and are connecting to the cluster over the internet, you must use the librados object level interface, and your OSDs must be located in the same data center.
+Sample Configuration
+The figure below shows a sample Ceph Configuration.
+img cephconfig.jpg
+
+Related Documentation
+Once you have determined your configuration needs, make sure you have access to the following documents:
+• Ceph Installation and Configuration Guide
+• Ceph System Administration Guide
+• Ceph Troubleshooting Manual
+
+
diff --git a/src/ceph/doc/dev/context.rst b/src/ceph/doc/dev/context.rst
new file mode 100644
index 0000000..1a2b2cb
--- /dev/null
+++ b/src/ceph/doc/dev/context.rst
@@ -0,0 +1,20 @@
+=============
+ CephContext
+=============
+
+A CephContext represents a single view of the Ceph cluster. It comes complete
+with a configuration, a set of performance counters (PerfCounters), and a
+heartbeat map. You can find more information about CephContext in
+src/common/ceph_context.h.
+
+Generally, you will have only one CephContext in your application, called
+g_ceph_context. However, in library code, it is possible that the library user
+will initialize multiple CephContexts. For example, this would happen if he
+called rados_create more than once.
+
+A ceph context is required to issue log messages. Why is this? Well, without
+the CephContext, we would not know which log messages were disabled and which
+were enabled. The dout() macro implicitly references g_ceph_context, so it
+can't be used in library code. It is fine to use dout and derr in daemons, but
+in library code, you must use ldout and lderr, and pass in your own CephContext
+object. The compiler will enforce this restriction.
diff --git a/src/ceph/doc/dev/corpus.rst b/src/ceph/doc/dev/corpus.rst
new file mode 100644
index 0000000..76fa43d
--- /dev/null
+++ b/src/ceph/doc/dev/corpus.rst
@@ -0,0 +1,95 @@
+
+Corpus structure
+================
+
+ceph.git/ceph-object-corpus is a submodule.::
+
+ bin/ # misc scripts
+ archive/$version/objects/$type/$hash # a sample of encoded objects from a specific version
+
+You can also mark known or deliberate incompatibilities between versions with::
+
+ archive/$version/forward_incompat/$type
+
+The presence of a file indicates that new versions of code cannot
+decode old objects across that $version (this is normally the case).
+
+
+How to generate an object corpus
+--------------------------------
+
+We can generate an object corpus for a particular version of ceph like so.
+
+#. Checkout a clean repo (best not to do this where you normally work)::
+
+ git clone ceph.git
+ cd ceph
+ git submodule update --init --recursive
+
+#. Build with flag to dump objects to /tmp/foo::
+
+ rm -rf /tmp/foo ; mkdir /tmp/foo
+ ./do_autogen.sh -e /tmp/foo
+ make
+
+#. Start via vstart::
+
+ cd src
+ MON=3 OSD=3 MDS=3 RGW=1 ./vstart.sh -n -x
+
+#. Use as much functionality of the cluster as you can, to exercise as many object encoder methods as possible::
+
+ ./rados -p rbd bench 10 write -b 123
+ ./ceph osd out 0
+ ./init-ceph stop osd.1
+ for f in ../qa/workunits/cls/*.sh ; do PATH=".:$PATH" $f ; done
+ ../qa/workunits/rados/test.sh
+ ./ceph_test_librbd
+ ./ceph_test_libcephfs
+ ./init-ceph restart mds.a
+
+Do some more stuff with rgw if you know how.
+
+#. Stop::
+
+ ./stop.sh
+
+#. Import the corpus (this will take a few minutes)::
+
+ test/encoding/import.sh /tmp/foo `./ceph-dencoder version` ../ceph-object-corpus/archive
+ test/encoding/import-generated.sh ../ceph-object-corpus/archive
+
+#. Prune it! There will be a bazillion copies of various objects, and we only want a representative sample.::
+
+ pushd ../ceph-object-corpus
+ bin/prune-archive.sh
+ popd
+
+#. Verify the tests pass::
+
+ make check-local
+
+#. Commit it to the corpus repo and push::
+
+ pushd ../ceph-object-corpus
+ git checkout -b wip-new
+ git add archive/`../src/ceph-dencoder version`
+ git commit -m `../src/ceph-dencoder version`
+ git remote add cc ceph.com:/git/ceph-object-corpus.git
+ git push cc wip-new
+ popd
+
+#. Go test it out::
+
+ cd my/regular/tree
+ cd ceph-object-corpus
+ git fetch origin
+ git checkout wip-new
+ cd ../src
+ make check-local
+
+#. If everything looks good, update the submodule master branch, and commit the submodule in ceph.git.
+
+
+
+
diff --git a/src/ceph/doc/dev/cpu-profiler.rst b/src/ceph/doc/dev/cpu-profiler.rst
new file mode 100644
index 0000000..d7a77fe
--- /dev/null
+++ b/src/ceph/doc/dev/cpu-profiler.rst
@@ -0,0 +1,54 @@
+=====================
+ Installing Oprofile
+=====================
+
+The easiest way to profile Ceph's CPU consumption is to use the `oprofile`_
+system-wide profiler.
+
+.. _oprofile: http://oprofile.sourceforge.net/about/
+
+Installation
+============
+
+If you are using a Debian/Ubuntu distribution, you can install ``oprofile`` by
+executing the following::
+
+ sudo apt-get install oprofile oprofile-gui
+
+
+Compiling Ceph for Profiling
+============================
+
+To compile Ceph for profiling, first clean everything. ::
+
+ make distclean
+
+Then, export the following settings so that you can see callgraph output. ::
+
+ export CFLAGS="-fno=omit-frame-pointer -O2 -g"
+
+Finally, compile Ceph. ::
+
+ ./autogen.sh
+ ./configure
+ make
+
+You can use ``make -j`` to execute multiple jobs depending upon your system. For
+example::
+
+ make -j4
+
+
+Ceph Configuration
+==================
+
+Ensure that you disable ``lockdep``. Consider setting logging to
+levels appropriate for a production cluster. See `Ceph Logging and Debugging`_
+for details.
+
+.. _Ceph Logging and Debugging: ../../rados/troubleshooting/log-and-debug
+
+See the `CPU Profiling`_ section of the RADOS Troubleshooting documentation for details on using Oprofile.
+
+
+.. _CPU Profiling: ../../rados/troubleshooting/cpu-profiling \ No newline at end of file
diff --git a/src/ceph/doc/dev/delayed-delete.rst b/src/ceph/doc/dev/delayed-delete.rst
new file mode 100644
index 0000000..bf5f65a
--- /dev/null
+++ b/src/ceph/doc/dev/delayed-delete.rst
@@ -0,0 +1,12 @@
+=========================
+ CephFS delayed deletion
+=========================
+
+When you delete a file, the data is not immediately removed. Each
+object in the file needs to be removed independently, and sending
+``size_of_file / stripe_size * replication_count`` messages would slow
+the client down too much, and use a too much of the clients
+bandwidth. Additionally, snapshots may mean some objects should not be
+deleted.
+
+Instead, the file is marked as deleted on the MDS, and deleted lazily.
diff --git a/src/ceph/doc/dev/dev_cluster_deployement.rst b/src/ceph/doc/dev/dev_cluster_deployement.rst
new file mode 100644
index 0000000..0e6244e
--- /dev/null
+++ b/src/ceph/doc/dev/dev_cluster_deployement.rst
@@ -0,0 +1,121 @@
+=================================
+ Deploying a development cluster
+=================================
+
+In order to develop on ceph, a Ceph utility,
+*vstart.sh*, allows you to deploy fake local cluster for development purpose.
+
+Usage
+=====
+
+It allows to deploy a fake local cluster on your machine for development purpose. It starts rgw, mon, osd and/or mds, or all of them if not specified.
+
+To start your development cluster, type the following::
+
+ vstart.sh [OPTIONS]...
+
+In order to stop the cluster, you can type::
+
+ ./stop.sh
+
+Options
+=======
+
+.. option:: -i ip_address
+
+ Bind to the specified *ip_address* instead of guessing and resolve from hostname.
+
+.. option:: -k
+
+ Keep old configuration files instead of overwritting theses.
+
+.. option:: -l, --localhost
+
+ Use localhost instead of hostanme.
+
+.. option:: -m ip[:port]
+
+ Specifies monitor *ip* address and *port*.
+
+.. option:: -n, --new
+
+ Create a new cluster.
+
+.. option:: -o config
+
+ Add *config* to all sections in the ceph configuration.
+
+.. option:: --nodaemon
+
+ Use ceph-run as wrapper for mon/osd/mds.
+
+.. option:: --smallmds
+
+ Configure mds with small limit cache size.
+
+.. option:: -x
+
+ Enable Cephx (on by default).
+
+.. option:: -X
+
+ Disable Cephx.
+
+.. option:: -d, --debug
+
+ Launch in debug mode
+
+.. option:: --valgrind[_{osd,mds,mon}] 'valgrind_toolname [args...]'
+
+ Launch the osd/mds/mon/all the ceph binaries using valgrind with the specified tool and arguments.
+
+.. option:: --bluestore
+
+ Use bluestore as the objectstore backend for osds
+
+.. option:: --memstore
+
+ Use memstore as the objectstore backend for osds
+
+.. option:: --cache <pool>
+
+ Set a cache-tier for the specified pool
+
+
+Environment variables
+=====================
+
+{OSD,MDS,MON,RGW}
+
+Theses environment variables will contains the number of instances of the desired ceph process you want to start.
+
+Example: ::
+
+ OSD=3 MON=3 RGW=1 vstart.sh
+
+
+============================================================
+ Deploying multiple development clusters on the same machine
+============================================================
+
+In order to bring up multiple ceph clusters on the same machine, *mstart.sh* a
+small wrapper around the above *vstart* can help.
+
+Usage
+=====
+
+To start multiple clusters, you would run mstart for each cluster you would want
+to deploy, and it will start monitors, rgws for each cluster on different ports
+allowing you to run multiple mons, rgws etc. on the same cluster. Invoke it in
+the following way::
+
+ mstart.sh <cluster-name> <vstart options>
+
+For eg::
+
+ ./mstart.sh cluster1 -n -r
+
+
+For stopping the cluster, you do::
+
+ ./mstop.sh <cluster-name>
diff --git a/src/ceph/doc/dev/development-workflow.rst b/src/ceph/doc/dev/development-workflow.rst
new file mode 100644
index 0000000..b2baea1
--- /dev/null
+++ b/src/ceph/doc/dev/development-workflow.rst
@@ -0,0 +1,250 @@
+=====================
+Development workflows
+=====================
+
+This page explains the workflows a developer is expected to follow to
+implement the goals that are part of the Ceph release cycle. It does not
+go into technical details and is designed to provide a high level view
+instead. Each chapter is about a given goal such as ``Merging bug
+fixes or features`` or ``Publishing point releases and backporting``.
+
+A key aspect of all workflows is that none of them blocks another. For
+instance, a bug fix can be backported and merged to a stable branch
+while the next point release is being published. For that specific
+example to work, a branch should be created to avoid any
+interference. In practice it is not necessary for Ceph because:
+
+* there are few people involved
+* the frequency of backports is not too high
+* the reviewers, who know a release is being published, are unlikely
+ to merge anything that may cause issues
+
+This ad-hoc approach implies the workflows are changed on a regular
+basis to adapt. For instance, ``quality engineers`` were not involved
+in the workflow to publish ``dumpling`` point releases. The number of
+commits being backported to ``firefly`` made it impractical for developers
+tasked to write code or fix bugs to also run and verify the full suite
+of integration tests. Inserting ``quality engineers`` makes it
+possible for someone to participate in the workflow by analyzing test
+results.
+
+The workflows are not enforced when they impose an overhead that does
+not make sense. For instance, if the release notes for a point release
+were not written prior to checking all integration tests, they can be
+commited to the stable branch and the result sent for publication
+without going through another run of integration tests.
+
+Release Cycle
+=============
+
+::
+
+ Ceph hammer infernalis
+ Developer CDS CDS
+ Summit | |
+ | |
+ development | |
+ release | v0.88 v0.89 v0.90 ... | v9.0.0
+ --v--^----^--v---^------^--v- ---v----^----^--- 2015
+ | | | |
+ stable giant | | hammer
+ release v0.87 | | v0.94
+ | |
+ point firefly dumpling
+ release v0.80.8 v0.67.12
+
+
+Four times a year, the development roadmap is discussed online during
+the `Ceph Developer Summit <http://tracker.ceph.com/projects/ceph/wiki/Planning#Ceph-Developer-Summit>`_. A
+new stable release (hammer, infernalis, jewel ...) is published at the same
+frequency. Every other release (firefly, hammer, jewel...) is a `Long Term
+Stable (LTS) <../../releases>`_. See `Understanding the release cycle
+<../../releases#understanding-the-release-cycle>`_ for more information.
+
+Merging bug fixes or features
+=============================
+
+The development branch is ``master`` and the workflow followed by all
+developers can be summarized as follows:
+
+* The developer prepares a series of commits
+* The developer submits the series of commits via a pull request
+* A reviewer is assigned the pull request
+* When the pull request looks good to the reviewer, it is merged into
+ an integration branch by the tester
+* After a successful run of integration tests, the pull request is
+ merged by the tester
+
+The ``developer`` is the author of a series of commits. The
+``reviewer`` is responsible for providing feedback to the developer on
+a regular basis and the developer is invited to ping the reviewer if
+nothing happened after a week. After the ``reviewer`` is satisfied
+with the pull request, (s)he passes it to the ``tester``. The
+``tester`` is responsible for running teuthology integration tests on
+the pull request. If nothing happens within a month the ``reviewer`` is
+invited to ping the ``tester``.
+
+Resolving bug reports and implementing features
+===============================================
+
+All bug reports and feature requests are in the `issue tracker
+<http://tracker.ceph.com>`_ and the workflow can be summarized as
+follows:
+
+* The reporter creates the issue with priority ``Normal``
+* A developer may pick the issue right away
+* During a bi-weekly bug scrub, the team goes over all new issue and
+ assign them a priority
+* The bugs with higher priority are worked on first
+
+Each ``team`` is responsible for a project, managed by leads_.
+
+.. _leads: index#Leads
+
+The ``developer`` assigned to an issue is responsible for it. The
+status of an open issue can be:
+
+* ``New``: it is unclear if the issue needs work.
+* ``Verified``: the bug can be reproduced or showed up multiple times
+* ``In Progress``: the developer is working on it this week
+* ``Pending Backport``: the fix needs to be backported to the stable
+ releases listed in the backport field
+
+For each ``Pending Backport`` issue, there exists at least one issue
+in the ``Backport`` tracker to record the work done to cherry pick the
+necessary commits from the master branch to the target stable branch.
+See `the backporter manual
+<http://tracker.ceph.com/projects/ceph-releases/wiki/HOWTO>`_ for more
+information.
+
+Running and interpreting teuthology integration tests
+=====================================================
+
+The :doc:`/dev/sepia` runs `teuthology
+<https://github.com/ceph/teuthology/>`_ integration tests `on a regular basis <http://tracker.ceph.com/projects/ceph-releases/wiki/HOWTO_monitor_the_automated_tests_AKA_nightlies#Automated-tests-AKA-nightlies>`_ and the
+results are posted on `pulpito <http://pulpito.ceph.com/>`_ and the
+`ceph-qa mailing list <https://ceph.com/irc/>`_.
+
+* The job failures are `analyzed by quality engineers and developers
+ <http://tracker.ceph.com/projects/ceph-releases/wiki/HOWTO_monitor_the_automated_tests_AKA_nightlies#List-of-suites-and-watchers>`_
+* If the cause is environmental (e.g. network connectivity), an issue
+ is created in the `sepia lab project
+ <http://tracker.ceph.com/projects/lab/issues/new>`_
+* If the bug is known, a pulpito URL to the failed job is added to the issue
+* If the bug is new, an issue is created
+
+The ``quality engineer`` is either a developer or a member of the QE
+team. There is at least one integration test suite per project:
+
+* `rgw <https://github.com/ceph/ceph/tree/master/qa/suites/rgw>`_ suite
+* `CephFS <https://github.com/ceph/ceph/tree/master/qa/suites/fs>`_ suite
+* `rados <https://github.com/ceph/ceph/tree/master/qa/suites/rados>`_ suite
+* `rbd <https://github.com/ceph/ceph/tree/master/qa/suites/rbd>`_ suite
+
+and many others such as
+
+* `upgrade <https://github.com/ceph/ceph/tree/master/qa/suites/upgrade>`_ suites
+* `power-cyle <https://github.com/ceph/ceph/tree/master/qa/suites/powercycle>`_ suite
+* ...
+
+Preparing a new release
+=======================
+
+A release is prepared in a dedicated branch, different from the
+``master`` branch.
+
+* For a stable releases it is the branch matching the release code
+ name (dumpling, firefly, etc.)
+* For a development release it is the ``next`` branch
+
+The workflow expected of all developers to stabilize the release
+candidate is the same as the normal development workflow with the
+following differences:
+
+* The pull requests must target the stable branch or next instead of
+ master
+* The reviewer rejects pull requests that are not bug fixes
+* The ``Backport`` issues matching a teuthology test failure and set
+ with priority ``Urgent`` must be fixed before the release
+
+Cutting a new stable release
+============================
+
+A new stable release can be cut when:
+
+* all ``Backport`` issues with priority ``Urgent`` are fixed
+* integration and upgrade tests run successfully
+
+Publishing a new stable release implies a risk of regression or
+discovering new bugs during the upgrade, no matter how carefully it is
+tested. The decision to cut a release must take this into account: it
+may not be wise to publish a stable release that only fixes a few
+minor bugs. For instance if only one commit has been backported to a
+stable release that is not a LTS, it is better to wait until there are
+more.
+
+When a stable release is to be retired, it may be safer to
+recommend an upgrade to the next LTS release instead of
+proposing a new point release to fix a problem. For instance, the
+``dumpling`` v0.67.11 release has bugs related to backfilling which have
+been fixed in ``firefly`` v0.80.x. A backport fixing these backfilling
+bugs has been tested in the draft point release ``dumpling`` v0.67.12 but
+they are large enough to introduce a risk of regression. As ``dumpling``
+is to be retired, users suffering from this bug can
+upgrade to ``firefly`` to fix it. Unless users manifest themselves and ask
+for ``dumpling`` v0.67.12, this draft release may never be published.
+
+* The ``Ceph lead`` decides a new stable release must be published
+* The ``release master`` gets approval from all leads
+* The ``release master`` writes and commits the release notes
+* The ``release master`` informs the ``quality engineer`` that the
+ branch is ready for testing
+* The ``quality engineer`` runs additional integration tests
+* If the ``quality engineer`` discovers new bugs that require an
+ ``Urgent Backport``, the release goes back to being prepared, it
+ was not ready after all
+* The ``quality engineer`` informs the ``publisher`` that the branch
+ is ready for release
+* The ``publisher`` `creates the packages and sets the release tag
+ <../release-process>`_
+
+The person responsible for each role is:
+
+* Sage Weil is the ``Ceph lead``
+* Sage Weil is the ``release master`` for major stable releases
+ (``firefly`` 0.80, ``hammer`` 0.94 etc.)
+* Loic Dachary is the ``release master`` for stable point releases
+ (``firefly`` 0.80.10, ``hammer`` 0.94.1 etc.)
+* Yuri Weinstein is the ``quality engineer``
+* Alfredo Deza is the ``publisher``
+
+Cutting a new development release
+=================================
+
+The publication workflow of a development release is the same as
+preparing a new release and cutting it, with the following
+differences:
+
+* The ``next`` branch is reset to the tip of ``master`` after
+ publication
+* The ``quality engineer`` is not required to run additional tests,
+ the ``release master`` directly informs the ``publisher`` that the
+ release is ready to be published.
+
+Publishing point releases and backporting
+=========================================
+
+The publication workflow of the point releases is the same as
+preparing a new release and cutting it, with the following
+differences:
+
+* The ``backport`` field of each issue contains the code name of the
+ stable release
+* There is exactly one issue in the ``Backport`` tracker for each
+ stable release to which the issue is backported
+* All commits are cherry-picked with ``git cherry-pick -x`` to
+ reference the original commit
+
+See `the backporter manual
+<http://tracker.ceph.com/projects/ceph-releases/wiki/HOWTO>`_ for more
+information.
diff --git a/src/ceph/doc/dev/documenting.rst b/src/ceph/doc/dev/documenting.rst
new file mode 100644
index 0000000..63e1f19
--- /dev/null
+++ b/src/ceph/doc/dev/documenting.rst
@@ -0,0 +1,132 @@
+==================
+ Documenting Ceph
+==================
+
+User documentation
+==================
+
+The documentation on docs.ceph.com is generated from the restructuredText
+sources in ``/doc/`` in the Ceph git repository.
+
+Please make sure that your changes are written in a way that is intended
+for end users of the software, unless you are making additions in
+``/doc/dev/``, which is the section for developers.
+
+All pull requests that modify user-facing functionality must
+include corresponding updates to documentation: see
+`Submitting Patches`_ for more detail.
+
+Check your .rst syntax is working as expected by using the "View"
+button in the github user interface when looking at a diff on
+an .rst file, or build the docs locally using the ``admin/build-doc``
+script.
+
+For more information about the Ceph documentation, see
+:doc:`/start/documenting-ceph`.
+
+Code Documentation
+==================
+
+C and C++ can be documented with Doxygen_, using the subset of Doxygen
+markup supported by Breathe_.
+
+.. _Doxygen: http://www.stack.nl/~dimitri/doxygen/
+.. _Breathe: https://github.com/michaeljones/breathe
+
+The general format for function documentation is::
+
+ /**
+ * Short description
+ *
+ * Detailed description when necessary
+ *
+ * preconditons, postconditions, warnings, bugs or other notes
+ *
+ * parameter reference
+ * return value (if non-void)
+ */
+
+This should be in the header where the function is declared, and
+functions should be grouped into logical categories. The `librados C
+API`_ provides a complete example. It is pulled into Sphinx by
+`librados.rst`_, which is rendered at :doc:`/rados/api/librados`.
+
+.. _`librados C API`: https://github.com/ceph/ceph/blob/master/src/include/rados/librados.h
+.. _`librados.rst`: https://github.com/ceph/ceph/raw/master/doc/rados/api/librados.rst
+
+Drawing diagrams
+================
+
+Graphviz
+--------
+
+You can use Graphviz_, as explained in the `Graphviz extension documentation`_.
+
+.. _Graphviz: http://graphviz.org/
+.. _`Graphviz extension documentation`: http://sphinx.pocoo.org/ext/graphviz.html
+
+.. graphviz::
+
+ digraph "example" {
+ foo -> bar;
+ bar -> baz;
+ bar -> th
+ }
+
+Most of the time, you'll want to put the actual DOT source in a
+separate file, like this::
+
+ .. graphviz:: myfile.dot
+
+
+Ditaa
+-----
+
+You can use Ditaa_:
+
+.. _Ditaa: http://ditaa.sourceforge.net/
+
+.. ditaa::
+
+ +--------------+ /=----\
+ | hello, world |-->| hi! |
+ +--------------+ \-----/
+
+
+Blockdiag
+---------
+
+If a use arises, we can integrate Blockdiag_. It is a Graphviz-style
+declarative language for drawing things, and includes:
+
+- `block diagrams`_: boxes and arrows (automatic layout, as opposed to
+ Ditaa_)
+- `sequence diagrams`_: timelines and messages between them
+- `activity diagrams`_: subsystems and activities in them
+- `network diagrams`_: hosts, LANs, IP addresses etc (with `Cisco
+ icons`_ if wanted)
+
+.. _Blockdiag: http://blockdiag.com/
+.. _`Cisco icons`: http://pypi.python.org/pypi/blockdiagcontrib-cisco/
+.. _`block diagrams`: http://blockdiag.com/en/blockdiag/
+.. _`sequence diagrams`: http://blockdiag.com/en/seqdiag/index.html
+.. _`activity diagrams`: http://blockdiag.com/en/actdiag/index.html
+.. _`network diagrams`: http://blockdiag.com/en/nwdiag/
+
+
+Inkscape
+--------
+
+You can use Inkscape to generate scalable vector graphics.
+http://inkscape.org for restructedText documents.
+
+If you generate diagrams with Inkscape, you should
+commit both the Scalable Vector Graphics (SVG) file and export a
+Portable Network Graphic (PNG) file. Reference the PNG file.
+
+By committing the SVG file, others will be able to update the
+SVG diagrams using Inkscape.
+
+HTML5 will support SVG inline.
+
+.. _Submitting Patches: /SubmittingPatches.rst
diff --git a/src/ceph/doc/dev/erasure-coded-pool.rst b/src/ceph/doc/dev/erasure-coded-pool.rst
new file mode 100644
index 0000000..4694a7a
--- /dev/null
+++ b/src/ceph/doc/dev/erasure-coded-pool.rst
@@ -0,0 +1,137 @@
+Erasure Coded pool
+==================
+
+Purpose
+-------
+
+Erasure-coded pools require less storage space compared to replicated
+pools. The erasure-coding support has higher computational requirements and
+only supports a subset of the operations allowed on an object (for instance,
+partial write is not supported).
+
+Use cases
+---------
+
+Cold storage
+~~~~~~~~~~~~
+
+An erasure-coded pool is created to store a large number of 1GB
+objects (imaging, genomics, etc.) and 10% of them are read per
+month. New objects are added every day and the objects are not
+modified after being written. On average there is one write for 10,000
+reads.
+
+A replicated pool is created and set as a cache tier for the
+erasure coded pool. An agent demotes objects (i.e. moves them from the
+replicated pool to the erasure-coded pool) if they have not been
+accessed in a week.
+
+The erasure-coded pool crush ruleset targets hardware designed for
+cold storage with high latency and slow access time. The replicated
+pool crush ruleset targets faster hardware to provide better response
+times.
+
+Cheap multidatacenter storage
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Ten datacenters are connected with dedicated network links. Each
+datacenter contains the same amount of storage with no power-supply
+backup and no air-cooling system.
+
+An erasure-coded pool is created with a crush map ruleset that will
+ensure no data loss if at most three datacenters fail
+simultaneously. The overhead is 50% with erasure code configured to
+split data in six (k=6) and create three coding chunks (m=3). With
+replication the overhead would be 400% (four replicas).
+
+Interface
+---------
+
+Set up an erasure-coded pool::
+
+ $ ceph osd pool create ecpool 12 12 erasure
+
+Set up an erasure-coded pool and the associated crush ruleset::
+
+ $ ceph osd crush rule create-erasure ecruleset
+ $ ceph osd pool create ecpool 12 12 erasure \
+ default ecruleset
+
+Set the ruleset failure domain to osd (instead of the host which is the default)::
+
+ $ ceph osd erasure-code-profile set myprofile \
+ crush-failure-domain=osd
+ $ ceph osd erasure-code-profile get myprofile
+ k=2
+ m=1
+ plugin=jerasure
+ technique=reed_sol_van
+ crush-failure-domain=osd
+ $ ceph osd pool create ecpool 12 12 erasure myprofile
+
+Control the parameters of the erasure code plugin::
+
+ $ ceph osd erasure-code-profile set myprofile \
+ k=3 m=1
+ $ ceph osd erasure-code-profile get myprofile
+ k=3
+ m=1
+ plugin=jerasure
+ technique=reed_sol_van
+ $ ceph osd pool create ecpool 12 12 erasure \
+ myprofile
+
+Choose an alternate erasure code plugin::
+
+ $ ceph osd erasure-code-profile set myprofile \
+ plugin=example technique=xor
+ $ ceph osd erasure-code-profile get myprofile
+ k=2
+ m=1
+ plugin=example
+ technique=xor
+ $ ceph osd pool create ecpool 12 12 erasure \
+ myprofile
+
+Display the default erasure code profile::
+
+ $ ceph osd erasure-code-profile ls
+ default
+ $ ceph osd erasure-code-profile get default
+ k=2
+ m=1
+ plugin=jerasure
+ technique=reed_sol_van
+
+Create a profile to set the data to be distributed on six OSDs (k+m=6) and sustain the loss of three OSDs (m=3) without losing data::
+
+ $ ceph osd erasure-code-profile set myprofile k=3 m=3
+ $ ceph osd erasure-code-profile get myprofile
+ k=3
+ m=3
+ plugin=jerasure
+ technique=reed_sol_van
+ $ ceph osd erasure-code-profile ls
+ default
+ myprofile
+
+Remove a profile that is no longer in use (otherwise it will fail with EBUSY)::
+
+ $ ceph osd erasure-code-profile ls
+ default
+ myprofile
+ $ ceph osd erasure-code-profile rm myprofile
+ $ ceph osd erasure-code-profile ls
+ default
+
+Set the ruleset to take ssd (instead of default)::
+
+ $ ceph osd erasure-code-profile set myprofile \
+ crush-root=ssd
+ $ ceph osd erasure-code-profile get myprofile
+ k=2
+ m=1
+ plugin=jerasure
+ technique=reed_sol_van
+ crush-root=ssd
+
diff --git a/src/ceph/doc/dev/file-striping.rst b/src/ceph/doc/dev/file-striping.rst
new file mode 100644
index 0000000..405c971
--- /dev/null
+++ b/src/ceph/doc/dev/file-striping.rst
@@ -0,0 +1,161 @@
+File striping
+=============
+
+The text below describes how files from Ceph file system clients are
+stored across objects stored in RADOS.
+
+ceph_file_layout
+----------------
+
+Ceph distributes (stripes) the data for a given file across a number
+of underlying objects. The way file data is mapped to those objects
+is defined by the ceph_file_layout structure. The data distribution
+is a modified RAID 0, where data is striped across a set of objects up
+to a (per-file) fixed size, at which point another set of objects
+holds the file's data. The second set also holds no more than the
+fixed amount of data, and then another set is used, and so on.
+
+Defining some terminology will go a long way toward explaining the
+way file data is laid out across Ceph objects.
+
+- file
+ A collection of contiguous data, named from the perspective of
+ the Ceph client (i.e., a file on a Linux system using Ceph
+ storage). The data for a file is divided into fixed-size
+ "stripe units," which are stored in ceph "objects."
+- stripe unit
+ The size (in bytes) of a block of data used in the RAID 0
+ distribution of a file. All stripe units for a file have equal
+ size. The last stripe unit is typically incomplete--i.e. it
+ represents the data at the end of the file as well as unused
+ "space" beyond it up to the end of the fixed stripe unit size.
+- stripe count
+ The number of consecutive stripe units that constitute a RAID 0
+ "stripe" of file data.
+- stripe
+ A contiguous range of file data, RAID 0 striped across "stripe
+ count" objects in fixed-size "stripe unit" blocks.
+- object
+ A collection of data maintained by Ceph storage. Objects are
+ used to hold portions of Ceph client files.
+- object set
+ A set of objects that together represent a contiguous portion of
+ a file.
+
+Three fields in the ceph_file_layout structure define this mapping::
+
+ u32 fl_stripe_unit;
+ u32 fl_stripe_count;
+ u32 fl_object_size;
+
+(They are actually maintained in their on-disk format, __le32.)
+
+The role of the first two fields should be clear from the
+definitions above.
+
+The third field is the maximum size (in bytes) of an object used to
+back file data. The object size is a multiple of the stripe unit.
+
+A file's data is blocked into stripe units, and consecutive stripe
+units are stored on objects in an object set. The number of objects
+in a set is the same as the stripe count. No object storing file
+data will exceed the file's designated object size, so after some
+fixed number of complete stripes, a new object set is used to store
+subsequent file data.
+
+Note that by default, Ceph uses a simple striping strategy in which
+object_size equals stripe_unit and stripe_count is 1. This simply
+puts one stripe_unit in each object.
+
+Here's a more complex example::
+
+ file size = 1 trillion = 1000000000000 bytes
+
+ fl_stripe_unit = 64KB = 65536 bytes
+ fl_stripe_count = 5 stripe units per stripe
+ fl_object_size = 64GB = 68719476736 bytes
+
+This means::
+
+ file stripe size = 64KB * 5 = 320KB = 327680 bytes
+ each object holds 64GB / 64KB = 1048576 stripe units
+ file object set size = 64GB * 5 = 320GB = 343597383680 bytes
+ (also 1048576 stripe units * 327680 bytes per stripe unit)
+
+So the file's 1 trillion bytes can be divided into complete object
+sets, then complete stripes, then complete stripe units, and finally
+a single incomplete stripe unit::
+
+ - 1 trillion bytes / 320GB per object set = 2 complete object sets
+ (with 312805232640 bytes remaining)
+ - 312805232640 bytes / 320KB per stripe = 954605 complete stripes
+ (with 266240 bytes remaining)
+ - 266240 bytes / 64KB per stripe unit = 4 complete stripe units
+ (with 4096 bytes remaining)
+ - and the final incomplete stripe unit holds those 4096 bytes.
+
+The ASCII art below attempts to capture this::
+
+ _________ _________ _________ _________ _________
+ /object 0\ /object 1\ /object 2\ /object 3\ /object 4\
+ +=========+ +=========+ +=========+ +=========+ +=========+
+ | stripe | | stripe | | stripe | | stripe | | stripe |
+ o | unit | | unit | | unit | | unit | | unit | stripe 0
+ b | 0 | | 1 | | 2 | | 3 | | 4 |
+ j |---------| |---------| |---------| |---------| |---------|
+ e | stripe | | stripe | | stripe | | stripe | | stripe |
+ c | unit | | unit | | unit | | unit | | unit | stripe 1
+ t | 5 | | 6 | | 7 | | 8 | | 9 |
+ |---------| |---------| |---------| |---------| |---------|
+ s | . | | . | | . | | . | | . |
+ e . . . . .
+ t | . | | . | | . | | . | | . |
+ |---------| |---------| |---------| |---------| |---------|
+ 0 | stripe | | stripe | | stripe | | stripe | | stripe | stripe
+ | unit | | unit | | unit | | unit | | unit | 1048575
+ | 5242875 | | 5242876 | | 5242877 | | 5242878 | | 5242879 |
+ \=========/ \=========/ \=========/ \=========/ \=========/
+
+ _________ _________ _________ _________ _________
+ /object 5\ /object 6\ /object 7\ /object 8\ /object 9\
+ +=========+ +=========+ +=========+ +=========+ +=========+
+ | stripe | | stripe | | stripe | | stripe | | stripe | stripe
+ o | unit | | unit | | unit | | unit | | unit | 1048576
+ b | 5242880 | | 5242881 | | 5242882 | | 5242883 | | 5242884 |
+ j |---------| |---------| |---------| |---------| |---------|
+ e | stripe | | stripe | | stripe | | stripe | | stripe | stripe
+ c | unit | | unit | | unit | | unit | | unit | 1048577
+ t | 5242885 | | 5242886 | | 5242887 | | 5242888 | | 5242889 |
+ |---------| |---------| |---------| |---------| |---------|
+ s | . | | . | | . | | . | | . |
+ e . . . . .
+ t | . | | . | | . | | . | | . |
+ |---------| |---------| |---------| |---------| |---------|
+ 1 | stripe | | stripe | | stripe | | stripe | | stripe | stripe
+ | unit | | unit | | unit | | unit | | unit | 2097151
+ | 10485755| | 10485756| | 10485757| | 10485758| | 10485759|
+ \=========/ \=========/ \=========/ \=========/ \=========/
+
+ _________ _________ _________ _________ _________
+ /object 10\ /object 11\ /object 12\ /object 13\ /object 14\
+ +=========+ +=========+ +=========+ +=========+ +=========+
+ | stripe | | stripe | | stripe | | stripe | | stripe | stripe
+ o | unit | | unit | | unit | | unit | | unit | 2097152
+ b | 10485760| | 10485761| | 10485762| | 10485763| | 10485764|
+ j |---------| |---------| |---------| |---------| |---------|
+ e | stripe | | stripe | | stripe | | stripe | | stripe | stripe
+ c | unit | | unit | | unit | | unit | | unit | 2097153
+ t | 10485765| | 10485766| | 10485767| | 10485768| | 10485769|
+ |---------| |---------| |---------| |---------| |---------|
+ s | . | | . | | . | | . | | . |
+ e . . . . .
+ t | . | | . | | . | | . | | . |
+ |---------| |---------| |---------| |---------| |---------|
+ 2 | stripe | | stripe | | stripe | | stripe | | stripe | stripe
+ | unit | | unit | | unit | | unit | | unit | 3051756
+ | 15258780| | 15258781| | 15258782| | 15258783| | 15258784|
+ |---------| |---------| |---------| |---------| |---------|
+ | stripe | | stripe | | stripe | | stripe | | (partial| (partial
+ | unit | | unit | | unit | | unit | | stripe | stripe
+ | 15258785| | 15258786| | 15258787| | 15258788| | unit) | 3051757)
+ \=========/ \=========/ \=========/ \=========/ \=========/
diff --git a/src/ceph/doc/dev/freebsd.rst b/src/ceph/doc/dev/freebsd.rst
new file mode 100644
index 0000000..5a401ee
--- /dev/null
+++ b/src/ceph/doc/dev/freebsd.rst
@@ -0,0 +1,58 @@
+==============================
+FreeBSD Implementation details
+==============================
+
+
+Disk layout
+-----------
+
+Current implementation works on ZFS pools
+
+* created in /var/lib/ceph
+* One ZFS pool per OSD, like::
+
+ gpart create -s GPT ada1
+ gpart add -t freebsd-zfs -l osd1 ada1
+ zpool create -o mountpoint=/var/lib/ceph/osd/osd.1 osd
+
+* Maybe add some cache and log (ZIL)? Assuming that ada2 is an SSD::
+
+ gpart create -s GPT ada2
+ gpart add -t freebsd-zfs -l osd1-log -s 1G ada2
+ zpool add osd1 log gpt/osd1-log
+ gpart add -t freebsd-zfs -l osd1-cache -s 10G ada2
+ zpool add osd1 log gpt/osd1-cache
+
+* Note: *UFS2 does not allow large xattribs*
+
+
+Configuration
+-------------
+
+As per FreeBSD default parts of extra software go into ``/usr/local/``. Which
+means that for ``/etc/ceph.conf`` the default location is
+``/usr/local/etc/ceph/ceph.conf``. Smartest thing to do is to create a softlink
+from ``/etc/ceph`` to ``/usr/local/etc/ceph``::
+
+ ln -s /usr/local/etc/ceph /etc/ceph
+
+A sample file is provided in ``/usr/local/share/doc/ceph/sample.ceph.conf``
+
+
+MON creation
+------------
+
+Monitors are created by following the manual creation steps on::
+
+ http://docs.ceph.com/docs/master/install/manual-deployment/
+
+
+OSD creation
+------------
+
+OSDs can be create with ``ceph-disk``::
+
+ ceph-disk prepare /var/lib/ceph/osd/osd1
+ ceph-disk activate /var/lib/ceph/osd/osd1
+
+And things should automagically work out.
diff --git a/src/ceph/doc/dev/generatedocs.rst b/src/ceph/doc/dev/generatedocs.rst
new file mode 100644
index 0000000..6badef0
--- /dev/null
+++ b/src/ceph/doc/dev/generatedocs.rst
@@ -0,0 +1,56 @@
+Building Ceph Documentation
+===========================
+
+Ceph utilizes Python's Sphinx documentation tool. For details on
+the Sphinx documentation tool, refer to `The Sphinx Documentation Tool <http://sphinx.pocoo.org/>`_.
+
+To build the Ceph documentation set, you must:
+
+1. Clone the Ceph repository
+2. Install the required tools
+3. Build the documents
+
+Clone the Ceph Repository
+-------------------------
+
+To clone the Ceph repository, you must have ``git`` installed
+on your local host. To install ``git``, execute::
+
+ sudo apt-get install git
+
+To clone the Ceph repository, execute::
+
+ git clone git://github.com/ceph/ceph
+
+You should have a full copy of the Ceph repository.
+
+
+Install the Required Tools
+--------------------------
+
+To build the Ceph documentation, some dependencies are required.
+To know what packages are needed, you can launch this command::
+
+ cd ceph
+ admin/build-doc
+
+If dependencies are missing, the command above will fail
+with a message that suggests you a command to install all
+missing dependencies.
+
+
+Build the Documents
+-------------------
+
+Once you have installed all the dependencies, execute the build (the
+same command as above)::
+
+ cd ceph
+ admin/build-doc
+
+Once you build the documentation set, you may navigate to the source directory to view it::
+
+ cd build-doc/output
+
+There should be an ``html`` directory and a ``man`` directory containing documentation
+in HTML and manpage formats respectively.
diff --git a/src/ceph/doc/dev/index-old.rst b/src/ceph/doc/dev/index-old.rst
new file mode 100644
index 0000000..8192516
--- /dev/null
+++ b/src/ceph/doc/dev/index-old.rst
@@ -0,0 +1,42 @@
+:orphan:
+
+==================================
+ Internal developer documentation
+==================================
+
+.. note:: If you're looking for how to use Ceph as a library from your
+ own software, please see :doc:`/api/index`.
+
+You can start a development mode Ceph cluster, after compiling the source, with::
+
+ cd src
+ install -d -m0755 out dev/osd0
+ ./vstart.sh -n -x -l
+ # check that it's there
+ ./ceph health
+
+.. todo:: vstart is woefully undocumented and full of sharp sticks to poke yourself with.
+
+
+.. _mailing-list:
+
+.. rubric:: Mailing list
+
+The official development email list is ``ceph-devel@vger.kernel.org``. Subscribe by sending
+a message to ``majordomo@vger.kernel.org`` with the line::
+
+ subscribe ceph-devel
+
+in the body of the message.
+
+
+.. rubric:: Contents
+
+.. toctree::
+ :glob:
+
+ *
+ osd_internals/index*
+ mds_internals/index*
+ radosgw/index*
+ ceph-volume/index*
diff --git a/src/ceph/doc/dev/index.rst b/src/ceph/doc/dev/index.rst
new file mode 100644
index 0000000..b76f2f2
--- /dev/null
+++ b/src/ceph/doc/dev/index.rst
@@ -0,0 +1,1558 @@
+============================================
+Contributing to Ceph: A Guide for Developers
+============================================
+
+:Author: Loic Dachary
+:Author: Nathan Cutler
+:License: Creative Commons Attribution-ShareAlike (CC BY-SA)
+
+.. note:: The old (pre-2016) developer documentation has been moved to :doc:`/dev/index-old`.
+
+.. contents::
+ :depth: 3
+
+Introduction
+============
+
+This guide has two aims. First, it should lower the barrier to entry for
+software developers who wish to get involved in the Ceph project. Second,
+it should serve as a reference for Ceph developers.
+
+We assume that readers are already familiar with Ceph (the distributed
+object store and file system designed to provide excellent performance,
+reliability and scalability). If not, please refer to the `project website`_
+and especially the `publications list`_.
+
+.. _`project website`: http://ceph.com
+.. _`publications list`: https://ceph.com/resources/publications/
+
+Since this document is to be consumed by developers, who are assumed to
+have Internet access, topics covered elsewhere, either within the Ceph
+documentation or elsewhere on the web, are treated by linking. If you
+notice that a link is broken or if you know of a better link, please
+`report it as a bug`_.
+
+.. _`report it as a bug`: http://tracker.ceph.com/projects/ceph/issues/new
+
+Essentials (tl;dr)
+==================
+
+This chapter presents essential information that every Ceph developer needs
+to know.
+
+Leads
+-----
+
+The Ceph project is led by Sage Weil. In addition, each major project
+component has its own lead. The following table shows all the leads and
+their nicks on `GitHub`_:
+
+.. _github: https://github.com/
+
+========= ================ =============
+Scope Lead GitHub nick
+========= ================ =============
+Ceph Sage Weil liewegas
+RADOS Samuel Just athanatos
+RGW Yehuda Sadeh yehudasa
+RBD Jason Dillaman dillaman
+CephFS Patrick Donnelly batrick
+Build/Ops Ken Dreyer ktdreyer
+========= ================ =============
+
+The Ceph-specific acronyms in the table are explained in
+:doc:`/architecture`.
+
+History
+-------
+
+See the `History chapter of the Wikipedia article`_.
+
+.. _`History chapter of the Wikipedia article`: https://en.wikipedia.org/wiki/Ceph_%28software%29#History
+
+Licensing
+---------
+
+Ceph is free software.
+
+Unless stated otherwise, the Ceph source code is distributed under the terms of
+the LGPL2.1. For full details, see `the file COPYING in the top-level
+directory of the source-code tree`_.
+
+.. _`the file COPYING in the top-level directory of the source-code tree`:
+ https://github.com/ceph/ceph/blob/master/COPYING
+
+Source code repositories
+------------------------
+
+The source code of Ceph lives on `GitHub`_ in a number of repositories below
+the `Ceph "organization"`_.
+
+.. _`Ceph "organization"`: https://github.com/ceph
+
+To make a meaningful contribution to the project as a developer, a working
+knowledge of git_ is essential.
+
+.. _git: https://git-scm.com/documentation
+
+Although the `Ceph "organization"`_ includes several software repositories,
+this document covers only one: https://github.com/ceph/ceph.
+
+Redmine issue tracker
+---------------------
+
+Although `GitHub`_ is used for code, Ceph-related issues (Bugs, Features,
+Backports, Documentation, etc.) are tracked at http://tracker.ceph.com,
+which is powered by `Redmine`_.
+
+.. _Redmine: http://www.redmine.org
+
+The tracker has a Ceph project with a number of subprojects loosely
+corresponding to the various architectural components (see
+:doc:`/architecture`).
+
+Mere `registration`_ in the tracker automatically grants permissions
+sufficient to open new issues and comment on existing ones.
+
+.. _registration: http://tracker.ceph.com/account/register
+
+To report a bug or propose a new feature, `jump to the Ceph project`_ and
+click on `New issue`_.
+
+.. _`jump to the Ceph project`: http://tracker.ceph.com/projects/ceph
+.. _`New issue`: http://tracker.ceph.com/projects/ceph/issues/new
+
+Mailing list
+------------
+
+Ceph development email discussions take place on the mailing list
+``ceph-devel@vger.kernel.org``. The list is open to all. Subscribe by
+sending a message to ``majordomo@vger.kernel.org`` with the line: ::
+
+ subscribe ceph-devel
+
+in the body of the message.
+
+There are also `other Ceph-related mailing lists`_.
+
+.. _`other Ceph-related mailing lists`: https://ceph.com/irc/
+
+IRC
+---
+
+In addition to mailing lists, the Ceph community also communicates in real
+time using `Internet Relay Chat`_.
+
+.. _`Internet Relay Chat`: http://www.irchelp.org/
+
+See https://ceph.com/irc/ for how to set up your IRC
+client and a list of channels.
+
+Submitting patches
+------------------
+
+The canonical instructions for submitting patches are contained in the
+`the file CONTRIBUTING.rst in the top-level directory of the source-code
+tree`_. There may be some overlap between this guide and that file.
+
+.. _`the file CONTRIBUTING.rst in the top-level directory of the source-code tree`:
+ https://github.com/ceph/ceph/blob/master/CONTRIBUTING.rst
+
+All newcomers are encouraged to read that file carefully.
+
+Building from source
+--------------------
+
+See instructions at :doc:`/install/build-ceph`.
+
+Using ccache to speed up local builds
+-------------------------------------
+
+Rebuilds of the ceph source tree can benefit significantly from use of `ccache`_.
+Many a times while switching branches and such, one might see build failures for
+certain older branches mostly due to older build artifacts. These rebuilds can
+significantly benefit the use of ccache. For a full clean source tree, one could
+do ::
+
+ $ make clean
+
+ # note the following will nuke everything in the source tree that
+ # isn't tracked by git, so make sure to backup any log files /conf options
+
+ $ git clean -fdx; git submodule foreach git clean -fdx
+
+ccache is available as a package in most distros. To build ceph with ccache one
+can::
+
+ $ cmake -DWITH_CCACHE=ON ..
+
+ccache can also be used for speeding up all builds in the system. for more
+details refer to the `run modes`_ of the ccache manual. The default settings of
+``ccache`` can be displayed with ``ccache -s``.
+
+.. note: It is recommended to override the ``max_size``, which is the size of
+ cache, defaulting to 10G, to a larger size like 25G or so. Refer to the
+ `configuration`_ section of ccache manual.
+
+.. _`ccache`: https://ccache.samba.org/
+.. _`run modes`: https://ccache.samba.org/manual.html#_run_modes
+.. _`configuration`: https://ccache.samba.org/manual.html#_configuration
+
+Development-mode cluster
+------------------------
+
+See :doc:`/dev/quick_guide`.
+
+Backporting
+-----------
+
+All bugfixes should be merged to the ``master`` branch before being backported.
+To flag a bugfix for backporting, make sure it has a `tracker issue`_
+associated with it and set the ``Backport`` field to a comma-separated list of
+previous releases (e.g. "hammer,jewel") that you think need the backport.
+The rest (including the actual backporting) will be taken care of by the
+`Stable Releases and Backports`_ team.
+
+.. _`tracker issue`: http://tracker.ceph.com/
+.. _`Stable Releases and Backports`: http://tracker.ceph.com/projects/ceph-releases/wiki
+
+Guidance for use of cluster log
+-------------------------------
+
+If your patches emit messages to the Ceph cluster log, please consult
+this guidance: :doc:`/dev/logging`.
+
+
+What is merged where and when ?
+===============================
+
+Commits are merged into branches according to criteria that change
+during the lifecycle of a Ceph release. This chapter is the inventory
+of what can be merged in which branch at a given point in time.
+
+Development releases (i.e. x.0.z)
+---------------------------------
+
+What ?
+^^^^^^
+
+* features
+* bug fixes
+
+Where ?
+^^^^^^^
+
+Features are merged to the master branch. Bug fixes should be merged
+to the corresponding named branch (e.g. "jewel" for 10.0.z, "kraken"
+for 11.0.z, etc.). However, this is not mandatory - bug fixes can be
+merged to the master branch as well, since the master branch is
+periodically merged to the named branch during the development
+releases phase. In either case, if the bugfix is important it can also
+be flagged for backport to one or more previous stable releases.
+
+When ?
+^^^^^^
+
+After the stable release candidates of the previous release enters
+phase 2 (see below). For example: the "jewel" named branch was
+created when the infernalis release candidates entered phase 2. From
+this point on, master was no longer associated with infernalis. As
+soon as the named branch of the next stable release is created, master
+starts getting periodically merged into it.
+
+Branch merges
+^^^^^^^^^^^^^
+
+* The branch of the stable release is merged periodically into master.
+* The master branch is merged periodically into the branch of the
+ stable release.
+* The master is merged into the branch of the stable release
+ immediately after each development x.0.z release.
+
+Stable release candidates (i.e. x.1.z) phase 1
+----------------------------------------------
+
+What ?
+^^^^^^
+
+* bug fixes only
+
+Where ?
+^^^^^^^
+
+The branch of the stable release (e.g. "jewel" for 10.0.z, "kraken"
+for 11.0.z, etc.) or master. Bug fixes should be merged to the named
+branch corresponding to the stable release candidate (e.g. "jewel" for
+10.1.z) or to master. During this phase, all commits to master will be
+merged to the named branch, and vice versa. In other words, it makes
+no difference whether a commit is merged to the named branch or to
+master - it will make it into the next release candidate either way.
+
+When ?
+^^^^^^
+
+After the first stable release candidate is published, i.e. after the
+x.1.0 tag is set in the release branch.
+
+Branch merges
+^^^^^^^^^^^^^
+
+* The branch of the stable release is merged periodically into master.
+* The master branch is merged periodically into the branch of the
+ stable release.
+* The master is merged into the branch of the stable release
+ immediately after each x.1.z release candidate.
+
+Stable release candidates (i.e. x.1.z) phase 2
+----------------------------------------------
+
+What ?
+^^^^^^
+
+* bug fixes only
+
+Where ?
+^^^^^^^
+
+The branch of the stable release (e.g. "jewel" for 10.0.z, "kraken"
+for 11.0.z, etc.). During this phase, all commits to the named branch
+will be merged into master. Cherry-picking to the named branch during
+release candidate phase 2 is done manually since the official
+backporting process only begins when the release is pronounced
+"stable".
+
+When ?
+^^^^^^
+
+After Sage Weil decides it is time for phase 2 to happen.
+
+Branch merges
+^^^^^^^^^^^^^
+
+* The branch of the stable release is merged periodically into master.
+
+Stable releases (i.e. x.2.z)
+----------------------------
+
+What ?
+^^^^^^
+
+* bug fixes
+* features are sometime accepted
+* commits should be cherry-picked from master when possible
+* commits that are not cherry-picked from master must be about a bug unique to the stable release
+* see also `the backport HOWTO`_
+
+.. _`the backport HOWTO`:
+ http://tracker.ceph.com/projects/ceph-releases/wiki/HOWTO#HOWTO
+
+Where ?
+^^^^^^^
+
+The branch of the stable release (hammer for 0.94.x, infernalis for 9.2.x, etc.)
+
+When ?
+^^^^^^
+
+After the stable release is published, i.e. after the "vx.2.0" tag is
+set in the release branch.
+
+Branch merges
+^^^^^^^^^^^^^
+
+Never
+
+Issue tracker
+=============
+
+See `Redmine issue tracker`_ for a brief introduction to the Ceph Issue Tracker.
+
+Ceph developers use the issue tracker to
+
+1. keep track of issues - bugs, fix requests, feature requests, backport
+requests, etc.
+
+2. communicate with other developers and keep them informed as work
+on the issues progresses.
+
+Issue tracker conventions
+-------------------------
+
+When you start working on an existing issue, it's nice to let the other
+developers know this - to avoid duplication of labor. Typically, this is
+done by changing the :code:`Assignee` field (to yourself) and changing the
+:code:`Status` to *In progress*. Newcomers to the Ceph community typically do not
+have sufficient privileges to update these fields, however: they can
+simply update the issue with a brief note.
+
+.. table:: Meanings of some commonly used statuses
+
+ ================ ===========================================
+ Status Meaning
+ ================ ===========================================
+ New Initial status
+ In Progress Somebody is working on it
+ Need Review Pull request is open with a fix
+ Pending Backport Fix has been merged, backport(s) pending
+ Resolved Fix and backports (if any) have been merged
+ ================ ===========================================
+
+Basic workflow
+==============
+
+The following chart illustrates basic development workflow:
+
+.. ditaa::
+
+ Upstream Code Your Local Environment
+
+ /----------\ git clone /-------------\
+ | Ceph | -------------------------> | ceph/master |
+ \----------/ \-------------/
+ ^ |
+ | | git branch fix_1
+ | git merge |
+ | v
+ /----------------\ git commit --amend /-------------\
+ | make check |---------------------> | ceph/fix_1 |
+ | ceph--qa--suite| \-------------/
+ \----------------/ |
+ ^ | fix changes
+ | | test changes
+ | review | git commit
+ | |
+ | v
+ /--------------\ /-------------\
+ | github |<---------------------- | ceph/fix_1 |
+ | pull request | git push \-------------/
+ \--------------/
+
+Below we present an explanation of this chart. The explanation is written
+with the assumption that you, the reader, are a beginning developer who
+has an idea for a bugfix, but do not know exactly how to proceed.
+
+Update the tracker
+------------------
+
+Before you start, you should know the `Issue tracker`_ number of the bug
+you intend to fix. If there is no tracker issue, now is the time to create
+one.
+
+The tracker is there to explain the issue (bug) to your fellow Ceph
+developers and keep them informed as you make progress toward resolution.
+To this end, then, provide a descriptive title as well as sufficient
+information and details in the description.
+
+If you have sufficient tracker permissions, assign the bug to yourself by
+changing the ``Assignee`` field. If your tracker permissions have not yet
+been elevated, simply add a comment to the issue with a short message like
+"I am working on this issue".
+
+Upstream code
+-------------
+
+This section, and the ones that follow, correspond to the nodes in the
+above chart.
+
+The upstream code lives in https://github.com/ceph/ceph.git, which is
+sometimes referred to as the "upstream repo", or simply "upstream". As the
+chart illustrates, we will make a local copy of this code, modify it, test
+our modifications, and submit the modifications back to the upstream repo
+for review.
+
+A local copy of the upstream code is made by
+
+1. forking the upstream repo on GitHub, and
+2. cloning your fork to make a local working copy
+
+See the `the GitHub documentation
+<https://help.github.com/articles/fork-a-repo/#platform-linux>`_ for
+detailed instructions on forking. In short, if your GitHub username is
+"mygithubaccount", your fork of the upstream repo will show up at
+https://github.com/mygithubaccount/ceph. Once you have created your fork,
+you clone it by doing:
+
+.. code::
+
+ $ git clone https://github.com/mygithubaccount/ceph
+
+While it is possible to clone the upstream repo directly, in this case you
+must fork it first. Forking is what enables us to open a `GitHub pull
+request`_.
+
+For more information on using GitHub, refer to `GitHub Help
+<https://help.github.com/>`_.
+
+Local environment
+-----------------
+
+In the local environment created in the previous step, you now have a
+copy of the ``master`` branch in ``remotes/origin/master``. Since the fork
+(https://github.com/mygithubaccount/ceph.git) is frozen in time and the
+upstream repo (https://github.com/ceph/ceph.git, typically abbreviated to
+``ceph/ceph.git``) is updated frequently by other developers, you will need
+to sync your fork periodically. To do this, first add the upstream repo as
+a "remote" and fetch it::
+
+ $ git remote add ceph https://github.com/ceph/ceph.git
+ $ git fetch ceph
+
+Fetching downloads all objects (commits, branches) that were added since
+the last sync. After running these commands, all the branches from
+``ceph/ceph.git`` are downloaded to the local git repo as
+``remotes/ceph/$BRANCH_NAME`` and can be referenced as
+``ceph/$BRANCH_NAME`` in certain git commands.
+
+For example, your local ``master`` branch can be reset to the upstream Ceph
+``master`` branch by doing::
+
+ $ git fetch ceph
+ $ git checkout master
+ $ git reset --hard ceph/master
+
+Finally, the ``master`` branch of your fork can then be synced to upstream
+master by::
+
+ $ git push -u origin master
+
+Bugfix branch
+-------------
+
+Next, create a branch for the bugfix:
+
+.. code::
+
+ $ git checkout master
+ $ git checkout -b fix_1
+ $ git push -u origin fix_1
+
+This creates a ``fix_1`` branch locally and in our GitHub fork. At this
+point, the ``fix_1`` branch is identical to the ``master`` branch, but not
+for long! You are now ready to modify the code.
+
+Fix bug locally
+---------------
+
+At this point, change the status of the tracker issue to "In progress" to
+communicate to the other Ceph developers that you have begun working on a
+fix. If you don't have permission to change that field, your comment that
+you are working on the issue is sufficient.
+
+Possibly, your fix is very simple and requires only minimal testing.
+More likely, it will be an iterative process involving trial and error, not
+to mention skill. An explanation of how to fix bugs is beyond the
+scope of this document. Instead, we focus on the mechanics of the process
+in the context of the Ceph project.
+
+A detailed discussion of the tools available for validating your bugfixes,
+see the `Testing`_ chapter.
+
+For now, let us just assume that you have finished work on the bugfix and
+that you have tested it and believe it works. Commit the changes to your local
+branch using the ``--signoff`` option::
+
+ $ git commit -as
+
+and push the changes to your fork::
+
+ $ git push origin fix_1
+
+GitHub pull request
+-------------------
+
+The next step is to open a GitHub pull request. The purpose of this step is
+to make your bugfix available to the community of Ceph developers. They
+will review it and may do additional testing on it.
+
+In short, this is the point where you "go public" with your modifications.
+Psychologically, you should be prepared to receive suggestions and
+constructive criticism. Don't worry! In our experience, the Ceph project is
+a friendly place!
+
+If you are uncertain how to use pull requests, you may read
+`this GitHub pull request tutorial`_.
+
+.. _`this GitHub pull request tutorial`:
+ https://help.github.com/articles/using-pull-requests/
+
+For some ideas on what constitutes a "good" pull request, see
+the `Git Commit Good Practice`_ article at the `OpenStack Project Wiki`_.
+
+.. _`Git Commit Good Practice`: https://wiki.openstack.org/wiki/GitCommitMessages
+.. _`OpenStack Project Wiki`: https://wiki.openstack.org/wiki/Main_Page
+
+Once your pull request (PR) is opened, update the `Issue tracker`_ by
+adding a comment to the bug pointing the other developers to your PR. The
+update can be as simple as::
+
+ *PR*: https://github.com/ceph/ceph/pull/$NUMBER_OF_YOUR_PULL_REQUEST
+
+Automated PR validation
+-----------------------
+
+When your PR hits GitHub, the Ceph project's `Continuous Integration (CI)
+<https://en.wikipedia.org/wiki/Continuous_integration>`_
+infrastructure will test it automatically. At the time of this writing
+(March 2016), the automated CI testing included a test to check that the
+commits in the PR are properly signed (see `Submitting patches`_) and a
+`make check`_ test.
+
+The latter, `make check`_, builds the PR and runs it through a battery of
+tests. These tests run on machines operated by the Ceph Continuous
+Integration (CI) team. When the tests complete, the result will be shown
+on GitHub in the pull request itself.
+
+You can (and should) also test your modifications before you open a PR.
+Refer to the `Testing`_ chapter for details.
+
+Notes on PR make check test
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The GitHub `make check`_ test is driven by a Jenkins instance.
+
+Jenkins merges the PR branch into the latest version of the base branch before
+starting the build, so you don't have to rebase the PR to pick up any fixes.
+
+You can trigger the PR tests at any time by adding a comment to the PR - the
+comment should contain the string "test this please". Since a human subscribed
+to the PR might interpret that as a request for him or her to test the PR, it's
+good to write the request as "Jenkins, test this please".
+
+The `make check`_ log is the place to go if there is a failure and you're not
+sure what caused it. To reach it, first click on "details" (next to the `make
+check`_ test in the PR) to get into the Jenkins web GUI, and then click on
+"Console Output" (on the left).
+
+Jenkins is set up to grep the log for strings known to have been associated
+with `make check`_ failures in the past. However, there is no guarantee that
+the strings are associated with any given `make check`_ failure. You have to
+dig into the log to be sure.
+
+Integration tests AKA ceph-qa-suite
+-----------------------------------
+
+Since Ceph is a complex beast, it may also be necessary to test your fix to
+see how it behaves on real clusters running either on real or virtual
+hardware. Tests designed for this purpose live in the `ceph/qa
+sub-directory`_ and are run via the `teuthology framework`_.
+
+.. _`ceph/qa sub-directory`: https://github.com/ceph/ceph/tree/master/qa/
+.. _`teuthology repository`: https://github.com/ceph/teuthology
+.. _`teuthology framework`: https://github.com/ceph/teuthology
+
+If you have access to an OpenStack tenant, you are encouraged to run the
+integration tests yourself using `ceph-workbench ceph-qa-suite`_,
+and to post the test results to the PR.
+
+.. _`ceph-workbench ceph-qa-suite`: http://ceph-workbench.readthedocs.org/
+
+The Ceph community has access to the `Sepia lab
+<http://ceph.github.io/sepia/>`_ where integration tests can be run on
+real hardware. Other developers may add tags like "needs-qa" to your PR.
+This allows PRs that need testing to be merged into a single branch and
+tested all at the same time. Since teuthology suites can take hours
+(even days in some cases) to run, this can save a lot of time.
+
+Integration testing is discussed in more detail in the `Testing`_ chapter.
+
+Code review
+-----------
+
+Once your bugfix has been thoroughly tested, or even during this process,
+it will be subjected to code review by other developers. This typically
+takes the form of correspondence in the PR itself, but can be supplemented
+by discussions on `IRC`_ and the `Mailing list`_.
+
+Amending your PR
+----------------
+
+While your PR is going through `Testing`_ and `Code review`_, you can
+modify it at any time by editing files in your local branch.
+
+After the changes are committed locally (to the ``fix_1`` branch in our
+example), they need to be pushed to GitHub so they appear in the PR.
+
+Modifying the PR is done by adding commits to the ``fix_1`` branch upon
+which it is based, often followed by rebasing to modify the branch's git
+history. See `this tutorial
+<https://www.atlassian.com/git/tutorials/rewriting-history>`_ for a good
+introduction to rebasing. When you are done with your modifications, you
+will need to force push your branch with:
+
+.. code::
+
+ $ git push --force origin fix_1
+
+Merge
+-----
+
+The bugfixing process culminates when one of the project leads decides to
+merge your PR.
+
+When this happens, it is a signal for you (or the lead who merged the PR)
+to change the `Issue tracker`_ status to "Resolved". Some issues may be
+flagged for backporting, in which case the status should be changed to
+"Pending Backport" (see the `Backporting`_ chapter for details).
+
+
+Testing
+=======
+
+Ceph has two types of tests: `make check`_ tests and integration tests.
+The former are run via `GNU Make <https://www.gnu.org/software/make/>`,
+and the latter are run via the `teuthology framework`_. The following two
+chapters examine the `make check`_ and integration tests in detail.
+
+.. _`make check`:
+
+Testing - make check
+====================
+
+After compiling Ceph, the `make check`_ command can be used to run the
+code through a battery of tests covering various aspects of Ceph. For
+inclusion in `make check`_, a test must:
+
+* bind ports that do not conflict with other tests
+* not require root access
+* not require more than one machine to run
+* complete within a few minutes
+
+While it is possible to run `make check`_ directly, it can be tricky to
+correctly set up your environment. Fortunately, a script is provided to
+make it easier run `make check`_ on your code. It can be run from the
+top-level directory of the Ceph source tree by doing::
+
+ $ ./run-make-check.sh
+
+You will need a minimum of 8GB of RAM and 32GB of free disk space for this
+command to complete successfully on x86_64 (other architectures may have
+different constraints). Depending on your hardware, it can take from 20
+minutes to three hours to complete, but it's worth the wait.
+
+Caveats
+-------
+
+1. Unlike the various Ceph daemons and ``ceph-fuse``, the `make check`_ tests
+ are linked against the default memory allocator (glibc) unless explicitly
+ linked against something else. This enables tools like valgrind to be used
+ in the tests.
+
+Testing - integration tests
+===========================
+
+When a test requires multiple machines, root access or lasts for a
+longer time (for example, to simulate a realistic Ceph deployment), it
+is deemed to be an integration test. Integration tests are organized into
+"suites", which are defined in the `ceph/qa sub-directory`_ and run with
+the ``teuthology-suite`` command.
+
+The ``teuthology-suite`` command is part of the `teuthology framework`_.
+In the sections that follow we attempt to provide a detailed introduction
+to that framework from the perspective of a beginning Ceph developer.
+
+Teuthology consumes packages
+----------------------------
+
+It may take some time to understand the significance of this fact, but it
+is `very` significant. It means that automated tests can be conducted on
+multiple platforms using the same packages (RPM, DEB) that can be
+installed on any machine running those platforms.
+
+Teuthology has a `list of platforms that it supports
+<https://github.com/ceph/ceph/tree/master/qa/distros/supported>`_ (as
+of March 2016 the list consisted of "CentOS 7.2" and "Ubuntu 14.04"). It
+expects to be provided pre-built Ceph packages for these platforms.
+Teuthology deploys these platforms on machines (bare-metal or
+cloud-provisioned), installs the packages on them, and deploys Ceph
+clusters on them - all as called for by the test.
+
+The nightlies
+-------------
+
+A number of integration tests are run on a regular basis in the `Sepia
+lab`_ against the official Ceph repositories (on the ``master`` development
+branch and the stable branches). Traditionally, these tests are called "the
+nightlies" because the Ceph core developers used to live and work in
+the same time zone and from their perspective the tests were run overnight.
+
+The results of the nightlies are published at http://pulpito.ceph.com/ and
+http://pulpito.ovh.sepia.ceph.com:8081/. The developer nick shows in the
+test results URL and in the first column of the Pulpito dashboard. The
+results are also reported on the `ceph-qa mailing list
+<https://ceph.com/irc/>`_ for analysis.
+
+Suites inventory
+----------------
+
+The ``suites`` directory of the `ceph/qa sub-directory`_ contains
+all the integration tests, for all the Ceph components.
+
+`ceph-deploy <https://github.com/ceph/ceph/tree/master/qa/suites/ceph-deploy>`_
+ install a Ceph cluster with ``ceph-deploy`` (`ceph-deploy man page`_)
+
+`ceph-disk <https://github.com/ceph/ceph/tree/master/qa/suites/ceph-disk>`_
+ verify init scripts (upstart etc.) and udev integration with
+ ``ceph-disk`` (`ceph-disk man page`_), with and without `dmcrypt
+ <https://gitlab.com/cryptsetup/cryptsetup/wikis/DMCrypt>`_ support.
+
+`dummy <https://github.com/ceph/ceph/tree/master/qa/suites/dummy>`_
+ get a machine, do nothing and return success (commonly used to
+ verify the integration testing infrastructure works as expected)
+
+`fs <https://github.com/ceph/ceph/tree/master/qa/suites/fs>`_
+ test CephFS
+
+`kcephfs <https://github.com/ceph/ceph/tree/master/qa/suites/kcephfs>`_
+ test the CephFS kernel module
+
+`krbd <https://github.com/ceph/ceph/tree/master/qa/suites/krbd>`_
+ test the RBD kernel module
+
+`powercycle <https://github.com/ceph/ceph/tree/master/qa/suites/powercycle>`_
+ verify the Ceph cluster behaves when machines are powered off
+ and on again
+
+`rados <https://github.com/ceph/ceph/tree/master/qa/suites/rados>`_
+ run Ceph clusters including OSDs and MONs, under various conditions of
+ stress
+
+`rbd <https://github.com/ceph/ceph/tree/master/qa/suites/rbd>`_
+ run RBD tests using actual Ceph clusters, with and without qemu
+
+`rgw <https://github.com/ceph/ceph/tree/master/qa/suites/rgw>`_
+ run RGW tests using actual Ceph clusters
+
+`smoke <https://github.com/ceph/ceph/tree/master/qa/suites/smoke>`_
+ run tests that exercise the Ceph API with an actual Ceph cluster
+
+`teuthology <https://github.com/ceph/ceph/tree/master/qa/suites/teuthology>`_
+ verify that teuthology can run integration tests, with and without OpenStack
+
+`upgrade <https://github.com/ceph/ceph/tree/master/qa/suites/upgrade>`_
+ for various versions of Ceph, verify that upgrades can happen
+ without disrupting an ongoing workload
+
+.. _`ceph-deploy man page`: ../../man/8/ceph-deploy
+.. _`ceph-disk man page`: ../../man/8/ceph-disk
+
+teuthology-describe-tests
+-------------------------
+
+In February 2016, a new feature called ``teuthology-describe-tests`` was
+added to the `teuthology framework`_ to facilitate documentation and better
+understanding of integration tests (`feature announcement
+<http://article.gmane.org/gmane.comp.file-systems.ceph.devel/29287>`_).
+
+The upshot is that tests can be documented by embedding ``meta:``
+annotations in the yaml files used to define the tests. The results can be
+seen in the `ceph-qa-suite wiki
+<http://tracker.ceph.com/projects/ceph-qa-suite/wiki/>`_.
+
+Since this is a new feature, many yaml files have yet to be annotated.
+Developers are encouraged to improve the documentation, in terms of both
+coverage and quality.
+
+How integration tests are run
+-----------------------------
+
+Given that - as a new Ceph developer - you will typically not have access
+to the `Sepia lab`_, you may rightly ask how you can run the integration
+tests in your own environment.
+
+One option is to set up a teuthology cluster on bare metal. Though this is
+a non-trivial task, it `is` possible. Here are `some notes
+<http://docs.ceph.com/teuthology/docs/LAB_SETUP.html>`_ to get you started
+if you decide to go this route.
+
+If you have access to an OpenStack tenant, you have another option: the
+`teuthology framework`_ has an OpenStack backend, which is documented `here
+<https://github.com/dachary/teuthology/tree/openstack#openstack-backend>`__.
+This OpenStack backend can build packages from a given git commit or
+branch, provision VMs, install the packages and run integration tests
+on those VMs. This process is controlled using a tool called
+`ceph-workbench ceph-qa-suite`_. This tool also automates publishing of
+test results at http://teuthology-logs.public.ceph.com.
+
+Running integration tests on your code contributions and publishing the
+results allows reviewers to verify that changes to the code base do not
+cause regressions, or to analyze test failures when they do occur.
+
+Every teuthology cluster, whether bare-metal or cloud-provisioned, has a
+so-called "teuthology machine" from which tests suites are triggered using the
+``teuthology-suite`` command.
+
+A detailed and up-to-date description of each `teuthology-suite`_ option is
+available by running the following command on the teuthology machine::
+
+ $ teuthology-suite --help
+
+.. _teuthology-suite: http://docs.ceph.com/teuthology/docs/teuthology.suite.html
+
+How integration tests are defined
+---------------------------------
+
+Integration tests are defined by yaml files found in the ``suites``
+subdirectory of the `ceph/qa sub-directory`_ and implemented by python
+code found in the ``tasks`` subdirectory. Some tests ("standalone tests")
+are defined in a single yaml file, while other tests are defined by a
+directory tree containing yaml files that are combined, at runtime, into a
+larger yaml file.
+
+Reading a standalone test
+-------------------------
+
+Let us first examine a standalone test, or "singleton".
+
+Here is a commented example using the integration test
+`rados/singleton/all/admin-socket.yaml
+<https://github.com/ceph/ceph/blob/master/qa/suites/rados/singleton/all/admin-socket.yaml>`_
+::
+
+ roles:
+ - - mon.a
+ - osd.0
+ - osd.1
+ tasks:
+ - install:
+ - ceph:
+ - admin_socket:
+ osd.0:
+ version:
+ git_version:
+ help:
+ config show:
+ config set filestore_dump_file /tmp/foo:
+ perf dump:
+ perf schema:
+
+The ``roles`` array determines the composition of the cluster (how
+many MONs, OSDs, etc.) on which this test is designed to run, as well
+as how these roles will be distributed over the machines in the
+testing cluster. In this case, there is only one element in the
+top-level array: therefore, only one machine is allocated to the
+test. The nested array declares that this machine shall run a MON with
+id ``a`` (that is the ``mon.a`` in the list of roles) and two OSDs
+(``osd.0`` and ``osd.1``).
+
+The body of the test is in the ``tasks`` array: each element is
+evaluated in order, causing the corresponding python file found in the
+``tasks`` subdirectory of the `teuthology repository`_ or
+`ceph/qa sub-directory`_ to be run. "Running" in this case means calling
+the ``task()`` function defined in that file.
+
+In this case, the `install
+<https://github.com/ceph/teuthology/blob/master/teuthology/task/install/__init__.py>`_
+task comes first. It installs the Ceph packages on each machine (as
+defined by the ``roles`` array). A full description of the ``install``
+task is `found in the python file
+<https://github.com/ceph/teuthology/blob/master/teuthology/task/install/__init__.py>`_
+(search for "def task").
+
+The ``ceph`` task, which is documented `here
+<https://github.com/ceph/ceph/blob/master/qa/tasks/ceph.py>`__ (again,
+search for "def task"), starts OSDs and MONs (and possibly MDSs as well)
+as required by the ``roles`` array. In this example, it will start one MON
+(``mon.a``) and two OSDs (``osd.0`` and ``osd.1``), all on the same
+machine. Control moves to the next task when the Ceph cluster reaches
+``HEALTH_OK`` state.
+
+The next task is ``admin_socket`` (`source code
+<https://github.com/ceph/ceph/blob/master/qa/tasks/admin_socket.py>`_).
+The parameter of the ``admin_socket`` task (and any other task) is a
+structure which is interpreted as documented in the task. In this example
+the parameter is a set of commands to be sent to the admin socket of
+``osd.0``. The task verifies that each of them returns on success (i.e.
+exit code zero).
+
+This test can be run with::
+
+ $ teuthology-suite --suite rados/singleton/all/admin-socket.yaml fs/ext4.yaml
+
+Test descriptions
+-----------------
+
+Each test has a "test description", which is similar to a directory path,
+but not the same. In the case of a standalone test, like the one in
+`Reading a standalone test`_, the test description is identical to the
+relative path (starting from the ``suites/`` directory of the
+`ceph/qa sub-directory`_) of the yaml file defining the test.
+
+Much more commonly, tests are defined not by a single yaml file, but by a
+`directory tree of yaml files`. At runtime, the tree is walked and all yaml
+files (facets) are combined into larger yaml "programs" that define the
+tests. A full listing of the yaml defining the test is included at the
+beginning of every test log.
+
+In these cases, the description of each test consists of the
+subdirectory under `suites/
+<https://github.com/ceph/ceph/tree/master/qa/suites>`_ containing the
+yaml facets, followed by an expression in curly braces (``{}``) consisting of
+a list of yaml facets in order of concatenation. For instance the
+test description::
+
+ ceph-disk/basic/{distros/centos_7.0.yaml tasks/ceph-disk.yaml}
+
+signifies the concatenation of two files:
+
+* ceph-disk/basic/distros/centos_7.0.yaml
+* ceph-disk/basic/tasks/ceph-disk.yaml
+
+How are tests built from directories?
+-------------------------------------
+
+As noted in the previous section, most tests are not defined in a single
+yaml file, but rather as a `combination` of files collected from a
+directory tree within the ``suites/`` subdirectory of the `ceph/qa sub-directory`_.
+
+The set of all tests defined by a given subdirectory of ``suites/`` is
+called an "integration test suite", or a "teuthology suite".
+
+Combination of yaml facets is controlled by special files (``%`` and
+``+``) that are placed within the directory tree and can be thought of as
+operators. The ``%`` file is the "convolution" operator and ``+``
+signifies concatenation.
+
+Convolution operator
+--------------------
+
+The convolution operator, implemented as an empty file called ``%``, tells
+teuthology to construct a test matrix from yaml facets found in
+subdirectories below the directory containing the operator.
+
+For example, the `ceph-disk suite
+<https://github.com/ceph/ceph/tree/jewel/qa/suites/ceph-disk/>`_ is
+defined by the ``suites/ceph-disk/`` tree, which consists of the files and
+subdirectories in the following structure::
+
+ directory: ceph-disk/basic
+ file: %
+ directory: distros
+ file: centos_7.0.yaml
+ file: ubuntu_14.04.yaml
+ directory: tasks
+ file: ceph-disk.yaml
+
+This is interpreted as a 2x1 matrix consisting of two tests:
+
+1. ceph-disk/basic/{distros/centos_7.0.yaml tasks/ceph-disk.yaml}
+2. ceph-disk/basic/{distros/ubuntu_14.04.yaml tasks/ceph-disk.yaml}
+
+i.e. the concatenation of centos_7.0.yaml and ceph-disk.yaml and
+the concatenation of ubuntu_14.04.yaml and ceph-disk.yaml, respectively.
+In human terms, this means that the task found in ``ceph-disk.yaml`` is
+intended to run on both CentOS 7.0 and Ubuntu 14.04.
+
+Without the file percent, the ``ceph-disk`` tree would be interpreted as
+three standalone tests:
+
+* ceph-disk/basic/distros/centos_7.0.yaml
+* ceph-disk/basic/distros/ubuntu_14.04.yaml
+* ceph-disk/basic/tasks/ceph-disk.yaml
+
+(which would of course be wrong in this case).
+
+Referring to the `ceph/qa sub-directory`_, you will notice that the
+``centos_7.0.yaml`` and ``ubuntu_14.04.yaml`` files in the
+``suites/ceph-disk/basic/distros/`` directory are implemented as symlinks.
+By using symlinks instead of copying, a single file can appear in multiple
+suites. This eases the maintenance of the test framework as a whole.
+
+All the tests generated from the ``suites/ceph-disk/`` directory tree
+(also known as the "ceph-disk suite") can be run with::
+
+ $ teuthology-suite --suite ceph-disk
+
+An individual test from the `ceph-disk suite`_ can be run by adding the
+``--filter`` option::
+
+ $ teuthology-suite \
+ --suite ceph-disk/basic \
+ --filter 'ceph-disk/basic/{distros/ubuntu_14.04.yaml tasks/ceph-disk.yaml}'
+
+.. note: To run a standalone test like the one in `Reading a standalone
+ test`_, ``--suite`` alone is sufficient. If you want to run a single
+ test from a suite that is defined as a directory tree, ``--suite`` must
+ be combined with ``--filter``. This is because the ``--suite`` option
+ understands POSIX relative paths only.
+
+Concatenation operator
+----------------------
+
+For even greater flexibility in sharing yaml files between suites, the
+special file plus (``+``) can be used to concatenate files within a
+directory. For instance, consider the `suites/rbd/thrash
+<https://github.com/ceph/ceph/tree/master/qa/suites/rbd/thrash>`_
+tree::
+
+ directory: rbd/thrash
+ file: %
+ directory: clusters
+ file: +
+ file: fixed-2.yaml
+ file: openstack.yaml
+ directory: workloads
+ file: rbd_api_tests_copy_on_read.yaml
+ file: rbd_api_tests.yaml
+
+This creates two tests:
+
+* rbd/thrash/{clusters/fixed-2.yaml clusters/openstack.yaml workloads/rbd_api_tests_copy_on_read.yaml}
+* rbd/thrash/{clusters/fixed-2.yaml clusters/openstack.yaml workloads/rbd_api_tests.yaml}
+
+Because the ``clusters/`` subdirectory contains the special file plus
+(``+``), all the other files in that subdirectory (``fixed-2.yaml`` and
+``openstack.yaml`` in this case) are concatenated together
+and treated as a single file. Without the special file plus, they would
+have been convolved with the files from the workloads directory to create
+a 2x2 matrix:
+
+* rbd/thrash/{clusters/openstack.yaml workloads/rbd_api_tests_copy_on_read.yaml}
+* rbd/thrash/{clusters/openstack.yaml workloads/rbd_api_tests.yaml}
+* rbd/thrash/{clusters/fixed-2.yaml workloads/rbd_api_tests_copy_on_read.yaml}
+* rbd/thrash/{clusters/fixed-2.yaml workloads/rbd_api_tests.yaml}
+
+The ``clusters/fixed-2.yaml`` file is shared among many suites to
+define the following ``roles``::
+
+ roles:
+ - [mon.a, mon.c, osd.0, osd.1, osd.2, client.0]
+ - [mon.b, osd.3, osd.4, osd.5, client.1]
+
+The ``rbd/thrash`` suite as defined above, consisting of two tests,
+can be run with::
+
+ $ teuthology-suite --suite rbd/thrash
+
+A single test from the rbd/thrash suite can be run by adding the
+``--filter`` option::
+
+ $ teuthology-suite \
+ --suite rbd/thrash \
+ --filter 'rbd/thrash/{clusters/fixed-2.yaml clusters/openstack.yaml workloads/rbd_api_tests_copy_on_read.yaml}'
+
+Filtering tests by their description
+------------------------------------
+
+When a few jobs fail and need to be run again, the ``--filter`` option
+can be used to select tests with a matching description. For instance, if the
+``rados`` suite fails the `all/peer.yaml <https://github.com/ceph/ceph/blob/master/qa/suites/rados/singleton/all/peer.yaml>`_ test, the following will only run the tests that contain this file::
+
+ teuthology-suite --suite rados --filter all/peer.yaml
+
+The ``--filter-out`` option does the opposite (it matches tests that do
+`not` contain a given string), and can be combined with the ``--filter``
+option.
+
+Both ``--filter`` and ``--filter-out`` take a comma-separated list of strings (which
+means the comma character is implicitly forbidden in filenames found in the
+`ceph/qa sub-directory`_). For instance::
+
+ teuthology-suite --suite rados --filter all/peer.yaml,all/rest-api.yaml
+
+will run tests that contain either
+`all/peer.yaml <https://github.com/ceph/ceph/blob/master/qa/suites/rados/singleton/all/peer.yaml>`_
+or
+`all/rest-api.yaml <https://github.com/ceph/ceph/blob/master/qa/suites/rados/singleton/all/rest-api.yaml>`_
+
+Each string is looked up anywhere in the test description and has to
+be an exact match: they are not regular expressions.
+
+Reducing the number of tests
+----------------------------
+
+The ``rados`` suite generates thousands of tests out of a few hundred
+files. This happens because teuthology constructs test matrices from
+subdirectories wherever it encounters a file named ``%``. For instance,
+all tests in the `rados/basic suite
+<https://github.com/ceph/ceph/tree/master/qa/suites/rados/basic>`_
+run with different messenger types: ``simple``, ``async`` and
+``random``, because they are combined (via the special file ``%``) with
+the `msgr directory
+<https://github.com/ceph/ceph/tree/master/qa/suites/rados/basic/msgr>`_
+
+All integration tests are required to be run before a Ceph release is published.
+When merely verifying whether a contribution can be merged without
+risking a trivial regression, it is enough to run a subset. The ``--subset`` option can be used to
+reduce the number of tests that are triggered. For instance::
+
+ teuthology-suite --suite rados --subset 0/4000
+
+will run as few tests as possible. The tradeoff in this case is that
+not all combinations of test variations will together,
+but no matter how small a ratio is provided in the ``--subset``,
+teuthology will still ensure that all files in the suite are in at
+least one test. Understanding the actual logic that drives this
+requires reading the teuthology source code.
+
+The ``--limit`` option only runs the first ``N`` tests in the suite:
+this is rarely useful, however, because there is no way to control which
+test will be first.
+
+Testing in the cloud
+====================
+
+In this chapter, we will explain in detail how use an OpenStack
+tenant as an environment for Ceph integration testing.
+
+Assumptions and caveat
+----------------------
+
+We assume that:
+
+1. you are the only person using the tenant
+2. you have the credentials
+3. the tenant supports the ``nova`` and ``cinder`` APIs
+
+Caveat: be aware that, as of this writing (July 2016), testing in
+OpenStack clouds is a new feature. Things may not work as advertised.
+If you run into trouble, ask for help on `IRC`_ or the `Mailing list`_, or
+open a bug report at the `ceph-workbench bug tracker`_.
+
+.. _`ceph-workbench bug tracker`: http://ceph-workbench.dachary.org/root/ceph-workbench/issues
+
+Prepare tenant
+--------------
+
+If you have not tried to use ``ceph-workbench`` with this tenant before,
+proceed to the next step.
+
+To start with a clean slate, login to your tenant via the Horizon dashboard and:
+
+* terminate the ``teuthology`` and ``packages-repository`` instances, if any
+* delete the ``teuthology`` and ``teuthology-worker`` security groups, if any
+* delete the ``teuthology`` and ``teuthology-myself`` key pairs, if any
+
+Also do the above if you ever get key-related errors ("invalid key", etc.) when
+trying to schedule suites.
+
+Getting ceph-workbench
+----------------------
+
+Since testing in the cloud is done using the `ceph-workbench
+ceph-qa-suite`_ tool, you will need to install that first. It is designed
+to be installed via Docker, so if you don't have Docker running on your
+development machine, take care of that first. You can follow `the official
+tutorial <https://docs.docker.com/engine/installation/>`_ to install if
+you have not installed yet.
+
+Once Docker is up and running, install ``ceph-workbench`` by following the
+`Installation instructions in the ceph-workbench documentation
+<http://ceph-workbench.readthedocs.org/en/latest/#installation>`_.
+
+Linking ceph-workbench with your OpenStack tenant
+-------------------------------------------------
+
+Before you can trigger your first teuthology suite, you will need to link
+``ceph-workbench`` with your OpenStack account.
+
+First, download a ``openrc.sh`` file by clicking on the "Download OpenStack
+RC File" button, which can be found in the "API Access" tab of the "Access
+& Security" dialog of the OpenStack Horizon dashboard.
+
+Second, create a ``~/.ceph-workbench`` directory, set its permissions to
+700, and move the ``openrc.sh`` file into it. Make sure that the filename
+is exactly ``~/.ceph-workbench/openrc.sh``.
+
+Third, edit the file so it does not ask for your OpenStack password
+interactively. Comment out the relevant lines and replace them with
+something like::
+
+ export OS_PASSWORD="aiVeth0aejee3eep8rogho3eep7Pha6ek"
+
+When `ceph-workbench ceph-qa-suite`_ connects to your OpenStack tenant for
+the first time, it will generate two keypairs: ``teuthology-myself`` and
+``teuthology``.
+
+.. If this is not the first time you have tried to use
+.. `ceph-workbench ceph-qa-suite`_ with this tenant, make sure to delete any
+.. stale keypairs with these names!
+
+Run the dummy suite
+-------------------
+
+You are now ready to take your OpenStack teuthology setup for a test
+drive::
+
+ $ ceph-workbench ceph-qa-suite --suite dummy
+
+Be forewarned that the first run of `ceph-workbench ceph-qa-suite`_ on a
+pristine tenant will take a long time to complete because it downloads a VM
+image and during this time the command may not produce any output.
+
+The images are cached in OpenStack, so they are only downloaded once.
+Subsequent runs of the same command will complete faster.
+
+Although ``dummy`` suite does not run any tests, in all other respects it
+behaves just like a teuthology suite and produces some of the same
+artifacts.
+
+The last bit of output should look something like this::
+
+ pulpito web interface: http://149.202.168.201:8081/
+ ssh access : ssh -i /home/smithfarm/.ceph-workbench/teuthology-myself.pem ubuntu@149.202.168.201 # logs in /usr/share/nginx/html
+
+What this means is that `ceph-workbench ceph-qa-suite`_ triggered the test
+suite run. It does not mean that the suite run has completed. To monitor
+progress of the run, check the Pulpito web interface URL periodically, or
+if you are impatient, ssh to the teuthology machine using the ssh command
+shown and do::
+
+ $ tail -f /var/log/teuthology.*
+
+The `/usr/share/nginx/html` directory contains the complete logs of the
+test suite. If we had provided the ``--upload`` option to the
+`ceph-workbench ceph-qa-suite`_ command, these logs would have been
+uploaded to http://teuthology-logs.public.ceph.com.
+
+Run a standalone test
+---------------------
+
+The standalone test explained in `Reading a standalone test`_ can be run
+with the following command::
+
+ $ ceph-workbench ceph-qa-suite --suite rados/singleton/all/admin-socket.yaml
+
+This will run the suite shown on the current ``master`` branch of
+``ceph/ceph.git``. You can specify a different branch with the ``--ceph``
+option, and even a different git repo with the ``--ceph-git-url`` option. (Run
+``ceph-workbench ceph-qa-suite --help`` for an up-to-date list of available
+options.)
+
+The first run of a suite will also take a long time, because ceph packages
+have to be built, first. Again, the packages so built are cached and
+`ceph-workbench ceph-qa-suite`_ will not build identical packages a second
+time.
+
+Interrupt a running suite
+-------------------------
+
+Teuthology suites take time to run. From time to time one may wish to
+interrupt a running suite. One obvious way to do this is::
+
+ ceph-workbench ceph-qa-suite --teardown
+
+This destroys all VMs created by `ceph-workbench ceph-qa-suite`_ and
+returns the OpenStack tenant to a "clean slate".
+
+Sometimes you may wish to interrupt the running suite, but keep the logs,
+the teuthology VM, the packages-repository VM, etc. To do this, you can
+``ssh`` to the teuthology VM (using the ``ssh access`` command reported
+when you triggered the suite -- see `Run the dummy suite`_) and, once
+there::
+
+ sudo /etc/init.d/teuthology restart
+
+This will keep the teuthology machine, the logs and the packages-repository
+instance but nuke everything else.
+
+Upload logs to archive server
+-----------------------------
+
+Since the teuthology instance in OpenStack is only semi-permanent, with limited
+space for storing logs, ``teuthology-openstack`` provides an ``--upload``
+option which, if included in the ``ceph-workbench ceph-qa-suite`` command,
+will cause logs from all failed jobs to be uploaded to the log archive server
+maintained by the Ceph project. The logs will appear at the URL::
+
+ http://teuthology-logs.public.ceph.com/$RUN
+
+where ``$RUN`` is the name of the run. It will be a string like this::
+
+ ubuntu-2016-07-23_16:08:12-rados-hammer-backports---basic-openstack
+
+Even if you don't providing the ``--upload`` option, however, all the logs can
+still be found on the teuthology machine in the directory
+``/usr/share/nginx/html``.
+
+Provision VMs ad hoc
+--------------------
+
+From the teuthology VM, it is possible to provision machines on an "ad hoc"
+basis, to use however you like. The magic incantation is::
+
+ teuthology-lock --lock-many $NUMBER_OF_MACHINES \
+ --os-type $OPERATING_SYSTEM \
+ --os-version $OS_VERSION \
+ --machine-type openstack \
+ --owner $EMAIL_ADDRESS
+
+The command must be issued from the ``~/teuthology`` directory. The possible
+values for ``OPERATING_SYSTEM`` AND ``OS_VERSION`` can be found by examining
+the contents of the directory ``teuthology/openstack/``. For example::
+
+ teuthology-lock --lock-many 1 --os-type ubuntu --os-version 16.04 \
+ --machine-type openstack --owner foo@example.com
+
+When you are finished with the machine, find it in the list of machines::
+
+ openstack server list
+
+to determine the name or ID, and then terminate it with::
+
+ openstack server delete $NAME_OR_ID
+
+Deploy a cluster for manual testing
+-----------------------------------
+
+The `teuthology framework`_ and `ceph-workbench ceph-qa-suite`_ are
+versatile tools that automatically provision Ceph clusters in the cloud and
+run various tests on them in an automated fashion. This enables a single
+engineer, in a matter of hours, to perform thousands of tests that would
+keep dozens of human testers occupied for days or weeks if conducted
+manually.
+
+However, there are times when the automated tests do not cover a particular
+scenario and manual testing is desired. It turns out that it is simple to
+adapt a test to stop and wait after the Ceph installation phase, and the
+engineer can then ssh into the running cluster. Simply add the following
+snippet in the desired place within the test YAML and schedule a run with the
+test::
+
+ tasks:
+ - exec:
+ client.0:
+ - sleep 1000000000 # forever
+
+(Make sure you have a ``client.0`` defined in your ``roles`` stanza or adapt
+accordingly.)
+
+The same effect can be achieved using the ``interactive`` task::
+
+ tasks:
+ - interactive
+
+By following the test log, you can determine when the test cluster has entered
+the "sleep forever" condition. At that point, you can ssh to the teuthology
+machine and from there to one of the target VMs (OpenStack) or teuthology
+worker machines machine (Sepia) where the test cluster is running.
+
+The VMs (or "instances" in OpenStack terminology) created by
+`ceph-workbench ceph-qa-suite`_ are named as follows:
+
+``teuthology`` - the teuthology machine
+
+``packages-repository`` - VM where packages are stored
+
+``ceph-*`` - VM where packages are built
+
+``target*`` - machines where tests are run
+
+The VMs named ``target*`` are used by tests. If you are monitoring the
+teuthology log for a given test, the hostnames of these target machines can
+be found out by searching for the string ``Locked targets``::
+
+ 2016-03-20T11:39:06.166 INFO:teuthology.task.internal:Locked targets:
+ target149202171058.teuthology: null
+ target149202171059.teuthology: null
+
+The IP addresses of the target machines can be found by running ``openstack
+server list`` on the teuthology machine, but the target VM hostnames (e.g.
+``target149202171058.teuthology``) are resolvable within the teuthology
+cluster.
+
+
+Testing - how to run s3-tests locally
+=====================================
+
+RGW code can be tested by building Ceph locally from source, starting a vstart
+cluster, and running the "s3-tests" suite against it.
+
+The following instructions should work on jewel and above.
+
+Step 1 - build Ceph
+-------------------
+
+Refer to :doc:`/install/build-ceph`.
+
+You can do step 2 separately while it is building.
+
+Step 2 - vstart
+---------------
+
+When the build completes, and still in the top-level directory of the git
+clone where you built Ceph, do the following, for cmake builds::
+
+ cd build/
+ RGW=1 ../vstart.sh -n
+
+This will produce a lot of output as the vstart cluster is started up. At the
+end you should see a message like::
+
+ started. stop.sh to stop. see out/* (e.g. 'tail -f out/????') for debug output.
+
+This means the cluster is running.
+
+
+Step 3 - run s3-tests
+---------------------
+
+To run the s3tests suite do the following::
+
+ $ ../qa/workunits/rgw/run-s3tests.sh
+
+.. WIP
+.. ===
+..
+.. Building RPM packages
+.. ---------------------
+..
+.. Ceph is regularly built and packaged for a number of major Linux
+.. distributions. At the time of this writing, these included CentOS, Debian,
+.. Fedora, openSUSE, and Ubuntu.
+..
+.. Architecture
+.. ============
+..
+.. Ceph is a collection of components built on top of RADOS and provide
+.. services (RBD, RGW, CephFS) and APIs (S3, Swift, POSIX) for the user to
+.. store and retrieve data.
+..
+.. See :doc:`/architecture` for an overview of Ceph architecture. The
+.. following sections treat each of the major architectural components
+.. in more detail, with links to code and tests.
+..
+.. FIXME The following are just stubs. These need to be developed into
+.. detailed descriptions of the various high-level components (RADOS, RGW,
+.. etc.) with breakdowns of their respective subcomponents.
+..
+.. FIXME Later, in the Testing chapter I would like to take another look
+.. at these components/subcomponents with a focus on how they are tested.
+..
+.. RADOS
+.. -----
+..
+.. RADOS stands for "Reliable, Autonomic Distributed Object Store". In a Ceph
+.. cluster, all data are stored in objects, and RADOS is the component responsible
+.. for that.
+..
+.. RADOS itself can be further broken down into Monitors, Object Storage Daemons
+.. (OSDs), and client APIs (librados). Monitors and OSDs are introduced at
+.. :doc:`/start/intro`. The client library is explained at
+.. :doc:`/rados/api/index`.
+..
+.. RGW
+.. ---
+..
+.. RGW stands for RADOS Gateway. Using the embedded HTTP server civetweb_ or
+.. Apache FastCGI, RGW provides a REST interface to RADOS objects.
+..
+.. .. _civetweb: https://github.com/civetweb/civetweb
+..
+.. A more thorough introduction to RGW can be found at :doc:`/radosgw/index`.
+..
+.. RBD
+.. ---
+..
+.. RBD stands for RADOS Block Device. It enables a Ceph cluster to store disk
+.. images, and includes in-kernel code enabling RBD images to be mounted.
+..
+.. To delve further into RBD, see :doc:`/rbd/rbd`.
+..
+.. CephFS
+.. ------
+..
+.. CephFS is a distributed file system that enables a Ceph cluster to be used as a NAS.
+..
+.. File system metadata is managed by Meta Data Server (MDS) daemons. The Ceph
+.. file system is explained in more detail at :doc:`/cephfs/index`.
+..
diff --git a/src/ceph/doc/dev/kernel-client-troubleshooting.rst b/src/ceph/doc/dev/kernel-client-troubleshooting.rst
new file mode 100644
index 0000000..59e4761
--- /dev/null
+++ b/src/ceph/doc/dev/kernel-client-troubleshooting.rst
@@ -0,0 +1,17 @@
+====================================
+ Kernel client troubleshooting (FS)
+====================================
+
+If there is an issue with the cephfs kernel client, the most important thing is
+figuring out whether the problem is with the client or the MDS. Generally,
+this is easy to work out. If the kernel client broke directly, there
+will be output in dmesg. Collect it and any appropriate kernel state. If
+the problem is with the MDS, there will be hung requests that the client
+is waiting on. Look in ``/sys/kernel/debug/ceph/*/`` and cat the ``mdsc`` file to
+get a listing of requests in progress. If one of them remains there, the
+MDS has probably "forgotten" it.
+We can get hints about what's going on by dumping the MDS cache:
+ceph mds tell 0 dumpcache /tmp/dump.txt
+
+And if high logging levels are set on the MDS, that will almost certainly
+hold the information we need to diagnose and solve the issue.
diff --git a/src/ceph/doc/dev/libs.rst b/src/ceph/doc/dev/libs.rst
new file mode 100644
index 0000000..203dd38
--- /dev/null
+++ b/src/ceph/doc/dev/libs.rst
@@ -0,0 +1,18 @@
+======================
+ Library architecture
+======================
+
+Ceph is structured into libraries which are built and then combined together to
+make executables and other libraries.
+
+- libcommon: a collection of utilities which are available to nearly every ceph
+ library and executable. In general, libcommon should not contain global
+ variables, because it is intended to be linked into libraries such as
+ libcephfs.so.
+
+- libglobal: a collection of utilities focused on the needs of Ceph daemon
+ programs. In here you will find pidfile management functions, signal
+ handlers, and so forth.
+
+.. todo:: document other libraries
+
diff --git a/src/ceph/doc/dev/logging.rst b/src/ceph/doc/dev/logging.rst
new file mode 100644
index 0000000..9c2a6f3
--- /dev/null
+++ b/src/ceph/doc/dev/logging.rst
@@ -0,0 +1,106 @@
+
+Use of the cluster log
+======================
+
+(Note: none of this applies to the local "dout" logging. This is about
+the cluster log that we send through the mon daemons)
+
+Severity
+--------
+
+Use ERR for situations where the cluster cannot do its job for some reason.
+For example: we tried to do a write, but it returned an error, or we tried
+to read something, but it's corrupt so we can't, or we scrubbed a PG but
+the data was inconsistent so we can't recover.
+
+Use WRN for incidents that the cluster can handle, but have some abnormal/negative
+aspect, such as a temporary degredation of service, or an unexpected internal
+value. For example, a metadata error that can be auto-fixed, or a slow operation.
+
+Use INFO for ordinary cluster operations that do not indicate a fault in
+Ceph. It is especially important that INFO level messages are clearly
+worded and do not cause confusion or alarm.
+
+Frequency
+---------
+
+It is important that messages of all severities are not excessively
+frequent. Consumers may be using a rotating log buffer that contains
+messages of all severities, so even DEBUG messages could interfere
+with proper display of the latest INFO messages if the DEBUG messages
+are too frequent.
+
+Remember that if you have a bad state (as opposed to event), that is
+what health checks are for -- do not spam the cluster log to indicate
+a continuing unhealthy state.
+
+Do not emit cluster log messages for events that scale with
+the number of clients or level of activity on the system, or for
+events that occur regularly in normal operation. For example, it
+would be inappropriate to emit a INFO message about every
+new client that connects (scales with #clients), or to emit and INFO
+message about every CephFS subtree migration (occurs regularly).
+
+Language and formatting
+-----------------------
+
+(Note: these guidelines matter much less for DEBUG-level messages than
+ for INFO and above. Concentrate your efforts on making INFO/WRN/ERR
+ messages as readable as possible.)
+
+Use the passive voice. For example, use "Object xyz could not be read", rather
+than "I could not read the object xyz".
+
+Print long/big identifiers, such as inode numbers, as hex, prefixed
+with an 0x so that the user can tell it is hex. We do this because
+the 0x makes it unambiguous (no equivalent for decimal), and because
+the hex form is more likely to fit on the screen.
+
+Print size quantities as a human readable MB/GB/etc, including the unit
+at the end of the number. Exception: if you are specifying an offset,
+where precision is essential to the meaning, then you can specify
+the value in bytes (but print it as hex).
+
+Make a good faith effort to fit your message on a single line. It does
+not have to be guaranteed, but it should at least usually be
+the case. That means, generally, no printing of lists unless there
+are only a few items in the list.
+
+Use nouns that are meaningful to the user, and defined in the
+documentation. Common acronyms are OK -- don't waste screen space
+typing "Rados Object Gateway" instead of RGW. Do not use internal
+class names like "MDCache" or "Objecter". It is okay to mention
+internal structures if they are the direct subject of the message,
+for example in a corruption, but use plain english.
+Example: instead of "Objecter requests" say "OSD client requests"
+Example: it is okay to mention internal structure in the context
+ of "Corrupt session table" (but don't say "Corrupt SessionTable")
+
+Where possible, describe the consequence for system availability, rather
+than only describing the underlying state. For example, rather than
+saying "MDS myfs.0 is replaying", say that "myfs is degraded, waiting
+for myfs.0 to finish starting".
+
+While common acronyms are fine, don't randomly truncate words. It's not
+"dir ino", it's "directory inode".
+
+If you're logging something that "should never happen", i.e. a situation
+where it would be an assertion, but we're helpfully not crashing, then
+make that clear in the language -- this is probably not a situation
+that the user can remediate themselves.
+
+Avoid UNIX/programmer jargon. Instead of "errno", just say "error" (or
+preferably give something more descriptive than the number!)
+
+Do not mention cluster map epochs unless they are essential to
+the meaning of the message. For example, "OSDMap epoch 123 is corrupt"
+would be okay (the epoch is the point of the message), but saying "OSD
+123 is down in OSDMap epoch 456" would not be (the osdmap and epoch
+concepts are an implementation detail, the down-ness of the OSD
+is the real message). Feel free to send additional detail to
+the daemon's local log (via `dout`/`derr`).
+
+If you log a problem that may go away in the future, make sure you
+also log when it goes away. Whatever priority you logged the original
+message at, log the "going away" message at INFO.
+
diff --git a/src/ceph/doc/dev/logs.rst b/src/ceph/doc/dev/logs.rst
new file mode 100644
index 0000000..7fda64f
--- /dev/null
+++ b/src/ceph/doc/dev/logs.rst
@@ -0,0 +1,57 @@
+============
+ Debug logs
+============
+
+The main debugging tool for Ceph is the dout and derr logging functions.
+Collectively, these are referred to as "dout logging."
+
+Dout has several log faculties, which can be set at various log
+levels using the configuration management system. So it is possible to enable
+debugging just for the messenger, by setting debug_ms to 10, for example.
+
+Dout is implemented mainly in common/DoutStreambuf.cc
+
+The dout macro avoids even generating log messages which are not going to be
+used, by enclosing them in an "if" statement. What this means is that if you
+have the debug level set at 0, and you run this code::
+
+ dout(20) << "myfoo() = " << myfoo() << dendl;
+
+
+myfoo() will not be called here.
+
+Unfortunately, the performance of debug logging is relatively low. This is
+because there is a single, process-wide mutex which every debug output
+statement takes, and every debug output statement leads to a write() system
+call or a call to syslog(). There is also a computational overhead to using C++
+streams to consider. So you will need to be parsimonious in your logging to get
+the best performance.
+
+Sometimes, enabling logging can hide race conditions and other bugs by changing
+the timing of events. Keep this in mind when debugging.
+
+Performance counters
+====================
+
+Ceph daemons use performance counters to track key statistics like number of
+inodes pinned. Performance counters are essentially sets of integers and floats
+which can be set, incremented, and read using the PerfCounters API.
+
+A PerfCounters object is usually associated with a single subsystem. It
+contains multiple counters. This object is thread-safe because it is protected
+by an internal mutex. You can create multiple PerfCounters objects.
+
+Currently, three types of performance counters are supported: u64 counters,
+float counters, and long-run floating-point average counters. These are created
+by PerfCountersBuilder::add_u64, PerfCountersBuilder::add_fl, and
+PerfCountersBuilder::add_fl_avg, respectively. u64 and float counters simply
+provide a single value which can be updated, incremented, and read atomically.
+floating-pointer average counters provide two values: the current total, and
+the number of times the total has been changed. This is intended to provide a
+long-run average value.
+
+Performance counter information can be read in JSON format from the
+administrative socket (admin_sock). This is implemented as a UNIX domain
+socket. The Ceph performance counter plugin for collectd shows an example of how
+to access this information. Another example can be found in the unit tests for
+the administrative sockets.
diff --git a/src/ceph/doc/dev/mds_internals/data-structures.rst b/src/ceph/doc/dev/mds_internals/data-structures.rst
new file mode 100644
index 0000000..1197b62
--- /dev/null
+++ b/src/ceph/doc/dev/mds_internals/data-structures.rst
@@ -0,0 +1,36 @@
+MDS internal data structures
+==============================
+
+*CInode*
+ CInode contains the metadata of a file, there is one CInode for each file.
+ The CInode stores information like who owns the file, how big the file is.
+
+*CDentry*
+ CDentry is the glue that holds inodes and files together by relating inode to
+ file/directory names. A CDentry links to at most one CInode (it may not link
+ to any CInode). A CInode may be linked by multiple CDentries.
+
+*CDir*
+ CDir only exists for directory inode, it's used to link CDentries under the
+ directory. A CInode can have multiple CDir when the directory is fragmented.
+
+These data structures are linked together as::
+
+ CInode
+ CDir
+ | \
+ | \
+ | \
+ CDentry CDentry
+ CInode CInode
+ CDir CDir
+ | | \
+ | | \
+ | | \
+ CDentry CDentry CDentry
+ CInode CInode CInode
+
+As this doc is being written, size of CInode is about 1400 bytes, size of CDentry
+is about 400 bytes, size of CDir is about 700 bytes. These data structures are
+quite large. Please be careful if you want to add new fields to them.
+
diff --git a/src/ceph/doc/dev/mds_internals/exports.rst b/src/ceph/doc/dev/mds_internals/exports.rst
new file mode 100644
index 0000000..c5b0e39
--- /dev/null
+++ b/src/ceph/doc/dev/mds_internals/exports.rst
@@ -0,0 +1,76 @@
+
+===============
+Subtree exports
+===============
+
+Normal Migration
+----------------
+
+The exporter begins by doing some checks in export_dir() to verify
+that it is permissible to export the subtree at this time. In
+particular, the cluster must not be degraded, the subtree root may not
+be freezing or frozen (\ie already exporting, or nested beneath
+something that is exporting), and the path must be pinned (\ie not
+conflicted with a rename). If these conditions are met, the subtree
+freeze is initiated, and the exporter is committed to the subtree
+migration, barring an intervening failure of the importer or itself.
+
+The MExportDirDiscover serves simply to ensure that the base directory
+being exported is open on the destination node. It is pinned by the
+importer to prevent it from being trimmed. This occurs before the
+exporter completes the freeze of the subtree to ensure that the
+importer is able to replicate the necessary metadata. When the
+exporter receives the MExportDirDiscoverAck, it allows the freeze to proceed.
+
+The MExportDirPrep message then follows to populate a spanning tree that
+includes all dirs, inodes, and dentries necessary to reach any nested
+exports within the exported region. This replicates metadata as well,
+but it is pushed out by the exporter, avoiding deadlock with the
+regular discover and replication process. The importer is responsible
+for opening the bounding directories from any third parties before
+acknowledging. This ensures that the importer has correct dir_auth
+information about where authority is delegated for all points nested
+within the subtree being migrated. While processing the MExportDirPrep,
+the importer freezes the entire subtree region to prevent any new
+replication or cache expiration.
+
+The warning stage occurs only if the base subtree directory is open by
+nodes other than the importer and exporter. If so, then a
+MExportDirNotify message informs any bystanders that the authority for
+the region is temporarily ambiguous. In particular, bystanders who
+are trimming items from their cache must send MCacheExpire messages to
+both the old and new authorities. This is necessary to ensure that
+the surviving authority reliably receives all expirations even if the
+importer or exporter fails. While the subtree is frozen (on both the
+importer and exporter), expirations will not be immediately processed;
+instead, they will be queued until the region is unfrozen and it can
+be determined that the node is or is not authoritative for the region.
+
+The MExportDir message sends the actual subtree metadata to the importer.
+Upon receipt, the importer inserts the data into its cache, logs a
+copy in the EImportStart, and replies with an MExportDirAck. The exporter
+can now log an EExport, which ultimately specifies that
+the export was a success. In the presence of failures, it is the
+existence of the EExport that disambiguates authority during recovery.
+
+Once logged, the exporter will send an MExportDirNotify to any
+bystanders, informing them that the authority is no longer ambiguous
+and cache expirations should be sent only to the new authority (the
+importer). Once these are acknowledged, implicitly flushing the
+bystander to exporter message streams of any stray expiration notices,
+the exporter unfreezes the subtree, cleans up its state, and sends a
+final MExportDirFinish to the importer. Upon receipt, the importer logs
+an EImportFinish(true), unfreezes its subtree, and cleans up its
+state.
+
+
+PARTIAL FAILURE RECOVERY
+
+
+
+RECOVERY FROM JOURNAL
+
+
+
+
+
diff --git a/src/ceph/doc/dev/mds_internals/index.rst b/src/ceph/doc/dev/mds_internals/index.rst
new file mode 100644
index 0000000..c8c82ad
--- /dev/null
+++ b/src/ceph/doc/dev/mds_internals/index.rst
@@ -0,0 +1,10 @@
+==============================
+MDS developer documentation
+==============================
+
+.. rubric:: Contents
+
+.. toctree::
+ :glob:
+
+ *
diff --git a/src/ceph/doc/dev/messenger.rst b/src/ceph/doc/dev/messenger.rst
new file mode 100644
index 0000000..2b1a888
--- /dev/null
+++ b/src/ceph/doc/dev/messenger.rst
@@ -0,0 +1,33 @@
+============================
+ Messenger notes
+============================
+
+Messenger is the Ceph network layer implementation. Currently Ceph supports
+three messenger type "simple", "async" and "xio". The latter two are both
+experiment features and shouldn't use them in production environment.
+
+ceph_perf_msgr
+==============
+
+ceph_perf_msgr is used to do benchmark for messenger module only and can help
+to find the bottleneck or time consuming within messenger moduleIt just like
+"iperf", we need to start server-side program firstly:
+
+# ./ceph_perf_msgr_server 172.16.30.181:10001 0
+
+The first argument is ip:port pair which is telling the destination address the
+client need to specified. The second argument tells the "think time" when
+dispatching messages. After Giant, CEPH_OSD_OP message which is the actual client
+read/write io request is fast dispatched without queueing to Dispatcher, in order
+to achieve better performance. So CEPH_OSD_OP message will be processed inline,
+"think time" is used by mock this "inline process" process.
+
+# ./ceph_perf_msgr_client 172.16.30.181:10001 1 32 10000 10 4096
+
+The first argument is specified the server ip:port, and the second argument is
+used to specify client threads. The third argument specify the concurrency(the
+max inflight messages for each client thread), the fourth argument specify the
+io numbers will be issued to server per client thread. The fifth argument is
+used to indicate the "think time" for client thread when receiving messages,
+this is also used to mock the client fast dispatch process. The last argument
+specify the message data length to issue.
diff --git a/src/ceph/doc/dev/mon-bootstrap.rst b/src/ceph/doc/dev/mon-bootstrap.rst
new file mode 100644
index 0000000..75c8a5e
--- /dev/null
+++ b/src/ceph/doc/dev/mon-bootstrap.rst
@@ -0,0 +1,199 @@
+===================
+ Monitor bootstrap
+===================
+
+Terminology:
+
+* ``cluster``: a set of monitors
+* ``quorum``: an active set of monitors consisting of a majority of the cluster
+
+In order to initialize a new monitor, it must always be fed:
+
+#. a logical name
+#. secret keys
+#. a cluster fsid (uuid)
+
+In addition, a monitor needs to know two things:
+
+#. what address to bind to
+#. who its peers are (if any)
+
+There are a range of ways to do both.
+
+Logical id
+==========
+
+The logical id should be unique across the cluster. It will be
+appended to ``mon.`` to logically describe the monitor in the Ceph
+cluster. For example, if the logical id is ``foo``, the monitor's
+name will be ``mon.foo``.
+
+For most users, there is no more than one monitor per host, which
+makes the short hostname logical choice.
+
+Secret keys
+===========
+
+The ``mon.`` secret key is stored a ``keyring`` file in the ``mon data`` directory. It can be generated
+with a command like::
+
+ ceph-authtool --create-keyring /path/to/keyring --gen-key -n mon.
+
+When creating a new monitor cluster, the keyring should also contain a ``client.admin`` key that can be used
+to administer the system::
+
+ ceph-authtool /path/to/keyring --gen-key -n client.admin --set-uid=0 --cap mon 'allow *' --cap osd 'allow *' --cap mds 'allow'
+
+The resulting keyring is fed to ``ceph-mon --mkfs`` with the ``--keyring <keyring>`` command-line argument.
+
+Cluster fsid
+============
+
+The cluster fsid is a normal uuid, like that generated by the ``uuidgen`` command. It
+can be provided to the monitor in two ways:
+
+#. via the ``--fsid <uuid>`` command-line argument (or config file option)
+#. via a monmap provided to the new monitor via the ``--monmap <path>`` command-line argument.
+
+Monitor address
+===============
+
+The monitor address can be provided in several ways.
+
+#. via the ``--public-addr <ip[:port]>`` command-line option (or config file option)
+#. via the ``--public-network <cidr>`` command-line option (or config file option)
+#. via the monmap provided via ``--monmap <path>``, if it includes a monitor with our name
+#. via the bootstrap monmap (provided via ``--inject-monmap <path>`` or generated from ``--mon-host <list>``) if it includes a monitor with no name (``noname-<something>``) and an address configured on the local host.
+
+Peers
+=====
+
+The monitor peers are provided in several ways:
+
+#. via the initial monmap, provided via ``--monmap <filename>``
+#. via the bootstrap monmap generated from ``--mon-host <list>``
+#. via the bootstrap monmap generated from ``[mon.*]`` sections with ``mon addr`` in the config file
+#. dynamically via the admin socket
+
+However, these methods are not completely interchangeable because of
+the complexity of creating a new monitor cluster without danger of
+races.
+
+Cluster creation
+================
+
+There are three basic approaches to creating a cluster:
+
+#. Create a new cluster by specifying the monitor names and addresses ahead of time.
+#. Create a new cluster by specifying the monitor names ahead of time, and dynamically setting the addresses as ``ceph-mon`` daemons configure themselves.
+#. Create a new cluster by specifying the monitor addresses ahead of time.
+
+
+Names and addresses
+-------------------
+
+Generate a monmap using ``monmaptool`` with the names and addresses of the initial
+monitors. The generated monmap will also include a cluster fsid. Feed that monmap
+to each monitor daemon::
+
+ ceph-mon --mkfs -i <name> --monmap <initial_monmap> --keyring <initial_keyring>
+
+When the daemons start, they will know exactly who they and their peers are.
+
+
+Addresses only
+--------------
+
+The initial monitor addresses can be specified with the ``mon host`` configuration value,
+either via a config file or the command-line argument. This method has the advantage that
+a single global config file for the cluster can have a line like::
+
+ mon host = a.foo.com, b.foo.com, c.foo.com
+
+and will also serve to inform any ceph clients or daemons who the monitors are.
+
+The ``ceph-mon`` daemons will need to be fed the initial keyring and cluster fsid to
+initialize themselves:
+
+ ceph-mon --mkfs -i <name> --fsid <uuid> --keyring <initial_keyring>
+
+When the daemons first start up, they will share their names with each other and form a
+new cluster.
+
+Names only
+----------
+
+In dynamic "cloud" environments, the cluster creator may not (yet)
+know what the addresses of the monitors are going to be. Instead,
+they may want machines to configure and start themselves in parallel
+and, as they come up, form a new cluster on their own. The problem is
+that the monitor cluster relies on strict majorities to keep itself
+consistent, and in order to "create" a new cluster, it needs to know
+what the *initial* set of monitors will be.
+
+This can be done with the ``mon initial members`` config option, which
+should list the ids of the initial monitors that are allowed to create
+the cluster::
+
+ mon initial members = foo, bar, baz
+
+The monitors can then be initialized by providing the other pieces of
+information (they keyring, cluster fsid, and a way of determining
+their own address). For example::
+
+ ceph-mon --mkfs -i <name> --mon-initial-hosts 'foo,bar,baz' --keyring <initial_keyring> --public-addr <ip>
+
+When these daemons are started, they will know their own address, but
+not their peers. They can learn those addresses via the admin socket::
+
+ ceph daemon mon.<id> add_bootstrap_peer_hint <peer ip>
+
+Once they learn enough of their peers from the initial member set,
+they will be able to create the cluster.
+
+
+Cluster expansion
+=================
+
+Cluster expansion is slightly less demanding than creation, because
+the creation of the initial quorum is not an issue and there is no
+worry about creating separately independent clusters.
+
+New nodes can be forced to join an existing cluster in two ways:
+
+#. by providing no initial monitor peers addresses, and feeding them dynamically.
+#. by specifying the ``mon initial members`` config option to prevent the new nodes from forming a new, independent cluster, and feeding some existing monitors via any available method.
+
+Initially peerless expansion
+----------------------------
+
+Create a new monitor and give it no peer addresses other than it's own. For
+example::
+
+ ceph-mon --mkfs -i <myid> --fsid <fsid> --keyring <mon secret key> --public-addr <ip>
+
+Once the daemon starts, you can give it one or more peer addresses to join with::
+
+ ceph daemon mon.<id> add_bootstrap_peer_hint <peer ip>
+
+This monitor will never participate in cluster creation; it can only join an existing
+cluster.
+
+Expanding with initial members
+------------------------------
+
+You can feed the new monitor some peer addresses initially and avoid badness by also
+setting ``mon initial members``. For example::
+
+ ceph-mon --mkfs -i <myid> --fsid <fsid> --keyring <mon secret key> --public-addr <ip> --mon-host foo,bar,baz
+
+When the daemon is started, ``mon initial members`` must be set via the command line or config file::
+
+ ceph-mon -i <myid> --mon-initial-members foo,bar,baz
+
+to prevent any risk of split-brain.
+
+
+
+
+
diff --git a/src/ceph/doc/dev/msgr2.rst b/src/ceph/doc/dev/msgr2.rst
new file mode 100644
index 0000000..584ce7d
--- /dev/null
+++ b/src/ceph/doc/dev/msgr2.rst
@@ -0,0 +1,250 @@
+msgr2 protocol
+==============
+
+This is a revision of the legacy Ceph on-wire protocol that was
+implemented by the SimpleMessenger. It addresses performance and
+security issues.
+
+Definitions
+-----------
+
+* *client* (C): the party initiating a (TCP) connection
+* *server* (S): the party accepting a (TCP) connection
+* *connection*: an instance of a (TCP) connection between two processes.
+* *entity*: a ceph entity instantiation, e.g. 'osd.0'. each entity
+ has one or more unique entity_addr_t's by virtue of the 'nonce'
+ field, which is typically a pid or random value.
+* *stream*: an exchange, passed over a connection, between two unique
+ entities. in the future multiple entities may coexist within the
+ same process.
+* *session*: a stateful session between two entities in which message
+ exchange is ordered and lossless. A session might span multiple
+ connections (and streams) if there is an interruption (TCP connection
+ disconnect).
+* *frame*: a discrete message sent between the peers. Each frame
+ consists of a tag (type code), stream id, payload, and (if signing
+ or encryption is enabled) some other fields. See below for the
+ structure.
+* *stream id*: a 32-bit value that uniquely identifies a stream within
+ a given connection. the stream id implicitly instantiated when the send
+ sends a frame using that id.
+* *tag*: a single-byte type code associated with a frame. The tag
+ determines the structure of the payload.
+
+Phases
+------
+
+A connection has two distinct phases:
+
+#. banner
+#. frame exchange for one or more strams
+
+A stream has three distinct phases:
+
+#. authentication
+#. message flow handshake
+#. message exchange
+
+Banner
+------
+
+Both the client and server, upon connecting, send a banner::
+
+ "ceph %x %x\n", protocol_features_suppored, protocol_features_required
+
+The protocol features are a new, distinct namespace. Initially no
+features are defined or required, so this will be "ceph 0 0\n".
+
+If the remote party advertises required features we don't support, we
+can disconnect.
+
+Frame format
+------------
+
+All further data sent or received is contained by a frame. Each frame has
+the form::
+
+ stream_id (le32)
+ frame_len (le32)
+ tag (TAG_* byte)
+ payload
+ [payload padding -- only present after stream auth phase]
+ [signature -- only present after stream auth phase]
+
+* frame_len includes everything after the frame_len le32 up to the end of the
+ frame (all payloads, signatures, and padding).
+
+* The payload format and length is determined by the tag.
+
+* The signature portion is only present in a given stream if the
+ authentication phase has completed (TAG_AUTH_DONE has been sent) and
+ signatures are enabled.
+
+
+Authentication
+--------------
+
+* TAG_AUTH_METHODS (server only): list authentication methods (none, cephx, ...)::
+
+ __le32 num_methods;
+ __le32 methods[num_methods]; // CEPH_AUTH_{NONE, CEPHX}
+
+* TAG_AUTH_SET_METHOD (client only): set auth method for this connection::
+
+ __le32 method;
+
+ - The selected auth method determines the sig_size and block_size in any
+ subsequent messages (TAG_AUTH_DONE and non-auth messages).
+
+* TAG_AUTH_BAD_METHOD (server only): reject client-selected auth method::
+
+ __le32 method
+
+* TAG_AUTH: client->server or server->client auth message::
+
+ __le32 len;
+ method specific payload
+
+* TAG_AUTH_DONE::
+
+ confounder (block_size bytes of random garbage)
+ __le64 flags
+ FLAG_ENCRYPTED 1
+ FLAG_SIGNED 2
+ signature
+
+ - The client first says AUTH_DONE, and the server replies to
+ acknowledge it.
+
+
+Message frame format
+--------------------
+
+The frame format is fixed (see above), but can take three different
+forms, depending on the AUTH_DONE flags:
+
+* If neither FLAG_SIGNED or FLAG_ENCRYPTED is specified, things are simple::
+
+ stream_id
+ frame_len
+ tag
+ payload
+ payload_padding (out to auth block_size)
+
+* If FLAG_SIGNED has been specified::
+
+ stream_id
+ frame_len
+ tag
+ payload
+ payload_padding (out to auth block_size)
+ signature (sig_size bytes)
+
+ Here the padding just makes life easier for the signature. It can be
+ random data to add additional confounder. Note also that the
+ signature input must include some state from the session key and the
+ previous message.
+
+* If FLAG_ENCRYPTED has been specified::
+
+ stream_id
+ frame_len
+ {
+ payload_sig_length
+ payload
+ payload_padding (out to auth block_size)
+ } ^ stream cipher
+
+ Note that the padding ensures that the total frame is a multiple of
+ the auth method's block_size so that the message can be sent out over
+ the wire without waiting for the next frame in the stream.
+
+
+Message flow handshake
+----------------------
+
+In this phase the peers identify each other and (if desired) reconnect to
+an established session.
+
+* TAG_IDENT: identify ourselves::
+
+ entity_addrvec_t addr(s)
+ __u8 my type (CEPH_ENTITY_TYPE_*)
+ __le32 protocol version
+ __le64 features supported (CEPH_FEATURE_* bitmask)
+ __le64 features required (CEPH_FEATURE_* bitmask)
+ __le64 flags (CEPH_MSG_CONNECT_* bitmask)
+ __le64 cookie (a client identifier, assigned by the sender. unique on the sender.)
+
+ - client will send first, server will reply with same.
+
+* TAG_IDENT_MISSING_FEATURES (server only): complain about a TAG_IDENT with too few features::
+
+ __le64 features we require that peer didn't advertise
+
+* TAG_IDENT_BAD_PROTOCOL (server only): complain about an old protocol version::
+
+ __le32 protocol_version (our protocol version)
+
+* TAG_RECONNECT (client only): reconnect to an established session::
+
+ __le64 cookie
+ __le64 global_seq
+ __le64 connect_seq
+ __le64 msg_seq (the last msg seq received)
+
+* TAG_RECONNECT_OK (server only): acknowledge a reconnect attempt::
+
+ __le64 msg_seq (last msg seq received)
+
+* TAG_RECONNECT_RETRY_SESSION (server only): fail reconnect due to stale connect_seq
+
+* TAG_RECONNECT_RETRY_GLOBAL (server only): fail reconnect due to stale global_seq
+
+* TAG_RECONNECT_WAIT (server only): fail reconnect due to connect race.
+
+ - Indicates that the server is already connecting to the client, and
+ that direction should win the race. The client should wait for that
+ connection to complete.
+
+Message exchange
+----------------
+
+Once a session is stablished, we can exchange messages.
+
+* TAG_MSG: a message::
+
+ ceph_msg_header2
+ front
+ middle
+ data
+
+ - The ceph_msg_header is modified in ceph_msg_header2 to include an
+ ack_seq. This avoids the need for a TAG_ACK message most of the time.
+
+* TAG_ACK: acknowledge receipt of message(s)::
+
+ __le64 seq
+
+ - This is only used for stateful sessions.
+
+* TAG_KEEPALIVE2: check for connection liveness::
+
+ ceph_timespec stamp
+
+ - Time stamp is local to sender.
+
+* TAG_KEEPALIVE2_ACK: reply to a keepalive2::
+
+ ceph_timestamp stamp
+
+ - Time stamp is from the TAG_KEEPALIVE2 we are responding to.
+
+* TAG_CLOSE: terminate a stream
+
+ Indicates that a stream should be terminated. This is equivalent to
+ a hangup or reset (i.e., should trigger ms_handle_reset). It isn't
+ strictly necessary or useful if there is only a single stream as we
+ could just disconnect the TCP connection, although one could
+ certainly use it creatively (e.g., reset the stream state and retry
+ an authentication handshake).
diff --git a/src/ceph/doc/dev/network-encoding.rst b/src/ceph/doc/dev/network-encoding.rst
new file mode 100644
index 0000000..51d030e
--- /dev/null
+++ b/src/ceph/doc/dev/network-encoding.rst
@@ -0,0 +1,214 @@
+==================
+ Network Encoding
+==================
+
+This describes the encoding used to serialize data. It doesn't cover specific
+objects/messages but focuses on the base types.
+
+The types are not self documenting in any way. They can not be decoded unless
+you know what they are.
+
+Conventions
+===========
+
+Integers
+--------
+
+The integer types used will be named ``{signed}{size}{endian}``. For example
+``u16le`` is an unsigned 16 bit integer encoded in little endian byte order
+while ``s64be`` is a signed 64 bit integer in big endian. Additionally ``u8``
+and ``s8`` will represent signed and unsigned bytes respectively. Signed
+integers use two's complement encoding.
+
+Complex Types
+-------------
+
+This document will use a c-like syntax for describing structures. The
+structure represents the data that will go over the wire. There will be no
+padding between the elements and the elements will be sent in the order they
+appear. For example::
+
+ struct foo {
+ u8 tag;
+ u32le data;
+ }
+
+When encoding the values ``0x05`` and ``0x12345678`` respectively will appear on
+the wire as ``05 78 56 34 12``.
+
+Variable Arrays
+---------------
+
+Unlike c, length arrays can be used anywhere in structures and will be inline in
+the protocol. Furthermore the length may be described using an earlier item in
+the structure.
+
+::
+
+ struct blob {
+ u32le size;
+ u8 data[size];
+ u32le checksum;
+ }
+
+This structure is encoded as a 32 bit size, followed by ``size`` data bytes,
+then a 32 bit checksum.
+
+Primitive Aliases
+-----------------
+
+These types are just aliases for primitive types.
+
+::
+
+ // From /src/include/types.h
+
+ typedef u32le epoch_t;
+ typedef u32le ceph_seq_t;
+ typedef u64le ceph_tid_t;
+ typedef u64le version_t;
+
+
+Structures
+==========
+
+These are the way structures are encoded. Note that these structures don't
+actually exist in the source but are the way that different types are encoded.
+
+Optional
+--------
+
+Optionals are represented as a presence byte, followed by the item if it exists.
+
+::
+
+ struct ceph_optional<T> {
+ u8 present;
+ T element[present? 1 : 0]; // Only if present is non-zero.
+ }
+
+Optionals are used to encode ``boost::optional``.
+
+Pair
+----
+
+Pairs are simply the first item followed by the second.
+
+::
+
+ struct ceph_pair<A,B> {
+ A a;
+ B b;
+ }
+
+Pairs are used to encode ``std::pair``.
+
+Triple
+------
+
+Triples are simply the tree elements one after another.
+
+::
+
+ struct ceph_triple<A,B,C> {
+ A a;
+ B b;
+ C c;
+ }
+
+Triples are used to encode ``ceph::triple``.
+
+
+List
+----
+
+Lists are represented as an element count followed by that many elements.
+
+::
+
+ struct ceph_list<T> {
+ u32le length;
+ T elements[length];
+ }
+
+.. note::
+ The size of the elements in the list are not necessarily uniform.
+
+Lists are used to encode ``std::list``, ``std::vector``, ``std::deque``,
+``std::set`` and ``ceph::unordered_set``.
+
+Blob
+----
+
+A Blob is simply a list of bytes.
+
+::
+
+ struct ceph_string {
+ ceph_list<u8>;
+ }
+
+ // AKA
+
+ struct ceph_string {
+ u32le size;
+ u8 data[size];
+ }
+
+Blobs are used to encode ``std::string``, ``const char *`` and ``bufferlist``.
+
+.. note::
+ The content of a Blob is arbratrary binary data.
+
+Map
+---
+
+Maps are a list of pairs.
+
+::
+
+ struct ceph_map<K,V> {
+ ceph_list<ceph_pair<K,V>>;
+ }
+
+ // AKA
+
+ struct ceph_map<K,V> {
+ u32le length;
+ ceph_pair<K,V> entries[length];
+ }
+
+Maps are used to encode ``std::map``, ``std::multimap`` and
+``ceph::unordered_map``.
+
+Complex Types
+=============
+
+These aren't hard to find in the source but the common ones are listed here for
+convenience.
+
+utime_t
+-------
+
+::
+
+ // From /src/include/utime.h
+ struct utime_t {
+ u32le tv_sec; // Seconds since epoch.
+ u32le tv_nsec; // Nanoseconds since the last second.
+ }
+
+ceph_entity_name
+----------------
+
+::
+
+ // From /src/include/msgr.h
+ struct ceph_entity_name {
+ u8 type; // CEPH_ENTITY_TYPE_*
+ u64le num;
+ }
+
+ // CEPH_ENTITY_TYPE_* defined in /src/include/msgr.h
+
+.. vi: textwidth=80 noexpandtab
diff --git a/src/ceph/doc/dev/network-protocol.rst b/src/ceph/doc/dev/network-protocol.rst
new file mode 100644
index 0000000..3a59b80
--- /dev/null
+++ b/src/ceph/doc/dev/network-protocol.rst
@@ -0,0 +1,197 @@
+==================
+ Network Protocol
+==================
+
+This file describes the network protocol used by Ceph. In order to understand
+the way the structures are defined it is recommended to read the introduction
+of :doc:`/dev/network-encoding` first.
+
+Hello
+=====
+
+The protocol starts with a handshake that confirms that both nodes are talking
+ceph and shares some basic information.
+
+Banner
+------
+
+The first action is the server sending banner to the client. The banner is
+defined in ``CEPH_BANNER`` from ``src/include/msgr.h``. This is followed by
+the server's then client's address each encoded as a ``entity_addr_t``.
+
+Once the client verifies that the servers banner matches its own it replies with
+its banner and its address.
+
+Connect
+-------
+
+Once the banners have been verified and the addresses exchanged the connection
+negotiation begins. First the client sends a ``ceph_msg_connect`` structure
+with its information.
+
+::
+
+ // From src/include/msgr.h
+ struct ceph_msg_connect {
+ u64le features; // Supported features (CEPH_FEATURE_*)
+ u32le host_type; // CEPH_ENTITY_TYPE_*
+ u32le global_seq; // Number of connections initiated by this host.
+ u32le connect_seq; // Number of connections initiated in this session.
+ u32le protocol_version;
+ u32le authorizer_protocol;
+ u32le authorizer_len;
+ u8 flags; // CEPH_MSG_CONNECT_*
+ u8 authorizer[authorizer_len];
+ }
+
+Connect Reply
+-------------
+
+Once the connect has been sent the connection has effectively been opened,
+however the first message the server sends must be a connect reply message.
+
+::
+
+ struct ceph_msg_connect_reply {
+ u8 tag; // Tag indicating response code.
+ u64le features;
+ u32le global_seq;
+ u32le connect_seq;
+ u32le protocol_version;
+ u32le authorizer_len;
+ u8 flags;
+ u8 authorizer[authorizer_len];
+ }
+
+MSGR Protocol
+=============
+
+This is a low level protocol over which messages are delivered. The messages
+at this level consist of a tag byte, identifying the type of message, followed
+by the message data.
+
+::
+
+ // Virtual structure.
+ struct {
+ u8 tag; // CEPH_MSGR_TAG_*
+ u8 data[]; // Length depends on tag and data.
+ }
+
+The length of ``data`` is determined by the tag byte and depending on the
+message type via information in the ``data`` array itself.
+
+.. note::
+ There is no way to determine the length of the message if you do not
+ understand the type of message.
+
+The message tags are defined in ``src/include/msgr.h`` and the current ones
+are listed below along with the data they include. Note that the defined
+structures don't exist in the source and are merely for representing the
+protocol.
+
+CEPH_MSGR_TAG_CLOSE (0x06)
+--------------------------
+
+::
+
+ struct ceph_msgr_close {
+ u8 tag = 0x06;
+ u8 data[0]; // No data.
+ }
+
+The close message indicates that the connection is being closed.
+
+CEPH_MSGR_TAG_MSG (0x07)
+------------------------
+
+::
+
+ struct ceph_msgr_msg {
+ u8 tag = 0x07;
+ ceph_msg_header header;
+ u8 front [header.front_len ];
+ u8 middle[header.middle_len];
+ u8 data [header.data_len ];
+ ceph_msg_footer footer;
+ }
+
+ // From src/include/msgr.h
+ struct ceph_msg_header {
+ u64le seq; // Sequence number.
+ u64le tid; // Transaction ID.
+ u16le type; // Message type (CEPH_MSG_* or MSG_*).
+ u16le priority; // Priority (higher is more important).
+ u16le version; // Version of message encoding.
+
+ u32le front_len; // The size of the front section.
+ u32le middle_len; // The size of the middle section.
+ u32le data_len; // The size of the data section.
+ u16le data_off; // The way data should be aligned by the reciever.
+
+ ceph_entity_name src; // Information about the sender.
+
+ u16le compat_version; // Oldest compatible encoding version.
+ u16le reserved; // Unused.
+ u32le crc; // CRC of header.
+ }
+
+ // From src/include/msgr.h
+ struct ceph_msg_footer {
+ u32le front_crc; // Checksums of the various sections.
+ u32le middle_crc; //
+ u32le data_crc; //
+ u64le sig; // Crypographic signature.
+ u8 flags;
+ }
+
+Messages are the business logic of Ceph. They are what is used to send data and
+requests between nodes. The message header contains the length of the message
+so unknown messages can be handled gracefully.
+
+There are two names for the message type constants ``CEPH_MSG_*`` and ``MSG_*``.
+The only difference between the two is that the first are considered "public"
+while the second is for internal use only. There is no protocol-level
+difference.
+
+CEPH_MSGR_TAG_ACK (0x08)
+------------------------
+
+::
+
+ struct ceph_msgr_ack {
+ u8 tag = 0x08;
+ u64le seq; // The sequence number of the message being acknowledged.
+ }
+
+CEPH_MSGR_TAG_KEEPALIVE (0x09)
+------------------------------
+
+::
+
+ struct ceph_msgr_keepalive {
+ u8 tag = 0x09;
+ u8 data[0]; // No data.
+ }
+
+CEPH_MSGR_TAG_KEEPALIVE2 (0x0E)
+-------------------------------
+
+::
+
+ struct ceph_msgr_keepalive2 {
+ u8 tag = 0x0E;
+ utime_t timestamp;
+ }
+
+CEPH_MSGR_TAG_KEEPALIVE2_ACK (0x0F)
+-----------------------------------
+
+::
+
+ struct ceph_msgr_keepalive2_ack {
+ u8 tag = 0x0F;
+ utime_t timestamp;
+ }
+
+.. vi: textwidth=80 noexpandtab
diff --git a/src/ceph/doc/dev/object-store.rst b/src/ceph/doc/dev/object-store.rst
new file mode 100644
index 0000000..355f515
--- /dev/null
+++ b/src/ceph/doc/dev/object-store.rst
@@ -0,0 +1,70 @@
+====================================
+ Object Store Architecture Overview
+====================================
+
+.. graphviz::
+
+ /*
+ * Rough outline of object store module dependencies
+ */
+
+ digraph object_store {
+ size="7,7";
+ node [color=lightblue2, style=filled, fontname="Serif"];
+
+ "testrados" -> "librados"
+ "testradospp" -> "librados"
+
+ "rbd" -> "librados"
+
+ "radostool" -> "librados"
+
+ "radosgw-admin" -> "radosgw"
+
+ "radosgw" -> "librados"
+
+ "radosacl" -> "librados"
+
+ "librados" -> "objecter"
+
+ "ObjectCacher" -> "Filer"
+
+ "dumpjournal" -> "Journaler"
+
+ "Journaler" -> "Filer"
+
+ "SyntheticClient" -> "Filer"
+ "SyntheticClient" -> "objecter"
+
+ "Filer" -> "objecter"
+
+ "objecter" -> "OSDMap"
+
+ "ceph-osd" -> "PG"
+ "ceph-osd" -> "ObjectStore"
+
+ "crushtool" -> "CrushWrapper"
+
+ "OSDMap" -> "CrushWrapper"
+
+ "OSDMapTool" -> "OSDMap"
+
+ "PG" -> "PrimaryLogPG"
+ "PG" -> "ObjectStore"
+ "PG" -> "OSDMap"
+
+ "PrimaryLogPG" -> "ObjectStore"
+ "PrimaryLogPG" -> "OSDMap"
+
+ "ObjectStore" -> "FileStore"
+ "ObjectStore" -> "BlueStore"
+
+ "BlueStore" -> "rocksdb"
+
+ "FileStore" -> "xfs"
+ "FileStore" -> "btrfs"
+ "FileStore" -> "ext4"
+ }
+
+
+.. todo:: write more here
diff --git a/src/ceph/doc/dev/osd-class-path.rst b/src/ceph/doc/dev/osd-class-path.rst
new file mode 100644
index 0000000..6e209bc
--- /dev/null
+++ b/src/ceph/doc/dev/osd-class-path.rst
@@ -0,0 +1,16 @@
+=======================
+ OSD class path issues
+=======================
+
+::
+
+ 2011-12-05 17:41:00.994075 7ffe8b5c3760 librbd: failed to assign a block name for image
+ create error: error 5: Input/output error
+
+This usually happens because your OSDs can't find ``cls_rbd.so``. They
+search for it in ``osd_class_dir``, which may not be set correctly by
+default (http://tracker.newdream.net/issues/1722).
+
+Most likely it's looking in ``/usr/lib/rados-classes`` instead of
+``/usr/lib64/rados-classes`` - change ``osd_class_dir`` in your
+``ceph.conf`` and restart the OSDs to fix it.
diff --git a/src/ceph/doc/dev/osd_internals/backfill_reservation.rst b/src/ceph/doc/dev/osd_internals/backfill_reservation.rst
new file mode 100644
index 0000000..cf9dab4
--- /dev/null
+++ b/src/ceph/doc/dev/osd_internals/backfill_reservation.rst
@@ -0,0 +1,38 @@
+====================
+Backfill Reservation
+====================
+
+When a new osd joins a cluster, all pgs containing it must eventually backfill
+to it. If all of these backfills happen simultaneously, it would put excessive
+load on the osd. osd_max_backfills limits the number of outgoing or
+incoming backfills on a single node. The maximum number of outgoing backfills is
+osd_max_backfills. The maximum number of incoming backfills is
+osd_max_backfills. Therefore there can be a maximum of osd_max_backfills * 2
+simultaneous backfills on one osd.
+
+Each OSDService now has two AsyncReserver instances: one for backfills going
+from the osd (local_reserver) and one for backfills going to the osd
+(remote_reserver). An AsyncReserver (common/AsyncReserver.h) manages a queue
+by priority of waiting items and a set of current reservation holders. When a
+slot frees up, the AsyncReserver queues the Context* associated with the next
+item on the highest priority queue in the finisher provided to the constructor.
+
+For a primary to initiate a backfill, it must first obtain a reservation from
+its own local_reserver. Then, it must obtain a reservation from the backfill
+target's remote_reserver via a MBackfillReserve message. This process is
+managed by substates of Active and ReplicaActive (see the substates of Active
+in PG.h). The reservations are dropped either on the Backfilled event, which
+is sent on the primary before calling recovery_complete and on the replica on
+receipt of the BackfillComplete progress message), or upon leaving Active or
+ReplicaActive.
+
+It's important that we always grab the local reservation before the remote
+reservation in order to prevent a circular dependency.
+
+We want to minimize the risk of data loss by prioritizing the order in
+which PGs are recovered. The highest priority is log based recovery
+(OSD_RECOVERY_PRIORITY_MAX) since this must always complete before
+backfill can start. The next priority is backfill of degraded PGs and
+is a function of the degradation. A backfill for a PG missing two
+replicas will have a priority higher than a backfill for a PG missing
+one replica. The lowest priority is backfill of non-degraded PGs.
diff --git a/src/ceph/doc/dev/osd_internals/erasure_coding.rst b/src/ceph/doc/dev/osd_internals/erasure_coding.rst
new file mode 100644
index 0000000..7263cc3
--- /dev/null
+++ b/src/ceph/doc/dev/osd_internals/erasure_coding.rst
@@ -0,0 +1,82 @@
+==============================
+Erasure Coded Placement Groups
+==============================
+
+Glossary
+--------
+
+*chunk*
+ when the encoding function is called, it returns chunks of the same
+ size. Data chunks which can be concatenated to reconstruct the original
+ object and coding chunks which can be used to rebuild a lost chunk.
+
+*chunk rank*
+ the index of a chunk when returned by the encoding function. The
+ rank of the first chunk is 0, the rank of the second chunk is 1
+ etc.
+
+*stripe*
+ when an object is too large to be encoded with a single call,
+ each set of chunks created by a call to the encoding function is
+ called a stripe.
+
+*shard|strip*
+ an ordered sequence of chunks of the same rank from the same
+ object. For a given placement group, each OSD contains shards of
+ the same rank. When dealing with objects that are encoded with a
+ single operation, *chunk* is sometime used instead of *shard*
+ because the shard is made of a single chunk. The *chunks* in a
+ *shard* are ordered according to the rank of the stripe they belong
+ to.
+
+*K*
+ the number of data *chunks*, i.e. the number of *chunks* in which the
+ original object is divided. For instance if *K* = 2 a 10KB object
+ will be divided into *K* objects of 5KB each.
+
+*M*
+ the number of coding *chunks*, i.e. the number of additional *chunks*
+ computed by the encoding functions. If there are 2 coding *chunks*,
+ it means 2 OSDs can be out without losing data.
+
+*N*
+ the number of data *chunks* plus the number of coding *chunks*,
+ i.e. *K+M*.
+
+*rate*
+ the proportion of the *chunks* that contains useful information, i.e. *K/N*.
+ For instance, for *K* = 9 and *M* = 3 (i.e. *K+M* = *N* = 12) the rate is
+ *K* = 9 / *N* = 12 = 0.75, i.e. 75% of the chunks contain useful information.
+
+The definitions are illustrated as follows (PG stands for placement group):
+::
+
+ OSD 40 OSD 33
+ +-------------------------+ +-------------------------+
+ | shard 0 - PG 10 | | shard 1 - PG 10 |
+ |+------ object O -------+| |+------ object O -------+|
+ ||+---------------------+|| ||+---------------------+||
+ stripe||| chunk 0 ||| ||| chunk 1 ||| ...
+ 0 ||| stripe 0 ||| ||| stripe 0 |||
+ ||+---------------------+|| ||+---------------------+||
+ ||+---------------------+|| ||+---------------------+||
+ stripe||| chunk 0 ||| ||| chunk 1 ||| ...
+ 1 ||| stripe 1 ||| ||| stripe 1 |||
+ ||+---------------------+|| ||+---------------------+||
+ ||+---------------------+|| ||+---------------------+||
+ stripe||| chunk 0 ||| ||| chunk 1 ||| ...
+ 2 ||| stripe 2 ||| ||| stripe 2 |||
+ ||+---------------------+|| ||+---------------------+||
+ |+-----------------------+| |+-----------------------+|
+ | ... | | ... |
+ +-------------------------+ +-------------------------+
+
+Table of content
+----------------
+
+.. toctree::
+ :maxdepth: 1
+
+ Developer notes <erasure_coding/developer_notes>
+ Jerasure plugin <erasure_coding/jerasure>
+ High level design document <erasure_coding/ecbackend>
diff --git a/src/ceph/doc/dev/osd_internals/erasure_coding/developer_notes.rst b/src/ceph/doc/dev/osd_internals/erasure_coding/developer_notes.rst
new file mode 100644
index 0000000..770ff4a
--- /dev/null
+++ b/src/ceph/doc/dev/osd_internals/erasure_coding/developer_notes.rst
@@ -0,0 +1,223 @@
+============================
+Erasure Code developer notes
+============================
+
+Introduction
+------------
+
+Each chapter of this document explains an aspect of the implementation
+of the erasure code within Ceph. It is mostly based on examples being
+explained to demonstrate how things work.
+
+Reading and writing encoded chunks from and to OSDs
+---------------------------------------------------
+
+An erasure coded pool stores each object as K+M chunks. It is divided
+into K data chunks and M coding chunks. The pool is configured to have
+a size of K+M so that each chunk is stored in an OSD in the acting
+set. The rank of the chunk is stored as an attribute of the object.
+
+Let's say an erasure coded pool is created to use five OSDs ( K+M =
+5 ) and sustain the loss of two of them ( M = 2 ).
+
+When the object *NYAN* containing *ABCDEFGHI* is written to it, the
+erasure encoding function splits the content in three data chunks,
+simply by dividing the content in three : the first contains *ABC*,
+the second *DEF* and the last *GHI*. The content will be padded if the
+content length is not a multiple of K. The function also creates two
+coding chunks : the fourth with *YXY* and the fifth with *GQC*. Each
+chunk is stored in an OSD in the acting set. The chunks are stored in
+objects that have the same name ( *NYAN* ) but reside on different
+OSDs. The order in which the chunks were created must be preserved and
+is stored as an attribute of the object ( shard_t ), in addition to its
+name. Chunk *1* contains *ABC* and is stored on *OSD5* while chunk *4*
+contains *YXY* and is stored on *OSD3*.
+
+::
+
+ +-------------------+
+ name | NYAN |
+ +-------------------+
+ content | ABCDEFGHI |
+ +--------+----------+
+ |
+ |
+ v
+ +------+------+
+ +---------------+ encode(3,2) +-----------+
+ | +--+--+---+---+ |
+ | | | | |
+ | +-------+ | +-----+ |
+ | | | | |
+ +--v---+ +--v---+ +--v---+ +--v---+ +--v---+
+ name | NYAN | | NYAN | | NYAN | | NYAN | | NYAN |
+ +------+ +------+ +------+ +------+ +------+
+ shard | 1 | | 2 | | 3 | | 4 | | 5 |
+ +------+ +------+ +------+ +------+ +------+
+ content | ABC | | DEF | | GHI | | YXY | | QGC |
+ +--+---+ +--+---+ +--+---+ +--+---+ +--+---+
+ | | | | |
+ | | | | |
+ | | +--+---+ | |
+ | | | OSD1 | | |
+ | | +------+ | |
+ | | +------+ | |
+ | +------>| OSD2 | | |
+ | +------+ | |
+ | +------+ | |
+ | | OSD3 |<----+ |
+ | +------+ |
+ | +------+ |
+ | | OSD4 |<--------------+
+ | +------+
+ | +------+
+ +----------------->| OSD5 |
+ +------+
+
+
+
+
+When the object *NYAN* is read from the erasure coded pool, the
+decoding function reads three chunks : chunk *1* containing *ABC*,
+chunk *3* containing *GHI* and chunk *4* containing *YXY* and rebuild
+the original content of the object *ABCDEFGHI*. The decoding function
+is informed that the chunks *2* and *5* are missing ( they are called
+*erasures* ). The chunk *5* could not be read because the *OSD4* is
+*out*.
+
+The decoding function could be called as soon as three chunks are
+read : *OSD2* was the slowest and its chunk does not need to be taken into
+account. This optimization is not implemented in Firefly.
+
+::
+
+ +-------------------+
+ name | NYAN |
+ +-------------------+
+ content | ABCDEFGHI |
+ +--------+----------+
+ ^
+ |
+ |
+ +------+------+
+ | decode(3,2) |
+ | erasures 2,5|
+ +-------------->| |
+ | +-------------+
+ | ^ ^
+ | | +-----+
+ | | |
+ +--+---+ +------+ +--+---+ +--+---+
+ name | NYAN | | NYAN | | NYAN | | NYAN |
+ +------+ +------+ +------+ +------+
+ shard | 1 | | 2 | | 3 | | 4 |
+ +------+ +------+ +------+ +------+
+ content | ABC | | DEF | | GHI | | YXY |
+ +--+---+ +--+---+ +--+---+ +--+---+
+ ^ . ^ ^
+ | TOO . | |
+ | SLOW . +--+---+ |
+ | ^ | OSD1 | |
+ | | +------+ |
+ | | +------+ |
+ | +-------| OSD2 | |
+ | +------+ |
+ | +------+ |
+ | | OSD3 |-----+
+ | +------+
+ | +------+
+ | | OSD4 | OUT
+ | +------+
+ | +------+
+ +------------------| OSD5 |
+ +------+
+
+
+Erasure code library
+--------------------
+
+Using `Reed-Solomon <https://en.wikipedia.org/wiki/Reed_Solomon>`_,
+with parameters K+M, object O is encoded by dividing it into chunks O1,
+O2, ... OM and computing coding chunks P1, P2, ... PK. Any K chunks
+out of the available K+M chunks can be used to obtain the original
+object. If data chunk O2 or coding chunk P2 are lost, they can be
+repaired using any K chunks out of the K+M chunks. If more than M
+chunks are lost, it is not possible to recover the object.
+
+Reading the original content of object O can be a simple
+concatenation of O1, O2, ... OM, because the plugins are using
+`systematic codes
+<http://en.wikipedia.org/wiki/Systematic_code>`_. Otherwise the chunks
+must be given to the erasure code library *decode* method to retrieve
+the content of the object.
+
+Performance depend on the parameters to the encoding functions and
+is also influenced by the packet sizes used when calling the encoding
+functions ( for Cauchy or Liberation for instance ): smaller packets
+means more calls and more overhead.
+
+Although Reed-Solomon is provided as a default, Ceph uses it via an
+`abstract API <https://github.com/ceph/ceph/blob/v0.78/src/erasure-code/ErasureCodeInterface.h>`_ designed to
+allow each pool to choose the plugin that implements it using
+key=value pairs stored in an `erasure code profile`_.
+
+.. _erasure code profile: ../../../erasure-coded-pool
+
+::
+
+ $ ceph osd erasure-code-profile set myprofile \
+ crush-failure-domain=osd
+ $ ceph osd erasure-code-profile get myprofile
+ directory=/usr/lib/ceph/erasure-code
+ k=2
+ m=1
+ plugin=jerasure
+ technique=reed_sol_van
+ crush-failure-domain=osd
+ $ ceph osd pool create ecpool 12 12 erasure myprofile
+
+The *plugin* is dynamically loaded from *directory* and expected to
+implement the *int __erasure_code_init(char *plugin_name, char *directory)* function
+which is responsible for registering an object derived from *ErasureCodePlugin*
+in the registry. The `ErasureCodePluginExample <https://github.com/ceph/ceph/blob/v0.78/src/test/erasure-code/ErasureCodePluginExample.cc>`_ plugin reads:
+
+::
+
+ ErasureCodePluginRegistry &instance =
+ ErasureCodePluginRegistry::instance();
+ instance.add(plugin_name, new ErasureCodePluginExample());
+
+The *ErasureCodePlugin* derived object must provide a factory method
+from which the concrete implementation of the *ErasureCodeInterface*
+object can be generated. The `ErasureCodePluginExample plugin <https://github.com/ceph/ceph/blob/v0.78/src/test/erasure-code/ErasureCodePluginExample.cc>`_ reads:
+
+::
+
+ virtual int factory(const map<std::string,std::string> &parameters,
+ ErasureCodeInterfaceRef *erasure_code) {
+ *erasure_code = ErasureCodeInterfaceRef(new ErasureCodeExample(parameters));
+ return 0;
+ }
+
+The *parameters* argument is the list of *key=value* pairs that were
+set in the erasure code profile, before the pool was created.
+
+::
+
+ ceph osd erasure-code-profile set myprofile \
+ directory=<dir> \ # mandatory
+ plugin=jerasure \ # mandatory
+ m=10 \ # optional and plugin dependant
+ k=3 \ # optional and plugin dependant
+ technique=reed_sol_van \ # optional and plugin dependant
+
+Notes
+-----
+
+If the objects are large, it may be impractical to encode and decode
+them in memory. However, when using *RBD* a 1TB device is divided in
+many individual 4MB objects and *RGW* does the same.
+
+Encoding and decoding is implemented in the OSD. Although it could be
+implemented client side for read write, the OSD must be able to encode
+and decode on its own when scrubbing.
diff --git a/src/ceph/doc/dev/osd_internals/erasure_coding/ecbackend.rst b/src/ceph/doc/dev/osd_internals/erasure_coding/ecbackend.rst
new file mode 100644
index 0000000..624ec21
--- /dev/null
+++ b/src/ceph/doc/dev/osd_internals/erasure_coding/ecbackend.rst
@@ -0,0 +1,207 @@
+=================================
+ECBackend Implementation Strategy
+=================================
+
+Misc initial design notes
+=========================
+
+The initial (and still true for ec pools without the hacky ec
+overwrites debug flag enabled) design for ec pools restricted
+EC pools to operations which can be easily rolled back:
+
+- CEPH_OSD_OP_APPEND: We can roll back an append locally by
+ including the previous object size as part of the PG log event.
+- CEPH_OSD_OP_DELETE: The possibility of rolling back a delete
+ requires that we retain the deleted object until all replicas have
+ persisted the deletion event. ErasureCoded backend will therefore
+ need to store objects with the version at which they were created
+ included in the key provided to the filestore. Old versions of an
+ object can be pruned when all replicas have committed up to the log
+ event deleting the object.
+- CEPH_OSD_OP_(SET|RM)ATTR: If we include the prior value of the attr
+ to be set or removed, we can roll back these operations locally.
+
+Log entries contain a structure explaining how to locally undo the
+operation represented by the operation
+(see osd_types.h:TransactionInfo::LocalRollBack).
+
+PGTemp and Crush
+----------------
+
+Primaries are able to request a temp acting set mapping in order to
+allow an up-to-date OSD to serve requests while a new primary is
+backfilled (and for other reasons). An erasure coded pg needs to be
+able to designate a primary for these reasons without putting it in
+the first position of the acting set. It also needs to be able to
+leave holes in the requested acting set.
+
+Core Changes:
+
+- OSDMap::pg_to_*_osds needs to separately return a primary. For most
+ cases, this can continue to be acting[0].
+- MOSDPGTemp (and related OSD structures) needs to be able to specify
+ a primary as well as an acting set.
+- Much of the existing code base assumes that acting[0] is the primary
+ and that all elements of acting are valid. This needs to be cleaned
+ up since the acting set may contain holes.
+
+Distinguished acting set positions
+----------------------------------
+
+With the replicated strategy, all replicas of a PG are
+interchangeable. With erasure coding, different positions in the
+acting set have different pieces of the erasure coding scheme and are
+not interchangeable. Worse, crush might cause chunk 2 to be written
+to an OSD which happens already to contain an (old) copy of chunk 4.
+This means that the OSD and PG messages need to work in terms of a
+type like pair<shard_t, pg_t> in order to distinguish different pg
+chunks on a single OSD.
+
+Because the mapping of object name to object in the filestore must
+be 1-to-1, we must ensure that the objects in chunk 2 and the objects
+in chunk 4 have different names. To that end, the objectstore must
+include the chunk id in the object key.
+
+Core changes:
+
+- The objectstore `ghobject_t needs to also include a chunk id
+ <https://github.com/ceph/ceph/blob/firefly/src/common/hobject.h#L241>`_ making it more like
+ tuple<hobject_t, gen_t, shard_t>.
+- coll_t needs to include a shard_t.
+- The OSD pg_map and similar pg mappings need to work in terms of a
+ spg_t (essentially
+ pair<pg_t, shard_t>). Similarly, pg->pg messages need to include
+ a shard_t
+- For client->PG messages, the OSD will need a way to know which PG
+ chunk should get the message since the OSD may contain both a
+ primary and non-primary chunk for the same pg
+
+Object Classes
+--------------
+
+Reads from object classes will return ENOTSUP on ec pools by invoking
+a special SYNC read.
+
+Scrub
+-----
+
+The main catch, however, for ec pools is that sending a crc32 of the
+stored chunk on a replica isn't particularly helpful since the chunks
+on different replicas presumably store different data. Because we
+don't support overwrites except via DELETE, however, we have the
+option of maintaining a crc32 on each chunk through each append.
+Thus, each replica instead simply computes a crc32 of its own stored
+chunk and compares it with the locally stored checksum. The replica
+then reports to the primary whether the checksums match.
+
+With overwrites, all scrubs are disabled for now until we work out
+what to do (see doc/dev/osd_internals/erasure_coding/proposals.rst).
+
+Crush
+-----
+
+If crush is unable to generate a replacement for a down member of an
+acting set, the acting set should have a hole at that position rather
+than shifting the other elements of the acting set out of position.
+
+=========
+ECBackend
+=========
+
+MAIN OPERATION OVERVIEW
+=======================
+
+A RADOS put operation can span
+multiple stripes of a single object. There must be code that
+tessellates the application level write into a set of per-stripe write
+operations -- some whole-stripes and up to two partial
+stripes. Without loss of generality, for the remainder of this
+document we will focus exclusively on writing a single stripe (whole
+or partial). We will use the symbol "W" to represent the number of
+blocks within a stripe that are being written, i.e., W <= K.
+
+There are three data flows for handling a write into an EC stripe. The
+choice of which of the three data flows to choose is based on the size
+of the write operation and the arithmetic properties of the selected
+parity-generation algorithm.
+
+(1) whole stripe is written/overwritten
+(2) a read-modify-write operation is performed.
+
+WHOLE STRIPE WRITE
+------------------
+
+This is the simple case, and is already performed in the existing code
+(for appends, that is). The primary receives all of the data for the
+stripe in the RADOS request, computes the appropriate parity blocks
+and send the data and parity blocks to their destination shards which
+write them. This is essentially the current EC code.
+
+READ-MODIFY-WRITE
+-----------------
+
+The primary determines which of the K-W blocks are to be unmodified,
+and reads them from the shards. Once all of the data is received it is
+combined with the received new data and new parity blocks are
+computed. The modified blocks are sent to their respective shards and
+written. The RADOS operation is acknowledged.
+
+OSD Object Write and Consistency
+--------------------------------
+
+Regardless of the algorithm chosen above, writing of the data is a two
+phase process: commit and rollforward. The primary sends the log
+entries with the operation described (see
+osd_types.h:TransactionInfo::(LocalRollForward|LocalRollBack).
+In all cases, the "commit" is performed in place, possibly leaving some
+information required for a rollback in a write-aside object. The
+rollforward phase occurs once all acting set replicas have committed
+the commit (sorry, overloaded term) and removes the rollback information.
+
+In the case of overwrites of exsting stripes, the rollback information
+has the form of a sparse object containing the old values of the
+overwritten extents populated using clone_range. This is essentially
+a place-holder implementation, in real life, bluestore will have an
+efficient primitive for this.
+
+The rollforward part can be delayed since we report the operation as
+committed once all replicas have committed. Currently, whenever we
+send a write, we also indicate that all previously committed
+operations should be rolled forward (see
+ECBackend::try_reads_to_commit). If there aren't any in the pipeline
+when we arrive at the waiting_rollforward queue, we start a dummy
+write to move things along (see the Pipeline section later on and
+ECBackend::try_finish_rmw).
+
+ExtentCache
+-----------
+
+It's pretty important to be able to pipeline writes on the same
+object. For this reason, there is a cache of extents written by
+cacheable operations. Each extent remains pinned until the operations
+referring to it are committed. The pipeline prevents rmw operations
+from running until uncacheable transactions (clones, etc) are flushed
+from the pipeline.
+
+See ExtentCache.h for a detailed explanation of how the cache
+states correspond to the higher level invariants about the conditions
+under which cuncurrent operations can refer to the same object.
+
+Pipeline
+--------
+
+Reading src/osd/ExtentCache.h should have given a good idea of how
+operations might overlap. There are several states involved in
+processing a write operation and an important invariant which
+isn't enforced by PrimaryLogPG at a higher level which need to be
+managed by ECBackend. The important invariant is that we can't
+have uncacheable and rmw operations running at the same time
+on the same object. For simplicity, we simply enforce that any
+operation which contains an rmw operation must wait until
+all in-progress uncacheable operations complete.
+
+There are improvements to be made here in the future.
+
+For more details, see ECBackend::waiting_* and
+ECBackend::try_<from>_to_<to>.
+
diff --git a/src/ceph/doc/dev/osd_internals/erasure_coding/jerasure.rst b/src/ceph/doc/dev/osd_internals/erasure_coding/jerasure.rst
new file mode 100644
index 0000000..27669a0
--- /dev/null
+++ b/src/ceph/doc/dev/osd_internals/erasure_coding/jerasure.rst
@@ -0,0 +1,33 @@
+===============
+jerasure plugin
+===============
+
+Introduction
+------------
+
+The parameters interpreted by the jerasure plugin are:
+
+::
+
+ ceph osd erasure-code-profile set myprofile \
+ directory=<dir> \ # plugin directory absolute path
+ plugin=jerasure \ # plugin name (only jerasure)
+ k=<k> \ # data chunks (default 2)
+ m=<m> \ # coding chunks (default 2)
+ technique=<technique> \ # coding technique
+
+The coding techniques can be chosen among *reed_sol_van*,
+*reed_sol_r6_op*, *cauchy_orig*, *cauchy_good*, *liberation*,
+*blaum_roth* and *liber8tion*.
+
+The *src/erasure-code/jerasure* directory contains the
+implementation. It is a wrapper around the code found at
+`https://github.com/ceph/jerasure <https://github.com/ceph/jerasure>`_
+and `https://github.com/ceph/gf-complete
+<https://github.com/ceph/gf-complete>`_ , pinned to the latest stable
+version in *.gitmodules*. These repositories are copies of the
+upstream repositories `http://jerasure.org/jerasure/jerasure
+<http://jerasure.org/jerasure/jerasure>`_ and
+`http://jerasure.org/jerasure/gf-complete
+<http://jerasure.org/jerasure/gf-complete>`_ . The difference
+between the two, if any, should match pull requests against upstream.
diff --git a/src/ceph/doc/dev/osd_internals/erasure_coding/proposals.rst b/src/ceph/doc/dev/osd_internals/erasure_coding/proposals.rst
new file mode 100644
index 0000000..52f98e8
--- /dev/null
+++ b/src/ceph/doc/dev/osd_internals/erasure_coding/proposals.rst
@@ -0,0 +1,385 @@
+:orphan:
+
+=================================
+Proposed Next Steps for ECBackend
+=================================
+
+PARITY-DELTA-WRITE
+------------------
+
+RMW operations current require 4 network hops (2 round trips). In
+principle, for some codes, we can reduce this to 3 by sending the
+update to the replicas holding the data blocks and having them
+compute a delta to forward onto the parity blocks.
+
+The primary reads the current values of the "W" blocks and then uses
+the new values of the "W" blocks to compute parity-deltas for each of
+the parity blocks. The W blocks and the parity delta-blocks are sent
+to their respective shards.
+
+The choice of whether to use a read-modify-write or a
+parity-delta-write is complex policy issue that is TBD in the details
+and is likely to be heavily dependant on the computational costs
+associated with a parity-delta vs. a regular parity-generation
+operation. However, it is believed that the parity-delta scheme is
+likely to be the preferred choice, when available.
+
+The internal interface to the erasure coding library plug-ins needs to
+be extended to support the ability to query if parity-delta
+computation is possible for a selected algorithm as well as an
+interface to the actual parity-delta computation algorithm when
+available.
+
+Stripe Cache
+------------
+
+It may be a good idea to extend the current ExtentCache usage to
+cache some data past when the pinning operation releases it.
+One application pattern that is important to optimize is the small
+block sequential write operation (think of the journal of a journaling
+file system or a database transaction log). Regardless of the chosen
+redundancy algorithm, it is advantageous for the primary to
+retain/buffer recently read/written portions of a stripe in order to
+reduce network traffic. The dynamic contents of this cache may be used
+in the determination of whether a read-modify-write or a
+parity-delta-write is performed. The sizing of this cache is TBD, but
+we should plan on allowing at least a few full stripes per active
+client. Limiting the cache occupancy on a per-client basis will reduce
+the noisy neighbor problem.
+
+Recovery and Rollback Details
+=============================
+
+Implementing a Rollback-able Prepare Operation
+----------------------------------------------
+
+The prepare operation is implemented at each OSD through a simulation
+of a versioning or copy-on-write capability for modifying a portion of
+an object.
+
+When a prepare operation is performed, the new data is written into a
+temporary object. The PG log for the
+operation will contain a reference to the temporary object so that it
+can be located for recovery purposes as well as a record of all of the
+shards which are involved in the operation.
+
+In order to avoid fragmentation (and hence, future read performance),
+creation of the temporary object needs special attention. The name of
+the temporary object affects its location within the KV store. Right
+now its unclear whether it's desirable for the name to locate near the
+base object or whether a separate subset of keyspace should be used
+for temporary objects. Sam believes that colocation with the base
+object is preferred (he suggests using the generation counter of the
+ghobject for temporaries). Whereas Allen believes that using a
+separate subset of keyspace is desirable since these keys are
+ephemeral and we don't want to actually colocate them with the base
+object keys. Perhaps some modeling here can help resolve this
+issue. The data of the temporary object wants to be located as close
+to the data of the base object as possible. This may be best performed
+by adding a new ObjectStore creation primitive that takes the base
+object as an addtional parameter that is a hint to the allocator.
+
+Sam: I think that the short lived thing may be a red herring. We'll
+be updating the donor and primary objects atomically, so it seems like
+we'd want them adjacent in the key space, regardless of the donor's
+lifecycle.
+
+The apply operation moves the data from the temporary object into the
+correct position within the base object and deletes the associated
+temporary object. This operation is done using a specialized
+ObjectStore primitive. In the current ObjectStore interface, this can
+be done using the clonerange function followed by a delete, but can be
+done more efficiently with a specialized move primitive.
+Implementation of the specialized primitive on FileStore can be done
+by copying the data. Some file systems have extensions that might also
+be able to implement this operation (like a defrag API that swaps
+chunks between files). It is expected that NewStore will be able to
+support this efficiently and natively (It has been noted that this
+sequence requires that temporary object allocations, which tend to be
+small, be efficiently converted into blocks for main objects and that
+blocks that were formerly inside of main objects must be reusable with
+minimal overhead)
+
+The prepare and apply operations can be separated arbitrarily in
+time. If a read operation accesses an object that has been altered by
+a prepare operation (but without a corresponding apply operation) it
+must return the data after the prepare operation. This is done by
+creating an in-memory database of objects which have had a prepare
+operation without a corresponding apply operation. All read operations
+must consult this in-memory data structure in order to get the correct
+data. It should explicitly recognized that it is likely that there
+will be multiple prepare operations against a single base object and
+the code must handle this case correctly. This code is implemented as
+a layer between ObjectStore and all existing readers. Annoyingly,
+we'll want to trash this state when the interval changes, so the first
+thing that needs to happen after activation is that the primary and
+replicas apply up to last_update so that the empty cache will be
+correct.
+
+During peering, it is now obvious that an unapplied prepare operation
+can easily be rolled back simply by deleting the associated temporary
+object and removing that entry from the in-memory data structure.
+
+Partial Application Peering/Recovery modifications
+--------------------------------------------------
+
+Some writes will be small enough to not require updating all of the
+shards holding data blocks. For write amplification minization
+reasons, it would be best to avoid writing to those shards at all,
+and delay even sending the log entries until the next write which
+actually hits that shard.
+
+The delaying (buffering) of the transmission of the prepare and apply
+operations for witnessing OSDs creates new situations that peering
+must handle. In particular the logic for determining the authoritative
+last_update value (and hence the selection of the OSD which has the
+authoritative log) must be modified to account for the valid but
+missing (i.e., delayed/buffered) pglog entries to which the
+authoritative OSD was only a witness to.
+
+Because a partial write might complete without persisting a log entry
+on every replica, we have to do a bit more work to determine an
+authoritative last_update. The constraint (as with a replicated PG)
+is that last_update >= the most recent log entry for which a commit
+was sent to the client (call this actual_last_update). Secondarily,
+we want last_update to be as small as possible since any log entry
+past actual_last_update (we do not apply a log entry until we have
+sent the commit to the client) must be able to be rolled back. Thus,
+the smaller a last_update we choose, the less recovery will need to
+happen (we can always roll back, but rolling a replica forward may
+require an object rebuild). Thus, we will set last_update to 1 before
+the oldest log entry we can prove cannot have been committed. In
+current master, this is simply the last_update of the shortest log
+from that interval (because that log did not persist any entry past
+that point -- a precondition for sending a commit to the client). For
+this design, we must consider the possibility that any log is missing
+at its head log entries in which it did not participate. Thus, we
+must determine the most recent interval in which we went active
+(essentially, this is what find_best_info currently does). We then
+pull the log from each live osd from that interval back to the minimum
+last_update among them. Then, we extend all logs from the
+authoritative interval until each hits an entry in which it should
+have participated, but did not record. The shortest of these extended
+logs must therefore contain any log entry for which we sent a commit
+to the client -- and the last entry gives us our last_update.
+
+Deep scrub support
+------------------
+
+The simple answer here is probably our best bet. EC pools can't use
+the omap namespace at all right now. The simplest solution would be
+to take a prefix of the omap space and pack N M byte L bit checksums
+into each key/value. The prefixing seems like a sensible precaution
+against eventually wanting to store something else in the omap space.
+It seems like any write will need to read at least the blocks
+containing the modified range. However, with a code able to compute
+parity deltas, we may not need to read a whole stripe. Even without
+that, we don't want to have to write to blocks not participating in
+the write. Thus, each shard should store checksums only for itself.
+It seems like you'd be able to store checksums for all shards on the
+parity blocks, but there may not be distinguished parity blocks which
+are modified on all writes (LRC or shec provide two examples). L
+should probably have a fixed number of options (16, 32, 64?) and be
+configurable per-pool at pool creation. N, M should be likewise be
+configurable at pool creation with sensible defaults.
+
+We need to handle online upgrade. I think the right answer is that
+the first overwrite to an object with an append only checksum
+removes the append only checksum and writes in whatever stripe
+checksums actually got written. The next deep scrub then writes
+out the full checksum omap entries.
+
+RADOS Client Acknowledgement Generation Optimization
+====================================================
+
+Now that the recovery scheme is understood, we can discuss the
+generation of of the RADOS operation acknowledgement (ACK) by the
+primary ("sufficient" from above). It is NOT required that the primary
+wait for all shards to complete their respective prepare
+operations. Using our example where the RADOS operations writes only
+"W" chunks of the stripe, the primary will generate and send W+M
+prepare operations (possibly including a send-to-self). The primary
+need only wait for enough shards to be written to ensure recovery of
+the data, Thus after writing W + M chunks you can afford the lost of M
+chunks. Hence the primary can generate the RADOS ACK after W+M-M => W
+of those prepare operations are completed.
+
+Inconsistent object_info_t versions
+===================================
+
+A natural consequence of only writing the blocks which actually
+changed is that we don't want to update the object_info_t of the
+objects which didn't. I actually think it would pose a problem to do
+so: pg ghobject namespaces are generally large, and unless the osd is
+seeing a bunch of overwrites on a small set of objects, I'd expect
+each write to be far enough apart in the backing ghobject_t->data
+mapping to each constitute a random metadata update. Thus, we have to
+accept that not every shard will have the current version in its
+object_info_t. We can't even bound how old the version on a
+particular shard will happen to be. In particular, the primary does
+not necessarily have the current version. One could argue that the
+parity shards would always have the current version, but not every
+code necessarily has designated parity shards which see every write
+(certainly LRC, iirc shec, and even with a more pedestrian code, it
+might be desirable to rotate the shards based on object hash). Even
+if you chose to designate a shard as witnessing all writes, the pg
+might be degraded with that particular shard missing. This is a bit
+tricky, currently reads and writes implicitely return the most recent
+version of the object written. On reads, we'd have to read K shards
+to answer that question. We can get around that by adding a "don't
+tell me the current version" flag. Writes are more problematic: we
+need an object_info from the most recent write in order to form the
+new object_info and log_entry.
+
+A truly terrifying option would be to eliminate version and
+prior_version entirely from the object_info_t. There are a few
+specific purposes it serves:
+
+#. On OSD startup, we prime the missing set by scanning backwards
+ from last_update to last_complete comparing the stored object's
+ object_info_t to the version of most recent log entry.
+#. During backfill, we compare versions between primary and target
+ to avoid some pushes. We use it elsewhere as well
+#. While pushing and pulling objects, we verify the version.
+#. We return it on reads and writes and allow the librados user to
+ assert it atomically on writesto allow the user to deal with write
+ races (used extensively by rbd).
+
+Case (3) isn't actually essential, just convenient. Oh well. (4)
+is more annoying. Writes are easy since we know the version. Reads
+are tricky because we may not need to read from all of the replicas.
+Simplest solution is to add a flag to rados operations to just not
+return the user version on read. We can also just not support the
+user version assert on ec for now (I think? Only user is rgw bucket
+indices iirc, and those will always be on replicated because they use
+omap).
+
+We can avoid (1) by maintaining the missing set explicitely. It's
+already possible for there to be a missing object without a
+corresponding log entry (Consider the case where the most recent write
+is to an object which has not been updated in weeks. If that write
+becomes divergent, the written object needs to be marked missing based
+on the prior_version which is not in the log.) THe PGLog already has
+a way of handling those edge cases (see divergent_priors). We'd
+simply expand that to contain the entire missing set and maintain it
+atomically with the log and the objects. This isn't really an
+unreasonable option, the addiitonal keys would be fewer than the
+existing log keys + divergent_priors and aren't updated in the fast
+write path anyway.
+
+The second case is a bit trickier. It's really an optimization for
+the case where a pg became not in the acting set long enough for the
+logs to no longer overlap but not long enough for the PG to have
+healed and removed the old copy. Unfortunately, this describes the
+case where a node was taken down for maintenance with noout set. It's
+probably not acceptable to re-backfill the whole OSD in such a case,
+so we need to be able to quickly determine whether a particular shard
+is up to date given a valid acting set of other shards.
+
+Let ordinary writes which do not change the object size not touch the
+object_info at all. That means that the object_info version won't
+match the pg log entry version. Include in the pg_log_entry_t the
+current object_info version as well as which shards participated (as
+mentioned above). In addition to the object_info_t attr, record on
+each shard s a vector recording for each other shard s' the most
+recent write which spanned both s and s'. Operationally, we maintain
+an attr on each shard containing that vector. A write touching S
+updates the version stamp entry for each shard in S on each shard in
+S's attribute (and leaves the rest alone). If we have a valid acting
+set during backfill, we must have a witness of every write which
+completed -- so taking the max of each entry over all of the acting
+set shards must give us the current version for each shard. During
+recovery, we set the attribute on the recovery target to that max
+vector (Question: with LRC, we may not need to touch much of the
+acting set to recover a particular shard -- can we just use the max of
+the shards we used to recovery, or do we need to grab the version
+vector from the rest of the acting set as well? I'm not sure, not a
+big deal anyway, I think).
+
+The above lets us perform blind writes without knowing the current
+object version (log entry version, that is) while still allowing us to
+avoid backfilling up to date objects. The only catch is that our
+backfill scans will can all replicas, not just the primary and the
+backfill targets.
+
+It would be worth adding into scrub the ability to check the
+consistency of the gathered version vectors -- probably by just
+taking 3 random valid subsets and verifying that they generate
+the same authoritative version vector.
+
+Implementation Strategy
+=======================
+
+It goes without saying that it would be unwise to attempt to do all of
+this in one massive PR. It's also not a good idea to merge code which
+isn't being tested. To that end, it's worth thinking a bit about
+which bits can be tested on their own (perhaps with a bit of temporary
+scaffolding).
+
+We can implement the overwrite friendly checksumming scheme easily
+enough with the current implementation. We'll want to enable it on a
+per-pool basis (probably using a flag which we'll later repurpose for
+actual overwrite support). We can enable it in some of the ec
+thrashing tests in the suite. We can also add a simple test
+validating the behavior of turning it on for an existing ec pool
+(later, we'll want to be able to convert append-only ec pools to
+overwrite ec pools, so that test will simply be expanded as we go).
+The flag should be gated by the experimental feature flag since we
+won't want to support this as a valid configuration -- testing only.
+We need to upgrade append only ones in place during deep scrub.
+
+Similarly, we can implement the unstable extent cache with the current
+implementation, it even lets us cut out the readable ack the replicas
+send to the primary after the commit which lets it release the lock.
+Same deal, implement, gate with experimental flag, add to some of the
+automated tests. I don't really see a reason not to use the same flag
+as above.
+
+We can certainly implement the move-range primitive with unit tests
+before there are any users. Adding coverage to the existing
+objectstore tests would suffice here.
+
+Explicit missing set can be implemented now, same deal as above --
+might as well even use the same feature bit.
+
+The TPC protocol outlined above can actually be implemented an append
+only EC pool. Same deal as above, can even use the same feature bit.
+
+The RADOS flag to suppress the read op user version return can be
+implemented immediately. Mostly just needs unit tests.
+
+The version vector problem is an interesting one. For append only EC
+pools, it would be pointless since all writes increase the size and
+therefore update the object_info. We could do it for replicated pools
+though. It's a bit silly since all "shards" see all writes, but it
+would still let us implement and partially test the augmented backfill
+code as well as the extra pg log entry fields -- this depends on the
+explicit pg log entry branch having already merged. It's not entirely
+clear to me that this one is worth doing seperately. It's enough code
+that I'd really prefer to get it done independently, but it's also a
+fair amount of scaffolding that will be later discarded.
+
+PGLog entries need to be able to record the participants and log
+comparison needs to be modified to extend logs with entries they
+wouldn't have witnessed. This logic should be abstracted behind
+PGLog so it can be unittested -- that would let us test it somewhat
+before the actual ec overwrites code merges.
+
+Whatever needs to happen to the ec plugin interface can probably be
+done independently of the rest of this (pending resolution of
+questions below).
+
+The actual nuts and bolts of performing the ec overwrite it seems to
+me can't be productively tested (and therefore implemented) until the
+above are complete, so best to get all of the supporting code in
+first.
+
+Open Questions
+==============
+
+Is there a code we should be using that would let us compute a parity
+delta without rereading and reencoding the full stripe? If so, is it
+the kind of thing we need to design for now, or can it be reasonably
+put off?
+
+What needs to happen to the EC plugin interface?
diff --git a/src/ceph/doc/dev/osd_internals/index.rst b/src/ceph/doc/dev/osd_internals/index.rst
new file mode 100644
index 0000000..7e82914
--- /dev/null
+++ b/src/ceph/doc/dev/osd_internals/index.rst
@@ -0,0 +1,10 @@
+==============================
+OSD developer documentation
+==============================
+
+.. rubric:: Contents
+
+.. toctree::
+ :glob:
+
+ *
diff --git a/src/ceph/doc/dev/osd_internals/last_epoch_started.rst b/src/ceph/doc/dev/osd_internals/last_epoch_started.rst
new file mode 100644
index 0000000..9978bd3
--- /dev/null
+++ b/src/ceph/doc/dev/osd_internals/last_epoch_started.rst
@@ -0,0 +1,60 @@
+======================
+last_epoch_started
+======================
+
+info.last_epoch_started records an activation epoch e for interval i
+such that all writes commited in i or earlier are reflected in the
+local info/log and no writes after i are reflected in the local
+info/log. Since no committed write is ever divergent, even if we
+get an authoritative log/info with an older info.last_epoch_started,
+we can leave our info.last_epoch_started alone since no writes could
+have commited in any intervening interval (See PG::proc_master_log).
+
+info.history.last_epoch_started records a lower bound on the most
+recent interval in which the pg as a whole went active and accepted
+writes. On a particular osd, it is also an upper bound on the
+activation epoch of intervals in which writes in the local pg log
+occurred (we update it before accepting writes). Because all
+committed writes are committed by all acting set osds, any
+non-divergent writes ensure that history.last_epoch_started was
+recorded by all acting set members in the interval. Once peering has
+queried one osd from each interval back to some seen
+history.last_epoch_started, it follows that no interval after the max
+history.last_epoch_started can have reported writes as committed
+(since we record it before recording client writes in an interval).
+Thus, the minimum last_update across all infos with
+info.last_epoch_started >= MAX(history.last_epoch_started) must be an
+upper bound on writes reported as committed to the client.
+
+We update info.last_epoch_started with the intial activation message,
+but we only update history.last_epoch_started after the new
+info.last_epoch_started is persisted (possibly along with the first
+write). This ensures that we do not require an osd with the most
+recent info.last_epoch_started until all acting set osds have recorded
+it.
+
+In find_best_info, we do include info.last_epoch_started values when
+calculating the max_last_epoch_started_found because we want to avoid
+designating a log entry divergent which in a prior interval would have
+been non-divergent since it might have been used to serve a read. In
+activate(), we use the peer's last_epoch_started value as a bound on
+how far back divergent log entries can be found.
+
+However, in a case like
+
+.. code::
+
+ calc_acting osd.0 1.4e( v 473'302 (292'200,473'302] local-les=473 n=4 ec=5 les/c 473/473 556/556/556
+ calc_acting osd.1 1.4e( v 473'302 (293'202,473'302] lb 0//0//-1 local-les=477 n=0 ec=5 les/c 473/473 556/556/556
+ calc_acting osd.4 1.4e( v 473'302 (120'121,473'302] local-les=473 n=4 ec=5 les/c 473/473 556/556/556
+ calc_acting osd.5 1.4e( empty local-les=0 n=0 ec=5 les/c 473/473 556/556/556
+
+since osd.1 is the only one which recorded info.les=477 while 4,0
+which were the acting set in that interval did not (4 restarted and 0
+did not get the message in time) the pg is marked incomplete when
+either 4 or 0 would have been valid choices. To avoid this, we do not
+consider info.les for incomplete peers when calculating
+min_last_epoch_started_found. It would not have been in the acting
+set, so we must have another osd from that interval anyway (if
+maybe_went_rw). If that osd does not remember that info.les, then we
+cannot have served reads.
diff --git a/src/ceph/doc/dev/osd_internals/log_based_pg.rst b/src/ceph/doc/dev/osd_internals/log_based_pg.rst
new file mode 100644
index 0000000..8b11012
--- /dev/null
+++ b/src/ceph/doc/dev/osd_internals/log_based_pg.rst
@@ -0,0 +1,206 @@
+============
+Log Based PG
+============
+
+Background
+==========
+
+Why PrimaryLogPG?
+-----------------
+
+Currently, consistency for all ceph pool types is ensured by primary
+log-based replication. This goes for both erasure-coded and
+replicated pools.
+
+Primary log-based replication
+-----------------------------
+
+Reads must return data written by any write which completed (where the
+client could possibly have received a commit message). There are lots
+of ways to handle this, but ceph's architecture makes it easy for
+everyone at any map epoch to know who the primary is. Thus, the easy
+answer is to route all writes for a particular pg through a single
+ordering primary and then out to the replicas. Though we only
+actually need to serialize writes on a single object (and even then,
+the partial ordering only really needs to provide an ordering between
+writes on overlapping regions), we might as well serialize writes on
+the whole PG since it lets us represent the current state of the PG
+using two numbers: the epoch of the map on the primary in which the
+most recent write started (this is a bit stranger than it might seem
+since map distribution itself is asyncronous -- see Peering and the
+concept of interval changes) and an increasing per-pg version number
+-- this is referred to in the code with type eversion_t and stored as
+pg_info_t::last_update. Furthermore, we maintain a log of "recent"
+operations extending back at least far enough to include any
+*unstable* writes (writes which have been started but not committed)
+and objects which aren't uptodate locally (see recovery and
+backfill). In practice, the log will extend much further
+(osd_pg_min_log_entries when clean, osd_pg_max_log_entries when not
+clean) because it's handy for quickly performing recovery.
+
+Using this log, as long as we talk to a non-empty subset of the OSDs
+which must have accepted any completed writes from the most recent
+interval in which we accepted writes, we can determine a conservative
+log which must contain any write which has been reported to a client
+as committed. There is some freedom here, we can choose any log entry
+between the oldest head remembered by an element of that set (any
+newer cannot have completed without that log containing it) and the
+newest head remembered (clearly, all writes in the log were started,
+so it's fine for us to remember them) as the new head. This is the
+main point of divergence between replicated pools and ec pools in
+PG/PrimaryLogPG: replicated pools try to choose the newest valid
+option to avoid the client needing to replay those operations and
+instead recover the other copies. EC pools instead try to choose
+the *oldest* option available to them.
+
+The reason for this gets to the heart of the rest of the differences
+in implementation: one copy will not generally be enough to
+reconstruct an ec object. Indeed, there are encodings where some log
+combinations would leave unrecoverable objects (as with a 4+2 encoding
+where 3 of the replicas remember a write, but the other 3 do not -- we
+don't have 3 copies of either version). For this reason, log entries
+representing *unstable* writes (writes not yet committed to the
+client) must be rollbackable using only local information on ec pools.
+Log entries in general may therefore be rollbackable (and in that case,
+via a delayed application or via a set of instructions for rolling
+back an inplace update) or not. Replicated pool log entries are
+never able to be rolled back.
+
+For more details, see PGLog.h/cc, osd_types.h:pg_log_t,
+osd_types.h:pg_log_entry_t, and peering in general.
+
+ReplicatedBackend/ECBackend unification strategy
+================================================
+
+PGBackend
+---------
+
+So, the fundamental difference between replication and erasure coding
+is that replication can do destructive updates while erasure coding
+cannot. It would be really annoying if we needed to have two entire
+implementations of PrimaryLogPG, one for each of the two, if there
+are really only a few fundamental differences:
+
+#. How reads work -- async only, requires remote reads for ec
+#. How writes work -- either restricted to append, or must write aside and do a
+ tpc
+#. Whether we choose the oldest or newest possible head entry during peering
+#. A bit of extra information in the log entry to enable rollback
+
+and so many similarities
+
+#. All of the stats and metadata for objects
+#. The high level locking rules for mixing client IO with recovery and scrub
+#. The high level locking rules for mixing reads and writes without exposing
+ uncommitted state (which might be rolled back or forgotten later)
+#. The process, metadata, and protocol needed to determine the set of osds
+ which partcipated in the most recent interval in which we accepted writes
+#. etc.
+
+Instead, we choose a few abstractions (and a few kludges) to paper over the differences:
+
+#. PGBackend
+#. PGTransaction
+#. PG::choose_acting chooses between calc_replicated_acting and calc_ec_acting
+#. Various bits of the write pipeline disallow some operations based on pool
+ type -- like omap operations, class operation reads, and writes which are
+ not aligned appends (officially, so far) for ec
+#. Misc other kludges here and there
+
+PGBackend and PGTransaction enable abstraction of differences 1, 2,
+and the addition of 4 as needed to the log entries.
+
+The replicated implementation is in ReplicatedBackend.h/cc and doesn't
+require much explanation, I think. More detail on the ECBackend can be
+found in doc/dev/osd_internals/erasure_coding/ecbackend.rst.
+
+PGBackend Interface Explanation
+===============================
+
+Note: this is from a design document from before the original firefly
+and is probably out of date w.r.t. some of the method names.
+
+Readable vs Degraded
+--------------------
+
+For a replicated pool, an object is readable iff it is present on
+the primary (at the right version). For an ec pool, we need at least
+M shards present to do a read, and we need it on the primary. For
+this reason, PGBackend needs to include some interfaces for determing
+when recovery is required to serve a read vs a write. This also
+changes the rules for when peering has enough logs to prove that it
+
+Core Changes:
+
+- | PGBackend needs to be able to return IsPG(Recoverable|Readable)Predicate
+ | objects to allow the user to make these determinations.
+
+Client Reads
+------------
+
+Reads with the replicated strategy can always be satisfied
+synchronously out of the primary OSD. With an erasure coded strategy,
+the primary will need to request data from some number of replicas in
+order to satisfy a read. PGBackend will therefore need to provide
+seperate objects_read_sync and objects_read_async interfaces where
+the former won't be implemented by the ECBackend.
+
+PGBackend interfaces:
+
+- objects_read_sync
+- objects_read_async
+
+Scrub
+-----
+
+We currently have two scrub modes with different default frequencies:
+
+#. [shallow] scrub: compares the set of objects and metadata, but not
+ the contents
+#. deep scrub: compares the set of objects, metadata, and a crc32 of
+ the object contents (including omap)
+
+The primary requests a scrubmap from each replica for a particular
+range of objects. The replica fills out this scrubmap for the range
+of objects including, if the scrub is deep, a crc32 of the contents of
+each object. The primary gathers these scrubmaps from each replica
+and performs a comparison identifying inconsistent objects.
+
+Most of this can work essentially unchanged with erasure coded PG with
+the caveat that the PGBackend implementation must be in charge of
+actually doing the scan.
+
+
+PGBackend interfaces:
+
+- be_*
+
+Recovery
+--------
+
+The logic for recovering an object depends on the backend. With
+the current replicated strategy, we first pull the object replica
+to the primary and then concurrently push it out to the replicas.
+With the erasure coded strategy, we probably want to read the
+minimum number of replica chunks required to reconstruct the object
+and push out the replacement chunks concurrently.
+
+Another difference is that objects in erasure coded pg may be
+unrecoverable without being unfound. The "unfound" concept
+should probably then be renamed to unrecoverable. Also, the
+PGBackend implementation will have to be able to direct the search
+for pg replicas with unrecoverable object chunks and to be able
+to determine whether a particular object is recoverable.
+
+
+Core changes:
+
+- s/unfound/unrecoverable
+
+PGBackend interfaces:
+
+- `on_local_recover_start <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L60>`_
+- `on_local_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L66>`_
+- `on_global_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L78>`_
+- `on_peer_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L83>`_
+- `begin_peer_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L90>`_
diff --git a/src/ceph/doc/dev/osd_internals/map_message_handling.rst b/src/ceph/doc/dev/osd_internals/map_message_handling.rst
new file mode 100644
index 0000000..a5013c2
--- /dev/null
+++ b/src/ceph/doc/dev/osd_internals/map_message_handling.rst
@@ -0,0 +1,131 @@
+===========================
+Map and PG Message handling
+===========================
+
+Overview
+--------
+The OSD handles routing incoming messages to PGs, creating the PG if necessary
+in some cases.
+
+PG messages generally come in two varieties:
+
+ 1. Peering Messages
+ 2. Ops/SubOps
+
+There are several ways in which a message might be dropped or delayed. It is
+important that the message delaying does not result in a violation of certain
+message ordering requirements on the way to the relevant PG handling logic:
+
+ 1. Ops referring to the same object must not be reordered.
+ 2. Peering messages must not be reordered.
+ 3. Subops must not be reordered.
+
+MOSDMap
+-------
+MOSDMap messages may come from either monitors or other OSDs. Upon receipt, the
+OSD must perform several tasks:
+
+ 1. Persist the new maps to the filestore.
+ Several PG operations rely on having access to maps dating back to the last
+ time the PG was clean.
+ 2. Update and persist the superblock.
+ 3. Update OSD state related to the current map.
+ 4. Expose new maps to PG processes via *OSDService*.
+ 5. Remove PGs due to pool removal.
+ 6. Queue dummy events to trigger PG map catchup.
+
+Each PG asynchronously catches up to the currently published map during
+process_peering_events before processing the event. As a result, different
+PGs may have different views as to the "current" map.
+
+One consequence of this design is that messages containing submessages from
+multiple PGs (MOSDPGInfo, MOSDPGQuery, MOSDPGNotify) must tag each submessage
+with the PG's epoch as well as tagging the message as a whole with the OSD's
+current published epoch.
+
+MOSDPGOp/MOSDPGSubOp
+--------------------
+See OSD::dispatch_op, OSD::handle_op, OSD::handle_sub_op
+
+MOSDPGOps are used by clients to initiate rados operations. MOSDSubOps are used
+between OSDs to coordinate most non peering activities including replicating
+MOSDPGOp operations.
+
+OSD::require_same_or_newer map checks that the current OSDMap is at least
+as new as the map epoch indicated on the message. If not, the message is
+queued in OSD::waiting_for_osdmap via OSD::wait_for_new_map. Note, this
+cannot violate the above conditions since any two messages will be queued
+in order of receipt and if a message is received with epoch e0, a later message
+from the same source must be at epoch at least e0. Note that two PGs from
+the same OSD count for these purposes as different sources for single PG
+messages. That is, messages from different PGs may be reordered.
+
+
+MOSDPGOps follow the following process:
+
+ 1. OSD::handle_op: validates permissions and crush mapping.
+ discard the request if they are not connected and the client cannot get the reply ( See OSD::op_is_discardable )
+ See OSDService::handle_misdirected_op
+ See PG::op_has_sufficient_caps
+ See OSD::require_same_or_newer_map
+ 2. OSD::enqueue_op
+
+MOSDSubOps follow the following process:
+
+ 1. OSD::handle_sub_op checks that sender is an OSD
+ 2. OSD::enqueue_op
+
+OSD::enqueue_op calls PG::queue_op which checks waiting_for_map before calling OpWQ::queue which adds the op to the queue of the PG responsible for handling it.
+
+OSD::dequeue_op is then eventually called, with a lock on the PG. At
+this time, the op is passed to PG::do_request, which checks that:
+
+ 1. the PG map is new enough (PG::must_delay_op)
+ 2. the client requesting the op has enough permissions (PG::op_has_sufficient_caps)
+ 3. the op is not to be discarded (PG::can_discard_{request,op,subop,scan,backfill})
+ 4. the PG is active (PG::flushed boolean)
+ 5. the op is a CEPH_MSG_OSD_OP and the PG is in PG_STATE_ACTIVE state and not in PG_STATE_REPLAY
+
+If these conditions are not met, the op is either discarded or queued for later processing. If all conditions are met, the op is processed according to its type:
+
+ 1. CEPH_MSG_OSD_OP is handled by PG::do_op
+ 2. MSG_OSD_SUBOP is handled by PG::do_sub_op
+ 3. MSG_OSD_SUBOPREPLY is handled by PG::do_sub_op_reply
+ 4. MSG_OSD_PG_SCAN is handled by PG::do_scan
+ 5. MSG_OSD_PG_BACKFILL is handled by PG::do_backfill
+
+CEPH_MSG_OSD_OP processing
+--------------------------
+
+PrimaryLogPG::do_op handles CEPH_MSG_OSD_OP op and will queue it
+
+ 1. in wait_for_all_missing if it is a CEPH_OSD_OP_PGLS for a designated snapid and some object updates are still missing
+ 2. in waiting_for_active if the op may write but the scrubber is working
+ 3. in waiting_for_missing_object if the op requires an object or a snapdir or a specific snap that is still missing
+ 4. in waiting_for_degraded_object if the op may write an object or a snapdir that is degraded, or if another object blocks it ("blocked_by")
+ 5. in waiting_for_backfill_pos if the op requires an object that will be available after the backfill is complete
+ 6. in waiting_for_ack if an ack from another OSD is expected
+ 7. in waiting_for_ondisk if the op is waiting for a write to complete
+
+Peering Messages
+----------------
+See OSD::handle_pg_(notify|info|log|query)
+
+Peering messages are tagged with two epochs:
+
+ 1. epoch_sent: map epoch at which the message was sent
+ 2. query_epoch: map epoch at which the message triggering the message was sent
+
+These are the same in cases where there was no triggering message. We discard
+a peering message if the message's query_epoch if the PG in question has entered
+a new epoch (See PG::old_peering_evt, PG::queue_peering_event). Notifies,
+infos, notifies, and logs are all handled as PG::RecoveryMachine events and
+are wrapped by PG::queue_* by PG::CephPeeringEvts, which include the created
+state machine event along with epoch_sent and query_epoch in order to
+generically check PG::old_peering_message upon insertion and removal from the
+queue.
+
+Note, notifies, logs, and infos can trigger the creation of a PG. See
+OSD::get_or_create_pg.
+
+
diff --git a/src/ceph/doc/dev/osd_internals/osd_overview.rst b/src/ceph/doc/dev/osd_internals/osd_overview.rst
new file mode 100644
index 0000000..192ddf8
--- /dev/null
+++ b/src/ceph/doc/dev/osd_internals/osd_overview.rst
@@ -0,0 +1,106 @@
+===
+OSD
+===
+
+Concepts
+--------
+
+*Messenger*
+ See src/msg/Messenger.h
+
+ Handles sending and receipt of messages on behalf of the OSD. The OSD uses
+ two messengers:
+
+ 1. cluster_messenger - handles traffic to other OSDs, monitors
+ 2. client_messenger - handles client traffic
+
+ This division allows the OSD to be configured with different interfaces for
+ client and cluster traffic.
+
+*Dispatcher*
+ See src/msg/Dispatcher.h
+
+ OSD implements the Dispatcher interface. Of particular note is ms_dispatch,
+ which serves as the entry point for messages received via either the client
+ or cluster messenger. Because there are two messengers, ms_dispatch may be
+ called from at least two threads. The osd_lock is always held during
+ ms_dispatch.
+
+*WorkQueue*
+ See src/common/WorkQueue.h
+
+ The WorkQueue class abstracts the process of queueing independent tasks
+ for asynchronous execution. Each OSD process contains workqueues for
+ distinct tasks:
+
+ 1. OpWQ: handles ops (from clients) and subops (from other OSDs).
+ Runs in the op_tp threadpool.
+ 2. PeeringWQ: handles peering tasks and pg map advancement
+ Runs in the op_tp threadpool.
+ See Peering
+ 3. CommandWQ: handles commands (pg query, etc)
+ Runs in the command_tp threadpool.
+ 4. RecoveryWQ: handles recovery tasks.
+ Runs in the recovery_tp threadpool.
+ 5. SnapTrimWQ: handles snap trimming
+ Runs in the disk_tp threadpool.
+ See SnapTrimmer
+ 6. ScrubWQ: handles primary scrub path
+ Runs in the disk_tp threadpool.
+ See Scrub
+ 7. ScrubFinalizeWQ: handles primary scrub finalize
+ Runs in the disk_tp threadpool.
+ See Scrub
+ 8. RepScrubWQ: handles replica scrub path
+ Runs in the disk_tp threadpool
+ See Scrub
+ 9. RemoveWQ: Asynchronously removes old pg directories
+ Runs in the disk_tp threadpool
+ See PGRemoval
+
+*ThreadPool*
+ See src/common/WorkQueue.h
+ See also above.
+
+ There are 4 OSD threadpools:
+
+ 1. op_tp: handles ops and subops
+ 2. recovery_tp: handles recovery tasks
+ 3. disk_tp: handles disk intensive tasks
+ 4. command_tp: handles commands
+
+*OSDMap*
+ See src/osd/OSDMap.h
+
+ The crush algorithm takes two inputs: a picture of the cluster
+ with status information about which nodes are up/down and in/out,
+ and the pgid to place. The former is encapsulated by the OSDMap.
+ Maps are numbered by *epoch* (epoch_t). These maps are passed around
+ within the OSD as std::tr1::shared_ptr<const OSDMap>.
+
+ See MapHandling
+
+*PG*
+ See src/osd/PG.* src/osd/PrimaryLogPG.*
+
+ Objects in rados are hashed into *PGs* and *PGs* are placed via crush onto
+ OSDs. The PG structure is responsible for handling requests pertaining to
+ a particular *PG* as well as for maintaining relevant metadata and controlling
+ recovery.
+
+*OSDService*
+ See src/osd/OSD.cc OSDService
+
+ The OSDService acts as a broker between PG threads and OSD state which allows
+ PGs to perform actions using OSD services such as workqueues and messengers.
+ This is still a work in progress. Future cleanups will focus on moving such
+ state entirely from the OSD into the OSDService.
+
+Overview
+--------
+ See src/ceph_osd.cc
+
+ The OSD process represents one leaf device in the crush hierarchy. There
+ might be one OSD process per physical machine, or more than one if, for
+ example, the user configures one OSD instance per disk.
+
diff --git a/src/ceph/doc/dev/osd_internals/osd_throttles.rst b/src/ceph/doc/dev/osd_internals/osd_throttles.rst
new file mode 100644
index 0000000..6739bd9
--- /dev/null
+++ b/src/ceph/doc/dev/osd_internals/osd_throttles.rst
@@ -0,0 +1,93 @@
+=============
+OSD Throttles
+=============
+
+There are three significant throttles in the filestore: wbthrottle,
+op_queue_throttle, and a throttle based on journal usage.
+
+WBThrottle
+----------
+The WBThrottle is defined in src/os/filestore/WBThrottle.[h,cc] and
+included in FileStore as FileStore::wbthrottle. The intention is to
+bound the amount of outstanding IO we need to do to flush the journal.
+At the same time, we don't want to necessarily do it inline in case we
+might be able to combine several IOs on the same object close together
+in time. Thus, in FileStore::_write, we queue the fd for asyncronous
+flushing and block in FileStore::_do_op if we have exceeded any hard
+limits until the background flusher catches up.
+
+The relevant config options are filestore_wbthrottle*. There are
+different defaults for xfs and btrfs. Each set has hard and soft
+limits on bytes (total dirty bytes), ios (total dirty ios), and
+inodes (total dirty fds). The WBThrottle will begin flushing
+when any of these hits the soft limit and will block in throttle()
+while any has exceeded the hard limit.
+
+Tighter soft limits will cause writeback to happen more quickly,
+but may cause the OSD to miss oportunities for write coalescing.
+Tighter hard limits may cause a reduction in latency variance by
+reducing time spent flushing the journal, but may reduce writeback
+parallelism.
+
+op_queue_throttle
+-----------------
+The op queue throttle is intended to bound the amount of queued but
+uncompleted work in the filestore by delaying threads calling
+queue_transactions more and more based on how many ops and bytes are
+currently queued. The throttle is taken in queue_transactions and
+released when the op is applied to the filesystem. This period
+includes time spent in the journal queue, time spent writing to the
+journal, time spent in the actual op queue, time spent waiting for the
+wbthrottle to open up (thus, the wbthrottle can push back indirectly
+on the queue_transactions caller), and time spent actually applying
+the op to the filesystem. A BackoffThrottle is used to gradually
+delay the queueing thread after each throttle becomes more than
+filestore_queue_low_threshhold full (a ratio of
+filestore_queue_max_(bytes|ops)). The throttles will block once the
+max value is reached (filestore_queue_max_(bytes|ops)).
+
+The significant config options are:
+filestore_queue_low_threshhold
+filestore_queue_high_threshhold
+filestore_expected_throughput_ops
+filestore_expected_throughput_bytes
+filestore_queue_high_delay_multiple
+filestore_queue_max_delay_multiple
+
+While each throttle is at less than low_threshhold of the max,
+no delay happens. Between low and high, the throttle will
+inject a per-op delay (per op or byte) ramping from 0 at low to
+high_delay_multiple/expected_throughput at high. From high to
+1, the delay will ramp from high_delay_multiple/expected_throughput
+to max_delay_multiple/expected_throughput.
+
+filestore_queue_high_delay_multiple and
+filestore_queue_max_delay_multiple probably do not need to be
+changed.
+
+Setting these properly should help to smooth out op latencies by
+mostly avoiding the hard limit.
+
+See FileStore::throttle_ops and FileSTore::thottle_bytes.
+
+journal usage throttle
+----------------------
+See src/os/filestore/JournalThrottle.h/cc
+
+The intention of the journal usage throttle is to gradually slow
+down queue_transactions callers as the journal fills up in order
+to smooth out hiccup during filestore syncs. JournalThrottle
+wraps a BackoffThrottle and tracks journaled but not flushed
+journal entries so that the throttle can be released when the
+journal is flushed. The configs work very similarly to the
+op_queue_throttle.
+
+The significant config options are:
+journal_throttle_low_threshhold
+journal_throttle_high_threshhold
+filestore_expected_throughput_ops
+filestore_expected_throughput_bytes
+journal_throttle_high_multiple
+journal_throttle_max_multiple
+
+.. literalinclude:: osd_throttles.txt
diff --git a/src/ceph/doc/dev/osd_internals/osd_throttles.txt b/src/ceph/doc/dev/osd_internals/osd_throttles.txt
new file mode 100644
index 0000000..0332377
--- /dev/null
+++ b/src/ceph/doc/dev/osd_internals/osd_throttles.txt
@@ -0,0 +1,21 @@
+ Messenger throttle (number and size)
+ |-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+ FileStore op_queue throttle (number and size, includes a soft throttle based on filestore_expected_throughput_(ops|bytes))
+ |--------------------------------------------------------|
+ WBThrottle
+ |---------------------------------------------------------------------------------------------------------|
+ Journal (size, includes a soft throttle based on filestore_expected_throughput_bytes)
+ |-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+ |----------------------------------------------------------------------------------------------------> flushed ----------------> synced
+ |
+Op: Read Header --DispatchQ--> OSD::_dispatch --OpWQ--> PG::do_request --journalq--> Journal --FileStore::OpWQ--> Apply Thread --Finisher--> op_applied -------------------------------------------------------------> Complete
+ | |
+SubOp: --Messenger--> ReadHeader --DispatchQ--> OSD::_dispatch --OpWQ--> PG::do_request --journalq--> Journal --FileStore::OpWQ--> Apply Thread --Finisher--> sub_op_applied -
+ |
+ |-----------------------------> flushed ----------------> synced
+ |------------------------------------------------------------------------------------------|
+ Journal (size)
+ |---------------------------------|
+ WBThrottle
+ |-----------------------------------------------------|
+ FileStore op_queue throttle (number and size)
diff --git a/src/ceph/doc/dev/osd_internals/pg.rst b/src/ceph/doc/dev/osd_internals/pg.rst
new file mode 100644
index 0000000..4055363
--- /dev/null
+++ b/src/ceph/doc/dev/osd_internals/pg.rst
@@ -0,0 +1,31 @@
+====
+PG
+====
+
+Concepts
+--------
+
+*Peering Interval*
+ See PG::start_peering_interval.
+ See PG::acting_up_affected
+ See PG::RecoveryState::Reset
+
+ A peering interval is a maximal set of contiguous map epochs in which the
+ up and acting sets did not change. PG::RecoveryMachine represents a
+ transition from one interval to another as passing through
+ RecoveryState::Reset. On PG::RecoveryState::AdvMap PG::acting_up_affected can
+ cause the pg to transition to Reset.
+
+
+Peering Details and Gotchas
+---------------------------
+For an overview of peering, see `Peering <../../peering>`_.
+
+ * PG::flushed defaults to false and is set to false in
+ PG::start_peering_interval. Upon transitioning to PG::RecoveryState::Started
+ we send a transaction through the pg op sequencer which, upon complete,
+ sends a FlushedEvt which sets flushed to true. The primary cannot go
+ active until this happens (See PG::RecoveryState::WaitFlushedPeering).
+ Replicas can go active but cannot serve ops (writes or reads).
+ This is necessary because we cannot read our ondisk state until unstable
+ transactions from the previous interval have cleared.
diff --git a/src/ceph/doc/dev/osd_internals/pg_removal.rst b/src/ceph/doc/dev/osd_internals/pg_removal.rst
new file mode 100644
index 0000000..d968ecc
--- /dev/null
+++ b/src/ceph/doc/dev/osd_internals/pg_removal.rst
@@ -0,0 +1,56 @@
+==========
+PG Removal
+==========
+
+See OSD::_remove_pg, OSD::RemoveWQ
+
+There are two ways for a pg to be removed from an OSD:
+
+ 1. MOSDPGRemove from the primary
+ 2. OSD::advance_map finds that the pool has been removed
+
+In either case, our general strategy for removing the pg is to
+atomically set the metadata objects (pg->log_oid, pg->biginfo_oid) to
+backfill and asynronously remove the pg collections. We do not do
+this inline because scanning the collections to remove the objects is
+an expensive operation.
+
+OSDService::deleting_pgs tracks all pgs in the process of being
+deleted. Each DeletingState object in deleting_pgs lives while at
+least one reference to it remains. Each item in RemoveWQ carries a
+reference to the DeletingState for the relevant pg such that
+deleting_pgs.lookup(pgid) will return a null ref only if there are no
+collections currently being deleted for that pg.
+
+The DeletingState for a pg also carries information about the status
+of the current deletion and allows the deletion to be cancelled.
+The possible states are:
+
+ 1. QUEUED: the PG is in the RemoveWQ
+ 2. CLEARING_DIR: the PG's contents are being removed synchronously
+ 3. DELETING_DIR: the PG's directories and metadata being queued for removal
+ 4. DELETED_DIR: the final removal transaction has been queued
+ 5. CANCELED: the deletion has been canceled
+
+In 1 and 2, the deletion can be canceled. Each state transition
+method (and check_canceled) returns false if deletion has been
+canceled and true if the state transition was successful. Similarly,
+try_stop_deletion() returns true if it succeeds in canceling the
+deletion. Additionally, try_stop_deletion() in the event that it
+fails to stop the deletion will not return until the final removal
+transaction is queued. This ensures that any operations queued after
+that point will be ordered after the pg deletion.
+
+OSD::_create_lock_pg must handle two cases:
+
+ 1. Either there is no DeletingStateRef for the pg, or it failed to cancel
+ 2. We succeeded in canceling the deletion.
+
+In case 1., we proceed as if there were no deletion occurring, except that
+we avoid writing to the PG until the deletion finishes. In case 2., we
+proceed as in case 1., except that we first mark the PG as backfilling.
+
+Similarly, OSD::osr_registry ensures that the OpSequencers for those
+pgs can be reused for a new pg if created before the old one is fully
+removed, ensuring that operations on the new pg are sequenced properly
+with respect to operations on the old one.
diff --git a/src/ceph/doc/dev/osd_internals/pgpool.rst b/src/ceph/doc/dev/osd_internals/pgpool.rst
new file mode 100644
index 0000000..45a252b
--- /dev/null
+++ b/src/ceph/doc/dev/osd_internals/pgpool.rst
@@ -0,0 +1,22 @@
+==================
+PGPool
+==================
+
+PGPool is a structure used to manage and update the status of removed
+snapshots. It does this by maintaining two fields, cached_removed_snaps - the
+current removed snap set and newly_removed_snaps - newly removed snaps in the
+last epoch. In OSD::load_pgs the osd map is recovered from the pg's file store
+and passed down to OSD::_get_pool where a PGPool object is initialised with the
+map.
+
+With each new map we receive we call PGPool::update with the new map. In that
+function we build a list of newly removed snaps
+(pg_pool_t::build_removed_snaps) and merge that with our cached_removed_snaps.
+This function included checks to make sure we only do this update when things
+have changed or there has been a map gap.
+
+When we activate the pg we initialise the snap trim queue from
+cached_removed_snaps and subtract the purged_snaps we have already purged
+leaving us with the list of snaps that need to be trimmed. Trimming is later
+performed asynchronously by the snap_trim_wq.
+
diff --git a/src/ceph/doc/dev/osd_internals/recovery_reservation.rst b/src/ceph/doc/dev/osd_internals/recovery_reservation.rst
new file mode 100644
index 0000000..4ab0319
--- /dev/null
+++ b/src/ceph/doc/dev/osd_internals/recovery_reservation.rst
@@ -0,0 +1,75 @@
+====================
+Recovery Reservation
+====================
+
+Recovery reservation extends and subsumes backfill reservation. The
+reservation system from backfill recovery is used for local and remote
+reservations.
+
+When a PG goes active, first it determines what type of recovery is
+necessary, if any. It may need log-based recovery, backfill recovery,
+both, or neither.
+
+In log-based recovery, the primary first acquires a local reservation
+from the OSDService's local_reserver. Then a MRemoteReservationRequest
+message is sent to each replica in order of OSD number. These requests
+will always be granted (i.e., cannot be rejected), but they may take
+some time to be granted if the remotes have already granted all their
+remote reservation slots.
+
+After all reservations are acquired, log-based recovery proceeds as it
+would without the reservation system.
+
+After log-based recovery completes, the primary releases all remote
+reservations. The local reservation remains held. The primary then
+determines whether backfill is necessary. If it is not necessary, the
+primary releases its local reservation and waits in the Recovered state
+for all OSDs to indicate that they are clean.
+
+If backfill recovery occurs after log-based recovery, the local
+reservation does not need to be reacquired since it is still held from
+before. If it occurs immediately after activation (log-based recovery
+not possible/necessary), the local reservation is acquired according to
+the typical process.
+
+Once the primary has its local reservation, it requests a remote
+reservation from the backfill target. This reservation CAN be rejected,
+for instance if the OSD is too full (backfillfull_ratio osd setting).
+If the reservation is rejected, the primary drops its local
+reservation, waits (osd_backfill_retry_interval), and then retries. It
+will retry indefinitely.
+
+Once the primary has the local and remote reservations, backfill
+proceeds as usual. After backfill completes the remote reservation is
+dropped.
+
+Finally, after backfill (or log-based recovery if backfill was not
+necessary), the primary drops the local reservation and enters the
+Recovered state. Once all the PGs have reported they are clean, the
+primary enters the Clean state and marks itself active+clean.
+
+
+--------------
+Things to Note
+--------------
+
+We always grab the local reservation first, to prevent a circular
+dependency. We grab remote reservations in order of OSD number for the
+same reason.
+
+The recovery reservation state chart controls the PG state as reported
+to the monitor. The state chart can set:
+
+ - recovery_wait: waiting for local/remote reservations
+ - recovering: recovering
+ - recovery_toofull: recovery stopped, OSD(s) above full ratio
+ - backfill_wait: waiting for remote backfill reservations
+ - backfilling: backfilling
+ - backfill_toofull: backfill stopped, OSD(s) above backfillfull ratio
+
+
+--------
+See Also
+--------
+
+The Active substate of the automatically generated OSD state diagram.
diff --git a/src/ceph/doc/dev/osd_internals/scrub.rst b/src/ceph/doc/dev/osd_internals/scrub.rst
new file mode 100644
index 0000000..3343b39
--- /dev/null
+++ b/src/ceph/doc/dev/osd_internals/scrub.rst
@@ -0,0 +1,30 @@
+
+Scrubbing Behavior Table
+========================
+
++-------------------------------------------------+----------+-----------+---------------+----------------------+
+| Flags | none | noscrub | nodeep_scrub | noscrub/nodeep_scrub |
++=================================================+==========+===========+===============+======================+
+| Periodic tick | S | X | S | X |
++-------------------------------------------------+----------+-----------+---------------+----------------------+
+| Periodic tick after osd_deep_scrub_interval | D | D | S | X |
++-------------------------------------------------+----------+-----------+---------------+----------------------+
+| Initiated scrub | S | S | S | S |
++-------------------------------------------------+----------+-----------+---------------+----------------------+
+| Initiated scrub after osd_deep_scrub_interval | D | D | S | S |
++-------------------------------------------------+----------+-----------+---------------+----------------------+
+| Initiated deep scrub | D | D | D | D |
++-------------------------------------------------+----------+-----------+---------------+----------------------+
+
+- X = Do nothing
+- S = Do regular scrub
+- D = Do deep scrub
+
+State variables
+---------------
+
+- Periodic tick state is !must_scrub && !must_deep_scrub && !time_for_deep
+- Periodic tick after osd_deep_scrub_interval state is !must_scrub && !must_deep_scrub && time_for_deep
+- Initiated scrub state is must_scrub && !must_deep_scrub && !time_for_deep
+- Initiated scrub after osd_deep_scrub_interval state is must scrub && !must_deep_scrub && time_for_deep
+- Initiated deep scrub state is must_scrub && must_deep_scrub
diff --git a/src/ceph/doc/dev/osd_internals/snaps.rst b/src/ceph/doc/dev/osd_internals/snaps.rst
new file mode 100644
index 0000000..e17378f
--- /dev/null
+++ b/src/ceph/doc/dev/osd_internals/snaps.rst
@@ -0,0 +1,128 @@
+======
+Snaps
+======
+
+Overview
+--------
+Rados supports two related snapshotting mechanisms:
+
+ 1. *pool snaps*: snapshots are implicitely applied to all objects
+ in a pool
+ 2. *self managed snaps*: the user must provide the current *SnapContext*
+ on each write.
+
+These two are mutually exclusive, only one or the other can be used on
+a particular pool.
+
+The *SnapContext* is the set of snapshots currently defined for an object
+as well as the most recent snapshot (the *seq*) requested from the mon for
+sequencing purposes (a *SnapContext* with a newer *seq* is considered to
+be more recent).
+
+The difference between *pool snaps* and *self managed snaps* from the
+OSD's point of view lies in whether the *SnapContext* comes to the OSD
+via the client's MOSDOp or via the most recent OSDMap.
+
+See OSD::make_writeable
+
+Ondisk Structures
+-----------------
+Each object has in the pg collection a *head* object (or *snapdir*, which we
+will come to shortly) and possibly a set of *clone* objects.
+Each hobject_t has a snap field. For the *head* (the only writeable version
+of an object), the snap field is set to CEPH_NOSNAP. For the *clones*, the
+snap field is set to the *seq* of the *SnapContext* at their creation.
+When the OSD services a write, it first checks whether the most recent
+*clone* is tagged with a snapid prior to the most recent snap represented
+in the *SnapContext*. If so, at least one snapshot has occurred between
+the time of the write and the time of the last clone. Therefore, prior
+to performing the mutation, the OSD creates a new clone for servicing
+reads on snaps between the snapid of the last clone and the most recent
+snapid.
+
+The *head* object contains a *SnapSet* encoded in an attribute, which tracks
+
+ 1. The full set of snaps defined for the object
+ 2. The full set of clones which currently exist
+ 3. Overlapping intervals between clones for tracking space usage
+ 4. Clone size
+
+If the *head* is deleted while there are still clones, a *snapdir* object
+is created instead to house the *SnapSet*.
+
+Additionally, the *object_info_t* on each clone includes a vector of snaps
+for which clone is defined.
+
+Snap Removal
+------------
+To remove a snapshot, a request is made to the *Monitor* cluster to
+add the snapshot id to the list of purged snaps (or to remove it from
+the set of pool snaps in the case of *pool snaps*). In either case,
+the *PG* adds the snap to its *snap_trimq* for trimming.
+
+A clone can be removed when all of its snaps have been removed. In
+order to determine which clones might need to be removed upon snap
+removal, we maintain a mapping from snap to *hobject_t* using the
+*SnapMapper*.
+
+See PrimaryLogPG::SnapTrimmer, SnapMapper
+
+This trimming is performed asynchronously by the snap_trim_wq while the
+pg is clean and not scrubbing.
+
+ #. The next snap in PG::snap_trimq is selected for trimming
+ #. We determine the next object for trimming out of PG::snap_mapper.
+ For each object, we create a log entry and repop updating the
+ object info and the snap set (including adjusting the overlaps).
+ If the object is a clone which no longer belongs to any live snapshots,
+ it is removed here. (See PrimaryLogPG::trim_object() when new_snaps
+ is empty.)
+ #. We also locally update our *SnapMapper* instance with the object's
+ new snaps.
+ #. The log entry containing the modification of the object also
+ contains the new set of snaps, which the replica uses to update
+ its own *SnapMapper* instance.
+ #. The primary shares the info with the replica, which persists
+ the new set of purged_snaps along with the rest of the info.
+
+
+
+Recovery
+--------
+Because the trim operations are implemented using repops and log entries,
+normal pg peering and recovery maintain the snap trimmer operations with
+the caveat that push and removal operations need to update the local
+*SnapMapper* instance. If the purged_snaps update is lost, we merely
+retrim a now empty snap.
+
+SnapMapper
+----------
+*SnapMapper* is implemented on top of map_cacher<string, bufferlist>,
+which provides an interface over a backing store such as the filesystem
+with async transactions. While transactions are incomplete, the map_cacher
+instance buffers unstable keys allowing consistent access without having
+to flush the filestore. *SnapMapper* provides two mappings:
+
+ 1. hobject_t -> set<snapid_t>: stores the set of snaps for each clone
+ object
+ 2. snapid_t -> hobject_t: stores the set of hobjects with the snapshot
+ as one of its snaps
+
+Assumption: there are lots of hobjects and relatively few snaps. The
+first encoding has a stringification of the object as the key and an
+encoding of the set of snaps as a value. The second mapping, because there
+might be many hobjects for a single snap, is stored as a collection of keys
+of the form stringify(snap)_stringify(object) such that stringify(snap)
+is constant length. These keys have a bufferlist encoding
+pair<snapid, hobject_t> as a value. Thus, creating or trimming a single
+object does not involve reading all objects for any snap. Additionally,
+upon construction, the *SnapMapper* is provided with a mask for filtering
+the objects in the single SnapMapper keyspace belonging to that pg.
+
+Split
+-----
+The snapid_t -> hobject_t key entries are arranged such that for any pg,
+up to 8 prefixes need to be checked to determine all hobjects in a particular
+snap for a particular pg. Upon split, the prefixes to check on the parent
+are adjusted such that only the objects remaining in the pg will be visible.
+The children will immediately have the correct mapping.
diff --git a/src/ceph/doc/dev/osd_internals/watch_notify.rst b/src/ceph/doc/dev/osd_internals/watch_notify.rst
new file mode 100644
index 0000000..8c2ce09
--- /dev/null
+++ b/src/ceph/doc/dev/osd_internals/watch_notify.rst
@@ -0,0 +1,81 @@
+============
+Watch Notify
+============
+
+See librados for the watch/notify interface.
+
+Overview
+--------
+The object_info (See osd/osd_types.h) tracks the set of watchers for
+a particular object persistently in the object_info_t::watchers map.
+In order to track notify progress, we also maintain some ephemeral
+structures associated with the ObjectContext.
+
+Each Watch has an associated Watch object (See osd/Watch.h). The
+ObjectContext for a watched object will have a (strong) reference
+to one Watch object per watch, and each Watch object holds a
+reference to the corresponding ObjectContext. This circular reference
+is deliberate and is broken when the Watch state is discarded on
+a new peering interval or removed upon timeout expiration or an
+unwatch operation.
+
+A watch tracks the associated connection via a strong
+ConnectionRef Watch::conn. The associated connection has a
+WatchConState stashed in the OSD::Session for tracking associated
+Watches in order to be able to notify them upon ms_handle_reset()
+(via WatchConState::reset()).
+
+Each Watch object tracks the set of currently un-acked notifies.
+start_notify() on a Watch object adds a reference to a new in-progress
+Notify to the Watch and either:
+
+* if the Watch is *connected*, sends a Notify message to the client
+* if the Watch is *unconnected*, does nothing.
+
+When the Watch becomes connected (in PrimaryLogPG::do_osd_op_effects),
+Notifies are resent to all remaining tracked Notify objects.
+
+Each Notify object tracks the set of un-notified Watchers via
+calls to complete_watcher(). Once the remaining set is empty or the
+timeout expires (cb, registered in init()) a notify completion
+is sent to the client.
+
+Watch Lifecycle
+---------------
+A watch may be in one of 5 states:
+
+1. Non existent.
+2. On disk, but not registered with an object context.
+3. Connected
+4. Disconnected, callback registered with timer
+5. Disconnected, callback in queue for scrub or is_degraded
+
+Case 2 occurs between when an OSD goes active and the ObjectContext
+for an object with watchers is loaded into memory due to an access.
+During Case 2, no state is registered for the watch. Case 2
+transitions to Case 4 in PrimaryLogPG::populate_obc_watchers() during
+PrimaryLogPG::find_object_context. Case 1 becomes case 3 via
+OSD::do_osd_op_effects due to a watch operation. Case 4,5 become case
+3 in the same way. Case 3 becomes case 4 when the connection resets
+on a watcher's session.
+
+Cases 4&5 can use some explanation. Normally, when a Watch enters Case
+4, a callback is registered with the OSDService::watch_timer to be
+called at timeout expiration. At the time that the callback is
+called, however, the pg might be in a state where it cannot write
+to the object in order to remove the watch (i.e., during a scrub
+or while the object is degraded). In that case, we use
+Watch::get_delayed_cb() to generate another Context for use from
+the callbacks_for_degraded_object and Scrubber::callbacks lists.
+In either case, Watch::unregister_cb() does the right thing
+(SafeTimer::cancel_event() is harmless for contexts not registered
+with the timer).
+
+Notify Lifecycle
+----------------
+The notify timeout is simpler: a timeout callback is registered when
+the notify is init()'d. If all watchers ack notifies before the
+timeout occurs, the timeout is canceled and the client is notified
+of the notify completion. Otherwise, the timeout fires, the Notify
+object pings each Watch via cancel_notify to remove itself, and
+sends the notify completion to the client early.
diff --git a/src/ceph/doc/dev/osd_internals/wbthrottle.rst b/src/ceph/doc/dev/osd_internals/wbthrottle.rst
new file mode 100644
index 0000000..a3ae00d
--- /dev/null
+++ b/src/ceph/doc/dev/osd_internals/wbthrottle.rst
@@ -0,0 +1,28 @@
+==================
+Writeback Throttle
+==================
+
+Previously, the filestore had a problem when handling large numbers of
+small ios. We throttle dirty data implicitely via the journal, but
+a large number of inodes can be dirtied without filling the journal
+resulting in a very long sync time when the sync finally does happen.
+The flusher was not an adequate solution to this problem since it
+forced writeback of small writes too eagerly killing performance.
+
+WBThrottle tracks unflushed io per hobject_t and ::fsyncs in lru
+order once the start_flusher threshold is exceeded for any of
+dirty bytes, dirty ios, or dirty inodes. While any of these exceed
+the hard_limit, we block on throttle() in _do_op.
+
+See src/os/WBThrottle.h, src/osd/WBThrottle.cc
+
+To track the open FDs through the writeback process, there is now an
+fdcache to cache open fds. lfn_open now returns a cached FDRef which
+implicitely closes the fd once all references have expired.
+
+Filestore syncs have a sideeffect of flushing all outstanding objects
+in the wbthrottle.
+
+lfn_unlink clears the cached FDRef and wbthrottle entries for the
+unlinked object when the last link is removed and asserts that all
+outstanding FDRefs for that object are dead.
diff --git a/src/ceph/doc/dev/peering.rst b/src/ceph/doc/dev/peering.rst
new file mode 100644
index 0000000..7ee5deb
--- /dev/null
+++ b/src/ceph/doc/dev/peering.rst
@@ -0,0 +1,259 @@
+======================
+Peering
+======================
+
+Concepts
+--------
+
+*Peering*
+ the process of bringing all of the OSDs that store
+ a Placement Group (PG) into agreement about the state
+ of all of the objects (and their metadata) in that PG.
+ Note that agreeing on the state does not mean that
+ they all have the latest contents.
+
+*Acting set*
+ the ordered list of OSDs who are (or were as of some epoch)
+ responsible for a particular PG.
+
+*Up set*
+ the ordered list of OSDs responsible for a particular PG for
+ a particular epoch according to CRUSH. Normally this
+ is the same as the *acting set*, except when the *acting set* has been
+ explicitly overridden via *PG temp* in the OSDMap.
+
+*PG temp*
+ a temporary placement group acting set used while backfilling the
+ primary osd. Let say acting is [0,1,2] and we are
+ active+clean. Something happens and acting is now [3,1,2]. osd 3 is
+ empty and can't serve reads although it is the primary. osd.3 will
+ see that and request a *PG temp* of [1,2,3] to the monitors using a
+ MOSDPGTemp message so that osd.1 temporarily becomes the
+ primary. It will select osd.3 as a backfill peer and continue to
+ serve reads and writes while osd.3 is backfilled. When backfilling
+ is complete, *PG temp* is discarded and the acting set changes back
+ to [3,1,2] and osd.3 becomes the primary.
+
+*current interval* or *past interval*
+ a sequence of OSD map epochs during which the *acting set* and *up
+ set* for particular PG do not change
+
+*primary*
+ the (by convention first) member of the *acting set*,
+ who is responsible for coordination peering, and is
+ the only OSD that will accept client initiated
+ writes to objects in a placement group.
+
+*replica*
+ a non-primary OSD in the *acting set* for a placement group
+ (and who has been recognized as such and *activated* by the primary).
+
+*stray*
+ an OSD who is not a member of the current *acting set*, but
+ has not yet been told that it can delete its copies of a
+ particular placement group.
+
+*recovery*
+ ensuring that copies of all of the objects in a PG
+ are on all of the OSDs in the *acting set*. Once
+ *peering* has been performed, the primary can start
+ accepting write operations, and *recovery* can proceed
+ in the background.
+
+*PG info* basic metadata about the PG's creation epoch, the version
+ for the most recent write to the PG, *last epoch started*, *last
+ epoch clean*, and the beginning of the *current interval*. Any
+ inter-OSD communication about PGs includes the *PG info*, such that
+ any OSD that knows a PG exists (or once existed) also has a lower
+ bound on *last epoch clean* or *last epoch started*.
+
+*PG log*
+ a list of recent updates made to objects in a PG.
+ Note that these logs can be truncated after all OSDs
+ in the *acting set* have acknowledged up to a certain
+ point.
+
+*missing set*
+ Each OSD notes update log entries and if they imply updates to
+ the contents of an object, adds that object to a list of needed
+ updates. This list is called the *missing set* for that <OSD,PG>.
+
+*Authoritative History*
+ a complete, and fully ordered set of operations that, if
+ performed, would bring an OSD's copy of a Placement Group
+ up to date.
+
+*epoch*
+ a (monotonically increasing) OSD map version number
+
+*last epoch start*
+ the last epoch at which all nodes in the *acting set*
+ for a particular placement group agreed on an
+ *authoritative history*. At this point, *peering* is
+ deemed to have been successful.
+
+*up_thru*
+ before a primary can successfully complete the *peering* process,
+ it must inform a monitor that is alive through the current
+ OSD map epoch by having the monitor set its *up_thru* in the osd
+ map. This helps peering ignore previous *acting sets* for which
+ peering never completed after certain sequences of failures, such as
+ the second interval below:
+
+ - *acting set* = [A,B]
+ - *acting set* = [A]
+ - *acting set* = [] very shortly after (e.g., simultaneous failure, but staggered detection)
+ - *acting set* = [B] (B restarts, A does not)
+
+*last epoch clean*
+ the last epoch at which all nodes in the *acting set*
+ for a particular placement group were completely
+ up to date (both PG logs and object contents).
+ At this point, *recovery* is deemed to have been
+ completed.
+
+Description of the Peering Process
+----------------------------------
+
+The *Golden Rule* is that no write operation to any PG
+is acknowledged to a client until it has been persisted
+by all members of the *acting set* for that PG. This means
+that if we can communicate with at least one member of
+each *acting set* since the last successful *peering*, someone
+will have a record of every (acknowledged) operation
+since the last successful *peering*.
+This means that it should be possible for the current
+primary to construct and disseminate a new *authoritative history*.
+
+It is also important to appreciate the role of the OSD map
+(list of all known OSDs and their states, as well as some
+information about the placement groups) in the *peering*
+process:
+
+ When OSDs go up or down (or get added or removed)
+ this has the potential to affect the *active sets*
+ of many placement groups.
+
+ Before a primary successfully completes the *peering*
+ process, the OSD map must reflect that the OSD was alive
+ and well as of the first epoch in the *current interval*.
+
+ Changes can only be made after successful *peering*.
+
+Thus, a new primary can use the latest OSD map along with a recent
+history of past maps to generate a set of *past intervals* to
+determine which OSDs must be consulted before we can successfully
+*peer*. The set of past intervals is bounded by *last epoch started*,
+the most recent *past interval* for which we know *peering* completed.
+The process by which an OSD discovers a PG exists in the first place is
+by exchanging *PG info* messages, so the OSD always has some lower
+bound on *last epoch started*.
+
+The high level process is for the current PG primary to:
+
+ 1. get a recent OSD map (to identify the members of the all
+ interesting *acting sets*, and confirm that we are still the
+ primary).
+
+ #. generate a list of *past intervals* since *last epoch started*.
+ Consider the subset of those for which *up_thru* was greater than
+ the first interval epoch by the last interval epoch's OSD map; that is,
+ the subset for which *peering* could have completed before the *acting
+ set* changed to another set of OSDs.
+
+ Successful *peering* will require that we be able to contact at
+ least one OSD from each of *past interval*'s *acting set*.
+
+ #. ask every node in that list for its *PG info*, which includes the most
+ recent write made to the PG, and a value for *last epoch started*. If
+ we learn about a *last epoch started* that is newer than our own, we can
+ prune older *past intervals* and reduce the peer OSDs we need to contact.
+
+ #. if anyone else has (in its PG log) operations that I do not have,
+ instruct them to send me the missing log entries so that the primary's
+ *PG log* is up to date (includes the newest write)..
+
+ #. for each member of the current *acting set*:
+
+ a. ask it for copies of all PG log entries since *last epoch start*
+ so that I can verify that they agree with mine (or know what
+ objects I will be telling it to delete).
+
+ If the cluster failed before an operation was persisted by all
+ members of the *acting set*, and the subsequent *peering* did not
+ remember that operation, and a node that did remember that
+ operation later rejoined, its logs would record a different
+ (divergent) history than the *authoritative history* that was
+ reconstructed in the *peering* after the failure.
+
+ Since the *divergent* events were not recorded in other logs
+ from that *acting set*, they were not acknowledged to the client,
+ and there is no harm in discarding them (so that all OSDs agree
+ on the *authoritative history*). But, we will have to instruct
+ any OSD that stores data from a divergent update to delete the
+ affected (and now deemed to be apocryphal) objects.
+
+ #. ask it for its *missing set* (object updates recorded
+ in its PG log, but for which it does not have the new data).
+ This is the list of objects that must be fully replicated
+ before we can accept writes.
+
+ #. at this point, the primary's PG log contains an *authoritative history* of
+ the placement group, and the OSD now has sufficient
+ information to bring any other OSD in the *acting set* up to date.
+
+ #. if the primary's *up_thru* value in the current OSD map is not greater than
+ or equal to the first epoch in the *current interval*, send a request to the
+ monitor to update it, and wait until receive an updated OSD map that reflects
+ the change.
+
+ #. for each member of the current *acting set*:
+
+ a. send them log updates to bring their PG logs into agreement with
+ my own (*authoritative history*) ... which may involve deciding
+ to delete divergent objects.
+
+ #. await acknowledgment that they have persisted the PG log entries.
+
+ #. at this point all OSDs in the *acting set* agree on all of the meta-data,
+ and would (in any future *peering*) return identical accounts of all
+ updates.
+
+ a. start accepting client write operations (because we have unanimous
+ agreement on the state of the objects into which those updates are
+ being accepted). Note, however, that if a client tries to write to an
+ object it will be promoted to the front of the recovery queue, and the
+ write willy be applied after it is fully replicated to the current *acting set*.
+
+ #. update the *last epoch started* value in our local *PG info*, and instruct
+ other *active set* OSDs to do the same.
+
+ #. start pulling object data updates that other OSDs have, but I do not. We may
+ need to query OSDs from additional *past intervals* prior to *last epoch started*
+ (the last time *peering* completed) and following *last epoch clean* (the last epoch that
+ recovery completed) in order to find copies of all objects.
+
+ #. start pushing object data updates to other OSDs that do not yet have them.
+
+ We push these updates from the primary (rather than having the replicas
+ pull them) because this allows the primary to ensure that a replica has
+ the current contents before sending it an update write. It also makes
+ it possible for a single read (from the primary) to be used to write
+ the data to multiple replicas. If each replica did its own pulls,
+ the data might have to be read multiple times.
+
+ #. once all replicas store the all copies of all objects (that
+ existed prior to the start of this epoch) we can update *last
+ epoch clean* in the *PG info*, and we can dismiss all of the
+ *stray* replicas, allowing them to delete their copies of objects
+ for which they are no longer in the *acting set*.
+
+ We could not dismiss the *strays* prior to this because it was possible
+ that one of those *strays* might hold the sole surviving copy of an
+ old object (all of whose copies disappeared before they could be
+ replicated on members of the current *acting set*).
+
+State Model
+-----------
+
+.. graphviz:: peering_graph.generated.dot
diff --git a/src/ceph/doc/dev/perf.rst b/src/ceph/doc/dev/perf.rst
new file mode 100644
index 0000000..5e42255
--- /dev/null
+++ b/src/ceph/doc/dev/perf.rst
@@ -0,0 +1,39 @@
+Using perf
+==========
+
+Top::
+
+ sudo perf top -p `pidof ceph-osd`
+
+To capture some data with call graphs::
+
+ sudo perf record -p `pidof ceph-osd` -F 99 --call-graph dwarf -- sleep 60
+
+To view by caller (where you can see what each top function calls)::
+
+ sudo perf report --call-graph caller
+
+To view by callee (where you can see who calls each top function)::
+
+ sudo perf report --call-graph callee
+
+:note: If the caller/callee views look the same you may be
+ suffering from a kernel bug; upgrade to 4.8 or later.
+
+Flamegraphs
+-----------
+
+First, get things set up::
+
+ cd ~/src
+ git clone https://github.com/brendangregg/FlameGraph
+
+Run ceph, then record some perf data::
+
+ sudo perf record -p `pidof ceph-osd` -F 99 --call-graph dwarf -- sleep 60
+
+Then generate the flamegraph::
+
+ sudo perf script | ~/src/FlameGraph/stackcollapse-perf.pl > /tmp/folded
+ ~/src/FlameGraph/flamegraph.pl /tmp/folded > /tmp/perf.svg
+ firefox /tmp/perf.svg
diff --git a/src/ceph/doc/dev/perf_counters.rst b/src/ceph/doc/dev/perf_counters.rst
new file mode 100644
index 0000000..2f49f77
--- /dev/null
+++ b/src/ceph/doc/dev/perf_counters.rst
@@ -0,0 +1,198 @@
+===============
+ Perf counters
+===============
+
+The perf counters provide generic internal infrastructure for gauges and counters. The counted values can be both integer and float. There is also an "average" type (normally float) that combines a sum and num counter which can be divided to provide an average.
+
+The intention is that this data will be collected and aggregated by a tool like ``collectd`` or ``statsd`` and fed into a tool like ``graphite`` for graphing and analysis. Also, note the :doc:`../mgr/prometheus`.
+
+Access
+------
+
+The perf counter data is accessed via the admin socket. For example::
+
+ ceph daemon osd.0 perf schema
+ ceph daemon osd.0 perf dump
+
+
+Collections
+-----------
+
+The values are grouped into named collections, normally representing a subsystem or an instance of a subsystem. For example, the internal ``throttle`` mechanism reports statistics on how it is throttling, and each instance is named something like::
+
+
+ throttle-msgr_dispatch_throttler-hbserver
+ throttle-msgr_dispatch_throttler-client
+ throttle-filestore_bytes
+ ...
+
+
+Schema
+------
+
+The ``perf schema`` command dumps a json description of which values are available, and what their type is. Each named value as a ``type`` bitfield, with the following bits defined.
+
++------+-------------------------------------+
+| bit | meaning |
++======+=====================================+
+| 1 | floating point value |
++------+-------------------------------------+
+| 2 | unsigned 64-bit integer value |
++------+-------------------------------------+
+| 4 | average (sum + count pair), where |
++------+-------------------------------------+
+| 8 | counter (vs gauge) |
++------+-------------------------------------+
+
+Every value will have either bit 1 or 2 set to indicate the type
+(float or integer).
+
+If bit 8 is set (counter), the value is monotonically increasing and
+the reader may want to subtract off the previously read value to get
+the delta during the previous interval.
+
+If bit 4 is set (average), there will be two values to read, a sum and
+a count. If it is a counter, the average for the previous interval
+would be sum delta (since the previous read) divided by the count
+delta. Alternatively, dividing the values outright would provide the
+lifetime average value. Normally these are used to measure latencies
+(number of requests and a sum of request latencies), and the average
+for the previous interval is what is interesting.
+
+Instead of interpreting the bit fields, the ``metric type`` has a
+value of either ``guage`` or ``counter``, and the ``value type``
+property will be one of ``real``, ``integer``, ``real-integer-pair``
+(for a sum + real count pair), or ``integer-integer-pair`` (for a
+sum + integer count pair).
+
+Here is an example of the schema output::
+
+ {
+ "throttle-bluestore_throttle_bytes": {
+ "val": {
+ "type": 2,
+ "metric_type": "gauge",
+ "value_type": "integer",
+ "description": "Currently available throttle",
+ "nick": ""
+ },
+ "max": {
+ "type": 2,
+ "metric_type": "gauge",
+ "value_type": "integer",
+ "description": "Max value for throttle",
+ "nick": ""
+ },
+ "get_started": {
+ "type": 10,
+ "metric_type": "counter",
+ "value_type": "integer",
+ "description": "Number of get calls, increased before wait",
+ "nick": ""
+ },
+ "get": {
+ "type": 10,
+ "metric_type": "counter",
+ "value_type": "integer",
+ "description": "Gets",
+ "nick": ""
+ },
+ "get_sum": {
+ "type": 10,
+ "metric_type": "counter",
+ "value_type": "integer",
+ "description": "Got data",
+ "nick": ""
+ },
+ "get_or_fail_fail": {
+ "type": 10,
+ "metric_type": "counter",
+ "value_type": "integer",
+ "description": "Get blocked during get_or_fail",
+ "nick": ""
+ },
+ "get_or_fail_success": {
+ "type": 10,
+ "metric_type": "counter",
+ "value_type": "integer",
+ "description": "Successful get during get_or_fail",
+ "nick": ""
+ },
+ "take": {
+ "type": 10,
+ "metric_type": "counter",
+ "value_type": "integer",
+ "description": "Takes",
+ "nick": ""
+ },
+ "take_sum": {
+ "type": 10,
+ "metric_type": "counter",
+ "value_type": "integer",
+ "description": "Taken data",
+ "nick": ""
+ },
+ "put": {
+ "type": 10,
+ "metric_type": "counter",
+ "value_type": "integer",
+ "description": "Puts",
+ "nick": ""
+ },
+ "put_sum": {
+ "type": 10,
+ "metric_type": "counter",
+ "value_type": "integer",
+ "description": "Put data",
+ "nick": ""
+ },
+ "wait": {
+ "type": 5,
+ "metric_type": "gauge",
+ "value_type": "real-integer-pair",
+ "description": "Waiting latency",
+ "nick": ""
+ }
+ }
+
+
+Dump
+----
+
+The actual dump is similar to the schema, except that average values are grouped. For example::
+
+ {
+ "throttle-msgr_dispatch_throttler-hbserver" : {
+ "get_or_fail_fail" : 0,
+ "get_sum" : 0,
+ "max" : 104857600,
+ "put" : 0,
+ "val" : 0,
+ "take" : 0,
+ "get_or_fail_success" : 0,
+ "wait" : {
+ "avgcount" : 0,
+ "sum" : 0
+ },
+ "get" : 0,
+ "take_sum" : 0,
+ "put_sum" : 0
+ },
+ "throttle-msgr_dispatch_throttler-client" : {
+ "get_or_fail_fail" : 0,
+ "get_sum" : 82760,
+ "max" : 104857600,
+ "put" : 2637,
+ "val" : 0,
+ "take" : 0,
+ "get_or_fail_success" : 0,
+ "wait" : {
+ "avgcount" : 0,
+ "sum" : 0
+ },
+ "get" : 2637,
+ "take_sum" : 0,
+ "put_sum" : 82760
+ }
+ }
+
diff --git a/src/ceph/doc/dev/perf_histograms.rst b/src/ceph/doc/dev/perf_histograms.rst
new file mode 100644
index 0000000..c277ac2
--- /dev/null
+++ b/src/ceph/doc/dev/perf_histograms.rst
@@ -0,0 +1,677 @@
+=================
+ Perf histograms
+=================
+
+The perf histograms build on perf counters infrastructure. Histograms are built for a number of counters and simplify gathering data on which groups of counter values occur most often over time.
+Perf histograms are currently unsigned 64-bit integer counters, so they're mostly useful for time and sizes. Data dumped by perf histogram can then be feed into other analysis tools/scripts.
+
+Access
+------
+
+The perf histogram data are accessed via the admin socket. For example::
+
+ ceph daemon osd.0 perf histogram schema
+ ceph daemon osd.0 perf histogram dump
+
+
+Collections
+-----------
+
+The histograms are grouped into named collections, normally representing a subsystem or an instance of a subsystem. For example, the internal ``throttle`` mechanism reports statistics on how it is throttling, and each instance is named something like::
+
+
+ op_r_latency_out_bytes_histogram
+ op_rw_latency_in_bytes_histogram
+ op_rw_latency_out_bytes_histogram
+ ...
+
+
+Schema
+------
+
+The ``perf histogram schema`` command dumps a json description of which values are available, and what their type is. Each named value as a ``type`` bitfield, with the 5-th bit always set and following bits defined.
+
++------+-------------------------------------+
+| bit | meaning |
++======+=====================================+
+| 1 | floating point value |
++------+-------------------------------------+
+| 2 | unsigned 64-bit integer value |
++------+-------------------------------------+
+| 4 | average (sum + count pair) |
++------+-------------------------------------+
+| 8 | counter (vs gauge) |
++------+-------------------------------------+
+
+In other words, histogram of type "18" is a histogram of unsigned 64-bit integer values (16 + 2).
+
+Here is an example of the schema output::
+
+ {
+ "AsyncMessenger::Worker-0": {},
+ "AsyncMessenger::Worker-1": {},
+ "AsyncMessenger::Worker-2": {},
+ "mutex-WBThrottle::lock": {},
+ "objecter": {},
+ "osd": {
+ "op_r_latency_out_bytes_histogram": {
+ "type": 18,
+ "description": "Histogram of operation latency (including queue time) + da ta read",
+ "nick": ""
+ },
+ "op_w_latency_in_bytes_histogram": {
+ "type": 18,
+ "description": "Histogram of operation latency (including queue time) + da ta written",
+ "nick": ""
+ },
+ "op_rw_latency_in_bytes_histogram": {
+ "type": 18,
+ "description": "Histogram of rw operation latency (including queue time) + data written",
+ "nick": ""
+ },
+ "op_rw_latency_out_bytes_histogram": {
+ "type": 18,
+ "description": "Histogram of rw operation latency (including queue time) + data read",
+ "nick": ""
+ }
+ }
+ }
+
+
+Dump
+----
+
+The actual dump is similar to the schema, except that there are actual value groups. For example::
+
+ "osd": {
+ "op_r_latency_out_bytes_histogram": {
+ "axes": [
+ {
+ "name": "Latency (usec)",
+ "min": 0,
+ "quant_size": 100000,
+ "buckets": 32,
+ "scale_type": "log2",
+ "ranges": [
+ {
+ "max": -1
+ },
+ {
+ "min": 0,
+ "max": 99999
+ },
+ {
+ "min": 100000,
+ "max": 199999
+ },
+ {
+ "min": 200000,
+ "max": 399999
+ },
+ {
+ "min": 400000,
+ "max": 799999
+ },
+ {
+ "min": 800000,
+ "max": 1599999
+ },
+ {
+ "min": 1600000,
+ "max": 3199999
+ },
+ {
+ "min": 3200000,
+ "max": 6399999
+ },
+ {
+ "min": 6400000,
+ "max": 12799999
+ },
+ {
+ "min": 12800000,
+ "max": 25599999
+ },
+ {
+ "min": 25600000,
+ "max": 51199999
+ },
+ {
+ "min": 51200000,
+ "max": 102399999
+ },
+ {
+ "min": 102400000,
+ "max": 204799999
+ },
+ {
+ "min": 204800000,
+ "max": 409599999
+ },
+ {
+ "min": 409600000,
+ "max": 819199999
+ },
+ {
+ "min": 819200000,
+ "max": 1638399999
+ },
+ {
+ "min": 1638400000,
+ "max": 3276799999
+ },
+ {
+ "min": 3276800000,
+ "max": 6553599999
+ },
+ {
+ "min": 6553600000,
+ "max": 13107199999
+ },
+ {
+ "min": 13107200000,
+ "max": 26214399999
+ },
+ {
+ "min": 26214400000,
+ "max": 52428799999
+ },
+ {
+ "min": 52428800000,
+ "max": 104857599999
+ },
+ {
+ "min": 104857600000,
+ "max": 209715199999
+ },
+ {
+ "min": 209715200000,
+ "max": 419430399999
+ },
+ {
+ "min": 419430400000,
+ "max": 838860799999
+ },
+ {
+ "min": 838860800000,
+ "max": 1677721599999
+ },
+ {
+ "min": 1677721600000,
+ "max": 3355443199999
+ },
+ {
+ "min": 3355443200000,
+ "max": 6710886399999
+ },
+ {
+ "min": 6710886400000,
+ "max": 13421772799999
+ },
+ {
+ "min": 13421772800000,
+ "max": 26843545599999
+ },
+ {
+ "min": 26843545600000,
+ "max": 53687091199999
+ },
+ },
+ {
+ "min": 53687091200000
+ }
+ ]
+ },
+ {
+ "name": "Request size (bytes)",
+ "min": 0,
+ "quant_size": 512,
+ "buckets": 32,
+ "scale_type": "log2",
+ "ranges": [
+ {
+ "max": -1
+ },
+ {
+ "min": 0,
+ "max": 511
+ },
+ {
+ "min": 512,
+ "max": 1023
+ },
+ {
+ "min": 1024,
+ "max": 2047
+ },
+ {
+ "min": 2048,
+ "max": 4095
+ },
+ {
+ "min": 4096,
+ "max": 8191
+ },
+ {
+ "min": 8192,
+ "max": 16383
+ },
+ {
+ "min": 16384,
+ "max": 32767
+ },
+ {
+ "min": 32768,
+ "max": 65535
+ },
+ {
+ "min": 65536,
+ "max": 131071
+ },
+ {
+ "min": 131072,
+ "max": 262143
+ },
+ {
+ "min": 262144,
+ "max": 524287
+ },
+ {
+ "min": 524288,
+ "max": 1048575
+ },
+ {
+ "min": 1048576,
+ "max": 2097151
+ },
+ {
+ "min": 2097152,
+ "max": 4194303
+ },
+ {
+ "min": 4194304,
+ "max": 8388607
+ },
+ {
+ "min": 8388608,
+ "max": 16777215
+ },
+ {
+ "min": 16777216,
+ "max": 33554431
+ },
+ {
+ "min": 33554432,
+ "max": 67108863
+ },
+ {
+ "min": 67108864,
+ "max": 134217727
+ },
+ {
+ "min": 134217728,
+ "max": 268435455
+ },
+ {
+ "min": 268435456,
+ "max": 536870911
+ },
+ {
+ "min": 536870912,
+ "max": 1073741823
+ },
+ {
+ "min": 1073741824,
+ "max": 2147483647
+ },
+ {
+ "min": 2147483648,
+ "max": 4294967295
+ },
+ {
+ "min": 4294967296,
+ "max": 8589934591
+ },
+ {
+ "min": 8589934592,
+ "max": 17179869183
+ },
+ {
+ "min": 17179869184,
+ "max": 34359738367
+ },
+ {
+ "min": 34359738368,
+ "max": 68719476735
+ },
+ {
+ "min": 68719476736,
+ "max": 137438953471
+ },
+ {
+ "min": 137438953472,
+ "max": 274877906943
+ },
+ {
+ "min": 274877906944
+ }
+ ]
+ }
+ ],
+ "values": [
+ [
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0
+ ],
+ [
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0
+ ],
+ [
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0
+ ],
+ [
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0
+ ],
+ [
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0
+ ],
+ [
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0
+ ],
+ [
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0
+ ],
+ [
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0
+ ],
+ [
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0,
+ 0
+ ]
+ ]
+ }
+ },
+
+This represents the 2d histogram, consisting of 9 history entrires and 32 value groups per each history entry.
+"Ranges" element denote value bounds for each of value groups. "Buckets" denote amount of value groups ("buckets"),
+"Min" is a minimum accepted valaue, "quant_size" is quantization unit and "scale_type" is either "log2" (logarhitmic
+scale) or "linear" (linear scale).
+You can use histogram_dump.py tool (see src/tools/histogram_dump.py) for quick visualisation of existing histogram
+data.
diff --git a/src/ceph/doc/dev/placement-group.rst b/src/ceph/doc/dev/placement-group.rst
new file mode 100644
index 0000000..3c067ea
--- /dev/null
+++ b/src/ceph/doc/dev/placement-group.rst
@@ -0,0 +1,151 @@
+============================
+ PG (Placement Group) notes
+============================
+
+Miscellaneous copy-pastes from emails, when this gets cleaned up it
+should move out of /dev.
+
+Overview
+========
+
+PG = "placement group". When placing data in the cluster, objects are
+mapped into PGs, and those PGs are mapped onto OSDs. We use the
+indirection so that we can group objects, which reduces the amount of
+per-object metadata we need to keep track of and processes we need to
+run (it would be prohibitively expensive to track eg the placement
+history on a per-object basis). Increasing the number of PGs can
+reduce the variance in per-OSD load across your cluster, but each PG
+requires a bit more CPU and memory on the OSDs that are storing it. We
+try and ballpark it at 100 PGs/OSD, although it can vary widely
+without ill effects depending on your cluster. You hit a bug in how we
+calculate the initial PG number from a cluster description.
+
+There are a couple of different categories of PGs; the 6 that exist
+(in the original emailer's ``ceph -s`` output) are "local" PGs which
+are tied to a specific OSD. However, those aren't actually used in a
+standard Ceph configuration.
+
+
+Mapping algorithm (simplified)
+==============================
+
+| > How does the Object->PG mapping look like, do you map more than one object on
+| > one PG, or do you sometimes map an object to more than one PG? How about the
+| > mapping of PGs to OSDs, does one PG belong to exactly one OSD?
+| >
+| > Does one PG represent a fixed amount of storage space?
+
+Many objects map to one PG.
+
+Each object maps to exactly one PG.
+
+One PG maps to a single list of OSDs, where the first one in the list
+is the primary and the rest are replicas.
+
+Many PGs can map to one OSD.
+
+A PG represents nothing but a grouping of objects; you configure the
+number of PGs you want, number of OSDs * 100 is a good starting point
+, and all of your stored objects are pseudo-randomly evenly distributed
+to the PGs. So a PG explicitly does NOT represent a fixed amount of
+storage; it represents 1/pg_num'th of the storage you happen to have
+on your OSDs.
+
+Ignoring the finer points of CRUSH and custom placement, it goes
+something like this in pseudocode::
+
+ locator = object_name
+ obj_hash = hash(locator)
+ pg = obj_hash % num_pg
+ OSDs_for_pg = crush(pg) # returns a list of OSDs
+ primary = osds_for_pg[0]
+ replicas = osds_for_pg[1:]
+
+If you want to understand the crush() part in the above, imagine a
+perfectly spherical datacenter in a vacuum ;) that is, if all OSDs
+have weight 1.0, and there is no topology to the data center (all OSDs
+are on the top level), and you use defaults, etc, it simplifies to
+consistent hashing; you can think of it as::
+
+ def crush(pg):
+ all_osds = ['osd.0', 'osd.1', 'osd.2', ...]
+ result = []
+ # size is the number of copies; primary+replicas
+ while len(result) < size:
+ r = hash(pg)
+ chosen = all_osds[ r % len(all_osds) ]
+ if chosen in result:
+ # OSD can be picked only once
+ continue
+ result.append(chosen)
+ return result
+
+User-visible PG States
+======================
+
+.. todo:: diagram of states and how they can overlap
+
+*creating*
+ the PG is still being created
+
+*active*
+ requests to the PG will be processed
+
+*clean*
+ all objects in the PG are replicated the correct number of times
+
+*down*
+ a replica with necessary data is down, so the pg is offline
+
+*replay*
+ the PG is waiting for clients to replay operations after an OSD crashed
+
+*splitting*
+ the PG is being split into multiple PGs (not functional as of 2012-02)
+
+*scrubbing*
+ the PG is being checked for inconsistencies
+
+*degraded*
+ some objects in the PG are not replicated enough times yet
+
+*inconsistent*
+ replicas of the PG are not consistent (e.g. objects are
+ the wrong size, objects are missing from one replica *after* recovery
+ finished, etc.)
+
+*peering*
+ the PG is undergoing the :doc:`/dev/peering` process
+
+*repair*
+ the PG is being checked and any inconsistencies found will be repaired (if possible)
+
+*recovering*
+ objects are being migrated/synchronized with replicas
+
+*recovery_wait*
+ the PG is waiting for the local/remote recovery reservations
+
+*backfilling*
+ a special case of recovery, in which the entire contents of
+ the PG are scanned and synchronized, instead of inferring what
+ needs to be transferred from the PG logs of recent operations
+
+*backfill_wait*
+ the PG is waiting in line to start backfill
+
+*backfill_toofull*
+ backfill reservation rejected, OSD too full
+
+*incomplete*
+ a pg is missing a necessary period of history from its
+ log. If you see this state, report a bug, and try to start any
+ failed OSDs that may contain the needed information.
+
+*stale*
+ the PG is in an unknown state - the monitors have not received
+ an update for it since the PG mapping changed.
+
+*remapped*
+ the PG is temporarily mapped to a different set of OSDs from what
+ CRUSH specified
diff --git a/src/ceph/doc/dev/quick_guide.rst b/src/ceph/doc/dev/quick_guide.rst
new file mode 100644
index 0000000..d9c6a53
--- /dev/null
+++ b/src/ceph/doc/dev/quick_guide.rst
@@ -0,0 +1,131 @@
+=================================
+ Developer Guide (Quick)
+=================================
+
+This guide will describe how to build and test Ceph for development.
+
+Development
+-----------
+
+The ``run-make-check.sh`` script will install Ceph dependencies,
+compile everything in debug mode and run a number of tests to verify
+the result behaves as expected.
+
+.. code::
+
+ $ ./run-make-check.sh
+
+
+Running a development deployment
+--------------------------------
+Ceph contains a script called ``vstart.sh`` (see also :doc:`/dev/dev_cluster_deployement`) which allows developers to quickly test their code using
+a simple deployment on your development system. Once the build finishes successfully, start the ceph
+deployment using the following command:
+
+.. code::
+
+ $ cd ceph/build # Assuming this is where you ran cmake
+ $ make vstart
+ $ ../src/vstart.sh -d -n -x
+
+You can also configure ``vstart.sh`` to use only one monitor and one metadata server by using the following:
+
+.. code::
+
+ $ MON=1 MDS=1 ../src/vstart.sh -d -n -x
+
+The system creates two pools on startup: `cephfs_data_a` and `cephfs_metadata_a`. Let's get some stats on
+the current pools:
+
+.. code::
+
+ $ bin/ceph osd pool stats
+ *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
+ pool cephfs_data_a id 1
+ nothing is going on
+
+ pool cephfs_metadata_a id 2
+ nothing is going on
+
+ $ bin/ceph osd pool stats cephfs_data_a
+ *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
+ pool cephfs_data_a id 1
+ nothing is going on
+
+ $ bin/rados df
+ POOL_NAME USED OBJECTS CLONES COPIES MISSING_ON_PRIMARY UNFOUND DEGRADED RD_OPS RD WR_OPS WR
+ cephfs_data_a 0 0 0 0 0 0 0 0 0 0 0
+ cephfs_metadata_a 2246 21 0 63 0 0 0 0 0 42 8192
+
+ total_objects 21
+ total_used 244G
+ total_space 1180G
+
+
+Make a pool and run some benchmarks against it:
+
+.. code::
+
+ $ bin/rados mkpool mypool
+ $ bin/rados -p mypool bench 10 write -b 123
+
+Place a file into the new pool:
+
+.. code::
+
+ $ bin/rados -p mypool put objectone <somefile>
+ $ bin/rados -p mypool put objecttwo <anotherfile>
+
+List the objects in the pool:
+
+.. code::
+
+ $ bin/rados -p mypool ls
+
+Once you are done, type the following to stop the development ceph deployment:
+
+.. code::
+
+ $ ../src/stop.sh
+
+Resetting your vstart environment
+---------------------------------
+
+The vstart script creates out/ and dev/ directories which contain
+the cluster's state. If you want to quickly reset your environment,
+you might do something like this:
+
+.. code::
+
+ [build]$ ../src/stop.sh
+ [build]$ rm -rf out dev
+ [build]$ MDS=1 MON=1 OSD=3 ../src/vstart.sh -n -d
+
+Running a RadosGW development environment
+-----------------------------------------
+
+Set the ``RGW`` environment variable when running vstart.sh to enable the RadosGW.
+
+.. code::
+
+ $ cd build
+ $ RGW=1 ../src/vstart.sh -d -n -x
+
+You can now use the swift python client to communicate with the RadosGW.
+
+.. code::
+
+ $ swift -A http://localhost:8000/auth -U test:tester -K testing list
+ $ swift -A http://localhost:8000/auth -U test:tester -K testing upload mycontainer ceph
+ $ swift -A http://localhost:8000/auth -U test:tester -K testing list
+
+
+Run unit tests
+--------------
+
+The tests are located in `src/tests`. To run them type:
+
+.. code::
+
+ $ make check
+
diff --git a/src/ceph/doc/dev/rados-client-protocol.rst b/src/ceph/doc/dev/rados-client-protocol.rst
new file mode 100644
index 0000000..b05e753
--- /dev/null
+++ b/src/ceph/doc/dev/rados-client-protocol.rst
@@ -0,0 +1,117 @@
+RADOS client protocol
+=====================
+
+This is very incomplete, but one must start somewhere.
+
+Basics
+------
+
+Requests are MOSDOp messages. Replies are MOSDOpReply messages.
+
+An object request is targetted at an hobject_t, which includes a pool,
+hash value, object name, placement key (usually empty), and snapid.
+
+The hash value is a 32-bit hash value, normally generated by hashing
+the object name. The hobject_t can be arbitrarily constructed,
+though, with any hash value and name. Note that in the MOSDOp these
+components are spread across several fields and not logically
+assembled in an actual hobject_t member (mainly historical reasons).
+
+A request can also target a PG. In this case, the *ps* value matches
+a specific PG, the object name is empty, and (hopefully) the ops in
+the request are PG ops.
+
+Either way, the request ultimately targets a PG, either by using the
+explicit pgid or by folding the hash value onto the current number of
+pgs in the pool. The client sends the request to the primary for the
+assocated PG.
+
+Each request is assigned a unique tid.
+
+Resends
+-------
+
+If there is a connection drop, the client will resend any outstanding
+requets.
+
+Any time there is a PG mapping change such that the primary changes,
+the client is responsible for resending the request. Note that
+although there may be an interval change from the OSD's perspective
+(triggering PG peering), if the primary doesn't change then the client
+need not resend.
+
+There are a few exceptions to this rule:
+
+ * There is a last_force_op_resend field in the pg_pool_t in the
+ OSDMap. If this changes, then the clients are forced to resend any
+ outstanding requests. (This happens when tiering is adjusted, for
+ example.)
+ * Some requests are such that they are resent on *any* PG interval
+ change, as defined by pg_interval_t's is_new_interval() (the same
+ criteria used by peering in the OSD).
+ * If the PAUSE OSDMap flag is set and unset.
+
+Each time a request is sent to the OSD the *attempt* field is incremented. The
+first time it is 0, the next 1, etc.
+
+Backoff
+-------
+
+Ordinarily the OSD will simply queue any requests it can't immeidately
+process in memory until such time as it can. This can become
+problematic because the OSD limits the total amount of RAM consumed by
+incoming messages: if either of the thresholds for the number of
+messages or the number of bytes is reached, new messages will not be
+read off the network socket, causing backpressure through the network.
+
+In some cases, though, the OSD knows or expects that a PG or object
+will be unavailable for some time and does not want to consume memory
+by queuing requests. In these cases it can send a MOSDBackoff message
+to the client.
+
+A backoff request has four properties:
+
+#. the op code (block, unblock, or ack-block)
+#. *id*, a unique id assigned within this session
+#. hobject_t begin
+#. hobject_t end
+
+There are two types of backoff: a *PG* backoff will plug all requests
+targetting an entire PG at the client, as described by a range of the
+hash/hobject_t space [begin,end), while an *object* backoff will plug
+all requests targetting a single object (begin == end).
+
+When the client receives a *block* backoff message, it is now
+responsible for *not* sending any requests for hobject_ts described by
+the backoff. The backoff remains in effect until the backoff is
+cleared (via an 'unblock' message) or the OSD session is closed. A
+*ack_block* message is sent back to the OSD immediately to acknowledge
+receipt of the backoff.
+
+When an unblock is
+received, it will reference a specific id that the client previous had
+blocked. However, the range described by the unblock may be smaller
+than the original range, as the PG may have split on the OSD. The unblock
+should *only* unblock the range specified in the unblock message. Any requests
+that fall within the unblock request range are reexamined and, if no other
+installed backoff applies, resent.
+
+On the OSD, Backoffs are also tracked across ranges of the hash space, and
+exist in three states:
+
+#. new
+#. acked
+#. deleting
+
+A newly installed backoff is set to *new* and a message is sent to the
+client. When the *ack-block* message is received it is changed to the
+*acked* state. The OSD may process other messages from the client that
+are covered by the backoff in the *new* state, but once the backoff is
+*acked* it should never see a blocked request unless there is a bug.
+
+If the OSD wants to a remove a backoff in the *acked* state it can
+simply remove it and notify the client. If the backoff is in the
+*new* state it must move it to the *deleting* state and continue to
+use it to discard client requests until the *ack-block* message is
+received, at which point it can finally be removed. This is necessary to
+preserve the order of operations processed by the OSD.
diff --git a/src/ceph/doc/dev/radosgw/admin/adminops_nonimplemented.rst b/src/ceph/doc/dev/radosgw/admin/adminops_nonimplemented.rst
new file mode 100644
index 0000000..e579bd5
--- /dev/null
+++ b/src/ceph/doc/dev/radosgw/admin/adminops_nonimplemented.rst
@@ -0,0 +1,495 @@
+==================
+ Admin Operations
+==================
+
+An admin API request will be done on a URI that starts with the configurable 'admin'
+resource entry point. Authorization for the admin API duplicates the S3 authorization
+mechanism. Some operations require that the user holds special administrative capabilities.
+The response entity type (XML or JSON) may be specified as the 'format' option in the
+request and defaults to JSON if not specified.
+
+Get Object
+==========
+
+Get an existing object. NOTE: Does not require owner to be non-suspended.
+
+Syntax
+~~~~~~
+
+::
+
+ GET /{admin}/bucket?object&format=json HTTP/1.1
+ Host {fqdn}
+
+Request Parameters
+~~~~~~~~~~~~~~~~~~
+
+``bucket``
+
+:Description: The bucket containing the object to be retrieved.
+:Type: String
+:Example: ``foo_bucket``
+:Required: Yes
+
+``object``
+
+:Description: The object to be retrieved.
+:Type: String
+:Example: ``foo.txt``
+:Required: Yes
+
+Response Entities
+~~~~~~~~~~~~~~~~~
+
+If successful, returns the desired object.
+
+``object``
+
+:Description: The desired object.
+:Type: Object
+
+Special Error Responses
+~~~~~~~~~~~~~~~~~~~~~~~
+
+``NoSuchObject``
+
+:Description: Specified object does not exist.
+:Code: 404 Not Found
+
+Head Object
+===========
+
+Verify the existence of an object. If the object exists,
+metadata headers for the object will be returned.
+
+Syntax
+~~~~~~
+
+::
+
+ HEAD /{admin}/bucket?object HTTP/1.1
+ Host {fqdn}
+
+Request Parameters
+~~~~~~~~~~~~~~~~~~
+
+``bucket``
+
+:Description: The bucket containing the object to be retrieved.
+:Type: String
+:Example: ``foo_bucket``
+:Required: Yes
+
+``object``
+
+:Description: The object to be retrieved.
+:Type: String
+:Example: ``foo.txt``
+:Required: Yes
+
+Response Entities
+~~~~~~~~~~~~~~~~~
+
+None.
+
+Special Error Responses
+~~~~~~~~~~~~~~~~~~~~~~~
+
+``NoSuchObject``
+
+:Description: Specified object does not exist.
+:Code: 404 Not Found
+
+Get Zone Info
+=============
+
+Get cluster information.
+
+Syntax
+~~~~~~
+
+::
+
+ GET /{admin}/zone&format=json HTTP/1.1
+ Host {fqdn}
+
+
+Response Entities
+~~~~~~~~~~~~~~~~~
+
+If successful, returns cluster pool configuration.
+
+``zone``
+
+:Description: Contains current cluster pool configuration.
+:Type: Container
+
+``domain_root``
+
+:Description: root of all buckets.
+:Type: String
+:Parent: ``cluster``
+
+``control_pool``
+
+:Description:
+:Type: String
+:Parent: ``cluster``
+
+``gc_pool``
+
+:Description: Garbage collection pool.
+:Type: String
+:Parent: ``cluster``
+
+``log_pool``
+
+:Description: Log pool.
+:Type: String
+:Parent: ``cluster``
+
+``intent_log_pool``
+
+:Description: Intent log pool.
+:Type: String
+:Parent: ``cluster``
+
+``usage_log_pool``
+
+:Description: Usage log pool.
+:Type: String
+:Parent: ``cluster``
+
+``user_keys_pool``
+
+:Description: User key pool.
+:Type: String
+:Parent: ``cluster``
+
+``user_email_pool``
+
+:Description: User email pool.
+:Type: String
+:Parent: ``cluster``
+
+``user_swift_pool``
+
+:Description: Pool of swift users.
+:Type: String
+:Parent: ``cluster``
+
+Special Error Responses
+~~~~~~~~~~~~~~~~~~~~~~~
+
+None.
+
+Example Response
+~~~~~~~~~~~~~~~~
+
+::
+
+ HTTP/1.1 200
+ Content-Type: application/json
+
+ {
+ "domain_root": ".rgw",
+ "control_pool": ".rgw.control",
+ "gc_pool": ".rgw.gc",
+ "log_pool": ".log",
+ "intent_log_pool": ".intent-log",
+ "usage_log_pool": ".usage",
+ "user_keys_pool": ".users",
+ "user_email_pool": ".users.email",
+ "user_swift_pool": ".users.swift",
+ "user_uid_pool ": ".users.uid"
+ }
+
+
+
+Add Placement Pool
+==================
+
+Make a pool available for data placement.
+
+Syntax
+~~~~~~
+
+::
+
+ PUT /{admin}/pool?format=json HTTP/1.1
+ Host {fqdn}
+
+
+Request Parameters
+~~~~~~~~~~~~~~~~~~
+
+``pool``
+
+:Description: The pool to be made available for data placement.
+:Type: String
+:Example: ``foo_pool``
+:Required: Yes
+
+``create``
+
+:Description: Creates the data pool if it does not exist.
+:Type: Boolean
+:Example: False [False]
+:Required: No
+
+Response Entities
+~~~~~~~~~~~~~~~~~
+
+TBD.
+
+Special Error Responses
+~~~~~~~~~~~~~~~~~~~~~~~
+
+TBD.
+
+Remove Placement Pool
+=====================
+
+Make a pool unavailable for data placement.
+
+Syntax
+~~~~~~
+
+::
+
+ DELETE /{admin}/pool?format=json HTTP/1.1
+ Host {fqdn}
+
+
+Request Parameters
+~~~~~~~~~~~~~~~~~~
+
+``pool``
+
+:Description: The existing pool to be made available for data placement.
+:Type: String
+:Example: ``foo_pool``
+:Required: Yes
+
+``destroy``
+
+:Description: Destroys the pool after removing it from the active set.
+:Type: Boolean
+:Example: False [False]
+:Required: No
+
+Response Entities
+~~~~~~~~~~~~~~~~~
+
+TBD.
+
+Special Error Responses
+~~~~~~~~~~~~~~~~~~~~~~~
+
+TBD.
+
+List Available Data Placement Pools
+===================================
+
+List current pools available for data placement.
+
+Syntax
+~~~~~~
+
+::
+
+ GET /{admin}/pool?format=json HTTP/1.1
+ Host {fqdn}
+
+
+Response Entities
+~~~~~~~~~~~~~~~~~
+
+If successful, returns a list of pools available for data placement.
+
+``pools``
+
+:Description: Contains currently available pools for data placement.
+:Type: Container
+
+
+
+List Expired Garbage Collection Items
+=====================================
+
+List objects scheduled for garbage collection.
+
+Syntax
+~~~~~~
+
+::
+
+ GET /{admin}/garbage?format=json HTTP/1.1
+ Host {fqdn}
+
+Request Parameters
+~~~~~~~~~~~~~~~~~~
+
+None.
+
+Response Entities
+~~~~~~~~~~~~~~~~~
+
+If expired garbage collection items exist, a list of such objects
+will be returned.
+
+``garbage``
+
+:Description: Expired garbage collection items.
+:Type: Container
+
+``object``
+
+:Description: A container garbage collection object information.
+:Type: Container
+:Parent: ``garbage``
+
+``name``
+
+:Description: The name of the object.
+:Type: String
+:Parent: ``object``
+
+``expired``
+
+:Description: The date at which the object expired.
+:Type: String
+:Parent: ``object``
+
+Special Error Responses
+~~~~~~~~~~~~~~~~~~~~~~~
+
+TBD.
+
+Manually Processes Garbage Collection Items
+===========================================
+
+List objects scheduled for garbage collection.
+
+Syntax
+~~~~~~
+
+::
+
+ DELETE /{admin}/garbage?format=json HTTP/1.1
+ Host {fqdn}
+
+Request Parameters
+~~~~~~~~~~~~~~~~~~
+
+None.
+
+Response Entities
+~~~~~~~~~~~~~~~~~
+
+If expired garbage collection items exist, a list of removed objects
+will be returned.
+
+``garbage``
+
+:Description: Expired garbage collection items.
+:Type: Container
+
+``object``
+
+:Description: A container garbage collection object information.
+:Type: Container
+:Parent: ``garbage``
+
+``name``
+
+:Description: The name of the object.
+:Type: String
+:Parent: ``object``
+
+``expired``
+
+:Description: The date at which the object expired.
+:Type: String
+:Parent: ``object``
+
+Special Error Responses
+~~~~~~~~~~~~~~~~~~~~~~~
+
+TBD.
+
+Show Log Objects
+================
+
+Show log objects
+
+Syntax
+~~~~~~
+
+::
+
+ GET /{admin}/log?format=json HTTP/1.1
+ Host {fqdn}
+
+Request Parameters
+~~~~~~~~~~~~~~~~~~
+
+``object``
+
+:Description: The log object to return.
+:Type: String:
+:Example: ``2012-10-11-09-4165.2-foo_bucket``
+:Required: No
+
+Response Entities
+~~~~~~~~~~~~~~~~~
+
+If no object is specified, returns the full list of log objects.
+
+``log-objects``
+
+:Description: A list of log objects.
+:Type: Container
+
+``object``
+
+:Description: The name of the log object.
+:Type: String
+
+``log``
+
+:Description: The contents of the log object.
+:Type: Container
+
+Special Error Responses
+~~~~~~~~~~~~~~~~~~~~~~~
+
+None.
+
+Standard Error Responses
+========================
+
+``AccessDenied``
+
+:Description: Access denied.
+:Code: 403 Forbidden
+
+``InternalError``
+
+:Description: Internal server error.
+:Code: 500 Internal Server Error
+
+``NoSuchUser``
+
+:Description: User does not exist.
+:Code: 404 Not Found
+
+``NoSuchBucket``
+
+:Description: Bucket does not exist.
+:Code: 404 Not Found
+
+``NoSuchKey``
+
+:Description: No such access key.
+:Code: 404 Not Found
diff --git a/src/ceph/doc/dev/radosgw/index.rst b/src/ceph/doc/dev/radosgw/index.rst
new file mode 100644
index 0000000..5f77609
--- /dev/null
+++ b/src/ceph/doc/dev/radosgw/index.rst
@@ -0,0 +1,13 @@
+=======================================
+ RADOS Gateway developer documentation
+=======================================
+
+.. rubric:: Contents
+
+.. toctree::
+ :maxdepth: 1
+
+
+ usage
+ Admin Ops Nonimplemented <admin/adminops_nonimplemented>
+ s3_compliance
diff --git a/src/ceph/doc/dev/radosgw/s3_compliance.rst b/src/ceph/doc/dev/radosgw/s3_compliance.rst
new file mode 100644
index 0000000..b6b0d85
--- /dev/null
+++ b/src/ceph/doc/dev/radosgw/s3_compliance.rst
@@ -0,0 +1,304 @@
+===============================
+Rados Gateway S3 API Compliance
+===============================
+
+.. warning::
+ This document is a draft, it might not be accurate
+
+----------------------
+Naming code reference
+----------------------
+
+Here comes a BNF definition on how to name a feature in the code for referencing purpose : ::
+
+ name ::= request_type "_" ( header | operation ) ( "_" header_option )?
+
+ request_type ::= "req" | "res"
+
+ header ::= string
+
+ operation ::= method resource
+
+ method ::= "GET" | "PUT" | "POST" | "DELETE" | "OPTIONS" | "HEAD"
+
+ resource ::= string
+
+ header_option ::= string
+
+----------------------
+Common Request Headers
+----------------------
+
+S3 Documentation reference : http://docs.aws.amazon.com/AmazonS3/latest/API/RESTCommonRequestHeaders.html
+
++----------------------+------------+---------------------------------------------------------------------------------------------------------+-------------+
+| Header | Supported? | Code Links | Tests links |
++======================+============+=========================================================================================================+=============+
+| Authorization | Yes | https://github.com/ceph/ceph/blob/8a2eb18494005aa968b71f18121da8ebab48e950/src/rgw/rgw_rest_s3.cc#L1962 | |
+| | | https://github.com/ceph/ceph/blob/8a2eb18494005aa968b71f18121da8ebab48e950/src/rgw/rgw_rest_s3.cc#L2051 | |
++----------------------+------------+---------------------------------------------------------------------------------------------------------+-------------+
+| Content-Length | Yes | | |
++----------------------+------------+---------------------------------------------------------------------------------------------------------+-------------+
+| Content-Type | Yes | | |
++----------------------+------------+---------------------------------------------------------------------------------------------------------+-------------+
+| Content-MD5 | Yes | https://github.com/ceph/ceph/blob/b139a7cd34b4e203ab164ada7a8fa590b50d8b13/src/rgw/rgw_op.cc#L1249 | |
+| | | https://github.com/ceph/ceph/blob/b139a7cd34b4e203ab164ada7a8fa590b50d8b13/src/rgw/rgw_op.cc#L1306 | |
++----------------------+------------+---------------------------------------------------------------------------------------------------------+-------------+
+| Date | Yes | https://github.com/ceph/ceph/blob/8a2eb18494005aa968b71f18121da8ebab48e950/src/rgw/rgw_auth_s3.cc#L164 | |
++----------------------+------------+---------------------------------------------------------------------------------------------------------+-------------+
+| Expect | Yes | https://github.com/ceph/ceph/blob/8a2eb18494005aa968b71f18121da8ebab48e950/src/rgw/rgw_rest.cc#L1227 | |
+| | | https://github.com/ceph/ceph/blob/8a2eb18494005aa968b71f18121da8ebab48e950/src/rgw/rgw_rest_s3.cc#L802 | |
+| | | https://github.com/ceph/ceph/blob/76040d90f7eb9f9921a3b8dcd0f821ac2cd9c492/src/rgw/rgw_main.cc#L372 | |
++----------------------+------------+---------------------------------------------------------------------------------------------------------+-------------+
+| Host | ? | | |
++----------------------+------------+---------------------------------------------------------------------------------------------------------+-------------+
+| x-amz-date | Yes | https://github.com/ceph/ceph/blob/8a2eb18494005aa968b71f18121da8ebab48e950/src/rgw/rgw_auth_s3.cc#L169 | |
+| | | should take precedence over DATE as mentioned here -> | |
+| | | http://docs.aws.amazon.com/AmazonS3/latest/API/RESTCommonRequestHeaders.html | |
++----------------------+------------+---------------------------------------------------------------------------------------------------------+-------------+
+| x-amz-security-token | No | | |
++----------------------+------------+---------------------------------------------------------------------------------------------------------+-------------+
+
+-----------------------
+Common Response Headers
+-----------------------
+
+S3 Documentation reference : http://docs.aws.amazon.com/AmazonS3/latest/API/RESTCommonResponseHeaders.html
+
++---------------------+------------+---------------------------------------------------------------------------------------------------------+-------------+
+| Header | Supported? | Code Links | Tests links |
++=====================+============+=========================================================================================================+=============+
+| Content-Length | Yes | | |
++---------------------+------------+---------------------------------------------------------------------------------------------------------+-------------+
+| Connection | ? | | |
++---------------------+------------+---------------------------------------------------------------------------------------------------------+-------------+
+| Date | ? | | |
++---------------------+------------+---------------------------------------------------------------------------------------------------------+-------------+
+| ETag | Yes | https://github.com/ceph/ceph/blob/b139a7cd34b4e203ab164ada7a8fa590b50d8b13/src/rgw/rgw_op.cc#L1312 | |
+| | | https://github.com/ceph/ceph/blob/b139a7cd34b4e203ab164ada7a8fa590b50d8b13/src/rgw/rgw_op.cc#L1436 | |
+| | | https://github.com/ceph/ceph/blob/b139a7cd34b4e203ab164ada7a8fa590b50d8b13/src/rgw/rgw_op.cc#L2222 | |
+| | | https://github.com/ceph/ceph/blob/8a2eb18494005aa968b71f18121da8ebab48e950/src/rgw/rgw_rest_s3.cc#L118 | |
+| | | https://github.com/ceph/ceph/blob/8a2eb18494005aa968b71f18121da8ebab48e950/src/rgw/rgw_rest_s3.cc#L268 | |
+| | | https://github.com/ceph/ceph/blob/8a2eb18494005aa968b71f18121da8ebab48e950/src/rgw/rgw_rest_s3.cc#L516 | |
+| | | https://github.com/ceph/ceph/blob/8a2eb18494005aa968b71f18121da8ebab48e950/src/rgw/rgw_rest_s3.cc#L1336 | |
+| | | https://github.com/ceph/ceph/blob/8a2eb18494005aa968b71f18121da8ebab48e950/src/rgw/rgw_rest_s3.cc#L1486 | |
+| | | https://github.com/ceph/ceph/blob/8a2eb18494005aa968b71f18121da8ebab48e950/src/rgw/rgw_rest_s3.cc#L1548 | |
++---------------------+------------+---------------------------------------------------------------------------------------------------------+-------------+
+| Server | No | | |
++---------------------+------------+---------------------------------------------------------------------------------------------------------+-------------+
+| x-amz-delete-marker | No | | |
++---------------------+------------+---------------------------------------------------------------------------------------------------------+-------------+
+| x-amz-id-2 | No | | |
++---------------------+------------+---------------------------------------------------------------------------------------------------------+-------------+
+| x-amz-request-id | Yes | https://github.com/ceph/ceph/commit/b711e3124f8f73c17ebd19b38807a1b77f201e44 | |
++---------------------+------------+---------------------------------------------------------------------------------------------------------+-------------+
+| x-amz-version-id | No | | |
++---------------------+------------+---------------------------------------------------------------------------------------------------------+-------------+
+
+-------------------------
+Operations on the Service
+-------------------------
+
+S3 Documentation reference : http://docs.aws.amazon.com/AmazonS3/latest/API/RESTServiceOps.html
+
++------+-----------+------------+---------------------------------------------------------------------------------------------------------+-------------+
+| Type | Operation | Supported? | Code links | Tests links |
++======+===========+============+=========================================================================================================+=============+
+| GET | Service | Yes | https://github.com/ceph/ceph/blob/8a2eb18494005aa968b71f18121da8ebab48e950/src/rgw/rgw_rest_s3.cc#L2094 | |
+| | | | https://github.com/ceph/ceph/blob/8a2eb18494005aa968b71f18121da8ebab48e950/src/rgw/rgw_rest_s3.cc#L1676 | |
+| | | | https://github.com/ceph/ceph/blob/8a2eb18494005aa968b71f18121da8ebab48e950/src/rgw/rgw_rest_s3.cc#L185 | |
++------+-----------+------------+---------------------------------------------------------------------------------------------------------+-------------+
+
+---------------------
+Operations on Buckets
+---------------------
+
+S3 Documentation reference : http://docs.aws.amazon.com/AmazonS3/latest/API/RESTBucketOps.html
+
++--------+------------------------+------------+------------------------------------------------------------------------------------------------------------+-------------+
+| Type | Operation | Supported? | Code links | Tests links |
++========+========================+============+============================================================================================================+=============+
+| DELETE | Bucket | Yes | https://github.com/ceph/ceph/blob/8a2eb18494005aa968b71f18121da8ebab48e950/src/rgw/rgw_rest_s3.cc#L1728 | |
+| | | | https://github.com/ceph/ceph/blob/e91042171939b6bf82a56a1015c5cae792d228ad/src/rgw/rgw_rest_bucket.cc#L250 | |
+| | | | https://github.com/ceph/ceph/blob/e91042171939b6bf82a56a1015c5cae792d228ad/src/rgw/rgw_rest_bucket.cc#L212 | |
+| | | | https://github.com/ceph/ceph/blob/25948319c4d256c4aeb0137eb88947e54d14cc79/src/rgw/rgw_bucket.cc#L856 | |
+| | | | https://github.com/ceph/ceph/blob/25948319c4d256c4aeb0137eb88947e54d14cc79/src/rgw/rgw_bucket.cc#L513 | |
+| | | | https://github.com/ceph/ceph/blob/25948319c4d256c4aeb0137eb88947e54d14cc79/src/rgw/rgw_bucket.cc#L286 | |
+| | | | https://github.com/ceph/ceph/blob/8a2eb18494005aa968b71f18121da8ebab48e950/src/rgw/rgw_rest_s3.cc#L461 | |
++--------+------------------------+------------+------------------------------------------------------------------------------------------------------------+-------------+
+| DELETE | Bucket cors | ? | https://github.com/ceph/ceph/blob/8a2eb18494005aa968b71f18121da8ebab48e950/src/rgw/rgw_rest_s3.cc#L1731 | |
+| | | | https://github.com/ceph/ceph/blob/b139a7cd34b4e203ab164ada7a8fa590b50d8b13/src/rgw/rgw_op.cc#L1916 | |
++--------+------------------------+------------+------------------------------------------------------------------------------------------------------------+-------------+
+| DELETE | Bucket lifecycle | No | | |
++--------+------------------------+------------+------------------------------------------------------------------------------------------------------------+-------------+
+| DELETE | Bucket policy | ? | | |
++--------+------------------------+------------+------------------------------------------------------------------------------------------------------------+-------------+
+| DELETE | Bucket tagging | ? | | |
++--------+------------------------+------------+------------------------------------------------------------------------------------------------------------+-------------+
+| DELETE | Bucket website | No | | |
++--------+------------------------+------------+------------------------------------------------------------------------------------------------------------+-------------+
+| GET | Bucket | Yes | https://github.com/ceph/ceph/blob/8a2eb18494005aa968b71f18121da8ebab48e950/src/rgw/rgw_rest_s3.cc#L1676 | |
+| | | | https://github.com/ceph/ceph/blob/8a2eb18494005aa968b71f18121da8ebab48e950/src/rgw/rgw_rest_s3.cc#L185 | |
++--------+------------------------+------------+------------------------------------------------------------------------------------------------------------+-------------+
+| GET | Bucket acl | Yes | https://github.com/ceph/ceph/blob/8a2eb18494005aa968b71f18121da8ebab48e950/src/rgw/rgw_rest_s3.cc#L1697 | |
+| | | | https://github.com/ceph/ceph/blob/b139a7cd34b4e203ab164ada7a8fa590b50d8b13/src/rgw/rgw_op.cc#L1728 | |
+| | | | https://github.com/ceph/ceph/blob/8a2eb18494005aa968b71f18121da8ebab48e950/src/rgw/rgw_rest_s3.cc#L1344 | |
++--------+------------------------+------------+------------------------------------------------------------------------------------------------------------+-------------+
+| GET | Bucket cors | ? | https://github.com/ceph/ceph/blob/8a2eb18494005aa968b71f18121da8ebab48e950/src/rgw/rgw_rest_s3.cc#L1698 | |
+| | | | https://github.com/ceph/ceph/blob/b139a7cd34b4e203ab164ada7a8fa590b50d8b13/src/rgw/rgw_op.cc#L1845 | |
+| | | | https://github.com/ceph/ceph/blob/76040d90f7eb9f9921a3b8dcd0f821ac2cd9c492/src/rgw/rgw_main.cc#L345 | |
++--------+------------------------+------------+------------------------------------------------------------------------------------------------------------+-------------+
+| GET | Bucket lifecycle | No | | |
++--------+------------------------+------------+------------------------------------------------------------------------------------------------------------+-------------+
+| GET | Bucket location | No | | |
++--------+------------------------+------------+------------------------------------------------------------------------------------------------------------+-------------+
+| GET | Bucket policy | ? | https://github.com/ceph/ceph/blob/e91042171939b6bf82a56a1015c5cae792d228ad/src/rgw/rgw_rest_bucket.cc#L232 | |
+| | | | https://github.com/ceph/ceph/blob/e91042171939b6bf82a56a1015c5cae792d228ad/src/rgw/rgw_rest_bucket.cc#L58 | |
++--------+------------------------+------------+------------------------------------------------------------------------------------------------------------+-------------+
+| GET | Bucket logging | ? | https://github.com/ceph/ceph/blob/8a2eb18494005aa968b71f18121da8ebab48e950/src/rgw/rgw_rest_s3.cc#L1695 | |
+| | | | https://github.com/ceph/ceph/blob/8a2eb18494005aa968b71f18121da8ebab48e950/src/rgw/rgw_rest_s3.cc#L287 | |
++--------+------------------------+------------+------------------------------------------------------------------------------------------------------------+-------------+
+| GET | Bucket notification | No | | |
++--------+------------------------+------------+------------------------------------------------------------------------------------------------------------+-------------+
+| GET | Bucket tagging | No | | |
++--------+------------------------+------------+------------------------------------------------------------------------------------------------------------+-------------+
+| GET | Bucket Object versions | No | | |
++--------+------------------------+------------+------------------------------------------------------------------------------------------------------------+-------------+
+| GET | Bucket requestPayment | No | | |
++--------+------------------------+------------+------------------------------------------------------------------------------------------------------------+-------------+
+| GET | Bucket versionning | No | | |
++--------+------------------------+------------+------------------------------------------------------------------------------------------------------------+-------------+
+| GET | Bucket website | No | | |
++--------+------------------------+------------+------------------------------------------------------------------------------------------------------------+-------------+
+| GET | List Multipart uploads | Yes | https://github.com/ceph/ceph/blob/8a2eb18494005aa968b71f18121da8ebab48e950/src/rgw/rgw_rest_s3.cc#L1701 | |
+| | | | https://github.com/ceph/ceph/blob/8a2eb18494005aa968b71f18121da8ebab48e950/src/rgw/rgw_rest.cc#L877 | |
+| | | | https://github.com/ceph/ceph/blob/b139a7cd34b4e203ab164ada7a8fa590b50d8b13/src/rgw/rgw_op.cc#L2355 | |
+| | | | https://github.com/ceph/ceph/blob/b139a7cd34b4e203ab164ada7a8fa590b50d8b13/src/rgw/rgw_op.cc#L2363 | |
++--------+------------------------+------------+------------------------------------------------------------------------------------------------------------+-------------+
+| HEAD | Bucket | Yes | https://github.com/ceph/ceph/blob/8a2eb18494005aa968b71f18121da8ebab48e950/src/rgw/rgw_rest_s3.cc#L1713 | |
+| | | | https://github.com/ceph/ceph/blob/8a2eb18494005aa968b71f18121da8ebab48e950/src/rgw/rgw_rest_s3.cc#L1689 | |
+| | | | https://github.com/ceph/ceph/blob/b139a7cd34b4e203ab164ada7a8fa590b50d8b13/src/rgw/rgw_op.cc#L826 | |
+| | | | https://github.com/ceph/ceph/blob/b139a7cd34b4e203ab164ada7a8fa590b50d8b13/src/rgw/rgw_op.cc#L834 | |
++--------+------------------------+------------+------------------------------------------------------------------------------------------------------------+-------------+
+| PUT | Bucket | Yes | https://github.com/ceph/ceph/blob/8a2eb18494005aa968b71f18121da8ebab48e950/src/rgw/rgw_rest_s3.cc#L1725 | |
+| | | | https://github.com/ceph/ceph/blob/8a2eb18494005aa968b71f18121da8ebab48e950/src/rgw/rgw_rest_s3.cc#L382 | |
+| | | | https://github.com/ceph/ceph/blob/8a2eb18494005aa968b71f18121da8ebab48e950/src/rgw/rgw_rest_s3.cc#L437 | |
+| | | | https://github.com/ceph/ceph/blob/b139a7cd34b4e203ab164ada7a8fa590b50d8b13/src/rgw/rgw_op.cc#L901 | |
+| | | | https://github.com/ceph/ceph/blob/b139a7cd34b4e203ab164ada7a8fa590b50d8b13/src/rgw/rgw_op.cc#L945 | |
++--------+------------------------+------------+------------------------------------------------------------------------------------------------------------+-------------+
+| PUT | Bucket acl | Yes | https://github.com/ceph/ceph/blob/8a2eb18494005aa968b71f18121da8ebab48e950/src/rgw/rgw_rest_s3.cc#L1721 | |
+| | | | https://github.com/ceph/ceph/blob/8a2eb18494005aa968b71f18121da8ebab48e950/src/rgw/rgw_rest_s3.cc#L1354 | |
+| | | | https://github.com/ceph/ceph/blob/8a2eb18494005aa968b71f18121da8ebab48e950/src/rgw/rgw_rest_s3.cc#L1373 | |
+| | | | https://github.com/ceph/ceph/blob/b139a7cd34b4e203ab164ada7a8fa590b50d8b13/src/rgw/rgw_op.cc#L1739 | |
+| | | | https://github.com/ceph/ceph/blob/b139a7cd34b4e203ab164ada7a8fa590b50d8b13/src/rgw/rgw_op.cc#L1753 | |
++--------+------------------------+------------+------------------------------------------------------------------------------------------------------------+-------------+
+| PUT | Bucket cors | ? | https://github.com/ceph/ceph/blob/8a2eb18494005aa968b71f18121da8ebab48e950/src/rgw/rgw_rest_s3.cc#L1723 | |
+| | | | https://github.com/ceph/ceph/blob/8a2eb18494005aa968b71f18121da8ebab48e950/src/rgw/rgw_rest_s3.cc#L1398 | |
+| | | | https://github.com/ceph/ceph/blob/b139a7cd34b4e203ab164ada7a8fa590b50d8b13/src/rgw/rgw_op.cc#L1858 | |
+| | | | https://github.com/ceph/ceph/blob/b139a7cd34b4e203ab164ada7a8fa590b50d8b13/src/rgw/rgw_op.cc#L1866 | |
++--------+------------------------+------------+------------------------------------------------------------------------------------------------------------+-------------+
+| PUT | Bucket lifecycle | No | | |
++--------+------------------------+------------+------------------------------------------------------------------------------------------------------------+-------------+
+| PUT | Bucket policy | ? | | |
++--------+------------------------+------------+------------------------------------------------------------------------------------------------------------+-------------+
+| PUT | Bucket logging | ? | | |
++--------+------------------------+------------+------------------------------------------------------------------------------------------------------------+-------------+
+| PUT | Bucket notification | No | | |
++--------+------------------------+------------+------------------------------------------------------------------------------------------------------------+-------------+
+| PUT | Bucket tagging | ? | | |
++--------+------------------------+------------+------------------------------------------------------------------------------------------------------------+-------------+
+| PUT | Bucket requestPayment | No | | |
++--------+------------------------+------------+------------------------------------------------------------------------------------------------------------+-------------+
+| PUT | Bucket versionning | No | | |
++--------+------------------------+------------+------------------------------------------------------------------------------------------------------------+-------------+
+| PUT | Bucket website | No | | |
++--------+------------------------+------------+------------------------------------------------------------------------------------------------------------+-------------+
+
+---------------------
+Operations on Objects
+---------------------
+
+S3 Documentation reference : http://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectOps.html
+
++---------+---------------------------+------------+---------------------------------------------------------------------------------------------------------+-------------+
+| Type | Operation | Supported? | Code links | Tests links |
++=========+===========================+============+=========================================================================================================+=============+
+| DELETE | Object | Yes | https://github.com/ceph/ceph/blob/8a2eb18494005aa968b71f18121da8ebab48e950/src/rgw/rgw_rest_s3.cc#L1796 | |
+| | | | https://github.com/ceph/ceph/blob/b139a7cd34b4e203ab164ada7a8fa590b50d8b13/src/rgw/rgw_op.cc#L1516 | |
+| | | | https://github.com/ceph/ceph/blob/b139a7cd34b4e203ab164ada7a8fa590b50d8b13/src/rgw/rgw_op.cc#L1524 | |
++---------+---------------------------+------------+---------------------------------------------------------------------------------------------------------+-------------+
+| DELETE | Multiple objects | Yes | https://github.com/ceph/ceph/blob/8a2eb18494005aa968b71f18121da8ebab48e950/src/rgw/rgw_rest_s3.cc#L1739 | |
+| | | | https://github.com/ceph/ceph/blob/8a2eb18494005aa968b71f18121da8ebab48e950/src/rgw/rgw_rest_s3.cc#L1616 | |
+| | | | https://github.com/ceph/ceph/blob/8a2eb18494005aa968b71f18121da8ebab48e950/src/rgw/rgw_rest_s3.cc#L1626 | |
+| | | | https://github.com/ceph/ceph/blob/8a2eb18494005aa968b71f18121da8ebab48e950/src/rgw/rgw_rest_s3.cc#L1641 | |
+| | | | https://github.com/ceph/ceph/blob/8a2eb18494005aa968b71f18121da8ebab48e950/src/rgw/rgw_rest_s3.cc#L1667 | |
+| | | | https://github.com/ceph/ceph/blob/b139a7cd34b4e203ab164ada7a8fa590b50d8b13/src/rgw/rgw_op.cc#L1516 | |
+| | | | https://github.com/ceph/ceph/blob/b139a7cd34b4e203ab164ada7a8fa590b50d8b13/src/rgw/rgw_op.cc#L1524 | |
++---------+---------------------------+------------+---------------------------------------------------------------------------------------------------------+-------------+
+| GET | Object | Yes | https://github.com/ceph/ceph/blob/8a2eb18494005aa968b71f18121da8ebab48e950/src/rgw/rgw_rest_s3.cc#L1767 | |
+| | | | https://github.com/ceph/ceph/blob/8a2eb18494005aa968b71f18121da8ebab48e950/src/rgw/rgw_rest_s3.cc#L71 | |
+| | | | https://github.com/ceph/ceph/blob/b139a7cd34b4e203ab164ada7a8fa590b50d8b13/src/rgw/rgw_op.cc#L397 | |
+| | | | https://github.com/ceph/ceph/blob/b139a7cd34b4e203ab164ada7a8fa590b50d8b13/src/rgw/rgw_op.cc#L424 | |
+| | | | https://github.com/ceph/ceph/blob/b139a7cd34b4e203ab164ada7a8fa590b50d8b13/src/rgw/rgw_op.cc#L497 | |
+| | | | https://github.com/ceph/ceph/blob/b139a7cd34b4e203ab164ada7a8fa590b50d8b13/src/rgw/rgw_op.cc#L562 | |
+| | | | https://github.com/ceph/ceph/blob/b139a7cd34b4e203ab164ada7a8fa590b50d8b13/src/rgw/rgw_op.cc#L626 | |
+| | | | https://github.com/ceph/ceph/blob/b139a7cd34b4e203ab164ada7a8fa590b50d8b13/src/rgw/rgw_op.cc#L641 | |
+| | | | https://github.com/ceph/ceph/blob/b139a7cd34b4e203ab164ada7a8fa590b50d8b13/src/rgw/rgw_op.cc#L706 | |
++---------+---------------------------+------------+---------------------------------------------------------------------------------------------------------+-------------+
+| GET | Object acl | Yes | | |
++---------+---------------------------+------------+---------------------------------------------------------------------------------------------------------+-------------+
+| GET | Object torrent | No | | |
++---------+---------------------------+------------+---------------------------------------------------------------------------------------------------------+-------------+
+| HEAD | Object | Yes | https://github.com/ceph/ceph/blob/8a2eb18494005aa968b71f18121da8ebab48e950/src/rgw/rgw_rest_s3.cc#L1777 | |
+| | | | https://github.com/ceph/ceph/blob/8a2eb18494005aa968b71f18121da8ebab48e950/src/rgw/rgw_rest_s3.cc#L71 | |
+| | | | https://github.com/ceph/ceph/blob/b139a7cd34b4e203ab164ada7a8fa590b50d8b13/src/rgw/rgw_op.cc#L397 | |
+| | | | https://github.com/ceph/ceph/blob/b139a7cd34b4e203ab164ada7a8fa590b50d8b13/src/rgw/rgw_op.cc#L424 | |
+| | | | https://github.com/ceph/ceph/blob/b139a7cd34b4e203ab164ada7a8fa590b50d8b13/src/rgw/rgw_op.cc#L497 | |
+| | | | https://github.com/ceph/ceph/blob/b139a7cd34b4e203ab164ada7a8fa590b50d8b13/src/rgw/rgw_op.cc#L562 | |
+| | | | https://github.com/ceph/ceph/blob/b139a7cd34b4e203ab164ada7a8fa590b50d8b13/src/rgw/rgw_op.cc#L626 | |
+| | | | https://github.com/ceph/ceph/blob/b139a7cd34b4e203ab164ada7a8fa590b50d8b13/src/rgw/rgw_op.cc#L641 | |
+| | | | https://github.com/ceph/ceph/blob/b139a7cd34b4e203ab164ada7a8fa590b50d8b13/src/rgw/rgw_op.cc#L706 | |
++---------+---------------------------+------------+---------------------------------------------------------------------------------------------------------+-------------+
+| OPTIONS | Object | Yes | https://github.com/ceph/ceph/blob/8a2eb18494005aa968b71f18121da8ebab48e950/src/rgw/rgw_rest_s3.cc#L1814 | |
+| | | | https://github.com/ceph/ceph/blob/8a2eb18494005aa968b71f18121da8ebab48e950/src/rgw/rgw_rest_s3.cc#L1418 | |
+| | | | https://github.com/ceph/ceph/blob/b139a7cd34b4e203ab164ada7a8fa590b50d8b13/src/rgw/rgw_op.cc#L1951 | |
+| | | | https://github.com/ceph/ceph/blob/b139a7cd34b4e203ab164ada7a8fa590b50d8b13/src/rgw/rgw_op.cc#L1968 | |
+| | | | https://github.com/ceph/ceph/blob/b139a7cd34b4e203ab164ada7a8fa590b50d8b13/src/rgw/rgw_op.cc#L1993 | |
++---------+---------------------------+------------+---------------------------------------------------------------------------------------------------------+-------------+
+| POST | Object | Yes | https://github.com/ceph/ceph/blob/8a2eb18494005aa968b71f18121da8ebab48e950/src/rgw/rgw_rest_s3.cc#L1742 | |
+| | | | https://github.com/ceph/ceph/blob/8a2eb18494005aa968b71f18121da8ebab48e950/src/rgw/rgw_rest_s3.cc#L631 | |
+| | | | https://github.com/ceph/ceph/blob/8a2eb18494005aa968b71f18121da8ebab48e950/src/rgw/rgw_rest_s3.cc#L694 | |
+| | | | https://github.com/ceph/ceph/blob/8a2eb18494005aa968b71f18121da8ebab48e950/src/rgw/rgw_rest_s3.cc#L700 | |
+| | | | https://github.com/ceph/ceph/blob/8a2eb18494005aa968b71f18121da8ebab48e950/src/rgw/rgw_rest_s3.cc#L707 | |
+| | | | https://github.com/ceph/ceph/blob/8a2eb18494005aa968b71f18121da8ebab48e950/src/rgw/rgw_rest_s3.cc#L759 | |
+| | | | https://github.com/ceph/ceph/blob/8a2eb18494005aa968b71f18121da8ebab48e950/src/rgw/rgw_rest_s3.cc#L771 | |
+| | | | https://github.com/ceph/ceph/blob/8a2eb18494005aa968b71f18121da8ebab48e950/src/rgw/rgw_rest_s3.cc#L781 | |
+| | | | https://github.com/ceph/ceph/blob/8a2eb18494005aa968b71f18121da8ebab48e950/src/rgw/rgw_rest_s3.cc#L795 | |
+| | | | https://github.com/ceph/ceph/blob/8a2eb18494005aa968b71f18121da8ebab48e950/src/rgw/rgw_rest_s3.cc#L929 | |
+| | | | https://github.com/ceph/ceph/blob/8a2eb18494005aa968b71f18121da8ebab48e950/src/rgw/rgw_rest_s3.cc#L1037 | |
+| | | | https://github.com/ceph/ceph/blob/8a2eb18494005aa968b71f18121da8ebab48e950/src/rgw/rgw_rest_s3.cc#L1059 | |
+| | | | https://github.com/ceph/ceph/blob/8a2eb18494005aa968b71f18121da8ebab48e950/src/rgw/rgw_rest_s3.cc#L1134 | |
+| | | | https://github.com/ceph/ceph/blob/b139a7cd34b4e203ab164ada7a8fa590b50d8b13/src/rgw/rgw_op.cc#L1344 | |
+| | | | https://github.com/ceph/ceph/blob/b139a7cd34b4e203ab164ada7a8fa590b50d8b13/src/rgw/rgw_op.cc#L1360 | |
+| | | | https://github.com/ceph/ceph/blob/b139a7cd34b4e203ab164ada7a8fa590b50d8b13/src/rgw/rgw_op.cc#L1365 | |
++---------+---------------------------+------------+---------------------------------------------------------------------------------------------------------+-------------+
+| POST | Object restore | ? | | |
++---------+---------------------------+------------+---------------------------------------------------------------------------------------------------------+-------------+
+| PUT | Object | Yes | | |
++---------+---------------------------+------------+---------------------------------------------------------------------------------------------------------+-------------+
+| PUT | Object acl | Yes | | |
++---------+---------------------------+------------+---------------------------------------------------------------------------------------------------------+-------------+
+| PUT | Object copy | Yes | | |
++---------+---------------------------+------------+---------------------------------------------------------------------------------------------------------+-------------+
+| PUT | Initate multipart upload | Yes | | |
++---------+---------------------------+------------+---------------------------------------------------------------------------------------------------------+-------------+
+| PUT | Upload Part | Yes | | |
++---------+---------------------------+------------+---------------------------------------------------------------------------------------------------------+-------------+
+| PUT | Upload Part copy | ? | | |
++---------+---------------------------+------------+---------------------------------------------------------------------------------------------------------+-------------+
+| PUT | Complete multipart upload | Yes | | |
++---------+---------------------------+------------+---------------------------------------------------------------------------------------------------------+-------------+
+| PUT | Abort multipart upload | Yes | | |
++---------+---------------------------+------------+---------------------------------------------------------------------------------------------------------+-------------+
+| PUT | List parts | Yes | | |
++---------+---------------------------+------------+---------------------------------------------------------------------------------------------------------+-------------+
diff --git a/src/ceph/doc/dev/radosgw/usage.rst b/src/ceph/doc/dev/radosgw/usage.rst
new file mode 100644
index 0000000..6c856fc
--- /dev/null
+++ b/src/ceph/doc/dev/radosgw/usage.rst
@@ -0,0 +1,84 @@
+============================
+Usage Design Overview
+============================
+
+
+
+
+Testing
+-------
+
+The current usage testing does the following:
+
+Following these operations:
+
+ - Create a few buckets
+ - Remove buckets
+ - Create a bucket
+ - Put object
+ - Remove object
+
+Test:
+
+1. Verify that 'usage show' with delete_obj category isn't empty after no more than 45 seconds (wait to flush)
+2. Check the following
+
+ - 'usage show'
+
+ - does not error out
+ - num of entries > 0
+ - num of summary entries > 0
+ - for every entry in categories check successful_ops > 0
+ - check that correct uid in the user summary
+
+
+ - 'usage show' with specified uid (--uid=<uid>')
+
+ - num of entries > 0
+ - num of summary entries > 0
+ - for every entry in categories check successful_ops > 0
+ - check that correct uid in the user summary
+
+ - 'usage show' with specified uid and specified categories (create_bucket,
+ put_obj, delete_obj, delete_bucket)
+
+ - for each category:
+ - does not error out
+ - num of entries > 0
+ - user in user summary is correct user
+ - length of categories entries under user summary is exactly 1
+ - name of category under user summary is correct name
+ - successful ops for the category > 0
+
+ - 'usage trim' with specified uid
+ - does not error
+ - check following 'usage show' shows complete usage info cleared for user
+
+
+Additional required testing:
+
+ - test multiple users
+
+ Do the same as in (2), with multiple users being set up.
+
+ - test with multiple buckets (> 1000 * factor, e.g., 2000)
+
+ Create multiple buckets, put objects in each. Account the number written data and verify
+ that usage reports show the expected number (up to a certain delta).
+
+ - verify usage show with a date/time range
+
+ Take timestamp of the beginning of the test, and the end of the test. Round timestamps to the
+ nearest hour (downward from start of test, upward from the end of test). List data starting
+ at end-time, make sure that no data is being shown. List data ending at start-time, make sure
+ that no data is shown. List data beginning at start-time, make sure that correct data is
+ displayed. List data ending end end-time, make sure that correct data is displayed. List
+ data beginning in begin-time, ending in end-time, make sure that correct data is displayed.
+
+ - verify usage trim with a date/time range
+
+ Take timestamp of the beginning of the test, and the end of the test. Round timestamps to the
+ nearest hour (downward from start of test, upward from the end of test). Trim data starting
+ at end-time, make sure that no data has been trimmed. Trim data ending at start-time, make sure
+ that no data has been trimmed. Trim data beginning in begin-time, ending in end-time, make sure
+ that all data has been trimmed.
diff --git a/src/ceph/doc/dev/rbd-diff.rst b/src/ceph/doc/dev/rbd-diff.rst
new file mode 100644
index 0000000..083c131
--- /dev/null
+++ b/src/ceph/doc/dev/rbd-diff.rst
@@ -0,0 +1,146 @@
+RBD Incremental Backup
+======================
+
+This is a simple streaming file format for representing a diff between
+two snapshots (or a snapshot and the head) of an RBD image.
+
+Header
+~~~~~~
+
+"rbd diff v1\\n"
+
+Metadata records
+~~~~~~~~~~~~~~~~
+
+Every record has a one byte "tag" that identifies the record type,
+followed by some other data.
+
+Metadata records come in the first part of the image. Order is not
+important, as long as all the metadata records come before the data
+records.
+
+From snap
+---------
+
+- u8: 'f'
+- le32: snap name length
+- snap name
+
+To snap
+-------
+
+- u8: 't'
+- le32: snap name length
+- snap name
+
+Size
+----
+
+- u8: 's'
+- le64: (ending) image size
+
+Data Records
+~~~~~~~~~~~~
+
+These records come in the second part of the sequence.
+
+Updated data
+------------
+
+- u8: 'w'
+- le64: offset
+- le64: length
+- length bytes of actual data
+
+Zero data
+---------
+
+- u8: 'z'
+- le64: offset
+- le64: length
+
+
+Final Record
+~~~~~~~~~~~~
+
+End
+---
+
+- u8: 'e'
+
+
+Header
+~~~~~~
+
+"rbd diff v2\\n"
+
+Metadata records
+~~~~~~~~~~~~~~~~
+
+Every record has a one byte "tag" that identifies the record type,
+followed by length of data, and then some other data.
+
+Metadata records come in the first part of the image. Order is not
+important, as long as all the metadata records come before the data
+records.
+
+In v2, we have the following metadata in each section:
+(1 Bytes) tag.
+(8 Bytes) length.
+(n Bytes) data.
+
+In this way, we can skip the unrecognized tag.
+
+From snap
+---------
+
+- u8: 'f'
+- le64: length of appending data (4 + length)
+- le32: snap name length
+- snap name
+
+To snap
+-------
+
+- u8: 't'
+- le64: length of appending data (4 + length)
+- le32: snap name length
+- snap name
+
+Size
+----
+
+- u8: 's'
+- le64: length of appending data (8)
+- le64: (ending) image size
+
+Data Records
+~~~~~~~~~~~~
+
+These records come in the second part of the sequence.
+
+Updated data
+------------
+
+- u8: 'w'
+- le64: length of appending data (8 + 8 + length)
+- le64: offset
+- le64: length
+- length bytes of actual data
+
+Zero data
+---------
+
+- u8: 'z'
+- le64: length of appending data (8 + 8)
+- le64: offset
+- le64: length
+
+
+Final Record
+~~~~~~~~~~~~
+
+End
+---
+
+- u8: 'e'
diff --git a/src/ceph/doc/dev/rbd-export.rst b/src/ceph/doc/dev/rbd-export.rst
new file mode 100644
index 0000000..0d34d61
--- /dev/null
+++ b/src/ceph/doc/dev/rbd-export.rst
@@ -0,0 +1,84 @@
+RBD Export & Import
+===================
+
+This is a file format of an RBD image or snapshot. It's a sparse format
+for the full image. There are three recording sections in the file.
+
+(1) Header.
+(2) Metadata.
+(3) Diffs.
+
+Header
+~~~~~~
+
+"rbd image v2\\n"
+
+Metadata records
+~~~~~~~~~~~~~~~~
+
+Every record has a one byte "tag" that identifies the record type,
+followed by length of data, and then some other data.
+
+Metadata records come in the first part of the image. Order is not
+important, as long as all the metadata records come before the data
+records.
+
+In v2, we have the following metadata in each section:
+(1 Bytes) tag.
+(8 Bytes) length.
+(n Bytes) data.
+
+In this way, we can skip the unrecognized tag.
+
+Image order
+-----------
+
+- u8: 'O'
+- le64: length of appending data (8)
+- le64: image order
+
+Image format
+------------
+
+- u8: 'F'
+- le64: length of appending data (8)
+- le64: image format
+
+Image Features
+--------------
+
+- u8: 'T'
+- le64: length of appending data (8)
+- le64: image features
+
+Image Stripe unit
+-----------------
+
+- u8: 'U'
+- le64: length of appending data (8)
+- le64: image striping unit
+
+Image Stripe count
+------------------
+
+- u8: 'C'
+- le64: length of appending data (8)
+- le64: image striping count
+
+Final Record
+~~~~~~~~~~~~
+
+End
+---
+
+- u8: 'E'
+
+
+Diffs records
+~~~~~~~~~~~~~~~~~
+Record the all snapshots and the HEAD in this section.
+
+- le64: number of diffs
+- Diffs ...
+
+Detail please refer to rbd-diff.rst
diff --git a/src/ceph/doc/dev/rbd-layering.rst b/src/ceph/doc/dev/rbd-layering.rst
new file mode 100644
index 0000000..e6e224c
--- /dev/null
+++ b/src/ceph/doc/dev/rbd-layering.rst
@@ -0,0 +1,281 @@
+============
+RBD Layering
+============
+
+RBD layering refers to the creation of copy-on-write clones of block
+devices. This allows for fast image creation, for example to clone a
+golden master image of a virtual machine into a new instance. To
+simplify the semantics, you can only create a clone of a snapshot -
+snapshots are always read-only, so the rest of the image is
+unaffected, and there's no possibility of writing to them
+accidentally.
+
+From a user's perspective, a clone is just like any other rbd image.
+You can take snapshots of them, read/write them, resize them, etc.
+There are no restrictions on clones from a user's viewpoint.
+
+Note: the terms `child` and `parent` below mean an rbd image created
+by cloning, and the rbd image snapshot a child was cloned from.
+
+Command line interface
+----------------------
+
+Before cloning a snapshot, you must mark it as protected, to prevent
+it from being deleted while child images refer to it:
+::
+
+ $ rbd snap protect pool/image@snap
+
+Then you can perform the clone:
+::
+
+ $ rbd clone [--parent] pool/parent@snap [--image] pool2/child1
+
+You can create a clone with different object sizes from the parent:
+::
+
+ $ rbd clone --order 25 pool/parent@snap pool2/child2
+
+To delete the parent, you must first mark it unprotected, which checks
+that there are no children left:
+::
+
+ $ rbd snap unprotect pool/image@snap
+ Cannot unprotect: Still in use by pool2/image2
+ $ rbd children pool/image@snap
+ pool2/child1
+ pool2/child2
+ $ rbd flatten pool2/child1
+ $ rbd rm pool2/child2
+ $ rbd snap rm pool/image@snap
+ Cannot remove a protected snapshot: pool/image@snap
+ $ rbd snap unprotect pool/image@snap
+
+Then the snapshot can be deleted like normal:
+::
+
+ $ rbd snap rm pool/image@snap
+
+Implementation
+--------------
+
+Data Flow
+^^^^^^^^^
+
+In the initial implementation, called 'trivial layering', there will
+be no tracking of which objects exist in a clone. A read that hits a
+non-existent object will attempt to read from the parent snapshot, and
+this will continue recursively until an object exists or an image with
+no parent is found. This is done through the normal read path from
+the parent, so differing object sizes between parents and children
+do not matter.
+
+Before a write to an object is performed, the object is checked for
+existence. If it doesn't exist, a copy-up operation is performed,
+which means reading the relevant range of data from the parent
+snapshot and writing it (plus the original write) to the child
+image. To prevent races with multiple writes trying to copy-up the
+same object, this copy-up operation will include an atomic create. If
+the atomic create fails, the original write is done instead. This
+copy-up operation is implemented as a class method so that extra
+metadata can be stored by it in the future. In trivial layering, the
+copy-up operation copies the entire range needed to the child object
+(that is, the full size of the child object). A future optimization
+could make this copy-up more fine-grained.
+
+Another future optimization could be storing a bitmap of which objects
+actually exist in a child. This would obviate the check for existence
+before each write, and let reads go directly to the parent if needed.
+
+These optimizations are discussed in:
+
+http://marc.info/?l=ceph-devel&m=129867273303846
+
+Parent/Child relationships
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Children store a reference to their parent in their header, as a tuple
+of (pool id, image id, snapshot id). This is enough information to
+open the parent and read from it.
+
+In addition to knowing which parent a given image has, we want to be
+able to tell if a protected snapshot still has children. This is
+accomplished with a new per-pool object, `rbd_children`, which maps
+(parent pool id, parent image id, parent snapshot id) to a list of
+child image ids. This is stored in the same pool as the child image
+because the client creating a clone already has read/write access to
+everything in this pool, but may not have write access to the parent's
+pool. This lets a client with read-only access to one pool clone a
+snapshot from that pool into a pool they have full access to. It
+increases the cost of unprotecting an image, since this needs to check
+for children in every pool, but this is a rare operation. It would
+likely only be done before removing old images, which is already much
+more expensive because it involves deleting every data object in the
+image.
+
+Protection
+^^^^^^^^^^
+
+Internally, protection_state is a field in the header object that
+can be in three states. "protected", "unprotected", and
+"unprotecting". The first two are set as the result of "rbd
+protect/unprotect". The "unprotecting" state is set while the "rbd
+unprotect" command checks for any child images. Only snapshots in the
+"protected" state may be cloned, so the "unprotected" state prevents
+a race like:
+
+1. A: walk through all pools, look for clones, find none
+2. B: create a clone
+3. A: unprotect parent
+4. A: rbd snap rm pool/parent@snap
+
+Resizing
+^^^^^^^^
+
+Resizing an rbd image is like truncating a sparse file. New space is
+treated as zeroes, and shrinking an rbd image deletes the contents
+beyond the old bounds. This means that if you have a 10G image full of
+data, and you resize it down to 5G and then up to 10G again, the last
+5G is treated as zeroes (and any objects that held that data were
+removed when the image was shrunk).
+
+Layering complicates this because the absence of an object no longer
+implies it should be treated as zeroes - if the object is part of a
+clone, it may mean that some data needs to be read from the parent.
+
+To preserve the resizing behavior for clones, we need to keep track of
+which objects could be stored in the parent. We can track this as the
+amount of overlap the child has with the parent, since resizing only
+changes the end of an image. When a child is created, its overlap
+is the size of the parent snapshot. On each subsequent resize, the
+overlap is `min(overlap, new_size)`. That is, shrinking the image
+may shrinks the overlap, but increasing the image's size does not
+change the overlap.
+
+Objects that do not exist past the overlap are treated as zeroes.
+Objects that do not exist before that point fall back to reading
+from the parent.
+
+Since this overlap changes over time, we store it as part of the
+metadata for a snapshot as well.
+
+Renaming
+^^^^^^^^
+
+Currently the rbd header object (that stores all the metadata about an
+image) is named after the name of the image. This makes renaming
+disrupt clients who have the image open (such as children reading from
+a parent). To avoid this, we can name the header object by the
+id of the image, which does not change. That is, the name of the
+header object could be `rbd_header.$id`, where $id is a unique id for
+the image in the pool.
+
+When a client opens an image, all it knows is the name. There is
+already a per-pool `rbd_directory` object that maps image names to
+ids, but if we relied on it to get the id, we could not open any
+images in that pool if that single object was unavailable. To avoid
+this dependency, we can store the id of an image in an object called
+`rbd_id.$image_name`, where $image_name is the name of the image. The
+per-pool `rbd_directory` object is still useful for listing all images
+in a pool, however.
+
+Header changes
+--------------
+
+The header needs a few new fields:
+
+* int64_t parent_pool_id
+* string parent_image_id
+* uint64_t parent_snap_id
+* uint64_t overlap (how much of the image may be referring to the parent)
+
+These are stored in a "parent" key, which is only present if the image
+has a parent.
+
+cls_rbd
+^^^^^^^
+
+Some new methods are needed:
+::
+
+ /***************** methods on the rbd header *********************/
+ /**
+ * Sets the parent and overlap keys.
+ * Fails if any of these keys exist, since the image already
+ * had a parent.
+ */
+ set_parent(uint64_t pool_id, string image_id, uint64_t snap_id)
+
+ /**
+ * returns the parent pool id, image id, snap id, and overlap, or -ENOENT
+ * if parent_pool_id does not exist or is -1
+ */
+ get_parent(uint64_t snapid)
+
+ /**
+ * Removes the parent key
+ */
+ remove_parent() // after all parent data is copied to the child
+
+ /*************** methods on the rbd_children object *****************/
+
+ add_child(uint64_t parent_pool_id, string parent_image_id,
+ uint64_t parent_snap_id, string image_id);
+ remove_child(uint64_t parent_pool_id, string parent_image_id,
+ uint64_t parent_snap_id, string image_id);
+ /**
+ * List ids of a given parent
+ */
+ get_children(uint64_t parent_pool_id, string parent_image_id,
+ uint64_t parent_snap_id, uint64_t max_return,
+ string start);
+ /**
+ * list parent
+ */
+ get_parents(uint64_t max_return, uint64_t start_pool_id,
+ string start_image_id, string start_snap_id);
+
+
+ /************ methods on the rbd_id.$image_name object **************/
+
+ set_id(string id)
+ get_id()
+
+ /************** methods on the rbd_directory object *****************/
+
+ dir_get_id(string name);
+ dir_get_name(string id);
+ dir_list(string start_after, uint64_t max_return);
+ dir_add_image(string name, string id);
+ dir_remove_image(string name, string id);
+ dir_rename_image(string src, string dest, string id);
+
+Two existing methods will change if the image supports
+layering:
+::
+
+ snapshot_add - stores current overlap and has_parent with
+ other snapshot metadata (images that don't have
+ layering enabled aren't affected)
+
+ set_size - will adjust the parent overlap down as needed.
+
+librbd
+^^^^^^
+
+Opening a child image opens its parent (and this will continue
+recursively as needed). This means that an ImageCtx will contain a
+pointer to the parent image context. Differing object sizes won't
+matter, since reading from the parent will go through the parent
+image context.
+
+Discard will need to change for layered images so that it only
+truncates objects, and does not remove them. If we removed objects, we
+could not tell if we needed to read them from the parent.
+
+A new clone method will be added, which takes the same arguments as
+create except size (size of the parent image is used).
+
+Instead of expanding the rbd_info struct, we will break the metadata
+retrieval into several API calls. Right now, the only users of
+rbd_stat() other than 'rbd info' only use it to retrieve image size.
diff --git a/src/ceph/doc/dev/release-process.rst b/src/ceph/doc/dev/release-process.rst
new file mode 100644
index 0000000..f7e853b
--- /dev/null
+++ b/src/ceph/doc/dev/release-process.rst
@@ -0,0 +1,173 @@
+======================
+ Ceph Release Process
+======================
+
+1. Build environment
+====================
+
+There are multiple build environments, debian based packages are built via pbuilder for multiple distributions. The build hosts are listed in the ``deb_hosts`` file, and the list of distributions are in ``deb_dist``. All distributions are build on each of the build hosts. Currently there is 1 64 bit and 1 32 bit build host.
+
+The RPM based packages are built natively, so one distribution per build host. The list of hosts is found in ``rpm_hosts``.
+
+Prior to building, it's necessary to update the pbuilder seed tarballs::
+
+ ./update_all_pbuilders.sh
+
+2. Setup keyring for signing packages
+=====================================
+
+::
+
+ export GNUPGHOME=<path to keyring dir>
+
+ # verify it's accessible
+ gpg --list-keys
+
+The release key should be present::
+
+ pub 4096R/17ED316D 2012-05-20
+ uid Ceph Release Key <sage@newdream.net>
+
+
+3. Set up build area
+====================
+
+Clone the ceph and ceph-build source trees::
+
+ git clone http://github.com/ceph/ceph.git
+ git clone http://github.com/ceph/ceph-build.git
+
+In the ceph source directory, checkout next branch (for point releases use the {codename} branch)::
+
+ git checkout next
+
+Checkout the submodules::
+
+ git submodule update --force --init --recursive
+
+4. Update Build version numbers
+================================
+
+Substitute the ceph release number where indicated below by the string ``0.xx``.
+
+Edit configure.ac and update the version number. Example diff::
+
+ -AC_INIT([ceph], [0.54], [ceph-devel@vger.kernel.org])
+ +AC_INIT([ceph], [0.55], [ceph-devel@vger.kernel.org])
+
+Update the version number in the debian change log::
+
+ DEBEMAIL user@host dch -v 0.xx-1
+
+Commit the changes::
+
+ git commit -a
+
+Tag the release::
+
+ ../ceph-build/tag-release v0.xx
+
+
+5. Create Makefiles
+===================
+
+The actual configure options used to build packages are in the
+``ceph.spec.in`` and ``debian/rules`` files. At this point we just
+need to create a Makefile.::
+
+ ./do_autogen.sh
+
+
+6. Run the release scripts
+==========================
+
+This creates tarballs and copies them, with other needed files to
+the build hosts listed in deb_hosts and rpm_hosts, runs a local build
+script, then rsyncs the results back to the specified release directory.::
+
+ ../ceph-build/do_release.sh /tmp/release
+
+
+7. Create RPM Repo
+==================
+
+Copy the rpms to the destination repo::
+
+ mkdir /tmp/rpm-repo
+ ../ceph-build/push_to_rpm_repo.sh /tmp/release /tmp/rpm-repo 0.xx
+
+Next add any additional rpms to the repo that are needed such as leveldb and
+and ceph-deploy. See RPM Backports section
+
+Finally, sign the rpms and build the repo indexes::
+
+ ../ceph-build/sign_and_index_rpm_repo.sh /tmp/release /tmp/rpm-repo 0.xx
+
+
+8. Create Debian repo
+=====================
+
+The key-id used below is the id of the ceph release key from step 2::
+
+ mkdir /tmp/debian-repo
+ ../ceph-build/gen_reprepro_conf.sh /tmp/debian-repo key-id
+ ../ceph-build/push_to_deb_repo.sh /tmp/release /tmp/debian-repo 0.xx main
+
+
+Next add any addition debian packages that are needed such as leveldb and
+ceph-deploy. See the Debian Backports section below.
+
+Debian packages are signed when added to the repo, so no further action is
+needed.
+
+
+9. Push repos to ceph.org
+==========================
+
+For a development release::
+
+ rcp ceph-0.xx.tar.bz2 ceph-0.xx.tar.gz \
+ ceph_site@ceph.com:ceph.com/downloads/.
+ rsync -av /tmp/rpm-repo/0.xx/ ceph_site@ceph.com:ceph.com/rpm-testing
+ rsync -av /tmp/debian-repo/ ceph_site@ceph.com:ceph.com/debian-testing
+
+For a stable release, replace {CODENAME} with the release codename (e.g., ``argonaut`` or ``bobtail``)::
+
+ rcp ceph-0.xx.tar.bz2 \
+ ceph_site@ceph.com:ceph.com/downloads/ceph-0.xx.tar.bz2
+ rcp ceph-0.xx.tar.gz \
+ ceph_site@ceph.com:ceph.com/downloads/ceph-0.xx.tar.gz
+ rsync -av /tmp/rpm-repo/0.xx/ ceph_site@ceph.com:ceph.com/rpm-{CODENAME}
+ rsync -auv /tmp/debian-repo/ ceph_site@ceph.com:ceph.com/debian-{CODENAME}
+
+10. Update Git
+==============
+
+Point release
+-------------
+
+For point releases just push the version number update to the
+branch and the new tag::
+
+ git push origin {codename}
+ git push origin v0.xx
+
+Development and Stable releases
+-------------------------------
+
+For a development release, update tags for ``ceph.git``::
+
+ git push origin v0.xx
+ git push origin HEAD:last
+ git checkout master
+ git merge next
+ git push origin master
+ git push origin HEAD:next
+
+Similarly, for a development release, for both ``teuthology.git`` and ``ceph-qa-suite.git``::
+
+ git checkout master
+ git reset --hard origin/master
+ git branch -f last origin/next
+ git push -f origin last
+ git push -f origin master:next
diff --git a/src/ceph/doc/dev/repo-access.rst b/src/ceph/doc/dev/repo-access.rst
new file mode 100644
index 0000000..0196b0a
--- /dev/null
+++ b/src/ceph/doc/dev/repo-access.rst
@@ -0,0 +1,38 @@
+Notes on Ceph repositories
+==========================
+
+Special branches
+----------------
+
+* ``master``: current tip (integration branch)
+* ``next``: pending release (feature frozen, bugfixes only)
+* ``last``: last/previous release
+* ``dumpling``, ``cuttlefish``, ``bobtail``, ``argonaut``, etc.: stable release branches
+* ``dumpling-next``: backports for stable release, pending testing
+
+Rules
+-----
+
+The source repos are all on github.
+
+* Any branch pushed to ceph.git will kick off builds that will either
+ run unit tests or generate packages for gitbuilder.ceph.com. Try
+ not to generate unnecessary load. For private, unreviewed work,
+ only push to branches named ``wip-*``. This avoids colliding with
+ any special branches.
+
+* Nothing should every reach a special branch unless it has been
+ reviewed.
+
+* Preferred means of review is via github pull requests to capture any
+ review discussion.
+
+* For multi-patch series, the pull request can be merged via github,
+ and a Reviewed-by: ... line added to the merge commit.
+
+* For single- (or few-) patch merges, it is preferable to add the
+ Reviewed-by: directly to the commit so that it is also visible when
+ the patch is cherry-picked for backports.
+
+* All backports should use ``git cherry-pick -x`` to capture which
+ commit they are cherry-picking from.
diff --git a/src/ceph/doc/dev/sepia.rst b/src/ceph/doc/dev/sepia.rst
new file mode 100644
index 0000000..6f83b2c
--- /dev/null
+++ b/src/ceph/doc/dev/sepia.rst
@@ -0,0 +1,9 @@
+Sepia community test lab
+========================
+
+The Ceph community maintains a test lab that is open to active
+contributors to the Ceph project. Please see the `Sepia repository`_ for more
+information.
+
+.. _Sepia repository: https://github.com/ceph/sepia
+
diff --git a/src/ceph/doc/dev/session_authentication.rst b/src/ceph/doc/dev/session_authentication.rst
new file mode 100644
index 0000000..e8a5059
--- /dev/null
+++ b/src/ceph/doc/dev/session_authentication.rst
@@ -0,0 +1,160 @@
+==============================================
+Session Authentication for the Cephx Protocol
+==============================================
+Peter Reiher
+7/30/12
+
+The original Cephx protocol authenticated the client to the authenticator and set up a session
+key used to authenticate the client to the server it needs to talk to. It did not, however,
+authenticate the ongoing messages between the client and server. Based on the fact that they
+share a secret key, these ongoing session messages can be easily authenticated by using the
+key to sign the messages.
+
+This document describes changes to the code that allow such ongoing session authentication.
+The changes allow for future changes that permit other authentication protocols (and the
+existing null NONE and UNKNOWN protocols) to handle signatures, but the only protocol that
+actually does signatures, at the time of the writing, is the Cephx protocol.
+
+Introduction
+-------------
+
+This code comes into play after the Cephx protocol has completed. At this point, the client and
+server share a secret key. This key will be used for authentication. For other protocols, there
+may or may not be such a key in place, and perhaps the actual procedures used to perform
+signing will be different, so the code is written to be general.
+
+The "session" here is represented by an established pipe. For such pipes, there should be a
+``session\_security`` structure attached to the pipe. Whenever a message is to be sent on the
+pipe, code that handles the signature for this kind of session security will be called. On the
+other end of the pipe, code that checks this kind of session security's message signatures will
+be called. Messages that fail the signature check will not be processed further. That implies
+that the sender had better be in agreement with the receiver on the session security being used,
+since otherwise messages will be uniformly dropped between them.
+
+The code is also prepared to handle encryption and decryption of session messages, which would
+add secrecy to the integrity provided by the signatures. No protocol currently implemented
+encrypts the ongoing session messages, though.
+
+For this functionality to work, several steps are required. First, the sender and receiver must have
+a successful run of the cephx protocol to establish a shared key. They must store that key somewhere
+that the pipe can get at later, to permit messages to be signed with it. Sent messages must be
+signed, and received messages must have their signatures checked.
+
+The signature could be computed in a variety of ways, but currently its size is limited to 64 bits.
+A message's signature is placed in its footer, in a field called ``sig``.
+
+The signature code in Cephx can be turned on and off at runtime, using a Ceph boolean option called
+``cephx\_sign\_messages``. It is currently set to false, by default, so no messages will be signed. It
+must be changed to true to cause signatures to be calculated and checked.
+
+Storing the Key
+---------------
+
+The key is needed to create signatures on the sending end and check signatures on the receiving end.
+In the future, if asymmetric crypto is an option, it's possible that two keys (a private one for
+this end of the pipe and a public one for the other end) would need to be stored. At this time,
+messages going in both directions will be signed with the same key, so only that key needs to be
+saved.
+
+The key is saved when the pipe is established. On the client side, this happens in ``connect()``,
+which is located in ``msg/Pipe.cc``. The key is obtained from a run of the Cephx protocol,
+which results in a successfully checked authorizer structure. If there is such an authorizer
+available, the code calls ``get\_auth\_session\_handler()`` to create a new authentication session handler
+and stores it in the pipe data structure. On the server side, a similar thing is done in
+``accept()`` after the authorizer provided by the client has been verified.
+
+Once these things are done on either end of the connection, session authentication can start.
+
+These routines (``connect()`` and ``accept()``) are also used to handle situations where a new
+session is being set up. At this stage, no authorizer has been created yet, so there's no key.
+Special cases in the code that calls the signature code skip these calls when the
+``CEPH\_AUTH\_UNKNOWN`` protocol is in use. This protocol label is on the pre-authorizer
+messages in a session, indicating that negotiation on an authentication protocol is ongoing and
+thus signature is not possible. There will be a reliable authentication operation later in this
+session before anything sensitive should be passed, so this is not a security problem.
+
+Signing Messages
+----------------
+
+Messages are signed in the ``write\_message`` call located in ``msg/Pipe.cc``. The actual
+signature process is to encrypt the CRCs for the message using the shared key. Thus, we must
+defer signing until all CRCs have been computed. The header CRC is computed last, so we
+call ``sign\_message()`` as soon as we've calculated that CRC.
+
+``sign\_message()`` is a virtual function defined in ``auth/AuthSessionHandler.h``. Thus,
+a specific version of it must be written for each authentication protocol supported. Currently,
+only UNKNOWN, NONE and CEPHX are supported. So there is a separate version of ``sign\_message()`` in
+``auth/unknown/AuthUnknownSessionHandler.h``, ``auth/none/AuthNoneSessionHandler.h`` and
+``auth/cephx/CephxSessionHandler.cc``. The UNKNOWN and NONE versions simply return 0, indicating
+success.
+
+The CEPHX version is more extensive. It is found in ``auth/cephx/CephxSessionHandler.cc``.
+The first thing done is to determine if the run time option to handle signatures (see above) is on.
+If not, the Cephx version of ``sign\_message()`` simply returns success without actually calculating
+a signature or inserting it into the message.
+
+If the run time option is enabled, ``sign\_message()`` copies all of the message's CRCs (one from the
+header and three from the footer) into a buffer. It calls ``encode\_encrypt()`` on the buffer,
+using the key obtained from the pipe's ``session\_security`` structure. 64 bits of the encrypted
+result are put into the message footer's signature field and a footer flag is set to indicate that
+the message was signed. (This flag is a sanity check. It is not regarded as definitive
+evidence that the message was signed. The presence of a ``session\_security`` structure at the
+receiving end requires a signature regardless of the value of this flag.) If this all goes well,
+``sign\_message()`` returns 0. If there is a problem anywhere along the line and no signature
+was computed, it returns ``SESSION\_SIGNATURE\_FAILURE``.
+
+Checking Signatures
+-------------------
+
+The signature is checked by a routine called ``check\_message\_signature()``. This is also a
+virtual function, defined in ``auth/AuthSessionHandler.h``. So again there are specific versions
+for supported authentication protocols, such as UNKNOWN, NONE and CEPHX. Again, the UNKNOWN and
+NONE versions are stored in ``auth/unknown/AuthUnknownSessionHandler.h`` and
+``auth/none/AuthNoneSessionHandler.h``, respectively, and again they simply return 0, indicating
+success.
+
+The CEPHX version of ``check\_message\_signature()`` performs a real signature check. This routine
+(stored in ``auth/cephx/CephxSessionHandler.cc``) exits with success if the run time option has
+disabled signatures. Otherwise, it takes the CRCs from the header and footer, encrypts the result,
+and compares it to the signature stored in the footer. Since an earlier routine has checked that
+the CRCs actually match the contents of the message, it is unnecessary to recompute the CRCs
+on the raw data in the message. The encryption is performed with the same ``encode\_encrypt()``
+routine used on the sending end, using the key stored in the local ``session\_security``
+data structure.
+
+If everything checks out, the CEPHX routine returns 0, indicating succcess. If there is a
+problem, the routine returns ``SESSION\_SIGNATURE\_FAILURE``.
+
+Adding New Session Authentication Methods
+-----------------------------------------
+
+For the purpose of session authentication only (not the basic authentication of client and
+server currently performed by the Cephx protocol), in addition to adding a new protocol, that
+protocol must have a ``sign\_message()`` routine and a ``check\_message\_signature`` routine.
+These routines will take a message pointer as a parameter and return 0 on success. The procedure
+used to sign and check will be specific to the new method, but probably there will be a
+``session\_security`` structure attached to the pipe that contains a cryptographic key. This
+structure will be either an ``AuthSessionHandler`` (found in ``auth/AuthSessionHandler.h``)
+or a structure derived from that type.
+
+Adding Encryption to Sessions
+-----------------------------
+
+The existing code is partially, but not fully, set up to allow sessions to have their packets
+encrypted. Part of adding encryption would be similar to adding a new authentication method.
+But one would also need to add calls to the encryption and decryption routines in ``write\_message()``
+and ``read\_message()``. These calls would probably go near where the current calls for
+authentication are made. You should consider whether you want to replace the existing calls
+with something more general that does whatever the chosen form of session security requires,
+rather than explicitly saying ``sign`` or ``encrypt``.
+
+Session Security Statistics
+---------------------------
+
+The existing Cephx authentication code keeps statistics on how many messages were signed, how
+many message signature were checked, and how many checks succeeded and failed. It is prepared
+to keep similar statistics on encryption and decryption. These statistics can be accessed through
+the call ``printAuthSessionHandlerStats`` in ``auth/AuthSessionHandler.cc``.
+
+If new authentication or encryption methods are added, they should include code that keeps these
+statistics.
diff --git a/src/ceph/doc/dev/versions.rst b/src/ceph/doc/dev/versions.rst
new file mode 100644
index 0000000..34ed747
--- /dev/null
+++ b/src/ceph/doc/dev/versions.rst
@@ -0,0 +1,42 @@
+==================
+Public OSD Version
+==================
+
+We maintain two versions on disk: an eversion_t pg_log.head and a
+version_t info.user_version. Each object is tagged with both the pg
+version and user_version it was last modified with. The PG version is
+modified by manipulating OpContext::at_version and then persisting it
+to the pg log as transactions, and is incremented in all the places it
+used to be. The user_version is modified by manipulating the new
+OpContext::user_at_version and is also persisted via the pg log
+transactions.
+user_at_version is modified only in PrimaryLogPG::prepare_transaction
+when the op was a "user modify" (a non-watch write), and the durable
+user_version is updated according to the following rules:
+1) set user_at_version to the maximum of ctx->new_obs.oi.user_version+1
+and info.last_user_version+1.
+2) set user_at_version to the maximum of itself and
+ctx->at_version.version.
+3) ctx->new_obs.oi.user_version = ctx->user_at_version (to change the
+object's user_version)
+
+This set of update semantics mean that for traditional pools the
+user_version will be equal to the past reassert_version, while for
+caching pools the object and PG user-version will be able to cross
+pools without making a total mess of things.
+In order to support old clients, we keep the old reassert_version but
+rename it to "bad_replay_version"; we fill it in as before: for writes
+it is set to the at_version (and is the proper replay version); for
+watches it is set to our user version; for ENOENT replies it is set to
+the replay version's epoch but the user_version's version. We also now
+fill in the version_t portion of the bad_replay_version on read ops as
+well as write ops, which should be fine for all old clients.
+
+For new clients, we prevent them from reading bad_replay_version and
+add two proper members: user_version and replay_version; user_version
+is filled in on every operation (reads included) while replay_version
+is filled in for writes.
+
+The objclass function get_current_version() now always returns the
+pg->info.last_user_version, which means it is guaranteed to contain
+the version of the last user update in the PG (including on reads!).
diff --git a/src/ceph/doc/dev/wireshark.rst b/src/ceph/doc/dev/wireshark.rst
new file mode 100644
index 0000000..e03b362
--- /dev/null
+++ b/src/ceph/doc/dev/wireshark.rst
@@ -0,0 +1,41 @@
+=====================
+ Wireshark Dissector
+=====================
+
+Wireshark has support for the Ceph protocol and it will be shipped in the 1.12.1
+release.
+
+Using
+=====
+
+To use the Wireshark dissector you must build it from `git`__, the process is
+outlined in great detail in the `Building and Installing`__ section of the
+`Wireshark Users Guide`__.
+
+__ `Wireshark git`_
+__ WSUG_BI_
+__ WSUG_
+
+Developing
+==========
+
+The Ceph dissector lives in `Wireshark git`_ at
+``epan/dissectors/packet-ceph.c``. At the top of that file there are some
+comments explaining how to insert new functionality or to update the encoding
+of existing types.
+
+Before you start hacking on Wireshark code you should look at the
+``doc/README.developer`` and ``doc/README.dissector`` documents as they explain
+the basics of writing dissectors. After reading those two documents you should
+be prepared to work on the Ceph dissector. `The Wireshark
+developers guide`__ also contains a lot of useful information but it is less
+directed and is more useful as a reference then an introduction.
+
+__ WSDG_
+
+.. _WSUG: https://www.wireshark.org/docs/wsug_html_chunked/
+.. _WSDG: https://www.wireshark.org/docs/wsdg_html_chunked/
+.. _WSUG_BI: https://www.wireshark.org/docs/wsug_html_chunked/ChapterBuildInstall.html
+.. _Wireshark git: https://www.wireshark.org/develop.html
+
+.. vi: textwidth=80 noexpandtab