summaryrefslogtreecommitdiffstats
path: root/src/ceph/doc/cephfs
diff options
context:
space:
mode:
Diffstat (limited to 'src/ceph/doc/cephfs')
-rw-r--r--src/ceph/doc/cephfs/administration.rst198
-rw-r--r--src/ceph/doc/cephfs/best-practices.rst88
-rw-r--r--src/ceph/doc/cephfs/capabilities.rst111
-rw-r--r--src/ceph/doc/cephfs/cephfs-journal-tool.rst238
-rw-r--r--src/ceph/doc/cephfs/client-auth.rst102
-rw-r--r--src/ceph/doc/cephfs/client-config-ref.rst214
-rw-r--r--src/ceph/doc/cephfs/createfs.rst62
-rw-r--r--src/ceph/doc/cephfs/dirfrags.rst100
-rw-r--r--src/ceph/doc/cephfs/disaster-recovery.rst280
-rw-r--r--src/ceph/doc/cephfs/eviction.rst190
-rw-r--r--src/ceph/doc/cephfs/experimental-features.rst107
-rw-r--r--src/ceph/doc/cephfs/file-layouts.rst215
-rw-r--r--src/ceph/doc/cephfs/fstab.rst46
-rw-r--r--src/ceph/doc/cephfs/full.rst60
-rw-r--r--src/ceph/doc/cephfs/fuse.rst52
-rw-r--r--src/ceph/doc/cephfs/hadoop.rst202
-rw-r--r--src/ceph/doc/cephfs/health-messages.rst127
-rw-r--r--src/ceph/doc/cephfs/index.rst116
-rw-r--r--src/ceph/doc/cephfs/journaler.rst41
-rw-r--r--src/ceph/doc/cephfs/kernel.rst37
-rw-r--r--src/ceph/doc/cephfs/mantle.rst263
-rw-r--r--src/ceph/doc/cephfs/mds-config-ref.rst629
-rw-r--r--src/ceph/doc/cephfs/multimds.rst147
-rw-r--r--src/ceph/doc/cephfs/posix.rst49
-rw-r--r--src/ceph/doc/cephfs/quota.rst70
-rw-r--r--src/ceph/doc/cephfs/standby.rst222
-rw-r--r--src/ceph/doc/cephfs/troubleshooting.rst160
-rw-r--r--src/ceph/doc/cephfs/upgrading.rst34
28 files changed, 4160 insertions, 0 deletions
diff --git a/src/ceph/doc/cephfs/administration.rst b/src/ceph/doc/cephfs/administration.rst
new file mode 100644
index 0000000..e9d9195
--- /dev/null
+++ b/src/ceph/doc/cephfs/administration.rst
@@ -0,0 +1,198 @@
+
+CephFS Administrative commands
+==============================
+
+Filesystems
+-----------
+
+These commands operate on the CephFS filesystems in your Ceph cluster.
+Note that by default only one filesystem is permitted: to enable
+creation of multiple filesystems use ``ceph fs flag set enable_multiple true``.
+
+::
+
+ fs new <filesystem name> <metadata pool name> <data pool name>
+
+::
+
+ fs ls
+
+::
+
+ fs rm <filesystem name> [--yes-i-really-mean-it]
+
+::
+
+ fs reset <filesystem name>
+
+::
+
+ fs get <filesystem name>
+
+::
+
+ fs set <filesystem name> <var> <val>
+
+::
+
+ fs add_data_pool <filesystem name> <pool name/id>
+
+::
+
+ fs rm_data_pool <filesystem name> <pool name/id>
+
+
+Settings
+--------
+
+::
+
+ fs set <fs name> max_file_size <size in bytes>
+
+CephFS has a configurable maximum file size, and it's 1TB by default.
+You may wish to set this limit higher if you expect to store large files
+in CephFS. It is a 64-bit field.
+
+Setting ``max_file_size`` to 0 does not disable the limit. It would
+simply limit clients to only creating empty files.
+
+
+Maximum file sizes and performance
+----------------------------------
+
+CephFS enforces the maximum file size limit at the point of appending to
+files or setting their size. It does not affect how anything is stored.
+
+When users create a file of an enormous size (without necessarily
+writing any data to it), some operations (such as deletes) cause the MDS
+to have to do a large number of operations to check if any of the RADOS
+objects within the range that could exist (according to the file size)
+really existed.
+
+The ``max_file_size`` setting prevents users from creating files that
+appear to be eg. exabytes in size, causing load on the MDS as it tries
+to enumerate the objects during operations like stats or deletes.
+
+
+Daemons
+-------
+
+These commands act on specific mds daemons or ranks.
+
+::
+
+ mds fail <gid/name/role>
+
+Mark an MDS daemon as failed. This is equivalent to what the cluster
+would do if an MDS daemon had failed to send a message to the mon
+for ``mds_beacon_grace`` second. If the daemon was active and a suitable
+standby is available, using ``mds fail`` will force a failover to the standby.
+
+If the MDS daemon was in reality still running, then using ``mds fail``
+will cause the daemon to restart. If it was active and a standby was
+available, then the "failed" daemon will return as a standby.
+
+::
+
+ mds deactivate <role>
+
+Deactivate an MDS, causing it to flush its entire journal to
+backing RADOS objects and close all open client sessions. Deactivating an MDS
+is primarily intended for bringing down a rank after reducing the number of
+active MDS (max_mds). Once the rank is deactivated, the MDS daemon will rejoin the
+cluster as a standby.
+``<role>`` can take one of three forms:
+
+::
+
+ <fs_name>:<rank>
+ <fs_id>:<rank>
+ <rank>
+
+Use ``mds deactivate`` in conjunction with adjustments to ``max_mds`` to
+shrink an MDS cluster. See :doc:`/cephfs/multimds`
+
+::
+
+ tell mds.<daemon name>
+
+::
+
+ mds metadata <gid/name/role>
+
+::
+
+ mds repaired <role>
+
+
+Global settings
+---------------
+
+::
+
+ fs dump
+
+::
+
+ fs flag set <flag name> <flag val> [<confirmation string>]
+
+"flag name" must be one of ['enable_multiple']
+
+Some flags require you to confirm your intentions with "--yes-i-really-mean-it"
+or a similar string they will prompt you with. Consider these actions carefully
+before proceeding; they are placed on especially dangerous activities.
+
+
+Advanced
+--------
+
+These commands are not required in normal operation, and exist
+for use in exceptional circumstances. Incorrect use of these
+commands may cause serious problems, such as an inaccessible
+filesystem.
+
+::
+
+ mds compat rm_compat
+
+::
+
+ mds compat rm_incompat
+
+::
+
+ mds compat show
+
+::
+
+ mds getmap
+
+::
+
+ mds set_state
+
+::
+
+ mds rmfailed
+
+Legacy
+------
+
+The ``ceph mds set`` command is the deprecated version of ``ceph fs set``,
+from before there was more than one filesystem per cluster. It operates
+on whichever filesystem is marked as the default (see ``ceph fs
+set-default``.)
+
+::
+
+ mds stat
+ mds dump # replaced by "fs get"
+ mds stop # replaced by "mds deactivate"
+ mds set_max_mds # replaced by "fs set max_mds"
+ mds set # replaced by "fs set"
+ mds cluster_down # replaced by "fs set cluster_down"
+ mds cluster_up # replaced by "fs set cluster_up"
+ mds newfs # replaced by "fs new"
+ mds add_data_pool # replaced by "fs add_data_pool"
+ mds remove_data_pool #replaced by "fs remove_data_pool"
+
diff --git a/src/ceph/doc/cephfs/best-practices.rst b/src/ceph/doc/cephfs/best-practices.rst
new file mode 100644
index 0000000..79c638e
--- /dev/null
+++ b/src/ceph/doc/cephfs/best-practices.rst
@@ -0,0 +1,88 @@
+
+CephFS best practices
+=====================
+
+This guide provides recommendations for best results when deploying CephFS.
+
+For the actual configuration guide for CephFS, please see the instructions
+at :doc:`/cephfs/index`.
+
+Which Ceph version?
+-------------------
+
+Use at least the Jewel (v10.2.0) release of Ceph. This is the first
+release to include stable CephFS code and fsck/repair tools. Make sure
+you are using the latest point release to get bug fixes.
+
+Note that Ceph releases do not include a kernel, this is versioned
+and released separately. See below for guidance of choosing an
+appropriate kernel version if you are using the kernel client
+for CephFS.
+
+Most stable configuration
+-------------------------
+
+Some features in CephFS are still experimental. See
+:doc:`/cephfs/experimental-features` for guidance on these.
+
+For the best chance of a happy healthy filesystem, use a **single active MDS**
+and **do not use snapshots**. Both of these are the default.
+
+Note that creating multiple MDS daemons is fine, as these will simply be
+used as standbys. However, for best stability you should avoid
+adjusting ``max_mds`` upwards, as this would cause multiple
+daemons to be active at once.
+
+Which client?
+-------------
+
+The fuse client is the easiest way to get up to date code, while
+the kernel client will often give better performance.
+
+The clients do not always provide equivalent functionality, for example
+the fuse client supports client-enforced quotas while the kernel client
+does not.
+
+When encountering bugs or performance issues, it is often instructive to
+try using the other client, in order to find out whether the bug was
+client-specific or not (and then to let the developers know).
+
+Which kernel version?
+~~~~~~~~~~~~~~~~~~~~~
+
+Because the kernel client is distributed as part of the linux kernel (not
+as part of packaged ceph releases),
+you will need to consider which kernel version to use on your client nodes.
+Older kernels are known to include buggy ceph clients, and may not support
+features that more recent Ceph clusters support.
+
+Remember that the "latest" kernel in a stable linux distribution is likely
+to be years behind the latest upstream linux kernel where Ceph development
+takes place (including bug fixes).
+
+As a rough guide, as of Ceph 10.x (Jewel), you should be using a least a
+4.x kernel. If you absolutely have to use an older kernel, you should use
+the fuse client instead of the kernel client.
+
+This advice does not apply if you are using a linux distribution that
+includes CephFS support, as in this case the distributor will be responsible
+for backporting fixes to their stable kernel: check with your vendor.
+
+Reporting issues
+----------------
+
+If you have identified a specific issue, please report it with as much
+information as possible. Especially important information:
+
+* Ceph versions installed on client and server
+* Whether you are using the kernel or fuse client
+* If you are using the kernel client, what kernel version?
+* How many clients are in play, doing what kind of workload?
+* If a system is 'stuck', is that affecting all clients or just one?
+* Any ceph health messages
+* Any backtraces in the ceph logs from crashes
+
+If you are satisfied that you have found a bug, please file it on
+http://tracker.ceph.com. For more general queries please write
+to the ceph-users mailing list.
+
diff --git a/src/ceph/doc/cephfs/capabilities.rst b/src/ceph/doc/cephfs/capabilities.rst
new file mode 100644
index 0000000..7cd2716
--- /dev/null
+++ b/src/ceph/doc/cephfs/capabilities.rst
@@ -0,0 +1,111 @@
+======================
+Capabilities in CephFS
+======================
+When a client wants to operate on an inode, it will query the MDS in various
+ways, which will then grant the client a set of **capabilities**. These
+grant the client permissions to operate on the inode in various ways. One
+of the major differences from other network filesystems (e.g NFS or SMB) is
+that the capabilities granted are quite granular, and it's possible that
+multiple clients can hold different capabilities on the same inodes.
+
+Types of Capabilities
+---------------------
+There are several "generic" capability bits. These denote what sort of ability
+the capability grants.
+
+::
+
+ /* generic cap bits */
+ #define CEPH_CAP_GSHARED 1 /* client can reads (s) */
+ #define CEPH_CAP_GEXCL 2 /* client can read and update (x) */
+ #define CEPH_CAP_GCACHE 4 /* (file) client can cache reads (c) */
+ #define CEPH_CAP_GRD 8 /* (file) client can read (r) */
+ #define CEPH_CAP_GWR 16 /* (file) client can write (w) */
+ #define CEPH_CAP_GBUFFER 32 /* (file) client can buffer writes (b) */
+ #define CEPH_CAP_GWREXTEND 64 /* (file) client can extend EOF (a) */
+ #define CEPH_CAP_GLAZYIO 128 /* (file) client can perform lazy io (l) */
+
+These are then shifted by a particular number of bits. These denote a part of
+the inode's data or metadata on which the capability is being granted:
+
+::
+
+ /* per-lock shift */
+ #define CEPH_CAP_SAUTH 2 /* A */
+ #define CEPH_CAP_SLINK 4 /* L */
+ #define CEPH_CAP_SXATTR 6 /* X */
+ #define CEPH_CAP_SFILE 8 /* F */
+
+Only certain generic cap types are ever granted for some of those "shifts",
+however. In particular, only the FILE shift ever has more than the first two
+bits.
+
+::
+
+ | AUTH | LINK | XATTR | FILE
+ 2 4 6 8
+
+From the above, we get a number of constants, that are generated by taking
+each bit value and shifting to the correct bit in the word:
+
+::
+
+ #define CEPH_CAP_AUTH_SHARED (CEPH_CAP_GSHARED << CEPH_CAP_SAUTH)
+
+These bits can then be or'ed together to make a bitmask denoting a set of
+capabilities.
+
+There is one exception:
+
+::
+
+ #define CEPH_CAP_PIN 1 /* no specific capabilities beyond the pin */
+
+The "pin" just pins the inode into memory, without granting any other caps.
+
+Graphically:
+
+::
+
+ +---+---+---+---+---+---+---+---+
+ | p | _ |As x |Ls x |Xs x |
+ +---+---+---+---+---+---+---+---+
+ |Fs x c r w b a l |
+ +---+---+---+---+---+---+---+---+
+
+The second bit is currently unused.
+
+Abilities granted by each cap:
+------------------------------
+While that is how capabilities are granted (and communicated), the important
+bit is what they actually allow the client to do:
+
+* PIN: this just pins the inode into memory. This is sufficient to allow the
+ client to get to the inode number, as well as other immutable things like
+ major or minor numbers in a device inode, or symlink contents.
+
+* AUTH: this grants the ability to get to the authentication-related metadata.
+ In particular, the owner, group and mode. Note that doing a full permission
+ check may require getting at ACLs as well, which are stored in xattrs.
+
+* LINK: the link count of the inode
+
+* XATTR: ability to access or manipulate xattrs. Note that since ACLs are
+ stored in xattrs, it's also sometimes necessary to access them when checking
+ permissions.
+
+* FILE: this is the big one. These allow the client to access and manipulate
+ file data. It also covers certain metadata relating to file data -- the
+ size, mtime, atime and ctime, in particular.
+
+Shorthand:
+----------
+Note that the client logging can also present a compact representation of the
+capabilities. For example:
+
+::
+ pAsLsXsFs
+
+The 'p' represents the pin. Each capital letter corresponds to the shift
+values, and the lowercase letters after each shift are for the actual
+capabilities granted in each shift.
diff --git a/src/ceph/doc/cephfs/cephfs-journal-tool.rst b/src/ceph/doc/cephfs/cephfs-journal-tool.rst
new file mode 100644
index 0000000..0dd54fb
--- /dev/null
+++ b/src/ceph/doc/cephfs/cephfs-journal-tool.rst
@@ -0,0 +1,238 @@
+
+cephfs-journal-tool
+===================
+
+Purpose
+-------
+
+If a CephFS journal has become damaged, expert intervention may be required
+to restore the filesystem to a working state.
+
+The ``cephfs-journal-tool`` utility provides functionality to aid experts in
+examining, modifying, and extracting data from journals.
+
+.. warning::
+
+ This tool is **dangerous** because it directly modifies internal
+ data structures of the filesystem. Make backups, be careful, and
+ seek expert advice. If you are unsure, do not run this tool.
+
+Syntax
+------
+
+::
+
+ cephfs-journal-tool journal <inspect|import|export|reset>
+ cephfs-journal-tool header <get|set>
+ cephfs-journal-tool event <get|splice|apply> [filter] <list|json|summary>
+
+
+The tool operates in three modes: ``journal``, ``header`` and ``event``,
+meaning the whole journal, the header, and the events within the journal
+respectively.
+
+Journal mode
+------------
+
+This should be your starting point to assess the state of a journal.
+
+* ``inspect`` reports on the health of the journal. This will identify any
+ missing objects or corruption in the stored journal. Note that this does
+ not identify inconsistencies in the events themselves, just that events are
+ present and can be decoded.
+
+* ``import`` and ``export`` read and write binary dumps of the journal
+ in a sparse file format. Pass the filename as the last argument. The
+ export operation may not work reliably for journals which are damaged (missing
+ objects).
+
+* ``reset`` truncates a journal, discarding any information within it.
+
+
+Example: journal inspect
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+::
+
+ # cephfs-journal-tool journal inspect
+ Overall journal integrity: DAMAGED
+ Objects missing:
+ 0x1
+ Corrupt regions:
+ 0x400000-ffffffffffffffff
+
+Example: Journal import/export
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+::
+
+ # cephfs-journal-tool journal export myjournal.bin
+ journal is 4194304~80643
+ read 80643 bytes at offset 4194304
+ wrote 80643 bytes at offset 4194304 to myjournal.bin
+ NOTE: this is a _sparse_ file; you can
+ $ tar cSzf myjournal.bin.tgz myjournal.bin
+ to efficiently compress it while preserving sparseness.
+
+ # cephfs-journal-tool journal import myjournal.bin
+ undump myjournal.bin
+ start 4194304 len 80643
+ writing header 200.00000000
+ writing 4194304~80643
+ done.
+
+.. note::
+
+ It is wise to use the ``journal export <backup file>`` command to make a journal backup
+ before any further manipulation.
+
+Header mode
+-----------
+
+* ``get`` outputs the current content of the journal header
+
+* ``set`` modifies an attribute of the header. Allowed attributes are
+ ``trimmed_pos``, ``expire_pos`` and ``write_pos``.
+
+Example: header get/set
+~~~~~~~~~~~~~~~~~~~~~~~
+
+::
+
+ # cephfs-journal-tool header get
+ { "magic": "ceph fs volume v011",
+ "write_pos": 4274947,
+ "expire_pos": 4194304,
+ "trimmed_pos": 4194303,
+ "layout": { "stripe_unit": 4194304,
+ "stripe_count": 4194304,
+ "object_size": 4194304,
+ "cas_hash": 4194304,
+ "object_stripe_unit": 4194304,
+ "pg_pool": 4194304}}
+
+ # cephfs-journal-tool header set trimmed_pos 4194303
+ Updating trimmed_pos 0x400000 -> 0x3fffff
+ Successfully updated header.
+
+
+Event mode
+----------
+
+Event mode allows detailed examination and manipulation of the contents of the journal. Event
+mode can operate on all events in the journal, or filters may be applied.
+
+The arguments following ``cephfs-journal-tool event`` consist of an action, optional filter
+parameters, and an output mode:
+
+::
+
+ cephfs-journal-tool event <action> [filter] <output>
+
+Actions:
+
+* ``get`` read the events from the log
+* ``splice`` erase events or regions in the journal
+* ``apply`` extract filesystem metadata from events and attempt to apply it to the metadata store.
+
+Filtering:
+
+* ``--range <int begin>..[int end]`` only include events within the range begin (inclusive) to end (exclusive)
+* ``--path <path substring>`` only include events referring to metadata containing the specified string
+* ``--inode <int>`` only include events referring to metadata containing the specified string
+* ``--type <type string>`` only include events of this type
+* ``--frag <ino>[.frag id]`` only include events referring to this directory fragment
+* ``--dname <string>`` only include events referring to this named dentry within a directory
+ fragment (may only be used in conjunction with ``--frag``
+* ``--client <int>`` only include events from this client session ID
+
+Filters may be combined on an AND basis (i.e. only the intersection of events from each filter).
+
+Output modes:
+
+* ``binary``: write each event as a binary file, within a folder whose name is controlled by ``--path``
+* ``json``: write all events to a single file, as a JSON serialized list of objects
+* ``summary``: write a human readable summary of the events read to standard out
+* ``list``: write a human readable terse listing of the type of each event, and
+ which file paths the event affects.
+
+
+Example: event mode
+~~~~~~~~~~~~~~~~~~~
+
+::
+
+ # cephfs-journal-tool event get json --path output.json
+ Wrote output to JSON file 'output.json'
+
+ # cephfs-journal-tool event get summary
+ Events by type:
+ NOOP: 2
+ OPEN: 2
+ SESSION: 2
+ SUBTREEMAP: 1
+ UPDATE: 43
+
+ # cephfs-journal-tool event get list
+ 0x400000 SUBTREEMAP: ()
+ 0x400308 SESSION: ()
+ 0x4003de UPDATE: (setattr)
+ /
+ 0x40068b UPDATE: (mkdir)
+ diralpha
+ 0x400d1b UPDATE: (mkdir)
+ diralpha/filealpha1
+ 0x401666 UPDATE: (unlink_local)
+ stray0/10000000001
+ diralpha/filealpha1
+ 0x40228d UPDATE: (unlink_local)
+ diralpha
+ stray0/10000000000
+ 0x402bf9 UPDATE: (scatter_writebehind)
+ stray0
+ 0x403150 UPDATE: (mkdir)
+ dirbravo
+ 0x4037e0 UPDATE: (openc)
+ dirbravo/.filebravo1.swp
+ 0x404032 UPDATE: (openc)
+ dirbravo/.filebravo1.swpx
+
+ # cephfs-journal-tool event get --path /filebravo1 list
+ 0x40785a UPDATE: (openc)
+ dirbravo/filebravo1
+ 0x4103ee UPDATE: (cap update)
+ dirbravo/filebravo1
+
+ # cephfs-journal-tool event splice --range 0x40f754..0x410bf1 summary
+ Events by type:
+ OPEN: 1
+ UPDATE: 2
+
+ # cephfs-journal-tool event apply --range 0x410bf1.. summary
+ Events by type:
+ NOOP: 1
+ SESSION: 1
+ UPDATE: 9
+
+ # cephfs-journal-tool event get --inode=1099511627776 list
+ 0x40068b UPDATE: (mkdir)
+ diralpha
+ 0x400d1b UPDATE: (mkdir)
+ diralpha/filealpha1
+ 0x401666 UPDATE: (unlink_local)
+ stray0/10000000001
+ diralpha/filealpha1
+ 0x40228d UPDATE: (unlink_local)
+ diralpha
+ stray0/10000000000
+
+ # cephfs-journal-tool event get --frag=1099511627776 --dname=filealpha1 list
+ 0x400d1b UPDATE: (mkdir)
+ diralpha/filealpha1
+ 0x401666 UPDATE: (unlink_local)
+ stray0/10000000001
+ diralpha/filealpha1
+
+ # cephfs-journal-tool event get binary --path bin_events
+ Wrote output to binary files in directory 'bin_events'
+
diff --git a/src/ceph/doc/cephfs/client-auth.rst b/src/ceph/doc/cephfs/client-auth.rst
new file mode 100644
index 0000000..fbf694b
--- /dev/null
+++ b/src/ceph/doc/cephfs/client-auth.rst
@@ -0,0 +1,102 @@
+================================
+CephFS Client Capabilities
+================================
+
+Use Ceph authentication capabilities to restrict your filesystem clients
+to the lowest possible level of authority needed.
+
+.. note::
+
+ Path restriction and layout modification restriction are new features
+ in the Jewel release of Ceph.
+
+Path restriction
+================
+
+By default, clients are not restricted in what paths they are allowed to mount.
+Further, when clients mount a subdirectory, e.g., /home/user, the MDS does not
+by default verify that subsequent operations
+are ‘locked’ within that directory.
+
+To restrict clients to only mount and work within a certain directory, use
+path-based MDS authentication capabilities.
+
+Syntax
+------
+
+To grant rw access to the specified directory only, we mention the specified
+directory while creating key for a client using the following syntax. ::
+
+ ceph fs authorize *filesystem_name* client.*client_name* /*specified_directory* rw
+
+for example, to restrict client ``foo`` to writing only in the ``bar`` directory of filesystem ``cephfs``, use ::
+
+ ceph fs authorize cephfs client.foo / r /bar rw
+
+To completely restrict the client to the ``bar`` directory, omit the
+root directory ::
+
+ ceph fs authorize cephfs client.foo /bar rw
+
+Note that if a client's read access is restricted to a path, they will only
+be able to mount the filesystem when specifying a readable path in the
+mount command (see below).
+
+
+See `User Management - Add a User to a Keyring`_. for additional details on user management
+
+To restrict a client to the specfied sub-directory only, we mention the specified
+directory while mounting using the following syntax. ::
+
+ ./ceph-fuse -n client.*client_name* *mount_path* -r *directory_to_be_mounted*
+
+for example, to restrict client ``foo`` to ``mnt/bar`` directory, we will use. ::
+
+ ./ceph-fuse -n client.foo mnt -r /bar
+
+Free space reporting
+--------------------
+
+By default, when a client is mounting a sub-directory, the used space (``df``)
+will be calculated from the quota on that sub-directory, rather than reporting
+the overall amount of space used on the cluster.
+
+If you would like the client to report the overall usage of the filesystem,
+and not just the quota usage on the sub-directory mounted, then set the
+following config option on the client:
+
+::
+
+ client quota df = false
+
+If quotas are not enabled, or no quota is set on the sub-directory mounted,
+then the overall usage of the filesystem will be reported irrespective of
+the value of this setting.
+
+Layout and Quota restriction (the 'p' flag)
+===========================================
+
+To set layouts or quotas, clients require the 'p' flag in addition to 'rw'.
+This restricts all the attributes that are set by special extended attributes
+with a "ceph." prefix, as well as restricting other means of setting
+these fields (such as openc operations with layouts).
+
+For example, in the following snippet client.0 can modify layouts and quotas,
+but client.1 cannot.
+
+::
+
+ client.0
+ key: AQAz7EVWygILFRAAdIcuJ12opU/JKyfFmxhuaw==
+ caps: [mds] allow rwp
+ caps: [mon] allow r
+ caps: [osd] allow rw pool=data
+
+ client.1
+ key: AQAz7EVWygILFRAAdIcuJ12opU/JKyfFmxhuaw==
+ caps: [mds] allow rw
+ caps: [mon] allow r
+ caps: [osd] allow rw pool=data
+
+
+.. _User Management - Add a User to a Keyring: ../../rados/operations/user-management/#add-a-user-to-a-keyring
diff --git a/src/ceph/doc/cephfs/client-config-ref.rst b/src/ceph/doc/cephfs/client-config-ref.rst
new file mode 100644
index 0000000..6a149ac
--- /dev/null
+++ b/src/ceph/doc/cephfs/client-config-ref.rst
@@ -0,0 +1,214 @@
+========================
+ Client Config Reference
+========================
+
+``client acl type``
+
+:Description: Set the ACL type. Currently, only possible value is ``"posix_acl"`` to enable POSIX ACL, or an empty string. This option only takes effect when the ``fuse_default_permissions`` is set to ``false``.
+
+:Type: String
+:Default: ``""`` (no ACL enforcement)
+
+``client cache mid``
+
+:Description: Set client cache midpoint. The midpoint splits the least recently used lists into a hot and warm list.
+:Type: Float
+:Default: ``0.75``
+
+``client_cache_size``
+
+:Description: Set the number of inodes that the client keeps in the metadata cache.
+:Type: Integer
+:Default: ``16384``
+
+``client_caps_release_delay``
+
+:Description: Set the delay between capability releases in seconds. The delay sets how many seconds a client waits to release capabilities that it no longer needs in case the capabilities are needed for another user space operation.
+:Type: Integer
+:Default: ``5`` (seconds)
+
+``client_debug_force_sync_read``
+
+:Description: If set to ``true``, clients read data directly from OSDs instead of using a local page cache.
+:Type: Boolean
+:Default: ``false``
+
+``client_dirsize_rbytes``
+
+:Description: If set to ``true``, use the recursive size of a directory (that is, total of all descendants).
+:Type: Boolean
+:Default: ``true``
+
+``client_max_inline_size``
+
+:Description: Set the maximum size of inlined data stored in a file inode rather than in a separate data object in RADOS. This setting only applies if the ``inline_data`` flag is set on the MDS map.
+:Type: Integer
+:Default: ``4096``
+
+``client_metadata``
+
+:Description: Comma-delimited strings for client metadata sent to each MDS, in addition to the automatically generated version, host name, and other metadata.
+:Type: String
+:Default: ``""`` (no additional metadata)
+
+``client_mount_gid``
+
+:Description: Set the group ID of CephFS mount.
+:Type: Integer
+:Default: ``-1``
+
+``client_mount_timeout``
+
+:Description: Set the timeout for CephFS mount in seconds.
+:Type: Float
+:Default: ``300.0``
+
+``client_mount_uid``
+
+:Description: Set the user ID of CephFS mount.
+:Type: Integer
+:Default: ``-1``
+
+``client_mountpoint``
+
+:Description: Directory to mount on the CephFS file system. An alternative to the ``-r`` option of the ``ceph-fuse`` command.
+:Type: String
+:Default: ``"/"``
+
+``client_oc``
+
+:Description: Enable object caching.
+:Type: Boolean
+:Default: ``true``
+
+``client_oc_max_dirty``
+
+:Description: Set the maximum number of dirty bytes in the object cache.
+:Type: Integer
+:Default: ``104857600`` (100MB)
+
+``client_oc_max_dirty_age``
+
+:Description: Set the maximum age in seconds of dirty data in the object cache before writeback.
+:Type: Float
+:Default: ``5.0`` (seconds)
+
+``client_oc_max_objects``
+
+:Description: Set the maximum number of objects in the object cache.
+:Type: Integer
+:Default: ``1000``
+
+``client_oc_size``
+
+:Description: Set how many bytes of data will the client cache.
+:Type: Integer
+:Default: ``209715200`` (200 MB)
+
+``client_oc_target_dirty``
+
+:Description: Set the target size of dirty data. We recommend to keep this number low.
+:Type: Integer
+:Default: ``8388608`` (8MB)
+
+``client_permissions``
+
+:Description: Check client permissions on all I/O operations.
+:Type: Boolean
+:Default: ``true``
+
+``client_quota``
+
+:Description: Enable client quota checking if set to ``true``.
+:Type: Boolean
+:Default: ``true``
+
+``client_quota_df``
+
+:Description: Report root directory quota for the ``statfs`` operation.
+:Type: Boolean
+:Default: ``true``
+
+``client_readahead_max_bytes``
+
+:Description: Set the maximum number of bytes that the kernel reads ahead for future read operations. Overridden by the ``client_readahead_max_periods`` setting.
+:Type: Integer
+:Default: ``0`` (unlimited)
+
+``client_readahead_max_periods``
+
+:Description: Set the number of file layout periods (object size * number of stripes) that the kernel reads ahead. Overrides the ``client_readahead_max_bytes`` setting.
+:Type: Integer
+:Default: ``4``
+
+``client_readahead_min``
+
+:Description: Set the minimum number bytes that the kernel reads ahead.
+:Type: Integer
+:Default: ``131072`` (128KB)
+
+``client_reconnect_stale``
+
+:Description: Automatically reconnect stale session.
+:Type: Boolean
+:Default: ``false``
+
+``client_snapdir``
+
+:Description: Set the snapshot directory name.
+:Type: String
+:Default: ``".snap"``
+
+``client_tick_interval``
+
+:Description: Set the interval in seconds between capability renewal and other upkeep.
+:Type: Float
+:Default: ``1.0`` (seconds)
+
+``client_use_random_mds``
+
+:Description: Choose random MDS for each request.
+:Type: Boolean
+:Default: ``false``
+
+``fuse_default_permissions``
+
+:Description: When set to ``false``, ``ceph-fuse`` utility checks does its own permissions checking, instead of relying on the permissions enforcement in FUSE. Set to ``false`` together with the ``client acl type=posix_acl`` option to enable POSIX ACL.
+:Type: Boolean
+:Default: ``true``
+
+Developer Options
+#################
+
+.. important:: These options are internal. They are listed here only to complete the list of options.
+
+``client_debug_getattr_caps``
+
+:Description: Check if the reply from the MDS contains required capabilities.
+:Type: Boolean
+:Default: ``false``
+
+``client_debug_inject_tick_delay``
+
+:Description: Add artificial delay between client ticks.
+:Type: Integer
+:Default: ``0``
+
+``client_inject_fixed_oldest_tid``
+
+:Description:
+:Type: Boolean
+:Default: ``false``
+
+``client_inject_release_failure``
+
+:Description:
+:Type: Boolean
+:Default: ``false``
+
+``client_trace``
+
+:Description: The path to the trace file for all file operations. The output is designed to be used by the Ceph `synthetic client <../../man/8/ceph-syn>`_.
+:Type: String
+:Default: ``""`` (disabled)
+
diff --git a/src/ceph/doc/cephfs/createfs.rst b/src/ceph/doc/cephfs/createfs.rst
new file mode 100644
index 0000000..005ede8
--- /dev/null
+++ b/src/ceph/doc/cephfs/createfs.rst
@@ -0,0 +1,62 @@
+========================
+Create a Ceph filesystem
+========================
+
+.. tip::
+
+ The ``ceph fs new`` command was introduced in Ceph 0.84. Prior to this release,
+ no manual steps are required to create a filesystem, and pools named ``data`` and
+ ``metadata`` exist by default.
+
+ The Ceph command line now includes commands for creating and removing filesystems,
+ but at present only one filesystem may exist at a time.
+
+A Ceph filesystem requires at least two RADOS pools, one for data and one for metadata.
+When configuring these pools, you might consider:
+
+- Using a higher replication level for the metadata pool, as any data
+ loss in this pool can render the whole filesystem inaccessible.
+- Using lower-latency storage such as SSDs for the metadata pool, as this
+ will directly affect the observed latency of filesystem operations
+ on clients.
+
+Refer to :doc:`/rados/operations/pools` to learn more about managing pools. For
+example, to create two pools with default settings for use with a filesystem, you
+might run the following commands:
+
+.. code:: bash
+
+ $ ceph osd pool create cephfs_data <pg_num>
+ $ ceph osd pool create cephfs_metadata <pg_num>
+
+Once the pools are created, you may enable the filesystem using the ``fs new`` command:
+
+.. code:: bash
+
+ $ ceph fs new <fs_name> <metadata> <data>
+
+For example:
+
+.. code:: bash
+
+ $ ceph fs new cephfs cephfs_metadata cephfs_data
+ $ ceph fs ls
+ name: cephfs, metadata pool: cephfs_metadata, data pools: [cephfs_data ]
+
+Once a filesystem has been created, your MDS(s) will be able to enter
+an *active* state. For example, in a single MDS system:
+
+.. code:: bash
+
+ $ ceph mds stat
+ e5: 1/1/1 up {0=a=up:active}
+
+Once the filesystem is created and the MDS is active, you are ready to mount
+the filesystem. If you have created more than one filesystem, you will
+choose which to use when mounting.
+
+ - `Mount CephFS`_
+ - `Mount CephFS as FUSE`_
+
+.. _Mount CephFS: ../../cephfs/kernel
+.. _Mount CephFS as FUSE: ../../cephfs/fuse
diff --git a/src/ceph/doc/cephfs/dirfrags.rst b/src/ceph/doc/cephfs/dirfrags.rst
new file mode 100644
index 0000000..717553f
--- /dev/null
+++ b/src/ceph/doc/cephfs/dirfrags.rst
@@ -0,0 +1,100 @@
+
+===================================
+Configuring Directory fragmentation
+===================================
+
+In CephFS, directories are *fragmented* when they become very large
+or very busy. This splits up the metadata so that it can be shared
+between multiple MDS daemons, and between multiple objects in the
+metadata pool.
+
+In normal operation, directory fragmentation is invisbible to
+users and administrators, and all the configuration settings mentioned
+here should be left at their default values.
+
+While directory fragmentation enables CephFS to handle very large
+numbers of entries in a single directory, application programmers should
+remain conservative about creating very large directories, as they still
+have a resource cost in situations such as a CephFS client listing
+the directory, where all the fragments must be loaded at once.
+
+All directories are initially created as a single fragment. This fragment
+may be *split* to divide up the directory into more fragments, and these
+fragments may be *merged* to reduce the number of fragments in the directory.
+
+Splitting and merging
+=====================
+
+An MDS will only consider doing splits and merges if the ``mds_bal_frag``
+setting is true in the MDS's configuration file, and the allow_dirfrags
+setting is true in the filesystem map (set on the mons). These settings
+are both true by default since the *Luminous* (12.2.x) release of Ceph.
+
+When an MDS identifies a directory fragment to be split, it does not
+do the split immediately. Because splitting interrupts metadata IO,
+a short delay is used to allow short bursts of client IO to complete
+before the split begins. This delay is configured with
+``mds_bal_fragment_interval``, which defaults to 5 seconds.
+
+When the split is done, the directory fragment is broken up into
+a power of two number of new fragments. The number of new
+fragments is given by two to the power ``mds_bal_split_bits``, i.e.
+if ``mds_bal_split_bits`` is 2, then four new fragments will be
+created. The default setting is 3, i.e. splits create 8 new fragments.
+
+The criteria for initiating a split or a merge are described in the
+following sections.
+
+Size thresholds
+===============
+
+A directory fragment is elegible for splitting when its size exceeds
+``mds_bal_split_size`` (default 10000). Ordinarily this split is
+delayed by ``mds_bal_fragment_interval``, but if the fragment size
+exceeds a factor of ``mds_bal_fragment_fast_factor`` the split size,
+the split will happen immediately (holding up any client metadata
+IO on the directory).
+
+``mds_bal_fragment_size_max`` is the hard limit on the size of
+directory fragments. If it is reached, clients will receive
+ENOSPC errors if they try to create files in the fragment. On
+a properly configured system, this limit should never be reached on
+ordinary directories, as they will have split long before. By default,
+this is set to 10 times the split size, giving a dirfrag size limit of
+100000. Increasing this limit may lead to oversized directory fragment
+objects in the metadata pool, which the OSDs may not be able to handle.
+
+A directory fragment is elegible for merging when its size is less
+than ``mds_bal_merge_size``. There is no merge equivalent of the
+"fast splitting" explained above: fast splitting exists to avoid
+creating oversized directory fragments, there is no equivalent issue
+to avoid when merging. The default merge size is 50.
+
+Activity thresholds
+===================
+
+In addition to splitting fragments based
+on their size, the MDS may split directory fragments if their
+activity exceeds a threshold.
+
+The MDS maintains separate time-decaying load counters for read and write
+operations on directory fragments. The decaying load counters have an
+exponential decay based on the ``mds_decay_halflife`` setting.
+
+On writes, the write counter is
+incremented, and compared with ``mds_bal_split_wr``, triggering a
+split if the threshold is exceeded. Write operations include metadata IO
+such as renames, unlinks and creations.
+
+The ``mds_bal_split_rd`` threshold is applied based on the read operation
+load counter, which tracks readdir operations.
+
+By the default, the read threshold is 25000 and the write threshold is
+10000, i.e. 2.5x as many reads as writes would be required to trigger
+a split.
+
+After fragments are split due to the activity thresholds, they are only
+merged based on the size threshold (``mds_bal_merge_size``), so
+a spike in activity may cause a directory to stay fragmented
+forever unless some entries are unlinked.
+
diff --git a/src/ceph/doc/cephfs/disaster-recovery.rst b/src/ceph/doc/cephfs/disaster-recovery.rst
new file mode 100644
index 0000000..f47bd79
--- /dev/null
+++ b/src/ceph/doc/cephfs/disaster-recovery.rst
@@ -0,0 +1,280 @@
+
+Disaster recovery
+=================
+
+.. danger::
+
+ The notes in this section are aimed at experts, making a best effort
+ to recovery what they can from damaged filesystems. These steps
+ have the potential to make things worse as well as better. If you
+ are unsure, do not proceed.
+
+
+Journal export
+--------------
+
+Before attempting dangerous operations, make a copy of the journal like so:
+
+::
+
+ cephfs-journal-tool journal export backup.bin
+
+Note that this command may not always work if the journal is badly corrupted,
+in which case a RADOS-level copy should be made (http://tracker.ceph.com/issues/9902).
+
+
+Dentry recovery from journal
+----------------------------
+
+If a journal is damaged or for any reason an MDS is incapable of replaying it,
+attempt to recover what file metadata we can like so:
+
+::
+
+ cephfs-journal-tool event recover_dentries summary
+
+This command by default acts on MDS rank 0, pass --rank=<n> to operate on other ranks.
+
+This command will write any inodes/dentries recoverable from the journal
+into the backing store, if these inodes/dentries are higher-versioned
+than the previous contents of the backing store. If any regions of the journal
+are missing/damaged, they will be skipped.
+
+Note that in addition to writing out dentries and inodes, this command will update
+the InoTables of each 'in' MDS rank, to indicate that any written inodes' numbers
+are now in use. In simple cases, this will result in an entirely valid backing
+store state.
+
+.. warning::
+
+ The resulting state of the backing store is not guaranteed to be self-consistent,
+ and an online MDS scrub will be required afterwards. The journal contents
+ will not be modified by this command, you should truncate the journal
+ separately after recovering what you can.
+
+Journal truncation
+------------------
+
+If the journal is corrupt or MDSs cannot replay it for any reason, you can
+truncate it like so:
+
+::
+
+ cephfs-journal-tool journal reset
+
+.. warning::
+
+ Resetting the journal *will* lose metadata unless you have extracted
+ it by other means such as ``recover_dentries``. It is likely to leave
+ some orphaned objects in the data pool. It may result in re-allocation
+ of already-written inodes, such that permissions rules could be violated.
+
+MDS table wipes
+---------------
+
+After the journal has been reset, it may no longer be consistent with respect
+to the contents of the MDS tables (InoTable, SessionMap, SnapServer).
+
+To reset the SessionMap (erase all sessions), use:
+
+::
+
+ cephfs-table-tool all reset session
+
+This command acts on the tables of all 'in' MDS ranks. Replace 'all' with an MDS
+rank to operate on that rank only.
+
+The session table is the table most likely to need resetting, but if you know you
+also need to reset the other tables then replace 'session' with 'snap' or 'inode'.
+
+MDS map reset
+-------------
+
+Once the in-RADOS state of the filesystem (i.e. contents of the metadata pool)
+is somewhat recovered, it may be necessary to update the MDS map to reflect
+the contents of the metadata pool. Use the following command to reset the MDS
+map to a single MDS:
+
+::
+
+ ceph fs reset <fs name> --yes-i-really-mean-it
+
+Once this is run, any in-RADOS state for MDS ranks other than 0 will be ignored:
+as a result it is possible for this to result in data loss.
+
+One might wonder what the difference is between 'fs reset' and 'fs remove; fs new'. The
+key distinction is that doing a remove/new will leave rank 0 in 'creating' state, such
+that it would overwrite any existing root inode on disk and orphan any existing files. In
+contrast, the 'reset' command will leave rank 0 in 'active' state such that the next MDS
+daemon to claim the rank will go ahead and use the existing in-RADOS metadata.
+
+Recovery from missing metadata objects
+--------------------------------------
+
+Depending on what objects are missing or corrupt, you may need to
+run various commands to regenerate default versions of the
+objects.
+
+::
+
+ # Session table
+ cephfs-table-tool 0 reset session
+ # SnapServer
+ cephfs-table-tool 0 reset snap
+ # InoTable
+ cephfs-table-tool 0 reset inode
+ # Journal
+ cephfs-journal-tool --rank=0 journal reset
+ # Root inodes ("/" and MDS directory)
+ cephfs-data-scan init
+
+Finally, you can regenerate metadata objects for missing files
+and directories based on the contents of a data pool. This is
+a three-phase process. First, scanning *all* objects to calculate
+size and mtime metadata for inodes. Second, scanning the first
+object from every file to collect this metadata and inject it into
+the metadata pool. Third, checking inode linkages and fixing found
+errors.
+
+::
+
+ cephfs-data-scan scan_extents <data pool>
+ cephfs-data-scan scan_inodes <data pool>
+ cephfs-data-scan scan_links
+
+'scan_extents' and 'scan_inodes' commands may take a *very long* time
+if there are many files or very large files in the data pool.
+
+To accelerate the process, run multiple instances of the tool.
+
+Decide on a number of workers, and pass each worker a number within
+the range 0-(worker_m - 1).
+
+The example below shows how to run 4 workers simultaneously:
+
+::
+
+ # Worker 0
+ cephfs-data-scan scan_extents --worker_n 0 --worker_m 4 <data pool>
+ # Worker 1
+ cephfs-data-scan scan_extents --worker_n 1 --worker_m 4 <data pool>
+ # Worker 2
+ cephfs-data-scan scan_extents --worker_n 2 --worker_m 4 <data pool>
+ # Worker 3
+ cephfs-data-scan scan_extents --worker_n 3 --worker_m 4 <data pool>
+
+ # Worker 0
+ cephfs-data-scan scan_inodes --worker_n 0 --worker_m 4 <data pool>
+ # Worker 1
+ cephfs-data-scan scan_inodes --worker_n 1 --worker_m 4 <data pool>
+ # Worker 2
+ cephfs-data-scan scan_inodes --worker_n 2 --worker_m 4 <data pool>
+ # Worker 3
+ cephfs-data-scan scan_inodes --worker_n 3 --worker_m 4 <data pool>
+
+It is **important** to ensure that all workers have completed the
+scan_extents phase before any workers enter the scan_inodes phase.
+
+After completing the metadata recovery, you may want to run cleanup
+operation to delete ancillary data geneated during recovery.
+
+::
+
+ cephfs-data-scan cleanup <data pool>
+
+Finding files affected by lost data PGs
+---------------------------------------
+
+Losing a data PG may affect many files. Files are split into many objects,
+so identifying which files are affected by loss of particular PGs requires
+a full scan over all object IDs that may exist within the size of a file.
+This type of scan may be useful for identifying which files require
+restoring from a backup.
+
+.. danger::
+
+ This command does not repair any metadata, so when restoring files in
+ this case you must *remove* the damaged file, and replace it in order
+ to have a fresh inode. Do not overwrite damaged files in place.
+
+If you know that objects have been lost from PGs, use the ``pg_files``
+subcommand to scan for files that may have been damaged as a result:
+
+::
+
+ cephfs-data-scan pg_files <path> <pg id> [<pg id>...]
+
+For example, if you have lost data from PGs 1.4 and 4.5, and you would like
+to know which files under /home/bob might have been damaged:
+
+::
+
+ cephfs-data-scan pg_files /home/bob 1.4 4.5
+
+The output will be a list of paths to potentially damaged files, one
+per line.
+
+Note that this command acts as a normal CephFS client to find all the
+files in the filesystem and read their layouts, so the MDS must be
+up and running.
+
+Using an alternate metadata pool for recovery
+---------------------------------------------
+
+.. warning::
+
+ There has not been extensive testing of this procedure. It should be
+ undertaken with great care.
+
+If an existing filesystem is damaged and inoperative, it is possible to create
+a fresh metadata pool and attempt to reconstruct the filesystem metadata
+into this new pool, leaving the old metadata in place. This could be used to
+make a safer attempt at recovery since the existing metadata pool would not be
+overwritten.
+
+.. caution::
+
+ During this process, multiple metadata pools will contain data referring to
+ the same data pool. Extreme caution must be exercised to avoid changing the
+ data pool contents while this is the case. Once recovery is complete, the
+ damaged metadata pool should be deleted.
+
+To begin this process, first create the fresh metadata pool and initialize
+it with empty file system data structures:
+
+::
+
+ ceph fs flag set enable_multiple true --yes-i-really-mean-it
+ ceph osd pool create recovery <pg-num> replicated <crush-ruleset-name>
+ ceph fs new recovery-fs recovery <data pool> --allow-dangerous-metadata-overlay
+ cephfs-data-scan init --force-init --filesystem recovery-fs --alternate-pool recovery
+ ceph fs reset recovery-fs --yes-i-really-mean-it
+ cephfs-table-tool recovery-fs:all reset session
+ cephfs-table-tool recovery-fs:all reset snap
+ cephfs-table-tool recovery-fs:all reset inode
+
+Next, run the recovery toolset using the --alternate-pool argument to output
+results to the alternate pool:
+
+::
+
+ cephfs-data-scan scan_extents --alternate-pool recovery --filesystem <original filesystem name> <original data pool name>
+ cephfs-data-scan scan_inodes --alternate-pool recovery --filesystem <original filesystem name> --force-corrupt --force-init <original data pool name>
+ cephfs-data-scan scan_links --filesystem recovery-fs
+
+If the damaged filesystem contains dirty journal data, it may be recovered next
+with:
+
+::
+
+ cephfs-journal-tool --rank=<original filesystem name>:0 event recover_dentries list --alternate-pool recovery
+ cephfs-journal-tool --rank recovery-fs:0 journal reset --force
+
+After recovery, some recovered directories will have incorrect statistics.
+Ensure the parameters mds_verify_scatter and mds_debug_scatterstat are set
+to false (the default) to prevent the MDS from checking the statistics, then
+run a forward scrub to repair them. Ensure you have an MDS running and issue:
+
+::
+
+ ceph daemon mds.a scrub_path / recursive repair
diff --git a/src/ceph/doc/cephfs/eviction.rst b/src/ceph/doc/cephfs/eviction.rst
new file mode 100644
index 0000000..f2f6778
--- /dev/null
+++ b/src/ceph/doc/cephfs/eviction.rst
@@ -0,0 +1,190 @@
+
+===============================
+Ceph filesystem client eviction
+===============================
+
+When a filesystem client is unresponsive or otherwise misbehaving, it
+may be necessary to forcibly terminate its access to the filesystem. This
+process is called *eviction*.
+
+Evicting a CephFS client prevents it from communicating further with MDS
+daemons and OSD daemons. If a client was doing buffered IO to the filesystem,
+any un-flushed data will be lost.
+
+Clients may either be evicted automatically (if they fail to communicate
+promptly with the MDS), or manually (by the system administrator).
+
+The client eviction process applies to clients of all kinds, this includes
+FUSE mounts, kernel mounts, nfs-ganesha gateways, and any process using
+libcephfs.
+
+Automatic client eviction
+=========================
+
+There are two situations in which a client may be evicted automatically:
+
+On an active MDS daemon, if a client has not communicated with the MDS for
+over ``mds_session_autoclose`` seconds (300 seconds by default), then it
+will be evicted automatically.
+
+During MDS startup (including on failover), the MDS passes through a
+state called ``reconnect``. During this state, it waits for all the
+clients to connect to the new MDS daemon. If any clients fail to do
+so within the time window (``mds_reconnect_timeout``, 45 seconds by default)
+then they will be evicted.
+
+A warning message is sent to the cluster log if either of these situations
+arises.
+
+Manual client eviction
+======================
+
+Sometimes, the administrator may want to evict a client manually. This
+could happen if a client is died and the administrator does not
+want to wait for its session to time out, or it could happen if
+a client is misbehaving and the administrator does not have access to
+the client node to unmount it.
+
+It is useful to inspect the list of clients first:
+
+::
+
+ ceph tell mds.0 client ls
+
+ [
+ {
+ "id": 4305,
+ "num_leases": 0,
+ "num_caps": 3,
+ "state": "open",
+ "replay_requests": 0,
+ "completed_requests": 0,
+ "reconnecting": false,
+ "inst": "client.4305 172.21.9.34:0/422650892",
+ "client_metadata": {
+ "ceph_sha1": "ae81e49d369875ac8b569ff3e3c456a31b8f3af5",
+ "ceph_version": "ceph version 12.0.0-1934-gae81e49 (ae81e49d369875ac8b569ff3e3c456a31b8f3af5)",
+ "entity_id": "0",
+ "hostname": "senta04",
+ "mount_point": "/tmp/tmpcMpF1b/mnt.0",
+ "pid": "29377",
+ "root": "/"
+ }
+ }
+ ]
+
+
+
+Once you have identified the client you want to evict, you can
+do that using its unique ID, or various other attributes to identify it:
+
+::
+
+ # These all work
+ ceph tell mds.0 client evict id=4305
+ ceph tell mds.0 client evict client_metadata.=4305
+
+
+Advanced: Un-blacklisting a client
+==================================
+
+Ordinarily, a blacklisted client may not reconnect to the servers: it
+must be unmounted and then mounted anew.
+
+However, in some situations it may be useful to permit a client that
+was evicted to attempt to reconnect.
+
+Because CephFS uses the RADOS OSD blacklist to control client eviction,
+CephFS clients can be permitted to reconnect by removing them from
+the blacklist:
+
+::
+
+ ceph osd blacklist ls
+ # ... identify the address of the client ...
+ ceph osd blacklist rm <address>
+
+Doing this may put data integrity at risk if other clients have accessed
+files that the blacklisted client was doing buffered IO to. It is also not
+guaranteed to result in a fully functional client -- the best way to get
+a fully healthy client back after an eviction is to unmount the client
+and do a fresh mount.
+
+If you are trying to reconnect clients in this way, you may also
+find it useful to set ``client_reconnect_stale`` to true in the
+FUSE client, to prompt the client to try to reconnect.
+
+Advanced: Configuring blacklisting
+==================================
+
+If you are experiencing frequent client evictions, due to slow
+client hosts or an unreliable network, and you cannot fix the underlying
+issue, then you may want to ask the MDS to be less strict.
+
+It is possible to respond to slow clients by simply dropping their
+MDS sessions, but permit them to re-open sessions and permit them
+to continue talking to OSDs. To enable this mode, set
+``mds_session_blacklist_on_timeout`` to false on your MDS nodes.
+
+For the equivalent behaviour on manual evictions, set
+``mds_session_blacklist_on_evict`` to false.
+
+Note that if blacklisting is disabled, then evicting a client will
+only have an effect on the MDS you send the command to. On a system
+with multiple active MDS daemons, you would need to send an
+eviction command to each active daemon. When blacklisting is enabled
+(the default), sending an eviction command to just a single
+MDS is sufficient, because the blacklist propagates it to the others.
+
+Advanced options
+================
+
+``mds_blacklist_interval`` - this setting controls how many seconds
+entries will remain in the blacklist for.
+
+
+.. _background_blacklisting_and_osd_epoch_barrier:
+
+Background: Blacklisting and OSD epoch barrier
+==============================================
+
+After a client is blacklisted, it is necessary to make sure that
+other clients and MDS daemons have the latest OSDMap (including
+the blacklist entry) before they try to access any data objects
+that the blacklisted client might have been accessing.
+
+This is ensured using an internal "osdmap epoch barrier" mechanism.
+
+The purpose of the barrier is to ensure that when we hand out any
+capabilities which might allow touching the same RADOS objects, the
+clients we hand out the capabilities to must have a sufficiently recent
+OSD map to not race with cancelled operations (from ENOSPC) or
+blacklisted clients (from evictions).
+
+More specifically, the cases where an epoch barrier is set are:
+
+ * Client eviction (where the client is blacklisted and other clients
+ must wait for a post-blacklist epoch to touch the same objects).
+ * OSD map full flag handling in the client (where the client may
+ cancel some OSD ops from a pre-full epoch, so other clients must
+ wait until the full epoch or later before touching the same objects).
+ * MDS startup, because we don't persist the barrier epoch, so must
+ assume that latest OSD map is always required after a restart.
+
+Note that this is a global value for simplicity. We could maintain this on
+a per-inode basis. But we don't, because:
+
+ * It would be more complicated.
+ * It would use an extra 4 bytes of memory for every inode.
+ * It would not be much more efficient as almost always everyone has the latest.
+ OSD map anyway, in most cases everyone will breeze through this barrier
+ rather than waiting.
+ * This barrier is done in very rare cases, so any benefit from per-inode
+ granularity would only very rarely be seen.
+
+The epoch barrier is transmitted along with all capability messages, and
+instructs the receiver of the message to avoid sending any more RADOS
+operations to OSDs until it has seen this OSD epoch. This mainly applies
+to clients (doing their data writes directly to files), but also applies
+to the MDS because things like file size probing and file deletion are
+done directly from the MDS.
diff --git a/src/ceph/doc/cephfs/experimental-features.rst b/src/ceph/doc/cephfs/experimental-features.rst
new file mode 100644
index 0000000..1dc781a
--- /dev/null
+++ b/src/ceph/doc/cephfs/experimental-features.rst
@@ -0,0 +1,107 @@
+
+Experimental Features
+=====================
+
+CephFS includes a number of experimental features which are not fully stabilized
+or qualified for users to turn on in real deployments. We generally do our best
+to clearly demarcate these and fence them off so they cannot be used by mistake.
+
+Some of these features are closer to being done than others, though. We describe
+each of them with an approximation of how risky they are and briefly describe
+what is required to enable them. Note that doing so will *irrevocably* flag maps
+in the monitor as having once enabled this flag to improve debugging and
+support processes.
+
+Inline data
+-----------
+By default, all CephFS file data is stored in RADOS objects. The inline data
+feature enables small files (generally <2KB) to be stored in the inode
+and served out of the MDS. This may improve small-file performance but increases
+load on the MDS. It is not sufficiently tested to support at this time, although
+failures within it are unlikely to make non-inlined data inaccessible
+
+Inline data has always been off by default and requires setting
+the "inline_data" flag.
+
+
+
+Mantle: Programmable Metadata Load Balancer
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Mantle is a programmable metadata balancer built into the MDS. The idea is to
+protect the mechanisms for balancing load (migration, replication,
+fragmentation) but stub out the balancing policies using Lua. For details, see
+:doc:`/cephfs/mantle`.
+
+Snapshots
+---------
+Like multiple active MDSes, CephFS is designed from the ground up to support
+snapshotting of arbitrary directories. There are no known bugs at the time of
+writing, but there is insufficient testing to provide stability guarantees and
+every expansion of testing has generally revealed new issues. If you do enable
+snapshots and experience failure, manual intervention will be needed.
+
+Snapshots are known not to work properly with multiple filesystems (below) in
+some cases. Specifically, if you share a pool for multiple FSes and delete
+a snapshot in one FS, expect to lose snapshotted file data in any other FS using
+snapshots. See the :doc:`/dev/cephfs-snapshots` page for more information.
+
+Snapshots are known not to work with multi-MDS filesystems.
+
+Snapshotting was blocked off with the "allow_new_snaps" flag prior to Firefly.
+
+Multiple filesystems within a Ceph cluster
+------------------------------------------
+Code was merged prior to the Jewel release which enables administrators
+to create multiple independent CephFS filesystems within a single Ceph cluster.
+These independent filesystems have their own set of active MDSes, cluster maps,
+and data. But the feature required extensive changes to data structures which
+are not yet fully qualified, and has security implications which are not all
+apparent nor resolved.
+
+There are no known bugs, but any failures which do result from having multiple
+active filesystems in your cluster will require manual intervention and, so far,
+will not have been experienced by anybody else -- knowledgeable help will be
+extremely limited. You also probably do not have the security or isolation
+guarantees you want or think you have upon doing so.
+
+Note that snapshots and multiple filesystems are *not* tested in combination
+and may not work together; see above.
+
+Multiple filesystems were available starting in the Jewel release candidates
+but were protected behind the "enable_multiple" flag before the final release.
+
+
+Previously experimental features
+================================
+
+Directory Fragmentation
+-----------------------
+
+Directory fragmentation was considered experimental prior to the *Luminous*
+(12.2.x). It is now enabled by default on new filesystems. To enable directory
+fragmentation on filesystems created with older versions of Ceph, set
+the ``allow_dirfrags`` flag on the filesystem:
+
+::
+
+ ceph fs set <filesystem name> allow_dirfrags
+
+Multiple active metadata servers
+--------------------------------
+
+Prior to the *Luminous* (12.2.x) release, running multiple active metadata
+servers within a single filesystem was considered experimental. Creating
+multiple active metadata servers is now permitted by default on new
+filesystems.
+
+Filesystems created with older versions of Ceph still require explicitly
+enabling multiple active metadata servers as follows:
+
+::
+
+ ceph fs set <filesystem name> allow_multimds
+
+Note that the default size of the active mds cluster (``max_mds``) is
+still set to 1 initially.
+
diff --git a/src/ceph/doc/cephfs/file-layouts.rst b/src/ceph/doc/cephfs/file-layouts.rst
new file mode 100644
index 0000000..4124e1e
--- /dev/null
+++ b/src/ceph/doc/cephfs/file-layouts.rst
@@ -0,0 +1,215 @@
+
+File layouts
+============
+
+The layout of a file controls how its contents are mapped to Ceph RADOS objects. You can
+read and write a file's layout using *virtual extended attributes* or xattrs.
+
+The name of the layout xattrs depends on whether a file is a regular file or a directory. Regular
+files' layout xattrs are called ``ceph.file.layout``, whereas directories' layout xattrs are called
+``ceph.dir.layout``. Where subsequent examples refer to ``ceph.file.layout``, substitute ``dir`` as appropriate
+when dealing with directories.
+
+.. tip::
+
+ Your linux distribution may not ship with commands for manipulating xattrs by default,
+ the required package is usually called ``attr``.
+
+Layout fields
+-------------
+
+pool
+ String, giving ID or name. Which RADOS pool a file's data objects will be stored in.
+
+pool_namespace
+ String. Within the data pool, which RADOS namespace the objects will
+ be written to. Empty by default (i.e. default namespace).
+
+stripe_unit
+ Integer in bytes. The size (in bytes) of a block of data used in the RAID 0 distribution of a file. All stripe units for a file have equal size. The last stripe unit is typically incomplete–i.e. it represents the data at the end of the file as well as unused “space” beyond it up to the end of the fixed stripe unit size.
+
+stripe_count
+ Integer. The number of consecutive stripe units that constitute a RAID 0 “stripe” of file data.
+
+object_size
+ Integer in bytes. File data is chunked into RADOS objects of this size.
+
+.. tip::
+
+ RADOS enforces a configurable limit on object sizes: if you increase CephFS
+ object sizes beyond that limit then writes may not succeed. The OSD
+ setting is ``rados_max_object_size``, which is 128MB by default.
+ Very large RADOS objects may prevent smooth operation of the cluster,
+ so increasing the object size limit past the default is not recommended.
+
+Reading layouts with ``getfattr``
+---------------------------------
+
+Read the layout information as a single string:
+
+.. code-block:: bash
+
+ $ touch file
+ $ getfattr -n ceph.file.layout file
+ # file: file
+ ceph.file.layout="stripe_unit=4194304 stripe_count=1 object_size=4194304 pool=cephfs_data"
+
+Read individual layout fields:
+
+.. code-block:: bash
+
+ $ getfattr -n ceph.file.layout.pool file
+ # file: file
+ ceph.file.layout.pool="cephfs_data"
+ $ getfattr -n ceph.file.layout.stripe_unit file
+ # file: file
+ ceph.file.layout.stripe_unit="4194304"
+ $ getfattr -n ceph.file.layout.stripe_count file
+ # file: file
+ ceph.file.layout.stripe_count="1"
+ $ getfattr -n ceph.file.layout.object_size file
+ # file: file
+ ceph.file.layout.object_size="4194304"
+
+.. note::
+
+ When reading layouts, the pool will usually be indicated by name. However, in
+ rare cases when pools have only just been created, the ID may be output instead.
+
+Directories do not have an explicit layout until it is customized. Attempts to read
+the layout will fail if it has never been modified: this indicates that layout of the
+next ancestor directory with an explicit layout will be used.
+
+.. code-block:: bash
+
+ $ mkdir dir
+ $ getfattr -n ceph.dir.layout dir
+ dir: ceph.dir.layout: No such attribute
+ $ setfattr -n ceph.dir.layout.stripe_count -v 2 dir
+ $ getfattr -n ceph.dir.layout dir
+ # file: dir
+ ceph.dir.layout="stripe_unit=4194304 stripe_count=2 object_size=4194304 pool=cephfs_data"
+
+
+Writing layouts with ``setfattr``
+---------------------------------
+
+Layout fields are modified using ``setfattr``:
+
+.. code-block:: bash
+
+ $ ceph osd lspools
+ 0 rbd,1 cephfs_data,2 cephfs_metadata,
+
+ $ setfattr -n ceph.file.layout.stripe_unit -v 1048576 file2
+ $ setfattr -n ceph.file.layout.stripe_count -v 8 file2
+ $ setfattr -n ceph.file.layout.object_size -v 10485760 file2
+ $ setfattr -n ceph.file.layout.pool -v 1 file2 # Setting pool by ID
+ $ setfattr -n ceph.file.layout.pool -v cephfs_data file2 # Setting pool by name
+
+.. note::
+
+ When the layout fields of a file are modified using ``setfattr``, this file must be empty, otherwise an error will occur.
+
+.. code-block:: bash
+
+ # touch an empty file
+ $ touch file1
+ # modify layout field successfully
+ $ setfattr -n ceph.file.layout.stripe_count -v 3 file1
+
+ # write something to file1
+ $ echo "hello world" > file1
+ $ setfattr -n ceph.file.layout.stripe_count -v 4 file1
+ setfattr: file1: Directory not empty
+
+Clearing layouts
+----------------
+
+If you wish to remove an explicit layout from a directory, to revert to
+inherting the layout of its ancestor, you can do so:
+
+.. code-block:: bash
+
+ setfattr -x ceph.dir.layout mydir
+
+Similarly, if you have set the ``pool_namespace`` attribute and wish
+to modify the layout to use the default namespace instead:
+
+.. code-block:: bash
+
+ # Create a dir and set a namespace on it
+ mkdir mydir
+ setfattr -n ceph.dir.layout.pool_namespace -v foons mydir
+ getfattr -n ceph.dir.layout mydir
+ ceph.dir.layout="stripe_unit=4194304 stripe_count=1 object_size=4194304 pool=cephfs_data_a pool_namespace=foons"
+
+ # Clear the namespace from the directory's layout
+ setfattr -x ceph.dir.layout.pool_namespace mydir
+ getfattr -n ceph.dir.layout mydir
+ ceph.dir.layout="stripe_unit=4194304 stripe_count=1 object_size=4194304 pool=cephfs_data_a"
+
+
+Inheritance of layouts
+----------------------
+
+Files inherit the layout of their parent directory at creation time. However, subsequent
+changes to the parent directory's layout do not affect children.
+
+.. code-block:: bash
+
+ $ getfattr -n ceph.dir.layout dir
+ # file: dir
+ ceph.dir.layout="stripe_unit=4194304 stripe_count=2 object_size=4194304 pool=cephfs_data"
+
+ # Demonstrate file1 inheriting its parent's layout
+ $ touch dir/file1
+ $ getfattr -n ceph.file.layout dir/file1
+ # file: dir/file1
+ ceph.file.layout="stripe_unit=4194304 stripe_count=2 object_size=4194304 pool=cephfs_data"
+
+ # Now update the layout of the directory before creating a second file
+ $ setfattr -n ceph.dir.layout.stripe_count -v 4 dir
+ $ touch dir/file2
+
+ # Demonstrate that file1's layout is unchanged
+ $ getfattr -n ceph.file.layout dir/file1
+ # file: dir/file1
+ ceph.file.layout="stripe_unit=4194304 stripe_count=2 object_size=4194304 pool=cephfs_data"
+
+ # ...while file2 has the parent directory's new layout
+ $ getfattr -n ceph.file.layout dir/file2
+ # file: dir/file2
+ ceph.file.layout="stripe_unit=4194304 stripe_count=4 object_size=4194304 pool=cephfs_data"
+
+
+Files created as descendents of the directory also inherit the layout, if the intermediate
+directories do not have layouts set:
+
+.. code-block:: bash
+
+ $ getfattr -n ceph.dir.layout dir
+ # file: dir
+ ceph.dir.layout="stripe_unit=4194304 stripe_count=4 object_size=4194304 pool=cephfs_data"
+ $ mkdir dir/childdir
+ $ getfattr -n ceph.dir.layout dir/childdir
+ dir/childdir: ceph.dir.layout: No such attribute
+ $ touch dir/childdir/grandchild
+ $ getfattr -n ceph.file.layout dir/childdir/grandchild
+ # file: dir/childdir/grandchild
+ ceph.file.layout="stripe_unit=4194304 stripe_count=4 object_size=4194304 pool=cephfs_data"
+
+
+Adding a data pool to the MDS
+---------------------------------
+
+Before you can use a pool with CephFS you have to add it to the Metadata Servers.
+
+.. code-block:: bash
+
+ $ ceph fs add_data_pool cephfs cephfs_data_ssd
+ # Pool should now show up
+ $ ceph fs ls
+ .... data pools: [cephfs_data cephfs_data_ssd ]
+
+Make sure that your cephx keys allows the client to access this new pool.
diff --git a/src/ceph/doc/cephfs/fstab.rst b/src/ceph/doc/cephfs/fstab.rst
new file mode 100644
index 0000000..dc38715
--- /dev/null
+++ b/src/ceph/doc/cephfs/fstab.rst
@@ -0,0 +1,46 @@
+==========================================
+ Mount Ceph FS in your File Systems Table
+==========================================
+
+If you mount Ceph FS in your file systems table, the Ceph file system will mount
+automatically on startup.
+
+Kernel Driver
+=============
+
+To mount Ceph FS in your file systems table as a kernel driver, add the
+following to ``/etc/fstab``::
+
+ {ipaddress}:{port}:/ {mount}/{mountpoint} {filesystem-name} [name=username,secret=secretkey|secretfile=/path/to/secretfile],[{mount.options}]
+
+For example::
+
+ 10.10.10.10:6789:/ /mnt/ceph ceph name=admin,secretfile=/etc/ceph/secret.key,noatime,_netdev 0 2
+
+.. important:: The ``name`` and ``secret`` or ``secretfile`` options are
+ mandatory when you have Ceph authentication running.
+
+See `User Management`_ for details.
+
+
+FUSE
+====
+
+To mount Ceph FS in your file systems table as a filesystem in user space, add the
+following to ``/etc/fstab``::
+
+ #DEVICE PATH TYPE OPTIONS
+ none /mnt/ceph fuse.ceph ceph.id={user-ID}[,ceph.conf={path/to/conf.conf}],_netdev,defaults 0 0
+
+For example::
+
+ none /mnt/ceph fuse.ceph ceph.id=myuser,_netdev,defaults 0 0
+ none /mnt/ceph fuse.ceph ceph.id=myuser,ceph.conf=/etc/ceph/foo.conf,_netdev,defaults 0 0
+
+Ensure you use the ID (e.g., ``admin``, not ``client.admin``). You can pass any valid
+``ceph-fuse`` option to the command line this way.
+
+See `User Management`_ for details.
+
+
+.. _User Management: ../../rados/operations/user-management/
diff --git a/src/ceph/doc/cephfs/full.rst b/src/ceph/doc/cephfs/full.rst
new file mode 100644
index 0000000..cc9eb59
--- /dev/null
+++ b/src/ceph/doc/cephfs/full.rst
@@ -0,0 +1,60 @@
+
+Handling a full Ceph filesystem
+===============================
+
+When a RADOS cluster reaches its ``mon_osd_full_ratio`` (default
+95%) capacity, it is marked with the OSD full flag. This flag causes
+most normal RADOS clients to pause all operations until it is resolved
+(for example by adding more capacity to the cluster).
+
+The filesystem has some special handling of the full flag, explained below.
+
+Hammer and later
+----------------
+
+Since the hammer release, a full filesystem will lead to ENOSPC
+results from:
+
+ * Data writes on the client
+ * Metadata operations other than deletes and truncates
+
+Because the full condition may not be encountered until
+data is flushed to disk (sometime after a ``write`` call has already
+returned 0), the ENOSPC error may not be seen until the application
+calls ``fsync`` or ``fclose`` (or equivalent) on the file handle.
+
+Calling ``fsync`` is guaranteed to reliably indicate whether the data
+made it to disk, and will return an error if it doesn't. ``fclose`` will
+only return an error if buffered data happened to be flushed since
+the last write -- a successful ``fclose`` does not guarantee that the
+data made it to disk, and in a full-space situation, buffered data
+may be discarded after an ``fclose`` if no space is available to persist it.
+
+.. warning::
+ If an application appears to be misbehaving on a full filesystem,
+ check that it is performing ``fsync()`` calls as necessary to ensure
+ data is on disk before proceeding.
+
+Data writes may be cancelled by the client if they are in flight at the
+time the OSD full flag is sent. Clients update the ``osd_epoch_barrier``
+when releasing capabilities on files affected by cancelled operations, in
+order to ensure that these cancelled operations do not interfere with
+subsequent access to the data objects by the MDS or other clients. For
+more on the epoch barrier mechanism, see :ref:`background_blacklisting_and_osd_epoch_barrier`.
+
+Legacy (pre-hammer) behavior
+----------------------------
+
+In versions of Ceph earlier than hammer, the MDS would ignore
+the full status of the RADOS cluster, and any data writes from
+clients would stall until the cluster ceased to be full.
+
+There are two dangerous conditions to watch for with this behaviour:
+
+* If a client had pending writes to a file, then it was not possible
+ for the client to release the file to the MDS for deletion: this could
+ lead to difficulty clearing space on a full filesystem
+* If clients continued to create a large number of empty files, the
+ resulting metadata writes from the MDS could lead to total exhaustion
+ of space on the OSDs such that no further deletions could be performed.
+
diff --git a/src/ceph/doc/cephfs/fuse.rst b/src/ceph/doc/cephfs/fuse.rst
new file mode 100644
index 0000000..d8c6cdf
--- /dev/null
+++ b/src/ceph/doc/cephfs/fuse.rst
@@ -0,0 +1,52 @@
+=========================
+Mount Ceph FS using FUSE
+=========================
+
+Before mounting a Ceph File System in User Space (FUSE), ensure that the client
+host has a copy of the Ceph configuration file and a keyring with CAPS for the
+Ceph metadata server.
+
+#. From your client host, copy the Ceph configuration file from the monitor host
+ to the ``/etc/ceph`` directory. ::
+
+ sudo mkdir -p /etc/ceph
+ sudo scp {user}@{server-machine}:/etc/ceph/ceph.conf /etc/ceph/ceph.conf
+
+#. From your client host, copy the Ceph keyring from the monitor host to
+ to the ``/etc/ceph`` directory. ::
+
+ sudo scp {user}@{server-machine}:/etc/ceph/ceph.keyring /etc/ceph/ceph.keyring
+
+#. Ensure that the Ceph configuration file and the keyring have appropriate
+ permissions set on your client machine (e.g., ``chmod 644``).
+
+For additional details on ``cephx`` configuration, see
+`CEPHX Config Reference`_.
+
+To mount the Ceph file system as a FUSE, you may use the ``ceph-fuse`` command.
+For example::
+
+ sudo mkdir /home/usernname/cephfs
+ sudo ceph-fuse -m 192.168.0.1:6789 /home/username/cephfs
+
+If you have more than one filesystem, specify which one to mount using
+the ``--client_mds_namespace`` command line argument, or add a
+``client_mds_namespace`` setting to your ``ceph.conf``.
+
+See `ceph-fuse`_ for additional details.
+
+To automate mounting ceph-fuse, you may add an entry to the system fstab_.
+Additionally, ``ceph-fuse@.service`` and ``ceph-fuse.target`` systemd units are
+available. As usual, these unit files declare the default dependencies and
+recommended execution context for ``ceph-fuse``. An example ceph-fuse mount on
+``/mnt`` would be::
+
+ sudo systemctl start ceph-fuse@/mnt.service
+
+A persistent mount point can be setup via::
+
+ sudo systemctl enable ceph-fuse@/mnt.service
+
+.. _ceph-fuse: ../../man/8/ceph-fuse/
+.. _fstab: ./fstab
+.. _CEPHX Config Reference: ../../rados/configuration/auth-config-ref
diff --git a/src/ceph/doc/cephfs/hadoop.rst b/src/ceph/doc/cephfs/hadoop.rst
new file mode 100644
index 0000000..76d26f2
--- /dev/null
+++ b/src/ceph/doc/cephfs/hadoop.rst
@@ -0,0 +1,202 @@
+========================
+Using Hadoop with CephFS
+========================
+
+The Ceph file system can be used as a drop-in replacement for the Hadoop File
+System (HDFS). This page describes the installation and configuration process
+of using Ceph with Hadoop.
+
+Dependencies
+============
+
+* CephFS Java Interface
+* Hadoop CephFS Plugin
+
+.. important:: Currently requires Hadoop 1.1.X stable series
+
+Installation
+============
+
+There are three requirements for using CephFS with Hadoop. First, a running
+Ceph installation is required. The details of setting up a Ceph cluster and
+the file system are beyond the scope of this document. Please refer to the
+Ceph documentation for installing Ceph.
+
+The remaining two requirements are a Hadoop installation, and the Ceph file
+system Java packages, including the Java CephFS Hadoop plugin. The high-level
+steps are two add the dependencies to the Hadoop installation ``CLASSPATH``,
+and configure Hadoop to use the Ceph file system.
+
+CephFS Java Packages
+--------------------
+
+* CephFS Hadoop plugin (`hadoop-cephfs.jar <http://ceph.com/download/hadoop-cephfs.jar>`_)
+
+Adding these dependencies to a Hadoop installation will depend on your
+particular deployment. In general the dependencies must be present on each
+node in the system that will be part of the Hadoop cluster, and must be in the
+``CLASSPATH`` searched for by Hadoop. Typically approaches are to place the
+additional ``jar`` files into the ``hadoop/lib`` directory, or to edit the
+``HADOOP_CLASSPATH`` variable in ``hadoop-env.sh``.
+
+The native Ceph file system client must be installed on each participating
+node in the Hadoop cluster.
+
+Hadoop Configuration
+====================
+
+This section describes the Hadoop configuration options used to control Ceph.
+These options are intended to be set in the Hadoop configuration file
+`conf/core-site.xml`.
+
++---------------------+--------------------------+----------------------------+
+|Property |Value |Notes |
+| | | |
++=====================+==========================+============================+
+|fs.default.name |Ceph URI |ceph://[monaddr:port]/ |
+| | | |
+| | | |
++---------------------+--------------------------+----------------------------+
+|ceph.conf.file |Local path to ceph.conf |/etc/ceph/ceph.conf |
+| | | |
+| | | |
+| | | |
++---------------------+--------------------------+----------------------------+
+|ceph.conf.options |Comma separated list of |opt1=val1,opt2=val2 |
+| |Ceph configuration | |
+| |key/value pairs | |
+| | | |
++---------------------+--------------------------+----------------------------+
+|ceph.root.dir |Mount root directory |Default value: / |
+| | | |
+| | | |
++---------------------+--------------------------+----------------------------+
+|ceph.mon.address |Monitor address |host:port |
+| | | |
+| | | |
+| | | |
++---------------------+--------------------------+----------------------------+
+|ceph.auth.id |Ceph user id |Example: admin |
+| | | |
+| | | |
+| | | |
++---------------------+--------------------------+----------------------------+
+|ceph.auth.keyfile |Ceph key file | |
+| | | |
+| | | |
+| | | |
++---------------------+--------------------------+----------------------------+
+|ceph.auth.keyring |Ceph keyring file | |
+| | | |
+| | | |
+| | | |
++---------------------+--------------------------+----------------------------+
+|ceph.object.size |Default file object size |Default value (64MB): |
+| |in bytes |67108864 |
+| | | |
+| | | |
++---------------------+--------------------------+----------------------------+
+|ceph.data.pools |List of Ceph data pools |Default value: default Ceph |
+| |for storing file. |pool. |
+| | | |
+| | | |
++---------------------+--------------------------+----------------------------+
+|ceph.localize.reads |Allow reading from file |Default value: true |
+| |replica objects | |
+| | | |
+| | | |
++---------------------+--------------------------+----------------------------+
+
+Support For Per-file Custom Replication
+---------------------------------------
+
+The Hadoop file system interface allows users to specify a custom replication
+factor (e.g. 3 copies of each block) when creating a file. However, object
+replication factors in the Ceph file system are controlled on a per-pool
+basis, and by default a Ceph file system will contain only a single
+pre-configured pool. Thus, in order to support per-file replication with
+Hadoop over Ceph, additional storage pools with non-default replications
+factors must be created, and Hadoop must be configured to choose from these
+additional pools.
+
+Additional data pools can be specified using the ``ceph.data.pools``
+configuration option. The value of the option is a comma separated list of
+pool names. The default Ceph pool will be used automatically if this
+configuration option is omitted or the value is empty. For example, the
+following configuration setting will consider the pools ``pool1``, ``pool2``, and
+``pool5`` when selecting a target pool to store a file. ::
+
+ <property>
+ <name>ceph.data.pools</name>
+ <value>pool1,pool2,pool5</value>
+ </property>
+
+Hadoop will not create pools automatically. In order to create a new pool with
+a specific replication factor use the ``ceph osd pool create`` command, and then
+set the ``size`` property on the pool using the ``ceph osd pool set`` command. For
+more information on creating and configuring pools see the `RADOS Pool
+documentation`_.
+
+.. _RADOS Pool documentation: ../../rados/operations/pools
+
+Once a pool has been created and configured the metadata service must be told
+that the new pool may be used to store file data. A pool is be made available
+for storing file system data using the ``ceph fs add_data_pool`` command.
+
+First, create the pool. In this example we create the ``hadoop1`` pool with
+replication factor 1. ::
+
+ ceph osd pool create hadoop1 100
+ ceph osd pool set hadoop1 size 1
+
+Next, determine the pool id. This can be done by examining the output of the
+``ceph osd dump`` command. For example, we can look for the newly created
+``hadoop1`` pool. ::
+
+ ceph osd dump | grep hadoop1
+
+The output should resemble::
+
+ pool 3 'hadoop1' rep size 1 min_size 1 crush_ruleset 0...
+
+where ``3`` is the pool id. Next we will use the pool id reference to register
+the pool as a data pool for storing file system data. ::
+
+ ceph fs add_data_pool cephfs 3
+
+The final step is to configure Hadoop to consider this data pool when
+selecting the target pool for new files. ::
+
+ <property>
+ <name>ceph.data.pools</name>
+ <value>hadoop1</value>
+ </property>
+
+Pool Selection Rules
+~~~~~~~~~~~~~~~~~~~~
+
+The following rules describe how Hadoop chooses a pool given a desired
+replication factor and the set of pools specified using the
+``ceph.data.pools`` configuration option.
+
+1. When no custom pools are specified the default Ceph data pool is used.
+2. A custom pool with the same replication factor as the default Ceph data
+ pool will override the default.
+3. A pool with a replication factor that matches the desired replication will
+ be chosen if it exists.
+4. Otherwise, a pool with at least the desired replication factor will be
+ chosen, or the maximum possible.
+
+Debugging Pool Selection
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+Hadoop will produce log file entry when it cannot determine the replication
+factor of a pool (e.g. it is not configured as a data pool). The log message
+will appear as follows::
+
+ Error looking up replication of pool: <pool name>
+
+Hadoop will also produce a log entry when it wasn't able to select an exact
+match for replication. This log entry will appear as follows::
+
+ selectDataPool path=<path> pool:repl=<name>:<value> wanted=<value>
diff --git a/src/ceph/doc/cephfs/health-messages.rst b/src/ceph/doc/cephfs/health-messages.rst
new file mode 100644
index 0000000..3e68e93
--- /dev/null
+++ b/src/ceph/doc/cephfs/health-messages.rst
@@ -0,0 +1,127 @@
+
+======================
+CephFS health messages
+======================
+
+Cluster health checks
+=====================
+
+The Ceph monitor daemons will generate health messages in response
+to certain states of the filesystem map structure (and the enclosed MDS maps).
+
+Message: mds rank(s) *ranks* have failed
+Description: One or more MDS ranks are not currently assigned to
+an MDS daemon; the cluster will not recover until a suitable replacement
+daemon starts.
+
+Message: mds rank(s) *ranks* are damaged
+Description: One or more MDS ranks has encountered severe damage to
+its stored metadata, and cannot start again until it is repaired.
+
+Message: mds cluster is degraded
+Description: One or more MDS ranks are not currently up and running, clients
+may pause metadata IO until this situation is resolved. This includes
+ranks being failed or damaged, and additionally includes ranks
+which are running on an MDS but have not yet made it to the *active*
+state (e.g. ranks currently in *replay* state).
+
+Message: mds *names* are laggy
+Description: The named MDS daemons have failed to send beacon messages
+to the monitor for at least ``mds_beacon_grace`` (default 15s), while
+they are supposed to send beacon messages every ``mds_beacon_interval``
+(default 4s). The daemons may have crashed. The Ceph monitor will
+automatically replace laggy daemons with standbys if any are available.
+
+Message: insufficient standby daemons available
+Description: One or more file systems are configured to have a certain number
+of standby daemons available (including daemons in standby-replay) but the
+cluster does not have enough standby daemons. The standby deamons not in replay
+count towards any file system (i.e. they may overlap). This warning can
+configured by setting ``ceph fs set <fs> standby_count_wanted <count>``. Use
+zero for ``count`` to disable.
+
+
+Daemon-reported health checks
+=============================
+
+MDS daemons can identify a variety of unwanted conditions, and
+indicate these to the operator in the output of ``ceph status``.
+This conditions have human readable messages, and additionally
+a unique code starting MDS_HEALTH which appears in JSON output.
+
+Message: "Behind on trimming..."
+Code: MDS_HEALTH_TRIM
+Description: CephFS maintains a metadata journal that is divided into
+*log segments*. The length of journal (in number of segments) is controlled
+by the setting ``mds_log_max_segments``, and when the number of segments
+exceeds that setting the MDS starts writing back metadata so that it
+can remove (trim) the oldest segments. If this writeback is happening
+too slowly, or a software bug is preventing trimming, then this health
+message may appear. The threshold for this message to appear is for the
+number of segments to be double ``mds_log_max_segments``.
+
+Message: "Client *name* failing to respond to capability release"
+Code: MDS_HEALTH_CLIENT_LATE_RELEASE, MDS_HEALTH_CLIENT_LATE_RELEASE_MANY
+Description: CephFS clients are issued *capabilities* by the MDS, which
+are like locks. Sometimes, for example when another client needs access,
+the MDS will request clients release their capabilities. If the client
+is unresponsive or buggy, it might fail to do so promptly or fail to do
+so at all. This message appears if a client has taken longer than
+``mds_revoke_cap_timeout`` (default 60s) to comply.
+
+Message: "Client *name* failing to respond to cache pressure"
+Code: MDS_HEALTH_CLIENT_RECALL, MDS_HEALTH_CLIENT_RECALL_MANY
+Description: Clients maintain a metadata cache. Items (such as inodes) in the
+client cache are also pinned in the MDS cache, so when the MDS needs to shrink
+its cache (to stay within ``mds_cache_size`` or ``mds_cache_memory_limit``), it
+sends messages to clients to shrink their caches too. If the client is
+unresponsive or buggy, this can prevent the MDS from properly staying within
+its cache limits and it may eventually run out of memory and crash. This
+message appears if a client has taken more than ``mds_recall_state_timeout``
+(default 60s) to comply.
+
+Message: "Client *name* failing to advance its oldest client/flush tid"
+Code: MDS_HEALTH_CLIENT_OLDEST_TID, MDS_HEALTH_CLIENT_OLDEST_TID_MANY
+Description: The CephFS client-MDS protocol uses a field called the
+*oldest tid* to inform the MDS of which client requests are fully
+complete and may therefore be forgotten about by the MDS. If a buggy
+client is failing to advance this field, then the MDS may be prevented
+from properly cleaning up resources used by client requests. This message
+appears if a client appears to have more than ``max_completed_requests``
+(default 100000) requests that are complete on the MDS side but haven't
+yet been accounted for in the client's *oldest tid* value.
+
+Message: "Metadata damage detected"
+Code: MDS_HEALTH_DAMAGE,
+Description: Corrupt or missing metadata was encountered when reading
+from the metadata pool. This message indicates that the damage was
+sufficiently isolated for the MDS to continue operating, although
+client accesses to the damaged subtree will return IO errors. Use
+the ``damage ls`` admin socket command to get more detail on the damage.
+This message appears as soon as any damage is encountered.
+
+Message: "MDS in read-only mode"
+Code: MDS_HEALTH_READ_ONLY,
+Description: The MDS has gone into readonly mode and will return EROFS
+error codes to client operations that attempt to modify any metadata. The
+MDS will go into readonly mode if it encounters a write error while
+writing to the metadata pool, or if forced to by an administrator using
+the *force_readonly* admin socket command.
+
+Message: *N* slow requests are blocked"
+Code: MDS_HEALTH_SLOW_REQUEST,
+Description: One or more client requests have not been completed promptly,
+indicating that the MDS is either running very slowly, or that the RADOS
+cluster is not acknowledging journal writes promptly, or that there is a bug.
+Use the ``ops`` admin socket command to list outstanding metadata operations.
+This message appears if any client requests have taken longer than
+``mds_op_complaint_time`` (default 30s).
+
+Message: "Too many inodes in cache"
+Code: MDS_HEALTH_CACHE_OVERSIZED
+Description: The MDS is not succeeding in trimming its cache to comply with the
+limit set by the administrator. If the MDS cache becomes too large, the daemon
+may exhaust available memory and crash. By default, this message appears if
+the actual cache size (in inodes or memory) is at least 50% greater than
+``mds_cache_size`` (default 100000) or ``mds_cache_memory_limit`` (default
+1GB). Modify ``mds_health_cache_threshold`` to set the warning ratio.
diff --git a/src/ceph/doc/cephfs/index.rst b/src/ceph/doc/cephfs/index.rst
new file mode 100644
index 0000000..c63364f
--- /dev/null
+++ b/src/ceph/doc/cephfs/index.rst
@@ -0,0 +1,116 @@
+=================
+ Ceph Filesystem
+=================
+
+The :term:`Ceph Filesystem` (Ceph FS) is a POSIX-compliant filesystem that uses
+a Ceph Storage Cluster to store its data. The Ceph filesystem uses the same Ceph
+Storage Cluster system as Ceph Block Devices, Ceph Object Storage with its S3
+and Swift APIs, or native bindings (librados).
+
+.. note:: If you are evaluating CephFS for the first time, please review
+ the best practices for deployment: :doc:`/cephfs/best-practices`
+
+.. ditaa::
+ +-----------------------+ +------------------------+
+ | | | CephFS FUSE |
+ | | +------------------------+
+ | |
+ | | +------------------------+
+ | CephFS Kernel Object | | CephFS Library |
+ | | +------------------------+
+ | |
+ | | +------------------------+
+ | | | librados |
+ +-----------------------+ +------------------------+
+
+ +---------------+ +---------------+ +---------------+
+ | OSDs | | MDSs | | Monitors |
+ +---------------+ +---------------+ +---------------+
+
+
+Using CephFS
+============
+
+Using the Ceph Filesystem requires at least one :term:`Ceph Metadata Server` in
+your Ceph Storage Cluster.
+
+
+
+.. raw:: html
+
+ <style type="text/css">div.body h3{margin:5px 0px 0px 0px;}</style>
+ <table cellpadding="10"><colgroup><col width="33%"><col width="33%"><col width="33%"></colgroup><tbody valign="top"><tr><td><h3>Step 1: Metadata Server</h3>
+
+To run the Ceph Filesystem, you must have a running Ceph Storage Cluster with at
+least one :term:`Ceph Metadata Server` running.
+
+
+.. toctree::
+ :maxdepth: 1
+
+ Add/Remove MDS(s) <../../rados/deployment/ceph-deploy-mds>
+ MDS failover and standby configuration <standby>
+ MDS Configuration Settings <mds-config-ref>
+ Client Configuration Settings <client-config-ref>
+ Journaler Configuration <journaler>
+ Manpage ceph-mds <../../man/8/ceph-mds>
+
+.. raw:: html
+
+ </td><td><h3>Step 2: Mount CephFS</h3>
+
+Once you have a healthy Ceph Storage Cluster with at least
+one Ceph Metadata Server, you may create and mount your Ceph Filesystem.
+Ensure that you client has network connectivity and the proper
+authentication keyring.
+
+.. toctree::
+ :maxdepth: 1
+
+ Create CephFS <createfs>
+ Mount CephFS <kernel>
+ Mount CephFS as FUSE <fuse>
+ Mount CephFS in fstab <fstab>
+ Manpage ceph-fuse <../../man/8/ceph-fuse>
+ Manpage mount.ceph <../../man/8/mount.ceph>
+
+
+.. raw:: html
+
+ </td><td><h3>Additional Details</h3>
+
+.. toctree::
+ :maxdepth: 1
+
+ Deployment best practices <best-practices>
+ Administrative commands <administration>
+ POSIX compatibility <posix>
+ Experimental Features <experimental-features>
+ CephFS Quotas <quota>
+ Using Ceph with Hadoop <hadoop>
+ cephfs-journal-tool <cephfs-journal-tool>
+ File layouts <file-layouts>
+ Client eviction <eviction>
+ Handling full filesystems <full>
+ Health messages <health-messages>
+ Troubleshooting <troubleshooting>
+ Disaster recovery <disaster-recovery>
+ Client authentication <client-auth>
+ Upgrading old filesystems <upgrading>
+ Configuring directory fragmentation <dirfrags>
+ Configuring multiple active MDS daemons <multimds>
+
+.. raw:: html
+
+ </td></tr></tbody></table>
+
+For developers
+==============
+
+.. toctree::
+ :maxdepth: 1
+
+ Client's Capabilities <capabilities>
+ libcephfs <../../api/libcephfs-java/>
+ Mantle <mantle>
+
diff --git a/src/ceph/doc/cephfs/journaler.rst b/src/ceph/doc/cephfs/journaler.rst
new file mode 100644
index 0000000..2121532
--- /dev/null
+++ b/src/ceph/doc/cephfs/journaler.rst
@@ -0,0 +1,41 @@
+===========
+ Journaler
+===========
+
+``journaler write head interval``
+
+:Description: How frequently to update the journal head object
+:Type: Integer
+:Required: No
+:Default: ``15``
+
+
+``journaler prefetch periods``
+
+:Description: How many stripe periods to read-ahead on journal replay
+:Type: Integer
+:Required: No
+:Default: ``10``
+
+
+``journal prezero periods``
+
+:Description: How many stripe periods to zero ahead of write position
+:Type: Integer
+:Required: No
+:Default: ``10``
+
+``journaler batch interval``
+
+:Description: Maximum additional latency in seconds we incur artificially.
+:Type: Double
+:Required: No
+:Default: ``.001``
+
+
+``journaler batch max``
+
+:Description: Maximum bytes we will delay flushing.
+:Type: 64-bit Unsigned Integer
+:Required: No
+:Default: ``0``
diff --git a/src/ceph/doc/cephfs/kernel.rst b/src/ceph/doc/cephfs/kernel.rst
new file mode 100644
index 0000000..eaaacef
--- /dev/null
+++ b/src/ceph/doc/cephfs/kernel.rst
@@ -0,0 +1,37 @@
+======================================
+ Mount Ceph FS with the Kernel Driver
+======================================
+
+To mount the Ceph file system you may use the ``mount`` command if you know the
+monitor host IP address(es), or use the ``mount.ceph`` utility to resolve the
+monitor host name(s) into IP address(es) for you. For example::
+
+ sudo mkdir /mnt/mycephfs
+ sudo mount -t ceph 192.168.0.1:6789:/ /mnt/mycephfs
+
+To mount the Ceph file system with ``cephx`` authentication enabled, you must
+specify a user name and a secret. ::
+
+ sudo mount -t ceph 192.168.0.1:6789:/ /mnt/mycephfs -o name=admin,secret=AQATSKdNGBnwLhAAnNDKnH65FmVKpXZJVasUeQ==
+
+The foregoing usage leaves the secret in the Bash history. A more secure
+approach reads the secret from a file. For example::
+
+ sudo mount -t ceph 192.168.0.1:6789:/ /mnt/mycephfs -o name=admin,secretfile=/etc/ceph/admin.secret
+
+If you have more than one filesystem, specify which one to mount using
+the ``mds_namespace`` option, e.g. ``-o mds_namespace=myfs``.
+
+See `User Management`_ for details on cephx.
+
+To unmount the Ceph file system, you may use the ``umount`` command. For example::
+
+ sudo umount /mnt/mycephfs
+
+.. tip:: Ensure that you are not within the file system directories before
+ executing this command.
+
+See `mount.ceph`_ for details.
+
+.. _mount.ceph: ../../man/8/mount.ceph/
+.. _User Management: ../../rados/operations/user-management/
diff --git a/src/ceph/doc/cephfs/mantle.rst b/src/ceph/doc/cephfs/mantle.rst
new file mode 100644
index 0000000..9be89d6
--- /dev/null
+++ b/src/ceph/doc/cephfs/mantle.rst
@@ -0,0 +1,263 @@
+Mantle
+======
+
+.. warning::
+
+ Mantle is for research and development of metadata balancer algorithms,
+ not for use on production CephFS clusters.
+
+Multiple, active MDSs can migrate directories to balance metadata load. The
+policies for when, where, and how much to migrate are hard-coded into the
+metadata balancing module. Mantle is a programmable metadata balancer built
+into the MDS. The idea is to protect the mechanisms for balancing load
+(migration, replication, fragmentation) but stub out the balancing policies
+using Lua. Mantle is based on [1] but the current implementation does *NOT*
+have the following features from that paper:
+
+1. Balancing API: in the paper, the user fills in when, where, how much, and
+ load calculation policies; currently, Mantle only requires that Lua policies
+ return a table of target loads (e.g., how much load to send to each MDS)
+2. "How much" hook: in the paper, there was a hook that let the user control
+ the fragment selector policy; currently, Mantle does not have this hook
+3. Instantaneous CPU utilization as a metric
+
+[1] Supercomputing '15 Paper:
+http://sc15.supercomputing.org/schedule/event_detail-evid=pap168.html
+
+Quickstart with vstart
+----------------------
+
+.. warning::
+
+ Developing balancers with vstart is difficult because running all daemons
+ and clients on one node can overload the system. Let it run for a while, even
+ though you will likely see a bunch of lost heartbeat and laggy MDS warnings.
+ Most of the time this guide will work but sometimes all MDSs lock up and you
+ cannot actually see them spill. It is much better to run this on a cluster.
+
+As a pre-requistie, we assume you have installed `mdtest
+<https://sourceforge.net/projects/mdtest/>`_ or pulled the `Docker image
+<https://hub.docker.com/r/michaelsevilla/mdtest/>`_. We use mdtest because we
+need to generate enough load to get over the MIN_OFFLOAD threshold that is
+arbitrarily set in the balancer. For example, this does not create enough
+metadata load:
+
+::
+
+ while true; do
+ touch "/cephfs/blah-`date`"
+ done
+
+
+Mantle with `vstart.sh`
+~~~~~~~~~~~~~~~~~~~~~~~
+
+1. Start Ceph and tune the logging so we can see migrations happen:
+
+::
+
+ cd build
+ ../src/vstart.sh -n -l
+ for i in a b c; do
+ bin/ceph --admin-daemon out/mds.$i.asok config set debug_ms 0
+ bin/ceph --admin-daemon out/mds.$i.asok config set debug_mds 2
+ bin/ceph --admin-daemon out/mds.$i.asok config set mds_beacon_grace 1500
+ done
+
+
+2. Put the balancer into RADOS:
+
+::
+
+ bin/rados put --pool=cephfs_metadata_a greedyspill.lua ../src/mds/balancers/greedyspill.lua
+
+
+3. Activate Mantle:
+
+::
+
+ bin/ceph fs set cephfs max_mds 5
+ bin/ceph fs set cephfs_a balancer greedyspill.lua
+
+
+4. Mount CephFS in another window:
+
+::
+
+ bin/ceph-fuse /cephfs -o allow_other &
+ tail -f out/mds.a.log
+
+
+ Note that if you look at the last MDS (which could be a, b, or c -- it's
+ random), you will see an an attempt to index a nil value. This is because the
+ last MDS tries to check the load of its neighbor, which does not exist.
+
+5. Run a simple benchmark. In our case, we use the Docker mdtest image to
+ create load:
+
+::
+
+ for i in 0 1 2; do
+ docker run -d \
+ --name=client$i \
+ -v /cephfs:/cephfs \
+ michaelsevilla/mdtest \
+ -F -C -n 100000 -d "/cephfs/client-test$i"
+ done
+
+
+6. When you are done, you can kill all the clients with:
+
+::
+
+ for i in 0 1 2 3; do docker rm -f client$i; done
+
+
+Output
+~~~~~~
+
+Looking at the log for the first MDS (could be a, b, or c), we see that
+everyone has no load:
+
+::
+
+ 2016-08-21 06:44:01.763930 7fd03aaf7700 0 lua.balancer MDS0: < auth.meta_load=0.0 all.meta_load=0.0 req_rate=1.0 queue_len=0.0 cpu_load_avg=1.35 > load=0.0
+ 2016-08-21 06:44:01.763966 7fd03aaf7700 0 lua.balancer MDS1: < auth.meta_load=0.0 all.meta_load=0.0 req_rate=0.0 queue_len=0.0 cpu_load_avg=1.35 > load=0.0
+ 2016-08-21 06:44:01.763982 7fd03aaf7700 0 lua.balancer MDS2: < auth.meta_load=0.0 all.meta_load=0.0 req_rate=0.0 queue_len=0.0 cpu_load_avg=1.35 > load=0.0
+ 2016-08-21 06:44:01.764010 7fd03aaf7700 2 lua.balancer when: not migrating! my_load=0.0 hisload=0.0
+ 2016-08-21 06:44:01.764033 7fd03aaf7700 2 mds.0.bal mantle decided that new targets={}
+
+
+After the jobs starts, MDS0 gets about 1953 units of load. The greedy spill
+balancer dictates that half the load goes to your neighbor MDS, so we see that
+Mantle tries to send 1953 load units to MDS1.
+
+::
+
+ 2016-08-21 06:45:21.869994 7fd03aaf7700 0 lua.balancer MDS0: < auth.meta_load=5834.188908912 all.meta_load=1953.3492228857 req_rate=12591.0 queue_len=1075.0 cpu_load_avg=3.05 > load=1953.3492228857
+ 2016-08-21 06:45:21.870017 7fd03aaf7700 0 lua.balancer MDS1: < auth.meta_load=0.0 all.meta_load=0.0 req_rate=0.0 queue_len=0.0 cpu_load_avg=3.05 > load=0.0
+ 2016-08-21 06:45:21.870027 7fd03aaf7700 0 lua.balancer MDS2: < auth.meta_load=0.0 all.meta_load=0.0 req_rate=0.0 queue_len=0.0 cpu_load_avg=3.05 > load=0.0
+ 2016-08-21 06:45:21.870034 7fd03aaf7700 2 lua.balancer when: migrating! my_load=1953.3492228857 hisload=0.0
+ 2016-08-21 06:45:21.870050 7fd03aaf7700 2 mds.0.bal mantle decided that new targets={0=0,1=976.675,2=0}
+ 2016-08-21 06:45:21.870094 7fd03aaf7700 0 mds.0.bal - exporting [0,0.52287 1.04574] 1030.88 to mds.1 [dir 100000006ab /client-test2/ [2,head] auth pv=33 v=32 cv=32/0 ap=2+3+4 state=1610612802|complete f(v0 m2016-08-21 06:44:20.366935 1=0+1) n(v2 rc2016-08-21 06:44:30.946816 3790=3788+2) hs=1+0,ss=0+0 dirty=1 | child=1 dirty=1 authpin=1 0x55d2762fd690]
+ 2016-08-21 06:45:21.870151 7fd03aaf7700 0 mds.0.migrator nicely exporting to mds.1 [dir 100000006ab /client-test2/ [2,head] auth pv=33 v=32 cv=32/0 ap=2+3+4 state=1610612802|complete f(v0 m2016-08-21 06:44:20.366935 1=0+1) n(v2 rc2016-08-21 06:44:30.946816 3790=3788+2) hs=1+0,ss=0+0 dirty=1 | child=1 dirty=1 authpin=1 0x55d2762fd690]
+
+
+Eventually load moves around:
+
+::
+
+ 2016-08-21 06:47:10.210253 7fd03aaf7700 0 lua.balancer MDS0: < auth.meta_load=415.77414300449 all.meta_load=415.79000078186 req_rate=82813.0 queue_len=0.0 cpu_load_avg=11.97 > load=415.79000078186
+ 2016-08-21 06:47:10.210277 7fd03aaf7700 0 lua.balancer MDS1: < auth.meta_load=228.72023977691 all.meta_load=186.5606496623 req_rate=28580.0 queue_len=0.0 cpu_load_avg=11.97 > load=186.5606496623
+ 2016-08-21 06:47:10.210290 7fd03aaf7700 0 lua.balancer MDS2: < auth.meta_load=0.0 all.meta_load=0.0 req_rate=1.0 queue_len=0.0 cpu_load_avg=11.97 > load=0.0
+ 2016-08-21 06:47:10.210298 7fd03aaf7700 2 lua.balancer when: not migrating! my_load=415.79000078186 hisload=186.5606496623
+ 2016-08-21 06:47:10.210311 7fd03aaf7700 2 mds.0.bal mantle decided that new targets={}
+
+
+Implementation Details
+----------------------
+
+Most of the implementation is in MDBalancer. Metrics are passed to the balancer
+policies via the Lua stack and a list of loads is returned back to MDBalancer.
+It sits alongside the current balancer implementation and it's enabled with a
+Ceph CLI command ("ceph fs set cephfs balancer mybalancer.lua"). If the Lua policy
+fails (for whatever reason), we fall back to the original metadata load
+balancer. The balancer is stored in the RADOS metadata pool and a string in the
+MDSMap tells the MDSs which balancer to use.
+
+Exposing Metrics to Lua
+~~~~~~~~~~~~~~~~~~~~~~~
+
+Metrics are exposed directly to the Lua code as global variables instead of
+using a well-defined function signature. There is a global "mds" table, where
+each index is an MDS number (e.g., 0) and each value is a dictionary of metrics
+and values. The Lua code can grab metrics using something like this:
+
+::
+
+ mds[0]["queue_len"]
+
+
+This is in contrast to cls-lua in the OSDs, which has well-defined arguments
+(e.g., input/output bufferlists). Exposing the metrics directly makes it easier
+to add new metrics without having to change the API on the Lua side; we want
+the API to grow and shrink as we explore which metrics matter. The downside of
+this approach is that the person programming Lua balancer policies has to look
+at the Ceph source code to see which metrics are exposed. We figure that the
+Mantle developer will be in touch with MDS internals anyways.
+
+The metrics exposed to the Lua policy are the same ones that are already stored
+in mds_load_t: auth.meta_load(), all.meta_load(), req_rate, queue_length,
+cpu_load_avg.
+
+Compile/Execute the Balancer
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Here we use `lua_pcall` instead of `lua_call` because we want to handle errors
+in the MDBalancer. We do not want the error propagating up the call chain. The
+cls_lua class wants to handle the error itself because it must fail gracefully.
+For Mantle, we don't care if a Lua error crashes our balancer -- in that case,
+we will fall back to the original balancer.
+
+The performance improvement of using `lua_call` over `lua_pcall` would not be
+leveraged here because the balancer is invoked every 10 seconds by default.
+
+Returning Policy Decision to C++
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+We force the Lua policy engine to return a table of values, corresponding to
+the amount of load to send to each MDS. These loads are inserted directly into
+the MDBalancer "my_targets" vector. We do not allow the MDS to return a table
+of MDSs and metrics because we want the decision to be completely made on the
+Lua side.
+
+Iterating through tables returned by Lua is done through the stack. In Lua
+jargon: a dummy value is pushed onto the stack and the next iterator replaces
+the top of the stack with a (k, v) pair. After reading each value, pop that
+value but keep the key for the next call to `lua_next`.
+
+Reading from RADOS
+~~~~~~~~~~~~~~~~~~
+
+All MDSs will read balancing code from RADOS when the balancer version changes
+in the MDS Map. The balancer pulls the Lua code from RADOS synchronously. We do
+this with a timeout: if the asynchronous read does not come back within half
+the balancing tick interval the operation is cancelled and a Connection Timeout
+error is returned. By default, the balancing tick interval is 10 seconds, so
+Mantle will use a 5 second second timeout. This design allows Mantle to
+immediately return an error if anything RADOS-related goes wrong.
+
+We use this implementation because we do not want to do a blocking OSD read
+from inside the global MDS lock. Doing so would bring down the MDS cluster if
+any of the OSDs are not responsive -- this is tested in the ceph-qa-suite by
+setting all OSDs to down/out and making sure the MDS cluster stays active.
+
+One approach would be to asynchronously fire the read when handling the MDS Map
+and fill in the Lua code in the background. We cannot do this because the MDS
+does not support daemon-local fallbacks and the balancer assumes that all MDSs
+come to the same decision at the same time (e.g., importers, exporters, etc.).
+
+Debugging
+~~~~~~~~~
+
+Logging in a Lua policy will appear in the MDS log. The syntax is the same as
+the cls logging interface:
+
+::
+
+ BAL_LOG(0, "this is a log message")
+
+
+It is implemented by passing a function that wraps the `dout` logging framework
+(`dout_wrapper`) to Lua with the `lua_register()` primitive. The Lua code is
+actually calling the `dout` function in C++.
+
+Warning and Info messages are centralized using the clog/Beacon. Successful
+messages are only sent on version changes by the first MDS to avoid spamming
+the `ceph -w` utility. These messages are used for the integration tests.
+
+Testing
+~~~~~~~
+
+Testing is done with the ceph-qa-suite (tasks.cephfs.test_mantle). We do not
+test invalid balancer logging and loading the actual Lua VM.
diff --git a/src/ceph/doc/cephfs/mds-config-ref.rst b/src/ceph/doc/cephfs/mds-config-ref.rst
new file mode 100644
index 0000000..2fd47ae
--- /dev/null
+++ b/src/ceph/doc/cephfs/mds-config-ref.rst
@@ -0,0 +1,629 @@
+======================
+ MDS Config Reference
+======================
+
+``mon force standby active``
+
+:Description: If ``true`` monitors force standby-replay to be active. Set
+ under ``[mon]`` or ``[global]``.
+
+:Type: Boolean
+:Default: ``true``
+
+
+``mds max file size``
+
+:Description: The maximum allowed file size to set when creating a
+ new file system.
+
+:Type: 64-bit Integer Unsigned
+:Default: ``1ULL << 40``
+
+``mds cache memory limit``
+
+:Description: The memory limit the MDS should enforce for its cache.
+ Administrators should use this instead of ``mds cache size``.
+:Type: 64-bit Integer Unsigned
+:Default: ``1073741824``
+
+``mds cache reservation``
+
+:Description: The cache reservation (memory or inodes) for the MDS cache to maintain.
+ Once the MDS begins dipping into its reservation, it will recall
+ client state until its cache size shrinks to restore the
+ reservation.
+:Type: Float
+:Default: ``0.05``
+
+``mds cache size``
+
+:Description: The number of inodes to cache. A value of 0 indicates an
+ unlimited number. It is recommended to use
+ ``mds_cache_memory_limit`` to limit the amount of memory the MDS
+ cache uses.
+:Type: 32-bit Integer
+:Default: ``0``
+
+``mds cache mid``
+
+:Description: The insertion point for new items in the cache LRU
+ (from the top).
+
+:Type: Float
+:Default: ``0.7``
+
+
+``mds dir commit ratio``
+
+:Description: The fraction of directory that is dirty before Ceph commits using
+ a full update (instead of partial update).
+
+:Type: Float
+:Default: ``0.5``
+
+
+``mds dir max commit size``
+
+:Description: The maximum size of a directory update before Ceph breaks it into
+ smaller transactions) (MB).
+
+:Type: 32-bit Integer
+:Default: ``90``
+
+
+``mds decay halflife``
+
+:Description: The half-life of MDS cache temperature.
+:Type: Float
+:Default: ``5``
+
+``mds beacon interval``
+
+:Description: The frequency (in seconds) of beacon messages sent
+ to the monitor.
+
+:Type: Float
+:Default: ``4``
+
+
+``mds beacon grace``
+
+:Description: The interval without beacons before Ceph declares an MDS laggy
+ (and possibly replace it).
+
+:Type: Float
+:Default: ``15``
+
+
+``mds blacklist interval``
+
+:Description: The blacklist duration for failed MDSs in the OSD map.
+:Type: Float
+:Default: ``24.0*60.0``
+
+
+``mds session timeout``
+
+:Description: The interval (in seconds) of client inactivity before Ceph
+ times out capabilities and leases.
+
+:Type: Float
+:Default: ``60``
+
+
+``mds session autoclose``
+
+:Description: The interval (in seconds) before Ceph closes
+ a laggy client's session.
+
+:Type: Float
+:Default: ``300``
+
+
+``mds reconnect timeout``
+
+:Description: The interval (in seconds) to wait for clients to reconnect
+ during MDS restart.
+
+:Type: Float
+:Default: ``45``
+
+
+``mds tick interval``
+
+:Description: How frequently the MDS performs internal periodic tasks.
+:Type: Float
+:Default: ``5``
+
+
+``mds dirstat min interval``
+
+:Description: The minimum interval (in seconds) to try to avoid propagating
+ recursive stats up the tree.
+
+:Type: Float
+:Default: ``1``
+
+``mds scatter nudge interval``
+
+:Description: How quickly dirstat changes propagate up.
+:Type: Float
+:Default: ``5``
+
+
+``mds client prealloc inos``
+
+:Description: The number of inode numbers to preallocate per client session.
+:Type: 32-bit Integer
+:Default: ``1000``
+
+
+``mds early reply``
+
+:Description: Determines whether the MDS should allow clients to see request
+ results before they commit to the journal.
+
+:Type: Boolean
+:Default: ``true``
+
+
+``mds use tmap``
+
+:Description: Use trivialmap for directory updates.
+:Type: Boolean
+:Default: ``true``
+
+
+``mds default dir hash``
+
+:Description: The function to use for hashing files across directory fragments.
+:Type: 32-bit Integer
+:Default: ``2`` (i.e., rjenkins)
+
+
+``mds log skip corrupt events``
+
+:Description: Determines whether the MDS should try to skip corrupt journal
+ events during journal replay.
+
+:Type: Boolean
+:Default: ``false``
+
+
+``mds log max events``
+
+:Description: The maximum events in the journal before we initiate trimming.
+ Set to ``-1`` to disable limits.
+
+:Type: 32-bit Integer
+:Default: ``-1``
+
+
+``mds log max segments``
+
+:Description: The maximum number of segments (objects) in the journal before
+ we initiate trimming. Set to ``-1`` to disable limits.
+
+:Type: 32-bit Integer
+:Default: ``30``
+
+
+``mds log max expiring``
+
+:Description: The maximum number of segments to expire in parallels
+:Type: 32-bit Integer
+:Default: ``20``
+
+
+``mds log eopen size``
+
+:Description: The maximum number of inodes in an EOpen event.
+:Type: 32-bit Integer
+:Default: ``100``
+
+
+``mds bal sample interval``
+
+:Description: Determines how frequently to sample directory temperature
+ (for fragmentation decisions).
+
+:Type: Float
+:Default: ``3``
+
+
+``mds bal replicate threshold``
+
+:Description: The maximum temperature before Ceph attempts to replicate
+ metadata to other nodes.
+
+:Type: Float
+:Default: ``8000``
+
+
+``mds bal unreplicate threshold``
+
+:Description: The minimum temperature before Ceph stops replicating
+ metadata to other nodes.
+
+:Type: Float
+:Default: ``0``
+
+
+``mds bal frag``
+
+:Description: Determines whether the MDS will fragment directories.
+:Type: Boolean
+:Default: ``false``
+
+
+``mds bal split size``
+
+:Description: The maximum directory size before the MDS will split a directory
+ fragment into smaller bits.
+
+:Type: 32-bit Integer
+:Default: ``10000``
+
+
+``mds bal split rd``
+
+:Description: The maximum directory read temperature before Ceph splits
+ a directory fragment.
+
+:Type: Float
+:Default: ``25000``
+
+
+``mds bal split wr``
+
+:Description: The maximum directory write temperature before Ceph splits
+ a directory fragment.
+
+:Type: Float
+:Default: ``10000``
+
+
+``mds bal split bits``
+
+:Description: The number of bits by which to split a directory fragment.
+:Type: 32-bit Integer
+:Default: ``3``
+
+
+``mds bal merge size``
+
+:Description: The minimum directory size before Ceph tries to merge
+ adjacent directory fragments.
+
+:Type: 32-bit Integer
+:Default: ``50``
+
+
+``mds bal interval``
+
+:Description: The frequency (in seconds) of workload exchanges between MDSs.
+:Type: 32-bit Integer
+:Default: ``10``
+
+
+``mds bal fragment interval``
+
+:Description: The delay (in seconds) between a fragment being elegible for split
+ or merge and executing the fragmentation change.
+:Type: 32-bit Integer
+:Default: ``5``
+
+
+``mds bal fragment fast factor``
+
+:Description: The ratio by which frags may exceed the split size before
+ a split is executed immediately (skipping the fragment interval)
+:Type: Float
+:Default: ``1.5``
+
+``mds bal fragment size max``
+
+:Description: The maximum size of a fragment before any new entries
+ are rejected with ENOSPC.
+:Type: 32-bit Integer
+:Default: ``100000``
+
+``mds bal idle threshold``
+
+:Description: The minimum temperature before Ceph migrates a subtree
+ back to its parent.
+
+:Type: Float
+:Default: ``0``
+
+
+``mds bal max``
+
+:Description: The number of iterations to run balancer before Ceph stops.
+ (used for testing purposes only)
+
+:Type: 32-bit Integer
+:Default: ``-1``
+
+
+``mds bal max until``
+
+:Description: The number of seconds to run balancer before Ceph stops.
+ (used for testing purposes only)
+
+:Type: 32-bit Integer
+:Default: ``-1``
+
+
+``mds bal mode``
+
+:Description: The method for calculating MDS load.
+
+ - ``0`` = Hybrid.
+ - ``1`` = Request rate and latency.
+ - ``2`` = CPU load.
+
+:Type: 32-bit Integer
+:Default: ``0``
+
+
+``mds bal min rebalance``
+
+:Description: The minimum subtree temperature before Ceph migrates.
+:Type: Float
+:Default: ``0.1``
+
+
+``mds bal min start``
+
+:Description: The minimum subtree temperature before Ceph searches a subtree.
+:Type: Float
+:Default: ``0.2``
+
+
+``mds bal need min``
+
+:Description: The minimum fraction of target subtree size to accept.
+:Type: Float
+:Default: ``0.8``
+
+
+``mds bal need max``
+
+:Description: The maximum fraction of target subtree size to accept.
+:Type: Float
+:Default: ``1.2``
+
+
+``mds bal midchunk``
+
+:Description: Ceph will migrate any subtree that is larger than this fraction
+ of the target subtree size.
+
+:Type: Float
+:Default: ``0.3``
+
+
+``mds bal minchunk``
+
+:Description: Ceph will ignore any subtree that is smaller than this fraction
+ of the target subtree size.
+
+:Type: Float
+:Default: ``0.001``
+
+
+``mds bal target removal min``
+
+:Description: The minimum number of balancer iterations before Ceph removes
+ an old MDS target from the MDS map.
+
+:Type: 32-bit Integer
+:Default: ``5``
+
+
+``mds bal target removal max``
+
+:Description: The maximum number of balancer iteration before Ceph removes
+ an old MDS target from the MDS map.
+
+:Type: 32-bit Integer
+:Default: ``10``
+
+
+``mds replay interval``
+
+:Description: The journal poll interval when in standby-replay mode.
+ ("hot standby")
+
+:Type: Float
+:Default: ``1``
+
+
+``mds shutdown check``
+
+:Description: The interval for polling the cache during MDS shutdown.
+:Type: 32-bit Integer
+:Default: ``0``
+
+
+``mds thrash exports``
+
+:Description: Ceph will randomly export subtrees between nodes (testing only).
+:Type: 32-bit Integer
+:Default: ``0``
+
+
+``mds thrash fragments``
+
+:Description: Ceph will randomly fragment or merge directories.
+:Type: 32-bit Integer
+:Default: ``0``
+
+
+``mds dump cache on map``
+
+:Description: Ceph will dump the MDS cache contents to a file on each MDSMap.
+:Type: Boolean
+:Default: ``false``
+
+
+``mds dump cache after rejoin``
+
+:Description: Ceph will dump MDS cache contents to a file after
+ rejoining the cache (during recovery).
+
+:Type: Boolean
+:Default: ``false``
+
+
+``mds verify scatter``
+
+:Description: Ceph will assert that various scatter/gather invariants
+ are ``true`` (developers only).
+
+:Type: Boolean
+:Default: ``false``
+
+
+``mds debug scatterstat``
+
+:Description: Ceph will assert that various recursive stat invariants
+ are ``true`` (for developers only).
+
+:Type: Boolean
+:Default: ``false``
+
+
+``mds debug frag``
+
+:Description: Ceph will verify directory fragmentation invariants
+ when convenient (developers only).
+
+:Type: Boolean
+:Default: ``false``
+
+
+``mds debug auth pins``
+
+:Description: The debug auth pin invariants (for developers only).
+:Type: Boolean
+:Default: ``false``
+
+
+``mds debug subtrees``
+
+:Description: The debug subtree invariants (for developers only).
+:Type: Boolean
+:Default: ``false``
+
+
+``mds kill mdstable at``
+
+:Description: Ceph will inject MDS failure in MDSTable code
+ (for developers only).
+
+:Type: 32-bit Integer
+:Default: ``0``
+
+
+``mds kill export at``
+
+:Description: Ceph will inject MDS failure in the subtree export code
+ (for developers only).
+
+:Type: 32-bit Integer
+:Default: ``0``
+
+
+``mds kill import at``
+
+:Description: Ceph will inject MDS failure in the subtree import code
+ (for developers only).
+
+:Type: 32-bit Integer
+:Default: ``0``
+
+
+``mds kill link at``
+
+:Description: Ceph will inject MDS failure in hard link code
+ (for developers only).
+
+:Type: 32-bit Integer
+:Default: ``0``
+
+
+``mds kill rename at``
+
+:Description: Ceph will inject MDS failure in the rename code
+ (for developers only).
+
+:Type: 32-bit Integer
+:Default: ``0``
+
+
+``mds wipe sessions``
+
+:Description: Ceph will delete all client sessions on startup
+ (for testing only).
+
+:Type: Boolean
+:Default: ``0``
+
+
+``mds wipe ino prealloc``
+
+:Description: Ceph will delete ino preallocation metadata on startup
+ (for testing only).
+
+:Type: Boolean
+:Default: ``0``
+
+
+``mds skip ino``
+
+:Description: The number of inode numbers to skip on startup
+ (for testing only).
+
+:Type: 32-bit Integer
+:Default: ``0``
+
+
+``mds standby for name``
+
+:Description: An MDS daemon will standby for another MDS daemon of the name
+ specified in this setting.
+
+:Type: String
+:Default: N/A
+
+
+``mds standby for rank``
+
+:Description: An MDS daemon will standby for an MDS daemon of this rank.
+:Type: 32-bit Integer
+:Default: ``-1``
+
+
+``mds standby replay``
+
+:Description: Determines whether a ``ceph-mds`` daemon should poll and replay
+ the log of an active MDS (hot standby).
+
+:Type: Boolean
+:Default: ``false``
+
+
+``mds min caps per client``
+
+:Description: Set the minimum number of capabilities a client may hold.
+:Type: Integer
+:Default: ``100``
+
+
+``mds max ratio caps per client``
+
+:Description: Set the maximum ratio of current caps that may be recalled during MDS cache pressure.
+:Type: Float
+:Default: ``0.8``
diff --git a/src/ceph/doc/cephfs/multimds.rst b/src/ceph/doc/cephfs/multimds.rst
new file mode 100644
index 0000000..11814c1
--- /dev/null
+++ b/src/ceph/doc/cephfs/multimds.rst
@@ -0,0 +1,147 @@
+
+Configuring multiple active MDS daemons
+---------------------------------------
+
+*Also known as: multi-mds, active-active MDS*
+
+Each CephFS filesystem is configured for a single active MDS daemon
+by default. To scale metadata performance for large scale systems, you
+may enable multiple active MDS daemons, which will share the metadata
+workload with one another.
+
+When should I use multiple active MDS daemons?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+You should configure multiple active MDS daemons when your metadata performance
+is bottlenecked on the single MDS that runs by default.
+
+Adding more daemons may not increase performance on all workloads. Typically,
+a single application running on a single client will not benefit from an
+increased number of MDS daemons unless the application is doing a lot of
+metadata operations in parallel.
+
+Workloads that typically benefit from a larger number of active MDS daemons
+are those with many clients, perhaps working on many separate directories.
+
+
+Increasing the MDS active cluster size
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Each CephFS filesystem has a *max_mds* setting, which controls
+how many ranks will be created. The actual number of ranks
+in the filesystem will only be increased if a spare daemon is
+available to take on the new rank. For example, if there is only one MDS daemon running, and max_mds is set to two, no second rank will be created.
+
+Set ``max_mds`` to the desired number of ranks. In the following examples
+the "fsmap" line of "ceph status" is shown to illustrate the expected
+result of commands.
+
+::
+
+ # fsmap e5: 1/1/1 up {0=a=up:active}, 2 up:standby
+
+ ceph fs set max_mds 2
+
+ # fsmap e8: 2/2/2 up {0=a=up:active,1=c=up:creating}, 1 up:standby
+ # fsmap e9: 2/2/2 up {0=a=up:active,1=c=up:active}, 1 up:standby
+
+The newly created rank (1) will pass through the 'creating' state
+and then enter this 'active state'.
+
+Standby daemons
+~~~~~~~~~~~~~~~
+
+Even with multiple active MDS daemons, a highly available system **still
+requires standby daemons** to take over if any of the servers running
+an active daemon fail.
+
+Consequently, the practical maximum of ``max_mds`` for highly available systems
+is one less than the total number of MDS servers in your system.
+
+To remain available in the event of multiple server failures, increase the
+number of standby daemons in the system to match the number of server failures
+you wish to withstand.
+
+Decreasing the number of ranks
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+All ranks, including the rank(s) to be removed must first be active. This
+means that you must have at least max_mds MDS daemons available.
+
+First, set max_mds to a lower number, for example we might go back to
+having just a single active MDS:
+
+::
+
+ # fsmap e9: 2/2/2 up {0=a=up:active,1=c=up:active}, 1 up:standby
+ ceph fs set max_mds 1
+ # fsmap e10: 2/2/1 up {0=a=up:active,1=c=up:active}, 1 up:standby
+
+Note that we still have two active MDSs: the ranks still exist even though
+we have decreased max_mds, because max_mds only restricts creation
+of new ranks.
+
+Next, use the ``ceph mds deactivate <role>`` command to remove the
+unneeded rank:
+
+::
+
+ ceph mds deactivate cephfs_a:1
+ telling mds.1:1 172.21.9.34:6806/837679928 to deactivate
+
+ # fsmap e11: 2/2/1 up {0=a=up:active,1=c=up:stopping}, 1 up:standby
+ # fsmap e12: 1/1/1 up {0=a=up:active}, 1 up:standby
+ # fsmap e13: 1/1/1 up {0=a=up:active}, 2 up:standby
+
+See :doc:`/cephfs/administration` for more details which forms ``<role>`` can
+take.
+
+The deactivated rank will first enter the stopping state for a period
+of time while it hands off its share of the metadata to the remaining
+active daemons. This phase can take from seconds to minutes. If the
+MDS appears to be stuck in the stopping state then that should be investigated
+as a possible bug.
+
+If an MDS daemon crashes or is killed while in the 'stopping' state, a
+standby will take over and the rank will go back to 'active'. You can
+try to deactivate it again once it has come back up.
+
+When a daemon finishes stopping, it will respawn itself and go
+back to being a standby.
+
+
+Manually pinning directory trees to a particular rank
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In multiple active metadata server configurations, a balancer runs which works
+to spread metadata load evenly across the cluster. This usually works well
+enough for most users but sometimes it is desirable to override the dynamic
+balancer with explicit mappings of metadata to particular ranks. This can allow
+the administrator or users to evenly spread application load or limit impact of
+users' metadata requests on the entire cluster.
+
+The mechanism provided for this purpose is called an ``export pin``, an
+extended attribute of directories. The name of this extended attribute is
+``ceph.dir.pin``. Users can set this attribute using standard commands:
+
+::
+
+ setfattr -n ceph.dir.pin -v 2 path/to/dir
+
+The value of the extended attribute is the rank to assign the directory subtree
+to. A default value of ``-1`` indicates the directory is not pinned.
+
+A directory's export pin is inherited from its closest parent with a set export
+pin. In this way, setting the export pin on a directory affects all of its
+children. However, the parents pin can be overriden by setting the child
+directory's export pin. For example:
+
+::
+
+ mkdir -p a/b
+ # "a" and "a/b" both start without an export pin set
+ setfattr -n ceph.dir.pin -v 1 a/
+ # a and b are now pinned to rank 1
+ setfattr -n ceph.dir.pin -v 0 a/b
+ # a/b is now pinned to rank 0 and a/ and the rest of its children are still pinned to rank 1
+
diff --git a/src/ceph/doc/cephfs/posix.rst b/src/ceph/doc/cephfs/posix.rst
new file mode 100644
index 0000000..6a62cb2
--- /dev/null
+++ b/src/ceph/doc/cephfs/posix.rst
@@ -0,0 +1,49 @@
+========================
+ Differences from POSIX
+========================
+
+CephFS aims to adhere to POSIX semantics wherever possible. For
+example, in contrast to many other common network file systems like
+NFS, CephFS maintains strong cache coherency across clients. The goal
+is for processes communicating via the file system to behave the same
+when they are on different hosts as when they are on the same host.
+
+However, there are a few places where CephFS diverges from strict
+POSIX semantics for various reasons:
+
+- If a client is writing to a file and fails, its writes are not
+ necessarily atomic. That is, the client may call write(2) on a file
+ opened with O_SYNC with an 8 MB buffer and then crash and the write
+ may be only partially applied. (Almost all file systems, even local
+ file systems, have this behavior.)
+- In shared simultaneous writer situations, a write that crosses
+ object boundaries is not necessarily atomic. This means that you
+ could have writer A write "aa|aa" and writer B write "bb|bb"
+ simultaneously (where | is the object boundary), and end up with
+ "aa|bb" rather than the proper "aa|aa" or "bb|bb".
+- POSIX includes the telldir(2) and seekdir(2) system calls that allow
+ you to obtain the current directory offset and seek back to it.
+ Because CephFS may refragment directories at any time, it is
+ difficult to return a stable integer offset for a directory. As
+ such, a seekdir to a non-zero offset may often work but is not
+ guaranteed to do so. A seekdir to offset 0 will always work (and is
+ equivalent to rewinddir(2)).
+- Sparse files propagate incorrectly to the stat(2) st_blocks field.
+ Because CephFS does not explicitly track which parts of a file are
+ allocated/written, the st_blocks field is always populated by the
+ file size divided by the block size. This will cause tools like
+ du(1) to overestimate consumed space. (The recursive size field,
+ maintained by CephFS, also includes file "holes" in its count.)
+- When a file is mapped into memory via mmap(2) on multiple hosts,
+ writes are not coherently propagated to other clients' caches. That
+ is, if a page is cached on host A, and then updated on host B, host
+ A's page is not coherently invalidated. (Shared writable mmap
+ appears to be quite rare--we have yet to here any complaints about this
+ behavior, and implementing cache coherency properly is complex.)
+- CephFS clients present a hidden ``.snap`` directory that is used to
+ access, create, delete, and rename snapshots. Although the virtual
+ directory is excluded from readdir(2), any process that tries to
+ create a file or directory with the same name will get an error
+ code. The name of this hidden directory can be changed at mount
+ time with ``-o snapdirname=.somethingelse`` (Linux) or the config
+ option ``client_snapdir`` (libcephfs, ceph-fuse).
diff --git a/src/ceph/doc/cephfs/quota.rst b/src/ceph/doc/cephfs/quota.rst
new file mode 100644
index 0000000..aad0e0b
--- /dev/null
+++ b/src/ceph/doc/cephfs/quota.rst
@@ -0,0 +1,70 @@
+Quotas
+======
+
+CephFS allows quotas to be set on any directory in the system. The
+quota can restrict the number of *bytes* or the number of *files*
+stored beneath that point in the directory hierarchy.
+
+Limitations
+-----------
+
+#. *Quotas are cooperative and non-adversarial.* CephFS quotas rely on
+ the cooperation of the client who is mounting the file system to
+ stop writers when a limit is reached. A modified or adversarial
+ client cannot be prevented from writing as much data as it needs.
+ Quotas should not be relied on to prevent filling the system in
+ environments where the clients are fully untrusted.
+
+#. *Quotas are imprecise.* Processes that are writing to the file
+ system will be stopped a short time after the quota limit is
+ reached. They will inevitably be allowed to write some amount of
+ data over the configured limit. How far over the quota they are
+ able to go depends primarily on the amount of time, not the amount
+ of data. Generally speaking writers will be stopped within 10s of
+ seconds of crossing the configured limit.
+
+#. *Quotas are not yet implemented in the kernel client.* Quotas are
+ supported by the userspace client (libcephfs, ceph-fuse) but are
+ not yet implemented in the Linux kernel client.
+
+#. *Quotas must be configured carefully when used with path-based
+ mount restrictions.* The client needs to have access to the
+ directory inode on which quotas are configured in order to enforce
+ them. If the client has restricted access to a specific path
+ (e.g., ``/home/user``) based on the MDS capability, and a quota is
+ configured on an ancestor directory they do not have access to
+ (e.g., ``/home``), the client will not enforce it. When using
+ path-based access restrictions be sure to configure the quota on
+ the directory the client is restricted too (e.g., ``/home/user``)
+ or something nested beneath it.
+
+Configuration
+-------------
+
+Like most other things in CephFS, quotas are configured using virtual
+extended attributes:
+
+ * ``ceph.quota.max_files`` -- file limit
+ * ``ceph.quota.max_bytes`` -- byte limit
+
+If the attributes appear on a directory inode that means a quota is
+configured there. If they are not present then no quota is set on
+that directory (although one may still be configured on a parent directory).
+
+To set a quota::
+
+ setfattr -n ceph.quota.max_bytes -v 100000000 /some/dir # 100 MB
+ setfattr -n ceph.quota.max_files -v 10000 /some/dir # 10,000 files
+
+To view quota settings::
+
+ getfattr -n ceph.quota.max_bytes /some/dir
+ getfattr -n ceph.quota.max_files /some/dir
+
+Note that if the value of the extended attribute is ``0`` that means
+the quota is not set.
+
+To remove a quota::
+
+ setfattr -n ceph.quota.max_bytes -v 0 /some/dir
+ setfattr -n ceph.quota.max_files -v 0 /some/dir
diff --git a/src/ceph/doc/cephfs/standby.rst b/src/ceph/doc/cephfs/standby.rst
new file mode 100644
index 0000000..6cba2b7
--- /dev/null
+++ b/src/ceph/doc/cephfs/standby.rst
@@ -0,0 +1,222 @@
+
+Terminology
+-----------
+
+A Ceph cluster may have zero or more CephFS *filesystems*. CephFS
+filesystems have a human readable name (set in ``fs new``)
+and an integer ID. The ID is called the filesystem cluster ID,
+or *FSCID*.
+
+Each CephFS filesystem has a number of *ranks*, one by default,
+which start at zero. A rank may be thought of as a metadata shard.
+Controlling the number of ranks in a filesystem is described
+in :doc:`/cephfs/multimds`
+
+Each CephFS ceph-mds process (a *daemon*) initially starts up
+without a rank. It may be assigned one by the monitor cluster.
+A daemon may only hold one rank at a time. Daemons only give up
+a rank when the ceph-mds process stops.
+
+If a rank is not associated with a daemon, the rank is
+considered *failed*. Once a rank is assigned to a daemon,
+the rank is considered *up*.
+
+A daemon has a *name* that is set statically by the administrator
+when the daemon is first configured. Typical configurations
+use the hostname where the daemon runs as the daemon name.
+
+Each time a daemon starts up, it is also assigned a *GID*, which
+is unique to this particular process lifetime of the daemon. The
+GID is an integer.
+
+Referring to MDS daemons
+------------------------
+
+Most of the administrative commands that refer to an MDS daemon
+accept a flexible argument format that may contain a rank, a GID
+or a name.
+
+Where a rank is used, this may optionally be qualified with
+a leading filesystem name or ID. If a daemon is a standby (i.e.
+it is not currently assigned a rank), then it may only be
+referred to by GID or name.
+
+For example, if we had an MDS daemon which was called 'myhost',
+had GID 5446, and was assigned rank 0 in the filesystem 'myfs'
+which had FSCID 3, then any of the following would be suitable
+forms of the 'fail' command:
+
+::
+
+ ceph mds fail 5446 # GID
+ ceph mds fail myhost # Daemon name
+ ceph mds fail 0 # Unqualified rank
+ ceph mds fail 3:0 # FSCID and rank
+ ceph mds fail myfs:0 # Filesystem name and rank
+
+Managing failover
+-----------------
+
+If an MDS daemon stops communicating with the monitor, the monitor will
+wait ``mds_beacon_grace`` seconds (default 15 seconds) before marking
+the daemon as *laggy*.
+
+Each file system may specify a number of standby daemons to be considered
+healthy. This number includes daemons in standby-replay waiting for a rank to
+fail (remember that a standby-replay daemon will not be assigned to take over a
+failure for another rank or a failure in a another CephFS file system). The
+pool of standby daemons not in replay count towards any file system count.
+Each file system may set the number of standby daemons wanted using:
+
+::
+
+ ceph fs set <fs name> standby_count_wanted <count>
+
+Setting ``count`` to 0 will disable the health check.
+
+
+Configuring standby daemons
+---------------------------
+
+There are four configuration settings that control how a daemon
+will behave while in standby:
+
+::
+
+ mds_standby_for_name
+ mds_standby_for_rank
+ mds_standby_for_fscid
+ mds_standby_replay
+
+These may be set in the ceph.conf on the host where the MDS daemon
+runs (as opposed to on the monitor). The daemon loads these settings
+when it starts, and sends them to the monitor.
+
+By default, if none of these settings are used, all MDS daemons
+which do not hold a rank will be used as standbys for any rank.
+
+The settings which associate a standby daemon with a particular
+name or rank do not guarantee that the daemon will *only* be used
+for that rank. They mean that when several standbys are available,
+the associated standby daemon will be used. If a rank is failed,
+and a standby is available, it will be used even if it is associated
+with a different rank or named daemon.
+
+mds_standby_replay
+~~~~~~~~~~~~~~~~~~
+
+If this is set to true, then the standby daemon will continuously read
+the metadata journal of an up rank. This will give it
+a warm metadata cache, and speed up the process of failing over
+if the daemon serving the rank fails.
+
+An up rank may only have one standby replay daemon assigned to it,
+if two daemons are both set to be standby replay then one of them
+will arbitrarily win, and the other will become a normal non-replay
+standby.
+
+Once a daemon has entered the standby replay state, it will only be
+used as a standby for the rank that it is following. If another rank
+fails, this standby replay daemon will not be used as a replacement,
+even if no other standbys are available.
+
+*Historical note:* In Ceph prior to v10.2.1, this setting (when ``false``) is
+always true when ``mds_standby_for_*`` is also set.
+
+mds_standby_for_name
+~~~~~~~~~~~~~~~~~~~~
+
+Set this to make the standby daemon only take over a failed rank
+if the last daemon to hold it matches this name.
+
+mds_standby_for_rank
+~~~~~~~~~~~~~~~~~~~~
+
+Set this to make the standby daemon only take over the specified
+rank. If another rank fails, this daemon will not be used to
+replace it.
+
+Use in conjunction with ``mds_standby_for_fscid`` to be specific
+about which filesystem's rank you are targeting, if you have
+multiple filesystems.
+
+mds_standby_for_fscid
+~~~~~~~~~~~~~~~~~~~~~
+
+If ``mds_standby_for_rank`` is set, this is simply a qualifier to
+say which filesystem's rank is referred to.
+
+If ``mds_standby_for_rank`` is not set, then setting FSCID will
+cause this daemon to target any rank in the specified FSCID. Use
+this if you have a daemon that you want to use for any rank, but
+only within a particular filesystem.
+
+mon_force_standby_active
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+This setting is used on monitor hosts. It defaults to true.
+
+If it is false, then daemons configured with standby_replay=true
+will **only** become active if the rank/name that they have
+been configured to follow fails. On the other hand, if this
+setting is true, then a daemon configured with standby_replay=true
+may be assigned some other rank.
+
+Examples
+--------
+
+These are example ceph.conf snippets. In practice you can either
+copy a ceph.conf with all daemons' configuration to all your servers,
+or you can have a different file on each server that contains just
+that server's daemons' configuration.
+
+Simple pair
+~~~~~~~~~~~
+
+Two MDS daemons 'a' and 'b' acting as a pair, where whichever one is not
+currently assigned a rank will be the standby replay follower
+of the other.
+
+::
+
+ [mds.a]
+ mds standby replay = true
+ mds standby for rank = 0
+
+ [mds.b]
+ mds standby replay = true
+ mds standby for rank = 0
+
+Floating standby
+~~~~~~~~~~~~~~~~
+
+Three MDS daemons 'a', 'b' and 'c', in a filesystem that has
+``max_mds`` set to 2.
+
+::
+
+ # No explicit configuration required: whichever daemon is
+ # not assigned a rank will go into 'standby' and take over
+ # for whichever other daemon fails.
+
+Two MDS clusters
+~~~~~~~~~~~~~~~~
+
+With two filesystems, I have four MDS daemons, and I want two
+to act as a pair for one filesystem and two to act as a pair
+for the other filesystem.
+
+::
+
+ [mds.a]
+ mds standby for fscid = 1
+
+ [mds.b]
+ mds standby for fscid = 1
+
+ [mds.c]
+ mds standby for fscid = 2
+
+ [mds.d]
+ mds standby for fscid = 2
+
diff --git a/src/ceph/doc/cephfs/troubleshooting.rst b/src/ceph/doc/cephfs/troubleshooting.rst
new file mode 100644
index 0000000..4158d32
--- /dev/null
+++ b/src/ceph/doc/cephfs/troubleshooting.rst
@@ -0,0 +1,160 @@
+=================
+ Troubleshooting
+=================
+
+Slow/stuck operations
+=====================
+
+If you are experiencing apparent hung operations, the first task is to identify
+where the problem is occurring: in the client, the MDS, or the network connecting
+them. Start by looking to see if either side has stuck operations
+(:ref:`slow_requests`, below), and narrow it down from there.
+
+RADOS Health
+============
+
+If part of the CephFS metadata or data pools is unavaible and CephFS is not
+responding, it is probably because RADOS itself is unhealthy. Resolve those
+problems first (:doc:`../../rados/troubleshooting/index`).
+
+The MDS
+=======
+
+If an operation is hung inside the MDS, it will eventually show up in ``ceph health``,
+identifying "slow requests are blocked". It may also identify clients as
+"failing to respond" or misbehaving in other ways. If the MDS identifies
+specific clients as misbehaving, you should investigate why they are doing so.
+Generally it will be the result of
+1) overloading the system (if you have extra RAM, increase the
+"mds cache size" config from its default 100000; having a larger active file set
+than your MDS cache is the #1 cause of this!)
+2) running an older (misbehaving) client, or
+3) underlying RADOS issues.
+
+Otherwise, you have probably discovered a new bug and should report it to
+the developers!
+
+.. _slow_requests:
+
+Slow requests (MDS)
+-------------------
+You can list current operations via the admin socket by running::
+
+ ceph daemon mds.<name> dump_ops_in_flight
+
+from the MDS host. Identify the stuck commands and examine why they are stuck.
+Usually the last "event" will have been an attempt to gather locks, or sending
+the operation off to the MDS log. If it is waiting on the OSDs, fix them. If
+operations are stuck on a specific inode, you probably have a client holding
+caps which prevent others from using it, either because the client is trying
+to flush out dirty data or because you have encountered a bug in CephFS'
+distributed file lock code (the file "capabilities" ["caps"] system).
+
+If it's a result of a bug in the capabilities code, restarting the MDS
+is likely to resolve the problem.
+
+If there are no slow requests reported on the MDS, and it is not reporting
+that clients are misbehaving, either the client has a problem or its
+requests are not reaching the MDS.
+
+ceph-fuse debugging
+===================
+
+ceph-fuse also supports dump_ops_in_flight. See if it has any and where they are
+stuck.
+
+Debug output
+------------
+
+To get more debugging information from ceph-fuse, try running in the foreground
+with logging to the console (``-d``) and enabling client debug
+(``--debug-client=20``), enabling prints for each message sent
+(``--debug-ms=1``).
+
+If you suspect a potential monitor issue, enable monitor debugging as well
+(``--debug-monc=20``).
+
+
+Kernel mount debugging
+======================
+
+Slow requests
+-------------
+
+Unfortunately the kernel client does not support the admin socket, but it has
+similar (if limited) interfaces if your kernel has debugfs enabled. There
+will be a folder in ``sys/kernel/debug/ceph/``, and that folder (whose name will
+look something like ``28f7427e-5558-4ffd-ae1a-51ec3042759a.client25386880``)
+will contain a variety of files that output interesting output when you ``cat``
+them. These files are described below; the most interesting when debugging
+slow requests are probably the ``mdsc`` and ``osdc`` files.
+
+* bdi: BDI info about the Ceph system (blocks dirtied, written, etc)
+* caps: counts of file "caps" structures in-memory and used
+* client_options: dumps the options provided to the CephFS mount
+* dentry_lru: Dumps the CephFS dentries currently in-memory
+* mdsc: Dumps current requests to the MDS
+* mdsmap: Dumps the current MDSMap epoch and MDSes
+* mds_sessions: Dumps the current sessions to MDSes
+* monc: Dumps the current maps from the monitor, and any "subscriptions" held
+* monmap: Dumps the current monitor map epoch and monitors
+* osdc: Dumps the current ops in-flight to OSDs (ie, file data IO)
+* osdmap: Dumps the current OSDMap epoch, pools, and OSDs
+
+If there are no stuck requests but you have file IO which is not progressing,
+you might have a...
+
+Disconnected+Remounted FS
+=========================
+Because CephFS has a "consistent cache", if your network connection is
+disrupted for a long enough time, the client will be forcibly
+disconnected from the system. At this point, the kernel client is in
+a bind: it cannot safely write back dirty data, and many applications
+do not handle IO errors correctly on close().
+At the moment, the kernel client will remount the FS, but outstanding filesystem
+IO may or may not be satisfied. In these cases, you may need to reboot your
+client system.
+
+You can identify you are in this situation if dmesg/kern.log report something like::
+
+ Jul 20 08:14:38 teuthology kernel: [3677601.123718] ceph: mds0 closed our session
+ Jul 20 08:14:38 teuthology kernel: [3677601.128019] ceph: mds0 reconnect start
+ Jul 20 08:14:39 teuthology kernel: [3677602.093378] ceph: mds0 reconnect denied
+ Jul 20 08:14:39 teuthology kernel: [3677602.098525] ceph: dropping dirty+flushing Fw state for ffff8802dc150518 1099935956631
+ Jul 20 08:14:39 teuthology kernel: [3677602.107145] ceph: dropping dirty+flushing Fw state for ffff8801008e8518 1099935946707
+ Jul 20 08:14:39 teuthology kernel: [3677602.196747] libceph: mds0 172.21.5.114:6812 socket closed (con state OPEN)
+ Jul 20 08:14:40 teuthology kernel: [3677603.126214] libceph: mds0 172.21.5.114:6812 connection reset
+ Jul 20 08:14:40 teuthology kernel: [3677603.132176] libceph: reset on mds0
+
+This is an area of ongoing work to improve the behavior. Kernels will soon
+be reliably issuing error codes to in-progress IO, although your application(s)
+may not deal with them well. In the longer-term, we hope to allow reconnect
+and reclaim of data in cases where it won't violate POSIX semantics (generally,
+data which hasn't been accessed or modified by other clients).
+
+Mounting
+========
+
+Mount 5 Error
+-------------
+
+A mount 5 error typically occurs if a MDS server is laggy or if it crashed.
+Ensure at least one MDS is up and running, and the cluster is ``active +
+healthy``.
+
+Mount 12 Error
+--------------
+
+A mount 12 error with ``cannot allocate memory`` usually occurs if you have a
+version mismatch between the :term:`Ceph Client` version and the :term:`Ceph
+Storage Cluster` version. Check the versions using::
+
+ ceph -v
+
+If the Ceph Client is behind the Ceph cluster, try to upgrade it::
+
+ sudo apt-get update && sudo apt-get install ceph-common
+
+You may need to uninstall, autoclean and autoremove ``ceph-common``
+and then reinstall it so that you have the latest version.
+
diff --git a/src/ceph/doc/cephfs/upgrading.rst b/src/ceph/doc/cephfs/upgrading.rst
new file mode 100644
index 0000000..7ee3f09
--- /dev/null
+++ b/src/ceph/doc/cephfs/upgrading.rst
@@ -0,0 +1,34 @@
+
+Upgrading pre-Firefly filesystems past Jewel
+============================================
+
+.. tip::
+
+ This advice only applies to users with filesystems
+ created using versions of Ceph older than *Firefly* (0.80).
+ Users creating new filesystems may disregard this advice.
+
+Pre-firefly versions of Ceph used a now-deprecated format
+for storing CephFS directory objects, called TMAPs. Support
+for reading these in RADOS will be removed after the Jewel
+release of Ceph, so for upgrading CephFS users it is important
+to ensure that any old directory objects have been converted.
+
+After installing Jewel on all your MDS and OSD servers, and restarting
+the services, run the following command:
+
+::
+
+ cephfs-data-scan tmap_upgrade <metadata pool name>
+
+This only needs to be run once, and it is not necessary to
+stop any other services while it runs. The command may take some
+time to execute, as it iterates overall objects in your metadata
+pool. It is safe to continue using your filesystem as normal while
+it executes. If the command aborts for any reason, it is safe
+to simply run it again.
+
+If you are upgrading a pre-Firefly CephFS filesystem to a newer Ceph version
+than Jewel, you must first upgrade to Jewel and run the ``tmap_upgrade``
+command before completing your upgrade to the latest version.
+