7 files changed, 167 insertions, 13 deletions
diff --git a/kernel/Documentation/device-mapper/cache-policies.txt b/kernel/Documentation/device-mapper/cache-policies.txt
index 0d124a971..d9246a32e 100644
--- a/kernel/Documentation/device-mapper/cache-policies.txt
+++ b/kernel/Documentation/device-mapper/cache-policies.txt
@@ -25,10 +25,10 @@ trying to see when the io scheduler has let the ios run.
 Overview of supplied cache replacement policies
 ===============================================
 
-multiqueue
-----------
+multiqueue (mq)
+---------------
 
-This policy is the default.
+This policy has been deprecated in favor of the smq policy (see below).
 
 The multiqueue policy has three sets of 16 queues: one set for entries
 waiting for the cache and another two for those in the cache (a set for
@@ -73,6 +73,67 @@ If you're trying to quickly warm a new cache device you may wish to
 reduce these to encourage promotion.  Remember to switch them back to
 their defaults after the cache fills though.
 
+Stochastic multiqueue (smq)
+---------------------------
+
+This policy is the default.
+
+The stochastic multi-queue (smq) policy addresses some of the problems
+with the multiqueue (mq) policy.
+
+The smq policy (vs mq) offers the promise of less memory utilization,
+improved performance and increased adaptability in the face of changing
+workloads.  SMQ also does not have any cumbersome tuning knobs.
+
+Users may switch from "mq" to "smq" simply by appropriately reloading a
+DM table that is using the cache target.  Doing so will cause all of the
+mq policy's hints to be dropped.  Also, performance of the cache may
+degrade slightly until smq recalculates the origin device's hotspots
+that should be cached.
+
+Memory usage:
+The mq policy uses a lot of memory; 88 bytes per cache block on a 64
+bit machine.
+
+SMQ uses 28bit indexes to implement it's data structures rather than
+pointers.  It avoids storing an explicit hit count for each block.  It
+has a 'hotspot' queue rather than a pre cache which uses a quarter of
+the entries (each hotspot block covers a larger area than a single
+cache block).
+
+All these mean smq uses ~25bytes per cache block.  Still a lot of
+memory, but a substantial improvement nontheless.
+
+Level balancing:
+MQ places entries in different levels of the multiqueue structures
+based on their hit count (~ln(hit count)).  This means the bottom
+levels generally have the most entries, and the top ones have very
+few.  Having unbalanced levels like this reduces the efficacy of the
+multiqueue.
+
+SMQ does not maintain a hit count, instead it swaps hit entries with
+the least recently used entry from the level above.  The over all
+ordering being a side effect of this stochastic process.  With this
+scheme we can decide how many entries occupy each multiqueue level,
+resulting in better promotion/demotion decisions.
+
+Adaptability:
+The MQ policy maintains a hit count for each cache block.  For a
+different block to get promoted to the cache it's hit count has to
+exceed the lowest currently in the cache.  This means it can take a
+long time for the cache to adapt between varying IO patterns.
+Periodically degrading the hit counts could help with this, but I
+haven't found a nice general solution.
+
+SMQ doesn't maintain hit counts, so a lot of this problem just goes
+away.  In addition it tracks performance of the hotspot queue, which
+is used to decide which blocks to promote.  If the hotspot queue is
+performing badly then it starts moving entries more quickly between
+levels.  This lets it adapt to new IO patterns very quickly.
+
+Performance:
+Testing SMQ shows substantially better performance than MQ.
+
 cleaner
 -------
 
diff --git a/kernel/Documentation/device-mapper/cache.txt b/kernel/Documentation/device-mapper/cache.txt
index 68c0f517c..785eab87a 100644
--- a/kernel/Documentation/device-mapper/cache.txt
+++ b/kernel/Documentation/device-mapper/cache.txt
@@ -221,6 +221,7 @@ Status
 <#read hits> <#read misses> <#write hits> <#write misses>
 <#demotions> <#promotions> <#dirty> <#features> <features>*
 <#core args> <core args>* <policy name> <#policy args> <policy args>*
+<cache metadata mode>
 
 metadata block size	 : Fixed block size for each metadata block in
 			     sectors
@@ -251,8 +252,18 @@ core args		 : Key/value pairs for tuning the core
 			     e.g. migration_threshold
 policy name		 : Name of the policy
 #policy args		 : Number of policy arguments to follow (must be even)
-policy args		 : Key/value pairs
-			     e.g. sequential_threshold
+policy args		 : Key/value pairs e.g. sequential_threshold
+cache metadata mode      : ro if read-only, rw if read-write
+	In serious cases where even a read-only mode is deemed unsafe
+	no further I/O will be permitted and the status will just
+	contain the string 'Fail'.  The userspace recovery tools
+	should then be used.
+needs_check		 : 'needs_check' if set, '-' if not set
+	A metadata operation has failed, resulting in the needs_check
+	flag being set in the metadata's superblock.  The metadata
+	device must be deactivated and checked/repaired before the
+	cache can be made fully operational again.  '-' indicates
+	needs_check is not set.
 
 Messages
 --------
diff --git a/kernel/Documentation/device-mapper/delay.txt b/kernel/Documentation/device-mapper/delay.txt
index 15adc5535..a07b5927f 100644
--- a/kernel/Documentation/device-mapper/delay.txt
+++ b/kernel/Documentation/device-mapper/delay.txt
@@ -8,6 +8,7 @@ Parameters:
     <device> <offset> <delay> [<write_device> <write_offset> <write_delay>]
 
 With separate write parameters, the first set is only used for reads.
+Offsets are specified in sectors.
 Delays are specified in milliseconds.
 
 Example scripts
diff --git a/kernel/Documentation/device-mapper/dm-raid.txt b/kernel/Documentation/device-mapper/dm-raid.txt
index ef8ba9fa5..df2d636b6 100644
--- a/kernel/Documentation/device-mapper/dm-raid.txt
+++ b/kernel/Documentation/device-mapper/dm-raid.txt
@@ -209,6 +209,37 @@ include:
 	"repair" - Initiate a repair of the array.
 	"reshape"- Currently unsupported (-EINVAL).
 
+
+Discard Support
+---------------
+The implementation of discard support among hardware vendors varies.
+When a block is discarded, some storage devices will return zeroes when
+the block is read.  These devices set the 'discard_zeroes_data'
+attribute.  Other devices will return random data.  Confusingly, some
+devices that advertise 'discard_zeroes_data' will not reliably return
+zeroes when discarded blocks are read!  Since RAID 4/5/6 uses blocks
+from a number of devices to calculate parity blocks and (for performance
+reasons) relies on 'discard_zeroes_data' being reliable, it is important
+that the devices be consistent.  Blocks may be discarded in the middle
+of a RAID 4/5/6 stripe and if subsequent read results are not
+consistent, the parity blocks may be calculated differently at any time;
+making the parity blocks useless for redundancy.  It is important to
+understand how your hardware behaves with discards if you are going to
+enable discards with RAID 4/5/6.
+
+Since the behavior of storage devices is unreliable in this respect,
+even when reporting 'discard_zeroes_data', by default RAID 4/5/6
+discard support is disabled -- this ensures data integrity at the
+expense of losing some performance.
+
+Storage devices that properly support 'discard_zeroes_data' are
+increasingly whitelisted in the kernel and can thus be trusted.
+
+For trusted devices, the following dm-raid module parameter can be set
+to safely enable discard support for RAID 4/5/6:
+    'devices_handle_discards_safely'
+
+
 Version History
 ---------------
 1.0.0	Initial version.  Support for RAID 4/5/6
@@ -224,3 +255,5 @@ Version History
 	New status (STATUSTYPE_INFO) fields: sync_action and mismatch_cnt.
 1.5.1   Add ability to restore transiently failed devices on resume.
 1.5.2   'mismatch_cnt' is zero unless [last_]sync_action is "check".
+1.6.0   Add discard support (and devices_handle_discard_safely module param).
+1.7.0   Add support for MD RAID0 mappings.
diff --git a/kernel/Documentation/device-mapper/snapshot.txt b/kernel/Documentation/device-mapper/snapshot.txt
index 0d5bc46dc..ad6949bff 100644
--- a/kernel/Documentation/device-mapper/snapshot.txt
+++ b/kernel/Documentation/device-mapper/snapshot.txt
@@ -41,9 +41,13 @@ useless and be disabled, returning errors.  So it is important to monitor
 the amount of free space and expand the <COW device> before it fills up.
 
 <persistent?> is P (Persistent) or N (Not persistent - will not survive
-after reboot).
-The difference is that for transient snapshots less metadata must be
-saved on disk - they can be kept in memory by the kernel.
+after reboot).  O (Overflow) can be added as a persistent store option
+to allow userspace to advertise its support for seeing "Overflow" in the
+snapshot status.  So supported store types are "P", "PO" and "N".
+
+The difference between persistent and transient is with transient
+snapshots less metadata must be saved on disk - they can be kept in
+memory by the kernel.
 
 
 * snapshot-merge <origin> <COW device> <persistent> <chunksize>
diff --git a/kernel/Documentation/device-mapper/statistics.txt b/kernel/Documentation/device-mapper/statistics.txt
index 2a1673adc..6f5ef944c 100644
--- a/kernel/Documentation/device-mapper/statistics.txt
+++ b/kernel/Documentation/device-mapper/statistics.txt
@@ -13,9 +13,14 @@ the range specified.
 The I/O statistics counters for each step-sized area of a region are
 in the same format as /sys/block/*/stat or /proc/diskstats (see:
 Documentation/iostats.txt).  But two extra counters (12 and 13) are
-provided: total time spent reading and writing in milliseconds.	 All
-these counters may be accessed by sending the @stats_print message to
-the appropriate DM device via dmsetup.
+provided: total time spent reading and writing.  When the histogram
+argument is used, the 14th parameter is reported that represents the
+histogram of latencies.  All these counters may be accessed by sending
+the @stats_print message to the appropriate DM device via dmsetup.
+
+The reported times are in milliseconds and the granularity depends on
+the kernel ticks.  When the option precise_timestamps is used, the
+reported times are in nanoseconds.
 
 Each region has a corresponding unique identifier, which we call a
 region_id, that is assigned when the region is created.	 The region_id
@@ -33,7 +38,9 @@ memory is used by reading
 Messages
 ========
 
-    @stats_create <range> <step> [<program_id> [<aux_data>]]
+    @stats_create <range> <step>
+		[<number_of_optional_arguments> <optional_arguments>...]
+		[<program_id> [<aux_data>]]
 
 	Create a new region and return the region_id.
 
@@ -48,6 +55,29 @@ Messages
 	  "/<number_of_areas>" - the range is subdivided into the specified
 				 number of areas.
 
+	<number_of_optional_arguments>
+	  The number of optional arguments
+
+	<optional_arguments>
+	  The following optional arguments are supported
+	  precise_timestamps - use precise timer with nanosecond resolution
+		instead of the "jiffies" variable.  When this argument is
+		used, the resulting times are in nanoseconds instead of
+		milliseconds.  Precise timestamps are a little bit slower
+		to obtain than jiffies-based timestamps.
+	  histogram:n1,n2,n3,n4,... - collect histogram of latencies.  The
+		numbers n1, n2, etc are times that represent the boundaries
+		of the histogram.  If precise_timestamps is not used, the
+		times are in milliseconds, otherwise they are in
+		nanoseconds.  For each range, the kernel will report the
+		number of requests that completed within this range. For
+		example, if we use "histogram:10,20,30", the kernel will
+		report four numbers a:b:c:d. a is the number of requests
+		that took 0-10 ms to complete, b is the number of requests
+		that took 10-20 ms to complete, c is the number of requests
+		that took 20-30 ms to complete and d is the number of
+		requests that took more than 30 ms to complete.
+
 	<program_id>
 	  An optional parameter.  A name that uniquely identifies
 	  the userspace owner of the range.  This groups ranges together
@@ -55,6 +85,9 @@ Messages
 	  created and ignore those created by others.
 	  The kernel returns this string back in the output of
 	  @stats_list message, but it doesn't use it for anything else.
+	  If we omit the number of optional arguments, program id must not
+	  be a number, otherwise it would be interpreted as the number of
+	  optional arguments.
 
 	<aux_data>
 	  An optional parameter.  A word that provides auxiliary data
@@ -88,6 +121,10 @@ Messages
 
 	Output format:
 	  <region_id>: <start_sector>+<length> <step> <program_id> <aux_data>
+	        precise_timestamps histogram:n1,n2,n3,...
+
+	The strings "precise_timestamps" and "histogram" are printed only
+	if they were specified when creating the region.
 
     @stats_print <region_id> [<starting_line> <number_of_lines>]
 
diff --git a/kernel/Documentation/device-mapper/thin-provisioning.txt b/kernel/Documentation/device-mapper/thin-provisioning.txt
index 4f67578b2..1699a55b7 100644
--- a/kernel/Documentation/device-mapper/thin-provisioning.txt
+++ b/kernel/Documentation/device-mapper/thin-provisioning.txt
@@ -296,7 +296,7 @@ ii) Status
 	underlying device.  When this is enabled when loading the table,
 	it can get disabled if the underlying device doesn't support it.
 
-    ro|rw
+    ro|rw|out_of_data_space
 	If the pool encounters certain types of device failures it will
 	drop into a read-only metadata mode in which no changes to
 	the pool metadata (like allocating new blocks) are permitted.
@@ -314,6 +314,13 @@ ii) Status
 	module parameter can be used to change this timeout -- it
 	defaults to 60 seconds but may be disabled using a value of 0.
 
+    needs_check
+	A metadata operation has failed, resulting in the needs_check
+	flag being set in the metadata's superblock.  The metadata
+	device must be deactivated and checked/repaired before the
+	thin-pool can be made fully operational again.  '-' indicates
+	needs_check is not set.
+
 iii) Messages
 
     create_thin <dev id>