diff options
author | José Pekkarinen <jose.pekkarinen@nokia.com> | 2016-04-11 10:41:07 +0300 |
---|---|---|
committer | José Pekkarinen <jose.pekkarinen@nokia.com> | 2016-04-13 08:17:18 +0300 |
commit | e09b41010ba33a20a87472ee821fa407a5b8da36 (patch) | |
tree | d10dc367189862e7ca5c592f033dc3726e1df4e3 /kernel/Documentation/device-mapper | |
parent | f93b97fd65072de626c074dbe099a1fff05ce060 (diff) |
These changes are the raw update to linux-4.4.6-rt14. Kernel sources
are taken from kernel.org, and rt patch from the rt wiki download page.
During the rebasing, the following patch collided:
Force tick interrupt and get rid of softirq magic(I70131fb85).
Collisions have been removed because its logic was found on the
source already.
Change-Id: I7f57a4081d9deaa0d9ccfc41a6c8daccdee3b769
Signed-off-by: José Pekkarinen <jose.pekkarinen@nokia.com>
Diffstat (limited to 'kernel/Documentation/device-mapper')
-rw-r--r-- | kernel/Documentation/device-mapper/cache-policies.txt | 67 | ||||
-rw-r--r-- | kernel/Documentation/device-mapper/cache.txt | 15 | ||||
-rw-r--r-- | kernel/Documentation/device-mapper/delay.txt | 1 | ||||
-rw-r--r-- | kernel/Documentation/device-mapper/dm-raid.txt | 33 | ||||
-rw-r--r-- | kernel/Documentation/device-mapper/snapshot.txt | 10 | ||||
-rw-r--r-- | kernel/Documentation/device-mapper/statistics.txt | 45 | ||||
-rw-r--r-- | kernel/Documentation/device-mapper/thin-provisioning.txt | 9 |
7 files changed, 167 insertions, 13 deletions
diff --git a/kernel/Documentation/device-mapper/cache-policies.txt b/kernel/Documentation/device-mapper/cache-policies.txt index 0d124a971..d9246a32e 100644 --- a/kernel/Documentation/device-mapper/cache-policies.txt +++ b/kernel/Documentation/device-mapper/cache-policies.txt @@ -25,10 +25,10 @@ trying to see when the io scheduler has let the ios run. Overview of supplied cache replacement policies =============================================== -multiqueue ----------- +multiqueue (mq) +--------------- -This policy is the default. +This policy has been deprecated in favor of the smq policy (see below). The multiqueue policy has three sets of 16 queues: one set for entries waiting for the cache and another two for those in the cache (a set for @@ -73,6 +73,67 @@ If you're trying to quickly warm a new cache device you may wish to reduce these to encourage promotion. Remember to switch them back to their defaults after the cache fills though. +Stochastic multiqueue (smq) +--------------------------- + +This policy is the default. + +The stochastic multi-queue (smq) policy addresses some of the problems +with the multiqueue (mq) policy. + +The smq policy (vs mq) offers the promise of less memory utilization, +improved performance and increased adaptability in the face of changing +workloads. SMQ also does not have any cumbersome tuning knobs. + +Users may switch from "mq" to "smq" simply by appropriately reloading a +DM table that is using the cache target. Doing so will cause all of the +mq policy's hints to be dropped. Also, performance of the cache may +degrade slightly until smq recalculates the origin device's hotspots +that should be cached. + +Memory usage: +The mq policy uses a lot of memory; 88 bytes per cache block on a 64 +bit machine. + +SMQ uses 28bit indexes to implement it's data structures rather than +pointers. It avoids storing an explicit hit count for each block. It +has a 'hotspot' queue rather than a pre cache which uses a quarter of +the entries (each hotspot block covers a larger area than a single +cache block). + +All these mean smq uses ~25bytes per cache block. Still a lot of +memory, but a substantial improvement nontheless. + +Level balancing: +MQ places entries in different levels of the multiqueue structures +based on their hit count (~ln(hit count)). This means the bottom +levels generally have the most entries, and the top ones have very +few. Having unbalanced levels like this reduces the efficacy of the +multiqueue. + +SMQ does not maintain a hit count, instead it swaps hit entries with +the least recently used entry from the level above. The over all +ordering being a side effect of this stochastic process. With this +scheme we can decide how many entries occupy each multiqueue level, +resulting in better promotion/demotion decisions. + +Adaptability: +The MQ policy maintains a hit count for each cache block. For a +different block to get promoted to the cache it's hit count has to +exceed the lowest currently in the cache. This means it can take a +long time for the cache to adapt between varying IO patterns. +Periodically degrading the hit counts could help with this, but I +haven't found a nice general solution. + +SMQ doesn't maintain hit counts, so a lot of this problem just goes +away. In addition it tracks performance of the hotspot queue, which +is used to decide which blocks to promote. If the hotspot queue is +performing badly then it starts moving entries more quickly between +levels. This lets it adapt to new IO patterns very quickly. + +Performance: +Testing SMQ shows substantially better performance than MQ. + cleaner ------- diff --git a/kernel/Documentation/device-mapper/cache.txt b/kernel/Documentation/device-mapper/cache.txt index 68c0f517c..785eab87a 100644 --- a/kernel/Documentation/device-mapper/cache.txt +++ b/kernel/Documentation/device-mapper/cache.txt @@ -221,6 +221,7 @@ Status <#read hits> <#read misses> <#write hits> <#write misses> <#demotions> <#promotions> <#dirty> <#features> <features>* <#core args> <core args>* <policy name> <#policy args> <policy args>* +<cache metadata mode> metadata block size : Fixed block size for each metadata block in sectors @@ -251,8 +252,18 @@ core args : Key/value pairs for tuning the core e.g. migration_threshold policy name : Name of the policy #policy args : Number of policy arguments to follow (must be even) -policy args : Key/value pairs - e.g. sequential_threshold +policy args : Key/value pairs e.g. sequential_threshold +cache metadata mode : ro if read-only, rw if read-write + In serious cases where even a read-only mode is deemed unsafe + no further I/O will be permitted and the status will just + contain the string 'Fail'. The userspace recovery tools + should then be used. +needs_check : 'needs_check' if set, '-' if not set + A metadata operation has failed, resulting in the needs_check + flag being set in the metadata's superblock. The metadata + device must be deactivated and checked/repaired before the + cache can be made fully operational again. '-' indicates + needs_check is not set. Messages -------- diff --git a/kernel/Documentation/device-mapper/delay.txt b/kernel/Documentation/device-mapper/delay.txt index 15adc5535..a07b5927f 100644 --- a/kernel/Documentation/device-mapper/delay.txt +++ b/kernel/Documentation/device-mapper/delay.txt @@ -8,6 +8,7 @@ Parameters: <device> <offset> <delay> [<write_device> <write_offset> <write_delay>] With separate write parameters, the first set is only used for reads. +Offsets are specified in sectors. Delays are specified in milliseconds. Example scripts diff --git a/kernel/Documentation/device-mapper/dm-raid.txt b/kernel/Documentation/device-mapper/dm-raid.txt index ef8ba9fa5..df2d636b6 100644 --- a/kernel/Documentation/device-mapper/dm-raid.txt +++ b/kernel/Documentation/device-mapper/dm-raid.txt @@ -209,6 +209,37 @@ include: "repair" - Initiate a repair of the array. "reshape"- Currently unsupported (-EINVAL). + +Discard Support +--------------- +The implementation of discard support among hardware vendors varies. +When a block is discarded, some storage devices will return zeroes when +the block is read. These devices set the 'discard_zeroes_data' +attribute. Other devices will return random data. Confusingly, some +devices that advertise 'discard_zeroes_data' will not reliably return +zeroes when discarded blocks are read! Since RAID 4/5/6 uses blocks +from a number of devices to calculate parity blocks and (for performance +reasons) relies on 'discard_zeroes_data' being reliable, it is important +that the devices be consistent. Blocks may be discarded in the middle +of a RAID 4/5/6 stripe and if subsequent read results are not +consistent, the parity blocks may be calculated differently at any time; +making the parity blocks useless for redundancy. It is important to +understand how your hardware behaves with discards if you are going to +enable discards with RAID 4/5/6. + +Since the behavior of storage devices is unreliable in this respect, +even when reporting 'discard_zeroes_data', by default RAID 4/5/6 +discard support is disabled -- this ensures data integrity at the +expense of losing some performance. + +Storage devices that properly support 'discard_zeroes_data' are +increasingly whitelisted in the kernel and can thus be trusted. + +For trusted devices, the following dm-raid module parameter can be set +to safely enable discard support for RAID 4/5/6: + 'devices_handle_discards_safely' + + Version History --------------- 1.0.0 Initial version. Support for RAID 4/5/6 @@ -224,3 +255,5 @@ Version History New status (STATUSTYPE_INFO) fields: sync_action and mismatch_cnt. 1.5.1 Add ability to restore transiently failed devices on resume. 1.5.2 'mismatch_cnt' is zero unless [last_]sync_action is "check". +1.6.0 Add discard support (and devices_handle_discard_safely module param). +1.7.0 Add support for MD RAID0 mappings. diff --git a/kernel/Documentation/device-mapper/snapshot.txt b/kernel/Documentation/device-mapper/snapshot.txt index 0d5bc46dc..ad6949bff 100644 --- a/kernel/Documentation/device-mapper/snapshot.txt +++ b/kernel/Documentation/device-mapper/snapshot.txt @@ -41,9 +41,13 @@ useless and be disabled, returning errors. So it is important to monitor the amount of free space and expand the <COW device> before it fills up. <persistent?> is P (Persistent) or N (Not persistent - will not survive -after reboot). -The difference is that for transient snapshots less metadata must be -saved on disk - they can be kept in memory by the kernel. +after reboot). O (Overflow) can be added as a persistent store option +to allow userspace to advertise its support for seeing "Overflow" in the +snapshot status. So supported store types are "P", "PO" and "N". + +The difference between persistent and transient is with transient +snapshots less metadata must be saved on disk - they can be kept in +memory by the kernel. * snapshot-merge <origin> <COW device> <persistent> <chunksize> diff --git a/kernel/Documentation/device-mapper/statistics.txt b/kernel/Documentation/device-mapper/statistics.txt index 2a1673adc..6f5ef944c 100644 --- a/kernel/Documentation/device-mapper/statistics.txt +++ b/kernel/Documentation/device-mapper/statistics.txt @@ -13,9 +13,14 @@ the range specified. The I/O statistics counters for each step-sized area of a region are in the same format as /sys/block/*/stat or /proc/diskstats (see: Documentation/iostats.txt). But two extra counters (12 and 13) are -provided: total time spent reading and writing in milliseconds. All -these counters may be accessed by sending the @stats_print message to -the appropriate DM device via dmsetup. +provided: total time spent reading and writing. When the histogram +argument is used, the 14th parameter is reported that represents the +histogram of latencies. All these counters may be accessed by sending +the @stats_print message to the appropriate DM device via dmsetup. + +The reported times are in milliseconds and the granularity depends on +the kernel ticks. When the option precise_timestamps is used, the +reported times are in nanoseconds. Each region has a corresponding unique identifier, which we call a region_id, that is assigned when the region is created. The region_id @@ -33,7 +38,9 @@ memory is used by reading Messages ======== - @stats_create <range> <step> [<program_id> [<aux_data>]] + @stats_create <range> <step> + [<number_of_optional_arguments> <optional_arguments>...] + [<program_id> [<aux_data>]] Create a new region and return the region_id. @@ -48,6 +55,29 @@ Messages "/<number_of_areas>" - the range is subdivided into the specified number of areas. + <number_of_optional_arguments> + The number of optional arguments + + <optional_arguments> + The following optional arguments are supported + precise_timestamps - use precise timer with nanosecond resolution + instead of the "jiffies" variable. When this argument is + used, the resulting times are in nanoseconds instead of + milliseconds. Precise timestamps are a little bit slower + to obtain than jiffies-based timestamps. + histogram:n1,n2,n3,n4,... - collect histogram of latencies. The + numbers n1, n2, etc are times that represent the boundaries + of the histogram. If precise_timestamps is not used, the + times are in milliseconds, otherwise they are in + nanoseconds. For each range, the kernel will report the + number of requests that completed within this range. For + example, if we use "histogram:10,20,30", the kernel will + report four numbers a:b:c:d. a is the number of requests + that took 0-10 ms to complete, b is the number of requests + that took 10-20 ms to complete, c is the number of requests + that took 20-30 ms to complete and d is the number of + requests that took more than 30 ms to complete. + <program_id> An optional parameter. A name that uniquely identifies the userspace owner of the range. This groups ranges together @@ -55,6 +85,9 @@ Messages created and ignore those created by others. The kernel returns this string back in the output of @stats_list message, but it doesn't use it for anything else. + If we omit the number of optional arguments, program id must not + be a number, otherwise it would be interpreted as the number of + optional arguments. <aux_data> An optional parameter. A word that provides auxiliary data @@ -88,6 +121,10 @@ Messages Output format: <region_id>: <start_sector>+<length> <step> <program_id> <aux_data> + precise_timestamps histogram:n1,n2,n3,... + + The strings "precise_timestamps" and "histogram" are printed only + if they were specified when creating the region. @stats_print <region_id> [<starting_line> <number_of_lines>] diff --git a/kernel/Documentation/device-mapper/thin-provisioning.txt b/kernel/Documentation/device-mapper/thin-provisioning.txt index 4f67578b2..1699a55b7 100644 --- a/kernel/Documentation/device-mapper/thin-provisioning.txt +++ b/kernel/Documentation/device-mapper/thin-provisioning.txt @@ -296,7 +296,7 @@ ii) Status underlying device. When this is enabled when loading the table, it can get disabled if the underlying device doesn't support it. - ro|rw + ro|rw|out_of_data_space If the pool encounters certain types of device failures it will drop into a read-only metadata mode in which no changes to the pool metadata (like allocating new blocks) are permitted. @@ -314,6 +314,13 @@ ii) Status module parameter can be used to change this timeout -- it defaults to 60 seconds but may be disabled using a value of 0. + needs_check + A metadata operation has failed, resulting in the needs_check + flag being set in the metadata's superblock. The metadata + device must be deactivated and checked/repaired before the + thin-pool can be made fully operational again. '-' indicates + needs_check is not set. + iii) Messages create_thin <dev id> |