diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2015-06-25 16:34:39 -0700 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2015-06-25 16:34:39 -0700 |
commit | 6597ac8a514e2085cf19822a5783345c613312a5 (patch) | |
tree | fcae0569d2159e04258405c15c91551e36f8ee6c /Documentation | |
parent | e4bc13adfd016fc1036838170288b5680d1a98b0 (diff) | |
parent | e262f34741522e0d821642e5449c6eeb512723fc (diff) |
Merge tag 'dm-4.2-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper updates from Mike Snitzer:
- DM core cleanups:
* blk-mq request-based DM no longer uses any mempools now that
partial completions are no longer handled as part of cloned
requests
- DM raid cleanups and support for MD raid0
- DM cache core advances and a new stochastic-multi-queue (smq) cache
replacement policy
* smq is the new default dm-cache policy
- DM thinp cleanups and much more efficient large discard support
- DM statistics support for request-based DM and nanosecond resolution
timestamps
- Fixes to DM stripe, DM log-writes, DM raid1 and DM crypt
* tag 'dm-4.2-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (39 commits)
dm stats: add support for request-based DM devices
dm stats: collect and report histogram of IO latencies
dm stats: support precise timestamps
dm stats: fix divide by zero if 'number_of_areas' arg is zero
dm cache: switch the "default" cache replacement policy from mq to smq
dm space map metadata: fix occasional leak of a metadata block on resize
dm thin metadata: fix a race when entering fail mode
dm thin: fail messages with EOPNOTSUPP when pool cannot handle messages
dm thin: range discard support
dm thin metadata: add dm_thin_remove_range()
dm thin metadata: add dm_thin_find_mapped_range()
dm btree: add dm_btree_remove_leaves()
dm stats: Use kvfree() in dm_kvfree()
dm cache: age and write back cache entries even without active IO
dm cache: prefix all DMERR and DMINFO messages with cache device name
dm cache: add fail io mode and needs_check flag
dm cache: wake the worker thread every time we free a migration object
dm cache: add stochastic-multi-queue (smq) policy
dm cache: boost promotion of blocks that will be overwritten
dm cache: defer whole cells
...
Diffstat (limited to 'Documentation')
-rw-r--r-- | Documentation/device-mapper/cache-policies.txt | 67 | ||||
-rw-r--r-- | Documentation/device-mapper/cache.txt | 9 | ||||
-rw-r--r-- | Documentation/device-mapper/dm-raid.txt | 2 | ||||
-rw-r--r-- | Documentation/device-mapper/statistics.txt | 41 |
4 files changed, 110 insertions, 9 deletions
diff --git a/Documentation/device-mapper/cache-policies.txt b/Documentation/device-mapper/cache-policies.txt index 0d124a971801..d9246a32e673 100644 --- a/Documentation/device-mapper/cache-policies.txt +++ b/Documentation/device-mapper/cache-policies.txt @@ -25,10 +25,10 @@ trying to see when the io scheduler has let the ios run. Overview of supplied cache replacement policies =============================================== -multiqueue ----------- +multiqueue (mq) +--------------- -This policy is the default. +This policy has been deprecated in favor of the smq policy (see below). The multiqueue policy has three sets of 16 queues: one set for entries waiting for the cache and another two for those in the cache (a set for @@ -73,6 +73,67 @@ If you're trying to quickly warm a new cache device you may wish to reduce these to encourage promotion. Remember to switch them back to their defaults after the cache fills though. +Stochastic multiqueue (smq) +--------------------------- + +This policy is the default. + +The stochastic multi-queue (smq) policy addresses some of the problems +with the multiqueue (mq) policy. + +The smq policy (vs mq) offers the promise of less memory utilization, +improved performance and increased adaptability in the face of changing +workloads. SMQ also does not have any cumbersome tuning knobs. + +Users may switch from "mq" to "smq" simply by appropriately reloading a +DM table that is using the cache target. Doing so will cause all of the +mq policy's hints to be dropped. Also, performance of the cache may +degrade slightly until smq recalculates the origin device's hotspots +that should be cached. + +Memory usage: +The mq policy uses a lot of memory; 88 bytes per cache block on a 64 +bit machine. + +SMQ uses 28bit indexes to implement it's data structures rather than +pointers. It avoids storing an explicit hit count for each block. It +has a 'hotspot' queue rather than a pre cache which uses a quarter of +the entries (each hotspot block covers a larger area than a single +cache block). + +All these mean smq uses ~25bytes per cache block. Still a lot of +memory, but a substantial improvement nontheless. + +Level balancing: +MQ places entries in different levels of the multiqueue structures +based on their hit count (~ln(hit count)). This means the bottom +levels generally have the most entries, and the top ones have very +few. Having unbalanced levels like this reduces the efficacy of the +multiqueue. + +SMQ does not maintain a hit count, instead it swaps hit entries with +the least recently used entry from the level above. The over all +ordering being a side effect of this stochastic process. With this +scheme we can decide how many entries occupy each multiqueue level, +resulting in better promotion/demotion decisions. + +Adaptability: +The MQ policy maintains a hit count for each cache block. For a +different block to get promoted to the cache it's hit count has to +exceed the lowest currently in the cache. This means it can take a +long time for the cache to adapt between varying IO patterns. +Periodically degrading the hit counts could help with this, but I +haven't found a nice general solution. + +SMQ doesn't maintain hit counts, so a lot of this problem just goes +away. In addition it tracks performance of the hotspot queue, which +is used to decide which blocks to promote. If the hotspot queue is +performing badly then it starts moving entries more quickly between +levels. This lets it adapt to new IO patterns very quickly. + +Performance: +Testing SMQ shows substantially better performance than MQ. + cleaner ------- diff --git a/Documentation/device-mapper/cache.txt b/Documentation/device-mapper/cache.txt index 68c0f517c60e..82960cffbad3 100644 --- a/Documentation/device-mapper/cache.txt +++ b/Documentation/device-mapper/cache.txt @@ -221,6 +221,7 @@ Status <#read hits> <#read misses> <#write hits> <#write misses> <#demotions> <#promotions> <#dirty> <#features> <features>* <#core args> <core args>* <policy name> <#policy args> <policy args>* +<cache metadata mode> metadata block size : Fixed block size for each metadata block in sectors @@ -251,8 +252,12 @@ core args : Key/value pairs for tuning the core e.g. migration_threshold policy name : Name of the policy #policy args : Number of policy arguments to follow (must be even) -policy args : Key/value pairs - e.g. sequential_threshold +policy args : Key/value pairs e.g. sequential_threshold +cache metadata mode : ro if read-only, rw if read-write + In serious cases where even a read-only mode is deemed unsafe + no further I/O will be permitted and the status will just + contain the string 'Fail'. The userspace recovery tools + should then be used. Messages -------- diff --git a/Documentation/device-mapper/dm-raid.txt b/Documentation/device-mapper/dm-raid.txt index ef8ba9fa58c4..cb12af3b51c2 100644 --- a/Documentation/device-mapper/dm-raid.txt +++ b/Documentation/device-mapper/dm-raid.txt @@ -224,3 +224,5 @@ Version History New status (STATUSTYPE_INFO) fields: sync_action and mismatch_cnt. 1.5.1 Add ability to restore transiently failed devices on resume. 1.5.2 'mismatch_cnt' is zero unless [last_]sync_action is "check". +1.6.0 Add discard support (and devices_handle_discard_safely module param). +1.7.0 Add support for MD RAID0 mappings. diff --git a/Documentation/device-mapper/statistics.txt b/Documentation/device-mapper/statistics.txt index 2a1673adc200..4919b2dfd1b3 100644 --- a/Documentation/device-mapper/statistics.txt +++ b/Documentation/device-mapper/statistics.txt @@ -13,9 +13,14 @@ the range specified. The I/O statistics counters for each step-sized area of a region are in the same format as /sys/block/*/stat or /proc/diskstats (see: Documentation/iostats.txt). But two extra counters (12 and 13) are -provided: total time spent reading and writing in milliseconds. All -these counters may be accessed by sending the @stats_print message to -the appropriate DM device via dmsetup. +provided: total time spent reading and writing. When the histogram +argument is used, the 14th parameter is reported that represents the +histogram of latencies. All these counters may be accessed by sending +the @stats_print message to the appropriate DM device via dmsetup. + +The reported times are in milliseconds and the granularity depends on +the kernel ticks. When the option precise_timestamps is used, the +reported times are in nanoseconds. Each region has a corresponding unique identifier, which we call a region_id, that is assigned when the region is created. The region_id @@ -33,7 +38,9 @@ memory is used by reading Messages ======== - @stats_create <range> <step> [<program_id> [<aux_data>]] + @stats_create <range> <step> + [<number_of_optional_arguments> <optional_arguments>...] + [<program_id> [<aux_data>]] Create a new region and return the region_id. @@ -48,6 +55,29 @@ Messages "/<number_of_areas>" - the range is subdivided into the specified number of areas. + <number_of_optional_arguments> + The number of optional arguments + + <optional_arguments> + The following optional arguments are supported + precise_timestamps - use precise timer with nanosecond resolution + instead of the "jiffies" variable. When this argument is + used, the resulting times are in nanoseconds instead of + milliseconds. Precise timestamps are a little bit slower + to obtain than jiffies-based timestamps. + histogram:n1,n2,n3,n4,... - collect histogram of latencies. The + numbers n1, n2, etc are times that represent the boundaries + of the histogram. If precise_timestamps is not used, the + times are in milliseconds, otherwise they are in + nanoseconds. For each range, the kernel will report the + number of requests that completed within this range. For + example, if we use "histogram:10,20,30", the kernel will + report four numbers a:b:c:d. a is the number of requests + that took 0-10 ms to complete, b is the number of requests + that took 10-20 ms to complete, c is the number of requests + that took 20-30 ms to complete and d is the number of + requests that took more than 30 ms to complete. + <program_id> An optional parameter. A name that uniquely identifies the userspace owner of the range. This groups ranges together @@ -55,6 +85,9 @@ Messages created and ignore those created by others. The kernel returns this string back in the output of @stats_list message, but it doesn't use it for anything else. + If we omit the number of optional arguments, program id must not + be a number, otherwise it would be interpreted as the number of + optional arguments. <aux_data> An optional parameter. A word that provides auxiliary data |