[dm-devel] [RFC PATCH 1/1] dm: add clone target

Tue Aug 27 14:09:25 UTC 2019

Hello,

This is a kind reminder for this patch set. I'm bumping this thread to
solicit your feedback.

Following the discussion with Heinz, I have provided extensive
benchmarks that show dm-clone's significant performance increase
compared to a dm-snapshot/dm-raid1 stack.

How can we move forward with the review of dm-clone, so it can
eventually be merged upstream?

Looking forward to your feedback,

Nikos

On 7/30/19 1:13 PM, Nikos Tsironis wrote:
> On 7/30/19 12:20 AM, Heinz Mauelshagen wrote:
>> Hi Nikos,
>>
>> thanks for providing these benchmarks which  seem to confirm the
>> advantages of clone vs. a snapshot/raid1 stack.
>>
>> Can you please provide 'dmsetup table' for both configurations for 
>> completeness?
>>
>> Heinz
>>
> 
> Hi Heinz,
> 
> Yes, of course. The below 'dmsetup table' output is for the 4K
> region/chunk size benchmark. The 'dmsetup table' output for the rest of
> the benchmarks is the same, changing only the region/chunk sizes of
> dm-clone and dm-snapshot.
> 
> dm-clone stack (dmsetup table)
> ==============================
> 
> source--vg-origin--lv: 0 629145600 linear 8:16 2048
> dest--vg-meta--lv: 0 65536 linear 259:0 629147648
> clone: 0 629145600 clone 254:1 254:0 254:2 8
> dest--vg-clone--lv: 0 629145600 linear 259:0 2048
> 
> dm-snapshot + dm-raid stack (dmsetup table)
> ===========================================
> 
> mirrorvg-snap-cow: 0 104857600 linear 259:0 629155840
> mirrorvg-raid1--lv_rimage_1: 0 629145600 linear 259:0 10240
> mirrorvg-snap: 0 629145600 snapshot 254:5 254:6 P 8
> mirrorvg-raid1--lv_rimage_0: 0 629145600 linear 8:16 10240
> mirrorvg-raid1--lv-real: 0 629145600 raid raid1 3 0 region_size 1024 2 254:0 254:1 254:2 254:3
> mirrorvg-raid1--lv: 0 629145600 snapshot-origin 254:5
> mirrorvg-raid1--lv_rmeta_1: 0 8192 linear 259:0 2048
> mirrorvg-raid1--lv_rmeta_0: 0 8192 linear 8:16 2048
> 
> Nikos
> 
>> On 7/22/19 10:16 PM, Nikos Tsironis wrote:
>>> On 7/17/19 5:41 PM, Heinz Mauelshagen wrote:
>>>> Hi Nikos,
>>>>
>>>> thanks for elaborating on those details.
>>>>
>>>> Hash table collisions, exception store entry commit overhead,
>>>> SSD cache flush issues etc. are all valid points relative to performance
>>>> and work set footprints in general.
>>>>
>>>> Do you have any performance numbers for your solution vs.
>>>> a snapshot one showing the approach is actually superior in
>>>> in real configurations?
>>> Hi Heinz,
>>>
>>> Please see below for detailed benchmark results.
>>>
>>>> I'm asking this particularly in the context of your remark
>>>>
>>>> "A write to a not yet hydrated region will be delayed until the
>>>> corresponding
>>>> region has been hydrated and the hydration of the region starts
>>>> immediately."
>>>>
>>>> which'll cause a potentially large working set of delayed writes unless
>>>> those
>>>> cover the whole eventually larger than 4K region.
>>>> How does your 'clone' target perform on such heavy write situations?
>>>>
>>> This situation occurs only when the writes are smaller than the region
>>> size of dm-clone. E.g., if the user sets a region size of 64K and issues
>>> 4K writes.
>>>
>>> In this case, we experience a performance drop due to COW. This is true
>>> _both_ for dm-snapshot and dm-clone and is _unavoidable_.
>>>
>>> But, the common case will be setting a region size equal to the file
>>> system block size, e.g., 4K, and thus avoiding the COW overhead. Note
>>> that LVM snapshots _already_ use 4K as the _default_ chunk size.
>>>
>>> Nevertheless, even for larger region/chunk sizes, dm-clone outperforms
>>> the dm-snapshot based solution, as is evident by the following
>>> performance measurements.
>>>
>>>> In general, performance and storage footprint test results based on the
>>>> same set
>>>> of read/write tests including heavy loads with region size variations
>>>> run on 'clone'
>>>> and 'snapshot' would help your point.
>>>>
>>>> Heinz
>>>>
>>> I used fio to run a series of read and write tests that compare the
>>> performance of dm-clone against your proposed dm-snapshot over dm-raid
>>> solution.
>>>
>>> I used a 375GB spinning disk as the origin device storing the data to be
>>> cloned and a 375GB SSD as the clone device and for storing both
>>> dm-clone's metadata and dm-snapshot's exceptions (COW space).
>>>
>>> dm-clone stack (dmsetup ls --tree)
>>> ==================================
>>>
>>> clone (254:3)
>>>   ├─source--vg-origin--lv (254:2)
>>>   │  └─ (8:16)
>>>   ├─dest--vg-clone--lv (254:0)
>>>   │  └─ (259:0)
>>>   └─dest--vg-meta--lv (254:1)
>>>      └─ (259:0)
>>>
>>> dm-snapshot + dm-raid stack (dmsetup ls --tree)
>>> ===============================================
>>>
>>> mirrorvg-snap (254:7)
>>>   ├─mirrorvg-snap-cow (254:6)
>>>   │  └─ (259:0)
>>>   └─mirrorvg-raid1--lv-real (254:5)
>>>      ├─mirrorvg-raid1--lv_rimage_1 (254:3)
>>>      │  └─ (259:0)
>>>      ├─mirrorvg-raid1--lv_rmeta_1 (254:2)
>>>      │  └─ (259:0)
>>>      ├─mirrorvg-raid1--lv_rimage_0 (254:1)
>>>      │  └─ (8:16)
>>>      └─mirrorvg-raid1--lv_rmeta_0 (254:0)
>>>         └─ (8:16)
>>> mirrorvg-raid1--lv (254:4)
>>>   └─mirrorvg-raid1--lv-real (254:5)
>>>      ├─mirrorvg-raid1--lv_rimage_1 (254:3)
>>>      │  └─ (259:0)
>>>      ├─mirrorvg-raid1--lv_rmeta_1 (254:2)
>>>      │  └─ (259:0)
>>>      ├─mirrorvg-raid1--lv_rimage_0 (254:1)
>>>      │  └─ (8:16)
>>>      └─mirrorvg-raid1--lv_rmeta_0 (254:0)
>>>         └─ (8:16)
>>>
>>> fio configuration
>>> =================
>>>
>>> 1. Random Read/Write latency benchmark
>>>
>>>    ioengine=psync, bs=4K, numjobs=1, direct=1, timeout=90, time_based=1,
>>>    rw=randwrite/randread
>>>
>>> 2. Random Read/Write IOPS benchmark
>>>
>>>    ioengine=libaio, bs=4K, numjobs=1, direct=1, iodepth=32, timeout=90,
>>>    time_based=1, rw=randwrite/randread
>>>
>>> 3. Sequential Read/Write Bandwidth
>>>
>>>    ioengine=libaio, bs=256K, numjobs=1, direct=1, iodepth=32, timeout=90,
>>>    time_based=1, rw=write/read
>>>
>>> Baseline
>>> ========
>>>
>>> As a reference, the benchmark results for the raw devices:
>>>
>>> +--------+--------------------+-----------------+--------------+
>>> | device | rand-write latency | rand-write IOPS | seq-write BW |
>>> +--------+--------------------+-----------------+--------------+
>>> |  HDD   |      701 usec      |       1425      |   120 MB/s   |
>>> |  SSD   |     72.6 usec      |      64490      |   390 MB/s   |
>>> +--------+--------------------+-----------------+--------------+
>>>
>>> +--------+-------------------+----------------+-------------+
>>> | device | rand-read latency | rand-read IOPS | seq-read BW |
>>> +--------+-------------------+----------------+-------------+
>>> |  HDD   |      1.4 msec     |      712       |   120 MB/s  |
>>> |  SSD   |      122 usec     |     150920     |   701 MB/s  |
>>> +--------+-------------------+----------------+-------------+
>>>
>>> dm-clone vs dm-snapshot+dm-raid
>>> ===============================
>>>
>>> Latency benchmark
>>> -----------------
>>>
>>> 1. Random write latency
>>>
>>> +-------------------+-----------+-------------+
>>> | region/chunk size |  dm-clone | dm-snapshot |
>>> +-------------------+-----------+-------------+
>>> |        4 KB       | 75.7 usec |   6.8 msec  |
>>> |        8 KB       |  1.9 msec |  17.7 msec  |
>>> |       16 KB       |  2.1 msec |  15.8 msec  |
>>> |       32 KB       |  2.2 msec |  33.6 msec  |
>>> |       64 KB       |  2.6 msec |  31.2 msec  |
>>> |       128 KB      |  3.8 msec |  35.7 msec  |
>>> +-------------------+-----------+-------------+
>>>
>>> * dm-snapshot+dm-raid has 7.5 to 90 times _more_ write latency than
>>>    dm-clone.
>>>
>>> * For the common case of a 4 KB region/chunk size, dm-clone has minimal
>>>    overhead over the SSD device.
>>>
>>> * Even for region/chunk sizes greater than 4KB dm-clone's overhead is
>>>    minimal compared to dm-snapshot+dm-raid.
>>>
>>> 2. Random read latency
>>>
>>> +-------------------+----------+-------------+
>>> | region/chunk size | dm-clone | dm-snapshot |
>>> +-------------------+----------+-------------+
>>> |        4 KB       | 1.5 msec |  10.7 msec  |
>>> |        8 KB       | 1.5 msec |   9.7 msec  |
>>> |       16 KB       | 1.5 msec |  11.9 msec  |
>>> |       32 KB       | 1.5 msec |  28.6 msec  |
>>> |       64 KB       | 1.5 msec |  27.5 msec  |
>>> |       128 KB      | 1.5 msec |  27.3 msec  |
>>> +-------------------+----------+-------------+
>>>
>>> * dm-snapshot+dm-raid has 6.5 to 19 times _more_ read latency than
>>>    dm-clone.
>>>
>>> * For all region/chunk sizes dm-clone has minimal overhead over the HDD
>>>    device.
>>>
>>> IOPS benchmark
>>> --------------
>>>
>>> 1. Random write IOPS
>>>
>>> +-------------------+----------+-------------+
>>> | region/chunk size | dm-clone | dm-snapshot |
>>> +-------------------+----------+-------------+
>>> |        4 KB       |  62347   |     3758    |
>>> |        8 KB       |   696    |     388     |
>>> |       16 KB       |   667    |     217     |
>>> |       32 KB       |   614    |     207     |
>>> |       64 KB       |   531    |     186     |
>>> |       128 KB      |   417    |     159     |
>>> +-------------------+----------+-------------+
>>>
>>> * dm-clone achieves 1.8 to 16.6 times _more_ IOPS than
>>>    dm-snapshot+dm-raid.
>>>
>>> * For the common case of a 4 KB region/chunk size, dm-clone has minimal
>>>    overhead over the SSD device.
>>>
>>> * Even for region/chunk sizes greater than 4KB dm-clone achieves
>>>    significantly more IOPS than dm-snapshot+dm-raid.
>>>
>>> 2. Random read IOPS
>>>
>>> +-------------------+----------+-------------+
>>> | region/chunk size | dm-clone | dm-snapshot |
>>> +-------------------+----------+-------------+
>>> |        4 KB       |   767    |     680     |
>>> |        8 KB       |   714    |     677     |
>>> |       16 KB       |   715    |     338     |
>>> |       32 KB       |   717    |     338     |
>>> |       64 KB       |   720    |     338     |
>>> |       128 KB      |   724    |     339     |
>>> +-------------------+----------+-------------+
>>>
>>> * dm-clone achieves 1.1 to 2.1 times _more_ IOPS than
>>>    dm-snapshot+dm-raid.
>>>
>>> Bandwidth benchmark
>>> -------------------
>>>
>>> 1. Sequential write BW
>>>
>>> +-------------------+------------+-------------+
>>> | region/chunk size |  dm-clone  | dm-snapshot |
>>> +-------------------+------------+-------------+
>>> |        4 KB       | 389.4 MB/s |  135.3 MB/s |
>>> |        8 KB       | 390.5 MB/s |  231.7 MB/s |
>>> |       16 KB       | 390.5 MB/s |  213.1 MB/s |
>>> |       32 KB       | 390.4 MB/s |  214.0 MB/s |
>>> |       64 KB       | 390.3 MB/s |  214.0 MB/s |
>>> |       128 KB      | 390.5 MB/s |  211.3 MB/s |
>>> +-------------------+------------+-------------+
>>>
>>> * dm-clone achieves 1.7 to 2.9 times more write BW than
>>>    dm-snapshot+dm-raid.
>>>
>>> * For all region/chunk sizes dm-clone achieves the same write BW as the
>>>    SSD device.
>>>
>>> 2. Sequential read BW
>>>
>>> +-------------------+------------+-------------+
>>> | region/chunk size |  dm-clone  | dm-snapshot |
>>> +-------------------+------------+-------------+
>>> |        4 KB       | 442.8 MB/s |  217.3 MB/s |
>>> |        8 KB       | 443.8 MB/s |  288.8 MB/s |
>>> |       16 KB       | 443.8 MB/s |  275.3 MB/s |
>>> |       32 KB       | 443.8 MB/s |  276.1 MB/s |
>>> |       64 KB       | 443.6 MB/s |  276.1 MB/s |
>>> |       128 KB      | 443.6 MB/s |  275.2 MB/s |
>>> +-------------------+------------+-------------+
>>>
>>> * dm-clone achieves 1.5 to 2 times more read BW than
>>>    dm-snapshot+dm-raid.
>>>
>>> Metadata/Storage overhead
>>> =========================
>>>
>>> dm-clone had a _maximum_ metadata overhead of around 20 MB for all
>>> benchmarks. As dm-clone doesn't require any extra COW space for
>>> temporarily storing the written data (writes just go directly to the
>>> clone device) this is the _only_ storage overhead incurred by dm-clone,
>>> irrespective of the amount of the written data
>>>
>>> On the other hand, the COW space utilization of dm-snapshot, for the
>>> bandwidth benchmarks, varied from 11.95 GB to 20.41 GB, depending on the
>>> amount of written data.
>>>
>>> I want to emphasize that after the cloning/syncing is complete we have
>>> to merge this multi-gigabyte COW space back to the clone/destination
>>> device. This will cause _further_ performance degradation, which is
>>> _not_ reflected in the above performance measurements, but _will_ be
>>> present in real workloads, if the dm-snapshot based solution is used.
>>>
>>>
>>> To summarize, dm-clone performs _significantly_ better than a
>>> dm-snapshot based solution, on all aspects (latency, IOPS, BW), and with
>>> a _fraction_ of the storage/metadata overhead.
>>>
>>> If you have any more questions, I would be more than happy to discuss
>>> them with you.
>>>
>>> Thanks,
>>> Nikos
>>>
>>>> On 7/10/19 8:45 PM, Nikos Tsironis wrote:
>>>>> On 7/10/19 12:28 AM, Heinz Mauelshagen wrote:
>>>>>> Hi Nikos,
>>>> e>
>>>>>> what is the crucial factor your target offers vs. resynchronizing such a
>>>>>> latency distinct
>>>>>> 2-legged mirror with a read-write snapshot (local, fast exception store)
>>>>>> on top, tearing the
>>>>>> mirror down keeping the local leg once fully in sync and merging the
>>>>>> snapshot back into it?
>>>>>>
>>>>>> Heinz
>>>>>>
>>>>> Hi Heinz,
>>>>>
>>>>> The most significant benefits of dm-clone over the solution you propose
>>>>> is significantly better performance, no need for extra COW space, no
>>>>> need to merge back a snapshot, and the ability to skip syncing the
>>>>> unused space of a file system.
>>>>>
>>>>> 1. In order to ensure snapshot consistency, dm-snapshot needs to
>>>>>      commit a completed exception, before signaling the completion of the
>>>>>      write that triggered it to upper layers.
>>>>>
>>>>>      The persistent exception store commits exceptions every time a
>>>>>      metadata area is filled or when there are no more exceptions
>>>>>      in-flight. For a 4K chunk size we have 256 exceptions per metadata
>>>>>      area, so the best case scenario is one commit per 256 writes. Here I
>>>>>      assume a write with size equal to the chunk size of dm-snapshot,
>>>>>      e.g., 4K, so there is no COW overhead, and that we write to new
>>>>>      chunks, so we need to allocate new exceptions.
>>>>>
>>>>>      Part of committing the metadata is flushing the cache of the
>>>>>      underlying device, if there is one. We have seen SSDs which can
>>>>>      sustain hundreds of thousands of random write IOPS, but they take up
>>>>>      to 8ms to flush their cache. In such a case, flushing the SSD cache
>>>>>      every few writes significantly degrades performance.
>>>>>
>>>>>      Moreover, dm-snapshot forces exceptions to complete in the order they
>>>>>      were allocated, to avoid snapshot space leak on crash (commit
>>>>>      230c83afdd9cd). This inserts further latency in exception completions
>>>>>      and thus user write completions.
>>>>>
>>>>>      On the other hand, when cloning a device we don't need to be so
>>>>>      strict and can rely on committing the metadata every time a FLUSH or
>>>>>      FUA bio is written, or periodically, like dm-thin and dm-cache do.
>>>>>
>>>>>      dm-clone does exactly that. When a region/chunk is cloned or
>>>>>      over-written by a write, we just set a bit in the relevant in-core
>>>>>      bitmap. The metadata are committed once every second or when we
>>>>>      receive a FLUSH or FUA bio.
>>>>>
>>>>>      This improves performance significantly and results in increased IOPS
>>>>>      and reduced latency, especially in cases where flushing the disk
>>>>>      cache is very expensive.
>>>>>
>>>>> 2. For large devices, e.g. multi terabyte disks, resynchronizing the
>>>>>      local leg can take a lot of time. If the application running over the
>>>>>      local device is write-heavy, dm-snapshot will end up allocating a
>>>>>      large number of exceptions. This increases the number of hash table
>>>>>      collisions and thus increases the time we need to do a hash table
>>>>>      lookup.
>>>>>
>>>>>      dm-snapshot needs to look up the exception hash tables in order to
>>>>>      service an I/O, so this increases latency and degrades performance.
>>>>>
>>>>>      On the other hand, dm-clone is just testing a bit to see if a region
>>>>>      is cloned or not and decides what to do based on that test.
>>>>>
>>>>> 3. With dm-clone there is no need to reserve extra COW space for
>>>>>      temporarily storing the written data, while the clone device is
>>>>>      syncing. Nor would one need to worry about monitoring and expanding
>>>>>      the COW device to prevent it from filling up.
>>>>>
>>>>> 4. With dm-clone there is no need to merge back potentially several
>>>>>      gigabytes once cloning/syncing completes. We also avoid the relevant
>>>>>      performance degradation incurred by the merging process. Writes just
>>>>>      go directly to the clone device.
>>>>>
>>>>> 5. dm-clone implements support for discards, so it can skip
>>>>>      cloning/syncing the relevant regions. In the case of a large block
>>>>>      device which contains a filesystem with empty space, e.g. a 2TB
>>>>>      device containing 500GB of useful data in a filesystem, this can
>>>>>      significantly reduce the time needed to sync/clone.
>>>>>
>>>>> This was a rather long email, but I hope it makes the significant
>>>>> benefits of dm-clone over using dm-snapshot, and our rationale behind
>>>>> the decision to implement a new target clearer.
>>>>>
>>>>> I would be more than happy to continue the conversation and focus on any
>>>>> other questions you may have.
>>>>>
>>>>> Thanks,
>>>>> Nikos
>>> --
>>> dm-devel mailing list
>>> dm-devel at redhat.com
>>> https://www.redhat.com/mailman/listinfo/dm-devel
>>