[dm-devel] [RFC PATCH 1/1] dm: add clone target
Nikos Tsironis
ntsironis at arrikto.com
Tue Jul 30 10:13:24 UTC 2019
On 7/30/19 12:20 AM, Heinz Mauelshagen wrote:
> Hi Nikos,
>
> thanks for providing these benchmarks which seem to confirm the
> advantages of clone vs. a snapshot/raid1 stack.
>
> Can you please provide 'dmsetup table' for both configurations for
> completeness?
>
> Heinz
>
Hi Heinz,
Yes, of course. The below 'dmsetup table' output is for the 4K
region/chunk size benchmark. The 'dmsetup table' output for the rest of
the benchmarks is the same, changing only the region/chunk sizes of
dm-clone and dm-snapshot.
dm-clone stack (dmsetup table)
==============================
source--vg-origin--lv: 0 629145600 linear 8:16 2048
dest--vg-meta--lv: 0 65536 linear 259:0 629147648
clone: 0 629145600 clone 254:1 254:0 254:2 8
dest--vg-clone--lv: 0 629145600 linear 259:0 2048
dm-snapshot + dm-raid stack (dmsetup table)
===========================================
mirrorvg-snap-cow: 0 104857600 linear 259:0 629155840
mirrorvg-raid1--lv_rimage_1: 0 629145600 linear 259:0 10240
mirrorvg-snap: 0 629145600 snapshot 254:5 254:6 P 8
mirrorvg-raid1--lv_rimage_0: 0 629145600 linear 8:16 10240
mirrorvg-raid1--lv-real: 0 629145600 raid raid1 3 0 region_size 1024 2 254:0 254:1 254:2 254:3
mirrorvg-raid1--lv: 0 629145600 snapshot-origin 254:5
mirrorvg-raid1--lv_rmeta_1: 0 8192 linear 259:0 2048
mirrorvg-raid1--lv_rmeta_0: 0 8192 linear 8:16 2048
Nikos
> On 7/22/19 10:16 PM, Nikos Tsironis wrote:
>> On 7/17/19 5:41 PM, Heinz Mauelshagen wrote:
>>> Hi Nikos,
>>>
>>> thanks for elaborating on those details.
>>>
>>> Hash table collisions, exception store entry commit overhead,
>>> SSD cache flush issues etc. are all valid points relative to performance
>>> and work set footprints in general.
>>>
>>> Do you have any performance numbers for your solution vs.
>>> a snapshot one showing the approach is actually superior in
>>> in real configurations?
>> Hi Heinz,
>>
>> Please see below for detailed benchmark results.
>>
>>> I'm asking this particularly in the context of your remark
>>>
>>> "A write to a not yet hydrated region will be delayed until the
>>> corresponding
>>> region has been hydrated and the hydration of the region starts
>>> immediately."
>>>
>>> which'll cause a potentially large working set of delayed writes unless
>>> those
>>> cover the whole eventually larger than 4K region.
>>> How does your 'clone' target perform on such heavy write situations?
>>>
>> This situation occurs only when the writes are smaller than the region
>> size of dm-clone. E.g., if the user sets a region size of 64K and issues
>> 4K writes.
>>
>> In this case, we experience a performance drop due to COW. This is true
>> _both_ for dm-snapshot and dm-clone and is _unavoidable_.
>>
>> But, the common case will be setting a region size equal to the file
>> system block size, e.g., 4K, and thus avoiding the COW overhead. Note
>> that LVM snapshots _already_ use 4K as the _default_ chunk size.
>>
>> Nevertheless, even for larger region/chunk sizes, dm-clone outperforms
>> the dm-snapshot based solution, as is evident by the following
>> performance measurements.
>>
>>> In general, performance and storage footprint test results based on the
>>> same set
>>> of read/write tests including heavy loads with region size variations
>>> run on 'clone'
>>> and 'snapshot' would help your point.
>>>
>>> Heinz
>>>
>> I used fio to run a series of read and write tests that compare the
>> performance of dm-clone against your proposed dm-snapshot over dm-raid
>> solution.
>>
>> I used a 375GB spinning disk as the origin device storing the data to be
>> cloned and a 375GB SSD as the clone device and for storing both
>> dm-clone's metadata and dm-snapshot's exceptions (COW space).
>>
>> dm-clone stack (dmsetup ls --tree)
>> ==================================
>>
>> clone (254:3)
>> ├─source--vg-origin--lv (254:2)
>> │ └─ (8:16)
>> ├─dest--vg-clone--lv (254:0)
>> │ └─ (259:0)
>> └─dest--vg-meta--lv (254:1)
>> └─ (259:0)
>>
>> dm-snapshot + dm-raid stack (dmsetup ls --tree)
>> ===============================================
>>
>> mirrorvg-snap (254:7)
>> ├─mirrorvg-snap-cow (254:6)
>> │ └─ (259:0)
>> └─mirrorvg-raid1--lv-real (254:5)
>> ├─mirrorvg-raid1--lv_rimage_1 (254:3)
>> │ └─ (259:0)
>> ├─mirrorvg-raid1--lv_rmeta_1 (254:2)
>> │ └─ (259:0)
>> ├─mirrorvg-raid1--lv_rimage_0 (254:1)
>> │ └─ (8:16)
>> └─mirrorvg-raid1--lv_rmeta_0 (254:0)
>> └─ (8:16)
>> mirrorvg-raid1--lv (254:4)
>> └─mirrorvg-raid1--lv-real (254:5)
>> ├─mirrorvg-raid1--lv_rimage_1 (254:3)
>> │ └─ (259:0)
>> ├─mirrorvg-raid1--lv_rmeta_1 (254:2)
>> │ └─ (259:0)
>> ├─mirrorvg-raid1--lv_rimage_0 (254:1)
>> │ └─ (8:16)
>> └─mirrorvg-raid1--lv_rmeta_0 (254:0)
>> └─ (8:16)
>>
>> fio configuration
>> =================
>>
>> 1. Random Read/Write latency benchmark
>>
>> ioengine=psync, bs=4K, numjobs=1, direct=1, timeout=90, time_based=1,
>> rw=randwrite/randread
>>
>> 2. Random Read/Write IOPS benchmark
>>
>> ioengine=libaio, bs=4K, numjobs=1, direct=1, iodepth=32, timeout=90,
>> time_based=1, rw=randwrite/randread
>>
>> 3. Sequential Read/Write Bandwidth
>>
>> ioengine=libaio, bs=256K, numjobs=1, direct=1, iodepth=32, timeout=90,
>> time_based=1, rw=write/read
>>
>> Baseline
>> ========
>>
>> As a reference, the benchmark results for the raw devices:
>>
>> +--------+--------------------+-----------------+--------------+
>> | device | rand-write latency | rand-write IOPS | seq-write BW |
>> +--------+--------------------+-----------------+--------------+
>> | HDD | 701 usec | 1425 | 120 MB/s |
>> | SSD | 72.6 usec | 64490 | 390 MB/s |
>> +--------+--------------------+-----------------+--------------+
>>
>> +--------+-------------------+----------------+-------------+
>> | device | rand-read latency | rand-read IOPS | seq-read BW |
>> +--------+-------------------+----------------+-------------+
>> | HDD | 1.4 msec | 712 | 120 MB/s |
>> | SSD | 122 usec | 150920 | 701 MB/s |
>> +--------+-------------------+----------------+-------------+
>>
>> dm-clone vs dm-snapshot+dm-raid
>> ===============================
>>
>> Latency benchmark
>> -----------------
>>
>> 1. Random write latency
>>
>> +-------------------+-----------+-------------+
>> | region/chunk size | dm-clone | dm-snapshot |
>> +-------------------+-----------+-------------+
>> | 4 KB | 75.7 usec | 6.8 msec |
>> | 8 KB | 1.9 msec | 17.7 msec |
>> | 16 KB | 2.1 msec | 15.8 msec |
>> | 32 KB | 2.2 msec | 33.6 msec |
>> | 64 KB | 2.6 msec | 31.2 msec |
>> | 128 KB | 3.8 msec | 35.7 msec |
>> +-------------------+-----------+-------------+
>>
>> * dm-snapshot+dm-raid has 7.5 to 90 times _more_ write latency than
>> dm-clone.
>>
>> * For the common case of a 4 KB region/chunk size, dm-clone has minimal
>> overhead over the SSD device.
>>
>> * Even for region/chunk sizes greater than 4KB dm-clone's overhead is
>> minimal compared to dm-snapshot+dm-raid.
>>
>> 2. Random read latency
>>
>> +-------------------+----------+-------------+
>> | region/chunk size | dm-clone | dm-snapshot |
>> +-------------------+----------+-------------+
>> | 4 KB | 1.5 msec | 10.7 msec |
>> | 8 KB | 1.5 msec | 9.7 msec |
>> | 16 KB | 1.5 msec | 11.9 msec |
>> | 32 KB | 1.5 msec | 28.6 msec |
>> | 64 KB | 1.5 msec | 27.5 msec |
>> | 128 KB | 1.5 msec | 27.3 msec |
>> +-------------------+----------+-------------+
>>
>> * dm-snapshot+dm-raid has 6.5 to 19 times _more_ read latency than
>> dm-clone.
>>
>> * For all region/chunk sizes dm-clone has minimal overhead over the HDD
>> device.
>>
>> IOPS benchmark
>> --------------
>>
>> 1. Random write IOPS
>>
>> +-------------------+----------+-------------+
>> | region/chunk size | dm-clone | dm-snapshot |
>> +-------------------+----------+-------------+
>> | 4 KB | 62347 | 3758 |
>> | 8 KB | 696 | 388 |
>> | 16 KB | 667 | 217 |
>> | 32 KB | 614 | 207 |
>> | 64 KB | 531 | 186 |
>> | 128 KB | 417 | 159 |
>> +-------------------+----------+-------------+
>>
>> * dm-clone achieves 1.8 to 16.6 times _more_ IOPS than
>> dm-snapshot+dm-raid.
>>
>> * For the common case of a 4 KB region/chunk size, dm-clone has minimal
>> overhead over the SSD device.
>>
>> * Even for region/chunk sizes greater than 4KB dm-clone achieves
>> significantly more IOPS than dm-snapshot+dm-raid.
>>
>> 2. Random read IOPS
>>
>> +-------------------+----------+-------------+
>> | region/chunk size | dm-clone | dm-snapshot |
>> +-------------------+----------+-------------+
>> | 4 KB | 767 | 680 |
>> | 8 KB | 714 | 677 |
>> | 16 KB | 715 | 338 |
>> | 32 KB | 717 | 338 |
>> | 64 KB | 720 | 338 |
>> | 128 KB | 724 | 339 |
>> +-------------------+----------+-------------+
>>
>> * dm-clone achieves 1.1 to 2.1 times _more_ IOPS than
>> dm-snapshot+dm-raid.
>>
>> Bandwidth benchmark
>> -------------------
>>
>> 1. Sequential write BW
>>
>> +-------------------+------------+-------------+
>> | region/chunk size | dm-clone | dm-snapshot |
>> +-------------------+------------+-------------+
>> | 4 KB | 389.4 MB/s | 135.3 MB/s |
>> | 8 KB | 390.5 MB/s | 231.7 MB/s |
>> | 16 KB | 390.5 MB/s | 213.1 MB/s |
>> | 32 KB | 390.4 MB/s | 214.0 MB/s |
>> | 64 KB | 390.3 MB/s | 214.0 MB/s |
>> | 128 KB | 390.5 MB/s | 211.3 MB/s |
>> +-------------------+------------+-------------+
>>
>> * dm-clone achieves 1.7 to 2.9 times more write BW than
>> dm-snapshot+dm-raid.
>>
>> * For all region/chunk sizes dm-clone achieves the same write BW as the
>> SSD device.
>>
>> 2. Sequential read BW
>>
>> +-------------------+------------+-------------+
>> | region/chunk size | dm-clone | dm-snapshot |
>> +-------------------+------------+-------------+
>> | 4 KB | 442.8 MB/s | 217.3 MB/s |
>> | 8 KB | 443.8 MB/s | 288.8 MB/s |
>> | 16 KB | 443.8 MB/s | 275.3 MB/s |
>> | 32 KB | 443.8 MB/s | 276.1 MB/s |
>> | 64 KB | 443.6 MB/s | 276.1 MB/s |
>> | 128 KB | 443.6 MB/s | 275.2 MB/s |
>> +-------------------+------------+-------------+
>>
>> * dm-clone achieves 1.5 to 2 times more read BW than
>> dm-snapshot+dm-raid.
>>
>> Metadata/Storage overhead
>> =========================
>>
>> dm-clone had a _maximum_ metadata overhead of around 20 MB for all
>> benchmarks. As dm-clone doesn't require any extra COW space for
>> temporarily storing the written data (writes just go directly to the
>> clone device) this is the _only_ storage overhead incurred by dm-clone,
>> irrespective of the amount of the written data
>>
>> On the other hand, the COW space utilization of dm-snapshot, for the
>> bandwidth benchmarks, varied from 11.95 GB to 20.41 GB, depending on the
>> amount of written data.
>>
>> I want to emphasize that after the cloning/syncing is complete we have
>> to merge this multi-gigabyte COW space back to the clone/destination
>> device. This will cause _further_ performance degradation, which is
>> _not_ reflected in the above performance measurements, but _will_ be
>> present in real workloads, if the dm-snapshot based solution is used.
>>
>>
>> To summarize, dm-clone performs _significantly_ better than a
>> dm-snapshot based solution, on all aspects (latency, IOPS, BW), and with
>> a _fraction_ of the storage/metadata overhead.
>>
>> If you have any more questions, I would be more than happy to discuss
>> them with you.
>>
>> Thanks,
>> Nikos
>>
>>> On 7/10/19 8:45 PM, Nikos Tsironis wrote:
>>>> On 7/10/19 12:28 AM, Heinz Mauelshagen wrote:
>>>>> Hi Nikos,
>>> e>
>>>>> what is the crucial factor your target offers vs. resynchronizing such a
>>>>> latency distinct
>>>>> 2-legged mirror with a read-write snapshot (local, fast exception store)
>>>>> on top, tearing the
>>>>> mirror down keeping the local leg once fully in sync and merging the
>>>>> snapshot back into it?
>>>>>
>>>>> Heinz
>>>>>
>>>> Hi Heinz,
>>>>
>>>> The most significant benefits of dm-clone over the solution you propose
>>>> is significantly better performance, no need for extra COW space, no
>>>> need to merge back a snapshot, and the ability to skip syncing the
>>>> unused space of a file system.
>>>>
>>>> 1. In order to ensure snapshot consistency, dm-snapshot needs to
>>>> commit a completed exception, before signaling the completion of the
>>>> write that triggered it to upper layers.
>>>>
>>>> The persistent exception store commits exceptions every time a
>>>> metadata area is filled or when there are no more exceptions
>>>> in-flight. For a 4K chunk size we have 256 exceptions per metadata
>>>> area, so the best case scenario is one commit per 256 writes. Here I
>>>> assume a write with size equal to the chunk size of dm-snapshot,
>>>> e.g., 4K, so there is no COW overhead, and that we write to new
>>>> chunks, so we need to allocate new exceptions.
>>>>
>>>> Part of committing the metadata is flushing the cache of the
>>>> underlying device, if there is one. We have seen SSDs which can
>>>> sustain hundreds of thousands of random write IOPS, but they take up
>>>> to 8ms to flush their cache. In such a case, flushing the SSD cache
>>>> every few writes significantly degrades performance.
>>>>
>>>> Moreover, dm-snapshot forces exceptions to complete in the order they
>>>> were allocated, to avoid snapshot space leak on crash (commit
>>>> 230c83afdd9cd). This inserts further latency in exception completions
>>>> and thus user write completions.
>>>>
>>>> On the other hand, when cloning a device we don't need to be so
>>>> strict and can rely on committing the metadata every time a FLUSH or
>>>> FUA bio is written, or periodically, like dm-thin and dm-cache do.
>>>>
>>>> dm-clone does exactly that. When a region/chunk is cloned or
>>>> over-written by a write, we just set a bit in the relevant in-core
>>>> bitmap. The metadata are committed once every second or when we
>>>> receive a FLUSH or FUA bio.
>>>>
>>>> This improves performance significantly and results in increased IOPS
>>>> and reduced latency, especially in cases where flushing the disk
>>>> cache is very expensive.
>>>>
>>>> 2. For large devices, e.g. multi terabyte disks, resynchronizing the
>>>> local leg can take a lot of time. If the application running over the
>>>> local device is write-heavy, dm-snapshot will end up allocating a
>>>> large number of exceptions. This increases the number of hash table
>>>> collisions and thus increases the time we need to do a hash table
>>>> lookup.
>>>>
>>>> dm-snapshot needs to look up the exception hash tables in order to
>>>> service an I/O, so this increases latency and degrades performance.
>>>>
>>>> On the other hand, dm-clone is just testing a bit to see if a region
>>>> is cloned or not and decides what to do based on that test.
>>>>
>>>> 3. With dm-clone there is no need to reserve extra COW space for
>>>> temporarily storing the written data, while the clone device is
>>>> syncing. Nor would one need to worry about monitoring and expanding
>>>> the COW device to prevent it from filling up.
>>>>
>>>> 4. With dm-clone there is no need to merge back potentially several
>>>> gigabytes once cloning/syncing completes. We also avoid the relevant
>>>> performance degradation incurred by the merging process. Writes just
>>>> go directly to the clone device.
>>>>
>>>> 5. dm-clone implements support for discards, so it can skip
>>>> cloning/syncing the relevant regions. In the case of a large block
>>>> device which contains a filesystem with empty space, e.g. a 2TB
>>>> device containing 500GB of useful data in a filesystem, this can
>>>> significantly reduce the time needed to sync/clone.
>>>>
>>>> This was a rather long email, but I hope it makes the significant
>>>> benefits of dm-clone over using dm-snapshot, and our rationale behind
>>>> the decision to implement a new target clearer.
>>>>
>>>> I would be more than happy to continue the conversation and focus on any
>>>> other questions you may have.
>>>>
>>>> Thanks,
>>>> Nikos
>> --
>> dm-devel mailing list
>> dm-devel at redhat.com
>> https://www.redhat.com/mailman/listinfo/dm-devel
>
More information about the dm-devel
mailing list