[dm-devel] [PATCH RFCv2 00/10] dm-dedup: device-mapper deduplication target

Fri Jan 23 16:27:39 UTC 2015

Hi Vivek,

Thanks for reading our paper! Please, find the answers to the issues
you raised inline.

> Hi,
>
> I have quickly browsed through the paper above and have some very
> basic questions.
>
> - What real life workload is really going to benefit from this? Do you
>   have any numbers for that?
>
>   I see one example of storing multiple linux trees in tar format and for
>   the sequential write case with CBT backend performance has almost halfed
>   with CBT backend. And we had a dedup ratio of 1.88 (for perfect case).
>
>   INRAM numbers I think really don't count because it is not practical to
>   keep all metadata in RAM. And the case of keeping all data in NVRAM is
>   still little futuristic.
>
>   So this sounds like a too huge a performance penalty to me to be really
>   useful on real life workloads?

Dm-dedup is designed so that different metadata backends can be
implemented easily. We first implemented Copy-on-Write (COW) backend
because device-mapper already provides a COW-based persistent metadata
library. That library was specifically designed  for various
device-mapper targets to store metadata reliably in a common way.
Using COW library allows us to use a well-tested code that is already
in kernel instead of increasing the code size of our submission.

You're right, however, that COW B-tree exhibits relatively high I/O
overhead which might not be acceptable in some environments. For such
environments, new backends with higher performance will be added in
the future. As an example, we present DTB and INRAM backends in the
paper. INRAM backend is that simple that we even include it in the
submitted patches. We envision it to be used in cases similar to
Intel's pmfs file system (persistent memory file system). Persistent
memory is not that futuristic anymore, IMHO :)

Talking about workloads. Many workloads have uneven performance
profiles, so CBT's cache can adsorb peaks and then flush metadata
during the lower load phases. In many cases, deduplication ratio is
also higher, e.g., file systems that store hundreds of VM disk images,
backups, etc. So, we believe that for many situations CBT backend is
practical.

>
> - Why did you implement an inline deduplication as opposed to out-of-line
>   deduplication? Section 2 (Timeliness) in paper just mentioned
>   out-of-line dedup but does not go into more details that why did you
>   choose an in-line one.
>
>   I am wondering that will it not make sense to first implement an
>   out-of-line dedup and punt lot of cost to worker thread (which kick
>   in only when storage is idle). That way even if don't get a high dedup
>   ratio for a workload, inserting a dedup target in the stack will be less
>   painful from performance point of view.

Both in-line and off-line deduplication approaches have their own
pluses and minuses. Among the minuses of  the off-line approach is
that it requires allocation of extra space to buffer non-deduplicated
writes, re-reading the data from disk when deduplication happens (i.e.
more I/O used). It also complicates space usage accounting and user
might run out of space though deduplication process will discover many
duplicated blocks later.

Our final goal is to support both approaches but for this code
submission we wanted to limit the amount of new code. In-line
deduplication is a core part, around which we can implement off-line
dedup by adding an extra thread that will reuse the same logic as
in-line deduplication.

>
> - You mentioned that random workload will become sequetion with dedup.
>   That will be true only if there is a single writer, isn't it? Have
>   you run your tests with multiple writers doing random writes and did
>   you get same kind of imrovements?
>
>   Also on the flip side a seqeuntial file will become random if multiple
>   writers are overwriting their sequential file (as you always allocate
>   a new block upon overwrite) and that will hit performance.

Even for multiple random writers the workload at the data device level
becomes sequential. The thing is that we allocate blocks on data
device as requests are inserted in the I/O queue, no matter which
process inserts the request.

You're right, however, that as with any log-structured file system,
sequential allocation of data blocks in Dm-dedup leads to
fragmentation. Blocks that belong to the same file, for example, might
not be close if multiple writers wrote these blocks at different
times. Moreover, such fragmentaion is a general problem with any
deduplication system. In fact, if you have an identical chunk that
belongs to two (or more files) in the system, then the file layout is
not sequential for all files but one (or none of the files).

In future, mechanisms for defragmentation can be implemented to
mitigate this effect.

>
> - What is 4KB chunking? Is it same as saying that block size will be
>   4KB? If yes, I am concerned that this might turn out to be a performance
>   bottleneck.

Yes, chunk is a conventional name for a unit of deduplication.
Dm-dedup's user can configure chunk's size with respect to his or her
workload and performance requirements.  Larger chunks generally cause
less metadata and more sequentiality on allocation but lower
deduplication ratio.

Vasily