[dm-devel] [announce] thin-provisioning-tools v1.0.0-rc1

Fri Mar 3 21:21:13 UTC 2023

On Thu, 2 Mar 2023, Joe Thornber wrote:
> Hi Eric,
> 
> On Wed, Mar 1, 2023 at 10:26 PM Eric Wheeler <dm-devel at lists.ewheeler.net> wrote:
> 
>       Hurrah! I've been looking forward to this for a long time...
> 
> 
>       ...So if you have any commentary on the future of dm-thin with respect
>       to metadata range support, or dm-thin performance in general, that I would
>       be very curious about your roadmap and your plans.
> 
> 
> The plan over the next few months is roughly:
> 
> - Get people using the new Rust tools.  They are _so_ much faster than 
>   the old C++ ones. [available now]
> - Push upstream a set of patches I've been working on to boost thin 
>   concurrency performance.  These are nearing completion and are 
>   available here for those who are interested: 
>   https://github.com/jthornber/linux/tree/2023-02-28-thin-concurrency-7.
>   These are making a huge difference to performance in my testing, eg, 
>   fio with 16 jobs running concurrently gets several times the throughput.
>   [Upstream in the next month hopefully]

It would be nice to get people testing the new improvements:

Do you think it can make it for the 6.3 merge window that is open?

> - Change thinp metadata to store ranges rather than individual mappings.  
>   This will reduce the amount of space the metadata consumes, and have 
>   the knock on effect of boosting performance slightly (less metadata 
>   means faster lookups).  However I consider this a half-way house, in
>   that I'm only going to change the metadata and not start using ranges 
>   within the core target (I'm not moving away from fixed block sizes).
>   [Next 3 months]

Good idea.

> I don't envisage significant changes to dm-thin or dm-cache after this.

Seems reasonable.

> Longer term I think we're nearing a crunch point where we drastically 
> change how we do things.  Since I wrote device-mapper in 2001 the speed 
> of devices has increased so much that I think dm is no longer doing a 
> good job:
> 
> - The layering approach introduces inefficiencies with each layer.  
>   Sure it may only be a 5% hit to add another linear mapping into the 
>   stack.  But those 5%'s add up.
> - dm targets only see individual bios rather than the whole request 
>   queue.  This prevents a lot of really useful optimisations.  Think how 
>   much smarter dm-cache and dm-thin could be if they could look at the 
>   whole queue.
> - The targets are getting too complicated.  I think dm-thin is around 8k 
>   lines of code, though it shares most of that with dm-cache.  I 
>   understand the dedup target from the vdo guys weighs in at 64k lines.  
>   Kernel development is fantastically expensive (or slow depending
>   how you want to look at it).  I did a lot of development work on 
>   thinp v2, and it was looking a lot like a filesystem shoe-horned into 
>   the block layer.  I can see why bcache turned into bcache-fs.

Did thinp v2 get dropped, or just turn into the patchset above?

> - Code within the block layer is memory constrained.  We can't allocate 
>   arbitrary sized allocations within targets, instead we have to use 
>   mempools of fixed size objects (frowned upon these days), or declare 
>   up front how much memory we need to service a bio (forcing us to 
>   assume the worst case).
>   This stuff isn't hard, just tedious and makes coding sophisticated targets pretty joyless.
> 
> So my plan going forwards is to keep the fast path of these targets in 
> kernel (eg, a write to a provisioned, unsnapshotted region).  But take 
> the slow paths out to userland.

Seems reasonable.

> I think io_uring, and ublk have shown us that this is viable.  That way 
> a snapshot copy-on-write, or dm-cache data migration, which are very 
> slow operations can be done with ordinary userland code.

Would be nice to minimize CoW latency somehow if going to userspace 
increases that a notable amount.  CoW for spinning disks is definitely 
slow, but NVMe's are pretty quick to copy a 64k chunk.

> For the fast paths, layering will be removed by having userland give
> the kernel instruction to execute for specific regions of the virtual 
> device (ie. remap to here).

Maybe you just answered my question of latency?

> The kernel driver will have nothing specific to thin/cache etc. I'm not 
> sure how many of the current dm-targets would fit into this model, but 
> I'm sure thin provisioning, caching, linear, and stripe can.

To be clear, linear and stripe would stay in the kernel? 

-Eric

> 
> - Joe
> 
> 
> 
> 
> 
> 
>  
>       Thanks again for all your great work on this.
> 
>       -Eric
> 
>       > [note: _data_ sharing was always maintained, this is purely about metadata space usage]
>       >
>       > # thin_metadata_pack/unpack
>       >
>       > These are a couple of new tools that are used for support.  They compress
>       > thin metadata, typically to a tenth of the size (much better than you'd
>       > get with generic compressors).  This makes it easier to pass damaged
>       > metadata around for inspection.
>       >
>       > # blk-archive
>       >
>       > The blk-archive tools were initially part of this thin-provisioning-tools
>       > package.  But have now been split off to their own project:
>       >
>       >     https://github.com/jthornber/blk-archive
>       >
>       > They allow efficient archiving of thin devices (data deduplication
>       > and compression).  Which will be of interest to those of you who are
>       > holding large numbers of snapshots in thin pools as a poor man's backup.
>       >
>       > In particular:
>       >
>       >     - Thin snapshots can be used to archive live data.
>       >     - it avoids reading unprovisioned areas of thin devices.
>       >     - it can calculate deltas between thin devices to minimise
>       >       how much data is read and deduped (incremental backups).
>       >     - restoring to a thin device tries to maximise data sharing
>       >       within the thin pool (a big win if you're restoring snapshots).
>       >
>       >
> 
> 
>