[dm-devel] Potential enhancements to dm-thin v2

Mon Apr 11 20:16:43 UTC 2022

Dne 11. 04. 22 v 19:22 Demi Marie Obenour napsal(a):
> On Mon, Apr 11, 2022 at 10:16:02AM +0200, Zdenek Kabelac wrote:
>> Dne 11. 04. 22 v 0:03 Demi Marie Obenour napsal(a):
>>
>> Your proposal actually breaks this sequence and would move things to the
>> state of  'guess at which states we are now'. (and IMHO presents much more
>> risk than virtual problem with suspend from user-space - which is only a
>> problem if you are using suspended device as 'swap' and 'rootfs' - so there
>> are very easy ways how to orchestrate your LVs to avoid such problems).
> The intent is less “guess what states we are now” and more “It looks
> like dm-thin already has the data structures needed to store some
> per-thin metadata, and that could make writing a simple userspace volume
> manager FAR FAR easier”.  It appears to me that the only change needed

I do not spend hours explaining all the details - but running just the suspend 
alone may result in many differnt problem where the things like running 
thin-pool out-of-data space is one of the easiest.

Basically each step must be designed with  'power-off' happen during the 
operation. For each step you need to know how the recovery step looks like and 
how the lvm2 & kernel metadata c/would match together.  Combining many steps 
together into a single 'kernel' call just increases already large range of 
errors.  So in many case we simply do favour to keep operation more 
'low-level-atomic' even at slight higher performance price (as said - we've 
never seen a creation of snapshot to be 'msec' critical operation - as  the  
'suspend' with implicit flush & fsfreeze itself might be far more expensive 
operation.

>> But IMHO creation and removal of thousands of devices in very short period
>> of time rather suggest there is something sub-optimal in your original
>> software design as I'm really having hard time imagining why would you need
>> this ?
> There very well could be (suggestions for improvement welcome).
>
>> If you wish to operate lots of devices - keep them simply created and ready
>> - and eventually blkdiscard them for next device reuse.
> That would work for volatile volumes, but those are only about 1/3 of
> the volumes in a Qubes OS system.  The other 2/3 are writable snapshots.
> Also, Qubes OS has found blkdiscard on thins to be a performance
> problem.  It used to lock up entire pools until Qubes OS moved to doing
> the blkdiscard in chunks.

Always make sure you use recent Linux kernels.

Blkdiscard should not differ from lvremove too much  - also experiment how 
the  'lvchange --discards  passdown|nopassdown poolLV' works.

>> I'm also unsure from where would arise any special need to instantiate  that
>> many snapshots -  if there is some valid & logical purpose -   lvm2 can have
>> extended user space API to create multiple snapshots at once maybe (so
>> i.e.    create  10 snapshots   with      name-%d  of a single thinLV)
> This would be amazing, and Qubes OS should be able to use it.  That
> said, Qubes OS would prefer to be able to choose the name of each volume
> separately.  Could there be a more general batching operation?  Just
> supporting ‘lvm lvcreate’ and ‘lvm lvs’ would be great, but support for
> ‘lvm lvremove’, ‘lvm lvrename’, ‘lvm lvextend’, and ‘lvm lvchange
> --activate=y’ as well would be even better.

There is kind of 'hidden' plan inside command line processing to allow 
'grouped'  processing.

lvcreate --snapshot  --name lv1  --snapshot --name lv2 vg/origin

However there is currently no man power to proceed further on this part as we 
have other parts of code needed enhancements.

But we may put this on our TODO plans...

>> Not to mentioning operating that many thin volumes from a single thin-pool
>> is also nothing close to high performance goal you try to reach...
> Would you mind explaining?  My understanding, and the basis of
> essentially all my feature requests in this area, was that virtually all
> of the cost of LVM is the userspace metadata operations, udev syncing,
> and device scanning.  I have been assuming that the kernel does not have
> performance problems with large numbers of thin volumes.

The main idea behind the comment is -  when there is increased disk usage - 
the manipulation with thin-pool metadata and locking will soon start to be a 
considerable performance problem.

So while it's easy to have active  1000 thinLVs from a single thin-pool that 
are UNUSED, situation is dramatically different when there LVs would be in 
some heavy use load.  There you should keep the active thinLV at low number 
of  tens  LVs, especially if you are performance oriented.  The lighter usage 
and less provisioning and especially bigger block size - improve

>
> Right now, my machine has 334 active thin volumes, split between one
> pool on an NVMe drive and one on a spinning hard drive.  The pool on an
> NVMe drive has 312 active thin volumes, of which I believe 64 are in use.
> Are these numbers high enough to cause significant performance
> penalties for dm-thin v1, and would they cause problems for dm-thin v2?

There are not yet any numbers for v2

For v1 - 64 thins might eventually experience some congestion for heavy load 
(compared with 'native' raw spindle).

> How much of a performance win can I expect from only activating the
> subset of volumes I actually use?

I can only advice benchmark with some good approximation of your expected 
workload.

In some cases it may appear your workload is not too sensitive to various 
locking limitations.

Regards

Zdenek