[dm-devel] Potential enhancements to dm-thin v2

Tue Apr 12 09:32:09 UTC 2022

Dne 12. 04. 22 v 0:30 Demi Marie Obenour napsal(a):
> On Mon, Apr 11, 2022 at 10:16:43PM +0200, Zdenek Kabelac wrote:
>> Dne 11. 04. 22 v 19:22 Demi Marie Obenour napsal(a):
>>> On Mon, Apr 11, 2022 at 10:16:02AM +0200, Zdenek Kabelac wrote:
>>>> Dne 11. 04. 22 v 0:03 Demi Marie Obenour napsal(a):
>>>>
>>>> Your proposal actually breaks this sequence and would move things to the
>>>> state of  'guess at which states we are now'. (and IMHO presents much more
>>>> risk than virtual problem with suspend from user-space - which is only a
>>>> problem if you are using suspended device as 'swap' and 'rootfs' - so there
>>>> are very easy ways how to orchestrate your LVs to avoid such problems).
>>> The intent is less “guess what states we are now” and more “It looks
>>> like dm-thin already has the data structures needed to store some
>>> per-thin metadata, and that could make writing a simple userspace volume
>>> manager FAR FAR easier”.  It appears to me that the only change needed
>>
>> I do not spend hours explaining all the details - but running just the
>> suspend alone may result in many differnt problem where the things like
>> running thin-pool out-of-data space is one of the easiest.
>>
>> Basically each step must be designed with  'power-off' happen during the
>> operation. For each step you need to know how the recovery step looks like
>> and how the lvm2 & kernel metadata c/would match together.
> That is absolutely the case, and is in fact the reason I proposed this
> change to begin with.  By having dm-thin store a small amount of
> userspace-provided metadata for each thin volume, and by providing an
> API to enumerate the thin volumes in a pool, I can store all of the
> metadata I need in the thin pool itself.  This is much simpler than
> having to store metadata outside of the pool.

Hi

Here is actually the fundamental problem with your proposal - our design was 
about careful split between user-space and kernel 'who is the owner/holder of 
information'  - your proposal unfortunately does not fit the model where lvm2 
is the authoritative owner of info about devices -   note - we also tried the 
'model' where the info is held within target - our mdraid  dm wrapper - but it 
has more troubles compared with very clear thin logic.  So from the lvm2 
position - we do not have any plans to change this proven model.

What you are asking for is - that 'kernel' module is doing all the job - and 
lvm2 would be obtaining info from the kernel metadata - and eventually you 
would be able to command everything with ioctl() interface and letting the 
complexity sit completely in kernel - but as explained our design is heading 
in opposite direction - what can be done in user-space stays in user space and 
kernel does the necessary minimum, which can be then much easier developed and 
traced.

>> Combining many
>> steps together into a single 'kernel' call just increases already large
>> range of errors.  So in many case we simply do favour to keep operation more
>> 'low-level-atomic' even at slight higher performance price (as said - we've
>> never seen a creation of snapshot to be 'msec' critical operation - as  the
>> 'suspend' with implicit flush & fsfreeze itself might be far more expensive
>> operation.
> Qubes OS should never be snapshotting an in-use volume of any kind.
> Right now, there is one case where it does so, but that is a bug, and I
> am working on fixing it.  A future API might support snapshotting to an
> in-use volume, but that would likely require a way to tell the VM to
> freeze its own filesystem.

Yeah - you have very unusual use case  - in fact lvm2 goal is usually to 
support as much things as we can while devices are in-use so user does not 
need to take them offline - which surely complicates everything a lot -  also 
there was basically never any user demand to operate with offline device in 
very quick way - so admittedly not the focused area of development.

>>>> But IMHO creation and removal of thousands of devices in very short period
>>>> of time rather suggest there is something sub-optimal in your original
>>>> software design as I'm really having hard time imagining why would you need
>>>> this ?
>>> There very well could be (suggestions for improvement welcome).
>>>
>>>> If you wish to operate lots of devices - keep them simply created and ready
>>>> - and eventually blkdiscard them for next device reuse.
>>> That would work for volatile volumes, but those are only about 1/3 of
>>> the volumes in a Qubes OS system.  The other 2/3 are writable snapshots.
>>> Also, Qubes OS has found blkdiscard on thins to be a performance
>>> problem.  It used to lock up entire pools until Qubes OS moved to doing
>>> the blkdiscard in chunks.
>> Always make sure you use recent Linux kernels.
> Should the 5.16 series be recent enough?
>
>> Blkdiscard should not differ from lvremove too much  - also experiment how
>> the  'lvchange --discards  passdown|nopassdown poolLV' works.
> I believe this was with passdown on, which is the default in Qubes OS.
> The bug was tracked down by Jinoh Kang in
> https://github.com/QubesOS/qubes-issues/issues/5426#issuecomment-761595524
> and found to be due to dm-thin deleting B-tree nodes one at a time,
> causing large amounts of time to be wasted on btree rebalancing and node
> locking.
>
>>>> I'm also unsure from where would arise any special need to instantiate  that
>>>> many snapshots -  if there is some valid & logical purpose -   lvm2 can have
>>>> extended user space API to create multiple snapshots at once maybe (so
>>>> i.e.    create  10 snapshots   with      name-%d  of a single thinLV)
>>> This would be amazing, and Qubes OS should be able to use it.  That
>>> said, Qubes OS would prefer to be able to choose the name of each volume
>>> separately.  Could there be a more general batching operation?  Just
>>> supporting ‘lvm lvcreate’ and ‘lvm lvs’ would be great, but support for
>>> ‘lvm lvremove’, ‘lvm lvrename’, ‘lvm lvextend’, and ‘lvm lvchange
>>> --activate=y’ as well would be even better.
>> There is kind of 'hidden' plan inside command line processing to allow
>> 'grouped'  processing.
>>
>> lvcreate --snapshot  --name lv1  --snapshot --name lv2 vg/origin
>>
>> However there is currently no man power to proceed further on this part as
>> we have other parts of code needed enhancements.
>>
>> But we may put this on our TODO plans...
> That would be great, thanks!

Although the main reason to support this kind of API was the request to 
support an atomic snapshot of multiple LVs at once - but so far not a high 
priority.

>>>> Not to mentioning operating that many thin volumes from a single thin-pool
>>>> is also nothing close to high performance goal you try to reach...
>>> Would you mind explaining?  My understanding, and the basis of
>>> essentially all my feature requests in this area, was that virtually all
>>> of the cost of LVM is the userspace metadata operations, udev syncing,
>>> and device scanning.  I have been assuming that the kernel does not have
>>> performance problems with large numbers of thin volumes.
>>
>> The main idea behind the comment is -  when there is increased disk usage -
>> the manipulation with thin-pool metadata and locking will soon start to be a
>> considerable performance problem.
>>
>> So while it's easy to have active  1000 thinLVs from a single thin-pool that
>> are UNUSED, situation is dramatically different when there LVs would be in
>> some heavy use load.  There you should keep the active thinLV at low number
>> of  tens  LVs, especially if you are performance oriented.  The lighter
>> usage and less provisioning and especially bigger block size - improve
> I can try to modify the storage pool so that LVs are not activated by
> default.  That said, Qubes OS will always be provisioning-heavy.  With
> the notable exception of volatile volumes, qubesd always snapshots a

You definitely should keep active ONLY LVs you need to have active - it's 
impacting many other kernel areas and consumes system resources to keep 
'unused' LVs active.

> volume at startup and then provides the snapshot to the VM.  After
> shutdown, the original volume is renamed to be a backup, and the
> snapshot gets the name of the original volume.  Bigger block sizes would
> substantially increase write amplification, as turning off zeroing is
> not an option for security reasons.

For 'snapshot' heavy loads -  smaller chunks are usually better - but it comes 
with price.

> Is this just a workload that dm-thin is ill-suited for?  Qubes OS does
> support storing VM images on either BTRFS or XFS files, and it could be
> that this is a better plan going forward.

Not knowing the details - but as mentioned  'zeroing'  is not needed for 
'filesystem' security - modern filesystem will never let you read unwritten 
data - as it keeps its own map of written data  - but of course if the user 
has root access to device  with 'dd'  it could read some 'unwritten' data on 
that device...

>
>>> How much of a performance win can I expect from only activating the
>>> subset of volumes I actually use?
>>
>> I can only advice benchmark with some good approximation of your expected
>> workload.
> That’s already on the Qubes OS team’s (very long) to-do list.
>
I'd prioritize this - to get the best balance for performance  - i.e. slightly 
bigger chunks could give you much better numbers - if your 'snapshot' workload 
is focused on small 'areas' so you know exactly where the focus should go    
(too many cooks spoll the broth)...

So even jump 64k -> 256K can be significant

Regards

Zdenek