[dm-devel] Potential enhancements to dm-thin v2

Demi Marie Obenour demi at invisiblethingslab.com
Tue Apr 12 11:58:29 UTC 2022


On 4/12/22 05:32, Zdenek Kabelac wrote:
> Dne 12. 04. 22 v 0:30 Demi Marie Obenour napsal(a):
>> On Mon, Apr 11, 2022 at 10:16:43PM +0200, Zdenek Kabelac wrote:
>>> Dne 11. 04. 22 v 19:22 Demi Marie Obenour napsal(a):
>>>> On Mon, Apr 11, 2022 at 10:16:02AM +0200, Zdenek Kabelac wrote:
>>>>> Dne 11. 04. 22 v 0:03 Demi Marie Obenour napsal(a):
>>>>>
>>>>> Your proposal actually breaks this sequence and would move things to the
>>>>> state of  'guess at which states we are now'. (and IMHO presents much more
>>>>> risk than virtual problem with suspend from user-space - which is only a
>>>>> problem if you are using suspended device as 'swap' and 'rootfs' - so there
>>>>> are very easy ways how to orchestrate your LVs to avoid such problems).
>>>> The intent is less “guess what states we are now” and more “It looks
>>>> like dm-thin already has the data structures needed to store some
>>>> per-thin metadata, and that could make writing a simple userspace volume
>>>> manager FAR FAR easier”.  It appears to me that the only change needed
>>>
>>> I do not spend hours explaining all the details - but running just the
>>> suspend alone may result in many differnt problem where the things like
>>> running thin-pool out-of-data space is one of the easiest.
>>>
>>> Basically each step must be designed with  'power-off' happen during the
>>> operation. For each step you need to know how the recovery step looks like
>>> and how the lvm2 & kernel metadata c/would match together.
>> That is absolutely the case, and is in fact the reason I proposed this
>> change to begin with.  By having dm-thin store a small amount of
>> userspace-provided metadata for each thin volume, and by providing an
>> API to enumerate the thin volumes in a pool, I can store all of the
>> metadata I need in the thin pool itself.  This is much simpler than
>> having to store metadata outside of the pool.
> 
> Hi
> 
> Here is actually the fundamental problem with your proposal - our design was 
> about careful split between user-space and kernel 'who is the owner/holder of 
> information'  - your proposal unfortunately does not fit the model where lvm2 
> is the authoritative owner of info about devices -   note - we also tried the 
> 'model' where the info is held within target - our mdraid  dm wrapper - but it 
> has more troubles compared with very clear thin logic.  So from the lvm2 
> position - we do not have any plans to change this proven model.

This does not surprise me.  lvm2 already has the infrastructure to
store its own metadata and update it in a crash-safe way, so having the
kernel be able to store additional metadata would be of no benefit to
lvm2.  The intended use-case for this feature is tools that are dm-thin
specific, and which do not already have such infrastructure.

> What you are asking for is - that 'kernel' module is doing all the job - and 
> lvm2 would be obtaining info from the kernel metadata - and eventually you 
> would be able to command everything with ioctl() interface and letting the 
> complexity sit completely in kernel - but as explained our design is heading 
> in opposite direction - what can be done in user-space stays in user space and 
> kernel does the necessary minimum, which can be then much easier developed and 
> traced.
Not at all.  I just want userspace to be able to stash some data in
each thin and retrieve it later.  The complex logic would still remain
in userspace.  That’s why I dropped the “lookup thin by blob”
functionality: it requires a new data structure in the kernel, and
userspace can achieve almost the same effect with a cache.  Qubes OS
has a persistent daemon that has exclusive ownership of the storage,
so there are no cache invalidation problems.  The existing thin
pool already has a btree that could store the blob, so no new data
structures are required on the kernel side.

>> Combining many
>>> steps together into a single 'kernel' call just increases already large
>>> range of errors.  So in many case we simply do favour to keep operation more
>>> 'low-level-atomic' even at slight higher performance price (as said - we've
>>> never seen a creation of snapshot to be 'msec' critical operation - as  the
>>> 'suspend' with implicit flush & fsfreeze itself might be far more expensive
>>> operation.
>> Qubes OS should never be snapshotting an in-use volume of any kind.
>> Right now, there is one case where it does so, but that is a bug, and I
>> am working on fixing it.  A future API might support snapshotting to an
>> in-use volume, but that would likely require a way to tell the VM to
>> freeze its own filesystem.
> 
> 
> Yeah - you have very unusual use case  - in fact lvm2 goal is usually to 
> support as much things as we can while devices are in-use so user does not 
> need to take them offline - which surely complicates everything a lot -  also 
> there was basically never any user demand to operate with offline device in 
> very quick way - so admittedly not the focused area of development.

I wouldn’t exactly say my use case is *that* unusual.  My
understanding is that Docker has the same one, and they too resorted to
bypassing lvm2 and using raw dm-thin.  Docker does have the advantage
that the filesystem is assumed to be trusted, so they do not need eager
zeroing.  Stratis also uses dm-thin, and it, too, avoids using LVM.

>>>>> But IMHO creation and removal of thousands of devices in very short period
>>>>> of time rather suggest there is something sub-optimal in your original
>>>>> software design as I'm really having hard time imagining why would you need
>>>>> this ?
>>>> There very well could be (suggestions for improvement welcome).
>>>>
>>>>> If you wish to operate lots of devices - keep them simply created and ready
>>>>> - and eventually blkdiscard them for next device reuse.
>>>> That would work for volatile volumes, but those are only about 1/3 of
>>>> the volumes in a Qubes OS system.  The other 2/3 are writable snapshots.
>>>> Also, Qubes OS has found blkdiscard on thins to be a performance
>>>> problem.  It used to lock up entire pools until Qubes OS moved to doing
>>>> the blkdiscard in chunks.
>>> Always make sure you use recent Linux kernels.
>> Should the 5.16 series be recent enough?
>>
>>> Blkdiscard should not differ from lvremove too much  - also experiment how
>>> the  'lvchange --discards  passdown|nopassdown poolLV' works.
>> I believe this was with passdown on, which is the default in Qubes OS.
>> The bug was tracked down by Jinoh Kang in
>> https://github.com/QubesOS/qubes-issues/issues/5426#issuecomment-761595524
>> and found to be due to dm-thin deleting B-tree nodes one at a time,
>> causing large amounts of time to be wasted on btree rebalancing and node
>> locking.
>>
>>>>> I'm also unsure from where would arise any special need to instantiate  that
>>>>> many snapshots -  if there is some valid & logical purpose -   lvm2 can have
>>>>> extended user space API to create multiple snapshots at once maybe (so
>>>>> i.e.    create  10 snapshots   with      name-%d  of a single thinLV)
>>>> This would be amazing, and Qubes OS should be able to use it.  That
>>>> said, Qubes OS would prefer to be able to choose the name of each volume
>>>> separately.  Could there be a more general batching operation?  Just
>>>> supporting ‘lvm lvcreate’ and ‘lvm lvs’ would be great, but support for
>>>> ‘lvm lvremove’, ‘lvm lvrename’, ‘lvm lvextend’, and ‘lvm lvchange
>>>> --activate=y’ as well would be even better.
>>> There is kind of 'hidden' plan inside command line processing to allow
>>> 'grouped'  processing.
>>>
>>> lvcreate --snapshot  --name lv1  --snapshot --name lv2 vg/origin
>>>
>>> However there is currently no man power to proceed further on this part as
>>> we have other parts of code needed enhancements.
>>>
>>> But we may put this on our TODO plans...
>> That would be great, thanks!
> 
> Although the main reason to support this kind of API was the request to 
> support an atomic snapshot of multiple LVs at once - but so far not a high 
> priority.

Still, it will be useful if it becomes available.

>>>>> Not to mentioning operating that many thin volumes from a single thin-pool
>>>>> is also nothing close to high performance goal you try to reach...
>>>> Would you mind explaining?  My understanding, and the basis of
>>>> essentially all my feature requests in this area, was that virtually all
>>>> of the cost of LVM is the userspace metadata operations, udev syncing,
>>>> and device scanning.  I have been assuming that the kernel does not have
>>>> performance problems with large numbers of thin volumes.
>>>
>>> The main idea behind the comment is -  when there is increased disk usage -
>>> the manipulation with thin-pool metadata and locking will soon start to be a
>>> considerable performance problem.
>>>
>>> So while it's easy to have active  1000 thinLVs from a single thin-pool that
>>> are UNUSED, situation is dramatically different when there LVs would be in
>>> some heavy use load.  There you should keep the active thinLV at low number
>>> of  tens  LVs, especially if you are performance oriented.  The lighter
>>> usage and less provisioning and especially bigger block size - improve
>> I can try to modify the storage pool so that LVs are not activated by
>> default.  That said, Qubes OS will always be provisioning-heavy.  With
>> the notable exception of volatile volumes, qubesd always snapshots a
> 
> 
> You definitely should keep active ONLY LVs you need to have active - it's 
> impacting many other kernel areas and consumes system resources to keep 
> 'unused' LVs active.

Thank you.  There has already been work done in that direction and I
will see if I can finish it now that R4.1 is out.

>> volume at startup and then provides the snapshot to the VM.  After
>> shutdown, the original volume is renamed to be a backup, and the
>> snapshot gets the name of the original volume.  Bigger block sizes would
>> substantially increase write amplification, as turning off zeroing is
>> not an option for security reasons.
> 
> For 'snapshot' heavy loads -  smaller chunks are usually better - but it comes 
> with price.
> 
> 
>> Is this just a workload that dm-thin is ill-suited for?  Qubes OS does
>> support storing VM images on either BTRFS or XFS files, and it could be
>> that this is a better plan going forward.
> 
> 
> Not knowing the details - but as mentioned  'zeroing'  is not needed for 
> 'filesystem' security - modern filesystem will never let you read unwritten 
> data - as it keeps its own map of written data  - but of course if the user 
> has root access to device  with 'dd'  it could read some 'unwritten' data on 
> that device...

Qubes OS aims to defend against an attacker with kernel privilege
in a VM, so zeroing is indeed necessary.  The only way to avoid the
overhead would be for dm-thin’s chunk size to match the block size
of the paravirtualized disks, so that there are only full-chunk writes.
The reflink storage pool does not have this problem because it works on
4K blocks natively.

>>>> How much of a performance win can I expect from only activating the
>>>> subset of volumes I actually use?
>>>
>>> I can only advice benchmark with some good approximation of your expected
>>> workload.
>> That’s already on the Qubes OS team’s (very long) to-do list.
>>
> I'd prioritize this - to get the best balance for performance  - i.e. slightly 
> bigger chunks could give you much better numbers - if your 'snapshot' workload 
> is focused on small 'areas' so you know exactly where the focus should go    
> (too many cooks spoll the broth)...
> 
> So even jump 64k -> 256K can be significant

I will try to do a proper benchmark at some point.

-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_0xB288B55FFF9C22C1.asc
Type: application/pgp-keys
Size: 4885 bytes
Desc: OpenPGP public key
URL: <http://listman.redhat.com/archives/dm-devel/attachments/20220412/a167481b/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <http://listman.redhat.com/archives/dm-devel/attachments/20220412/a167481b/attachment.sig>


More information about the dm-devel mailing list