[dm-devel] Potential enhancements to dm-thin v2
Zdenek Kabelac
zkabelac at redhat.com
Tue Apr 12 09:32:09 UTC 2022
Dne 12. 04. 22 v 0:30 Demi Marie Obenour napsal(a):
> On Mon, Apr 11, 2022 at 10:16:43PM +0200, Zdenek Kabelac wrote:
>> Dne 11. 04. 22 v 19:22 Demi Marie Obenour napsal(a):
>>> On Mon, Apr 11, 2022 at 10:16:02AM +0200, Zdenek Kabelac wrote:
>>>> Dne 11. 04. 22 v 0:03 Demi Marie Obenour napsal(a):
>>>>
>>>> Your proposal actually breaks this sequence and would move things to the
>>>> state of 'guess at which states we are now'. (and IMHO presents much more
>>>> risk than virtual problem with suspend from user-space - which is only a
>>>> problem if you are using suspended device as 'swap' and 'rootfs' - so there
>>>> are very easy ways how to orchestrate your LVs to avoid such problems).
>>> The intent is less “guess what states we are now” and more “It looks
>>> like dm-thin already has the data structures needed to store some
>>> per-thin metadata, and that could make writing a simple userspace volume
>>> manager FAR FAR easier”. It appears to me that the only change needed
>>
>> I do not spend hours explaining all the details - but running just the
>> suspend alone may result in many differnt problem where the things like
>> running thin-pool out-of-data space is one of the easiest.
>>
>> Basically each step must be designed with 'power-off' happen during the
>> operation. For each step you need to know how the recovery step looks like
>> and how the lvm2 & kernel metadata c/would match together.
> That is absolutely the case, and is in fact the reason I proposed this
> change to begin with. By having dm-thin store a small amount of
> userspace-provided metadata for each thin volume, and by providing an
> API to enumerate the thin volumes in a pool, I can store all of the
> metadata I need in the thin pool itself. This is much simpler than
> having to store metadata outside of the pool.
Hi
Here is actually the fundamental problem with your proposal - our design was
about careful split between user-space and kernel 'who is the owner/holder of
information' - your proposal unfortunately does not fit the model where lvm2
is the authoritative owner of info about devices - note - we also tried the
'model' where the info is held within target - our mdraid dm wrapper - but it
has more troubles compared with very clear thin logic. So from the lvm2
position - we do not have any plans to change this proven model.
What you are asking for is - that 'kernel' module is doing all the job - and
lvm2 would be obtaining info from the kernel metadata - and eventually you
would be able to command everything with ioctl() interface and letting the
complexity sit completely in kernel - but as explained our design is heading
in opposite direction - what can be done in user-space stays in user space and
kernel does the necessary minimum, which can be then much easier developed and
traced.
>> Combining many
>> steps together into a single 'kernel' call just increases already large
>> range of errors. So in many case we simply do favour to keep operation more
>> 'low-level-atomic' even at slight higher performance price (as said - we've
>> never seen a creation of snapshot to be 'msec' critical operation - as the
>> 'suspend' with implicit flush & fsfreeze itself might be far more expensive
>> operation.
> Qubes OS should never be snapshotting an in-use volume of any kind.
> Right now, there is one case where it does so, but that is a bug, and I
> am working on fixing it. A future API might support snapshotting to an
> in-use volume, but that would likely require a way to tell the VM to
> freeze its own filesystem.
Yeah - you have very unusual use case - in fact lvm2 goal is usually to
support as much things as we can while devices are in-use so user does not
need to take them offline - which surely complicates everything a lot - also
there was basically never any user demand to operate with offline device in
very quick way - so admittedly not the focused area of development.
>>>> But IMHO creation and removal of thousands of devices in very short period
>>>> of time rather suggest there is something sub-optimal in your original
>>>> software design as I'm really having hard time imagining why would you need
>>>> this ?
>>> There very well could be (suggestions for improvement welcome).
>>>
>>>> If you wish to operate lots of devices - keep them simply created and ready
>>>> - and eventually blkdiscard them for next device reuse.
>>> That would work for volatile volumes, but those are only about 1/3 of
>>> the volumes in a Qubes OS system. The other 2/3 are writable snapshots.
>>> Also, Qubes OS has found blkdiscard on thins to be a performance
>>> problem. It used to lock up entire pools until Qubes OS moved to doing
>>> the blkdiscard in chunks.
>> Always make sure you use recent Linux kernels.
> Should the 5.16 series be recent enough?
>
>> Blkdiscard should not differ from lvremove too much - also experiment how
>> the 'lvchange --discards passdown|nopassdown poolLV' works.
> I believe this was with passdown on, which is the default in Qubes OS.
> The bug was tracked down by Jinoh Kang in
> https://github.com/QubesOS/qubes-issues/issues/5426#issuecomment-761595524
> and found to be due to dm-thin deleting B-tree nodes one at a time,
> causing large amounts of time to be wasted on btree rebalancing and node
> locking.
>
>>>> I'm also unsure from where would arise any special need to instantiate that
>>>> many snapshots - if there is some valid & logical purpose - lvm2 can have
>>>> extended user space API to create multiple snapshots at once maybe (so
>>>> i.e. create 10 snapshots with name-%d of a single thinLV)
>>> This would be amazing, and Qubes OS should be able to use it. That
>>> said, Qubes OS would prefer to be able to choose the name of each volume
>>> separately. Could there be a more general batching operation? Just
>>> supporting ‘lvm lvcreate’ and ‘lvm lvs’ would be great, but support for
>>> ‘lvm lvremove’, ‘lvm lvrename’, ‘lvm lvextend’, and ‘lvm lvchange
>>> --activate=y’ as well would be even better.
>> There is kind of 'hidden' plan inside command line processing to allow
>> 'grouped' processing.
>>
>> lvcreate --snapshot --name lv1 --snapshot --name lv2 vg/origin
>>
>> However there is currently no man power to proceed further on this part as
>> we have other parts of code needed enhancements.
>>
>> But we may put this on our TODO plans...
> That would be great, thanks!
Although the main reason to support this kind of API was the request to
support an atomic snapshot of multiple LVs at once - but so far not a high
priority.
>>>> Not to mentioning operating that many thin volumes from a single thin-pool
>>>> is also nothing close to high performance goal you try to reach...
>>> Would you mind explaining? My understanding, and the basis of
>>> essentially all my feature requests in this area, was that virtually all
>>> of the cost of LVM is the userspace metadata operations, udev syncing,
>>> and device scanning. I have been assuming that the kernel does not have
>>> performance problems with large numbers of thin volumes.
>>
>> The main idea behind the comment is - when there is increased disk usage -
>> the manipulation with thin-pool metadata and locking will soon start to be a
>> considerable performance problem.
>>
>> So while it's easy to have active 1000 thinLVs from a single thin-pool that
>> are UNUSED, situation is dramatically different when there LVs would be in
>> some heavy use load. There you should keep the active thinLV at low number
>> of tens LVs, especially if you are performance oriented. The lighter
>> usage and less provisioning and especially bigger block size - improve
> I can try to modify the storage pool so that LVs are not activated by
> default. That said, Qubes OS will always be provisioning-heavy. With
> the notable exception of volatile volumes, qubesd always snapshots a
You definitely should keep active ONLY LVs you need to have active - it's
impacting many other kernel areas and consumes system resources to keep
'unused' LVs active.
> volume at startup and then provides the snapshot to the VM. After
> shutdown, the original volume is renamed to be a backup, and the
> snapshot gets the name of the original volume. Bigger block sizes would
> substantially increase write amplification, as turning off zeroing is
> not an option for security reasons.
For 'snapshot' heavy loads - smaller chunks are usually better - but it comes
with price.
> Is this just a workload that dm-thin is ill-suited for? Qubes OS does
> support storing VM images on either BTRFS or XFS files, and it could be
> that this is a better plan going forward.
Not knowing the details - but as mentioned 'zeroing' is not needed for
'filesystem' security - modern filesystem will never let you read unwritten
data - as it keeps its own map of written data - but of course if the user
has root access to device with 'dd' it could read some 'unwritten' data on
that device...
>
>>> How much of a performance win can I expect from only activating the
>>> subset of volumes I actually use?
>>
>> I can only advice benchmark with some good approximation of your expected
>> workload.
> That’s already on the Qubes OS team’s (very long) to-do list.
>
I'd prioritize this - to get the best balance for performance - i.e. slightly
bigger chunks could give you much better numbers - if your 'snapshot' workload
is focused on small 'areas' so you know exactly where the focus should go
(too many cooks spoll the broth)...
So even jump 64k -> 256K can be significant
Regards
Zdenek
More information about the dm-devel
mailing list