[dm-devel] Potential enhancements to dm-thin v2

Mon Apr 11 22:30:53 UTC 2022

On Mon, Apr 11, 2022 at 10:16:43PM +0200, Zdenek Kabelac wrote:
> Dne 11. 04. 22 v 19:22 Demi Marie Obenour napsal(a):
> > On Mon, Apr 11, 2022 at 10:16:02AM +0200, Zdenek Kabelac wrote:
> > > Dne 11. 04. 22 v 0:03 Demi Marie Obenour napsal(a):
> > > 
> > > Your proposal actually breaks this sequence and would move things to the
> > > state of  'guess at which states we are now'. (and IMHO presents much more
> > > risk than virtual problem with suspend from user-space - which is only a
> > > problem if you are using suspended device as 'swap' and 'rootfs' - so there
> > > are very easy ways how to orchestrate your LVs to avoid such problems).
> > The intent is less “guess what states we are now” and more “It looks
> > like dm-thin already has the data structures needed to store some
> > per-thin metadata, and that could make writing a simple userspace volume
> > manager FAR FAR easier”.  It appears to me that the only change needed
> 
> 
> I do not spend hours explaining all the details - but running just the
> suspend alone may result in many differnt problem where the things like
> running thin-pool out-of-data space is one of the easiest.
> 
> Basically each step must be designed with  'power-off' happen during the
> operation. For each step you need to know how the recovery step looks like
> and how the lvm2 & kernel metadata c/would match together.

That is absolutely the case, and is in fact the reason I proposed this
change to begin with.  By having dm-thin store a small amount of
userspace-provided metadata for each thin volume, and by providing an
API to enumerate the thin volumes in a pool, I can store all of the
metadata I need in the thin pool itself.  This is much simpler than
having to store metadata outside of the pool.

> Combining many
> steps together into a single 'kernel' call just increases already large
> range of errors.  So in many case we simply do favour to keep operation more
> 'low-level-atomic' even at slight higher performance price (as said - we've
> never seen a creation of snapshot to be 'msec' critical operation - as  the 
> 'suspend' with implicit flush & fsfreeze itself might be far more expensive
> operation.

Qubes OS should never be snapshotting an in-use volume of any kind.
Right now, there is one case where it does so, but that is a bug, and I
am working on fixing it.  A future API might support snapshotting to an
in-use volume, but that would likely require a way to tell the VM to
freeze its own filesystem.

> > > But IMHO creation and removal of thousands of devices in very short period
> > > of time rather suggest there is something sub-optimal in your original
> > > software design as I'm really having hard time imagining why would you need
> > > this ?
> > There very well could be (suggestions for improvement welcome).
> > 
> > > If you wish to operate lots of devices - keep them simply created and ready
> > > - and eventually blkdiscard them for next device reuse.
> > That would work for volatile volumes, but those are only about 1/3 of
> > the volumes in a Qubes OS system.  The other 2/3 are writable snapshots.
> > Also, Qubes OS has found blkdiscard on thins to be a performance
> > problem.  It used to lock up entire pools until Qubes OS moved to doing
> > the blkdiscard in chunks.
> 
> Always make sure you use recent Linux kernels.

Should the 5.16 series be recent enough?

> Blkdiscard should not differ from lvremove too much  - also experiment how
> the  'lvchange --discards  passdown|nopassdown poolLV' works.

I believe this was with passdown on, which is the default in Qubes OS.
The bug was tracked down by Jinoh Kang in
https://github.com/QubesOS/qubes-issues/issues/5426#issuecomment-761595524
and found to be due to dm-thin deleting B-tree nodes one at a time,
causing large amounts of time to be wasted on btree rebalancing and node
locking.

> > > I'm also unsure from where would arise any special need to instantiate  that
> > > many snapshots -  if there is some valid & logical purpose -   lvm2 can have
> > > extended user space API to create multiple snapshots at once maybe (so
> > > i.e.    create  10 snapshots   with      name-%d  of a single thinLV)
> > This would be amazing, and Qubes OS should be able to use it.  That
> > said, Qubes OS would prefer to be able to choose the name of each volume
> > separately.  Could there be a more general batching operation?  Just
> > supporting ‘lvm lvcreate’ and ‘lvm lvs’ would be great, but support for
> > ‘lvm lvremove’, ‘lvm lvrename’, ‘lvm lvextend’, and ‘lvm lvchange
> > --activate=y’ as well would be even better.
> 
> There is kind of 'hidden' plan inside command line processing to allow
> 'grouped'  processing.
> 
> lvcreate --snapshot  --name lv1  --snapshot --name lv2 vg/origin
> 
> However there is currently no man power to proceed further on this part as
> we have other parts of code needed enhancements.
> 
> But we may put this on our TODO plans...

That would be great, thanks!

> > > Not to mentioning operating that many thin volumes from a single thin-pool
> > > is also nothing close to high performance goal you try to reach...
> > Would you mind explaining?  My understanding, and the basis of
> > essentially all my feature requests in this area, was that virtually all
> > of the cost of LVM is the userspace metadata operations, udev syncing,
> > and device scanning.  I have been assuming that the kernel does not have
> > performance problems with large numbers of thin volumes.
> 
> 
> The main idea behind the comment is -  when there is increased disk usage -
> the manipulation with thin-pool metadata and locking will soon start to be a
> considerable performance problem.
> 
> So while it's easy to have active  1000 thinLVs from a single thin-pool that
> are UNUSED, situation is dramatically different when there LVs would be in
> some heavy use load.  There you should keep the active thinLV at low number
> of  tens  LVs, especially if you are performance oriented.  The lighter
> usage and less provisioning and especially bigger block size - improve

I can try to modify the storage pool so that LVs are not activated by
default.  That said, Qubes OS will always be provisioning-heavy.  With
the notable exception of volatile volumes, qubesd always snapshots a
volume at startup and then provides the snapshot to the VM.  After
shutdown, the original volume is renamed to be a backup, and the
snapshot gets the name of the original volume.  Bigger block sizes would
substantially increase write amplification, as turning off zeroing is
not an option for security reasons.

Is this just a workload that dm-thin is ill-suited for?  Qubes OS does
support storing VM images on either BTRFS or XFS files, and it could be
that this is a better plan going forward.

> > Right now, my machine has 334 active thin volumes, split between one
> > pool on an NVMe drive and one on a spinning hard drive.  The pool on an
> > NVMe drive has 312 active thin volumes, of which I believe 64 are in use.
> > Are these numbers high enough to cause significant performance
> > penalties for dm-thin v1, and would they cause problems for dm-thin v2?
> 
> There are not yet any numbers for v2
> 
> For v1 - 64 thins might eventually experience some congestion for heavy load
> (compared with 'native' raw spindle).
> 
> > How much of a performance win can I expect from only activating the
> > subset of volumes I actually use?
> 
> 
> I can only advice benchmark with some good approximation of your expected
> workload.

That’s already on the Qubes OS team’s (very long) to-do list.

-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/dm-devel/attachments/20220411/0fcae85a/attachment.sig>