[libvirt RFC] add API for parallel Saves (not for committing)

Daniel P. Berrangé berrange at redhat.com
Thu Apr 28 08:47:43 UTC 2022


On Mon, Apr 25, 2022 at 03:25:43PM +0100, Dr. David Alan Gilbert wrote:
> * Daniel P. Berrangé (berrange at redhat.com) wrote:
> > On Mon, Apr 25, 2022 at 01:33:41PM +0100, Dr. David Alan Gilbert wrote:
> > > * Daniel P. Berrangé (berrange at redhat.com) wrote:
> > > > On Mon, Apr 25, 2022 at 12:04:37PM +0100, Dr. David Alan Gilbert wrote:
> > > > > * Daniel P. Berrangé (berrange at redhat.com) wrote:
> > > > > > I think we need to introduce an explicit 'file:' protocol
> > > > > > for the migrate command, that is backed by the blockdev APIs
> > > > > > so it can do O_DIRECT and non-blocking AIO.  For the 'fd:'
> > > > > > protocol, we need to be able to tell QEMU whether the 'fd'
> > > > > > is a stream or a regular file, so it can choose between the
> > > > > > regular send/recv APIs, vs the Blockdev APIs (maybe we can
> > > > > > auto-detect with fstat()).  If we do this, then multifd
> > > > > > doesn't end up needing multiple save files on disk, all
> > > > > > the threads can be directly writing to the same file, just
> > > > > > as the relevant offsets on disk to match the RAM page
> > > > > > location.
> > > > > 
> > > > > Hmm so what I'm not sure of is whether it makes sense to use the normal
> > > > > migration flow/code for this or not; and you're suggesting a few
> > > > > possibly contradictory things.
> > > > > 
> > > > > Adding a file: protocol would be pretty easy (whether it went via
> > > > > the blockdev layer or not); getting it to be more efficient is the
> > > > > tricky part, because we've got loads of levels of stream abstraction in
> > > > > the RAM save code:
> > > > >     QEMUFile->channel->OS
> > > > > but then if you want to enforce alignment you somehow have to make that
> > > > > go all the way down.
> > > > 
> > > > The QIOChannel stuff doesn't add buffering, so I wasn't worried
> > > > about alignment there.
> > > > 
> > > > QEMUFile has optional buffering which would mess with alignment,
> > > > but we could turn that off potentially for the RAM transfer, if
> > > > using multifd.
> > > 
> > > The problem isn't whether they add buffering or not; the problem is you
> > > now need a way to add a mechanism to ask for alignment.
> > > 
> > > > I'm confident the performance on the QMEU side though could
> > > > exceed what's viable with libvirt's iohelper  today, as we
> > > > would definitely be eliminating 1 copy and many context switches.
> > > 
> > > Yes but you get that just from adding a simple file: (or fd:) mode
> > > without trying to do anything clever with alignment or rewriting the
> > > same offset.
> > 
> > I don't think so, as libvirt supports O_DIRECT today to avoid
> > trashing the host cache when saving VMs. So to be able to
> > offload libvirt's work to QEMU, O_DIRECT is a pre-requisite.
> 
> I guess you could O_DIRECT it from a buffer in QemuFile or the channel.
> 
> > So we do need the alignment support at the very least. Rewriting
> > at the same offset isnt mandatory, but I think it'd make multifd
> > saner if trying to have all threads work on the same file.
> 
> Thinking on the fly, you'd need some non trivial changes:
> 
>   a) A section entry in the format to say 'align to ... n bytes'
>     (easyish)

Yep

>   b) A way to allocate a location in the file to a RAMBlock
>     [ We already have a bitmap address, so that might do,  but
>     you need to make it interact with the existing file, so it might
>     be easier to do the allocate and record it ]

IIUC, the migration protocol first serializes all RAM, and then serializes
the VMstate for devices.  When libvirt creates a save image for a VM it
has its own header + XML dump and then appends the migrate stream.

So we get a layout of

   +------------------+
   | libvirt header   |
   +------------------+
   | libvirt XML      |
   | ...              |
   +------------------+
   | migration stream |
   | ...              |
   +------------------+

The 'migration stream' is opaque as far as libvirt is concerned,
but we happen to know that from QEMU POV it expands to

   +------------------+
   | RAM stream       |
   | ...              |
   +------------------+
   | vmstate          |
   | ...              |
   +------------------+

Where 'RAM stream' is a stream of the RAM block header and
RAM block contents, for every page.

In the suggestion above, my desire is to achieve this layout


   +------------------+
   | libvirt header   |
   +------------------+
   | libvirt XML      |
   | ...              |
   +------------------+
   | RAM              |
   | ...              |
   +------------------+
   | vmstate          |
   | ...              |
   +------------------+

The key difference being 'RAM' instead of 'RAM stream'. Libvirt
would have to tell QEMU what offset in the file it is permitted
to start at - say 16 MB.  'RAM' would be a 1:1 mapping of the
guest RAM, simply offset by that 16 MB.

IOW, I'm assuming that a 4 GB RAM VM would have its RAM written
starting from offfset  16 MB and ending at 4 GB + 16 MB.

I'm thinking stuff like virtio-mem / RAM hotplug makes life
harder though as there can be many distinct blocks of RAM
contributing to a QEMU map, and we need to be able to declare
an ordering of them, for mapping to the file.



>   c) A way to say to the layers below it while writing RAM that it
>     needs to go in a given location.

Yes, QEMUFile would need some new APIs accepting offsets, and we
would need a way to report whether a given impl supports random
access, vs streaming only.

>   d) A clean way for a..c only to happen in this case.



>   e) Hmm ram size changes/hotplug/virtio-mem

This is "easy" from libvirt POV, since none of those things are possible
to be invoked during save/restore/migrate.

Hard(er) for QEMU since it doesn't place that restriction on users in the
general case.


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|


More information about the libvir-list mailing list