[libvirt RFCv8 00/27] multifd save restore prototype

Daniel P. Berrangé berrange at redhat.com
Wed May 11 11:57:54 UTC 2022


On Wed, May 11, 2022 at 01:47:13PM +0200, Claudio Fontana wrote:
> On 5/11/22 10:27 AM, Christophe Marie Francois Dupont de Dinechin wrote:
> > 
> > 
> >> On 10 May 2022, at 20:38, Daniel P. Berrangé <berrange at redhat.com> wrote:
> >>
> >> On Sat, May 07, 2022 at 03:42:53PM +0200, Claudio Fontana wrote:
> >>> This is v8 of the multifd save prototype, which fixes a few bugs,
> >>> adds a few more code splits, and records the number of channels
> >>> as well as the compression algorithm, so the restore command is
> >>> more user-friendly.
> >>>
> >>> It is now possible to just say:
> >>>
> >>> virsh save mydomain /mnt/saves/mysave --parallel
> >>>
> >>> virsh restore /mnt/saves/mysave --parallel
> >>>
> >>> and things work with the default of 2 channels, no compression.
> >>>
> >>> It is also possible to say of course:
> >>>
> >>> virsh save mydomain /mnt/saves/mysave --parallel
> >>>      --parallel-connections 16 --parallel-compression zstd
> >>>
> >>> virsh restore /mnt/saves/mysave --parallel
> >>>
> >>> and things also work fine, due to channels and compression
> >>> being stored in the main save file.
> >>
> >> For the sake of people following along, the above commands will
> >> result in creation of multiple files
> >>
> >>  /mnt/saves/mysave
> >>  /mnt/saves/mysave.0
> >>  /mnt/saves/mysave.1
> >>  ....
> >>  /mnt/saves/mysave.n
> >>
> >> Where 'n' is the number of threads used.
> >>
> >> Overall I'm not very happy with the approach of doing any of this
> >> on the libvirt side.
> >>
> >> Backing up, we know that QEMU can directly save to disk faster than
> >> libvirt can. We mitigated alot of that overhead with previous patches
> >> to increase the pipe buffer size, but some still remains due to the
> >> extra copies inherant in handing this off to libvirt.
> >>
> >> Using multifd on the libvirt side, IIUC, gets us better performance
> >> than QEMU can manage if doing non-multifd write to file directly,
> >> but we still have the extra copies in there due to the hand off
> >> to libvirt. If QEMU were to be directly capable to writing to
> >> disk with multifd, it should beat us again.
> >>
> >> As a result of how we integrate with QEMU multifd, we're taking the
> >> approach of saving the state across multiple files, because it is
> >> easier than trying to get multiple threads writing to the same file.
> >> It could be solved by using file range locking on the save file.
> >> eg a thread can reserve say 500 MB of space, fill it up, and then
> >> reserve another 500 MB, etc, etc. It is a bit tedious though and
> >> won't align nicely. eg a 1 GB huge page, would be 1 GB + a few
> >> bytes of QEMU RAM ave state header.
> 
> 
> I am not familiar enough to know if this approach would work with multifd without breaking
> the existing format, maybe David could answer this.
> 
> > 
> > First, I do not understand why you would write things that are
> > not page-aligned to start with? (As an aside, I don’t know
> > how any dirty tracking would work if you do not keep things
> > page-aligned).
> 
> Yes, alignment is one issue I encountered, and that in my view would _still_ need to be solved,
> and that is _whatever_ we put inside QEMU in the future,
> as it breaks also any attempt to be more efficient (using alternative APIs to read/write etc),
> 
> and is the reason why iohelper is still needed in my patchset at all for the main file, causing one extra copy for the main channel.
> 
> The libvirt header, including metadata, domain xml etc, that wraps the QEMU VM ends at an arbitrary address, f.e:
> 
> 00000000: 4c69 6276 6972 7451 656d 7564 5361 7665  LibvirtQemudSave
> 00000010: 0300 0000 5b13 0100 0100 0000 0000 0000  ....[...........
> 00000020: 3613 0000 0200 0000 0000 0000 0000 0000  6...............
> 00000030: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00000040: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00000050: 0000 0000 0000 0000 0000 0000 3c64 6f6d  ............<dom
> 00000060: 6169 6e20 7479 7065 3d27 6b76 6d27 3e0a  ain type='kvm'>.
> 
> 
> 
> 000113a0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 000113b0: 0000 0000 0000 0051 4556 4d00 0000 0307  .......QEVM.....
> 000113c0: 0000 000d 7063 2d69 3434 3066 782d 362e  ....pc-i440fx-6.
> 000113d0: 3201 0000 0003 0372 616d 0000 0000 0000  2......ram......
> 000113e0: 0004 0000 0008 c00c 2004 0670 632e 7261  ........ ..pc.ra
> 000113f0: 6d00 0000 08c0 0000 0014 2f72 6f6d 4065  m........./rom at e
> 00011400: 7463 2f61 6370 692f 7461 626c 6573 0000  tc/acpi/tables..
> 00011410: 0000 0002 0000 0770 632e 6269 6f73 0000  .......pc.bios..
> 00011420: 0000 0004 0000 1f30 3030 303a 3030 3a30  .......0000:00:0
> 00011430: 322e 302f 7669 7274 696f 2d6e 6574 2d70  2.0/virtio-net-p
> 00011440: 6369 2e72 6f6d 0000 0000 0004 0000 0670  ci.rom.........p
> 00011450: 632e 726f 6d00 0000 0000 0200 0015 2f72  c.rom........./r
> 00011460: 6f6d 4065 7463 2f74 6162 6c65 2d6c 6f61  om at etc/table-loa
> 00011470: 6465 7200 0000 0000 0010 0012 2f72 6f6d  der........./rom
> 00011480: 4065 7463 2f61 6370 692f 7273 6470 0000  @etc/acpi/rsdp..
> 00011490: 0000 0000 1000 0000 0000 0000 0010 7e00  ..............~.
> 000114a0: 0000 0302 0000 0003 0000 0000 0000 2002  .............. .
> 000114b0: 0670 632e 7261 6d00 0000 0000 0000 3022  .pc.ram.......0"
> 
> 
> in my view at the minimum we have to start by adding enough padding before starting the QEMU VM (QEVM magic)
> to be at a page-aligned address.
> 
> I would add one patch to this effect to my prototype, as this should not be very controversial I think.

We already add padding before the QEMU migration stream begins, but 
we're just doing a fixed 64kb. The intent was to allow us to edit
the embedded XML. It could easily round this upto to a sensible
boundary if needed.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|


More information about the libvir-list mailing list