[libvirt RFCv8 00/27] multifd save restore prototype

Wed May 11 07:26:10 UTC 2022

Hi Daniel,

thanks for looking at this,

On 5/10/22 8:38 PM, Daniel P. Berrangé wrote:
> On Sat, May 07, 2022 at 03:42:53PM +0200, Claudio Fontana wrote:
>> This is v8 of the multifd save prototype, which fixes a few bugs,
>> adds a few more code splits, and records the number of channels
>> as well as the compression algorithm, so the restore command is
>> more user-friendly.
>>
>> It is now possible to just say:
>>
>> virsh save mydomain /mnt/saves/mysave --parallel
>>
>> virsh restore /mnt/saves/mysave --parallel
>>
>> and things work with the default of 2 channels, no compression.
>>
>> It is also possible to say of course:
>>
>> virsh save mydomain /mnt/saves/mysave --parallel
>>       --parallel-connections 16 --parallel-compression zstd
>>
>> virsh restore /mnt/saves/mysave --parallel
>>
>> and things also work fine, due to channels and compression
>> being stored in the main save file.
> 
> For the sake of people following along, the above commands will
> result in creation of multiple files
> 
>   /mnt/saves/mysave
>   /mnt/saves/mysave.0

just minor correction, there is no .0

>   /mnt/saves/mysave.1
>   ....
>   /mnt/saves/mysave.n
> 
> Where 'n' is the number of threads used.
> 
> Overall I'm not very happy with the approach of doing any of this
> on the libvirt side.

Ok I understand your concern.

> 
> Backing up, we know that QEMU can directly save to disk faster than
> libvirt can. We mitigated alot of that overhead with previous patches
> to increase the pipe buffer size, but some still remains due to the
> extra copies inherant in handing this off to libvirt.

Right;
still the performance we get is insufficient for the use case we are trying to address,
even without libvirt in the picture.

Instead, with parallel save + compression we can make the numbers add up.
For parallel save using multifd, the overhead of libvirt is negligible.

> 
> Using multifd on the libvirt side, IIUC, gets us better performance
> than QEMU can manage if doing non-multifd write to file directly,
> but we still have the extra copies in there due to the hand off
> to libvirt. If QEMU were to be directly capable to writing to
> disk with multifd, it should beat us again.

Hmm I am thinking about this point, and at first glance I don't think this is 100% accurate;

if we do parallel save like in this series with multifd,
the overhead of libvirt is almost non-existent in my view compared with doing it with qemu only, skipping libvirt,
it is limited to the one iohelper for the main channel (which is the smallest of the transfers),
and maybe this could be removed as well.

This is because even without libvirt in the picture, we are still migrating to a socket, and something needs to
transfer data from that socket to a file. At that point I think both libvirt and a custom made script are in the same position.

> 
> As a result of how we integrate with QEMU multifd, we're taking the
> approach of saving the state across multiple files, because it is
> easier than trying to get multiple threads writing to the same file.
> It could be solved by using file range locking on the save file.
> eg a thread can reserve say 500 MB of space, fill it up, and then
> reserve another 500 MB, etc, etc. It is a bit tedious though and
> won't align nicely. eg a 1 GB huge page, would be 1 GB + a few
> bytes of QEMU RAM ave state header.
> 
> The other downside of multiple files is that it complicates life
> for both libvirt and apps using libvirt. They need to be aware of
> multiple files and move them around together. This is not a simple
> as it might sound. For example, IIRC OpenStack would upload a save
> image state into a glance bucket for later use. Well, now it needs
> multiple distinct buckets and keep track of them all. It also means
> we're forced to use the same concurrency level when restoring, which
> is not neccessarily desirable if the host environment is different
> when restoring. ie The original host might have had 8 CPUs, but the
> new host might only have 4 available, or vica-verca.
> 
> 
> I know it is appealing to do something on the libvirt side, because
> it is quicker than getting an enhancement into new QEMU release. We
> have been down this route before with the migration support in libvirt
> in the past though, when we introduced the tunnelled live migration
> in order to workaround QEMU's inability to do TLS encryption. I very
> much regret that we ever did this, because tunnelled migration was
> inherantly limited, so for example failed to work with multifd,
> and failed to work with NBD based disk migration. In the end I did
> what I should have done at the beginning and just added TLS support
> to QEMU, making tunnelled migration obsolete, except we still have
> to carry the code around in libvirt indefinitely due to apps using
> it.
> 
> So I'm very concerned about not having history repeat itself and
> give us a long term burden for  a solution that turns out to be a
> evolutionary dead end.
> 
> I like the idea of parallel saving, but I really think we need to
> implement this directly in QEMU, not libvirt. As previously
> mentioned I think QEMU needs to get a 'file' migration protocol,
> along with ability to directly map RAM  segments into fixed
> positions in the file. The benefits are many
> 
>  - It will save & restore faster because we're eliminating data
>    copies that libvirt imposes via the iohelper
>  
>  - It is simple for libvirt & mgmt apps as we still only
>    have one file to manage
> 
>  - It is space efficient because if a guest dirties a
>    memory page, we just overwrite the existing contents
>    at the fixed location in the file, instead of appending
>    new contents to the file
> 
>  - It will restore faster too because we only restore
>    each memory page once, due to always overwriting the
>    file in-place when the guest dirtied a page during save
> 
>  - It can save and restore with differing number of threads,
>    and can even dynamically change the number of threads
>    in the middle of the save/restore operation 
> 
> As David G has pointed out the impl is not trivial on the QEMU
> side, but from what I understand of the migration code, it is
> certainly viable. Most importantly I think it puts us in a
> better position for long term feature enhancements later by
> taking the middle man (libvirt) out of the equation, letting
> QEMU directly know what medium it is saving/restoring to/from.
> 
> 
> With regards,
> Daniel
> 

It's probably possible to do this in QEMU, with extensive changes, in my view possibly to the migration stream itself,
to have a more block-friendly, parallel migration stream to a file.

Thanks,

Claudio