[PATCH] failover: allow to pause the VM during the migration

Laurent Vivier lvivier at redhat.com
Fri Oct 1 06:48:06 UTC 2021


On 30/09/2021 22:17, Laine Stump wrote:
> On 9/30/21 1:09 PM, Laurent Vivier wrote:
>> If we want to save a snapshot of a VM to a file, we used to follow the
>> following steps:
>>
>> 1- stop the VM:
>>     (qemu) stop
>>
>> 2- migrate the VM to a file:
>>     (qemu) migrate "exec:cat > snapshot"
>>
>> 3- resume the VM:
>>     (qemu) cont
>>
>> After that we can restore the snapshot with:
>>    qemu-system-x86_64 ... -incoming "exec:cat snapshot"
>>    (qemu) cont
> 
> This is the basics of what libvirt does for a snapshot, and steps 1+2 are what it does for 
> a "managedsave" (where it saves the snapshot to disk and then terminates the qemu process, 
> for later re-animation).
> 
> In those cases, it seems like this new parameter could work for us - instead of explicitly 
> pausing the guest prior to migrating it to disk, we would set this new parameter to on, 
> then directly migrate-to-disk (relying on qemu to do the pause). Care will need to be 
> taken to assure that error recovery behaves the same though.

In case of error, the VM is restarted like it's done for a standard migration. I can 
change that if you need.

An other point is the VM state sent to the migration stream is "paused", it means that 
machine needs to be resumed after the stream is loaded (from the file or on destination in 
the case of a real migration), but it can be also changed to be "running" so the machine 
will be resumed automatically at the end of the file loading (or real migration)

> There are a couple of cases when libvirt apparently *doesn't* pause the guest during the 
> migrate-to-disk, both having to do with saving a coredump of the guest. Since I really 
> have no idea of how common/important that is (or even if my assessment of the code is 
> correct), I'm Cc'ing this patch to libvir-list to make sure it catches the attention of 
> someone who knows the answers and implications.

It's an interesting point I need to test and think about: in case of a coredump I guess 
the machine is crashed and doesn't answer to the unplug request and so the failover unplug 
cannot be done. For the moment the migration will hang until it is canceled. IT can be 
annoying if we want to debug the cause of the crash...

> 
>> But when failover is configured, it doesn't work anymore.
>>
>> As the failover needs to ask the guest OS to unplug the card
>> the machine cannot be paused.
>>
>> This patch introduces a new migration parameter, "pause-vm", that
>> asks the migration to pause the VM during the migration startup
>> phase after the the card is unplugged.
>>
>> Once the migration is done, we only need to resume the VM with
>> "cont" and the card is plugged back:
>>
>> 1- set the parameter:
>>     (qemu) migrate_set_parameter pause-vm on
>>
>> 2- migrate the VM to a file:
>>     (qemu) migrate "exec:cat > snapshot"
>>
>>     The primary failover card (VFIO) is unplugged and the VM is paused.
>>
>> 3- resume the VM:
>>     (qemu) cont
>>
>>     The VM restarts and the primary failover card is plugged back
>>
>> The VM state sent in the migration stream is "paused", it means
>> when the snapshot is loaded or if the stream is sent to a destination
>> QEMU, the VM needs to be resumed manually.
>>
>> Signed-off-by: Laurent Vivier <lvivier at redhat.com>
>> ---
>>   qapi/migration.json            | 20 +++++++++++++++---
>>   include/hw/virtio/virtio-net.h |  1 +
>>   hw/net/virtio-net.c            | 33 ++++++++++++++++++++++++++++++
>>   migration/migration.c          | 37 +++++++++++++++++++++++++++++++++-
>>   monitor/hmp-cmds.c             |  8 ++++++++
>>   5 files changed, 95 insertions(+), 4 deletions(-)
>>
...

Thanks,
Laurent




More information about the libvir-list mailing list