[libvirt PATCH 06/80] qemu: Keep domain running on dst on failed post-copy migration

Wed May 11 10:54:24 UTC 2022

On Wed, May 11, 2022 at 12:42:08PM +0200, Peter Krempa wrote:
> On Wed, May 11, 2022 at 12:26:52 +0200, Jiri Denemark wrote:
> > On Wed, May 11, 2022 at 10:48:10 +0200, Peter Krempa wrote:
> > > On Tue, May 10, 2022 at 17:20:27 +0200, Jiri Denemark wrote:
> > > > There's no need to artificially pause a domain when post-copy fails. The
> > > > virtual CPUs may continue running, only the guest tasks that decide to
> > > > read a page which has not been migrated yet will get blocked.
> > > 
> > > IMO not pausing the VM is a policy decision (same way as pausing it was
> > > though) and should be user-configurable at migration start.
> > > 
> > > I can see that users might want to prevent a half-broken VM from
> > > executing until it gets attention needed to fix it, even when it's safe
> > > from a "theoretical" standpoint.
> > 
> > It depends how much was already migrated. In practise the guest may
> > easily stop running anyway :-) 
> 
> Well, I'd consider that behaviour to be very bad actually, but given the
> caveats below ...
> 
> > So yeah, it was a needless policy
> > decision which is being removed now. But the important reason behind it,
> > which I should have mention in the commit message is the difference
> > between libvirt and QEMU migration state. When libvirt connection breaks
> > (between daemons for p2p migration or between a client and daemons) we
> > consider migration as broken from the API point of view and return
> > failure. However, the migration may still be running just fine if the
> > connection between QEMU processes remains working. And since we're in
> > post-copy phase, the migration can even finish just fine without
> > libvirt. So a half-broken VM may magically become a fully working
> > migrated VM after our migration API reported a failure. Keeping the
> > domain running makes this situation easier to handle :-)
> 
> I see. Additionally if e.g. libvirtd isn't running at all (but that ties
> to the "connection broken" scenario) we wouldn't even pause it.
> 
> So the caveats were there in fact always albeit less probable.

There are two very different scenarios here though.

Strictly from the QEMU scenario, migration only fails if there's
a problem with QEMU's migration connection. This scenario can
impact the guest, because processes get selectively blocked,
which can ultimately lead to application timeouts and errors.
If there's a failure at the QEMU level we want the guest to be
paused, so that interaction between apps in the guest is not
impacted.

On the libvirt side, if our own libvirtd conenction fails,
this does not impact the guest or QEMU's migration connection
(I'll ignore tunnelled mig here). So there is no problem with
the guest continuing to execute, and even complete the migration
from QEMU POV.

For robustness we want a way for QEMU to autonomously pause the
guest when post-copy fails, so that this pausing happens even
if libvirt's connection has also failed concurrently.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|