[libvirt PATCH 06/80] qemu: Keep domain running on dst on failed post-copy migration

Wed May 11 11:26:54 UTC 2022

On Wed, May 11, 2022 at 01:03:43PM +0200, Peter Krempa wrote:
> On Wed, May 11, 2022 at 11:39:29 +0100, Daniel P. Berrangé wrote:
> > On Wed, May 11, 2022 at 10:48:10AM +0200, Peter Krempa wrote:
> > > On Tue, May 10, 2022 at 17:20:27 +0200, Jiri Denemark wrote:
> > > > There's no need to artificially pause a domain when post-copy fails. The
> > > > virtual CPUs may continue running, only the guest tasks that decide to
> > > > read a page which has not been migrated yet will get blocked.
> > > 
> > > IMO not pausing the VM is a policy decision (same way as pausing it was
> > > though) and should be user-configurable at migration start.
> > > 
> > > I can see that users might want to prevent a half-broken VM from
> > > executing until it gets attention needed to fix it, even when it's safe
> > > from a "theoretical" standpoint.
> > 
> > It isn't even safe from a theoretical standpoint though.
> > 
> > Consider 2 processes in a guest that are communicating with each
> > other. 1 gets blocked on a page rea due to broken post copy, but
> > we leave the guest running.  The other process sees no progress
> > from the blocked process and/or hits time timeout and throws an
> > error. As a result the guest application workload ends up
> > completely dead, even if we later recover the the postcopy
> > migration.
> 
> IMO you have to deal with this scenario in a reduced scope anyways when
> opting into using post-copy.
> 
> Each page transfer is vastly slower than the comparable access into
> memory, so if the 'timeout' portion is implied to be on the same order
> of magnitde of memory access latency then your software is going to have
> a very bad time when being migrated in post-copy mode. If the link gets
> congested ... then it's even worse.

That's very different likely order of magnitudes though. A "slow"
page access in post-copy is $LOW seconds. A blocked process due to
a broken post-copy connection is potentially $HIGH minutes long if
the infra takes a long time to fix.

A page access taking a seconds rather than microseconds isn't
going to trip up many app level timeouts IMHO.

A process blocked for many minutes is highly likely to trigger
app level timeouts.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|