[libvirt PATCH v2 81/81] RFC: qemu: Keep vCPUs paused while migration is in postcopy-paused

Peter Xu peterx at redhat.com
Mon Jun 6 14:33:02 UTC 2022


[copy Dave, for real]

On Mon, Jun 06, 2022 at 10:32:03AM -0400, Peter Xu wrote:
> [copy Dave]
> 
> On Mon, Jun 06, 2022 at 12:29:39PM +0100, Daniel P. Berrangé wrote:
> > On Wed, Jun 01, 2022 at 02:50:21PM +0200, Jiri Denemark wrote:
> > > QEMU keeps guest CPUs running even in postcopy-paused migration state so
> > > that processes that already have all memory pages they need migrated to
> > > the destination can keep running. However, this behavior might bring
> > > unexpected delays in interprocess communication as some processes will
> > > be stopped until migration is recover and their memory pages migrated.
> > > So let's make sure all guest CPUs are paused while postcopy migration is
> > > paused.
> > > ---
> > > 
> > > Notes:
> > >     Version 2:
> > >     - new patch
> > > 
> > >     - this patch does not currently work as QEMU cannot handle "stop"
> > >       QMP command while in postcopy-paused state... the monitor just
> > >       hangs (see https://gitlab.com/qemu-project/qemu/-/issues/1052 )
> > >     - an ideal solution of the QEMU bug would be if QEMU itself paused
> > >       the CPUs for us and we just got notified about it via QMP events
> > >     - but Peter Xu thinks this behavior is actually worse than keeping
> > >       vCPUs running
> > 
> > I'd like to know what the rationale is here ?
> 
> I think the wording here is definitely stronger than what I meant. :-)
> 
> My understanding was stopping the VM may or may not help the guest,
> depending on the guest behavior at the point of migration failure.  And if
> we're not 100% sure of that, doing nothing is the best we have, as
> explicitly stopping the VM is something extra we do, and it's not part of
> the requirements for either postcopy itself or the recovery routine.
> 
> Some examples below.
> 
> 1) If many of the guest threads are doing cpu intensive work, and if the
> needed pageset is already migrated, then stopping the vcpu threads means
> they could have been running during this "downtime" but we forced them not
> to.  Actually if the postcopy didn't pause immediately right after switch,
> we could very possibly migrated the workload pages if the working set is
> not very large.
> 
> 2) If we're reaching the end of the postcopy phase and it paused, most of
> the pages could have been migrated already.  So maybe only a few or even
> none thread will be stopped due to remote page faults.
> 
> 3) Think about kvm async page fault: that's a feature that the guest can do
> to yield the guest thread when there's a page fault.  It means even if some
> of the page faulted threads got stuck for a long time due to postcopy
> pausing, the guest is "smart" to know it'll take a long time (userfaultfd
> is a major fault, and as long as KVM gup won't get the page we put the page
> fault into async pf queue) then the guest vcpu can explicitly schedule()
> the faulted context and run some other threads that may not need to be
> blocked.
> 
> What I wanted to say is I don't know whether assuming "stopping the VM will
> be better than not doing so" will always be true here.  If it's case by
> case I feel like the better way to do is to do nothing special.
> 
> > 
> > We've got a long history knowing the behaviour and impact when
> > pausing a VM as a whole. Of course some apps may have timeouts
> > that are hit if the paused time was too long, but overall this
> > scenario is not that different from a bare metal machine doing
> > suspend-to-ram. Application impact is limited & predictable and
> > genrally well understood.
> 
> My other question is, even if we stopped the VM then right after we resume
> the VM won't many of those timeout()s trigger as well?  I think I asked
> similar question to Jiri and the answer at that time was that we could have
> not called the timeout() function, however I think it's not persuasive
> enough as timeout() is the function that should take the major time so at
> least we're not sure whether we'll be on it already.
> 
> My understanding is that a VM can work properly after a migration because
> the guest timekeeping will gradually sync up with the real world time, so
> if there's a major donwtime triggered we can hardly make it not affecting
> the guest.  What we can do is if we know a software is in VM context we
> should be robust on the timeout (and that's at least what I do on programs
> even on bare metal because I'd assume the program be run on an extremely
> busy host).
> 
> But I could be all wrong on that, because I don't know enough on the whole
> rational of the importance of stopping the VM in the past.
> 
> > 
> > I don't think we can say the same about the behaviour & impact
> > on the guest OS if we selectively block execution of random
> > CPUs.  An OS where a certain physical CPU simply stops executing
> > is not a normal scenario that any application or OS is designed
> > to expect. I think the chance of the guest OS or application
> > breaking in a non-recoverable way is high. IOW, we might perform
> > post-copy recovery and all might look well from host POV, but
> > the guest OS/app is none the less broken.
> > 
> > The overriding goal for migration has to be to minimize the
> > danger to the guest OS and its applications, and I think that's
> > only viable if either the guest OS is running all CPUs or no
> > CPUs.
> 
> I agree.
> 
> > 
> > The length of outage for a CPU when post-copy transport is broken
> > is potentially orders of magnitude larger than the temporary
> > blockage while fetching a memory page asynchronously. The latter
> > is obviously not good for real-time sensitive apps, but most apps
> > and OS will cope with CPUs being stalled for 100's of milliseconds.
> > That isn't the case if CPUs get stalled for minutes, or even hours,
> > at a time due to a broken network link needing admin recovery work
> > in the host infra.
> 
> So let me also look at the issue on having vm stop hanged, no matter
> whether we'd like an explicit vm_stop that hang should better be avoided
> from libvirt pov.
> 
> Ideally it could be avoided but I need to look into it.  I think it can be
> that the vm_stop was waiting for other vcpus to exit to userspace but those
> didn't really come alive after the SIG_IPI sent to them (in reality that's
> SIGUSR1; and I'm pretty sure all vcpu threads can handle SIGKILL.. so maybe
> I need to figure out where got it blocked in the kernel).
> 
> I'll update either here or in the bug that Jiri opened when I got more
> clues out of it.
> 
> Thanks,
> 
> -- 
> Peter Xu

-- 
Peter Xu



More information about the libvir-list mailing list