[libvirt] [RFC 0/2] Fix detection of slow guest shutdown

Christian Ehrhardt christian.ehrhardt at canonical.com
Mon Aug 6 05:20:10 UTC 2018


On Fri, Aug 3, 2018 at 6:39 PM Alex Williamson <alex.williamson at redhat.com>
wrote:

> On Fri,  3 Aug 2018 08:29:39 +0200
> Christian Ehrhardt <christian.ehrhardt at canonical.com> wrote:
>
> > Hi,
> > I was recently looking into a case which essentially looked like this:
> >   1. virsh shutdown guest
> >   2. after <1 second the qemu process was gone from /proc/
> >   3. but libvirt spun in virProcessKillPainfully because the process
> >      was still reachable via signals
> >   4. virProcessKillPainfully eventually fails after 15 seconds and the
> >      guest stays in "in shutdown" state forever
> >
> > This is not one of the common cases I've found for
> > virProcessKillPainfully to break:
> > - bad I/O e.g. NFS gets qemu stuck
> > - CPU overload stalls things to death
> > - qemu not being reaped (by init)
> > All of the above would have the process still available in /proc/<pid>
> > as Zombie or in uninterruptible sleep, but that is not true in my case.
> >
> > It turned out that the case was dependent on the amount of hostdev
> resources
> > passed to the guest. Debugging showed that with 8 and more likely 16 GPUs
> > passed it took ~18 seconds from SIGTERM to "no more be reachable with
> signal 0".
> > I haven't conducted much more tests but stayed on the 16 GPU case, but
> > I'm rather sure more devices might make it take even longer.
>
> If it's dependent on device assignment, then it's probably either
> related to unmapping DMA or resetting devices.  The former should scale
> with the size of the VM, not the number of devices attached.  The
> latter could increase with each device.  Typically with physical GPUs
> we don't have a function level reset mechanism so we need to do a
> secondary bus reset on the upstream bridge to reset the device, this
> requires a 1s delay to let the bus settle after reset.  So if we're
> gated by these sorts of resets, your scaling doesn't sound
> unreasonable,


So the scaling makes sense with ~16*1s plus a tiny bit of default time to
clean up matching the ~18 seconds I see.
Thanks for that explanation!


> though I'm not sure how these factor into the process
> state you're seeing.


Yeah I'd have thought to still see it in any form like a Zombie or such.
But it really is gone.



> I'd also be surprised if you have a system that
> can host 16 physical GPUs, so maybe this is a vGPU example?


16*physical GPU it is :-)
See https://www.nvidia.com/en-us/data-center/dgx-2/


> Any mdev
> device should provide a reset callback for roughly the equivalent of a
> function level reset.  Implementation of such a reset would be vendor
> specific.


Since it is no classic mdev [1][2], but just 16*physical GPUs the callback
suggestion would not make sens right?
In that case I wonder what the libvirt community thinks of the proposed
general "Pid is gone means we can assume it is dead" approach?

An alternative would be to understand on the Kernel side why the PID is
gone "too early" and fix that so it stays until fully cleaned up.
But even then on the Libvirt side we would need the extended timeout values.

[1]: https://libvirt.org/drvnodedev.html#MDEV
[2]: https://www.kernel.org/doc/Documentation/vfio-mediated-device.txt


> Thanks,

Alex
>


-- 
Christian Ehrhardt
Software Engineer, Ubuntu Server
Canonical Ltd
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/libvir-list/attachments/20180806/4614dee7/attachment-0001.htm>


More information about the libvir-list mailing list