<div dir="ltr"><br><br><div class="gmail_quote"><div dir="ltr">On Fri, Aug 3, 2018 at 6:39 PM Alex Williamson <<a href="mailto:alex.williamson@redhat.com">alex.williamson@redhat.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On Fri, 3 Aug 2018 08:29:39 +0200<br> Christian Ehrhardt <<a href="mailto:christian.ehrhardt@canonical.com" target="_blank">christian.ehrhardt@canonical.com</a>> wrote:<br> <br> > Hi,<br> > I was recently looking into a case which essentially looked like this:<br> > 1. virsh shutdown guest<br> > 2. after <1 second the qemu process was gone from /proc/<br> > 3. but libvirt spun in virProcessKillPainfully because the process<br> > was still reachable via signals<br> > 4. virProcessKillPainfully eventually fails after 15 seconds and the<br> > guest stays in "in shutdown" state forever<br> > <br> > This is not one of the common cases I've found for<br> > virProcessKillPainfully to break:<br> > - bad I/O e.g. NFS gets qemu stuck<br> > - CPU overload stalls things to death<br> > - qemu not being reaped (by init)<br> > All of the above would have the process still available in /proc/<pid><br> > as Zombie or in uninterruptible sleep, but that is not true in my case.<br> > <br> > It turned out that the case was dependent on the amount of hostdev resources<br> > passed to the guest. Debugging showed that with 8 and more likely 16 GPUs<br> > passed it took ~18 seconds from SIGTERM to "no more be reachable with signal 0".<br> > I haven't conducted much more tests but stayed on the 16 GPU case, but<br> > I'm rather sure more devices might make it take even longer.<br> <br> If it's dependent on device assignment, then it's probably either<br> related to unmapping DMA or resetting devices. The former should scale<br> with the size of the VM, not the number of devices attached. The<br> latter could increase with each device. Typically with physical GPUs<br> we don't have a function level reset mechanism so we need to do a<br> secondary bus reset on the upstream bridge to reset the device, this<br> requires a 1s delay to let the bus settle after reset. So if we're<br> gated by these sorts of resets, your scaling doesn't sound<br> unreasonable, </blockquote><div><br></div><div><div style="font-size:small;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial">So the scaling makes sense with ~16*1s plus a tiny bit of default time to clean up matching the ~18 seconds I see.</div><div style="font-size:small;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial">Thanks for that explanation!</div></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">though I'm not sure how these factor into the process<br> state you're seeing.</blockquote><div><br></div><div>Yeah I'd have thought to still see it in any form like a Zombie or such.</div><div>But it really is gone.</div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">I'd also be surprised if you have a system that<br> can host 16 physical GPUs, so maybe this is a vGPU example? </blockquote><div><br></div><div>16*physical GPU it is :-)</div><div>See <a href="https://www.nvidia.com/en-us/data-center/dgx-2/">https://www.nvidia.com/en-us/data-center/dgx-2/</a><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Any mdev<br> device should provide a reset callback for roughly the equivalent of a<br> function level reset. Implementation of such a reset would be vendor<br> specific. </blockquote><div><br></div><div>Since it is no classic mdev [1][2], but just 16*physical GPUs the callback suggestion would not make sens right?</div><div>In that case I wonder what the libvirt community thinks of the proposed general "Pid is gone means we can assume it is dead" approach?</div><div><br></div><div>An alternative would be to understand on the Kernel side why the PID is gone "too early" and fix that so it stays until fully cleaned up.</div><div>But even then on the Libvirt side we would need the extended timeout values.</div><div><br></div><div>[1]: <a href="https://libvirt.org/drvnodedev.html#MDEV">https://libvirt.org/drvnodedev.html#MDEV</a></div><div>[2]: <a href="https://www.kernel.org/doc/Documentation/vfio-mediated-device.txt">https://www.kernel.org/doc/Documentation/vfio-mediated-device.txt</a></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> Thanks, </blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> Alex<br> </blockquote></div><br clear="all"><div><br></div>-- <br><div dir="ltr" class="gmail_signature"><div dir="ltr"><div><div dir="ltr"><span style="color:rgb(136,136,136);font-size:12.8px">Christian Ehrhardt</span><div style="color:rgb(136,136,136);font-size:12.8px">Software Engineer, Ubuntu Server</div><div style="color:rgb(136,136,136);font-size:12.8px">Canonical Ltd</div></div></div></div></div></div>