[libvirt] [RFC 0/2] Fix detection of slow guest shutdown

Fri Aug 3 06:29:39 UTC 2018

Hi,
I was recently looking into a case which essentially looked like this:
  1. virsh shutdown guest
  2. after <1 second the qemu process was gone from /proc/
  3. but libvirt spun in virProcessKillPainfully because the process
     was still reachable via signals
  4. virProcessKillPainfully eventually fails after 15 seconds and the
     guest stays in "in shutdown" state forever

This is not one of the common cases I've found for
virProcessKillPainfully to break:
- bad I/O e.g. NFS gets qemu stuck
- CPU overload stalls things to death
- qemu not being reaped (by init)
All of the above would have the process still available in /proc/<pid>
as Zombie or in uninterruptible sleep, but that is not true in my case.

It turned out that the case was dependent on the amount of hostdev resources
passed to the guest. Debugging showed that with 8 and more likely 16 GPUs
passed it took ~18 seconds from SIGTERM to "no more be reachable with signal 0".
I haven't conducted much more tests but stayed on the 16 GPU case, but
I'm rather sure more devices might make it take even longer.

Discussion with a few kernel folks revealed that the kill(2) man page
on signal 0 has to be taken very literal "check for the existence of a process
ID" - you can read this as "the PID exists, but the Process is no more".
I'm unsure why the kernel would take that much time to clean up as I
thought taking away /proc/<PID> would be almost the last step in the
cleanup of a task.

patch 2:
I happened to find that there seems to be no much better way than
signal-0 to check, but finding that this isn't reliable if the kernel
can still accept for quite some time even with the pid gone from all
other interfaces that I could find - so I wanted to suggest a fallback
in virProcessKillPainfully that considers the absence of /proc/<pid> as
a valid "the process is gone" as well on top of the ESRCH of signal-0.

We could also use the open FDs we have e.g. to the qemu monitor to see
if the remote end is dead, but that didn't seem more readable/reliable
to me and would have to cross quite some code to know about the FDs.

But maybe someone else here has the insight what exactly would take the
time in the kernel that I see and that might bring us to totally
different solutions (therefore RFC).

patch 1:
Finally after working through this code for a while I got the feeling
that if we are in a bad/non-responsive case after 10 seconds upgrading
to SIGKILL we should give it some more time to take effect. We reach
this in stressful cases only anyway and only if force is set, so then
waiting a bit more helps to resolve some of the other cases that I found
on the mailing list about virProcessKillPainfully being stuck.
If one has a personal interest in the 15 seconds we had before lets add
a VIR_WARN on 15 seconds if that would be better, but overall wait a bit
more.

P.S. Afer a short discussion with Daniel on IRC I'm also adding Alex explicitly
for passthrough experience.

P.P.S.: For now this really is only meant as an RFC to kick off the discussion.
I got taken away the system that I could trigger this case easily on before I
could complete a final verification. But the case is interesting enough
to start the discussion now.

Christian Ehrhardt (2):
  process: wait longer 5->30s on hard shutdown
  process: accept the lack of /proc/<pid> as valid process removal

 src/util/virprocess.c | 29 +++++++++++++++++++++++++++--
 1 file changed, 27 insertions(+), 2 deletions(-)

-- 
2.17.1