Squelch 'eof from qemu monitor' error on normal VM shutdown

Fri Oct 8 02:45:06 UTC 2021

On 10/4/21 08:55, Michal Prívozník wrote:
> On 9/30/21 7:15 PM, Jim Fehlig wrote:
>> On 9/29/21 21:29, Jim Fehlig wrote:
>>> Hi All,
>>>
>>> Likely Christian received a bug report that motivated commit
>>> aeda1b8c56, which was later reverted by Michal with commit 72adaf2f10.
>>> In the past, I recall being asked about "internal error: End of file
>>> from qemu monitor" on normal VM shutdown and gave a hand wavy response
>>> using some of Michal's words from the revert commit message.
>>>
>>> I recently received a bug report (sorry, but no public link) from a
>>> concerned user about this error and wondered if there is some way to
>>> improve it? I went down some dead ends before circling back to
>>> Christian's patch. When rebased to latest master, I cannot reproduce
>>> the hangs reported by Michal [1]. Perhaps Nikolay's series to resolve
>>> hangs/crashes of libvirtd [2] has now made Christian's patch viable?
>>
>> Hmm, Nikolay's series improves thread management at daemon shutdown and
>> doesn't touch VM shutdown logic. But there has been some behavior change
>> from the time aeda1b8c56 was committed (3.4.0 dev cycle) to current git
>> master. At least I don't see libvirtd hanging when running Michal's test
>> on master + rebased aeda1b8c56.

After reworking the tests a bit, I still don't see libvirtd hanging, but I do 
see VM shutdown stuck in "in shutdown" state. Attaching gdb shows the following 
thread blocked waiting for a response from the monitor that will never come 
since EOF has already occurred on the socket

Thread 21 (Thread 0x7fdb557fa700 (LWP 9110) "rpc-worker"):
#0  0x00007fdbc922a4dc in pthread_cond_wait@@GLIBC_2.3.2 () at 
/lib64/libpthread.so.0
#1  0x00007fdbccee2310 in virCondWait (c=0x7fdba403cd40, m=0x7fdba403cd18) at 
../src/util/virthread.c:156
#2  0x00007fdb87150a8b in qemuMonitorSend (mon=0x7fdba403cd00, 
msg=0x7fdb557f95b0) at ../src/qemu/qemu_monito
r.c:964
#3  0x00007fdb8715fbf1 in qemuMonitorJSONCommandWithFd (mon=0x7fdba403cd00, 
cmd=0x7fdb4015bae0, scm_fd=-1, re
ply=0x7fdb557f9678) at ../src/qemu/qemu_monitor_json.c:327
#4  0x00007fdb8715fda0 in qemuMonitorJSONCommand (mon=0x7fdba403cd00, 
cmd=0x7fdb4015bae0, reply=0x7fdb557f967
8) at ../src/qemu/qemu_monitor_json.c:352
#5  0x00007fdb87174b71 in qemuMonitorJSONGetIOThreads (mon=0x7fdba403cd00, 
iothreads=0x7fdb557f9798, niothrea
ds=0x7fdb557f9790) at ../src/qemu/qemu_monitor_json.c:7838
#6  0x00007fdb8715d059 in qemuMonitorGetIOThreads (mon=0x7fdba403cd00, 
iothreads=0x7fdb557f9798, niothreads=0
x7fdb557f9790) at ../src/qemu/qemu_monitor.c:4083
#7  0x00007fdb870e8ae3 in qemuDomainGetIOThreadsMon (driver=0x7fdb6c06a4f0, 
vm=0x7fdb6c063940, iothreads=0x7f
db557f9798, niothreads=0x7fdb557f9790) at ../src/qemu/qemu_driver.c:4941
#8  0x00007fdb871129bf in qemuDomainGetStatsIOThread (driver=0x7fdb6c06a4f0, 
dom=0x7fdb6c063940, params=0x7fd
b401c0cc0, privflags=1) at ../src/qemu/qemu_driver.c:18292
#9  0x00007fdb871130ee in qemuDomainGetStats (conn=0x7fdb9c006760, 
dom=0x7fdb6c063940, stats=1023, record=0x7
fdb557f98d0, flags=1) at ../src/qemu/qemu_driver.c:18504
#10 0x00007fdb87113526 in qemuConnectGetAllDomainStats (conn=0x7fdb9c006760, 
doms=0x0, ndoms=0, stats=1023, r
etStats=0x7fdb557f9990, flags=0) at ../src/qemu/qemu_driver.c:18598
#11 0x00007fdbcd163e4e in virConnectGetAllDomainStats (conn=0x7fdb9c006760, 
stats=0, retStats=0x7fdb557f9990,
  flags=0) at ../src/libvirt-domain.c:11975
...

So indeed, reporting the error when processing monitor IO is needed to prevent 
other threads from subsequently writing to the socket. One idea I had was to 
postpone reporting the error until someone tries to write to the socket, 
although not reporting an error when it is encountered seems distasteful. I've 
been testing such a hack (attached) and along with squelching the error message, 
I no longer see VMs stuck in the "in shutdown" state after 32 iterations of the 
test. A simple rebase of aeda1b8c56 on current git master never survived more 
than a dozen iterations. I'll let the test continue to run.

For reference, also attached is the libvirtd-hang-test.sh hack I use to 
continually start/stop VMs and a tweaked version of your stats.py test.

> That's sort of expected - back in that era we had one event loop for
> everything. Thus things like virGetLastErrorCode() which use thread
> local variables did not work as expected and thus we needed to save the
> error in mon->lastError. But now we have one additional thread per each
> QEMU just for handling monitor IO, thus virGetLastErrorCode() works now
> and we could lift some restrictions. Therefore I think we can revisit
> Christian's patch again. I too see some concerned users who think it's a
> true error.

Since Christian's patch won't work as-is, any other ideas beyond what I 
described above?

Regards,
Jim
-------------- next part --------------
A non-text attachment was scrubbed...
Name: libvirtd-hang-test.sh
Type: application/x-shellscript
Size: 2200 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/libvir-list/attachments/20211007/e22fb50f/attachment-0002.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: stats.py
Type: text/x-python
Size: 321 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/libvir-list/attachments/20211007/e22fb50f/attachment-0001.py>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-qemu-Do-not-report-eof-when-processing-monitor-IO.patch
Type: text/x-patch
Size: 2979 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/libvir-list/attachments/20211007/e22fb50f/attachment-0003.bin>