[libvirt] [PATCH v5 3/3] libvirtd: fix crash on termination

Nikolay Shirokovskiy nshirokovskiy at virtuozzo.com
Fri Dec 22 08:06:08 UTC 2017



On 21.12.2017 15:57, John Ferlan wrote:
> [...]
> 
>>>
>>> So short story made really long, I think the best course of action will
>>> be to add this patch and reorder the Unref()'s (adminProgram thru srv,
>>> but not dmn). It seems to resolve these corner cases, but I'm also open
>>> to other suggestions. Still need to think about it some more too before
>>> posting any patches.
>>>
>>>
>> Hi.
>>
>> I'm not grasp the whole picture yet but I've managed to find out what
>> triggered the crash. It is not 2f3054c22 where you reordered unrefs but
>> 1fd1b766105 which moves events unregistering from netserver client closing to
>> netservec client disposing. Before 1fd1b766105 we don't have crash
>> because clients simply do not get disposed.
> 
> Oh yeah, that one....  But considering Erik's most recent response in
> this overall thread vis-a-vis the separation of "close" vs. "dispose"
> and the timing of each w/r/t Unref and Free, I think having the call to
> remoteClientFreePrivateCallbacks in remoteClientCloseFunc is perhaps
> better than in remoteClientFreeFunc.
> 
>>  
>> As to fixing the crash with this patch I thinks its is coincidence. I want
>> do dispose netservers early to join rpc threads and it turns out that
>> disposing also closing clients too and this fixes the problem.
>>
>> Nikolay
>>
> 
> With Cedric's patch in place, the virt-manager client issue is fixed. So
> that's goodness.
> 
> If I then add the sleep (or usleep) into qemuConnectGetAllDomainStats as
> noted in what started this all, then I can either get libvirtd to crash
> dereferencing a NULL driver pointer or (my favorite) hang with two
> threads stuck waiting:
> 
> (gdb) t a a bt
> 
> Thread 5 (Thread 0x7fffe535b700 (LWP 15568)):
> #0  0x00007ffff3dc909d in __lll_lock_wait () from /lib64/libpthread.so.0
> #1  0x00007ffff3dc1e23 in pthread_mutex_lock () from /lib64/libpthread.so.0
> #2  0x00007ffff7299a15 in virMutexLock (m=<optimized out>)
>     at util/virthread.c:89
> #3  0x00007fffc760621e in qemuDriverLock (driver=0x7fffbc190510)
>     at qemu/qemu_conf.c:100
> #4  virQEMUDriverGetConfig (driver=driver at entry=0x7fffbc190510)
>     at qemu/qemu_conf.c:1002
> #5  0x00007fffc75dfa89 in qemuDomainObjBeginJobInternal (
>     driver=driver at entry=0x7fffbc190510, obj=obj at entry=0x7fffbc3bcd60,
>     job=job at entry=QEMU_JOB_QUERY,
> asyncJob=asyncJob at entry=QEMU_ASYNC_JOB_NONE)
>     at qemu/qemu_domain.c:4690
> #6  0x00007fffc75e3b2b in qemuDomainObjBeginJob (
>     driver=driver at entry=0x7fffbc190510, obj=obj at entry=0x7fffbc3bcd60,
>     job=job at entry=QEMU_JOB_QUERY) at qemu/qemu_domain.c:4842
> #7  0x00007fffc764f744 in qemuConnectGetAllDomainStats
> (conn=0x7fffb80009a0,
>     doms=<optimized out>, ndoms=<optimized out>, stats=<optimized out>,
>     retStats=0x7fffe535aaf0, flags=<optimized out>) at
> qemu/qemu_driver.c:20219
> #8  0x00007ffff736430a in virDomainListGetStats (doms=0x7fffa8000950,
> stats=0,
>     retStats=retStats at entry=0x7fffe535aaf0, flags=0) at
> libvirt-domain.c:11595
> #9  0x000055555557948d in remoteDispatchConnectGetAllDomainStats (
>     server=<optimized out>, msg=<optimized out>, ret=0x7fffa80008e0,
>     args=0x7fffa80008c0, rerr=0x7fffe535abf0, client=<optimized out>)
>     at remote.c:6538
> #10 remoteDispatchConnectGetAllDomainStatsHelper (server=<optimized out>,
>     client=<optimized out>, msg=<optimized out>, rerr=0x7fffe535abf0,
>     args=0x7fffa80008c0, ret=0x7fffa80008e0) at remote_dispatch.h:615
> #11 0x00007ffff73bf59c in virNetServerProgramDispatchCall
> (msg=0x55555586cdd0,
>     client=0x55555586bea0, server=0x55555582ed90, prog=0x555555869190)
>     at rpc/virnetserverprogram.c:437
> #12 virNetServerProgramDispatch (prog=0x555555869190,
>     server=server at entry=0x55555582ed90, client=0x55555586bea0,
>     msg=0x55555586cdd0) at rpc/virnetserverprogram.c:307
> #13 0x00005555555a9318 in virNetServerProcessMsg (msg=<optimized out>,
>     prog=<optimized out>, client=<optimized out>, srv=0x55555582ed90)
>     at rpc/virnetserver.c:148
> #14 virNetServerHandleJob (jobOpaque=<optimized out>, opaque=0x55555582ed90)
>     at rpc/virnetserver.c:169
> #15 0x00007ffff729a521 in virThreadPoolWorker (
>     opaque=opaque at entry=0x55555583aa40) at util/virthreadpool.c:167
> #16 0x00007ffff7299898 in virThreadHelper (data=<optimized out>)
>     at util/virthread.c:206
> #17 0x00007ffff3dbf36d in start_thread () from /lib64/libpthread.so.0
> #18 0x00007ffff3af3e1f in clone () from /lib64/libc.so.6
> 
> Thread 1 (Thread 0x7ffff7ef9d80 (LWP 15561)):
> #0  0x00007ffff3dc590b in pthread_cond_wait@@GLIBC_2.3.2 ()
> ---Type <return> to continue, or q <return> to quit---
>    from /lib64/libpthread.so.0
> #1  0x00007ffff7299af6 in virCondWait (c=<optimized out>, m=<optimized out>)
>     at util/virthread.c:154
> #2  0x00007ffff729a760 in virThreadPoolFree (pool=<optimized out>)
>     at util/virthreadpool.c:290
> #3  0x00005555555a8ec2 in virNetServerDispose (obj=0x55555582ed90)
>     at rpc/virnetserver.c:767
> #4  0x00007ffff727923b in virObjectUnref (anyobj=<optimized out>)
>     at util/virobject.c:356
> #5  0x00007ffff724f069 in virHashFree (table=<optimized out>)
>     at util/virhash.c:318
> #6  0x00007ffff73b8295 in virNetDaemonDispose (obj=0x55555582eb10)
>     at rpc/virnetdaemon.c:105
> #7  0x00007ffff727923b in virObjectUnref (anyobj=<optimized out>)
>     at util/virobject.c:356
> #8  0x000055555556f2eb in main (argc=<optimized out>, argv=<optimized out>)
>     at libvirtd.c:1524
> (gdb)
> 
> 
> Of course this could be a red herring because sleep/usleep and the
> condition handling nature of these jobs could be interfering with one
> another.
> 
> Still adding the "virHashRemoveAll(dmn->servers);" into
> virNetDaemonClose doesn't help the situation as I can still either crash
> randomly or hang, so I'm less convinced this would really fix anything.
> It does change the "nature" of the hung thread stack trace though, as
> the second thread is now:

virHashRemoveAll is not enough now. Due to unref reordeing last ref to @srv is
unrefed after virStateCleanup. So we need to virObjectUnref(srv|srvAdm) before
virStateCleanup. Or we can call virThreadPoolFree from virNetServerClose (
as in the first version of the patch and as Erik suggests) instead
of virHashRemoveAll.

> 
>  Thread 1 (Thread 0x7ffff7ef9d80 (LWP 20159)):
> #0  0x00007ffff3dc590b in pthread_cond_wait@@GLIBC_2.3.2 ()
> ---Type <return> to continue, or q <return> to quit---
>    from /lib64/libpthread.so.0
> #1  0x00007ffff7299b06 in virCondWait (c=<optimized out>, m=<optimized out>)
>     at util/virthread.c:154
> #2  0x00007ffff729a770 in virThreadPoolFree (pool=<optimized out>)
>     at util/virthreadpool.c:290
> #3  0x00005555555a8ec2 in virNetServerDispose (obj=0x55555582ed90)
>     at rpc/virnetserver.c:767
> #4  0x00007ffff727924b in virObjectUnref (anyobj=<optimized out>)
>     at util/virobject.c:356
> #5  0x000055555556f2e3 in main (argc=<optimized out>, argv=<optimized out>)
>     at libvirtd.c:1523
> (gdb)
> 
> 
> So we still haven't found the "root cause", but I think Erik is on to
> something in the other part of this thread. I'll go there.
> 
> 
> John
> 




More information about the libvir-list mailing list