[libvirt] [PATCH] [RFC] nwfilter: resolve deadlock between VM operations and filter update

Wed Oct 13 13:11:25 UTC 2010

On Thu, Oct 07, 2010 at 09:58:28AM -0400, Stefan Berger wrote:
>  On 10/07/2010 09:06 AM, Soren Hansen wrote:
> >I had trouble applying the patch (I think maybe Thunderbird may have
> >fiddled with the formatting :( ), but after doing it manually, it works
> >excellently. Thanks!
> >
> Great. I will prepare a V3.
> 
> I am also shooting a kill -SIGHUP at libvirt once in a while to see what 
> happens (while creating / destroying 2 VMs and modifying their filters). 
> Most of the time all goes well, but occasionally things do get stuck. I 
> get the following debugging output from libvirt and attaching gdb to 
> libvirt I see the following stack traces. Maybe Daniel can interpret 
> this... To me it looks like some of the conditions need to be 'tickled'...
> 
> 
> 09:47:25.000: error : qemuAutostartDomain:822 : Failed to start job on 
> VM 'dummy-vm1': Timed out during operation: cannot acquire state change lock
> 
> (gdb) thr ap all bt
> 
> Thread 9 (Thread 0x7f49bf592710 (LWP 17464)):
> #0  0x000000327680b729 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
>    from /lib64/libpthread.so.0
> #1  0x0000000000435312 in virCondWaitUntil (c=<value optimized out>,
>     m=<value optimized out>, whenms=<value optimized out>)
>     at util/threads-pthread.c:115
> #2  0x000000000043d0ab in qemuDomainObjBeginJobWithDriver 
> (driver=0x1f9c010,
>     obj=0x7f49a00011b0) at qemu/qemu_driver.c:409
> #3  0x0000000000458abf in qemuAutostartDomain (payload=<value optimized 
> out>,
>     name=<value optimized out>, opaque=0x7f49bf591320)
>     at qemu/qemu_driver.c:818
> #4  0x00007f49c040ab6a in virHashForEach (table=0x1f9be20,
>     iter=0x458a90 <qemuAutostartDomain>, data=0x7f49bf591320)
>     at util/hash.c:495
> #5  0x000000000043cdac in qemudAutostartConfigs (driver=0x1f9c010)
>     at qemu/qemu_driver.c:855
> #6  0x000000000043ce2a in qemudReload () at qemu/qemu_driver.c:2003
> #7  0x00007f49c0450a3e in virStateReload () at libvirt.c:1017
> #8  0x00000000004189e1 in qemudDispatchSignalEvent (
>     watch=<value optimized out>, fd=<value optimized out>,
>     events=<value optimized out>, opaque=0x1f6f830) at libvirtd.c:388
> ---Type <return> to continue, or q <return> to quit---
> #9  0x00000000004186a9 in virEventDispatchHandles () at event.c:479
> #10 virEventRunOnce () at event.c:608
> #11 0x000000000041a346 in qemudOneLoop () at libvirtd.c:2217
> #12 0x000000000041a613 in qemudRunLoop (opaque=0x1f6f830) at libvirtd.c:2326
> #13 0x0000003276807761 in start_thread () from /lib64/libpthread.so.0
> #14 0x00000032760e14ed in clone () from /lib64/libc.so.6

This thread shows the problem. Guests must not be run directly
from the event loop thread, because startup requires waiting
for I/O events. So this thread is sitting on the condition
variable waiting for an I/O event to complete, but because
its doing this from the event loop thread the event loop
isn't running. So the condition will never be signalled.

This is completely unrelated to the other problems discussed
in this thread & I'm surprised we've not seen it before now!

When you send SIGHUP to libvirt this triggers a reload of  the
guest domain configs. For some reason we also have this SIGHUP
re-triggering autostart. IMHO this is a very big mistake. If
a guest is marked as autostart, I don't think an admin would
expect it to be started when just sending SIGHUP. I think we
should fix it so that autostart is only ever done at daemon
startup, not SIGHUP. This would avoid the entire problem code
path here

Regards,
Daniel
-- 
|: Red Hat, Engineering, London    -o-   http://people.redhat.com/berrange/ :|
|: http://libvirt.org -o- http://virt-manager.org -o- http://deltacloud.org :|
|: http://autobuild.org        -o-         http://search.cpan.org/~danberr/ :|
|: GnuPG: 7D3B9505  -o-   F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|