[libvirt] deadlock in remoteDispatchDomainUndefine vs daemonStreamHandleAbort

Christian Ehrhardt christian.ehrhardt at canonical.com
Tue Apr 2 08:45:59 UTC 2019


On Mon, Apr 1, 2019 at 4:35 PM Michal Privoznik <mprivozn at redhat.com> wrote:
>
> On 4/1/19 4:25 PM, Christian Ehrhardt wrote:
> > Hi,
> > I happened to analyze a bug [1] report I got from a friend and for
> > quite a while it was rather elusive. But I now finally got it
> > reproducible [2] enough to share it with the community.
> >
> > The TL;DR of what I see is:
> > - an automation with python-libvirt gets a SIGINT
> > - cleanup runs destroy and further undefine
> > - the guest closes FDs due to SIGINT and/or destroy which triggers
> > daemonStreamHandleAbort
> > - those two fight over the lock
> >
> > There I get libvirtd into a deadlock which ends up with all threads
> > dead [4] and two of them fighting [3] (details) in particular.
> >
> > The to related stacks summarized are like:
> >
> > daemonStreamHandleWrite (failing to write)
> >   -> daemonStreamHandleAbort (closing things and cleaning up)
> >      -> ... virChrdevFDStreamCloseCb
> >          virMutexLock(&priv->devs->lock);
> >
> > # there is code meant to avoid such issues emitting "Unable to close"
> > if a lock is held
> > # but the log doesn't show this triggering with debug enabled
> >
> > #10 seems triggered via an "undefine" call
> >    remoteDispatchDomainUndefine
> >    ... -> virChrdevFree
> >       ... -> virFDStreamSetInternalCloseCb
> >          -> virObjectLock(virFDStreamDataPtr fdst)
> >            -> virMutexLock(&obj->lock);
> >    # closing all streams of a guest (requiring the same locks)
> >
> > While that already feels quite close I struggle to see where exactly
> > we'd want to fix it.
> > But finally having a repro-script [2] I hope that someone else here
> > might be able to help me with that.
> >
> > After all it is a race - on my s390x system it triggers usually <5
> > tries, while on x86 I have needed up to 18 runs of the test to hang.
> > Given different system configs it might be better or worse for you.
> >
> > FYI we hit this with libvirt 4.0 initially but libvirt 5.0 was just the same.
> > I haven't built 5.1 or a recent master, but the commits since 5.0
> > didn't mention any issue that seems related. OTOH I'm willing and able
> > to build and try suggestions if anyone comes up with ideas.
> >
> > [1]: https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/1822096
> > [2]: https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/1822096/+attachment/5251655/+files/test4.py
> > [3]: https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/1822096/comments/3
> > [4]: https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/1822096/comments/17
> >
>
> You may want to look at d63c82df8b11b583dec8e72dfb216d8c14783876
> (contained in 5.1.0) beause this smells like the issue you're facing.

Thanks Michal,
I agree that this appears to be similar.
But unfortunately with 5.0 + the full 9 patch series leading into d63c82df still
triggers the deadlock that we found.
So it seems to be a new issue :-/

As I said before - any further suggestions (on commits to test and/or
how to resolve with new changes) are welcome.
Thanks in advance!

> Michal



-- 
Christian Ehrhardt
Software Engineer, Ubuntu Server
Canonical Ltd




More information about the libvir-list mailing list