[libvirt] [PATCH v5 3/3] libvirtd: fix crash on termination

Nikolay Shirokovskiy nshirokovskiy at virtuozzo.com
Thu Dec 21 09:13:47 UTC 2017

On 21.12.2017 00:52, John Ferlan wrote:
> [...]
>>> Now you've lost me. What are the back traces? and now does one
>>> reasonably reproduce?  Are you trying to advocate here for [2] to be
>>> reviewed/accepted?
>> Sorry for that. Unfortunately back trace was only emailed in the first version
>> of series [3]. The reproducer is there too.
>> I'm not here for [2]. It addresses a different issue. However I want to 
>> make some progress on that series too.
>> [3] https://www.redhat.com/archives/libvir-list/2017-September/msg01006.html
> Ahh, sigh...  I should have been more "proactive" in ensuring that I had
> gone back through the above test when two of the patches I had posted
> related to the changes in this code were not ACK'd. Those two patches are:
> https://www.redhat.com/archives/libvir-list/2017-November/msg00296.html
> and
> https://www.redhat.com/archives/libvir-list/2017-November/msg00297.html
> What those did was perform the Unref from libvirtd as soon as the
> AddServer or AddProgram added to the respective hash table (and of
> course incr the refcnt). The result is the Unref for each was done
> *prior* to the virNetDaemonClose and prior to the virStateCleanup which
> solves the late run after Close when Dispose runs.
> What jiggled my memory on this was today there was an OFTC IRC posting
> regarding a virt-manager crash (see pastebin from cbosdonnat at
> http://paste.opensuse.org/44522138) as a result of incorrect ordering of
> daemon unref and running virStateCleanup.  A copy/paste of the
> subsequent conversation is:
> <cbosdonnat> I'm seeing a crash here when libvirtd is stopping and a
> client (namely virt-manager) calls
> networkConnectNetworkEventDeregisterAny().... is it OK to just goto
> cleanup if the driver is NULL in that function?
> <danpb> cbosdonnat: you mean the driver is being torn down in
> virStateCleanup ?  and an API is still allowed to be run ?
> <cbosdonnat> danpb, looks like that indeed
> <cbosdonnat> where would the API call be rejected otherwise?
> <danpb> that seems a much more general problem - we should not call
> virStateCleanup until we've killed off all the API worker threads
> <danpb> i kind of thought we already did that
> <danpb> is it perhaps that we're killing off active clients and as part
> of that we're cleaning up events they have registered ?
> <cbosdonnat> danpb, here is the backtrace of the crash:
> http://paste.opensuse.org/44522138
> <cbosdonnat> so seems like that indeed
> <danpb> oh yeah, we're killing off clients
> <danpb> we need to explicitly be able to kill off clients before
> shutting down drivers
> <danpb> surprised we've not seen this before to be honest
> <danpb> i wonder if some refactoring  has introduced this problem recently
> <cbosdonnat> are those done in separate threads?
> <danpb> i can't rememeber to be honest
> <cbosdonnat> I'll chase the order of these then
> <cbosdonnat> danpb, the state cleanup happens before the daemon unref...
> looks like we'll need to purge the daemon before that
> <cbosdonnat> danpb, would that be completely silly to move the client
> closing code to virNetServerClose()?
> <danpb> that seems pretty sensible to me actually
> <danpb> logically they feel related actions
> <cbosdonnat> we can keep the unref of them where it is though if it
> makes more sense
> <cbosdonnat> yep
> So yes, the "refactoring" that caused this is commit id '2f3054c22'.
> As a test, I altered the code to perform the all the Unref's before the
> virStateCleanup and that "resolved" the virt-manager crash; however,
> that would seem to conflict with commit id '85c3a1820' which noted that
> performing the last Unref(dmn) prior to virStateCleanup would cause
> issues for daemonInhibitCallback.
> So I moved the Unref(dmn) after virStateCleanup... That of course led to
> the virt-manager issue.  So I figured I'd give adding this patch in and
> that then fixed the virt-manager crash (obviously as we'd be Unref'ing
> all those servers we added).
> So after all that if I add that sleep into domstats as shown in the [3]
> link from above - I don't get a crash, but it also seems to cause the
> SIGTERM to be ignored at least the first time through. Running a second
> SIGTERM does the proper kill.
> So short story made really long, I think the best course of action will
> be to add this patch and reorder the Unref()'s (adminProgram thru srv,
> but not dmn). It seems to resolve these corner cases, but I'm also open
> to other suggestions. Still need to think about it some more too before
> posting any patches.

I'm not grasp the whole picture yet but I've managed to find out what
triggered the crash. It is not 2f3054c22 where you reordered unrefs but
1fd1b766105 which moves events unregistering from netserver client closing to
netservec client disposing. Before 1fd1b766105 we don't have crash
because clients simply do not get disposed.
As to fixing the crash with this patch I thinks its is coincidence. I want
do dispose netservers early to join rpc threads and it turns out that
disposing also closing clients too and this fixes the problem.


More information about the libvir-list mailing list