Races / crashes in shutdown of libvirtd daemon

Mon Apr 27 15:54:08 UTC 2020

We got a new BZ filed about a libvirtd crash in shutdown

  https://bugzilla.redhat.com/show_bug.cgi?id=1828207

We can see from the stack trace that the "interface" driver is in
the middle of servicing an RPC call for virConnectListAllInterfaces()

Meanwhile the libvirtd daemon is doing virObjectUnref(dmn) on the
virNetDaemonPtr object.

The fact that it is doing this unref, means that it must have already
call virStateCleanup(), given the code sequence:

    /* Run event loop. */
    virNetDaemonRun(dmn);

    ret = 0;

    virHookCall(VIR_HOOK_DRIVER_DAEMON, "-", VIR_HOOK_DAEMON_OP_SHUTDOWN,
                0, "shutdown", NULL, NULL);

 cleanup:
    /* Keep cleanup order in inverse order of startup */
    virNetDaemonClose(dmn);

    virNetlinkEventServiceStopAll();

    if (driversInitialized) {
        /* NB: Possible issue with timing window between driversInitialized
         * setting if virNetlinkEventServerStart fails */
        driversInitialized = false;
        virStateCleanup();
    }

    virObjectUnref(adminProgram);
    virObjectUnref(srvAdm);
    virObjectUnref(qemuProgram);
    virObjectUnref(lxcProgram);
    virObjectUnref(remoteProgram);
    virObjectUnref(srv);
    virObjectUnref(dmn);

Unless I'm missing something non-obvious, this cleanup code path is
inherantly broken & racy.   When virNetDaemonRun() returns the RPC
worker threads are all still active. They are all liable to still
be executing RPC calls, which means any of the drivers may be in
use. So calling  virStateCleanup() is an inherantly dangerous
thing to do. There is the further complication that once we have
exitted the main loop we may prevent the RPC calls from ever
completing, as they may be waiting on an event to be dispatched.

I know we're had various patch proposals in the past to improve the
robustness of shutdown cleanup but I can't remember the outcome of the
reviews. Hopefully people involved in those threads can jump in here...

IMHO the key problem here is the virNetDeamonRun() method which just
looks at the "quit" flag and immediately returns if it is set.
This needs to be changed so that when it sees quit == true, it takes
the following actions

  1. Call virNetDaemonClose() to drop all RPC clients and thus prevent
     new RPC calls arriving
  2. Flush any RPC calls which are queued but not yet assigned to a
     worker thread
  3. Tell worker threads to exit after finishing their current job
  4. Wait for all worker threads to exit
  5. Now virNetDaemonRun may return

At this point we can call virStateCleanup and the various other
things, as we know no drivers are still active in RPC calls.

Having said that, there could be background threads in the the
drivers which are doing work that uses the event loop thread.

So we probably need a virStateClose() method that we call from
virNetDaemonRun, *after* all worker threads are gone, which would
cleanup any background threads while the event loop is still
running.

The issue is that step 4 above ("Wait for all worker threads to exit")
may take too long, or indeed never complete.  To deal with this, it
will need a timeout. In the remote_daemon.c cleanup code path, if
there are still worker threads present, then we need to skip all
cleanup and simply call _exit(0) to terminate the process with no
attempt at cleanup, since it would be unsafe to try anything else.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|