[libvirt] [PATCH v2 7/9] rpc: Alter virNetDaemonQuit processing

Mon Jul 16 09:18:59 UTC 2018

On Thu, Jul 12, 2018 at 03:34:00PM -0400, John Ferlan wrote:
> 
> 
> On 07/09/2018 03:11 AM, Peter Krempa wrote:
> > On Sat, Jul 07, 2018 at 08:11:05 -0400, John Ferlan wrote:
> >> When virNetDaemonQuit is called from libvirtd's shutdown
> >> handler (daemonShutdownHandler) we need to perform the quit
> >> in multiple steps. The first part is to "request" the quit
> >> and notify the NetServer's of the impending quit which causes
> >> the NetServers to inform their workers that a quit was requested.
> >>
> >> Still because we cannot guarantee a quit will happen or it's
> >> possible there's no workers pending, use a virNetDaemonQuitTimer
> >> to not only break the event loop but keep track of how long we're
> >> waiting and we've waited too long, force an ungraceful exit so
> >> that we don't hang waiting forever or cause some sort of SEGV
> >> because something is still pending and we Unref things.
> >>
> >> Signed-off-by: John Ferlan <jferlan at redhat.com>
> >> ---
> >>  src/libvirt_remote.syms    |  1 +
> >>  src/remote/remote_daemon.c |  1 +
> >>  src/rpc/virnetdaemon.c     | 68 +++++++++++++++++++++++++++++++++++++-
> >>  src/rpc/virnetdaemon.h     |  4 +++
> >>  4 files changed, 73 insertions(+), 1 deletion(-)
> > 
> > [...]
> > 
> >> @@ -855,10 +904,27 @@ virNetDaemonRun(virNetDaemonPtr dmn)
> >>          virObjectLock(dmn);
> >>  
> >>          virHashForEach(dmn->servers, daemonServerProcessClients, NULL);
> >> +
> >> +        /* HACK: Add a dummy timeout to break event loop */
> >> +        if (dmn->quitRequested && quitTimer == -1)
> >> +            quitTimer = virEventAddTimeout(500, virNetDaemonQuitTimer,
> >> +                                           &quitCount, NULL);
> >> +
> >> +        if (dmn->quitRequested && daemonServerWorkersDone(dmn)) {
> >> +            dmn->quit = true;
> >> +        } else {
> >> +            /* Firing every 1/2 second and quitTimeout in seconds, force
> >> +             * an exit when there are still worker threads running and we
> >> +             * have waited long enough */
> >> +            if (quitCount > dmn->quitTimeout * 2)
> >> +                _exit(EXIT_FAILURE);
> > 
> > If you have a legitimate long-running job which would finish eventually
> > and e.g. write an updated status XML this will break things. I'm not
> > persuaded that this is a systematic solution to some API getting stuck.
> > 
> > The commit message also does not help persuading me that this is a good
> > idea.
> > 
> > NACK
> > 
> 
> I understand the sentiment and this hasn't been a "top of the list" item
> for me to think about, but to a degree the purpose of re-posting the
> series was to try to come to a consensus over some solution. A naked
> NACK almost makes it appear as if you would prefer to not do anything in
> this situation, letting nature takes it's course (so to speak).
> 
> Do you have any thoughts or suggestions related to what processing you
> believe is appropriate if someone issues a SIG{INT|QUIT|TERM} to a
> daemon (libvirtd|lockd|logd}? That is what to do if we reach the
> cleanup: label in main() of src/remote/remote_daemon.c and
> virNetDaemonClose() gets run? Right now IIRC it's either going to be a
> SEGV or possible long wait for some worker thread to complete. The
> latter is the don't apply this example to cause a wait in block stats
> fetch. Naturally the long term wait only matters up through the point
> while someone is patient before issuing a SIGKILL.
> 
> Do you favor waiting forever for some worker thread to finish? To some
> degree if some signal is sent to libvirtd, then I would think the
> expectation is to not wait for some indeterminate time. And most
> definitely trying to avoid some sort of defunct state resolvable by a
> host reboot. Knowing exactly what a worker thread is doing may help, but
> that'd take a bit (;-)) of code to trace the stack. Perhaps providing a
> list of client/workers still active or even some indication that we're
> waiting on some worker rather than no feedback? How much is "too much"?
> 
> Another option that was proposed, but didn't really gain any support was
> introduction of a stateShutdown in virStateDriver that would be run
> during libvirtd's cleanup: processing prior to virNetDaemonClose. That
> would be before virStateCleanup. For qemu, the priv->mon and priv->agent
> would be closed. That would seemingly generate an error and cause the
> worker to exit thus theoretically not needing any forced quit of the
> thread "if" the reason was some sort of hang as a result of
> monitor/agent processing. Doesn't mean there wouldn't be some other
> issue causing a hang. Refreshing my memory on this - there was some
> discussion or investigation related to using virStateStop, but I since
> that's gated by "WITH_DBUS" being defined, it wasn't clear if/how it
> could be used (daemonStop is only called/setup if daemonRunStateInit has
> set it up).

I think the key thing to bear in mind is what is the benefit of what we
are trying todo. ie why do we need todo a clean shutdown instead of just
immediately exiting. The kernel will clean up all resources associated
with the process when it exits. So we only need care about a clean shutdown
if there is something needing cleanup that the kernel won't handle, or if
we are trying to debug with valgrind & similar tools.

I tend to think we put way too much time into worrying about getting perfect
clean shutdown, for little real world benefit.

IMHO, we should make a moderate effort to shutdown cleanly, but if there's
something stuck after 15 seconds, we should just uncoditionally call exit().

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|