[libvirt] [PATCH] processSerialChangedEvent: Close agent monitor early

Wed May 6 16:33:51 UTC 2015

On Wed, May 06, 2015 at 10:27:50AM -0600, Eric Blake wrote:
> On 05/06/2015 02:46 AM, Michal Privoznik wrote:
> > https://bugzilla.redhat.com/show_bug.cgi?id=890648
> > 
> > So, imagine you've issued an API that involves guest agent. For
> > instance, you want to query guest's IP addresses. So the API acquires
> > QUERY_JOB, locks the guest agent and issues the agemt command.
> 
> s/agemt/agent/
> 
> > However, for some reason, guest agent replies to initial ping
> > correctly, but then crashes tragically while executing real command
> > (in this case guest-network-get-interfaces). Since initial ping went
> > well, libvirt thinks guest agent is accessible and awaits reply to the
> > real command. But it will never come. What will is a monitor event.
> 
> The monitor event is telling you that your side of the socket is still
> valid, but the guest's side of the agenet is no longer connected.
> 
> > Our handler (processSerialChangedEvent) will try to acquire
> > MODIFY_JOB, which will fail obviously because the other thread that's
> > executing the API already holds a job. So the event handler exits
> > early, and the QUERY_JOB is never released nor ended.
> > 
> > I don't understand why, but qemu should in fact instead of sending a
> > event on the monitor close the socket.
> 
> Reads awkwardly; I think you were trying to say:
> 
> qemu should in fact be closing the socket instead of sending an event
> 
> But that doesn't make sense technically.  The fact that the agent is no
> longer open in the guest is no reason to close the monitor socket, and
> closing the agent socket is inappropriate in case the guest restarts the
> guest agent and reopens the port.  Thus, listening to events on the
> monitor socket about whether the agent socket is connected or dangling
> is sufficient for us to know if the agent will be responsive or not.
> 
> > But leave that for a different
> > discussion. Since qemu is not doing that, we must work around it and
> > close and unregister the agent's FD from the even loop ourselves. This
> > will wake up all callers stuck in waiting for agent's reply.
> 
> Why do we have to kill the agent's fd?  If the guest restarts the agent,
> how will we reconnect to it if we have killed our agent fd?
> 
> > Hopefully, it will make them finish the job too and even handler can
> > acquire its job later.
> 
> I agree that we want to detect agent death events as a reason to mark
> the agent unresponsive, in such a way that any other job pending on the
> agent then gracefully aborts, but I'm not sure that this is the right
> solution.

Agreed, I think it makes more sense to just mark agentError = true
when the agent is not running. Or have an bool agentRunning flag
to let us report clearer error message.

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|