[libvirt] [PATCH] processSerialChangedEvent: Close agent monitor early

Thu May 7 09:07:13 UTC 2015

On 06.05.2015 18:33, Daniel P. Berrange wrote:
> On Wed, May 06, 2015 at 10:27:50AM -0600, Eric Blake wrote:
>> On 05/06/2015 02:46 AM, Michal Privoznik wrote:
>>> https://bugzilla.redhat.com/show_bug.cgi?id=890648
>>>
>>> So, imagine you've issued an API that involves guest agent. For
>>> instance, you want to query guest's IP addresses. So the API acquires
>>> QUERY_JOB, locks the guest agent and issues the agemt command.
>>
>> s/agemt/agent/
>>
>>> However, for some reason, guest agent replies to initial ping
>>> correctly, but then crashes tragically while executing real command
>>> (in this case guest-network-get-interfaces). Since initial ping went
>>> well, libvirt thinks guest agent is accessible and awaits reply to the
>>> real command. But it will never come. What will is a monitor event.
>>
>> The monitor event is telling you that your side of the socket is still
>> valid, but the guest's side of the agenet is no longer connected.
>>
>>> Our handler (processSerialChangedEvent) will try to acquire
>>> MODIFY_JOB, which will fail obviously because the other thread that's
>>> executing the API already holds a job. So the event handler exits
>>> early, and the QUERY_JOB is never released nor ended.
>>>
>>> I don't understand why, but qemu should in fact instead of sending a
>>> event on the monitor close the socket.
>>
>> Reads awkwardly; I think you were trying to say:
>>
>> qemu should in fact be closing the socket instead of sending an event
>>
>> But that doesn't make sense technically.  The fact that the agent is no
>> longer open in the guest is no reason to close the monitor socket, and
>> closing the agent socket is inappropriate in case the guest restarts the
>> guest agent and reopens the port.  Thus, listening to events on the
>> monitor socket about whether the agent socket is connected or dangling
>> is sufficient for us to know if the agent will be responsive or not.
>>
>>> But leave that for a different
>>> discussion. Since qemu is not doing that, we must work around it and
>>> close and unregister the agent's FD from the even loop ourselves. This
>>> will wake up all callers stuck in waiting for agent's reply.
>>
>> Why do we have to kill the agent's fd?  If the guest restarts the agent,
>> how will we reconnect to it if we have killed our agent fd?
>>
>>> Hopefully, it will make them finish the job too and even handler can
>>> acquire its job later.
>>
>> I agree that we want to detect agent death events as a reason to mark
>> the agent unresponsive, in such a way that any other job pending on the
>> agent then gracefully aborts, but I'm not sure that this is the right
>> solution.
> 
> Agreed, I think it makes more sense to just mark agentError = true
> when the agent is not running. Or have an bool agentRunning flag
> to let us report clearer error message.

While this will currently work, it's not future proof. We must close the
agent socket unconditionally even without acquiring a modify job.
Because what if we have an API, that will acquire a job (be it
query/modify/... whatever), then do some agent commands, and even if
they fail it recovers from error and finish the job? But finishing the
job may take so long, that event handler will timeout on its BeginJob()
and we effectively do nothing. Therefore I think calling
qemuAgentClose() is right thing to do. Moreover, our code should be able
to deal with a FD being closed anytime.

Okay, I can work in the flag to produce more sensible error message, but
I think it needs to be combined with my original patch.

Michal