[libvirt] [PATCH 01/10] qemu: agent: unset error flag in EOF handler

Wed Oct 26 19:48:13 UTC 2016

On 10/04/2016 02:56 AM, Nikolay Shirokovskiy wrote:
> Usually on domain shutdown event comes first from qemu
> and the flag is unset in processSerialChangedEvent. However
> there is a race between this function and qemuProcessHandleAgentEOF
> because the former is executed in a thread pool and the latter
> executed synchronously. qemuProcessHandleAgentEOF do not unset
> the flag thus if it was triggered before it remains set after domain
> shutdown/start and thus qemu agent becomes unavailable.

What error occurred that would be avoided by clearing the agentError
flag unconditionally? Considering that agentError being set is primarily
used by "new connection" request processing (e.g. callers to
qemuDomainAgentAvailable) in order to ascertain whether they should
actually attempt the connection or fail because of some condition that
would eventually lead to failure.  I would think EOF from qemu_monitor
for the qemu_agent would be one of those conditions *before* the domain
state is not RUNNING or the priv->agent has been fully removed.

I think what I'm missing is what sequence of events led to this
condition? How was it reproduced.

> ---
>  src/qemu/qemu_process.c | 1 +
>  1 file changed, 1 insertion(+)
> 

So I understand "eventually" your goal is to remove agentError in patch
6, but you need to lead down the path to get there. So let's see if I
can follow along...

The issue you chased is related to domain shutdown processing "order".

The purpose up to this point in time for 'agentError' is to ensure
callers of qemuDomainAgentAvailable() will fail before trying to use the
agent. The qemuDomainAgentAvailable checks domain active, agentError,
and whether the priv->agent is available.

So the contention is during shutdown the EOF event firing at about the
same time as the processSerialChangedEvent has caused an issue, but it's
not 100% clear just from the description the sequence of events and the
issue. So let's consider why these things happen:

1. The 'eofNotify' comes from the qemuMonitorIO function - ostensibly if
the guest is shutting down

2. The 'processSerialChangedEvent' seems to occur for multiple reasons -
startup, shutdown, some sort of channel failure (unix socket or pty). If
I'm reading things correctly, this path is only involved when VSERPORT
is being used.

So, is this a case of guest startup where the initial agent
connect/startup occurs (VSERPORT) and a guest shutdown racing? Or is
this purely a shutdown ordering thing?

> diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c
> index 31c8453..a581f96 100644
> --- a/src/qemu/qemu_process.c
> +++ b/src/qemu/qemu_process.c
> @@ -151,6 +151,7 @@ qemuProcessHandleAgentEOF(qemuAgentPtr agent,
>  
>      qemuAgentClose(agent);
>      priv->agent = NULL;

Because this is cleared, then qemuDomainAgentAvailable will fail because
the agent is not present instead of failing with there was some error.
That would seem to make the following clearing "safe" since we know
we'll have a failure any way, but...

> +    priv->agentError = false;

How does this alone solve the problem you saw.  AFAICT it just changes
*where* the error occurs and the error message displayed.

Conversely, if for some reason for either

1. qemuConnectAgent failed and is being "flagged" in
  1a. processSerialChangedEvent (for initial CONNECTED failure when
QEMU_CAPS_VSERPORT_CHANGE)
  1b. qemuMigrationFinish (for migration target)
  1c. qemuProcessReconnect (for libvirtd restart)
  1d. qemuProcessLaunch (initial domain startup before
QEMU_CAPS_VSERPORT_CHANGE)
  1e. qemuProcessAttach (for qemu-attach to existing qemu process)

or

2. qemuProcessHandleAgentError - some sort of error and we're stopping
any further agent communication.

then you're potentially clearing agentError and could conceivably cause
"extra" debugging steps to be taken only to figure out the reason it was
because of some error rather than the "next" connection failing and
indicating the failure was because of some previous error (it's a
startup race for some other thread desiring to use the agent).

>  
>      virObjectUnlock(vm);
>      return;
> 

So given all that - I'm not 100% clear how this really fixes the issue
you say.  Although, I have to wonder if the bug you're chasing is
perhaps that qemuConnectAgent doesn't clear agentError at the right time
as patch 3 attempts to fix.

After reading through patch 4 - I think Patch 4 and a slightly modified
patch 3 will fix what you're seeing, but I'd only know for sure given
more details about what was seen.

John