[libvirt] [PATCH 04/10] qemu: agent: handle agent connection errors in one place

Fri Oct 28 14:59:44 UTC 2016

On 28.10.2016 14:42, John Ferlan wrote:
> 
> 
> On 10/28/2016 03:37 AM, Nikolay Shirokovskiy wrote:
>>
>>
>> On 27.10.2016 18:51, John Ferlan wrote:
>>>
>>>
>>> On 10/27/2016 07:34 AM, Nikolay Shirokovskiy wrote:
>>>>
>>>>
>>>> On 26.10.2016 22:57, John Ferlan wrote:
>>>>>
>>>>> There's no commit message...
>>>>
>>>> This is simple refactoring.
>>>>
>>>
>>> Not really...
>>>
>>> A simple refactor would have kept the -2 logic.
>>
>> Ohh, my bad english. I mean just(mere) refactoring. Not simple.
>> However as a refactoring step this patch is not complex from my POV.
>>
> 
> Commit messages don't have to be a long story, but they shouldn't be
> empty. If you're relying on purely the 60 character commit message to
> convey what's been done, then the changes would have to be similarly short.
> 
> Part of your refactor altered logic and that's something that needs to
> be mentioned.
> 
>>>
>>> What's been done here is to let qemuConnectAgent make the decision on
>>> "hard" or "soft" error and only use < 0 as failure
>>>
>>>>>
>>>>> You're altering qemuConnectAgent to return -1 on the only error and
>>>>> perform the VIR_WARN plus agentError = true on other "soft" failures.
>>>>>
>>>>> Not exactly nothing going on!
>>>>
>>>> This is what already present in code, I've just moved soft failures
>>>> in qemuConnectAgent.
>>>>
>>>>>
>>>>> On 10/04/2016 02:56 AM, Nikolay Shirokovskiy wrote:
>>>>>> ---
>>>>>>  src/qemu/qemu_driver.c    |  8 +------
>>>>>>  src/qemu/qemu_migration.c | 13 ++---------
>>>>>>  src/qemu/qemu_process.c   | 58 ++++++++++++-----------------------------------
>>>>>>  3 files changed, 17 insertions(+), 62 deletions(-)
>>>>>>
>>>>>
>>>>> This idea seems to have merit too - something that I've thought now that
>>>>> I've read the first 3 patches...
>>>>>
>>>>> I think you should have started with this patch, it probably would have
>>>>> made the others easier to think about. In fact, I'm curious if with just
>>>>> this change and the suggestion in patch 3 for clearing agentError
>>>>> whether you can reproduced the bug you're trying to fix/resolve without
>>>>> any of the other patches.
>>>>>
>>>>>> diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c
>>>>>> index 12ddbc0..edff973 100644
>>>>>> --- a/src/qemu/qemu_driver.c
>>>>>> +++ b/src/qemu/qemu_driver.c
>>>>>> @@ -4397,7 +4397,6 @@ processSerialChangedEvent(virQEMUDriverPtr driver,
>>>>>>      virObjectEventPtr event = NULL;
>>>>>>      virDomainDeviceDef dev;
>>>>>>      qemuDomainObjPrivatePtr priv = vm->privateData;
>>>>>> -    int rc;
>>>>>>  
>>>>>>      if (connected)
>>>>>>          newstate = VIR_DOMAIN_CHR_DEVICE_STATE_CONNECTED;
>>>>>> @@ -4450,13 +4449,8 @@ processSerialChangedEvent(virQEMUDriverPtr driver,
>>>>>>  
>>>>>>      if (STREQ_NULLABLE(dev.data.chr->target.name, "org.qemu.guest_agent.0")) {
>>>>>>          if (newstate == VIR_DOMAIN_CHR_DEVICE_STATE_CONNECTED) {
>>>>>> -            if (!priv->agent) {
>>>>>> -                if ((rc = qemuConnectAgent(driver, vm)) == -2)
>>>>>> +            if (!priv->agent && qemuConnectAgent(driver, vm) < 0)
>>>>>
>>>>> FWIW:
>>>>> qemuConnectAgent returns 0 "if (priv->agent)", so there's one less check
>>>>> here too.  Could just be "if (qemuConnectAgent(driver, vm) < 0)"
>>>>
>>>> It depends on patch 3. If we clear the error early in qemuConnectAgent then
>>>> we'd better check before calling this function.
>>>>
>>>
>>> Right my mind was already there or more succinctly said - I was trying
>>> to consider this as the first patch as I think it'll make the reasoning
>>> a bit more clear.
>>
>> Agree.
>>
>>>
>>> Since the "issue" has been described now as a pure shutdown and restart
>>> after error problem without needing to consider the migration,
>>> reconnect, and attach separately - it's easier to focus on.
>>>
>>> So my feeling now after reading and thinking a bit about everything so
>>> far is that this patch could go in (mostly) as is.  That way there are
>>> then 2 places to set agentError=true and two places to set
>>> agentError=false (discluding original allocation and reconnect paths).
>>>
>>> I don't think we can "combine" the Error and EOF paths. They're
>>> separate. What happens if some day someone uses the VSERPORT logic to
>>
>> Right now there is no value in separation AFAIU but a lot of code that is
>> difficult to follow. 
>>
> 
> EOF and Error are different. That's why there's two different callbacks.
> The agent implementation has 'chosen' to just make Error essentially
> fatal for all future attempts while the guest is running. I seem to
> recall recent discussion regarding why there's a separate Error handler
> and the thought that some sort of error recover could be done, but I
> cannot find it in the last 3 months of archives on a simple title
> search. It could have been a conversation at KVM Forum too.
> 
> Could someone come along and alter the current reality - perhaps.  Your
> "choice" is remove a bunch of code that someone would then have to add
> back later. I'm erring on the side of caution - having separate Error
> and EOF does work - what doesn't work right is the 'agentError' logic.
> But we can fix that.

The point is that we can drop error flag altogether if we make explicit
the current state. And current state is that we don't need to differentiate
EOF and Error.

> 
> As for difficult to follow - that's perhaps my other hot button issue -
> it's not self documenting and that's frustrating. Some will tout usage
> of git log history to understand how something's supposed to work. I
> land more on the side of - if you learn something about code you're
> changing, then leave entrails in the code to hopefully help the next
> person reading the code and not the git log's. It's a delicate balance
> for my penchant to be too verbose vs. providing sufficient information.
> I'll sometimes leave a comment somewhere that means nothing to someone
> else, but causes me to remember why I did something (like the pensieve
> in the Harry Potter movies ;-)).

It is true. Some cases are not obvious from code only. But in this case
we have just pieces of code that add nothing but complexity AFAIU. And 
if the strategy is keeping the code even if it has no value then complexity
will be grow further and further... If there is value - then let's
make it explicit or if nothing can be done in this attempt at least
document the code. But after looking for a while at this code I realized
- it can be simplier.

> 
>>> somehow restart an agent after an error, after a close, and before a
>>> shutdown. IOW: Some kind of error recovery logic.
>>
>> Dont's see how having separate EOF and Error will help in this case.
>>
> 
> Check the callers of qemuDomainAgentAvailable and imagine a day where we
> can take an error and do something smarter...  If there's multiple
> "things" an Agent can do and we disable them all just because of an
> error that's kind of frustrating. So maybe someone comes along and
> classifies the calls into groups (state, vcpu, FS, time, interface,
> security). Then let's say we have some sort of error from a vcpu call -
> does that necessarily mean we should disable getting/setting time? Do we
> know enough about the error. The vmagent is not something I focus on, so
> I don't have the answer to whether we could make something recoverable.
> I also know there's a Linux agent and Windows agent - so

Error in io loop makes agent non-functional, no matter what command
is. It's just a case of breaked channel, communication is impossible
until new attempt to start communication.

So I'm still don't understand why we need eof/error.

> 
> In any case, I  don't think we should disable error processing just
> because it's difficult to follow...

No no, the point was no value *and* difficult to follow. Difficult
to follow alone can not be argument ))

Ok. After giving it more thought I think I'm not right on this.
We *do* want to know whether it is EOF or Error. For example
on domain shutdown we can receive EOF on channel and it i'd
want to report it as 'not connected' rather then io error.

I'll make new series.

> 
> As an aside - another "interesting" part of the VSERPORT processing...
> qemuAgentConnect will avoid running opening the channel if VSERPORT
> capability is set *and* config->state != CONNECTED. However, the
> VSERPORT processing code in processSerialChangedEvent will only work if
> the target.name is "org.qemu.guest_agent.0"... Sure that's our default,
> but nonetheless interesting about two different checks.
> 
>>>
>>> In order to handle the issue where agentError is set, but we cannot
>>> start again (e.g. the have error, do shutdown, do start path) - add an
>>> unconditional priv->agentError = false before in qemuConnectAgent before
>>> the virSecurityManagerSetDaemonSocketLabel call.  The other agentError =
>>> false settings can then be removed.
>>
>> Can not agree. For modern qemu agent will be in error state on start up
>> until serial port appeared. Should be in 'not connected' state.
>>
> 
> It's semantics... the error someone will get until VSERPORT runs is
> agent unavailable due to error"... Once the serial port magic happens,
> it could then be running, error, or not connected.  But this would only
> be for the case where guest running, vmagent error, guest shutdown, and
> guest start (until serial port is connect).

And this will be quite misleading in logs. Agent error is major error
condition AFAIK somethings is wrong with channel to qemu agent. qemu
or guest driver need to be fixed...

> 
> FWIW: One way around that is in the modern (processSerialChangedEvent)
> code rather than clearing agentError only if priv->agent was set, you
> could also unconditionally set it after/before. IOW: DISCONNECTED was
> received.  A possible downside to that option is it's not

I think it's fragile. Serial disconnected and getting error condition in channel 
are in race.

> architecturally clear to me what conditions other than grep-ing thru the
> JSON return data in qemuMonitorJSONExtractChardevInfo what could cause
> DISCONNECTED. Does it only occur at shutdown?  Or could it occur when
> say some network/socket error causes a disconnect. In which case, a
> running guest/agent could have an error, get a disconnect, but clear
> agentError 'unexpectedly'... Does the agent logic at some point in time
> send a second CONNECTED if for some reason a network is restored.
> Again, I'm not that familiar with the inner workings there, so I err on
> the side of caution and not unconditionally clear.
> 
> John
> 
>>>
>>> So we're left with 2 places to set error and one place to clear. I'll
>>> continue to consider this, bug figured I'd get out a response to all
>>> this sooner than later.
>>>
>>> I'm less convinced mucking with combining EOF and Error is a good idea
>>> in patches 7-10, but I haven't thought too much about it either.
>>>
>>> John
>>>>>
>>>>>>                      goto endjob;
>>>>>
>>>>> The indention for this is off (remove leading 4 spaces)
>>>>>
>>>>>> -
>>>>>> -                if (rc < 0)
>>>>>> -                    priv->agentError = true;
>>>>>> -            }
>>>>>>          } else {
>>>>>>              if (priv->agent) {
>>>>>>                  qemuAgentClose(priv->agent);
>>>>>> diff --git a/src/qemu/qemu_migration.c b/src/qemu/qemu_migration.c
>>>>>> index e2ca330..0a02236 100644
>>>>>> --- a/src/qemu/qemu_migration.c
>>>>>> +++ b/src/qemu/qemu_migration.c
>>>>>> @@ -6171,7 +6171,6 @@ qemuMigrationFinish(virQEMUDriverPtr driver,
>>>>>>      unsigned short port;
>>>>>>      unsigned long long timeReceived = 0;
>>>>>>      virObjectEventPtr event;
>>>>>> -    int rc;
>>>>>>      qemuDomainJobInfoPtr jobInfo = NULL;
>>>>>>      bool inPostCopy = false;
>>>>>>      bool doKill = true;
>>>>>> @@ -6244,16 +6243,8 @@ qemuMigrationFinish(virQEMUDriverPtr driver,
>>>>>>                                        QEMU_ASYNC_JOB_MIGRATION_IN) < 0)
>>>>>>          goto endjob;
>>>>>>  
>>>>>> -    if ((rc = qemuConnectAgent(driver, vm)) < 0) {
>>>>>> -        if (rc == -2)
>>>>>> -            goto endjob;
>>>>>> -
>>>>>> -        VIR_WARN("Cannot connect to QEMU guest agent for %s",
>>>>>> -                 vm->def->name);
>>>>>> -        virResetLastError();
>>>>>> -        priv->agentError = true;
>>>>>> -    }
>>>>>> -
>>>>>> +    if (qemuConnectAgent(driver, vm) < 0)
>>>>>> +        goto endjob;
>>>>>>  
>>>>>>      if (flags & VIR_MIGRATE_PERSIST_DEST) {
>>>>>>          if (qemuMigrationPersist(driver, vm, mig, !v3proto) < 0) {
>>>>>> diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c
>>>>>> index 42f7f84..d7c9ce3 100644
>>>>>> --- a/src/qemu/qemu_process.c
>>>>>> +++ b/src/qemu/qemu_process.c
>>>>>> @@ -206,7 +206,6 @@ int
>>>>>>  qemuConnectAgent(virQEMUDriverPtr driver, virDomainObjPtr vm)
>>>>>>  {
>>>>>>      qemuDomainObjPrivatePtr priv = vm->privateData;
>>>>>> -    int ret = -1;
>>>>>>      qemuAgentPtr agent = NULL;
>>>>>>      virDomainChrDefPtr config = qemuFindAgentConfig(vm->def);
>>>>>>  
>>>>>> @@ -252,8 +251,7 @@ qemuConnectAgent(virQEMUDriverPtr driver, virDomainObjPtr vm)
>>>>>>          qemuAgentClose(agent);
>>>>>>          virReportError(VIR_ERR_INTERNAL_ERROR, "%s",
>>>>>>                         _("guest crashed while connecting to the guest agent"));
>>>>>> -        ret = -2;
>>>>>> -        goto cleanup;
>>>>>> +        return -1;
>>>>>>      }
>>>>>>  
>>>>>>      if (virSecurityManagerClearSocketLabel(driver->securityManager,
>>>>>> @@ -264,18 +262,16 @@ qemuConnectAgent(virQEMUDriverPtr driver, virDomainObjPtr vm)
>>>>>>          goto cleanup;
>>>>>>      }
>>>>>>  
>>>>>> -
>>>>>>      priv->agent = agent;
>>>>>>  
>>>>>> -    if (priv->agent == NULL) {
>>>>>> -        VIR_INFO("Failed to connect agent for %s", vm->def->name);
>>>>>
>>>>> Interesting a "marker" of sorts to know the virAgentOpen "failed" is
>>>>> that we'd get an Informational "Failed to connect..." followed shortly
>>>>> thereafter by a Warning "Cannot connect..." (depending of course on
>>>>> one's message display severity level).
>>>>>
>>>>> I think if you "restore" this without the goto cleanup (below) and of
>>>>> course the removal of the {}, then at least message parity is
>>>>> achieved...  I suppose it could move up into the "if (agent == NULL)"
>>>>> too, but then it could be given even though an Error is reported.
>>>>>
>>>>> It's not that important, but it does leave some breadcrumbs.
>>>>
>>>> Agree. Even though I'm still wondering how useful this message is as
>>>> this patch is intended to be refactoring let's keep this message.
>>>>
>>>>>
>>>>> Again I'd like to see the breadcrumbs with just this patch applied to
>>>>> reproduced what you're chasing in patches 1-4...
>>>>>
>>>>
>>>> Sorry, not possible. The situation I've described in patch 1 is based on analysis. However
>>>> after applying the patches 1-3 the bug is not occured in our test system anymoree.
>>>>
>>>> Nikolay
>>>>