[libvirt] [PATCH] qemu_agent: fix deadlock in qemuProcessHandleAgentEOF

John Ferlan jferlan at redhat.com
Fri Nov 20 22:08:55 UTC 2015



On 10/27/2015 05:45 PM, John Ferlan wrote:
> 
> 
> On 10/02/2015 08:17 AM, John Ferlan wrote:
>>
>>
>> On 09/26/2015 08:18 AM, Wang Yufei wrote:
>>> We shutdown a VM A by qemu agent,meanwhile an agent EOF
>>> of VM A happened, there's a chance that deadlock occurred:
>>>
>>> qemuProcessHandleAgentEOF in main thread
>>> A)  priv->agent = NULL; //A happened before B
>>>
>>>     //deadlock when we get agent lock which's held by worker thread
>>>     qemuAgentClose(agent);
>>>
>>> qemuDomainObjExitAgent called by qemuDomainShutdownFlags in worker thread
>>> B)  hasRefs = virObjectUnref(priv->agent); //priv->agent is NULL, return false
>>>
>>>     if (hasRefs)
>>>         virObjectUnlock(priv->agent); //agent lock will not be released here
>>>
>>> So I close agent first, then set priv->agent NULL to fix the deadlock.
>>>
>>> Signed-off-by: Wang Yufei <james.wangyufei at huawei.com>
>>> Reviewed-by: Ren Guannan <renguannan at huawei.com>
>>> ---
>>>  src/qemu/qemu_process.c | 3 +--
>>>  1 file changed, 1 insertion(+), 2 deletions(-)
>>>
>>
>> Interesting - this is the exact opposite of commit id '1020a504' from
>> Michal over 3 years ago.
>>
>> However, a bit of digging into the claim from the commit message drove
>> me to commit id '362d04779c' which removes the domain object lock that
>> was the basis of the initial patch.
>>
>> While I'm not an expert in the vmagent code, I do note that the
>> HandleAgentEOF code checks for !priv->agent and priv->beingDestroyed,
>> while the ExitAgent code doesn't necessarily (or directly) check whether
>> the priv->agent is still valid (IOW: that nothing else has removed it
>> already like the EOF).
>>
>> So, while I don't discount that the patch works - I'm wondering whether
>> the smarts/logic should be built into ExitAgent to check for
>> !priv->agent (and or something else) that would indicate we're already
>> on the path of shutdown.
>>
>> John
>>
>>
> 
> Ironically after spinning a few cycles and generating another patch, it
> seems my initial instincts were correct with respect to commit
> 362d04779c removing the reason/cause for 1020a504 and thus moving the
> qemuAgentClose is the correct option.
> 
> Of course I'm not sure what caused me to doubt my first thoughts and
> start down the path of trying different ways to resolve this. I think I
> somehow got it stuck in my head that AgentDestroy still was taking out a
> lock. Oh well, I did learn something at least with respect to how this
> code works - so that part is good.  I can also answer my own question
> with respect whether ExitAgent needs to check priv->agent and/or
> priv->beingDestroyed.
> 
> With this patch, the EOF code will take out the vm-lock, then attempt to
> take out the agent-lock, but will be 'stuck' waiting for the AgentExit
> code to run. Once ExitAgent is called, it will remove it's reference and
> unlock. I don't believe there's a way for the !hasRefs to be true, so I
> suppose it could be removed...
> 
> I'll hold off on pushing to allow pkrempa to make any comments.
> 
> John
> 
> 

Hmm.. Seems I got busy doing a couple of other things and forgot about
this... So, I'd also like to add some "history" to the commit message
(just in case).  I have the following queued up - does it make sense?
And of course, have I missed some other corner/edge case?

John

This essentially reverts commit id '1020a504'. It's also of note that
commit id '362d0477' notes a possible/rare deadlock similar to what was
seen in the monitor in commit id '25f582e3'. However, it seems
interceding changes including commit id 'd960d06f' should remove the
deadlock issue.

    With this change, if EOF is called:

        Get VM lock
        Check if !priv->agent || priv->beingDestroyed, then unlock VM
        Call qemuAgentClose
        Unlock VM

    When qemuAgentClose is called
        Get Agent lock
        If Agent->fd open, close it
        Unlock Agent
        Unref Agent

    qemuDomainObjEnterAgent
        Enter with VM lock
        Get Agent lock
        Increase Agent refcnt
        Unlock VM

    After running agent command, calling qemuDomainObjExitAgent
        Enter with Agent lock
        Unref Agent
        If not last reference, unlock Agent
        Get VM lock

If we were in the middle of an EnterAgent, call Agent command, and
ExitAgent sequence and the EOF code is triggered, then the EOF code can
get the VM lock, make it's checks against !priv->agent ||
priv->beingDestroyed, and call qemuAgentClose. The CloseAgent would wait
to get agent lock. The other thread then will eventually call ExitAgent,
release the Agent lock and unref the Agent. Once ExitAgent releases the
Agent lock, AgentClose will get the Agent Agent lock, close the fd,
unlock the agent, and unref the agent. The final unref would cause
deletion of the agent.

>>> diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c
>>> index f2586a1..8c9622e 100644
>>> --- a/src/qemu/qemu_process.c
>>> +++ b/src/qemu/qemu_process.c
>>> @@ -150,11 +150,10 @@ qemuProcessHandleAgentEOF(qemuAgentPtr agent,
>>>          goto unlock;
>>>      }
>>>  
>>> +    qemuAgentClose(agent);
>>>      priv->agent = NULL;
>>>  
>>>      virObjectUnlock(vm);
>>> -
>>> -    qemuAgentClose(agent);
>>>      return;
>>>  
>>>   unlock:
>>>
>>
>> --
>> libvir-list mailing list
>> libvir-list at redhat.com
>> https://www.redhat.com/mailman/listinfo/libvir-list
>>
> 
> --
> libvir-list mailing list
> libvir-list at redhat.com
> https://www.redhat.com/mailman/listinfo/libvir-list
> 




More information about the libvir-list mailing list