[libvirt] [PATCH] qemu_agent: fix deadlock in qemuProcessHandleAgentEOF

Sat Oct 24 02:17:10 UTC 2015

On 2015/10/23 23:32, John Ferlan wrote:

> 
> 
> On 10/13/2015 02:36 AM, Wang Yufei wrote:
>> On 2015/10/2 20:17, John Ferlan wrote:
>>
>>>
>>>
>>> On 09/26/2015 08:18 AM, Wang Yufei wrote:
>>>> We shutdown a VM A by qemu agent,meanwhile an agent EOF
>>>> of VM A happened, there's a chance that deadlock occurred:
>>>>
>>>> qemuProcessHandleAgentEOF in main thread
>>>> A)  priv->agent = NULL; //A happened before B
>>>>
>>>>     //deadlock when we get agent lock which's held by worker thread
>>>>     qemuAgentClose(agent);
>>>>
>>>> qemuDomainObjExitAgent called by qemuDomainShutdownFlags in worker thread
>>>> B)  hasRefs = virObjectUnref(priv->agent); //priv->agent is NULL, return false
>>>>
>>>>     if (hasRefs)
>>>>         virObjectUnlock(priv->agent); //agent lock will not be released here
>>>>
>>>> So I close agent first, then set priv->agent NULL to fix the deadlock.
>>>>
>>>> Signed-off-by: Wang Yufei <james.wangyufei at huawei.com>
>>>> Reviewed-by: Ren Guannan <renguannan at huawei.com>
>>>> ---
>>>>  src/qemu/qemu_process.c | 3 +--
>>>>  1 file changed, 1 insertion(+), 2 deletions(-)
>>>>
>>>
>>> Interesting - this is the exact opposite of commit id '1020a504' from
>>> Michal over 3 years ago.
>>>
>>> However, a bit of digging into the claim from the commit message drove
>>> me to commit id '362d04779c' which removes the domain object lock that
>>> was the basis of the initial patch.
>>>
>>> While I'm not an expert in the vmagent code, I do note that the
>>> HandleAgentEOF code checks for !priv->agent and priv->beingDestroyed,
>>> while the ExitAgent code doesn't necessarily (or directly) check whether
>>> the priv->agent is still valid (IOW: that nothing else has removed it
>>> already like the EOF).
>>>
>>> So, while I don't discount that the patch works - I'm wondering whether
>>> the smarts/logic should be built into ExitAgent to check for
>>> !priv->agent (and or something else) that would indicate we're already
>>> on the path of shutdown.
>>>
>>> John
>>>
>>>
>>>> diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c
>>>> index f2586a1..8c9622e 100644
>>>> --- a/src/qemu/qemu_process.c
>>>> +++ b/src/qemu/qemu_process.c
>>>> @@ -150,11 +150,10 @@ qemuProcessHandleAgentEOF(qemuAgentPtr agent,
>>>>          goto unlock;
>>>>      }
>>>>  
>>>> +    qemuAgentClose(agent);
>>>>      priv->agent = NULL;
>>>>  
>>>>      virObjectUnlock(vm);
>>>> -
>>>> -    qemuAgentClose(agent);
>>>>      return;
>>>>  
>>>>   unlock:
>>>>
>>>
>>> .
>>>
>>
>> Thank you for your reply.
>>
>> At first, we should consider about the right logic. In my oppinion, we should call
>> qemuAgentClose at first, then we set priv->agent NULL, just like the logic in qemuProcessStop.
>>
>>     if (priv->agent) {
>>         qemuAgentClose(priv->agent);
>>         priv->agent = NULL;
>>         priv->agentError = false;
>>     }
>>
>> Base on the right logic, we consider about the right lock which I have shown in my patch.
>>
>> Whether the smarts/logic should be built into ExitAgent to check for !priv->agent (and or
>> something else) that would indicate we're already on the path of shutdown?
>>
>> The answer is yes, we have to, because in ExitAgent we use agent lock which may be released
>> by other thread like qemuAgentClose, we have to check the agent first to make sure it's
>> safe to visit.
>>
>>
> 
> It doesn't feel right that moving qemuAgentClose to inside where the vm
> lock is held will fix the issue. Or does the virCondSignal (e.g.
> pthread_cond_signal or pthread_cond_broadcast) release the mutex that
> would be held by the thread that did the EnterAgent?
> 
> So let's go back to your scenario again where A happens before B...  As
> I see it, the bug is in B which has accesses and unreferences
> priv->agent *without* first getting the vm lock (like other instances),
> and assuming that priv->agent could still be valid (or the same) between
> the time the vm was unlocked in EnterAgent and the point at which we are
> at in ExitAgent.
> 
> So the bug would then seem to be in the EnterAgent, run Agent command,
> ExitAgent sequence.  Since we release the vm lock during EnterAgent,
> "anything" that could alter the priv->agent setting while the lock is
> not held needs to be handled.
> 
> Thus perhaps the fix should be (in all callers)
> 
> agent = priv->agent;
> EnterAgent(vm);
> *Agent* Command
> ExitAgent(vm, agent)
> 
> Where in ExitAgent then has:
> 
>     virObjectLock(obj);
>     if (priv->agent != agent)
>         VIR_DEBUG("priv->agent=%p differs from agent=%p"
>                   priv->agent, agent);
>         virObjectUnlock(agent);
>         return;
>     }
> 
>     if ((hasRefs = virObjectUnref(priv->agent)))
>         virObjectUnlock(priv->agent);
> 
>     VIR_DEBUG("Exited agent (agent=%p vm=%p name=%s hasRefs=%d)",
>               priv->agent, obj, obj->def->name, hasRefs);
> 
>     priv->agentStart = 0;
>     if (!hasRefs)
>         priv->agent = NULL;
> 
> 
> Does this seem reasonable?
> 
> John
> 
> .

Let's see the order of lock if we modify it in your way:
Thread B
require agent lock (qemuDomainObjEnterAgent)
require vm lock (qemuDomainObjExitAgent, still hold agent lock)

Thread A
require vm lock (qemuProcessHandleAgentEOF)
require agent lock (qemuAgentClose, still hold vm lock)

Bomb, dead lock! Am I right? We must keep lock require in the same order.