[libvirt] [PATCH] nwfilter: fix deadlock caused updating network device and nwfilter

Pavel Hrdina phrdina at redhat.com
Wed Nov 12 09:16:28 UTC 2014


On 11/11/2014 06:40 PM, John Ferlan wrote:
>
>
> On 11/11/2014 11:34 AM, Pavel Hrdina wrote:
>> On 11/11/2014 04:13 PM, John Ferlan wrote:
>>>
>>>
>>> On 11/05/2014 09:02 AM, Pavel Hrdina wrote:
>>>> Commit 6e5c79a1 tried to fix deadlock between nwfilter{Define,Undefine}
>>>> and starting of guest, but this same deadlock is also for
>>>> updating/attaching network device to domain.
>>>
>>> The referenced commit has a roadmap of sorts of what the lock is -
>>> perhaps you could add similar details here for before and after.
>>
>> In that case I can do copy&paste of that roadmap as it's the same and
>> this patch only extend the referenced one.
>>
>>>
>>> While more of a pain logistically - separating out the patches into
>>> single changes may make it easier to list the bad locking order followed
>>> by the order the commit changes.
>>>
>>> The ones that are most odd are the places where you take out the lock in
>>> the middle of a function.  That seems to go against the original commit
>>> which takes them out right after the flags check. In particular that's
>>> the qemuDomainRestoreFlags and qemuMigrationPrepareAny changes.
>>
>> The order of the nwfilter lock and flags could be the same in all
>> functions, that's a good point. My intention was to lock the
>> nwfilterRead right before the vm is locked because the deadlock is
>> between the vm and nwfilter objects.
>>
>> However I don't quiet understand why you would like to split the patch
>> into a series. All the paths leads to the same nwfilter call
>> "virDomainConfNWFilterInstantiate" and that's the only function callable
>> outside of nwfilter that could cause the deadlock.
>>
>
> If the bad locking order is the same for each, then I suppose it's not
> necessary to copy commit message and make separate patches. It was
> mostly a suggestion considering you're changing 6 API's.
>
> Also it wasn't perfectly clear about the order especially for the
> Restore/Migration changes where the lock is taken out in the middle of
> each function.  Furthermore for those cases, we can get to the cleanup
> without the lock and do an unlock, which perhaps doesn't do anything,
> but it's been a while since I've thought about the guts of thread mutex
> lock/unlock. May not occur in these paths, but I seem to remember if
> your thread has the read lock and attempts to acquire the read lock
> again, then the thread code does a refcount++ type operation. The unlock
> would normally refcount-- for each and free the lock when zero. Now in
> the case where you're jumping to cleanup, you could decr a refcount
> which would be unexpected. My disclaimer is this was a different OS
> (hp-ux) in a prior job where there was a mix of mutex locks and file
> locks that caused all sorts of issues...
>
> John
>

Calling unlock without previous lock is wrong and it should be fixed.
I'll review the patch move the lock to the right place and send v2,
thanks.

Pavel

>>>
>>>>
>>>> The deadlock was introduced by removing global QEMU driver lock because
>>>> nwfilter was counting on this lock and ensure that all driver locks are
>>>> locked inside of nwfilter{Define,Undefine}.
>>>>
>>>> This patch extends usage of virNWFilterReadLockFilterUpdates to prevent
>>>> the deadlock for all possible paths in QEMU driver. LXC and UML drivers
>>>> still have global lock.
>>>>
>>>> Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1143780
>>>>
>>>> Signed-off-by: Pavel Hrdina <phrdina at redhat.com>
>>>> ---
>>>>
>>>> This is temporary fix for the deadlock issue as I'm planning to create global
>>>> libvirt jobs (similar to QEMU domain jobs) and use it for other drivers and
>>>> for example for nwfilters too.
>>>>
>>>>    src/qemu/qemu_driver.c    | 12 ++++++++++++
>>>>    src/qemu/qemu_migration.c |  3 +++
>>>>    src/qemu/qemu_process.c   |  4 ++++
>>>>    3 files changed, 19 insertions(+)
>>>>
>>>> diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c
>>>> index 6acaea8..9e6f505 100644
>>>
>>> [1] Usage around virQEMUDriverGetConfig() calls isn't consistent. In
>>> particular qemuDomainAttachDeviceFlags() will lock/unlock in different
>>> order.
>>>
>>>> --- a/src/qemu/qemu_driver.c
>>>> +++ b/src/qemu/qemu_driver.c
>>>> @@ -5937,6 +5937,8 @@ qemuDomainRestoreFlags(virConnectPtr conn,
>>>>            def = tmp;
>>>>        }
>>>>
>>>> +    virNWFilterReadLockFilterUpdates();
>>>> +
>>>
>>> So no other locks up to this point are taken? And no need perhaps to
>>> lock earlier to play follow the leader (eg, the original commit) where
>>> the lock was taken after the various flags checks.
>>>
>>>>        if (!(vm = virDomainObjListAdd(driver->domains, def,
>>>>                                       driver->xmlopt,
>>>>                                       VIR_DOMAIN_OBJ_LIST_ADD_LIVE |
>>>> @@ -5978,6 +5980,7 @@ qemuDomainRestoreFlags(virConnectPtr conn,
>>>>        virFileWrapperFdFree(wrapperFd);
>>>>        if (vm)
>>>>            virObjectUnlock(vm);
>>>> +    virNWFilterUnlockFilterUpdates();
>>>
>>> So this could get called without ever having set the lock - is that
>>> correct?  We can get to cleanup without calling the ReadLock - probably
>>> not a good thing since it's indeterminate what happens according to the
>>> man page.
>>>
>>>>        return ret;
>>>>    }
>>>>
>>>> @@ -7502,6 +7505,8 @@ static int qemuDomainAttachDeviceFlags(virDomainPtr dom, const char *xml,
>>>>
>>>>        affect = flags & (VIR_DOMAIN_AFFECT_LIVE | VIR_DOMAIN_AFFECT_CONFIG);
>>>>
>>>> +    virNWFilterReadLockFilterUpdates();
>>>> +
>>>
>>> [1] This one is done after the GetConfig
>>>
>>>>        if (!(caps = virQEMUDriverGetCapabilities(driver, false)))
>>>>            goto cleanup;
>>>>
>>>> @@ -7614,6 +7619,7 @@ static int qemuDomainAttachDeviceFlags(virDomainPtr dom, const char *xml,
>>>>            virObjectUnlock(vm);
>>>>        virObjectUnref(caps);
>>>>        virObjectUnref(cfg);
>>>
>>> [1] But the release is done after the cfg unref
>>>
>>>> +    virNWFilterUnlockFilterUpdates();
>>>
>>>>        return ret;
>>>>    }
>>>>
>>>> @@ -7644,6 +7650,8 @@ static int qemuDomainUpdateDeviceFlags(virDomainPtr dom,
>>>>                      VIR_DOMAIN_AFFECT_CONFIG |
>>>>                      VIR_DOMAIN_DEVICE_MODIFY_FORCE, -1);
>>>>
>>>> +    virNWFilterReadLockFilterUpdates();
>>>> +
>>>
>>> [1] This one is done before the GetConfig
>>>
>>>>        cfg = virQEMUDriverGetConfig(driver);
>>>>
>>>>        affect = flags & (VIR_DOMAIN_AFFECT_LIVE | VIR_DOMAIN_AFFECT_CONFIG);
>>>> @@ -7760,6 +7768,7 @@ static int qemuDomainUpdateDeviceFlags(virDomainPtr dom,
>>>>            virObjectUnlock(vm);
>>>>        virObjectUnref(caps);
>>>>        virObjectUnref(cfg);
>>>> +    virNWFilterUnlockFilterUpdates();
>>>
>>> [1] Release done after cfg unref
>>>
>>>>        return ret;
>>>>    }
>>>>
>>>> @@ -14510,6 +14519,8 @@ qemuDomainRevertToSnapshot(virDomainSnapshotPtr snapshot,
>>>>         * and use of FORCE can cause multiple transitions.
>>>>         */
>>>>
>>>> +    virNWFilterReadLockFilterUpdates();
>>>> +
>>>
>>> [1] This one is done before GetConfig
>>>
>>>>        if (!(vm = qemuDomObjFromSnapshot(snapshot)))
>>>>            return -1;
>>>>
>>>> @@ -14831,6 +14842,7 @@ qemuDomainRevertToSnapshot(virDomainSnapshotPtr snapshot,
>>>>            virObjectUnlock(vm);
>>>>        virObjectUnref(caps);
>>>>        virObjectUnref(cfg);
>>>> +    virNWFilterUnlockFilterUpdates();
>>>
>>> [1] Release done after cfg unref
>>>
>>>>
>>>>        return ret;
>>>>    }
>>>> diff --git a/src/qemu/qemu_migration.c b/src/qemu/qemu_migration.c
>>>> index 94a4cf6..18242ae 100644
>>>> --- a/src/qemu/qemu_migration.c
>>>> +++ b/src/qemu/qemu_migration.c
>>>> @@ -2666,6 +2666,8 @@ qemuMigrationPrepareAny(virQEMUDriverPtr driver,
>>>>                goto cleanup;
>>>>        }
>>>>
>>>> +    virNWFilterReadLockFilterUpdates();
>>>> +
>>>
>>> Lots of code before we get here - what's the locking
>>>
>>>>        if (!(vm = virDomainObjListAdd(driver->domains, *def,
>>>>                                       driver->xmlopt,
>>>>                                       VIR_DOMAIN_OBJ_LIST_ADD_LIVE |
>>>> @@ -2825,6 +2827,7 @@ qemuMigrationPrepareAny(virQEMUDriverPtr driver,
>>>>            qemuDomainEventQueue(driver, event);
>>>>        qemuMigrationCookieFree(mig);
>>>>        virObjectUnref(caps);
>>>> +    virNWFilterUnlockFilterUpdates();
>>>
>>> Yet another case where we could reach the cleanup without having ever
>>> obtained the ReadLock
>>>
>>>>        return ret;
>>>>
>>>>     stop:
>>>> diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c
>>>> index 26d4948..409a672 100644
>>>> --- a/src/qemu/qemu_process.c
>>>> +++ b/src/qemu/qemu_process.c
>>>
>>> This one seems OK - although I think I'd make it a separate patch and of
>>> course explain the lock ordering change
>>>
>>>
>>> John
>>>> @@ -3422,6 +3422,8 @@ qemuProcessReconnect(void *opaque)
>>>>
>>>>        VIR_FREE(data);
>>>>
>>>> +    virNWFilterReadLockFilterUpdates();
>>>> +
>>>>        virObjectLock(obj);
>>>>
>>>>        cfg = virQEMUDriverGetConfig(driver);
>>>> @@ -3573,6 +3575,7 @@ qemuProcessReconnect(void *opaque)
>>>>
>>>>        virObjectUnref(conn);
>>>>        virObjectUnref(cfg);
>>>> +    virNWFilterUnlockFilterUpdates();
>>>>
>>>>        return;
>>>>
>>>> @@ -3608,6 +3611,7 @@ qemuProcessReconnect(void *opaque)
>>>>        }
>>>>        virObjectUnref(conn);
>>>>        virObjectUnref(cfg);
>>>> +    virNWFilterUnlockFilterUpdates();
>>>>    }
>>>>
>>>>    static int
>>>>
>>




More information about the libvir-list mailing list