[libvirt] [PATCH v2 2/2] qemu: completely rework reference counting
Martin Kletzander
mkletzan at redhat.com
Tue Dec 9 13:14:35 UTC 2014
On Mon, Dec 08, 2014 at 07:57:50PM +0100, Peter Krempa wrote:
>On 12/05/14 15:03, Martin Kletzander wrote:
>> There is one problem that causes various errors in the daemon. When
>> domain is waiting for a job, it is unlocked while waiting on the
>> condition. However, if that domain is for example transient and being
>> removed in another API (e.g. cancelling incoming migration), it get's
>> unref'd. If the first call, that was waiting, fails to get the job, it
>> unref's the domain object, and because it was the last reference, it
>> causes clearing of the whole domain object. However, when finishing the
>> call, the domain must be unlocked, but there is no way for the API to
>> know whether it was cleaned or not (unless there is some ugly temporary
>> variable, but let's scratch that).
>>
>> The root cause is that our APIs don't ref the objects they are using and
>> all use the implicit reference that the object has when it is in the
>> domain list. That reference can be removed when the API is waiting for
>> a job. And because each domain doesn't do its ref'ing, it results in
>> the ugly checking of the return value of virObjectUnref() that we have
>> everywhere.
>>
>> This patch changes qemuDomObjFromDomain() to ref the domain (using
>> virDomainObjListFindByUUIDRef()) and adds qemuDomObjEndAPI() which
>> should be the only function in which the return value of
>> virObjectUnref() is checked. This makes all reference counting
>> deterministic and makes the code a bit clearer.
>>
>> Signed-off-by: Martin Kletzander <mkletzan at redhat.com>
>> ---
>> src/qemu/THREADS.txt | 40 ++-
>> src/qemu/qemu_domain.c | 29 +-
>> src/qemu/qemu_domain.h | 12 +-
>> src/qemu/qemu_driver.c | 708 ++++++++++++++++------------------------------
>> src/qemu/qemu_migration.c | 108 +++----
>> src/qemu/qemu_migration.h | 10 +-
>> src/qemu/qemu_process.c | 87 +++---
>> 7 files changed, 359 insertions(+), 635 deletions(-)
>>
>> diff --git a/src/qemu/THREADS.txt b/src/qemu/THREADS.txt
>> index 50a0cf9..53a3415 100644
>> --- a/src/qemu/THREADS.txt
>> +++ b/src/qemu/THREADS.txt
>
>...
>
>> @@ -109,7 +117,6 @@ To lock the virDomainObjPtr
>> To acquire the normal job condition
>>
>> qemuDomainObjBeginJob()
>
>*
>
>> - - Increments ref count on virDomainObjPtr
>> - Waits until the job is compatible with current async job or no
>> async job is running
>> - Waits for job.cond condition 'job.active != 0' using virDomainObjPtr
>> @@ -122,14 +129,12 @@ To acquire the normal job condition
>> qemuDomainObjEndJob()
>> - Sets job.active to 0
>> - Signals on job.cond condition
>> - - Decrements ref count on virDomainObjPtr
>>
>>
>>
>> To acquire the asynchronous job condition
>>
>> qemuDomainObjBeginAsyncJob()
>> - - Increments ref count on virDomainObjPtr
>
>Should we mention that having a ref already is pre-requisite for calling
>Begin*Job?
>
I take it as from this patch onwards it is pre-requisite to have a ref
for doing anything.
>> - Waits until no async job is running
>> - Waits for job.cond condition 'job.active != 0' using virDomainObjPtr
>> mutex
>
>...
>
>> @@ -179,12 +183,10 @@ To acquire the QEMU monitor lock as part of an asynchronous job
>> To keep a domain alive while waiting on a remote command
>>
>> qemuDomainObjEnterRemote()
>
>Same here ?
>
Same here.
Maybe it would be nice to document that the only way how to obtain
virDomainPtr is with its own reference. Would that help?
>> - - Increments ref count on virDomainObjPtr
>> - Releases the virDomainObjPtr lock
>>
>> qemuDomainObjExitRemote()
>> - Acquires the virDomainObjPtr lock
>> - - Decrements ref count on virDomainObjPtr
>>
>>
>> Design patterns
>
>> @@ -195,18 +197,18 @@ Design patterns
>>
>> virDomainObjPtr obj;
>>
>> - obj = virDomainFindByUUID(driver->domains, dom->uuid);
>> + obj = qemuDomObjFromDomain(dom);
>>
>> ...do work...
>>
>> - virDomainObjUnlock(obj);
>> + qemuDomObjEndAPI(obj);
>
>&obj to match the code.
>
I guess I missed some 'git add' after one of the 'sed -s' :)
>> diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c
>> index 220304f..d4a6021 100644
>> --- a/src/qemu/qemu_domain.c
>> +++ b/src/qemu/qemu_domain.c
>> @@ -1318,8 +1318,6 @@ qemuDomainObjBeginJobInternal(virQEMUDriverPtr driver,
>> priv->jobs_queued++;
>> then = now + QEMU_JOB_WAIT_TIME;
>>
>> - virObjectRef(obj);
>> -
>> retry:
>> if (cfg->maxQueuedJobs &&
>> priv->jobs_queued > cfg->maxQueuedJobs) {
>> @@ -1398,7 +1396,6 @@ qemuDomainObjBeginJobInternal(virQEMUDriverPtr driver,
>>
>> cleanup:
>> priv->jobs_queued--;
>> - virObjectUnref(obj);
>> virObjectUnref(cfg);
>> return ret;
>> }
>> @@ -1467,7 +1464,8 @@ qemuDomainObjBeginNestedJob(virQEMUDriverPtr driver,
>> * Returns true if @obj was still referenced, false if it was
>> * disposed of.
>
>Comment is no longer valid, also mentioning that caller should hold a
>reference would be helpful.
>
Yes, this should be removed. About the reference, see above.
>> */
>> -bool qemuDomainObjEndJob(virQEMUDriverPtr driver, virDomainObjPtr obj)
>> +void
>> +qemuDomainObjEndJob(virQEMUDriverPtr driver, virDomainObjPtr obj)
>> {
>> qemuDomainObjPrivatePtr priv = obj->privateData;
>> qemuDomainJob job = priv->job.active;
>
>...
>
>
>> @@ -1541,7 +1535,7 @@ qemuDomainObjEnterMonitorInternal(virQEMUDriverPtr driver,
>> virReportError(VIR_ERR_OPERATION_FAILED, "%s",
>> _("domain is no longer running"));
>> /* Still referenced by the containing async job. */
>
>This comment is no longer valid.
>
Yes, thanks (even though it is referenced by an API that entered the
monitor).
>> - ignore_value(qemuDomainObjEndJob(driver, obj));
>> + qemuDomainObjEndJob(driver, obj);
>> return -1;
>> }
>> } else if (priv->job.asyncOwner == virThreadSelfID()) {
>
>...
>
>> @@ -2795,3 +2786,11 @@ qemuDomainAgentAvailable(qemuDomainObjPrivatePtr priv,
>> }
>> return true;
>> }
>> +
>
>Adding some docs what this exactly does would be cool.
>
Would some short paragraph like the following one, for example,
suffice?
/*
* Stop working with a vm. Unlock the domain, decrements its
* reference counter and sets the parameter to NULL to make sure there
* is no consecutive access to the domain structure.
*/
>> +void
>> +qemuDomObjEndAPI(virDomainObjPtr *vm)
>> +{
>> + virObjectUnlock(*vm);
>> + virObjectUnref(*vm);
>> + *vm = NULL;
>> +}
>
>[...]
>
>> diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c
>> index 9152cf5..2e0b7fc 100644
>> --- a/src/qemu/qemu_driver.c
>> +++ b/src/qemu/qemu_driver.c
>> @@ -205,11 +205,11 @@ struct qemuAutostartData {
>> * qemuDomObjFromDomain:
>> * @domain: Domain pointer that has to be looked up
>> *
>> - * This function looks up @domain and returns the appropriate
>> - * virDomainObjPtr.
>> + * This function looks up @domain and returns the appropriate virDomainObjPtr
>> + * that has to be cleaned by calling qemuDomObjEndAPI().
>
>s/cleaned/released/ ?
>
I've got no problem changing it :)
>> *
>> - * Returns the domain object which is locked on success, NULL
>> - * otherwise.
>> + * Returns the domain object with incremented reference counter which is locked
>> + * on success, NULL otherwise.
>> */
>> static virDomainObjPtr
>> qemuDomObjFromDomain(virDomainPtr domain)
>
>[...]
>
>> @@ -1446,7 +1445,7 @@ static virDomainPtr qemuDomainLookupByID(virConnectPtr conn,
>> virDomainObjPtr vm;
>> virDomainPtr dom = NULL;
>>
>> - vm = virDomainObjListFindByID(driver->domains, id);
>> + vm = virDomainObjListFindByID(driver->domains, id);
>
>Unrelated whitespace change ..
>
I haven't felt like this deserves its own commit.
>>
>> if (!vm) {
>> virReportError(VIR_ERR_NO_DOMAIN,
>> @@ -1461,8 +1460,7 @@ static virDomainPtr qemuDomainLookupByID(virConnectPtr conn,
>> if (dom) dom->id = vm->def->id;
>>
>> cleanup:
>> - if (vm)
>> - virObjectUnlock(vm);
>
>and possible crash. If virDomainObjListFindByID fails, the code jumps to
>the cleanup label with "vm == NULL".
>
>Both changes also don't do anything with the reference count and thus
>should be dropped from this patch as a separate cleanup.
>
>Additional changes would be required in case you want to make the API
>more robust by adding a similar "ref then lock" scheme also for the ID
>lookup func. Without that change, this leaves a vector that will allow
>to lock up the domains hash in case a VM is unresponsive.
>
For ByID and ByName it is not that easy to change the code to do ref,
unlock list, lock, because the domain itself is being compared to the
name/id and it has to be locked for that (or at least for the ID it
does).
qemuDomObjEndAPI() needs to just return if the parameter points to
NULL. This wasn't a problem with that "if (unref) unlock" logic.
>> + virObjectUnlock(vm);
>> return dom;
>> }
>>
>> @@ -1473,7 +1471,7 @@ static virDomainPtr qemuDomainLookupByUUID(virConnectPtr conn,
>> virDomainObjPtr vm;
>> virDomainPtr dom = NULL;
>>
>> - vm = virDomainObjListFindByUUID(driver->domains, uuid);
>> + vm = virDomainObjListFindByUUIDRef(driver->domains, uuid);
>>
>> if (!vm) {
>> char uuidstr[VIR_UUID_STRING_BUFLEN];
>> @@ -1490,8 +1488,7 @@ static virDomainPtr qemuDomainLookupByUUID(virConnectPtr conn,
>> if (dom) dom->id = vm->def->id;
>>
>> cleanup:
>> - if (vm)
>> - virObjectUnlock(vm);
>> + qemuDomObjEndAPI(&vm);
>
>Control flow allows to reach this part with "vm == NULL" which would
>induce a crash.
>
>> return dom;
>> }
>>
>> @@ -1517,8 +1514,7 @@ static virDomainPtr qemuDomainLookupByName(virConnectPtr conn,
>> if (dom) dom->id = vm->def->id;
>>
>> cleanup:
>> - if (vm)
>> - virObjectUnlock(vm);
>> + virObjectUnlock(vm);
>
>Aaand the same as above ;)
>
>> return dom;
>> }
>>
>> @@ -1537,8 +1533,7 @@ static int qemuDomainIsActive(virDomainPtr dom)
>> ret = virDomainObjIsActive(obj);
>>
>> cleanup:
>> - if (obj)
>> - virObjectUnlock(obj);
>> + qemuDomObjEndAPI(&obj);
>
>Same as above. if obj is NULL we jump here.
>
>> return ret;
>> }
>>
>> @@ -1556,8 +1551,7 @@ static int qemuDomainIsPersistent(virDomainPtr dom)
>> ret = obj->persistent;
>>
>> cleanup:
>> - if (obj)
>> - virObjectUnlock(obj);
>> + qemuDomObjEndAPI(&obj);
>
>Ditto.
>
>> return ret;
>> }
>>
>> @@ -1575,8 +1569,7 @@ static int qemuDomainIsUpdated(virDomainPtr dom)
>> ret = obj->updated;
>>
>> cleanup:
>> - if (obj)
>> - virObjectUnlock(obj);
>> + qemuDomObjEndAPI(&obj);
>
>Ditto.
>
>> return ret;
>> }
>>
>> @@ -1717,12 +1710,11 @@ static virDomainPtr qemuDomainCreateXML(virConnectPtr conn,
>> VIR_DOMAIN_OBJ_LIST_ADD_CHECK_LIVE,
>> NULL)))
>> goto cleanup;
>> -
>> + virObjectRef(vm);
>> def = NULL;
>>
>> if (qemuDomainObjBeginJob(driver, vm, QEMU_JOB_MODIFY) < 0) {
>> qemuDomainRemoveInactive(driver, vm);
>> - vm = NULL;
>> goto cleanup;
>> }
>>
>> @@ -1731,9 +1723,8 @@ static virDomainPtr qemuDomainCreateXML(virConnectPtr conn,
>> VIR_NETDEV_VPORT_PROFILE_OP_CREATE,
>> start_flags) < 0) {
>> virDomainAuditStart(vm, "booted", false);
>> - if (qemuDomainObjEndJob(driver, vm))
>> - qemuDomainRemoveInactive(driver, vm);
>> - vm = NULL;
>> + qemuDomainObjEndJob(driver, vm);
>> + qemuDomainRemoveInactive(driver, vm);
>> goto cleanup;
>> }
>>
>> @@ -1756,13 +1747,11 @@ static virDomainPtr qemuDomainCreateXML(virConnectPtr conn,
>> if (dom)
>> dom->id = vm->def->id;
>>
>> - if (!qemuDomainObjEndJob(driver, vm))
>> - vm = NULL;
>> + qemuDomainObjEndJob(driver, vm);
>>
>> cleanup:
>> virDomainDefFree(def);
>> - if (vm)
>> - virObjectUnlock(vm);
>> + qemuDomObjEndAPI(&vm);
>> if (event) {
>> qemuDomainEventQueue(driver, event);
>> if (event2)
>> @@ -1842,12 +1831,10 @@ static int qemuDomainSuspend(virDomainPtr dom)
>> ret = 0;
>>
>> endjob:
>> - if (!qemuDomainObjEndJob(driver, vm))
>> - vm = NULL;
>> + qemuDomainObjEndJob(driver, vm);
>>
>> cleanup:
>> - if (vm)
>> - virObjectUnlock(vm);
>> + qemuDomObjEndAPI(&vm);
>
>
>Aaand same here. I'm starting to think that
>
>void
>qemuDomObjEndAPI(virDomainObjPtr *vm)
>{
> virObjectUnlock(*vm);
> virObjectUnref(*vm);
> *vm = NULL;
>}
>
>should look more like:
>
>void
>qemuDomObjEndAPI(virDomainObjPtr *vm)
>{
> if (!*vm)
> return;
>
> virObjectUnlock(*vm);
> virObjectUnref(*vm);
> *vm = NULL;
>}
>
Exactly what I thought above :)
>I'm stopping to report this issue assuming the tweak above.
>
>>
>> if (event)
>> qemuDomainEventQueue(driver, event);
>
>[...]
>
>> @@ -3333,11 +3282,9 @@ qemuDomainSaveFlags(virDomainPtr dom, const char *path, const char *dxml,
>>
>> ret = qemuDomainSaveInternal(driver, dom, vm, path, compressed,
>> dxml, flags);
>
>This function consumes @vm with it's reference and lock, thus ...
>
>> - vm = NULL;
>
>... you cannot remove this line.
>
I'd rather remove the endAPI call from SaveInternal.
>> cleanup:
>> - if (vm)
>> - virObjectUnlock(vm);
>> + qemuDomObjEndAPI(&vm);
>> virObjectUnref(cfg);
>> return ret;
>> }
>> @@ -3416,15 +3363,13 @@ qemuDomainManagedSave(virDomainPtr dom, unsigned int flags)
>>
>> VIR_INFO("Saving state of domain '%s' to '%s'", vm->def->name, name);
>>
>> - if ((ret = qemuDomainSaveInternal(driver, dom, vm, name, compressed,
>> - NULL, flags)) == 0)
>> + ret = qemuDomainSaveInternal(driver, dom, vm, name,
>> + compressed, NULL, flags);
>
>Same as above. qemuDomainSaveInternal consumes VM ...
>
>> + if (ret == 0)
>> vm->hasManagedSave = true;
>>
>> - vm = NULL;
>
>... so this needs to stay in the code.
>
>> -
>> cleanup:
>> - if (vm)
>> - virObjectUnlock(vm);
>> + qemuDomObjEndAPI(&vm);
>> VIR_FREE(name);
>> virObjectUnref(cfg);
>>
>
>[...]
>
>
>> @@ -6744,7 +6647,7 @@ static virDomainPtr qemuDomainDefineXML(virConnectPtr conn, const char *xml)
>> driver->xmlopt,
>> 0, &oldDef)))
>> goto cleanup;
>> -
>
>Don't remove the empty line for clarity.
>
I have no idea why I did that :D
>> + virObjectRef(vm);
>> def = NULL;
>> if (virDomainHasDiskMirror(vm)) {
>> virReportError(VIR_ERR_BLOCK_COPY_ACTIVE, "%s",
>
>[...]
>
>| Migrations |
>--------------
>
>qemuDomainMigratePerform uses qemuDomObjFromDomain but doesn't unlock
>the object if ACL check fails. I'll post a patch for that. You'll need
>to rebase on top of that.
>
>qemuDomainMigratePerform3Params does unlock on the ACL check, but you
>forgot to tweak the check.
>
Thanks for catching that, I'll gladly rebase that :)
>
>
>> @@ -12638,8 +12491,7 @@ qemuDomainMigrateGetMaxSpeed(virDomainPtr dom,
>> ret = 0;
>>
>> cleanup:
>> - if (vm)
>> - virObjectUnlock(vm);
>> + qemuDomObjEndAPI(&vm);
>> return ret;
>> }
>>
>
>Uff, I'm not able to finish this review, so I'm sending this as a part 1
>and will continue tomorrow.
>
Definitely take a rest. This is terrible patch to review (as it was
for writing it), thanks for doing that.
Martin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: Digital signature
URL: <http://listman.redhat.com/archives/libvir-list/attachments/20141209/1e72c893/attachment-0001.sig>
More information about the libvir-list
mailing list