[libvirt] [PATCH 3/3] qemu: fix concurrency crash bug in snapshot revert

Pavel Mores pmores at redhat.com
Fri Dec 6 09:11:29 UTC 2019


Although qemuDomainRevertToSnapshot() correctly begins a job at the start,
the job is implicitly ended if qemuProcessStop() is called because it lives
in the QEMU driver's private data that is purged during qemuProcessStop().
This commit restores the job by calling qemuProcessBeginJob() after
qemuProcessStop() invocations.

The first chunk is meant to fix a reported bug (see below).  The bug's
reproducibility on my machine was initially way lower than the reported
50% (more like 5%) but could be boosted to pretty much 100% by having
virt-manager connected, displaying the testing domain in viewer.  With the
fix, the bug cannot be reproduced on my machine under any scenario I could
think of.

The actual crash was due to the thread performing revert which, not
acquiring a job properly, was able to change the QEMU monitor structure
under the thread performing snapshot delete while it was waiting on a
condition variable.

An interesting question is whether this commit takes the right approach
to the fix.  In particular, qemuProcessStop() arguably should not clear
jobs, in which case the right thing to do would be to fix
qemuProcessStop().  However, qemuProcessStop() has about twenty callers,
and while none of them seems vulnerable to the kind of problems caused by
clearing jobs (unlike qemuDomainRevertToSnapshot() who starts QEMU again
right after stopping it), some of them might rely on jobs being cleared
(this is not always easy to check conclusively).  All in all, this commit's
solution seems to be the least bad fix which doesn't require a potentially
risky refactor.

https://bugzilla.redhat.com/show_bug.cgi?id=1610207
Signed-off-by: Pavel Mores <pmores at redhat.com>
---
 src/qemu/qemu_driver.c | 17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c
index 18bd0101e7..b3769cc479 100644
--- a/src/qemu/qemu_driver.c
+++ b/src/qemu/qemu_driver.c
@@ -16637,9 +16637,15 @@ qemuDomainRevertToSnapshot(virDomainSnapshotPtr snapshot,
                                                      VIR_DOMAIN_EVENT_STOPPED,
                                                      detail);
                     virObjectEventStateQueue(driver->domainEventState, event);
-                    /* Start after stop won't be an async start job, so
-                     * reset to none */
-                    jobType = QEMU_ASYNC_JOB_NONE;
+                    /* We have to begin a new job as the original one (begun
+                     * near the top of this function) was lost during the purge
+                     * of private data in qemuProcessStop() above.
+                     */
+                    if (qemuProcessBeginJob(driver, vm,
+                                            VIR_DOMAIN_JOB_OPERATION_SNAPSHOT_REVERT,
+                                            flags) < 0) {
+                        goto cleanup;
+                    }
                     goto load;
                 }
             }
@@ -16774,6 +16780,11 @@ qemuDomainRevertToSnapshot(virDomainSnapshotPtr snapshot,
             event = virDomainEventLifecycleNewFromObj(vm,
                                              VIR_DOMAIN_EVENT_STOPPED,
                                              detail);
+            if (qemuProcessBeginJob(driver, vm,
+                                    VIR_DOMAIN_JOB_OPERATION_SNAPSHOT_REVERT,
+                                    flags) < 0) {
+                goto cleanup;
+            }
         }
 
         if (qemuDomainSnapshotRevertInactive(driver, vm, snap) < 0) {
-- 
2.21.0




More information about the libvir-list mailing list