[PATCH] vm : forbid to start a removing vm

Michal Privoznik mprivozn at redhat.com
Thu Mar 18 13:24:50 UTC 2021


On 3/11/21 2:13 AM, Hogan Wang wrote:
> From: Zhuang Shengen <zhuangshengen at huawei.com>
> 
> When a vm is doing migration phase confirm, and then start it concurrently,
> it will lead to the vm out of libvirtd control.
> 
> Cause Analysis:
> 1. thread1 migrate vm out.
> 2. thread2 start the migrating vm.
> 3. thread1 remove vm from domain list after migrate success.

I guess I'm puzzled here. Thread1 has started migration or finished too? 
How can thread2 start the VM during migration? Is that an offline migration?

To me it seems like there's a problem with what jobs are allowed during 
migration, or their order. Looking at qemuMigrationSrcConfirm() which is 
the function that does final step of migration (at the source), so it 
ends migration job, then it starts MODIFY job (which potentially unlocks 
the domain object) and only then it sets vm->removing = true and removes 
it from the list.

What I am trying to say is that because the vm object can be briefly 
unlocked thus another thread might lock it successfully, proceed to 
BeginJob(), where it unlocks the object again only to let the first 
thread to remove it from the list.

It's not only virDomainCreateWithFlags() that's affected - basically any 
other API that grabs a job. Even though, plenty of them require domain 
to be running, therefore they exit early with "domain not running" error 
message.

I wonder whether something among these lines fixes the problem for you 
(compile tested only, but it shows the idea):

diff --git i/src/qemu/qemu_migration.c w/src/qemu/qemu_migration.c
index 79dcb4a15d..4b1bb77155 100644
--- i/src/qemu/qemu_migration.c
+++ w/src/qemu/qemu_migration.c
@@ -3468,6 +3468,7 @@ qemuMigrationSrcConfirm(virQEMUDriverPtr driver,
      qemuMigrationJobPhase phase;
      g_autoptr(virQEMUDriverConfig) cfg = virQEMUDriverGetConfig(driver);
      int ret = -1;
+    bool haveJob;

      if (!qemuMigrationJobIsActive(vm, QEMU_ASYNC_JOB_MIGRATION_OUT))
          goto cleanup;
@@ -3485,15 +3486,21 @@ qemuMigrationSrcConfirm(virQEMUDriverPtr driver,
                                         cookiein, cookieinlen,
                                         flags, cancelled);

-    qemuMigrationJobFinish(driver, vm);
+    haveJob = qemuDomainObjBeginJob(driver, vm, QEMU_JOB_MODIFY) >= 0;
+
      if (!virDomainObjIsActive(vm)) {
          if (!cancelled && ret == 0 && flags & 
VIR_MIGRATE_UNDEFINE_SOURCE) {
              virDomainDeleteConfig(cfg->configDir, cfg->autostartDir, vm);
              vm->persistent = 0;
          }
-        qemuDomainRemoveInactiveJob(driver, vm);
+        qemuDomainRemoveInactive(driver, vm);
      }

+    if (haveJob)
+        qemuDomainObjEndJob(driver, vm);
+
+    qemuMigrationJobFinish(driver, vm);
+
   cleanup:
      virDomainObjEndAPI(&vm);
      return ret;



> 4. thread2 acquired the vm job success and start the vm.
> 5. cannot find the vm any more by 'virsh list' command. Actually,
>     the started vm is not exist in the domain list.
> 
> Solution:
> Check the vm->removing state before start.
> 
> 
> Signed-off-by: Zhuang Shengen <zhuangshengen at huawei.com>
> Reviewed-by: Hogan Wang <hogan.wang at huawei.com>
> ---
>   src/qemu/qemu_driver.c | 6 ++++++
>   1 file changed, 6 insertions(+)
> 
> diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c
> index d1a3659774..a5dfea94cb 100644
> --- a/src/qemu/qemu_driver.c
> +++ b/src/qemu/qemu_driver.c
> @@ -6637,6 +6637,12 @@ qemuDomainCreateWithFlags(virDomainPtr dom, unsigned int flags)
>           goto endjob;
>       }
>   
> +    if (vm->removing) {
> +        virReportError(VIR_ERR_OPERATION_INVALID,
> +                       "%s", _("domain is already removing"));

s/removing/being removed/

> +        goto endjob;
> +    }
> +
>       if (qemuDomainObjStart(dom->conn, driver, vm, flags,
>                              QEMU_ASYNC_JOB_START) < 0)
>           goto endjob;
> 

Michal




More information about the libvir-list mailing list