[libvirt] [RFC][PATCH] Fix a deadlock in bi-directional p2p concurrent migration.

Daniel P. Berrange berrange at redhat.com
Thu Jul 15 16:02:50 UTC 2010


On Thu, Jul 15, 2010 at 10:44:46AM -0400, Chris Lalancette wrote:
> If you try to execute two concurrent migrations p2p
> from A->B and B->A, the two libvirtd's will deadlock
> trying to perform the migrations.  The reason for this is
> that in p2p migration, the libvirtd's are responsible for
> making the RPC Prepare, Migrate, and Finish calls.  However,
> they are currently holding the driver lock while doing so,
> which basically guarantees deadlock in this scenario.
> 
> This patch fixes the situation by adding
> qemuDomainObjEnterRemoteWithDriver and
> qemuDomainObjExitRemoteWithDriver helper methods.  The Enter
> take an additional object reference, then drops both the
> domain object lock and the driver lock.  The Exit takes
> both the driver and domain object lock, then drops the
> reference.  Adding calls to these Enter and Exit helpers
> around remote calls in the various migration methods
> seems to fix the problem for me in testing.
> 
> I *believe* that this makes the situation safe, though
> I'm not 100% certain.  The additional domain object reference
> ensures that the domain object won't disappear while this
> operation is happening.  The BeginJob that is called inside
> of qemudDomainMigratePerform ensures that we can't execute a
> second migrate (or shutdown, or save, etc) job while the
> migration is active.  That being said, my understanding of the
> locking is still a bit shaky, so I'd like some additional
> confirmation that this is indeed safe.

The general approach is sound because this guarentees that
the virDomainObjPtr can't disappear due to the extra
reference. What it doesn't guarentee is the state of the VM.
eg if the guest crashes/shuts down, you have to be careful
not to run any further monitor commands.

Since we expect the guest to continue running on the source
host during all these steps, we need to add calls

    if (!virDomainObjIsActive(vm)) {
        qemuReportError(VIR_ERR_INTERNAL_ERROR, "%s",
                        _("guest unexpectedly quit"));
        goto cleanup;
    }

immediately after the qemuDomainObjExitRemoteWithDriver() 
calls, to validate that the guest is still active.

> @@ -10753,9 +10768,11 @@ static int doTunnelMigrate(virDomainPtr dom,
>          /* virStreamNew only fails on OOM, and it reports the error itself */
>          goto cleanup;
>  
> +    qemuDomainObjEnterRemoteWithDriver(driver, vm);
>      internalret = dconn->driver->domainMigratePrepareTunnel(dconn, st,
>                                                              flags, dname,
>                                                              resource, dom_xml);
> +    qemuDomainObjExitRemoteWithDriver(driver, vm);
>  
>      if (internalret < 0)
>          /* domainMigratePrepareTunnel sets the error for us */


Add is-active check here - can be after this internalret < 0 check.

> @@ -10833,8 +10850,10 @@ cancel:
>  
>  finish:
>      dname = dname ? dname : dom->name;
> +    qemuDomainObjEnterRemoteWithDriver(driver, vm);
>      ddomain = dconn->driver->domainMigrateFinish2
>          (dconn, dname, NULL, 0, uri, flags, retval);
> +    qemuDomainObjExitRemoteWithDriver(driver, vm);
>  
>  cleanup:
>      if (client_sock != -1)

Don't need an is-active check here, since no further work is done
with the vm object after this.

> @@ -10873,16 +10892,20 @@ static int doNonTunnelMigrate(virDomainPtr dom,
>      virDomainPtr ddomain = NULL;
>      int retval = -1;
>      char *uri_out = NULL;
> +    int rc;
>  
> +    qemuDomainObjEnterRemoteWithDriver(driver, vm);
>      /* NB we don't pass 'uri' into this, since that's the libvirtd
>       * URI in this context - so we let dest pick it */
> -    if (dconn->driver->domainMigratePrepare2(dconn,
> -                                             NULL, /* cookie */
> -                                             0, /* cookielen */
> -                                             NULL, /* uri */
> -                                             &uri_out,
> -                                             flags, dname,
> -                                             resource, dom_xml) < 0)
> +    rc = dconn->driver->domainMigratePrepare2(dconn,
> +                                              NULL, /* cookie */
> +                                              0, /* cookielen */
> +                                              NULL, /* uri */
> +                                              &uri_out,
> +                                              flags, dname,
> +                                              resource, dom_xml);
> +    qemuDomainObjExitRemoteWithDriver(driver, vm);
> +    if (rc < 0)
>          /* domainMigratePrepare2 sets the error for us */
>          goto cleanup;
>  

Need an is-active check here

> @@ -10899,8 +10922,10 @@ static int doNonTunnelMigrate(virDomainPtr dom,
>  
>  finish:
>      dname = dname ? dname : dom->name;
> +    qemuDomainObjEnterRemoteWithDriver(driver, vm);
>      ddomain = dconn->driver->domainMigrateFinish2
>          (dconn, dname, NULL, 0, uri_out, flags, retval);
> +    qemuDomainObjExitRemoteWithDriver(driver, vm);

Don't need one here

Daniel
-- 
|: Red Hat, Engineering, London    -o-   http://people.redhat.com/berrange/ :|
|: http://libvirt.org -o- http://virt-manager.org -o- http://deltacloud.org :|
|: http://autobuild.org        -o-         http://search.cpan.org/~danberr/ :|
|: GnuPG: 7D3B9505  -o-   F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|




More information about the libvir-list mailing list