[libvirt] [RFC][PATCH] Fix a deadlock in bi-directional p2p concurrent migration.

Daniel Veillard veillard at redhat.com
Thu Jul 15 15:44:46 UTC 2010


On Thu, Jul 15, 2010 at 10:44:46AM -0400, Chris Lalancette wrote:
> If you try to execute two concurrent migrations p2p
> from A->B and B->A, the two libvirtd's will deadlock
> trying to perform the migrations.  The reason for this is
> that in p2p migration, the libvirtd's are responsible for
> making the RPC Prepare, Migrate, and Finish calls.  However,
> they are currently holding the driver lock while doing so,
> which basically guarantees deadlock in this scenario.
> 
> This patch fixes the situation by adding
> qemuDomainObjEnterRemoteWithDriver and
> qemuDomainObjExitRemoteWithDriver helper methods.  The Enter
> take an additional object reference, then drops both the
> domain object lock and the driver lock.  The Exit takes
> both the driver and domain object lock, then drops the
> reference.  Adding calls to these Enter and Exit helpers
> around remote calls in the various migration methods
> seems to fix the problem for me in testing.
> 
> I *believe* that this makes the situation safe, though
> I'm not 100% certain.  The additional domain object reference
> ensures that the domain object won't disappear while this
> operation is happening.  The BeginJob that is called inside
> of qemudDomainMigratePerform ensures that we can't execute a
> second migrate (or shutdown, or save, etc) job while the
> migration is active.  That being said, my understanding of the
> locking is still a bit shaky, so I'd like some additional
> confirmation that this is indeed safe.
> 
> Signed-off-by: Chris Lalancette <clalance at redhat.com>
> ---
>  src/qemu/qemu_driver.c |   39 ++++++++++++++++++++++++++++++++-------
>  1 files changed, 32 insertions(+), 7 deletions(-)
> 
> diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c
> index 5f2607c..e81af84 100644
> --- a/src/qemu/qemu_driver.c
> +++ b/src/qemu/qemu_driver.c
> @@ -531,6 +531,21 @@ static void qemuDomainObjExitMonitorWithDriver(struct qemud_driver *driver, virD
>      }
>  }
>  
> +static void qemuDomainObjEnterRemoteWithDriver(struct qemud_driver *driver,
> +                                               virDomainObjPtr obj)
> +{
> +    virDomainObjRef(obj);
> +    virDomainObjUnlock(obj);
> +    qemuDriverUnlock(driver);
> +}
> +
> +static void qemuDomainObjExitRemoteWithDriver(struct qemud_driver *driver,
> +                                              virDomainObjPtr obj)
> +{
> +    qemuDriverLock(driver);
> +    virDomainObjLock(obj);
> +    virDomainObjUnref(obj);
> +}
>  
>  static int qemuCgroupControllerActive(struct qemud_driver *driver,
>                                        int controller)
> @@ -10753,9 +10768,11 @@ static int doTunnelMigrate(virDomainPtr dom,
>          /* virStreamNew only fails on OOM, and it reports the error itself */
>          goto cleanup;
>  
> +    qemuDomainObjEnterRemoteWithDriver(driver, vm);
>      internalret = dconn->driver->domainMigratePrepareTunnel(dconn, st,
>                                                              flags, dname,
>                                                              resource, dom_xml);
> +    qemuDomainObjExitRemoteWithDriver(driver, vm);
>  
>      if (internalret < 0)
>          /* domainMigratePrepareTunnel sets the error for us */
> @@ -10833,8 +10850,10 @@ cancel:
>  
>  finish:
>      dname = dname ? dname : dom->name;
> +    qemuDomainObjEnterRemoteWithDriver(driver, vm);
>      ddomain = dconn->driver->domainMigrateFinish2
>          (dconn, dname, NULL, 0, uri, flags, retval);
> +    qemuDomainObjExitRemoteWithDriver(driver, vm);
>  
>  cleanup:
>      if (client_sock != -1)
> @@ -10873,16 +10892,20 @@ static int doNonTunnelMigrate(virDomainPtr dom,
>      virDomainPtr ddomain = NULL;
>      int retval = -1;
>      char *uri_out = NULL;
> +    int rc;
>  
> +    qemuDomainObjEnterRemoteWithDriver(driver, vm);
>      /* NB we don't pass 'uri' into this, since that's the libvirtd
>       * URI in this context - so we let dest pick it */
> -    if (dconn->driver->domainMigratePrepare2(dconn,
> -                                             NULL, /* cookie */
> -                                             0, /* cookielen */
> -                                             NULL, /* uri */
> -                                             &uri_out,
> -                                             flags, dname,
> -                                             resource, dom_xml) < 0)
> +    rc = dconn->driver->domainMigratePrepare2(dconn,
> +                                              NULL, /* cookie */
> +                                              0, /* cookielen */
> +                                              NULL, /* uri */
> +                                              &uri_out,
> +                                              flags, dname,
> +                                              resource, dom_xml);
> +    qemuDomainObjExitRemoteWithDriver(driver, vm);
> +    if (rc < 0)
>          /* domainMigratePrepare2 sets the error for us */
>          goto cleanup;
>  
> @@ -10899,8 +10922,10 @@ static int doNonTunnelMigrate(virDomainPtr dom,
>  
>  finish:
>      dname = dname ? dname : dom->name;
> +    qemuDomainObjEnterRemoteWithDriver(driver, vm);
>      ddomain = dconn->driver->domainMigrateFinish2
>          (dconn, dname, NULL, 0, uri_out, flags, retval);
> +    qemuDomainObjExitRemoteWithDriver(driver, vm);
>  
>      if (ddomain)
>          virUnrefDomain(ddomain);

  Explanations and patch both looks fine to me, but I would feel more
confident if Dan reviewed it too, the fine details of low level locking
escape me I'm afraid.

  mini-ACK :-)

Daniel

-- 
Daniel Veillard      | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
daniel at veillard.com  | Rpmfind RPM search engine http://rpmfind.net/
http://veillard.com/ | virtualization library  http://libvirt.org/




More information about the libvir-list mailing list