[libvirt] [RFC][PATCH] Fix a deadlock in bi-directional p2p concurrent migration.
Daniel Veillard
veillard at redhat.com
Thu Jul 15 15:44:46 UTC 2010
On Thu, Jul 15, 2010 at 10:44:46AM -0400, Chris Lalancette wrote:
> If you try to execute two concurrent migrations p2p
> from A->B and B->A, the two libvirtd's will deadlock
> trying to perform the migrations. The reason for this is
> that in p2p migration, the libvirtd's are responsible for
> making the RPC Prepare, Migrate, and Finish calls. However,
> they are currently holding the driver lock while doing so,
> which basically guarantees deadlock in this scenario.
>
> This patch fixes the situation by adding
> qemuDomainObjEnterRemoteWithDriver and
> qemuDomainObjExitRemoteWithDriver helper methods. The Enter
> take an additional object reference, then drops both the
> domain object lock and the driver lock. The Exit takes
> both the driver and domain object lock, then drops the
> reference. Adding calls to these Enter and Exit helpers
> around remote calls in the various migration methods
> seems to fix the problem for me in testing.
>
> I *believe* that this makes the situation safe, though
> I'm not 100% certain. The additional domain object reference
> ensures that the domain object won't disappear while this
> operation is happening. The BeginJob that is called inside
> of qemudDomainMigratePerform ensures that we can't execute a
> second migrate (or shutdown, or save, etc) job while the
> migration is active. That being said, my understanding of the
> locking is still a bit shaky, so I'd like some additional
> confirmation that this is indeed safe.
>
> Signed-off-by: Chris Lalancette <clalance at redhat.com>
> ---
> src/qemu/qemu_driver.c | 39 ++++++++++++++++++++++++++++++++-------
> 1 files changed, 32 insertions(+), 7 deletions(-)
>
> diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c
> index 5f2607c..e81af84 100644
> --- a/src/qemu/qemu_driver.c
> +++ b/src/qemu/qemu_driver.c
> @@ -531,6 +531,21 @@ static void qemuDomainObjExitMonitorWithDriver(struct qemud_driver *driver, virD
> }
> }
>
> +static void qemuDomainObjEnterRemoteWithDriver(struct qemud_driver *driver,
> + virDomainObjPtr obj)
> +{
> + virDomainObjRef(obj);
> + virDomainObjUnlock(obj);
> + qemuDriverUnlock(driver);
> +}
> +
> +static void qemuDomainObjExitRemoteWithDriver(struct qemud_driver *driver,
> + virDomainObjPtr obj)
> +{
> + qemuDriverLock(driver);
> + virDomainObjLock(obj);
> + virDomainObjUnref(obj);
> +}
>
> static int qemuCgroupControllerActive(struct qemud_driver *driver,
> int controller)
> @@ -10753,9 +10768,11 @@ static int doTunnelMigrate(virDomainPtr dom,
> /* virStreamNew only fails on OOM, and it reports the error itself */
> goto cleanup;
>
> + qemuDomainObjEnterRemoteWithDriver(driver, vm);
> internalret = dconn->driver->domainMigratePrepareTunnel(dconn, st,
> flags, dname,
> resource, dom_xml);
> + qemuDomainObjExitRemoteWithDriver(driver, vm);
>
> if (internalret < 0)
> /* domainMigratePrepareTunnel sets the error for us */
> @@ -10833,8 +10850,10 @@ cancel:
>
> finish:
> dname = dname ? dname : dom->name;
> + qemuDomainObjEnterRemoteWithDriver(driver, vm);
> ddomain = dconn->driver->domainMigrateFinish2
> (dconn, dname, NULL, 0, uri, flags, retval);
> + qemuDomainObjExitRemoteWithDriver(driver, vm);
>
> cleanup:
> if (client_sock != -1)
> @@ -10873,16 +10892,20 @@ static int doNonTunnelMigrate(virDomainPtr dom,
> virDomainPtr ddomain = NULL;
> int retval = -1;
> char *uri_out = NULL;
> + int rc;
>
> + qemuDomainObjEnterRemoteWithDriver(driver, vm);
> /* NB we don't pass 'uri' into this, since that's the libvirtd
> * URI in this context - so we let dest pick it */
> - if (dconn->driver->domainMigratePrepare2(dconn,
> - NULL, /* cookie */
> - 0, /* cookielen */
> - NULL, /* uri */
> - &uri_out,
> - flags, dname,
> - resource, dom_xml) < 0)
> + rc = dconn->driver->domainMigratePrepare2(dconn,
> + NULL, /* cookie */
> + 0, /* cookielen */
> + NULL, /* uri */
> + &uri_out,
> + flags, dname,
> + resource, dom_xml);
> + qemuDomainObjExitRemoteWithDriver(driver, vm);
> + if (rc < 0)
> /* domainMigratePrepare2 sets the error for us */
> goto cleanup;
>
> @@ -10899,8 +10922,10 @@ static int doNonTunnelMigrate(virDomainPtr dom,
>
> finish:
> dname = dname ? dname : dom->name;
> + qemuDomainObjEnterRemoteWithDriver(driver, vm);
> ddomain = dconn->driver->domainMigrateFinish2
> (dconn, dname, NULL, 0, uri_out, flags, retval);
> + qemuDomainObjExitRemoteWithDriver(driver, vm);
>
> if (ddomain)
> virUnrefDomain(ddomain);
Explanations and patch both looks fine to me, but I would feel more
confident if Dan reviewed it too, the fine details of low level locking
escape me I'm afraid.
mini-ACK :-)
Daniel
--
Daniel Veillard | libxml Gnome XML XSLT toolkit http://xmlsoft.org/
daniel at veillard.com | Rpmfind RPM search engine http://rpmfind.net/
http://veillard.com/ | virtualization library http://libvirt.org/
More information about the libvir-list
mailing list