[libvirt] [PATCH v12] support offline migration

Jiri Denemark jdenemar at redhat.com
Fri Nov 2 14:32:37 UTC 2012


On Fri, Nov 02, 2012 at 09:30:27 +0800, liguang wrote:
> original migration did not aware of offline case,
> so, try to support offline migration quietly
> (did not disturb original migration) by pass
> VIR_MIGRATE_OFFLINE flag to migration APIs if only
> the domain is really inactive, and
> migration process will not puzzled by domain
> offline and exit unexpectedly.
> these changes did not take care of disk images the
> domain required, for them could be transferred by
> other APIs as suggested, then VIR_MIGRATE_OFFLINE
> should not combined with VIR_MIGRATE_NON_SHARED_*.
> if you want a persistent migration,
> you should  do "virsh migrate --persistent" youself.
> 
> v12:
> rebased for conflicting with commit 2f3e2c0c434218a3d656c08779cb98b327170e11,
> and take in some messages from Doug Goldstein's patch
> https://www.redhat.com/archives/libvir-list/2012-October/msg00957.html
> 
> Signed-off-by: liguang <lig.fnst at cn.fujitsu.com>
> ---
>  include/libvirt/libvirt.h.in |    1 +
>  src/qemu/qemu_driver.c       |   15 +++++++++
>  src/qemu/qemu_migration.c    |   66 ++++++++++++++++++++++++++++++++++++-----
>  src/qemu/qemu_migration.h    |    3 +-
>  tools/virsh-domain.c         |   10 ++++++
>  5 files changed, 86 insertions(+), 9 deletions(-)
> 
> diff --git a/include/libvirt/libvirt.h.in b/include/libvirt/libvirt.h.in
> index 2b17cef..a8f641f 100644
> --- a/include/libvirt/libvirt.h.in
> +++ b/include/libvirt/libvirt.h.in
> @@ -1089,6 +1089,7 @@ typedef enum {
>                                                 * whole migration process; this will be used automatically
>                                                 * when supported */
>      VIR_MIGRATE_UNSAFE            = (1 << 9), /* force migration even if it is considered unsafe */
> +    VIR_MIGRATE_OFFLINE           = (1 << 10), /* offline migrate */
>  } virDomainMigrateFlags;
>  
>  /* Domain migration. */
> diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c
> index 267bbf1..f5c05e4 100644
> --- a/src/qemu/qemu_driver.c
> +++ b/src/qemu/qemu_driver.c
> @@ -9759,6 +9759,20 @@ qemuDomainMigrateBegin3(virDomainPtr domain,
>      }
>  
>      if (!virDomainObjIsActive(vm)) {
> +        if (flags & VIR_MIGRATE_OFFLINE) {
> +            if (flags & (VIR_MIGRATE_NON_SHARED_DISK|
> +                         VIR_MIGRATE_NON_SHARED_INC)) {
> +                virReportError(VIR_ERR_OPERATION_INVALID,
> +                               "%s", _("offline migration cannot handle non-shared storage"));
> +                goto endjob;
> +            }
> +            if (!(flags & VIR_MIGRATE_PERSIST_DEST)) {
> +                virReportError(VIR_ERR_OPERATION_INVALID,
> +                               "%s", _("offline migration must be specified with the persistent flag set"));
> +                goto endjob;
> +            }
> +            goto offline;
> +        }
>          virReportError(VIR_ERR_OPERATION_INVALID,
>                         "%s", _("domain is not running"));
>          goto endjob;
> @@ -9771,6 +9785,7 @@ qemuDomainMigrateBegin3(virDomainPtr domain,
>      if (qemuDomainCheckEjectableMedia(driver, vm, asyncJob) < 0)
>          goto endjob;
>  
> +offline:
>      if (!(xml = qemuMigrationBegin(driver, vm, xmlin, dname,
>                                     cookieout, cookieoutlen,
>                                     flags)))

I think this label is pretty confusing since the code after it is not just for
offline migration. Not to mention that this label is used to just skip calling
qemuDomainCheckEjectableMedia(). I think it would look better as

    bool offline = !!(flags & VIR_MIGRATE_OFFLINE);
    ...
    if (!virDomainObjIsActive(vm)) {
        if (offline) {
            if (flags & (VIR_MIGRATE_NON_SHARED_DISK|
                         VIR_MIGRATE_NON_SHARED_INC)) {
                virReportError(VIR_ERR_OPERATION_INVALID,
                               "%s", _("offline migration cannot handle non-shared storage"));
                goto endjob;
            }
            if (!(flags & VIR_MIGRATE_PERSIST_DEST)) {
                virReportError(VIR_ERR_OPERATION_INVALID,
                               "%s", _("offline migration must be specified with the persistent flag set"));
                goto endjob;
            }
        } else {
            virReportError(VIR_ERR_OPERATION_INVALID,
                           "%s", _("domain is not running"));
            goto endjob;
        }
    }

    /* Check if there is any ejected media.
     * We don't want to require them on the destination.
     */

    if (!offline &&
        qemuDomainCheckEjectableMedia(driver, vm, asyncJob) < 0)
        goto endjob;

    if (!(xml = qemuMigrationBegin(driver, vm, xmlin, dname,
                                   cookieout, cookieoutlen,
                                   flags)))
        goto endjob;


Oh. And this whole code is actually at wrong place. It should be placed in
qemuMigrationBegin() so that it's also called in case of peer-to-peer
migration. And that also applies for qemuDomainCheckEjectableMedia(), but that
should be moved first as a separate patch since it is a preexisting bug not
related to offline migration.

> diff --git a/src/qemu/qemu_migration.c b/src/qemu/qemu_migration.c
> index e0acc20..85fc48b 100644
> --- a/src/qemu/qemu_migration.c
> +++ b/src/qemu/qemu_migration.c
> @@ -72,6 +72,7 @@ enum qemuMigrationCookieFlags {
>      QEMU_MIGRATION_COOKIE_FLAG_LOCKSTATE,
>      QEMU_MIGRATION_COOKIE_FLAG_PERSISTENT,
>      QEMU_MIGRATION_COOKIE_FLAG_NETWORK,
> +    QEMU_MIGRATION_COOKIE_FLAG_OFFLINE,
>  
>      QEMU_MIGRATION_COOKIE_FLAG_LAST
>  };
> @@ -79,13 +80,14 @@ enum qemuMigrationCookieFlags {
>  VIR_ENUM_DECL(qemuMigrationCookieFlag);
>  VIR_ENUM_IMPL(qemuMigrationCookieFlag,
>                QEMU_MIGRATION_COOKIE_FLAG_LAST,
> -              "graphics", "lockstate", "persistent", "network");
> +              "graphics", "lockstate", "persistent", "network", "offline");
>  
>  enum qemuMigrationCookieFeatures {
>      QEMU_MIGRATION_COOKIE_GRAPHICS  = (1 << QEMU_MIGRATION_COOKIE_FLAG_GRAPHICS),
>      QEMU_MIGRATION_COOKIE_LOCKSTATE = (1 << QEMU_MIGRATION_COOKIE_FLAG_LOCKSTATE),
>      QEMU_MIGRATION_COOKIE_PERSISTENT = (1 << QEMU_MIGRATION_COOKIE_FLAG_PERSISTENT),
>      QEMU_MIGRATION_COOKIE_NETWORK = (1 << QEMU_MIGRATION_COOKIE_FLAG_NETWORK),
> +    QEMU_MIGRATION_COOKIE_OFFLINE = (1 << QEMU_MIGRATION_COOKIE_FLAG_OFFLINE),
>  };
>  
>  typedef struct _qemuMigrationCookieGraphics qemuMigrationCookieGraphics;
> @@ -594,6 +596,9 @@ qemuMigrationCookieXMLFormat(struct qemud_driver *driver,
>      if ((mig->flags & QEMU_MIGRATION_COOKIE_NETWORK) && mig->network)
>          qemuMigrationCookieNetworkXMLFormat(buf, mig->network);
>  
> +    if (mig->flags & QEMU_MIGRATION_COOKIE_OFFLINE)
> +        virBufferAsprintf(buf, "  <offline/>\n");
> +
>      virBufferAddLit(buf, "</qemu-migration>\n");
>      return 0;
>  }
> @@ -874,6 +879,11 @@ qemuMigrationCookieXMLParse(qemuMigrationCookiePtr mig,
>          (!(mig->network = qemuMigrationCookieNetworkXMLParse(ctxt))))
>          goto error;
>  
> +    if ((flags & QEMU_MIGRATION_COOKIE_OFFLINE)) {
> +        if (virXPathBoolean("count(./offline) > 0", ctxt))
> +            mig->flags |= QEMU_MIGRATION_COOKIE_OFFLINE;
> +    }
> +
>      return 0;
>  
>  error:
> @@ -938,6 +948,12 @@ qemuMigrationBakeCookie(qemuMigrationCookiePtr mig,
>          return -1;
>      }
>  
> +    if (flags & QEMU_MIGRATION_COOKIE_OFFLINE) {
> +        mig->flags |= QEMU_MIGRATION_COOKIE_OFFLINE;
> +    }
> +
> +
> +

To much space here, one empty line is enough :-)

>      if (!(*cookieout = qemuMigrationCookieXMLFormatStr(driver, mig)))
>          return -1;
>  
> @@ -1443,6 +1459,13 @@ char *qemuMigrationBegin(struct qemud_driver *driver,
>                                  QEMU_MIGRATION_COOKIE_LOCKSTATE) < 0)
>          goto cleanup;
>  
> +    if (flags & VIR_MIGRATE_OFFLINE) {
> +        if (qemuMigrationBakeCookie(mig, driver, vm,
> +                                    cookieout, cookieoutlen,
> +                                    QEMU_MIGRATION_COOKIE_OFFLINE) < 0)
> +            goto cleanup;
> +    }
> +
>      if (xmlin) {
>          if (!(def = virDomainDefParseString(driver->caps, xmlin,
>                                              QEMU_EXPECTED_VIRT_TYPES,
> @@ -1607,6 +1630,15 @@ qemuMigrationPrepareAny(struct qemud_driver *driver,
>          goto endjob;
>      }
>  
> +    if (!(mig = qemuMigrationEatCookie(driver, vm, cookiein, cookieinlen,
> +                                       QEMU_MIGRATION_COOKIE_OFFLINE)))
> +        return ret;
> +

Oops, the return statement will leave a big mess uncleaned. You need to goto
endjob instead.

> +    if (mig->flags & QEMU_MIGRATION_COOKIE_OFFLINE) {
> +        ret = 0;
> +        goto cleanup;
> +    }
> +

This goto cleanup will result in the warning reported by Daniel. Prepare phase
needs to call qemuMigrationJobContinue() on successful return. That is, you
need to do "goto done;" and add new "done:" label just above the following
code (around line 1690 on current HEAD):

/* add "done:" here */
     /* We keep the job active across API calls until the finish() call.
      * This prevents any other APIs being invoked while incoming
      * migration is taking place.
      */
     if (!qemuMigrationJobContinue(vm)) {
         vm = NULL;
         virReportError(VIR_ERR_OPERATION_FAILED,
                        "%s", _("domain disappeared"));
         goto cleanup;
     }
 
     ret = 0;


>      /* Start the QEMU daemon, with the same command-line arguments plus
>       * -incoming $migrateFrom
>       */
> @@ -2150,6 +2182,9 @@ qemuMigrationRun(struct qemud_driver *driver,
>          return -1;
>      }
>  
> +    if (flags & VIR_MIGRATE_OFFLINE)
> +        return 0;
> +
>      if (!(mig = qemuMigrationEatCookie(driver, vm, cookiein, cookieinlen,
>                                         QEMU_MIGRATION_COOKIE_GRAPHICS)))
>          goto cleanup;

This check seems to be too deep. The whole point of it is to skip transferring
migration data (since there's no qemu running anyway) so why doing all the
preparation work for it? I think Perform phase should be completely skipped
for offline migrations. Or do we actually need to do something during Perform
phase? It seems we don't since it is completely skipped during peer-to-peer
migration (not that it would actually work with this patch).

> @@ -2665,7 +2700,12 @@ static int doPeer2PeerMigrate3(struct qemud_driver *driver,
>               uri, &uri_out, flags, dname, resource, dom_xml);
>          qemuDomainObjExitRemoteWithDriver(driver, vm);
>      }
> +
>      VIR_FREE(dom_xml);
> +
> +    if (flags & VIR_MIGRATE_OFFLINE)
> +        goto cleanup;
> +
>      if (ret == -1)
>          goto cleanup;
>  

This will skip not only Perform but also Finish phase in peer-to-peer
migration. You want to jump to finish label and do that *after* the check for
ret == 1.

> @@ -2771,7 +2811,7 @@ finish:
>                   vm->def->name);
>  
>   cleanup:
> -    if (ddomain) {
> +    if (ddomain || (flags & VIR_MIGRATE_OFFLINE)) {
>          virObjectUnref(ddomain);
>          ret = 0;
>      } else {

This is weird. It means that we don't care if Finish succeeded or not during
offline migration even though Finish is where the domain configuration gets
stored on disk.

> @@ -2848,7 +2888,7 @@ static int doPeer2PeerMigrate(struct qemud_driver *driver,
>      }
>  
>      /* domain may have been stopped while we were talking to remote daemon */
> -    if (!virDomainObjIsActive(vm)) {
> +    if (!virDomainObjIsActive(vm) && !(flags & VIR_MIGRATE_OFFLINE)) {
>          virReportError(VIR_ERR_INTERNAL_ERROR, "%s",
>                         _("guest unexpectedly quit"));
>          goto cleanup;
> @@ -2911,7 +2951,7 @@ qemuMigrationPerformJob(struct qemud_driver *driver,
>      if (qemuMigrationJobStart(driver, vm, QEMU_ASYNC_JOB_MIGRATION_OUT) < 0)
>          goto cleanup;
>  
> -    if (!virDomainObjIsActive(vm)) {
> +    if (!virDomainObjIsActive(vm) && !(flags & VIR_MIGRATE_OFFLINE)) {
>          virReportError(VIR_ERR_OPERATION_INVALID,
>                         "%s", _("domain is not running"));
>          goto endjob;
> @@ -3236,6 +3276,8 @@ qemuMigrationFinish(struct qemud_driver *driver,
>       */
>      if (retcode == 0) {
>          if (!virDomainObjIsActive(vm)) {
> +            if (flags & VIR_MIGRATE_OFFLINE)
> +                goto offline;
>              virReportError(VIR_ERR_INTERNAL_ERROR, "%s",
>                             _("guest unexpectedly quit"));
>              goto endjob;
> @@ -3255,6 +3297,7 @@ qemuMigrationFinish(struct qemud_driver *driver,
>              if (qemuDomainMigrateOPDRelocate(driver, vm, mig) < 0)
>                  VIR_WARN("unable to provide network data for relocation");
>  
> +    offline:
>          if (flags & VIR_MIGRATE_PERSIST_DEST) {
>              virDomainDefPtr vmdef;
>              if (vm->persistent)

Again, pretty confusing usage of offline label.

> @@ -3302,7 +3345,7 @@ qemuMigrationFinish(struct qemud_driver *driver,
>              event = NULL;
>          }
>  
> -        if (!(flags & VIR_MIGRATE_PAUSED)) {
> +        if (!(flags & VIR_MIGRATE_PAUSED) && !(flags & VIR_MIGRATE_OFFLINE)) {
>              /* run 'cont' on the destination, which allows migration on qemu
>               * >= 0.10.6 to work properly.  This isn't strictly necessary on
>               * older qemu's, but it also doesn't hurt anything there
> @@ -3351,9 +3394,11 @@ qemuMigrationFinish(struct qemud_driver *driver,
>                                               VIR_DOMAIN_EVENT_SUSPENDED,
>                                               VIR_DOMAIN_EVENT_SUSPENDED_PAUSED);
>          }
> -        if (virDomainSaveStatus(driver->caps, driver->stateDir, vm) < 0) {
> -            VIR_WARN("Failed to save status on vm %s", vm->def->name);
> -            goto endjob;
> +        if (virDomainObjIsActive(vm)) {
> +            if (virDomainSaveStatus(driver->caps, driver->stateDir, vm) < 0) {
> +                VIR_WARN("Failed to save status on vm %s", vm->def->name);
> +                goto endjob;
> +            }
>          }
>  
>          /* Guest is successfully running, so cancel previous auto destroy */
> @@ -3373,6 +3418,7 @@ qemuMigrationFinish(struct qemud_driver *driver,
>  endjob:
>      if (qemuMigrationJobFinish(driver, vm) == 0) {
>          vm = NULL;
> +    } else if (flags & VIR_MIGRATE_OFFLINE) {
>      } else if (!vm->persistent && !virDomainObjIsActive(vm)) {
>          qemuDomainRemoveInactive(driver, vm);
>          vm = NULL;

This is wrong since vm->persistent will always be true for offline
migration. Unless something failed early in which case we actually want to
remove the domain. The current code should work without any modification.

> @@ -3420,6 +3466,9 @@ int qemuMigrationConfirm(struct qemud_driver *driver,
>      if (!(mig = qemuMigrationEatCookie(driver, vm, cookiein, cookieinlen, 0)))
>          return -1;
>  
> +    if (flags & VIR_MIGRATE_OFFLINE)
> +        goto offline;
> +
>      /* Did the migration go as planned?  If yes, kill off the
>       * domain object, but if no, resume CPUs
>       */
> @@ -3455,6 +3504,7 @@ int qemuMigrationConfirm(struct qemud_driver *driver,
>          }
>      }
>  
> +offline:
>      qemuMigrationCookieFree(mig);
>      rv = 0;
>  

Confusing label again, I'd use "done" or something similar.

We're almost there, just some unusual code paths (other than successful
non-p2p migration) need fixing. Moreover, I think we should explicitly fail v2
migration with OFFLINE flag to avoid having to touch that path as well. And
since you need to do changes to doPeer2PeerMigrate3 you should do similar
changes to migration APIs in libvirt.c (since they are very similar) which
should let you avoid changing some other parts.

Jirka




More information about the libvir-list mailing list