[libvirt] [PATCH] Migration: Preserve the failed job in case migration job is terminated beyond the perform phase.

Prerna saxenap.ltc at gmail.com
Mon Jan 29 10:26:29 UTC 2018


Hi Jirka,

On Thu, Jan 25, 2018 at 8:43 PM, Jiri Denemark <jdenemar at redhat.com> wrote:

> On Thu, Jan 25, 2018 at 19:51:23 +0530, Prerna Saxena wrote:
> > In case of non-p2p migration, in case libvirt client gets disconnected
> from source libvirt
> > after PERFORM phase is over, the daemon just resets the current
> migration job.
> > However, the VM could be left paused on both source and destination in
> such case. In case
> > the client reconnects and queries migration status, the job has been
> blanked out from source libvirt,
> > and this reconnected client has no clear way of figuring out if an
> unclean migration had previously
> > been aborted.
>
> The virDomainGetState API should return VIR_DOMAIN_PAUSED with
> VIR_DOMAIN_PAUSED_MIGRATION reason. Is this not enough?
>
>
I understand that a client application should poll source libvirtd for
status of migration job completion using virDomainGetJobStats().
However, as you explained above, cleanup callbacks clear the job info so a
client should additionally be polling for virDomainGetState() too.
Would it not be cleaner to have a single API reflect the source of truth?


> > This patch calls out a "potentially" incomplete migration as a failed
> > job, so that a client may
>
> As you say it's potentially incomplete, so marking it as failed is not
> completely correct. It's a split brain when the source cannot
> distinguish whether the migration was successful or not.
>

Agree, it might have run to completion too, as we observed in some cases.
Do you think marking the job status as "UNKNOWN" is better articulation of
the current state?


>
> > diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c
> > index e8e0313..7c60d17 100644
> > --- a/src/qemu/qemu_domain.c
> > +++ b/src/qemu/qemu_domain.c
> > @@ -4564,6 +4564,22 @@ qemuDomainObjDiscardAsyncJob(virQEMUDriverPtr
> driver, virDomainObjPtr obj)
> >      qemuDomainObjSaveJob(driver, obj);
> >  }
> >
> > +
> > +void
> > +qemuDomainObjFailAsyncJob(virQEMUDriverPtr driver, virDomainObjPtr obj)
> > +{
> > +    qemuDomainObjPrivatePtr priv = obj->privateData;
> > +    VIR_FREE(priv->job.completed);
> > +    if (VIR_ALLOC(priv->job.completed) == 0) {
> > +        priv->job.current->type = VIR_DOMAIN_JOB_FAILED;
> > +        priv->job.completed = priv->job.current;
>
> This will just leak the memory allocated for priv->job.completed by
> overwriting the pointer to the one from priv->job.current, ...
>
> > +    } else {
> > +        VIR_WARN("Unable to allocate job.completed for VM %s",
> obj->def->name);
> > +    }
> > +    qemuDomainObjResetAsyncJob(priv);
>
> which will point to a freed memory after this call.
>

Agree, I will fix this.


>
> > +    qemuDomainObjEndJob(driver, obj);
>
> And while this is almost certainly (I didn't really check though) not
> something you should call from a close callback, you don't save the
> changes to the status XML so once libvirtd restarts, it will think the
> domain is still being migrated.
>

I will add the same to status XML.
I am suggesting that strengthening the job data would be additionally
useful. If the daemon has not restarted, job information can still get us
fairly accurate status of migration. Pls let me know if you think this is
not useful, I will be happy to learn the rationale.

Regards,
Prerna
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/libvir-list/attachments/20180129/38e0ec12/attachment-0001.htm>


More information about the libvir-list mailing list