[libvirt] [PATCH 00/19] Rollback migration when libvirtd restarts

Wen Congyang wency at cn.fujitsu.com
Thu Jul 28 07:26:02 UTC 2011

At 07/28/2011 05:41 AM, Eric Blake Write:
> On 07/07/2011 05:34 PM, Jiri Denemark wrote:
>> This series is also available at
>> https://gitorious.org/~jirka/libvirt/jirka-staging/commits/migration-recovery
>> The series does several things:
>> - persists current job and its phase in status xml
>> - allows safe monitor commands to be run during migration/save/dump jobs
>> - implements recovery when libvirtd is restarted while a job is active
>> - consolidates some code and fixes bugs I found when working in the area
> git bisect is pointing to this series as the cause of a regression in
> 'virsh managedsave dom' triggering libvirtd core dumps if some other
> process is actively making queries on domain at the same time
> (virt-manager is a great process for fitting that bill).  I'm trying to
> further narrow down which patch introduced the regression, and see if I
> can plug the race (probably a case of not checking whether the monitor
> still exists when getting the condition for an asynchronous job, since
> the whole point of virsh [managed]save is that the domain will go away
> when the save completes, but that it is time-consuming enough that we
> want to query domain state in the meantime).

I can reproduce this bug.

> Program received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 0x7ffff06d0700 (LWP 11419)]
> 0x00000000004b9ad8 in qemuMonitorSend (mon=0x7fffe815c060,
> msg=0x7ffff06cf380)
>     at qemu/qemu_monitor.c:801
> 801        while (!mon->msg->finished) {

The reason is that mon->msg is NULL.
I add some debug codes, and found that we send monitor command while
the last command is not finished, and then libvirtd crashed.

After reading the code, I think something is wrong in the function
    if (priv->job.active == QEMU_JOB_NONE && priv->job.asyncJob) {
        if (qemuDomainObjBeginNestedJob(driver, obj) < 0)
We can run query job while asyncJob is running. When we query the migration's
status, priv->job.active is not QEMU_JOB_NONE, and we do not wait the query job
finished. So we send monitor command while last command is not finished. It's very
When we run a async job, we can not know whether the job is nested async job according
to priv->job.active's value.

I think we should introduce four functions for async nested job:

The qemuDomainObjEnterMonitorInternal()'s caller should pass a bool value to tell
qemuDomainObjEnterMonitorInternal() whether the job is a async nested job.

Wen Congyang.

> (gdb) bt
> #0  0x00000000004b9ad8 in qemuMonitorSend (mon=0x7fffe815c060,
>     msg=0x7ffff06cf380) at qemu/qemu_monitor.c:801
> #1  0x00000000004c77ae in qemuMonitorJSONCommandWithFd (mon=0x7fffe815c060,
>     cmd=0x7fffd8000940, scm_fd=-1, reply=0x7ffff06cf480)
>     at qemu/qemu_monitor_json.c:225
> #2  0x00000000004c78e5 in qemuMonitorJSONCommand (mon=0x7fffe815c060,
>     cmd=0x7fffd8000940, reply=0x7ffff06cf480) at
> qemu/qemu_monitor_json.c:254
> #3  0x00000000004cc19c in qemuMonitorJSONGetMigrationStatus (
>     mon=0x7fffe815c060, status=0x7ffff06cf580, transferred=0x7ffff06cf570,
>     remaining=0x7ffff06cf568, total=0x7ffff06cf560)
>     at qemu/qemu_monitor_json.c:1920
> #4  0x00000000004bc1b3 in qemuMonitorGetMigrationStatus
> (mon=0x7fffe815c060,
>     status=0x7ffff06cf580, transferred=0x7ffff06cf570,
>     remaining=0x7ffff06cf568, total=0x7ffff06cf560) at
> qemu/qemu_monitor.c:1532
> #5  0x00000000004b201b in qemuMigrationUpdateJobStatus
> (driver=0x7fffe80089f0,
>     vm=0x7fffe8015cd0, job=0x5427b6 "domain save job")
>     at qemu/qemu_migration.c:765
> #6  0x00000000004b2383 in qemuMigrationWaitForCompletion (
>     driver=0x7fffe80089f0, vm=0x7fffe8015cd0) at qemu/qemu_migration.c:846
> #7  0x00000000004b7806 in qemuMigrationToFile (driver=0x7fffe80089f0,
>     vm=0x7fffe8015cd0, fd=27, offset=4096,
>     path=0x7fffd8000990 "/var/lib/libvirt/qemu/save/fedora_12.save",
>     compressor=0x0, is_reg=true, bypassSecurityDriver=true)
>     at qemu/qemu_migration.c:2766
> #8  0x000000000046a90d in qemuDomainSaveInternal (driver=0x7fffe80089f0,
>     dom=0x7fffd8000ad0, vm=0x7fffe8015cd0,
>     path=0x7fffd8000990 "/var/lib/libvirt/qemu/save/fedora_12.save",
>     compressed=0, bypass_cache=false) at qemu/qemu_driver.c:2386

