[libvirt] [PATCH 00/19] Rollback migration when libvirtd restarts

Thu Jul 28 08:23:11 UTC 2011

At 07/28/2011 03:26 PM, Wen Congyang Write:
> At 07/28/2011 05:41 AM, Eric Blake Write:
>> On 07/07/2011 05:34 PM, Jiri Denemark wrote:
>>> This series is also available at
>>> https://gitorious.org/~jirka/libvirt/jirka-staging/commits/migration-recovery
>>>
>>>
>>> The series does several things:
>>> - persists current job and its phase in status xml
>>> - allows safe monitor commands to be run during migration/save/dump jobs
>>> - implements recovery when libvirtd is restarted while a job is active
>>> - consolidates some code and fixes bugs I found when working in the area
>>
>> git bisect is pointing to this series as the cause of a regression in
>> 'virsh managedsave dom' triggering libvirtd core dumps if some other
>> process is actively making queries on domain at the same time
>> (virt-manager is a great process for fitting that bill).  I'm trying to
>> further narrow down which patch introduced the regression, and see if I
>> can plug the race (probably a case of not checking whether the monitor
>> still exists when getting the condition for an asynchronous job, since
>> the whole point of virsh [managed]save is that the domain will go away
>> when the save completes, but that it is time-consuming enough that we
>> want to query domain state in the meantime).
> 
> I can reproduce this bug.
> 
>>
>> Program received signal SIGSEGV, Segmentation fault.
>> [Switching to Thread 0x7ffff06d0700 (LWP 11419)]
>> 0x00000000004b9ad8 in qemuMonitorSend (mon=0x7fffe815c060,
>> msg=0x7ffff06cf380)
>>     at qemu/qemu_monitor.c:801
>> 801        while (!mon->msg->finished) {
> 
> The reason is that mon->msg is NULL.
> I add some debug codes, and found that we send monitor command while
> the last command is not finished, and then libvirtd crashed.
> 
> After reading the code, I think something is wrong in the function
> qemuDomainObjEnterMonitorInternal():
>     if (priv->job.active == QEMU_JOB_NONE && priv->job.asyncJob) {
>         if (qemuDomainObjBeginNestedJob(driver, obj) < 0)
> We can run query job while asyncJob is running. When we query the migration's
> status, priv->job.active is not QEMU_JOB_NONE, and we do not wait the query job
> finished. So we send monitor command while last command is not finished. It's very
> dangerous.
> When we run a async job, we can not know whether the job is nested async job according
> to priv->job.active's value.
> 
> I think we should introduce four functions for async nested job:

Some functions(for example qemuProcessStopCPUs) can be used by sync job and async job.
So we do not know which type job call these functions when we enter these functions.

We support run sync job and async job at the same time. It means that the monitor commands
for two jobs can be run in any order.

Another way to fix this bug is:
If we try to send monitor command while the last command is not finished, we wait the last monitor command
to finish.

> qemuDomainObjAsyncEnterMonitor()
> qemuDomainObjAsyncEnterMonitorWithDriver()
> qemuDomainObjAsyncExitMonitor()
> qemuDomainObjAsyncExitMonitorWithDriver()
> 
> The qemuDomainObjEnterMonitorInternal()'s caller should pass a bool value to tell
> qemuDomainObjEnterMonitorInternal() whether the job is a async nested job.
> 
> Thanks
> Wen Congyang.
> 
>> (gdb) bt
>> #0  0x00000000004b9ad8 in qemuMonitorSend (mon=0x7fffe815c060,
>>     msg=0x7ffff06cf380) at qemu/qemu_monitor.c:801
>> #1  0x00000000004c77ae in qemuMonitorJSONCommandWithFd (mon=0x7fffe815c060,
>>     cmd=0x7fffd8000940, scm_fd=-1, reply=0x7ffff06cf480)
>>     at qemu/qemu_monitor_json.c:225
>> #2  0x00000000004c78e5 in qemuMonitorJSONCommand (mon=0x7fffe815c060,
>>     cmd=0x7fffd8000940, reply=0x7ffff06cf480) at
>> qemu/qemu_monitor_json.c:254
>> #3  0x00000000004cc19c in qemuMonitorJSONGetMigrationStatus (
>>     mon=0x7fffe815c060, status=0x7ffff06cf580, transferred=0x7ffff06cf570,
>>     remaining=0x7ffff06cf568, total=0x7ffff06cf560)
>>     at qemu/qemu_monitor_json.c:1920
>> #4  0x00000000004bc1b3 in qemuMonitorGetMigrationStatus
>> (mon=0x7fffe815c060,
>>     status=0x7ffff06cf580, transferred=0x7ffff06cf570,
>>     remaining=0x7ffff06cf568, total=0x7ffff06cf560) at
>> qemu/qemu_monitor.c:1532
>> #5  0x00000000004b201b in qemuMigrationUpdateJobStatus
>> (driver=0x7fffe80089f0,
>>     vm=0x7fffe8015cd0, job=0x5427b6 "domain save job")
>>     at qemu/qemu_migration.c:765
>> #6  0x00000000004b2383 in qemuMigrationWaitForCompletion (
>>     driver=0x7fffe80089f0, vm=0x7fffe8015cd0) at qemu/qemu_migration.c:846
>> #7  0x00000000004b7806 in qemuMigrationToFile (driver=0x7fffe80089f0,
>>     vm=0x7fffe8015cd0, fd=27, offset=4096,
>>     path=0x7fffd8000990 "/var/lib/libvirt/qemu/save/fedora_12.save",
>>     compressor=0x0, is_reg=true, bypassSecurityDriver=true)
>>     at qemu/qemu_migration.c:2766
>> #8  0x000000000046a90d in qemuDomainSaveInternal (driver=0x7fffe80089f0,
>>     dom=0x7fffd8000ad0, vm=0x7fffe8015cd0,
>>     path=0x7fffd8000990 "/var/lib/libvirt/qemu/save/fedora_12.save",
>>     compressed=0, bypass_cache=false) at qemu/qemu_driver.c:2386
>>
>>
> 
> --
> libvir-list mailing list
> libvir-list at redhat.com
> https://www.redhat.com/mailman/listinfo/libvir-list
>