[libvirt] [PATCHv3 0/6] Fix memory corruption/crash in the connection close callback

Peter Krempa pkrempa at redhat.com
Mon Apr 8 13:04:24 UTC 2013


[Re-sending, there was probably a problem and the mail didn't reach the 
list apparently]

On 04/08/13 14:06, Peter Krempa wrote:
 > On 04/08/13 13:55, Viktor Mihajlovski wrote:
 >> I fear we're yet not thru this. Today I had a segfault doing a migration
 >> using virsh migrate --verbose --live $guest qemu+ssh://$host/system.
 >> This is with Friday's git HEAD.
 >> The migration took very long (but succeeded except for the libvirt
 >> crash) so there still seems to be a race lingering in the object
 >> reference counting exposed by the --verbose option (getjobinfo?).
 >>
 >> (gdb) bt
 >> #0  qemuDomainGetJobInfo (dom=<optimized out>, info=0x3fffaaaaa70) at
 >> qemu/qemu_driver.c:10166
 >> #1  0x000003fffd4bbe68 in virDomainGetJobInfo (domain=0x3ffe4002660,
 >> info=0x3fffaaaaa70) at libvirt.c:17440
 >> #2  0x000002aace36b528 in remoteDispatchDomainGetJobInfo
 >> (server=<optimized out>, msg=<optimized out>, ret=0x3ffe40029d0,
 >>      args=0x3ffe40026a0, rerr=0x3fffaaaac20, client=<optimized out>)
 >> at remote_dispatch.h:2069
 >> #3  remoteDispatchDomainGetJobInfoHelper (server=<optimized out>,
 >> client=<optimized out>, msg=<optimized out>,
 >>      rerr=0x3fffaaaac20, args=0x3ffe40026a0, ret=0x3ffe40029d0) at
 >> remote_dispatch.h:2045
 >> #4  0x000003fffd500384 in virNetServerProgramDispatchCall
 >> (msg=0x2ab035dd800, client=0x2ab035df5d0, server=0x2ab035ca370,
 >>      prog=0x2ab035cf210) at rpc/virnetserverprogram.c:439
 >> #5  virNetServerProgramDispatch (prog=0x2ab035cf210,
 >> server=0x2ab035ca370, client=0x2ab035df5d0, msg=0x2ab035dd800)
 >>      at rpc/virnetserverprogram.c:305
 >> #6  0x000003fffd4fad3c in virNetServerProcessMsg (msg=<optimized out>,
 >> prog=<optimized out>, client=<optimized out>,
 >>      srv=0x2ab035ca370) at rpc/virnetserver.c:162
 >> #7  virNetServerHandleJob (jobOpaque=<optimized out>,
 >> opaque=0x2ab035ca370) at rpc/virnetserver.c:183
 >> #8  0x000003fffd42a91c in virThreadPoolWorker
 >> (opaque=opaque at entry=0x2ab035a9e60) at util/virthreadpool.c:144
 >> #9  0x000003fffd42a236 in virThreadHelper (data=<optimized out>) at
 >> util/virthreadpthread.c:161
 >> #10 0x000003fffcdee412 in start_thread () from /lib64/libpthread.so.0
 >> #11 0x000003fffcd30056 in thread_start () from /lib64/libc.so.6
 >>
 >> (gdb) l
 >> 10161        if (!(vm = qemuDomObjFromDomain(dom)))
 >> 10162            goto cleanup;
 >> 10163
 >> 10164        priv = vm->privateData;
 >> 10165
 >> 10166        if (virDomainObjIsActive(vm)) {
 >> 10167            if (priv->job.asyncJob && 
!priv->job.dump_memory_only) {
 >> 10168                memcpy(info, &priv->job.info, sizeof(*info));
 >> 10169
 >> 10170                /* Refresh elapsed time again just to ensure it
 >>
 >>
 >> (gdb) print *vm
 >> $1 = {parent = {parent = {magic = 3735928559, refs = 0, klass =
 >> 0xdeadbeef}, lock = {lock = {__data = {__lock = 0,
 >>            __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins
 >> = 0, __list = {__prev = 0x0, __next = 0x0}},
 >>          __size = '\000' <repeats 39 times>, __align = 0}}}, pid = 0,
 >> state = {state = 0, reason = 0}, autostart = 0,
 >>    persistent = 0, updated = 0, def = 0x0, newDef = 0x0, snapshots =
 >> 0x0, current_snapshot = 0x0, hasManagedSave = false,
 >>    privateData = 0x0, privateDataFreeFunc = 0x0, taint = 0}
 >>
 >> I am currently blocked with other work but if anyone has a theory that
 >> I should verify let me know...
 >>
 >
 > Aiee, perhaps a race between a thread freeing a domain object (and the
 > private data) and another thread that happened to acquire the domain
 > object pointer before it was freed? Let me verify if that is possible.

Ufff. The domain objects in the qemu driver don't use reference counting
to track the lifecycles. Thus it's (Theoretically) possible to acquire a
lock of a domain object in one thread while another thread happens to
free the domain object.

I have a reproducer for this issue:

diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c
index f50a964..90896cb 100644
--- a/src/conf/domain_conf.c
+++ b/src/conf/domain_conf.c
@@ -2222,6 +2222,8 @@ void virDomainObjListRemove(virDomainObjListPtr doms,
      virUUIDFormat(dom->def->uuid, uuidstr);
      virObjectUnlock(dom);

+    sleep(2);
+
      virObjectLock(doms);
      virHashRemoveEntry(doms->objs, uuidstr);
      virObjectUnlock(doms);
diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c
index 997d7c3..f1aeab7 100644
--- a/src/qemu/qemu_driver.c
+++ b/src/qemu/qemu_driver.c
@@ -2300,6 +2300,8 @@ static int qemuDomainGetInfo(virDomainPtr dom,
      if (!(vm = qemuDomObjFromDomain(dom)))
          goto cleanup;

+    sleep(5);
+
      info->state = virDomainObjGetState(vm, NULL);

      if (!virDomainObjIsActive(vm)) {


and use a bash oneliner to trigger the issue:

virsh undefine domain & sleep .1; virsh dominfo domain

The daemon crashes afterwards. I'll try to come up with a fix.

Peter





More information about the libvir-list mailing list