[libvirt] [Qemu-devel] migration: qemu-coroutine-lock.c:141: qemu_co_mutex_unlock: Assertion `mutex->locked == 1' failed

Wed Sep 17 15:17:20 UTC 2014

[adding libvirt list]

On 09/17/2014 09:04 AM, Stefan Hajnoczi wrote:
> On Wed, Sep 17, 2014 at 10:25 AM, Paolo Bonzini <pbonzini at redhat.com> wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> Il 17/09/2014 11:06, Stefan Hajnoczi ha scritto:
>>> I think the fundamental problem here is that the mirror block job
>>> on the source host does not synchronize with live migration.
>>>
>>> Remember the mirror block job iterates on the dirty bitmap
>>> whenever it feels like.
>>>
>>> There is no guarantee that the mirror block job has quiesced before
>>> migration handover takes place, right?
>>
>> Libvirt does that.  Migration is started only once storage mirroring
>> is out of the bulk phase, and the handover looks like:
>>
>> 1) migration completes
>>
>> 2) because the source VM is stopped, the disk has quiesced on the source
> 
> But the mirror block job might still be writing out dirty blocks.
> 
>> 3) libvirt sends block-job-complete
> 
> No, it sends block-job-cancel after the source QEMU's migration has
> completed.  See the qemuMigrationCancelDriveMirror() call in
> src/qemu/qemu_migration.c:qemuMigrationRun().
> 
>> 4) libvirt receives BLOCK_JOB_COMPLETED.  The disk has now quiesced on
>> the destination as well.
> 
> I don't see where this happens in the libvirt source code.  Libvirt
> doesn't care about block job events for drive-mirror during migration.
> 
> And that's why there could still be I/O going on (since
> block-job-cancel is asynchronous).
> 
>> 5) the VM is started on the destination
>>
>> 6) the NBD server is stopped on the destination and the source VM is quit.
>>
>> It is actually a feature that storage migration is completed
>> asynchronously with respect to RAM migration.  The problem is that
>> qcow2_invalidate_cache happens between (3) and (5), and it doesn't
>> like the concurrent I/O received by the NBD server.
> 
> I agree that qcow2_invalidate_cache() (and any other invalidate cache
> implementations) need to allow concurrent I/O requests.
> 
> Either I'm misreading the libvirt code or libvirt is not actually
> ensuring that the block job on the source has cancelled/completed
> before the guest is resumed on the destination.  So I think there is
> still a bug, maybe Eric can verify this?

You may indeed be correct that libvirt is not waiting long enough for
the block job to be gone on the source before resuming on the
destination.  I didn't write that particular code, so I'm cc'ing the
libvirt list, but I can try and take a look into it, since it's related
to code I've recently touched in getting libvirt to support active layer
block commit.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 539 bytes
Desc: OpenPGP digital signature
URL: <http://listman.redhat.com/archives/libvir-list/attachments/20140917/5c91480d/attachment-0001.sig>