[libvirt] Tunneled migration failures

Gary Hook gary.hook at nimboxx.com
Thu Oct 16 23:22:12 UTC 2014


So here¹s the thing:

I have an environment running without tunneling. I can move a modestly
sized (3.2GB disk space, 2GB RAM) VM back and forth between two systems
all day long, using peer-peer migration, with these flags set:

VIR_MIGRATE_UNDEFINE_SOURCE
VIR_MIGRATE_PEER2PEER
VIR_MIGRATE_PERSIST_DEST
VIR_MIGRATE_NON_SHARED_INC

By invoking libvirt_domain_migrate_to_uri2() in libvirt-php via apache. We
are not using shared storage, but as stated, the function invocations and
XML appear to be correct insofar as non-tunneled migrations are successful.

As soon as I enable tunneling by adding the VIR_MIGRATE_TUNNELLED to the
above options, the migration fails with this debug message:

qemuMigrationUpdateJobStatus:1788 : operation failed: migration job:
unexpectedly failed


Not very helpful. This is with no other changes to the invoking code.

I have been _trying_ to track down the source of the failure, and in so
doing I have increased the arbitrary timeout value in qemu_domain.c from
30 to 100 seconds. The copy of the disk is successful; nabbing a copy and
booting it in an alternate VM works just fine, so I have every reason to
believe that the disk is being migrated properly.

That said, what I need is to understand how to trace the monitoring thread
over on the destination, to understand where and why it is returning an
error to the source host. The debug output is silent on finding and error
and sending it back to the source system.  I¹ve been instrumenting the
calls in src/qemu, but haven¹t located the right code.

A pointer to where I should be looking would be helpful. I can provide
much more detail from tracing (that I have had to add to the code) on the
source side, but the destination side log entries get to virStreamFinish()
and then the thread exits while the sender gets an error back. Ostensibly
from another thread. The migration gets cancelled for no obvious reason,
from no obvious source.


Here¹s the kicker: this exact same scenario using a smaller VM (1.5GB
disk, 2GB RAM) succeeds.


The issue appears to be related to VM size and timeout values?

Again, if a developer could provide a pointer to how I might trace the
monitoring thread on the destination side, that¹d be great.

Thanks in advance.

Gary





More information about the libvir-list mailing list