[libvirt] qemu-kvm spending time in do_info_migrate() during virDomainSave(); 50ms polling too fast?

Mon May 3 17:17:44 UTC 2010

...or more particularly, in ram_bytes_remaining() called by do_info_migrate(); 
oddly, this is much more pronounced when running with <emulator> pointing at a 
shim prepending -no-kvm-irqchip to the invoked command line.

This VM was intended to be paused for the save event (if my software was doing  
its job correctly), so we shouldn't be spending time running the guest CPU and 
writing updates for already-once-written blocks.

I'm seeing much more CPU time spent inside qemu-kvm than in the exec'd lzop 
process compressing and writing the data stream; on attaching gdb and taking 
some stack traces to sample where execution time was spent, it appeared that we 
were spending our time responding to requests from the monitor.

The question then -- is the 50ms poll in qemuDomainWaitForMigrationComplete 
(called from qemudDomainSave) perhaps too frequent?

Thanks!

---

Below is an exchange from IRC:

<nDuff> How often should libvirt be calling "info migrate" during a 
virDomainSave (of a qemu domain)?
* nDuff is seeing his qemu-kvm spending the bulk of its time inside 
ram_bytes_remaining() under do_info_migrate().
<DV> nDuff: I doubt libvirt is doing this on his own, something else is asking 
for the information I would assume
<nDuff> DV, I'm not running virt-manager or such; the only management layer on 
top is locally developed, and it only has a single thread that's blocked waiting 
for the dom.save() call [this is using the Python bindings] to complete.
<DV> interesting
<DV> nDuff -> send this to the list, someone need to look at it, at least raise 
the problem, maybe we didn't expected that to be so costly
<DV> nDuff: maybe open a bugzilla
<nDuff> it might be that it's not _usually_ so costly except that I'm hitting a 
qemu/kvm bug; it only started expressing itself when I added -no-kvm-irqchip to 
the commandline via a shim
<DV> nDuff: I could see how trying to extract this too often could stall the 
migration process
<nDuff> ...but yes, I'll post to the list.
<DV> nDuff: maybe it's related to the capability to force migration end when the 
full flush will be shorter than a user defined limit