[libvirt] cont command failing via JSON monitor on restore

Laine Stump laine at laine.org
Thu Jan 13 04:29:27 UTC 2011


On 01/12/2011 05:13 PM, Jim Fehlig wrote:
> libvirt 0.8.7
> qemu 0.13
>
> I'm looking into a problem with qemu save/restore via JSON monitor.  On
> restore, the vm is left in a paused state with following error returned
> for 'cont' command
>
> An incoming migration is expected before this command can be executed
>
> I was trying to debug the issue in gdb, but stepping through the code
> introduces enough delay between qemudStartVMDaemon() and doStartCPUs()
> that the latter succeeds.  Any suggestions on how to determine when it
> is safe to call doStartCPUs() to prevent the above error?  I don't see
> this issue with the text monitor btw.

I'm pretty sure this is related to a bug I reported on qemu-devel last 
April:

    http://lists.gnu.org/archive/html/qemu-devel/2010-04/msg00635.html

(be sure to read my own followup if you want a correct description of 
the circumstances). In this case libvirt was using the text monitor, and 
there was a race condition between qemudStartVMDaemon (which executes 
qemu with '-S -incoming') and doStartCPUs() (which issues a 'cont' 
command to the qemu monitor). The result would be that sometimes the 
'cont' would be received and processed by qemu before the incoming 
migration had started, meaning that qemu would be executing garbage 
memory instead of the saved/restored image of the guest.

The solution to this was posted to upstream qemu in July:

   http://lists.gnu.org/archive/html/qemu-devel/2010-07/msg01574.html

and I believe is in qemu 0.13. That patch adds a check to the 'cont' 
command so that if '-incoming' was specified on the commandline, 'cont' 
will only execute after a migration has successfully completed, but will 
otherwise return an error.

Actually, thinking about this "fix", it seems that it isn't really a 
solution, because instead of the guest starting up in an indeterminate 
state, doStartCPUs() will just fail (as you've seen) making the entire 
guest startup fail.

You can almost surely make it work properly by putting in a 250msec 
delay between those two function calls in libvirt. It would be nice if 
it could be totally fixed in qemu, though, so that libvirt didn't need 
such a hack :-(

(I had unfortunately lost track of the bug by the time the patch was 
posted - it had been there for so long I'd just gotten used to manually 
pausing/unpausing any guest I wanted to save on the one machine that 
displays the problem. Too bad I got so used to living with it, as I'd 
have otherwise been forced to try it out (this machine is running F13, 
which is still at qemu-0.12.5, which doesn't have the patch).




More information about the libvir-list mailing list