[libvirt] cont command failing via JSON monitor on restore

Jim Fehlig jfehlig at novell.com
Thu Jan 13 21:29:51 UTC 2011


Daniel P. Berrange wrote:
> On Wed, Jan 12, 2011 at 11:29:27PM -0500, Laine Stump wrote:
>   
>> On 01/12/2011 05:13 PM, Jim Fehlig wrote:
>>     
>>> libvirt 0.8.7
>>> qemu 0.13
>>>
>>> I'm looking into a problem with qemu save/restore via JSON monitor.  On
>>> restore, the vm is left in a paused state with following error returned
>>> for 'cont' command
>>>
>>> An incoming migration is expected before this command can be executed
>>>
>>> I was trying to debug the issue in gdb, but stepping through the code
>>> introduces enough delay between qemudStartVMDaemon() and doStartCPUs()
>>> that the latter succeeds.  Any suggestions on how to determine when it
>>> is safe to call doStartCPUs() to prevent the above error?  I don't see
>>> this issue with the text monitor btw.
>>>       
>> I'm pretty sure this is related to a bug I reported on qemu-devel
>> last April:
>>
>>    http://lists.gnu.org/archive/html/qemu-devel/2010-04/msg00635.html
>>
>> (be sure to read my own followup if you want a correct description
>> of the circumstances). In this case libvirt was using the text
>> monitor, and there was a race condition between qemudStartVMDaemon
>> (which executes qemu with '-S -incoming') and doStartCPUs() (which
>> issues a 'cont' command to the qemu monitor). The result would be
>> that sometimes the 'cont' would be received and processed by qemu
>> before the incoming migration had started, meaning that qemu would
>> be executing garbage memory instead of the saved/restored image of
>> the guest.
>>
>> The solution to this was posted to upstream qemu in July:
>>
>>   http://lists.gnu.org/archive/html/qemu-devel/2010-07/msg01574.html
>>
>> and I believe is in qemu 0.13. That patch adds a check to the 'cont'
>> command so that if '-incoming' was specified on the commandline,
>> 'cont' will only execute after a migration has successfully
>> completed, but will otherwise return an error.
>>
>> Actually, thinking about this "fix", it seems that it isn't really a
>> solution, because instead of the guest starting up in an
>> indeterminate state, doStartCPUs() will just fail (as you've seen)
>> making the entire guest startup fail.
>>     
>
> Yep, it wasn't really intended as a fix. It was intended to make
> the error scenario clearly detectable, which has succeeded as per
> Jim's report. The fact that QMP returned an error in this way,
> means we can now reliably detect, usleep(1000), and then retry
> the 'cont' again at which point we should succeed.
>   

Something like the attached patch?  I'm not quite sure about the retry
bounds (currently 1 second).  In my testing, it always succeeds on the
second attempt - even with large memory vms.


Regards,
Jim

-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-qemu-Retry-JSON-monitor-cont-cmd-on-MigrationExpecte.patch
Type: text/x-diff
Size: 1884 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/libvir-list/attachments/20110113/65d1bb13/attachment-0001.bin>


More information about the libvir-list mailing list