[libvirt] cont command failing via JSON monitor on restore

Jim Fehlig jfehlig at novell.com
Tue Jan 18 18:15:07 UTC 2011


Laine Stump wrote:
> On 01/14/2011 04:26 PM, Jim Fehlig wrote:
>> Laine Stump wrote:
>>> On 01/13/2011 04:29 PM, Jim Fehlig wrote:
>>>> Daniel P. Berrange wrote:
>>>>>>   Yep, it wasn't really intended as a fix. It was intended to make
>>>>>>   the error scenario clearly detectable, which has succeeded as per
>>>>>>   Jim's report. The fact that QMP returned an error in this way,
>>>>>>   means we can now reliably detect, usleep(1000), and then retry
>>>>>>   the 'cont' again at which point we should succeed.
>>>>>>
>>>> Something like the attached patch?  I'm not quite sure about the retry
>>>> bounds (currently 1 second).  In my testing, it always succeeds on the
>>>> second attempt - even with large memory vms.
>>> In my tests, 250ms was more than enough, so I'm guessing it's okay,
>>> although there's probably no hurt in making it a bit larger - it's not
>>> like this is something that repeats all day every day :-)
>>>
>>> Would it be possible for you to add the same thing into the text
>>> monitor version of the code, so both fixes would travel together as
>>> the patch gets cherry-picked around?
>> I can, but have not seen this issue with the text monitor.  And the
>> error reporting is not so fine grained correct?
>
> Truthfully I haven't spent any time with the JSON code, and only
> enough with the text monitor to (locally) put in a rough hack for the
> same problem - I just delayed 250ms unconditionally.
>
> Maybe it is safest for your patch to just have what you're able to
> test for.

Agreed.  What error do you get back from the text monitor when cont cmd
fails?  I'd be interested in the following output from
$root/src/qemu/qemu_monitor_text.c:qemuMonitorCommandWithHandler()

    VIR_DEBUG("Receive command reply ret=%d errno=%d %d bytes '%s'",
              ret, msg.lastErrno, msg.rxLength, msg.rxBuffer);

>
> Of course my system isn't setup to test exactly this fix either - the
> machine that displays the problem when resuming is still running
> qemu-kvm-0.12.5.
>
>
>>>> - if (ret == 0)
>>>> -        ret = qemuMonitorJSONCheckError(cmd, reply);
>>>> +        if (ret != 0)
>>>> +            break;
>>>> +
>>>> +        /* If no error, we're done */
>>>> +        if ((ret = qemuMonitorJSONCheckError(cmd, reply)) == 0)
>>>> +            break;
>>>> +
>>>> +        /* If error class is not MigrationExpected, we're done.
>>>> +         * Otherwise try 'cont' cmd again */
>>>> +        if (!qemuMonitorJSONHasError(reply, "MigrationExpected"))
>>>> +            break;
>>>> +
>>>> +        virJSONValueFree(reply);
>>>> +    } while ((++i<= timeout)&&   (usleep(250000)<=0));
>>>>
>>>>        virJSONValueFree(cmd);
>>>>        virJSONValueFree(reply);
>>> Doesn't this end up doing a double-free of reply if it times out?
>>> virJSONValueFree doesn't update the pointer that's free'd like
>>> VIR_FREE does (it can't, since it's a function call rather than a
>>> macro).
>> virJSONValueFree() calls VIR_FREE() on the value passed to it, so reply
>> should be set to NULL when virJSONValueFree() returns.
>
>
> Yes, but it's pass by value, not reference, so VIR_FREE() is NULLing
> the copy of the pointer that was passed as an argument, not the
> original pointer itself.

Err, right.  I'll fix this in v2 - but need to understand the text
monitor error first.

Thanks,
Jim




More information about the libvir-list mailing list