[libvirt] [PATCH v2] Close the source fd if the destination qemu exits during tunnelled migration
Shivaprasad bhat
shivaprasadbhat at gmail.com
Tue Sep 22 11:21:19 UTC 2015
On Mon, Sep 21, 2015 at 8:04 PM, John Ferlan <jferlan at redhat.com> wrote:
>
>
> On 09/21/2015 05:09 AM, Shivaprasad bhat wrote:
>> Thanks John for the comments.
>>
>>
>> On Fri, Sep 18, 2015 at 10:34 PM, John Ferlan <jferlan at redhat.com> wrote:
>>>
>>>
>>> On 09/14/2015 10:44 AM, Shivaprasad G Bhat wrote:
>>>> Tunnelled migration can hang if the destination qemu exits despite all the
>>>> ABI checks. This happens whenever the destination qemu exits before the
>>>> complete transfer is noticed by source qemu. The savevm state checks at
>>>> runtime can fail at destination and cause qemu to error out.
>>>> The source qemu cant notice it as the EPIPE is not propogated to it.
>>>> The qemuMigrationIOFunc() notices the stream being broken from virStreamSend()
>>>> and it cleans up the stream alone. The qemuMigrationWaitForCompletion() would
>>>> never get to 100% transfer completion.
>>>> The qemuMigrationWaitForCompletion() never breaks out as well since
>>>> the ssh connection to destination is healthy, and the source qemu also thinks
>>>> the migration is ongoing as the Fd to which it transfers, is never
>>>> closed or broken. So, the migration will hang forever. Even Ctrl-C on the
>>>> virsh migrate wouldn't be honoured. Close the source side FD when there is
>>>> an error in the stream. That way, the source qemu updates itself and
>>>> qemuMigrationWaitForCompletion() notices the failure.
>>>>
>>>> Close the FD for all kinds of errors to be sure. The error message is not
>>>> copied for EPIPE so that the destination error is copied instead later.
>>>>
>>>> Note:
>>>> Reproducible with repeated migrations between Power hosts running in different
>>>> subcores-per-core modes.
>>>>
>>>> Changes from v1 -> v2:
>>>> VIR_FORCE_CLOSE() was called twice for this use case which would log
>>>> unneccessary warnings. So, move the fd close to qemuMigrationIOFunc
>>>> so that there are no unnecessary duplicate attempts.(Would this trigger
>>>> a Coverity error? I don't have a setup to check.)
>>>>
>>>> Signed-off-by: Shivaprasad G Bhat <sbhat at linux.vnet.ibm.com>
>>>> ---
>>>> src/qemu/qemu_migration.c | 8 ++++++--
>>>> 1 file changed, 6 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/src/qemu/qemu_migration.c b/src/qemu/qemu_migration.c
>>>> index ff89ab5..9602fb2 100644
>>>> --- a/src/qemu/qemu_migration.c
>>>> +++ b/src/qemu/qemu_migration.c
>>>> @@ -4012,6 +4012,7 @@ static void qemuMigrationIOFunc(void *arg)
>>>> if (virStreamFinish(data->st) < 0)
>>>> goto error;
>>>>
>>>> + VIR_FORCE_CLOSE(data->sock);
>>>> VIR_FREE(buffer);
>>>>
>>>> return;
>>>> @@ -4029,7 +4030,11 @@ static void qemuMigrationIOFunc(void *arg)
>>>> }
>>>>
>>>> error:
>>>> - virCopyLastError(&data->err);
>>>> + /* Let the source qemu know that the transfer cant continue anymore.
>>>> + * Don't copy the error for EPIPE as destination has the actual error. */
>>>> + VIR_FORCE_CLOSE(data->sock);
>>>> + if (!virLastErrorIsSystemErrno(EPIPE))
>>>> + virCopyLastError(&data->err);
>>>> virResetLastError();
>>>> VIR_FREE(buffer);
>>>> }
>>>> @@ -4397,7 +4402,6 @@ qemuMigrationRun(virQEMUDriverPtr driver,
>>>> if (iothread && qemuMigrationStopTunnel(iothread, ret < 0) < 0)
>>>> ret = -1;
>>>> }
>>>> - VIR_FORCE_CLOSE(fd);
>>>
>>> ^^^
>>>
>>> This causes Coverity to claim a RESOURCE_LEAK
>>>
>>> Feels like this was a mistake edit...
>>>
>>
>> The VIR_FORCE_CLOSE() inside qemuMigrationIOFunc() alone is sufficient.
>> Having this again here would lead to Warning in the logs. I too thought coverity
>> would complain. Is there a way to force ignore a coverity warning?
>>
>
> Typically a marker of sorts, such as
>
> /* coverity[leaked_handle] <some extra comment explaining why> */
>
> Although I'm not sure that's the best way to handle this either.
>
> The problem I see though is this is an "error path" issue and when
> perhaps it's safe/required to close fd/io->sock/data->sock.
>
> Your commit comment notes that the issue is seen on a fairly specific
> event (virStreamSend failure) for a specific type of migration.
I believe the failure can be seen for all types of migration with savestate
mismatch.
> As I
> read the code, that failure jumps to error (as does virStreamFinish). So
> now you have a fairly specific set of instances which perhaps would
> cause qemuMigrationWaitForCompletion to need to fail. The fix you have
> proposed is to close the data->sock (io->sock, fd). However, your
> proposal is a larger hammer. I assume closure of data->sock causes
> WaitForCompletion to fail (perhaps differently) and that's why you chose it.
>
> Going back to the beginning of qemuMigrationRun, setting the 'fd' and
> using StartTunnel seems to rely on MIGRATION_FWD_DIRECT, but if a tunnel
> is created an (ugh) 'iothread' variable is NON-NULL. What if 'iothread'
> were passed to qemuMigrationWaitForCompletion? Then again passed to
> qemuMigrationCompleted which would check if iothread non-NULL and for
> some new flag that could be created in _qemuMigrationIOThread, being
> true. If it were true, then that would cause failure so that
> WaitForCompletion would return error status. The only way the flag is
> set to true is in qemuMigrationIOFunc when the code jumps to error.
> (e.g. add bool stream_abort and data->stream_abort = true in "error:",
> then check iothread && iothread->stream_abort, force error).
>
I tried this. Until the fd is closed, the 'info migrate' wouldn't
return from the
qemu. So, the migration hangs. Checking the stream status before/after
fetching the job status would still leave a race. So, having it in a different
thread (qemuMigrationIOFunc) here seems inevitable to me.
#1 0x00003fff83ff593c in virCondWait (c=<optimized out>, m=<optimized out>)
at util/virthread.c:154
#2 0x00003fff76235544 in qemuMonitorSend (mon=0x3fff54003970,
msg=<optimized out>) at qemu/qemu_monitor.c:1035
#3 0x00003fff7624fb30 in qemuMonitorJSONCommandWithFd (mon=0x3fff54003970,
cmd=0x3fff5c001420, scm_fd=-1, reply=0x3fff79fbd388)
at qemu/qemu_monitor_json.c:293
#4 0x00003fff76254b90 in qemuMonitorJSONCommand (reply=0x3fff79fbd388,
cmd=0x3fff5c001420, mon=0x3fff54003970) at qemu/qemu_monitor_json.c:323
#5 qemuMonitorJSONGetMigrationStatus (mon=0x3fff54003970,
status=0x3fff79fbd538) at qemu/qemu_monitor_json.c:2620
#6 0x00003fff7623a664 in qemuMonitorGetMigrationStatus (mon=<optimized out>,
status=<optimized out>) at qemu/qemu_monitor.c:2134
#7 0x00003fff76228e6c in qemuMigrationFetchJobStatus (driver=0x3fff6c118d80,
vm=0x3fff6c27faf0, asyncJob=<optimized out>, jobInfo=0x3fff79fbd4f0)
at qemu/qemu_migration.c:2528
#8 0x00003fff76228fb4 in qemuMigrationUpdateJobStatus (driver=0x3fff6c118d80,
vm=0x3fff6c27faf0, asyncJob=QEMU_ASYNC_JOB_MIGRATION_OUT)
at qemu/qemu_migration.c:2565
#9 0x00003fff76229200 in qemuMigrationCheckJobStatus (driver=0x3fff6c118d80,
vm=0x3fff6c27faf0, asyncJob=QEMU_ASYNC_JOB_MIGRATION_OUT)
---Type <return> to continue, or q <return> to quit---cont
at qemu/qemu_migration.c:2585
#10 0x00003fff76229408 in qemuMigrationCompleted (iothread=<optimized out>,
> The only question then in my mind then is would this also be something
> that should be done for the virStreamFinish failure path which also
> jumps to error?
>
Yes, I too am not sure if its appropriate for virStreamFinish failure
case. But, given
this an error path, I feel we can safely go ahead and close the
data->sock.
> Doing this means you shouldn't need the VIR_FILE_CLOSE(data->sock) for
> either the normal or error path. Does this seem to be a reasonable
> approach and solve the issue you're facing?
>
> John
>> Thanks and Regards,
>> Shivaprasad
>>
>>> The rest of the patch looks reasonable; however, I'm far from the expert
>>> in these matters.
>>>
>>> John
>>>>
>>>> if (priv->job.completed) {
>>>> qemuDomainJobInfoUpdateTime(priv->job.completed);
>>>>
>>>> --
>>>> libvir-list mailing list
>>>> libvir-list at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/libvir-list
>>>>
More information about the libvir-list
mailing list