[Libguestfs] Performance regression in -o rhv-upload questions

Richard W.M. Jones rjones at redhat.com
Tue Feb 8 15:14:18 UTC 2022


On Tue, Feb 08, 2022 at 04:44:33PM +0200, Nir Soffer wrote:
> This happens because virt-v2v try to finalize the transfer *before* closing the
> connections to imageio server.
> 
> Current imageio release mark the a ticket as canceled, but will not remove it
> if the ticket has open connections from clients. If the clients are
> idle, the connection
> closes after 60 seconds, so when engine tries again to remove the ticket, the
> operation succeeds.

I posted this patch:
https://listman.redhat.com/archives/libguestfs/2022-February/msg00113.html

It's possibly not quite right.  We now just kill nbdkit and
immediately go on to finalization, so there might still be a race
(since killing nbdkit only starts the process of nbdkit shutting down,
closing the Python plugin etc).  I don't know if this is a concern or
not, but the patch does avoid the 60s timeout in my limited testing.

> Upstream version improved this flow to wait only for ongoing
> requests. If there are no ongoing requests the ticket is removed
> immediately, ignoring the open connections. The connections will be
> closed when the client closes the connection on when reading from
> the socket times out.

"Ongoing" here means that an HTTP request is being processed?

> You can test upstream version from here:
> https://github.com/oVirt/ovirt-imageio/actions/runs/1811058094
> 
> But if you want to be compatible with released imageio version, you
> must close the connections to imageio *before* finalizing the transfers.
> 
> We already discussed this issue last year, when we handled the bug
> when vddk blocks for many minutes during block status and imageio
> connection is droped, and you opened a bug for this.
> 
> > It's not a completely like-for-like comparison because the rhv-upload
> > backend changed a lot between these versions.  In particular if I was
> > going to pick some suspicious change it would be my refactoring here
> > which was supposed to be neutral but maybe isn't:
> >
> >   https://github.com/libguestfs/virt-v2v/commit/143a22860216b94d3a81706193088d50c03fc35c
> >
> > Unfortunately this commit is almost impossible to revert because of
> > deep restructuring of surrounding code.
> 
> The refactoring is not the issue, the issue is that we do not terminate
> the output nbdkit before finalizing.
>
> For some output we cannot do this since we want to query nbdkit for
> more info (e.g block status) but for rhv upload we don't have anything
> to do with nbkdit instance and we must close the connection before
> we finalize, so we should close the output right after we finish the copy.

It's not really desirable to kill nbdkit like I did in my patch.  We
do really want to open up virt-v2v to allow other processes to query
those sockets during the transfer.

An alternative to the patch I posted might be to have a way to request
that nbdkit disconnects its HTTP pool.

I will also note that HTTP is supposed to be stateless.  An open TCP
connection should not have meaning.

> > Another idea:
> >
> > Old virt-v2v uses qemu-img convert which does not flush by default.
> > New virt-v2v uses nbdcopy with the --flush option, so it will call
> > imageio PATCH /path ... "op":"flush" at the end.  However removing
> > nbdcopy --flush didn't help very much (only a couple of seconds off).
> 
> Calling flush is right, we cannot remove it. Although there is bug in nbdkit,
> and if flushes during close() flow even if you did not send a flush.
> 
> > Do you have any other ideas?
> >
> > What exactly does imageio do during the finalizing_success state?
> 
> The flow is this:
> - engine send request to vdsm to delete the ticket
> - vdsm connects to imageio control socket and send DELETE /tickets/{ticket-id}
> - imageio mark the ticket as canceled, so no new request can succeed
>   and no new connection to attach to the ticket
> - imageio mark all ongoing requests as canceled, so if the request is in
>   read/write loop, it will abort on the the next read/write complete
> - upstream: imageio waits until all ongoing operations complete
> - release: imageio waits until all connections are closed

This may answer my HTTP stateless point above.

> - if the ticket could not be removed in 60 seconds, imageio returns 409 Conflict
> - imageio returns 200 OK
> - engine retries the delete if it failed
> - when engine succeeds and finish other finalization tasks, the image
> transfer will
>   be reported as finished.
> 
> I would expect the flow to time out on virt-v2v side, since we use 60 seconds
> timeout, and when removing a ticket times out, it takes 60 seconds, so you
> should not succeed to finalize in 61 seconds.  There is a bug in ovirt
> engine that cause this flow to succeed.

Thanks,

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html




More information about the Libguestfs mailing list