[Libguestfs] [PATCH libnbd v2 3/4] api: Implement concurrent writer.

Wed Jul 3 13:41:56 UTC 2019

On 6/4/19 6:16 AM, Richard W.M. Jones wrote:
> There are several races / deadlocks which I've thought about.  Let's
> see if I can remember them all ...

More thoughts on this, now that we've solved some other issues in the
meantime.

> 
> (1) This I experienced: nbd_aio_get_fd deadlocks if there are
> concurrent synchronous APIs going on.  A typical case is where you set
> up the concurrent writer thread before connecting, and then call a
> synchronous connect function such as connect_tcp.  The synchronous
> function grabs h->lock, then writes something, which eventually
> invokes the writer thread which calls nbd_aio_get_fd and deadlocks on
> h->lock.
> 
> -> Probably the writer thread should be forbidden from using
> nbd_handle.

Still true (although we may decide to allow calling nbd_ functions that
don't require the lock, with proper documentation of which functions
that includes).  Is it reasonable to make aio_get_fd lockless?

> 
> (2) The writer thread calls nbd_aio_get_fd, but the fd returned might
> might be closed before we use it, resulting in either EBADF or worse
> using another fd that happens to be opened around the same time.
> 
> -> I think the solution to this would be to allow the writer callback
> to signal that the socket is about to be closed (eg. add an extra flag
> parameter to the callback), which would kill the writer thread.

The extra flag is still a good idea. Also, would it help to dup() the
fd, so that even if the reader thread ends up closing the fd, the writer
thread still has the dup'd copy open?  But it also emphasizes the need
for an nbd_aio_notify_error - if the reader thread encounters an error,
we need to inform the writer thread to quit polling; conversely, if the
writer thread encounters an error, it needs to affect the state machine
so that the reader thread will quit polling.

> 
> (3) nbd_concurrent_writer_error could lose errors.  This might happen
> if the socket is closed normally without writing anything, which would
> never check h->writer_error.

We may need some more states in the state machine to perform graceful
shutdown. As it is, I've already added h->disconnect_request; that may
figure into deciding whether the server going away was early or as a
result of our request.

> 
> (4) nbd_concurrent_writer_error possibly deadlocks too since it needs
> to grab h->lock.  Basically the same as (1).

Unless we make it lockless, to merely set a flag that later attempts to
call nbd_aio_notify_read or nbd_aio_FOO then notice early enough in
their advancement of the state machine.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: OpenPGP digital signature
URL: <http://listman.redhat.com/archives/libguestfs/attachments/20190703/7f8f74bf/attachment.sig>