[Libguestfs] [nbdkit PATCH v2] nbd: Avoid stuck poll() in nbdplug_close_handle()

Eric Blake eblake at redhat.com
Sat Mar 28 20:57:43 UTC 2020


We have been seeing sporadic hangs on test-nbd-tls-psk.sh and
test-nbd-tls.sh (on my machine, running those two in a loop with
commits 0a76cae4 and 09e34ba2 reverted would fail within 100
attempts), where even though the client to the 'nbdkit nbd' process
has cleanly exited, things are stalled in .close where nbd is trying
to pthread_join() the reader thread, while the reader thread is itself
blocked on a poll() that will never make additional progress.  In
hindsight, the problem is obvious: we used the wrong libnbd API
(synchronous instead of async), which often results in:

nbd .close             nbd reader            server
nbd_shutdown           poll
 - send NBD_CMD_DISC
                                             receive NBD_CMD_DISC
                                             shutdown(SHUT_WR)
                       wake up on EOF
                       nbd_aio_notify_read
                         - move state to CLOSED
                       nbd_aio_is_dead -> true
                       break loop
 - poll
pthread_join
close fds
nbd_close
                                             see EOF from client

but sometimes things happen differently:

nbd .close             nbd reader            server
nbd_shutdown           poll
 - send NBD_CMD_DISC
 - poll
                                             receive NBD_CMD_DISC
                                             shutdown(SHUT_WR)
 - wake up on EOF
 - move state to CLOSED
pthread_join
                       stuck

As can be seen, having two threads compete on poll() on the same fd is
unwise.  Worse, if the server uses a means that will cause us to see
EOF even though the socket is still open (such as shutdown(SHUT_WR) on
plaintext, or gnutls_bye() with TLS), but waits for us to close the
socket first, we end up with two processes deadlocked on each other.
(nbdkit as server has used gnutls_bye since commit bd9a52e0; qemu
however does not use it.)

Other commits such as 14ba4154, c70616f8, and 430f8141 show that
getting the shutdown sequence race-free has not been trivial.

Fixes: ab7760fc
Signed-off-by: Eric Blake <eblake at redhat.com>
---

 plugins/nbd/nbd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/plugins/nbd/nbd.c b/plugins/nbd/nbd.c
index ba1e7188..0ea2a4a2 100644
--- a/plugins/nbd/nbd.c
+++ b/plugins/nbd/nbd.c
@@ -532,7 +532,7 @@ nbdplug_open (int readonly)
 static void
 nbdplug_close_handle (struct handle *h)
 {
-  if (nbd_shutdown (h->nbd, 0) == -1)
+  if (nbd_aio_disconnect (h->nbd, 0) == -1)
     nbdkit_debug ("failed to clean up handle: %s", nbd_get_error ());
   if ((errno = pthread_join (h->reader, NULL)))
     nbdkit_debug ("failed to join reader thread: %m");
-- 
2.26.0.rc2




More information about the Libguestfs mailing list