[Libguestfs] Anyone seen build hangs (esp armv7, s390x) in Fedora?

Thu Mar 19 16:09:29 UTC 2020

On 3/19/20 7:13 AM, Richard W.M. Jones wrote:
> [Dropping devel, adding libguestfs]
> 
> This can be reproduced on x86-64 so I can reproduce it locally.  It
> only appears to happen when the tests are run under rpmbuild, not when
> I run them as ‘make check’, but I'm unclear why this is.
> 
> As Eric described earlier, the test runs two copies of nbdkit and a
> client, connected like this:
> 
>    qemu-img info ===> nbdkit nbd ===> nbdkit example1
>       [3]                [2]             [1]
> 
> These are started in order [1], [2] then [3].  When the client
> (process [3]) completes it exits and then the test harness kills
> processes [1] and [2] in that order.
> 
> The stack trace of [2] at the hang is:
> 
> Thread 3 (Thread 0x7fabbf4f7700 (LWP 3955842)):
> #0  0x00007fabc05c0f0f in poll () from /lib64/libc.so.6
> #1  0x00007fabc090abba in poll (__timeout=-1, __nfds=2, __fds=0x7fabbf4f6bb0) at /usr/include/bits/poll2.h:46
> #2  nbdplug_reader (handle=0x5584020e09b0) at nbd.c:323
> #3  0x00007fabc069d472 in start_thread () from /lib64/libpthread.so.0
> #4  0x00007fabc05cc063 in clone () from /lib64/libc.so.6

This is nbdkit trying to poll() in order to know when to wake up the 
libnbd engine.

> Thread 2 (Thread 0x7fabbfcf8700 (LWP 3955793)):
> #0  0x00007fabc069eab7 in __pthread_clockjoin_ex () from /lib64/libpthread.so.0
> #1  0x00007fabc090af2b in nbdplug_close_handle (h=0x5584020e09b0) at nbd.c:538
> #2  0x00005583f90caee0 in backend_close (b=<optimized out>) at backend.c:247
> #3  0x00005583f90cdbf1 in free_connection (conn=0x5584020df890) at connections.c:359
> #4  handle_single_connection (sockin=<optimized out>, sockout=<optimized out>) at connections.c:230
> #5  0x00005583f90d63e8 in start_thread (datav=0x5584020bf1b0) at sockets.c:356
> #6  0x00007fabc069d472 in start_thread () from /lib64/libpthread.so.0
> #7  0x00007fabc05cc063 in clone () from /lib64/libc.so.6

And this is nbdkit waiting for the poll thread to complete.

> Thread 1 (Thread 0x7fabc002ca40 (LWP 3955770)):
> #0  0x00007fabc06a3b02 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
> #1  0x00005583f90d6dbf in accept_incoming_connections (socks=0x5584020bf010, nr_socks=1) at sockets.c:501
> #2  0x00005583f90ca441 in start_serving () at main.c:936
> #3  main (argc=<optimized out>, argv=<optimized out>) at main.c:702

nbdkit itself listening for a second client.

> 
> The stack trace of [1] at the hang is:
> 
> Thread 2 (Thread 0x7f30ac822700 (LWP 3955798)):
> #0  0x00007f30b32a384c in recv () from /lib64/libpthread.so.0
> #1  0x00007f30b3342c8f in _gnutls_stream_read (ms=0x7f30ac821a2c, pull_func=0x7f30b336a570 <system_read>, size=5, bufel=<synthetic pointer>, session=0x55c5cf561620) at buffers.c:346
> #2  _gnutls_read (ms=0x7f30ac821a2c, pull_func=0x7f30b336a570 <system_read>, size=5, bufel=<synthetic pointer>, session=0x55c5cf561620) at buffers.c:426
> #3  _gnutls_io_read_buffered (session=session at entry=0x55c5cf561620, total=5, recv_type=recv_type at entry=4294967295, ms=0x7f30ac821a2c) at buffers.c:582
> #4  0x00007f30b3338f3f in recv_headers (ms=<optimized out>, record=0x7f30ac821a80, htype=4294967295, type=GNUTLS_ALERT, record_params=0x55c5cf565f60, session=0x55c5cf561620) at record.c:1172
> #5  _gnutls_recv_in_buffers (session=session at entry=0x55c5cf561620, type=type at entry=GNUTLS_ALERT, htype=htype at entry=4294967295, ms=<optimized out>, ms at entry=0) at record.c:1307
> #6  0x00007f30b333b555 in _gnutls_recv_int (session=session at entry=0x55c5cf561620, type=type at entry=GNUTLS_ALERT, data=data at entry=0x0, data_size=data_size at entry=0, seq=seq at entry=0x0, ms=0) at record.c:1773
> #7  0x00007f30b333b703 in gnutls_bye (session=session at entry=0x55c5cf561620, how=how at entry=GNUTLS_SHUT_RDWR) at record.c:312
> #8  0x000055c5c57af171 in crypto_close () at crypto.c:407
> #9  0x000055c5c57aea58 in free_connection (conn=0x55c5cf560500) at connections.c:339
> #10 handle_single_connection (sockin=<optimized out>, sockout=<optimized out>) at connections.c:230

nbdkit trying to use TLS to perform a clean shutdown.

> #11 0x000055c5c57b73e8 in start_thread (datav=0x55c5cf541550) at sockets.c:356
> #12 0x00007f30b3299472 in start_thread () from /lib64/libpthread.so.0
> #13 0x00007f30b31c8063 in clone () from /lib64/libc.so.6
> Thread 1 (Thread 0x7f30b2c28a40 (LWP 3955740)):
> #0  0x00007f30b329fb02 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
> #1  0x000055c5c57b7dbf in accept_incoming_connections (socks=0x55c5cf5413b0, nr_socks=1) at sockets.c:501
> #2  0x000055c5c57ab441 in start_serving () at main.c:936
> #3  main (argc=<optimized out>, argv=<optimized out>) at main.c:702

nbdkit waiting for another thread.

> 
> It seems like process [2] is hanging in pthread_join, waiting on
> thread 3 (the reader loop) to finish.  The reader loop is stuck on
> poll.
> 
> Meanwhile process [1] is trying to do a clean TLS disconnect.  This
> involves waiting on process [2] to write something, which it is never
> going to do.

Mostly correct, except that [2] _could_ write something if we were to 
improve libnbd's TLS shutdown to not be so awkward.

> 
> I don't think this is actually anything to do with libnbd not cleanly
> shutting down TLS connections.  If libnbd had closed the socket
> abruptly then process [1] wouldn't have to wait.  I think this might
> be a bug in the nbd plugin.

Possible, but I'm still leaning towards it being a problem that libnbd 
is not handling TLS very well, which makes it hard for the nbd plugin to 
know that it can quit poll()ing.

Determining which of the two (or both) that need patching is trickier, 
though.  I'll spend more cycles on it today, at any rate.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org