[Libguestfs] Anyone seen build hangs (esp armv7, s390x) in Fedora?

Thu Mar 19 12:13:42 UTC 2020

[Dropping devel, adding libguestfs]

This can be reproduced on x86-64 so I can reproduce it locally.  It
only appears to happen when the tests are run under rpmbuild, not when
I run them as ‘make check’, but I'm unclear why this is.

As Eric described earlier, the test runs two copies of nbdkit and a
client, connected like this:

  qemu-img info ===> nbdkit nbd ===> nbdkit example1
     [3]                [2]             [1]

These are started in order [1], [2] then [3].  When the client
(process [3]) completes it exits and then the test harness kills
processes [1] and [2] in that order.

The stack trace of [2] at the hang is:

Thread 3 (Thread 0x7fabbf4f7700 (LWP 3955842)):
#0  0x00007fabc05c0f0f in poll () from /lib64/libc.so.6
#1  0x00007fabc090abba in poll (__timeout=-1, __nfds=2, __fds=0x7fabbf4f6bb0) at /usr/include/bits/poll2.h:46
#2  nbdplug_reader (handle=0x5584020e09b0) at nbd.c:323
#3  0x00007fabc069d472 in start_thread () from /lib64/libpthread.so.0
#4  0x00007fabc05cc063 in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7fabbfcf8700 (LWP 3955793)):
#0  0x00007fabc069eab7 in __pthread_clockjoin_ex () from /lib64/libpthread.so.0
#1  0x00007fabc090af2b in nbdplug_close_handle (h=0x5584020e09b0) at nbd.c:538
#2  0x00005583f90caee0 in backend_close (b=<optimized out>) at backend.c:247
#3  0x00005583f90cdbf1 in free_connection (conn=0x5584020df890) at connections.c:359
#4  handle_single_connection (sockin=<optimized out>, sockout=<optimized out>) at connections.c:230
#5  0x00005583f90d63e8 in start_thread (datav=0x5584020bf1b0) at sockets.c:356
#6  0x00007fabc069d472 in start_thread () from /lib64/libpthread.so.0
#7  0x00007fabc05cc063 in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7fabc002ca40 (LWP 3955770)):
#0  0x00007fabc06a3b02 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00005583f90d6dbf in accept_incoming_connections (socks=0x5584020bf010, nr_socks=1) at sockets.c:501
#2  0x00005583f90ca441 in start_serving () at main.c:936
#3  main (argc=<optimized out>, argv=<optimized out>) at main.c:702

The stack trace of [1] at the hang is:

Thread 2 (Thread 0x7f30ac822700 (LWP 3955798)):
#0  0x00007f30b32a384c in recv () from /lib64/libpthread.so.0
#1  0x00007f30b3342c8f in _gnutls_stream_read (ms=0x7f30ac821a2c, pull_func=0x7f30b336a570 <system_read>, size=5, bufel=<synthetic pointer>, session=0x55c5cf561620) at buffers.c:346
#2  _gnutls_read (ms=0x7f30ac821a2c, pull_func=0x7f30b336a570 <system_read>, size=5, bufel=<synthetic pointer>, session=0x55c5cf561620) at buffers.c:426
#3  _gnutls_io_read_buffered (session=session at entry=0x55c5cf561620, total=5, recv_type=recv_type at entry=4294967295, ms=0x7f30ac821a2c) at buffers.c:582
#4  0x00007f30b3338f3f in recv_headers (ms=<optimized out>, record=0x7f30ac821a80, htype=4294967295, type=GNUTLS_ALERT, record_params=0x55c5cf565f60, session=0x55c5cf561620) at record.c:1172
#5  _gnutls_recv_in_buffers (session=session at entry=0x55c5cf561620, type=type at entry=GNUTLS_ALERT, htype=htype at entry=4294967295, ms=<optimized out>, ms at entry=0) at record.c:1307
#6  0x00007f30b333b555 in _gnutls_recv_int (session=session at entry=0x55c5cf561620, type=type at entry=GNUTLS_ALERT, data=data at entry=0x0, data_size=data_size at entry=0, seq=seq at entry=0x0, ms=0) at record.c:1773
#7  0x00007f30b333b703 in gnutls_bye (session=session at entry=0x55c5cf561620, how=how at entry=GNUTLS_SHUT_RDWR) at record.c:312
#8  0x000055c5c57af171 in crypto_close () at crypto.c:407
#9  0x000055c5c57aea58 in free_connection (conn=0x55c5cf560500) at connections.c:339
#10 handle_single_connection (sockin=<optimized out>, sockout=<optimized out>) at connections.c:230
#11 0x000055c5c57b73e8 in start_thread (datav=0x55c5cf541550) at sockets.c:356
#12 0x00007f30b3299472 in start_thread () from /lib64/libpthread.so.0
#13 0x00007f30b31c8063 in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7f30b2c28a40 (LWP 3955740)):
#0  0x00007f30b329fb02 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x000055c5c57b7dbf in accept_incoming_connections (socks=0x55c5cf5413b0, nr_socks=1) at sockets.c:501
#2  0x000055c5c57ab441 in start_serving () at main.c:936
#3  main (argc=<optimized out>, argv=<optimized out>) at main.c:702

It seems like process [2] is hanging in pthread_join, waiting on
thread 3 (the reader loop) to finish.  The reader loop is stuck on
poll.

Meanwhile process [1] is trying to do a clean TLS disconnect.  This
involves waiting on process [2] to write something, which it is never
going to do.

I don't think this is actually anything to do with libnbd not cleanly
shutting down TLS connections.  If libnbd had closed the socket
abruptly then process [1] wouldn't have to wait.  I think this might
be a bug in the nbd plugin.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-df lists disk usage of guests without needing to install any
software inside the virtual machine.  Supports Linux and Windows.
http://people.redhat.com/~rjones/virt-df/