[Libguestfs] Anyone seen build hangs (esp armv7, s390x) in Fedora?

Eric Blake eblake at redhat.com
Thu Mar 19 15:56:08 UTC 2020


[replying here, as I seem to have been dropped from cc on the subthread 
at 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/message/ELUEHAA7X7YKU5DFIOBS3UQ5AXQYJWLY/ 
- maybe I should subscribe to devel@ instead of seeing this second-hand...
hmm - I can't even post to devel@ without subscribing, so now just 
sending this to libguestfs]

[adding libguestfs - now that devel@ has helped point to a bug in nbdkit 
itself]

On 3/18/20 4:49 AM, Richard W.M. Jones wrote:
> On Wed, Mar 18, 2020 at 09:38:52AM +0000, Peter Robinson wrote:
>>> This might be a bug in the package itself, but has anyone seen builds
>>> hanging in weird places, in Rawhide, especially on armv7 and s390x?
>>>
>>> This packge build has hung 3 times in the same place, once on armv7
>>> and twice on s390x:
>>>
>>> https://koji.fedoraproject.org/koji/taskinfo?taskID=42570766
>>>
>>> It's hard to explain how it could hang at that place in the build
>>> unless something fundamental is broken like make.
>>
>> Well make 4.3 did land recently (March 12th) in rawhide so that's
>> entirely possible.
> 
> Yes, Eric Blake pointed this out to me too.  However I don't really
> want to blame make unless others have seen similar hangs.  It could
> easily be a new bug in the package itself.
> 
> If anyone has access to that builder, it might be interesting to get a
> process listing, or strace of whatever process is hanging.

Dan Horak added:

> it's a deadlock in the tests, not in make. Reproduced with "fedpkg local" in a cycle.
> sharkcz 1649225 0.0 0.0 222288 3904 pts/5 S+ 06:24 0:00 /bin/sh -e /var/tmp/rpm-tmp.RXcMRr
> sharkcz 1649230 0.0 0.0 10372 3248 pts/5 S+ 06:24 0:00 make -j4 check
> sharkcz 1658088 0.0 0.0 251236 3400 pts/5 Sl+ 06:25 0:00 /home/sharkcz/nbdkit/nbdkit-1.19.3/server/nbdkit -v -P test-nbd-tls-psk.pid1 -U /tmp/tmp.7e7Gv5MPmZ --tls=require --tls-psk=keys.psk -- /home/sharkcz/nbdkit/nbdkit-1.19.3/plugins/example1/.libs/nbdkit-example1-plugin.so
> sharkcz 1658091 0.0 0.1 192944 4464 pts/5 Sl+ 06:25 0:00 /home/sharkcz/nbdkit/nbdkit-1.19.3/server/nbdkit -v -P test-nbd-tls-psk.pid2 -U /tmp/tmp.yp61yXx09y --tls=off -- /home/sharkcz/nbdkit/nbdkit-1.19.3/plugins/nbd/.libs/nbdkit-nbd-plugin.so tls=require tls-psk=keys.psk tls-username=qemu socket=/tmp/tmp.7e7Gv5MPmZ
> the 2 nbdkit processes are stuck in the futex() syscall 

Reconstructing state from those command lines - we have a TLS test that 
operates 3 processes:

client <=> nbdkit nbd <=> nbdkit example1

it looks like this particular test was checking a plain-text client 
connecting to nbdkit nbd, which in turn was connecting as a TLS client 
to nbdkit example1.  I also know that 'nbdkit nbd' uses libnbd to 
support TLS, and that we have not fully implemented clean TLS teardown 
in libnbd - so it could be that the nbd side has told the example1 side 
that it will be shutting down soon, but due to unclean TLS library 
usage, is missing a poll() wakeup to realize that there will be no 
further response coming from the example1 side; while the example1 side 
is doing blocking I/O waiting for the nbd side to close the socket.  The 
overall test that spawned both nbdkit processes in the background 
(tests/test-nbd-tls-psk.sh) has completed, though, stranding those two 
hung child processes without their original parent but letting 'make 
check' report testsuite success.

As to why make is hanging, that is beyond me.  Maybe something new in 
make 4.3 is detecting that we have stranded indirect processes, and is 
waiting for them to complete?

Ideally, we need to fix libnbd TLS support to do cleaner shutdown.

Pragmatically, nbdkit's tests/functions.sh start_nbdkit() function right 
now tries only a single:

     cleanup_fn kill "$(cat "$pidfile")"

without waiting to see if it actually worked.  We could probably turn 
that into a more robust kill_nbdkit() function that first tries the 
graceful SIGTERM, waits a few seconds to confirm whether the process 
actually died, and follows up with a harder SIGKILL as needed 
(preferably failing a test whenever SIGTERM was insufficient).  It may 
not solve the bug in libnbd TLS shutdown, but would at least prevent 
stuck processes.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org




More information about the Libguestfs mailing list