[Libguestfs] [libnbd PATCH v2] lib/errors.c: Fix assert fail in exit path in multi-threaded code
Richard W.M. Jones
rjones at redhat.com
Wed Mar 8 22:55:01 UTC 2023
On Wed, Mar 08, 2023 at 04:29:03PM -0600, Eric Blake wrote:
> When a highly multi-threaded program such as nbdcopy encounters an
> error, there is a race condition in the library which can cause an
> assertion failure and thus a core dump:
>
> (1) An error occurs on one of the threads. nbdcopy calls exit(3).
>
> (2) In lib/errors.c, the destructor calls pthread_key_delete.
>
> (3) Another thread which is still running also encounters an error,
> and inside libnbd the library calls set_error().
>
> (4) The call to set_error() calls pthread_getspecific. This is
> undefined behavior per POSIX, since the key has already been destroyed
> in step (2), but glibc handles it by returning NULL with an EINVAL
> error (POSIX recommends, but can't mandate, that course of action for
> technical reasons). In any case, the error message is lost, and any
> subsequent call to nbd_get_error() will return NULL.
>
> (5) We enter the %DEAD state, where there is an assertion that
> nbd_get_error() is not NULL. This assertion is supposed to be for
> checking that the code called set_error() before entering the %DEAD
> state. Although we did call set_error(), the error message was lost
> at step (4) and nbd_get_error() will always return NULL. This
> assertion failure causes a crash.
>
> There aren't any good ways to fix this. We have proven that the
> assertions are too tight (it IS possible for one thread's first use of
> libnbd API, which causes the allocation of a thread-local error
> context, to occur after another thread has already triggered the
> destructor via exit(3)), so remove those. Meanwhile, since POSIX says
> any use of a destroyed key is undefined, add some atomic bookkeeping
> such that we track a reference count of how many threads have
> allocated an error context and subsequently had the per-thread data
> destructor called. Only call pthread_key_delete() if all threads had
> the chance to exit; leaking small amounts of memory is preferable to
> undefined behavior.
>
> Affected tests that now produce the new leak message:
> - copy/copy-nbd-error.sh (known issue - we probably want to improve
> nbdcopy to gracefully issue NBD_CMD_DISC on error, rather than
> outright calling exit(), to work around the assertion failure in
> nbdkit 1.32.5)
> - various fuse/* (not sure if we can address that; cleaning up FUSE is
> tricky)
>
> This reverts a small part of commit 831cbf7ee14c ("generator: Allow
> DEAD state actions to run").
>
> Thanks: Richard W.M. Jones <rjones at redhat.com>
> Signed-off-by: Eric Blake <eblake at redhat.com>
...
> @@ -87,16 +92,23 @@ free_errors_key (void *vp)
> static struct last_error *
> allocate_last_error_on_demand (void)
> {
> - struct last_error *last_error = pthread_getspecific (errors_key);
> + struct last_error *last_error = NULL;
>
> - if (!last_error) {
> - last_error = calloc (1, sizeof *last_error);
> - if (last_error) {
> - int err = pthread_setspecific (errors_key, last_error);
> - if (err != 0) {
> - /* This is not supposed to happen (XXX). */
> - fprintf (stderr, "%s: %s: %s\n", "libnbd", "pthread_setspecific",
> - strerror (err));
> + if (key_use_count) {
> + last_error = pthread_getspecific (errors_key);
> + if (!last_error) {
> + last_error = calloc (1, sizeof *last_error);
> + if (last_error) {
> + int err = pthread_setspecific (errors_key, last_error);
> + key_use_count++;
> + if (err != 0) {
> + /* This is not supposed to happen (XXX). */
> + fprintf (stderr, "%s: %s: %s\n", "libnbd", "pthread_setspecific",
> + strerror (err));
> + free (last_error);
> + last_error = NULL;
> + key_use_count--;
> + }
> }
> }
> }
I suspect this may not be completely race condition free unless
there's a lock. Otherwise key_use_count could be > 0 when we enter
this code and then another thread could run the destructor at the same
time.
However I don't want to add locking around these functions since
(especially the one setting context) is highly contended and this
could kill performance.
I think the worst that might happen is we would get EINVAL from
pthread_setspecific, print a message, but at least we won't crash
because the assert has been removed.
So soft ACK, but I'm still wondering if there's a better way to do this.
If simply never calling pthread_key_destroy a good idea? The worst is
it leaks slots in glibc's thread-local storage array.
Rich.
--
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines. Supports shell scripting,
bindings from many languages. http://libguestfs.org
More information about the Libguestfs
mailing list