[Libguestfs] [PATCH libnbd] lib/errors.c: Fix assert fail in exit path in multi-threaded code

Wed Mar 8 22:03:32 UTC 2023

On Wed, Mar 08, 2023 at 03:22:29PM -0600, Eric Blake wrote:
> On Wed, Mar 08, 2023 at 08:29:41PM +0000, Richard W.M. Jones wrote:
> > When a highly multi-threaded program such as nbdcopy encounters an
> > error, there is a race condition in the library which can cause an
> > assertion failure and thus a core dump:
> > 
> > (1) An error occurs on one of the threads.  nbdcopy calls exit(3).
> > 
> > (2) In lib/errors.c, the destructor calls pthread_key_delete.
> >
> 
> Note - POSIX says that calling pthread_key_delete() can leak memory.
> If the reason we are calling pthread_key_delete() is because libnbd
> was dlopen'd and is now being dlclose'd, and an app repeatedly reloads
> libnbd, causes an error, and then unloads libnbd, this leak could grow
> to be rather sizeable.  But that is a rather unlikely scenario, and
> safely coordinating cleanup is hard when we consider both normal
> thread exits (where the key destructor runs if the key is not yet
> deleted) and module unloads (where pthread_key_delete skips key
> destructors).
> 
> > (3) Another thread which is still running also encounters an error,
> > and inside libnbd the library calls set_error().
> > 
> > (4) The call to set_error() calls pthread_getspecific which returns
> > NULL (since the key has already been destroyed in step (2)), and this
> 
> This is undefined behavior per POSIX.  It is not safe to call
> pthread_getspecific() on a destroyed key (even though glibc lets us
> get away with it by giving us EINVAL).
> 
> > causes us to call pthread_setspecific which returns EINVAL because
> > glibc is able to detect invalid use of a pthread_key_t after it has
> > been destroyed.  In any case, the error message is lost, and any
> > subsequent call to nbd_get_error() will return NULL.
> 
> Indeed - once we call pthread_key_delete() in step 2, ANY further use
> of the key is going to cause undefined behavior (hopefully EINVAL
> failures, but worse behavior is also possible).
> 
> One possible idea: have a non-thread-local global that starts life
> false, but which errors_free() atomically sets to true prior to
> calling pthread_key_delete().  All other functions which reference
> errors_key check that the global is still false before using
> pthread_{get,set}specific.
> 
> Another idea: have a counter of the number of threads that have set a
> thread-local error but not yet exited.  Increment the counter any time
> allocate_last_error_on_demand() says a new thread is triggering an
> error, and decrement the counter any time free_errors_key() calls the
> destructor at the end of a thread.  Only when all threads with an
> allocated error have exited will the counter reach zero, so only the
> last thread to clean up needs to call pthread_key_delete().  The
> counter might not reach zero before the process exits, but that's okay
> (if the reason the module is being unloaded is because the process is
> going away, cleanup is less important), although I don't know if
> valgrind would complain.  In fact, POSIX suggests:
> 
> https://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_key_delete.html
> "For an application to know that it is safe to delete a key, it has to
> know that all the threads that might potentially ever use the key do
> not attempt to use it again. For example, it could know this if all
> the client threads have called a cleanup procedure declaring that they
> are through with the module that is being shut down, perhaps by
> setting a reference count to zero."

IMHO the suggested ways to solve this problem are all complex enough
that I would not have much confidence in them even once implemented.

The use of pthread thread-data destructors is inherantly dangerous
when combined with dlclose, to the extent that the best course of
action is to simply forbid this scenario.

Libvirt hit this problem with pthread data destructors and dlclose()
many years ago and we merely blocked the functional problem by linking
libvirt with '-z nodelete'...

  commit 8e44e5593eb9b89fbc0b54fde15f130707a0d81e
  Author: Daniel P. Berrangé <berrange at redhat.com>
  Date:   Thu Sep 1 17:57:06 2011 +0100

    Prevent crash from dlclose() of libvirt.so

    When libvirt calls virInitialize it creates a thread local
    for the virErrorPtr storage, and registers a callback to
    cleanup memory when a thread exits. When libvirt is dlclose()d
    or otherwise made non-resident, the callback function is
    removed from memory, but the thread local may still exist
    and if a thread later exists, it will invoke the callback
    and SEGV. There may also be other thread locals with callbacks
    pointing to libvirt code, so it is in general never safe to
    unload libvirt.so from memory once initialized.

    To allow dlclose() to succeed, but keep libvirt.so resident
    in memory, link with '-z nodelete'. This issue was first
    found with the libvirt CIM provider, but can potentially
    hit many of the dynamic language bindings which all ultimately
    involve dlopen() in some way, either on libvirt.so itself,
    or on the glue code for the binding which in turns links
    to libvirt

Yes, this prevents you from unloading the library from memory so you
potentially have permanent memory usage, but unless the library is
really huge, I have a hard time believing that is a problem that
matters in practice. I doubt anyone has even noticed that libvirt
doesn't get removed from memory with dlcose()...

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|