[dm-devel] [PATCH] multipathd: avoid crash in uevent_cleanup()

Martin Wilck mwilck at suse.com
Tue Mar 2 15:29:40 UTC 2021


On Tue, 2021-03-02 at 20:44 +0800, lixiaokeng wrote:
> 
> > Note that unlike all other threads, TUR threads are _detached_
> > threads.
> > multipathd tries to cancel them, but it has no way to verify that
> > they
> > actually stopped. It may be just a normal observation that you
> > can't
> > see the messages when a TUR thread terminates, in particular if the
> > program is exiting and might have already closed the stderr file
> > descriptor.
> > 
> > 
> > If you look at the crashed processes with gdb, the thread IDs
> > should
> > give you some clue which stack belongs to which thread. The TUR
> > threads
> > will have higher thread IDs than the others because they are
> > started
> > later.
> > 
> 
> 
> ??
> 
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/usr/lib64/libthread_db.so.1".
> Core was generated by `/sbin/multipathd -d -s'.
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0  0x00007f3f669e071d in ?? ()
> [Current thread is 1 (Thread 0x7f3f65873700 (LWP 1645593))]
> (gdb) i thread
>   Id   Target Id                           Frame
> * 1    Thread 0x7f3f65873700 (LWP 1645593) 0x00007f3f669e071d in ??
> ()
>   2    Thread 0x7f3f6611a000 (LWP 1645066) 0x00007f3f669fede7 in
> munmap () at ../sysdeps/unix/syscall-template.S:78
>   3    Thread 0x7f3f6609d700 (LWP 1645095) syscall () at
> ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
> (gdb) bt
> #0  0x00007f3f669e071d in ?? ()
> #1  0x0000000000000000 in ?? ()
> (gdb) thread 2
> [Switching to thread 2 (Thread 0x7f3f6611a000 (LWP 1645066))]
> #0  0x00007f3f669fede7 in munmap () at ../sysdeps/unix/syscall-
> template.S:78
> 78      T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
> (gdb) bt
> #0  0x00007f3f669fede7 in munmap () at ../sysdeps/unix/syscall-
> template.S:78
> #1  0x00007f3f669fb77d in _dl_unmap_segments
> (l=l at entry=0x557cb432ba10) at ./dl-unmap-segments.h:32
> ...
> #10 0x00007f3f669b44ed in cleanup_prio () at prio.c:66 
> //cleanup_checkers() is finished.
> #11 0x0000557cb26db794 in child (param=<optimized out>) at
> main.c:2932
> #12 0x0000557cb26d44d3 in main (argc=<optimized out>,
> argv=0x7ffc98d47948) at main.c:3150
> 
> 
> UNWIND
> 
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/usr/lib64/libthread_db.so.1".
> Core was generated by `/sbin/multipathd -d -s'.
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0  x86_64_fallback_frame_state (fs=0x7fa9b2f576d0,
> context=0x7fa9b2f57980) at ./md-unwind-support.h:58
> 58        if (*(unsigned char *)(pc+0) == 0x48
> [Current thread is 1 (Thread 0x7fa9b2f58700 (LWP 1285074))]
> (gdb) i thread
>   Id   Target Id                                     Frame
> * 1    Thread 0x7fa9b2f58700 (LWP 1285074) (Exiting)
> x86_64_fallback_frame_state (fs=0x7fa9b2f576d0,
> context=0x7fa9b2f57980) at ./md-unwind-support.h:58
>   2    Thread 0x7fa9b383e000 (LWP 1284366)          
> 0x00007fa9b403e127 in __close (fd=5) at
> ../sysdeps/unix/sysv/linux/close.c:27
>   3    Thread 0x7fa9b37c1700 (LWP 1284374)           syscall () at
> ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
>   4    Thread 0x7fa9b2f73700 (LWP 1285077)          
> 0x00007fa9b3e06507 in ioctl () at ../sysdeps/unix/syscall-
> template.S:78
>   5    Thread 0x7fa9b2f61700 (LWP 1285076)          
> 0x00007fa9b3e06507 in ioctl () at ../sysdeps/unix/syscall-
> template.S:78
>   6    Thread 0x7fa9b2f4f700 (LWP 1285079)          
> 0x00007fa9b3e06507 in ioctl () at ../sysdeps/unix/syscall-
> template.S:78
>   7    Thread 0x7fa9b2fa9700 (LWP 1285080)          
> 0x00007fa9b3e06507 in ioctl () at ../sysdeps/unix/syscall-
> template.S:78
> (gdb) thread 2
> [Switching to thread 2 (Thread 0x7fa9b383e000 (LWP 1284366))]
> #0  0x00007fa9b403e127 in __close (fd=5) at
> ../sysdeps/unix/sysv/linux/close.c:27
> 27        return SYSCALL_CANCEL (close, fd);
> (gdb) bt
> #0  0x00007fa9b403e127 in __close (fd=5) at
> ../sysdeps/unix/sysv/linux/close.c:27
> #1  0x00005606f030f95b in cleanup_dmevent_waiter () at dmevents.c:111
> #2  0x00005606f03087a2 in child (param=<optimized out>) at
> main.c:2934
> #3  0x00005606f03014d3 in main (argc=<optimized out>,
> argv=0x7ffdb782ab38) at main.c:3150
> 
> 
> The LWP of ?? and UNWIND is much larger than thread 2(main).
> 
> I add print_func like:
> 
> @@ -228,6 +228,10 @@ static void copy_msg_to_tcc(void *ct_p, const
> char *msg)
>         pthread_mutex_unlock(&ct->lock);
>  }
> 
> +static void lxk10 (void)
> +{
> +       condlog(2, "lxk exit tur_thread");
> +}
>  static void *tur_thread(void *ctx)
>  {
>         struct tur_checker_context *ct = ctx;
> @@ -235,6 +239,8 @@ static void *tur_thread(void *ctx)
>         char devt[32];
> 
>         /* This thread can be canceled, so setup clean up */
> +       condlog(2, "lxk start tur_thread");
> +       pthread_cleanup_push(lxk10, NULL);
>         tur_thread_cleanup_push(ct);
> 
> When there are four devices, core log:
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: exit
> (signal)
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sda:
> unusable path
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sdf:
> unusable path
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sde:
> unusable path
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sdd:
> unusable path
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sdc:
> unusable path
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sdb:
> unusable path
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk start
> tur_thread
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk exit
> tur_thread
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk start
> tur_thread
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk start
> tur_thread
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk exit
> tur_thread
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk start
> tur_thread
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk exit
> tur_thread
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]:
> 360014057a1353ec1bdd4dfcad19db6db: remove multipath map
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sdg: orphan
> path, map flushed
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: BUG:
> orphaning path sdg that holds hwe of
> 360014057a1353ec1bdd4dfcad19db6db
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: tur checker
> refcount 4
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]:
> 36001405faf8a6c2920840ed8ba73b9ee: remove multipath map
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sdj: orphan
> path, map flushed
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: BUG:
> orphaning path sdj that holds hwe of
> 36001405faf8a6c2920840ed8ba73b9ee
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: tur checker
> refcount 3
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]:
> 36001405044c0f50ba3c4e5b9b57e4de4: remove multipath map
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sdi: orphan
> path, map flushed
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: BUG:
> orphaning path sdi that holds hwe of
> 36001405044c0f50ba3c4e5b9b57e4de4
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: tur checker
> refcount 2
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]:
> 36001405e0cbb950907b4a51af1a002ed: remove multipath map
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sdh: orphan
> path, map flushed
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: BUG:
> orphaning path sdh that holds hwe of
> 36001405e0cbb950907b4a51af1a002ed
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: tur checker
> refcount 1
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk exit
> ueventloop
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk exit
> uxlsnrloop
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk exit
> uevqloop
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk exit
> wait_dmevents
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk exit
> checkerloop
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: directio
> checker refcount 6
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk free tur
> checker  //checker_put


So we do not see "unloading tur checker". Like you said, that suggests
that the crash occurs between libcheck_free() and the thread exiting.
I suggest you put a message in tur.c:libcheck_free (), AFTER the call
to cleanup_context(), printing the values of "running" and "holders".

Anyway:

	holders = uatomic_sub_return(&ct->holders, 1);
	if (!holders)
		cleanup_context(ct);

Whatever mistakes we have made, only one actor can have seen 
holders == 0, and have called cleanup_context().

The stacks you have shown indicate that the instruction pointers were
broken. That would suggest something similar as dicussed in the ML
thread leading to 38ffd89 ("libmultipath: prevent DSO unloading with
astray checker threads"). Your logs show "tur checker refcount 1", so
the next call to checker_put would have unloaded the DSO. 

Please try commenting out the dlclose() call in free_checker_class(),
and see if it makes a difference.

Martin






More information about the dm-devel mailing list