[dm-devel] [PATCH] multipathd: avoid crash in uevent_cleanup()
Martin Wilck
mwilck at suse.com
Tue Mar 2 15:29:40 UTC 2021
On Tue, 2021-03-02 at 20:44 +0800, lixiaokeng wrote:
>
> > Note that unlike all other threads, TUR threads are _detached_
> > threads.
> > multipathd tries to cancel them, but it has no way to verify that
> > they
> > actually stopped. It may be just a normal observation that you
> > can't
> > see the messages when a TUR thread terminates, in particular if the
> > program is exiting and might have already closed the stderr file
> > descriptor.
> >
> >
> > If you look at the crashed processes with gdb, the thread IDs
> > should
> > give you some clue which stack belongs to which thread. The TUR
> > threads
> > will have higher thread IDs than the others because they are
> > started
> > later.
> >
>
>
> ??
>
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/usr/lib64/libthread_db.so.1".
> Core was generated by `/sbin/multipathd -d -s'.
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0 0x00007f3f669e071d in ?? ()
> [Current thread is 1 (Thread 0x7f3f65873700 (LWP 1645593))]
> (gdb) i thread
> Id Target Id Frame
> * 1 Thread 0x7f3f65873700 (LWP 1645593) 0x00007f3f669e071d in ??
> ()
> 2 Thread 0x7f3f6611a000 (LWP 1645066) 0x00007f3f669fede7 in
> munmap () at ../sysdeps/unix/syscall-template.S:78
> 3 Thread 0x7f3f6609d700 (LWP 1645095) syscall () at
> ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
> (gdb) bt
> #0 0x00007f3f669e071d in ?? ()
> #1 0x0000000000000000 in ?? ()
> (gdb) thread 2
> [Switching to thread 2 (Thread 0x7f3f6611a000 (LWP 1645066))]
> #0 0x00007f3f669fede7 in munmap () at ../sysdeps/unix/syscall-
> template.S:78
> 78 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
> (gdb) bt
> #0 0x00007f3f669fede7 in munmap () at ../sysdeps/unix/syscall-
> template.S:78
> #1 0x00007f3f669fb77d in _dl_unmap_segments
> (l=l at entry=0x557cb432ba10) at ./dl-unmap-segments.h:32
> ...
> #10 0x00007f3f669b44ed in cleanup_prio () at prio.c:66
> //cleanup_checkers() is finished.
> #11 0x0000557cb26db794 in child (param=<optimized out>) at
> main.c:2932
> #12 0x0000557cb26d44d3 in main (argc=<optimized out>,
> argv=0x7ffc98d47948) at main.c:3150
>
>
> UNWIND
>
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/usr/lib64/libthread_db.so.1".
> Core was generated by `/sbin/multipathd -d -s'.
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0 x86_64_fallback_frame_state (fs=0x7fa9b2f576d0,
> context=0x7fa9b2f57980) at ./md-unwind-support.h:58
> 58 if (*(unsigned char *)(pc+0) == 0x48
> [Current thread is 1 (Thread 0x7fa9b2f58700 (LWP 1285074))]
> (gdb) i thread
> Id Target Id Frame
> * 1 Thread 0x7fa9b2f58700 (LWP 1285074) (Exiting)
> x86_64_fallback_frame_state (fs=0x7fa9b2f576d0,
> context=0x7fa9b2f57980) at ./md-unwind-support.h:58
> 2 Thread 0x7fa9b383e000 (LWP 1284366)
> 0x00007fa9b403e127 in __close (fd=5) at
> ../sysdeps/unix/sysv/linux/close.c:27
> 3 Thread 0x7fa9b37c1700 (LWP 1284374) syscall () at
> ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
> 4 Thread 0x7fa9b2f73700 (LWP 1285077)
> 0x00007fa9b3e06507 in ioctl () at ../sysdeps/unix/syscall-
> template.S:78
> 5 Thread 0x7fa9b2f61700 (LWP 1285076)
> 0x00007fa9b3e06507 in ioctl () at ../sysdeps/unix/syscall-
> template.S:78
> 6 Thread 0x7fa9b2f4f700 (LWP 1285079)
> 0x00007fa9b3e06507 in ioctl () at ../sysdeps/unix/syscall-
> template.S:78
> 7 Thread 0x7fa9b2fa9700 (LWP 1285080)
> 0x00007fa9b3e06507 in ioctl () at ../sysdeps/unix/syscall-
> template.S:78
> (gdb) thread 2
> [Switching to thread 2 (Thread 0x7fa9b383e000 (LWP 1284366))]
> #0 0x00007fa9b403e127 in __close (fd=5) at
> ../sysdeps/unix/sysv/linux/close.c:27
> 27 return SYSCALL_CANCEL (close, fd);
> (gdb) bt
> #0 0x00007fa9b403e127 in __close (fd=5) at
> ../sysdeps/unix/sysv/linux/close.c:27
> #1 0x00005606f030f95b in cleanup_dmevent_waiter () at dmevents.c:111
> #2 0x00005606f03087a2 in child (param=<optimized out>) at
> main.c:2934
> #3 0x00005606f03014d3 in main (argc=<optimized out>,
> argv=0x7ffdb782ab38) at main.c:3150
>
>
> The LWP of ?? and UNWIND is much larger than thread 2(main).
>
> I add print_func like:
>
> @@ -228,6 +228,10 @@ static void copy_msg_to_tcc(void *ct_p, const
> char *msg)
> pthread_mutex_unlock(&ct->lock);
> }
>
> +static void lxk10 (void)
> +{
> + condlog(2, "lxk exit tur_thread");
> +}
> static void *tur_thread(void *ctx)
> {
> struct tur_checker_context *ct = ctx;
> @@ -235,6 +239,8 @@ static void *tur_thread(void *ctx)
> char devt[32];
>
> /* This thread can be canceled, so setup clean up */
> + condlog(2, "lxk start tur_thread");
> + pthread_cleanup_push(lxk10, NULL);
> tur_thread_cleanup_push(ct);
>
> When there are four devices, core log:
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: exit
> (signal)
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sda:
> unusable path
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sdf:
> unusable path
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sde:
> unusable path
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sdd:
> unusable path
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sdc:
> unusable path
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sdb:
> unusable path
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk start
> tur_thread
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk exit
> tur_thread
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk start
> tur_thread
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk start
> tur_thread
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk exit
> tur_thread
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk start
> tur_thread
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk exit
> tur_thread
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]:
> 360014057a1353ec1bdd4dfcad19db6db: remove multipath map
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sdg: orphan
> path, map flushed
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: BUG:
> orphaning path sdg that holds hwe of
> 360014057a1353ec1bdd4dfcad19db6db
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: tur checker
> refcount 4
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]:
> 36001405faf8a6c2920840ed8ba73b9ee: remove multipath map
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sdj: orphan
> path, map flushed
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: BUG:
> orphaning path sdj that holds hwe of
> 36001405faf8a6c2920840ed8ba73b9ee
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: tur checker
> refcount 3
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]:
> 36001405044c0f50ba3c4e5b9b57e4de4: remove multipath map
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sdi: orphan
> path, map flushed
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: BUG:
> orphaning path sdi that holds hwe of
> 36001405044c0f50ba3c4e5b9b57e4de4
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: tur checker
> refcount 2
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]:
> 36001405e0cbb950907b4a51af1a002ed: remove multipath map
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: sdh: orphan
> path, map flushed
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: BUG:
> orphaning path sdh that holds hwe of
> 36001405e0cbb950907b4a51af1a002ed
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: tur checker
> refcount 1
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk exit
> ueventloop
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk exit
> uxlsnrloop
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk exit
> uevqloop
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk exit
> wait_dmevents
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk exit
> checkerloop
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: directio
> checker refcount 6
> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk free tur
> checker //checker_put
So we do not see "unloading tur checker". Like you said, that suggests
that the crash occurs between libcheck_free() and the thread exiting.
I suggest you put a message in tur.c:libcheck_free (), AFTER the call
to cleanup_context(), printing the values of "running" and "holders".
Anyway:
holders = uatomic_sub_return(&ct->holders, 1);
if (!holders)
cleanup_context(ct);
Whatever mistakes we have made, only one actor can have seen
holders == 0, and have called cleanup_context().
The stacks you have shown indicate that the instruction pointers were
broken. That would suggest something similar as dicussed in the ML
thread leading to 38ffd89 ("libmultipath: prevent DSO unloading with
astray checker threads"). Your logs show "tur checker refcount 1", so
the next call to checker_put would have unloaded the DSO.
Please try commenting out the dlclose() call in free_checker_class(),
and see if it makes a difference.
Martin
More information about the dm-devel
mailing list