[dm-devel] [PATCH] multipathd: avoid crash in uevent_cleanup()

Mon Feb 8 11:03:05 UTC 2021

On Mon, 2021-02-08 at 18:49 +0800, lixiaokeng wrote:
> 
> 
> On 2021/2/8 17:50, Martin Wilck wrote:
> > On Mon, 2021-02-08 at 15:41 +0800, lixiaokeng wrote:
> > > 
> > > Hi Martin,
> > > 
> > > There is a _cleanup_ in device_new_from_nulstr. If uevent_thr
> > > exit in
> > > device_new_from_nulstr and some keys is not be append to
> > > sd_device,
> > > the _cleanup_ will be called, which leads to multipathd crashes
> > > with
> > > the stack.
> > > 
> > > When I use your advice,
> > > 
> > > 
> > > On 2021/1/26 16:34, Martin Wilck wrote:
> > > >     int oldstate;
> > > > 
> > > >     pthread_setcancelstate(PTHREAD_CANCEL_DISABLE, &oldstate);
> > > > 
> > > >     udev_monitor_receive_device(...)
> > > > 
> > > >     pthread_setcancelstate(oldstate, NULL);
> > > >     pthread_testcancel();
> > > 
> > > this coredump does not seem to appear anymore (several hours with
> > > test scripts).
> > 
> > Thanks for your continued hard work on this, but I can't follow
> > you. In
> > this post:
> > 
> > https://listman.redhat.com/archives/dm-devel/2021-January/msg00396.html
> > 
> > you said that this advice did _not_ help. Please clarify.
> > 
> 
> Hi Martin,
> At that time, I did not know how the crash occurred in the systemd
> interface.
> There were still some crashes with pthread_testcancel(), for example
> #0  0x0000ffffb6118f4c in aarch64_fallback_frame_state
> (context=0xffffb523f200, context=0xffffb523f200, fs=0xffffb523e700)
> at ./md-unwind-support.h:74
> #1  uw_frame_state_for (context=context at entry=0xffffb523f200, 
> fs=fs at entry=0xffffb523e700) at ../../../libgcc/unwind-dw2.c:1257
> #2  0x0000ffffb6119ef4 in _Unwind_ForcedUnwind_Phase2
> (exc=exc at entry=0xffffb52403b0, context=context at entry=0xffffb523f200)
> at ../../../libgcc/unwind.inc:155
> #3  0x0000ffffb611a284 in _Unwind_ForcedUnwind (exc=0xffffb52403b0, 
> stop=stop at entry=0xffffb64846c0 <unwind_stop>,
> stop_argument=0xffffb523f630) at ../../../libgcc/unwind.inc:207
> #4  0x0000ffffb6484860 in __GI___pthread_unwind (buf=<optimized out>)
> at unwind.c:121
> #5  0x0000ffffb6482d08 in __do_cancel () at pthreadP.h:304
> #6  __GI___pthread_testcancel () at pthread_testcancel.c:26
> #7  0x0000ffffb5c528e8 in ?? ()
> 

I still don't fully understand. Above you said "this coredump doesn't
seem to appear any more". Am I understanding correctly that you
observed *other* core dumps instead?

The uw_frame_state_for() stack looks healthy (learned that just
recently from one of our experts in the area). Most probably the actual
crash occured in another thread in this case. It would be intersting to
look at a core dump.

The point of my suggestion was not the pthread_testcancel(), but the
blocking of thread cancellation during udev_monitor_receive_device().

> I thought these crashes might be related to crash in systemd
> interface.
> 
> However, I think these may be independent questions after analyzing
> coredump and discussing with the community. So I test it again.
> ?? and _Unwind_XXX crashes still exist but no crash in
> device_monitor_receive_device.

The "best" solution would probably be to generally disallow
cancellation, and only run pthread_testcancel() at certain points in
the code where we might block (and know that being cancelled would be
safe). That would not only make multipathd safer from crashing, it
would also enable us to remove hundreds of ugly
pthread_cleanup_push()/pop() calls from our code.

Finding all these points would be a challenge though, and if we don't
find them, we risk hanging on exit again, which is bad too, and was
just recently improved.

Regards
Martin