[dm-devel] v3.15 dm-mpath regression: cable pull test causes I/O hang

Mike Snitzer snitzer at redhat.com
Thu Jul 3 14:05:16 UTC 2014


On Thu, Jul 03 2014 at  9:56am -0400,
Bart Van Assche <bvanassche at acm.org> wrote:

> On 07/03/14 00:02, Mike Snitzer wrote:
> > On Fri, Jun 27 2014 at  9:33am -0400,
> > Mike Snitzer <snitzer at redhat.com> wrote:
> > 
> >> On Fri, Jun 27 2014 at  9:02am -0400,
> >> Bart Van Assche <bvanassche at acm.org> wrote:
> >>
> >>> Hello,
> >>>
> >>> While running a cable pull simulation test with dm_multipath on top of
> >>> the SRP initiator driver I noticed that after a few iterations I/O locks
> >>> up instead of dm_multipath processing the path failure properly (see also
> >>> below for a call trace). At least kernel versions 3.15 and 3.16-rc2 are
> >>> vulnerable. This issue does not occur with kernel 3.14. I have tried to
> >>> bisect this but gave up when I noticed that I/O locked up completely with
> >>> a kernel built from git commit ID e809917735ebf1b9a56c24e877ce0d320baee2ec
> >>> (dm mpath: push back requests instead of queueing). But with the bisect I
> >>> have been able to narrow down this issue to one of the patches in "Merge
> >>> tag 'dm-3.15-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/
> >>> device-mapper/linux-dm". Does anyone have a suggestion how to analyze this
> >>> further or how to fix this ?
> > 
> > I still don't have a _known_ fix for your issue but I reviewed commit
> > e809917735ebf1b9a56c24e877ce0d320baee2ec closer and identified what
> > looks to be a regression in logic for multipath_busy, it now calls
> > !pg_ready() instead of directly checking pg_init_in_progress.  I think
> > this is needed (Hannes, what do you think?):
> > 
> > diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
> > index 3f6fd9d..561ead6 100644
> > --- a/drivers/md/dm-mpath.c
> > +++ b/drivers/md/dm-mpath.c
> > @@ -373,7 +373,7 @@ static int __must_push_back(struct multipath *m)
> >  		 dm_noflush_suspending(m->ti)));
> >  }
> >  
> > -#define pg_ready(m) (!(m)->queue_io && !(m)->pg_init_required)
> > +#define pg_ready(m) (!(m)->queue_io && !(m)->pg_init_required && !(m)->pg_init_in_progress)
> >  
> >  /*
> >   * Map cloned requests
> 
> Hello Mike,
> 
> Sorry but even with this patch applied and additionally with commit IDs
> 86d56134f1b6 ("kobject: Make support for uevent_helper optional") and
> bcccff93af35 ("kobject: don't block for each kobject_uevent") reverted
> my multipath test still hangs after a few iterations. I also reran the
> same test with kernel 3.14.3 and it is still running after 30 iterations.

OK, thanks for testing though!  I still think the patch is needed.

You are using queue_if_no_path, do you see hangs due to paths not being
restored after the "cable" is restored?  Any errors in the multipathd
userspace logging?  Or abnormal errors in kernel?  Basically I'm looking
for some other clue besides the hung task timeout spew.

How easy would it be to replicate your testbed?  Is it uniquely FIO hw
dependent?  How are you simulating the cable pull tests?

I'd love to setup a testbed that would enable me to chase this more
interactively rather than punting to you for testing.

Hannes, do you have a testbed for heavy cable pull testing?  Are you
able to replicate these hangs?




More information about the dm-devel mailing list