[dm-devel] v3.15 dm-mpath regression: cable pull test causes I/O hang

Hannes Reinecke hare at suse.de
Thu Jul 3 14:15:48 UTC 2014


On 07/03/2014 04:05 PM, Mike Snitzer wrote:
> On Thu, Jul 03 2014 at  9:56am -0400,
> Bart Van Assche <bvanassche at acm.org> wrote:
>
>> On 07/03/14 00:02, Mike Snitzer wrote:
>>> On Fri, Jun 27 2014 at  9:33am -0400,
>>> Mike Snitzer <snitzer at redhat.com> wrote:
>>>
>>>> On Fri, Jun 27 2014 at  9:02am -0400,
>>>> Bart Van Assche <bvanassche at acm.org> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> While running a cable pull simulation test with dm_multipath on top of
>>>>> the SRP initiator driver I noticed that after a few iterations I/O locks
>>>>> up instead of dm_multipath processing the path failure properly (see also
>>>>> below for a call trace). At least kernel versions 3.15 and 3.16-rc2 are
>>>>> vulnerable. This issue does not occur with kernel 3.14. I have tried to
>>>>> bisect this but gave up when I noticed that I/O locked up completely with
>>>>> a kernel built from git commit ID e809917735ebf1b9a56c24e877ce0d320baee2ec
>>>>> (dm mpath: push back requests instead of queueing). But with the bisect I
>>>>> have been able to narrow down this issue to one of the patches in "Merge
>>>>> tag 'dm-3.15-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/
>>>>> device-mapper/linux-dm". Does anyone have a suggestion how to analyze this
>>>>> further or how to fix this ?
>>>
>>> I still don't have a _known_ fix for your issue but I reviewed commit
>>> e809917735ebf1b9a56c24e877ce0d320baee2ec closer and identified what
>>> looks to be a regression in logic for multipath_busy, it now calls
>>> !pg_ready() instead of directly checking pg_init_in_progress.  I think
>>> this is needed (Hannes, what do you think?):
>>>
>>> diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
>>> index 3f6fd9d..561ead6 100644
>>> --- a/drivers/md/dm-mpath.c
>>> +++ b/drivers/md/dm-mpath.c
>>> @@ -373,7 +373,7 @@ static int __must_push_back(struct multipath *m)
>>>   		 dm_noflush_suspending(m->ti)));
>>>   }
>>>
>>> -#define pg_ready(m) (!(m)->queue_io && !(m)->pg_init_required)
>>> +#define pg_ready(m) (!(m)->queue_io && !(m)->pg_init_required && !(m)->pg_init_in_progress)
>>>
>>>   /*
>>>    * Map cloned requests
>>
>> Hello Mike,
>>
>> Sorry but even with this patch applied and additionally with commit IDs
>> 86d56134f1b6 ("kobject: Make support for uevent_helper optional") and
>> bcccff93af35 ("kobject: don't block for each kobject_uevent") reverted
>> my multipath test still hangs after a few iterations. I also reran the
>> same test with kernel 3.14.3 and it is still running after 30 iterations.
>
> OK, thanks for testing though!  I still think the patch is needed.
>
> You are using queue_if_no_path, do you see hangs due to paths not being
> restored after the "cable" is restored?  Any errors in the multipathd
> userspace logging?  Or abnormal errors in kernel?  Basically I'm looking
> for some other clue besides the hung task timeout spew.
>
> How easy would it be to replicate your testbed?  Is it uniquely FIO hw
> dependent?  How are you simulating the cable pull tests?
>
> I'd love to setup a testbed that would enable me to chase this more
> interactively rather than punting to you for testing.
>
> Hannes, do you have a testbed for heavy cable pull testing?  Are you
> able to replicate these hangs?
>
Yes, I do. But sadly I've been tied up with polishing up SLES12 
(release deadline is looming nearer) and for some inexplicable 
reason management seems to find releasing a product more important 
than working on mainline issue ...
But I hope to find some time soonish (ie start of next week) to work 
on this; it's the very next thing on my to-do list.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare at suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)




More information about the dm-devel mailing list