[dm-devel] v4.8 dm-mpath

Fri Aug 26 16:35:21 UTC 2016

On Fri, Aug 26 2016 at 11:33am -0400,
Bart Van Assche <bart.vanassche at sandisk.com> wrote:

> On 08/26/2016 07:26 AM, Mike Snitzer wrote:
> >On Thu, Aug 25 2016 at  1:40pm -0400,
> >Bart Van Assche <bart.vanassche at sandisk.com> wrote:
> >>As usual, thanks for the quick feedback. But it seems like I sent my
> >>e-mail too soon: after I had sent my e-mail I ran again into the
> >>truncate_inode_pages_range() hang.
> >
> >I was skeptical your 3 earlier patches (particularly the __dm_destroy to
> >use internel suspend patch) would fix anything you care about in your
> >testing.  __dm_destroy is only used once all references on the DM mpath
> >device are dropped.  When you do your fio + cable pull tests you're just
> >bouncing underlying paths around.  You aren't _ever_ destroying the
> >multipath device.  That is why your __dm_destroy patch seemed off the
> >mark to me.
> 
> Hello Mike,
> 
> In case it wasn't clear, I want to drop the three patches you
> referred to. But I also want to clarify that my tests *do* trigger
> __dm_destroy(). If you have a look at the srp-test scripts then you
> will see that "dmsetup remove" is invoked after each test. What I
> see is that lock_page() and other page cache functions hang
> sporadically around the time the dm device is removed, most likely
> due to I/O that is submitted but never completed. That's why I
> started looking at the scsi-mq/blk-mq device removal code.

We're going round and round with a test that doesn't reflect 99% of the
usage that DM multipath sees.  I think we need to take a step back and
re-evaluate the test in question.

Could well be that there is some problem with outstanding IO racing with
DM multipath device removal.  BUT I'd really appreciate it if you could
make the 'dmsetup remove' phase secondary.  You're welcome to keep the
test you have (with DM device removal mixed with IO), make it
configurable with a flag or whatever, but it strikes me as much more of
a niche concern.  Not dismissing the need to make whatever it is you're
doing work.. but we're seriously conflating all the variables in play.

Customers aren't removing their multipath devices a lot.  So can we do
this?

test step1:
Lets at least verify that DM multipath fault handling capabilities
during normal IO in the face of cable pulls is reliable (be them
syntehtic pulls or real).

test step2:
Once the IO completes (after paths are restored) and fio ends _then_
DM multipath devices can be removed.

You'll note that all mptest tests follow this 2 step pattern.