[dm-devel] RFC for multipath queue_if_no_path timeout.

Fri Sep 27 16:32:09 UTC 2013

On Fri, 2013-09-27 at 10:06 +0200, Hannes Reinecke wrote:
> On 09/27/2013 08:07 AM, Hannes Reinecke wrote:
> > On 09/27/2013 01:49 AM, Mike Snitzer wrote:
> >> On Thu, Sep 26 2013 at  7:22pm -0400,
> >> Alasdair G Kergon <agk at redhat.com> wrote:
> >>
> >>> On Thu, Sep 26, 2013 at 10:47:13AM -0700, Frank Mayhar wrote:
> >>>> Launching it from ramdisk won't help, particularly, since it still goes
> >>>> through the block layer.  The other stuff won't help if a (potentially
> >>>> unrelated) bug in the daemon happens to be being tickled at the same
> >>>> time, or if some dependency happens to be broken and _that's_ what's
> >>>> preventing the daemon from making progress.
> >>>  
> >>> Then put more effort into debugging your daemon so it doesn't have
> >>> bugs that make it die?  Implement the timeout in a robust independent
> >>> daemon if it's other code there that's unreliable?
> >>>
> >>>> And as far as lvm2 and multipath-tools, yeah, they cope okay in the kind
> >>>> of environments most people have, but that's not the kind of environment
> >>>> (or scale) we have to deal with.
> >>>
> >>> In what way are your requirements so different that a locked-into-memory
> >>> monitoring daemon cannot implement this timeout?
> >>
> >> Frank, I had a look at your patch.  It leaves a lot to be desired, I was
> >> starting to clean it up but ultimately found myself agreeing with
> >> Alasdair's original point: that this policy should be implemented in the
> >> userspace daemon.
> >>
> > _Actually_ there is a way how this could be implemented properly:
> > implement a blk_timeout function.
> > 
> > Thing is, every request_queue might have a timeout function
> > implemented, whose goal is to abort requests which are beyond that
> > timeout. EG SCSI uses that for the dev_loss_tmo mechanism.
> > 
> > Multipath what with it being request-based could easily implement
> > the same mechanism, namely have to blk_timeout function which would
> > just re-arm the timeout in the default case, but abort any queued
> > I/O (after a timeout) if all paths are down.
> > 
> > Hmm. I see to draft up a PoC.
> > 
> And indeed, here it is.
> 
> Completely untested, just to give you an idea what I was going on
> about. Let's see if I can put this to test somewhere...

Thanks, Hannes!  I'll grab this and test it today.  I clearly don't know
enough about the block layer, since using blk_timeout never even crossed
my mind.
-- 
Frank Mayhar
310-460-4042