[dm-devel] Re: fastfail operation and retries

Thu Apr 21 21:31:46 UTC 2005

> -----Original Message-----
> From: linux-scsi-owner at vger.kernel.org 
> [mailto:linux-scsi-owner at vger.kernel.org] On Behalf Of Lars 
> Marowsky-Bree
> Sent: Thursday, April 21, 2005 5:19 PM
> To: device-mapper development; Andreas Herrmann
> Cc: Linux SCSI
> Subject: Re: [dm-devel] Re: fastfail operation and retries
> 
> On 2005-04-21T17:02:44, "goggin, edward" <egoggin at emc.com> wrote:
> 
> > Depending on the "queue_if_no_path" feature has the current 
> undesirable
> > side-effect of requiring intervention of the user space 
> multipath components
> > to reinstate at least one of the paths to a useable state 
> in the multipath
> > target driver.  This dependency currently creates the 
> potential for deadlock
> > scenarios since the user space multipath components (nor 
> the kernel for that
> > matter) are currently architected to avoid them.
> 
> multipath-tools is, to a certain degree, architected to avoid 
> them. And
> the kernel is meant to be, too - there's bugs and known FIXME's, but
> those are just bugs and we're taking patches gladly ;-)
> 
> > I think for now it may be better to try to avoid having to 
> fail a path if it
> > is possible that an io error is not path related.
> 
> No. Basically every time out error creates a "dunno why" 
> error right now
> - could be the storage system itself, could be the network in between.
>

I was really thinking of the code where the sense key/asc/ascq makes it
into the bio.

> A failover to another path is the obvious remedy; take for example the
> CX series where even if it's not the path, it's the SP, and 
> failing over
> to the other SP will cure the problem.
> 
> If the storage at least rejects the IO with a specific error code, it
> can be worked around by a specific hw handler which doesn't fail the
> path but just causes the IO to be queued and retried; that's a pretty
> simple hardware handler to write.

I agree we and likely other storage vendors could do a better job here.
But that said, the multipathing code could also avoid failing the path
just because an io error occurred on that path.  Instead, this could be
the sole responsibility of path testing (from user space) which could
reduce the likelihood of media errors being confused with path
connectivity ones.

> 
> But quite frankly, storage subsystems which _reject_ all IO 
> for a given
> time are just broken for reliable configurations. What good 
> are they in
> multipath configurations if they fail _all_ paths at the same 
> time? How
> can they even dare claim redundancy? We can build more or less smelly
> kludges around them, but it remains a problem to be fixed at 
> the storage
> subsystem level IMNSHO.

I agree that its unfortunate that the CLARiion is failing all paths
during NDU, even for a restricted amount of time.  Even so, it must
be dealt with as is.

> 
> 
> Sincerely,
>     Lars Marowsky-Brée <lmb at suse.de>
> 
> -- 
> High Availability & Clustering
> SUSE Labs, Research and Development
> SUSE LINUX Products GmbH - A Novell Business
> 
> -
> To unsubscribe from this list: send the line "unsubscribe 
> linux-scsi" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>