[dm-devel] [Lsf] Notes from the four separate IO track sessions at LSF/MM

Thu Apr 28 17:33:42 UTC 2016

On Thu, 2016-04-28 at 16:19 +0000, Knight, Frederick wrote:
> There are multiple possible situations being intermixed in this
> discussion.  First, I assume you're talking only about random access
> devices (if you try transport level error recover on a sequential
> access device - tape or SMR disk - there are lots of additional
> complexities).

Tape figured prominently in the reset discussion.  Resetting beyond the
LUN has the possibility to cause grave impact to long running jobs
(mostly on tapes).

> Failures can occur at multiple places:
> a) Transport layer failures that the transport layer is able to
> detect quickly;
> b) SCSI device layer failures that the transport layer never even
> knows about.
> 
> For (a) there are two competing goals.  If a port drops off the
> fabric and comes back again, should you be able to just recover and
> continue.  But how long do you wait during that drop?  Some devices
> use this technique to "move" a WWPN from one place to another.  The
> port drops from the fabric, and a short time later, shows up again
> (the WWPN moves from one physical port to a different physical port).
> There are FC driver layer timers that define the length of time
> allowed for this operation.  The goal is fast failover, but not too
> fast - because too fast will break this kind of "transparent
> failover".  This timer also allows for the "OH crap, I pulled the
> wrong cable - put it back in; quick" kind of stupid user bug.

I think we already have this sorted out with the dev loss timeout which
is implemented in the transport.  It's the grace period you have before
we act on a path loss.

> For (b) the transport never has a failure.  A LUN (or a group of
> LUNs) have an ALUA transition from one set of ports to a different
> set of ports.  Some of the LUNs on the port continue to work just
> fine, but others enter ALUA TRANSITION state so they can "move" to a
> different part of the hardware.  After the move completes, you now
> have different sets of optimized and non-optimized paths (or possible
> standby, or unavailable).  The transport will never even know this
> happened.  This kind of "failure" is handled by the SCSI layer
> drivers.

OK, so ALUA did come up as well, I just forgot.  Perhaps I should back
off a bit and give the historical reasons why dm became our primary
path failover system.  It's because for the first ~15 years of Linux we
had no separate transport infrastructure in SCSI (and, to be fair, T10
didn't either).  In fact, all scsi drivers implemented their own
variants of transport stuff.  This meant there was intial pressure to
make the transport failover stuff driver specific and the answer to
that was a resounding "hell no!" so dm (and md) became the de-facto
path failover standard because there was nowhere else to put it.  The
transport infrastructure didn't really become mature until 2006-2007,
well after this decision was made.  However, now we have transport
infrastructure the question of whether we can use it for path failover
isn't unreasonable.  If we abstract it correctly, it could become a
library usable by all our current transports, so we might only need a
single implementation.

For ALUA specifically (and other weird ALUA like implementations), the
handling code actually sits in drivers/scsi/device-handler, so it could
also be used by the transport code to make path decisions.  The point
here is that even if we implement path failover at the transport level,
we do have more than the information available that the transport
should strictly know to make the failover decision.

James