[dm-devel] dm-multipath "shaky SAN detection" is insufficient for intermittent errors.

Erwin van Londen erwin at erwinvanlonden.net
Tue Mar 23 03:17:13 UTC 2021


Hello All,

This topic may have been discussed before although I've not been able
to find it in this d-list.

The "shaky SAN" detection method seems to be currently based on
availability of the remote target ports and how often they
disappear/reappear as per HBA state change on that remote target.

What we seen in SAN troubleshooting is that the majority of issues are
related to frame-corruption, missing frames and therefore incomplete FC
sequences/exchanges and hence just IO errors. I've been doing some test
with a FC jammer/analyser doing all sorts of weird things from changing
a scsi data-payload or crc therefore corrupting the frame to almost
persistently killing of cmnds or status frames based on normal IO's. As
long as I do nothing on a TUR and that checker keeps getting correct
statuses back it's then just left to the FC stack and or arrays to
chuck and HBA offline forcing multipath to halt IO's to that path.

My request is would it be possible to, instead (or in addition) of
checking on disappearing/re-appearing targets, to monitor for actual IO
errors on data-transfer where cmnd's timeout or cmnd's end up in any
check condition and utilise that to either halt IO's entirely or to
also use the marginal_path_err logic and have that path moved into a
holding queue and in the background check for subsequent errors where
the marginal_err_sample_time, err_rate_threshold and gap_time then
determine if that path can be used again or to have it permanently
failed.

I've been doing SAN troubleshooting for 20 years and the majority of
the problems is related to these intermittent issues of frame
corruption and/or discards. Actual flipping paths where a target goes
into and offline/online state is far less common. 

Your feedback is appreciated.

Thank you

Erwin van Londen
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/dm-devel/attachments/20210323/433500e6/attachment.htm>


More information about the dm-devel mailing list