[dm-devel] [PATCH v7 0/2] multipath-tools: intermittent IO error accounting to improve reliability

Tue Oct 24 01:57:21 UTC 2017

Hi Christophe and All,

This patch set adds a new method of path state checking based on accounting
IO error. This is useful in many scenarios such as intermittent IO error
on a path due to intermittent frame drops, intermittent corruptions, network
congestion or a shaky link.

This patch set is of significance because of this (quoted from the discussion
with Muneendra, Brocade):

There are typically two type of SAN network problems that are categorized as
marginal issues. These issues by nature are not permanent in time and do come
and go away over time.
1) Switches in the SAN can have intermittent frame drops or intermittent
   frame corruptions due to bad optics cable (SFP) or any such wear/tear port
   issues. This causes ITL flows that go through the faulty switch/port to
   intermittently experience frame drops.  
2) There exists SAN topologies where there are switch ports in the fabric
   that becomes the only  conduit for many different ITL(host--target--LUN)
   flows across multiple hosts. These single network paths are essentially
   shared across multiple ITL flows. Under these conditions if the port link
   bandwidth is not able to handle the net sum of the shared ITL flows bandwidth
   going through the single path  then we could see intermittent network
   congestion problems. This condition is called network oversubscription.
   The intermittent congestions can delay SCSI exchange completion time
   (increase in I/O latency is observed).

To overcome the above network issues and many more such target issues, there
are frame level retries that are done in HBA device firmware and I/O retries
in the SCSI layer. These retries might succeed because of two reasons:
1) The intermittent switch/port issue is not observed
2) The retry I/O is a new  SCSI exchange. This SCSI exchange can take an
   alternate SAN path for the ITL flow, if such an SAN path exists.
3) Network congestion disappears momentarily because the net I/O bandwidth
   coming from multiple ITL flows on the single shared network path is
   something the path can handle

However in some cases we have seen I/O retries don't succeed because the retry
I/Os hits a SAN network path that has intermittent switch/port issue and/or
network congestion. 

On the host thus we see configurations two or more ITL path sharing the same
target/LUN going through two or more HBA ports. These HBA ports are connected
to two or more SAN to the same target/LUN.
If the I/O fails at the multipath layer then, the ITL path is turned into
Failed state. Because of the marginal nature of the network, the next Health
Check command sent from multipath layer might succeed, which results in making
the ITL path into Active state. You end up seeing the DM path state going into
Active, Failed, Active transitions. This results in overall reduction in
application I/O throughput and sometime application I/O failures (because of
timing constraints). All this can happen because of I/O retries and I/O request
moving across multiple paths of the DM device. In the host it is to be noted
all I/O retries on a single path and I/O movement across multiple paths results
in slowing down the forward progress of new application I/O. Reason behind,
the above I/O re-queue actions are given higher priority than the newer I/O
requests coming from the application. 

The above condition of the  ITL path is hence called "marginal".

What we desire is for the DM to deterministically  categorize a ITL Path as
“marginal” and move all the pending I/Os from the marginal Path to an Active
Path. This will help in meeting application I/O timing constraints. Also a
capability to automatically re-instantiate the marginal path into Active once
the marginal condition in the network is fixed.

Here is the description of implementation:
1) PATCH 1/2 implements the algorithm that sends a couple of continuous IOs
to a path which suffers two failed events in less than a given time. Those
IOs are sent at a fix rate of 10 Hz.
2) PATCH 2/2 discard the original algorithm because of this:
the detect sample interval of that path checkers is so big/coarse that
it doesn't see what happens in the middle of the sample interval. We have
the PATCH 1/2 as a better method.

Changes from V6:
* fix the warning of unwrapped commit description in patch 1/2 
* add Reviewed-by tag of Muneendra
* add detailed scenario discription in the cover letter

Changes from V5:
* rebase on the latest release 0.7.3 

Changes from V4:
* path_io_err_XXX -> marginal_path_err_XXX. (Mumeendra)
* add one more parameters named marginal_path_double_failed_time instead
  of the fixed 60 seconds for the pre-checking of a shaky path. (Martin)
* fix for "reschedule checking after %d seconds" log 
* path_io_err_recovery_time -> marginal_path_err_recheck_gap_time.
* put the marginal path into PATH_SHAKY instead of PATH_DELAYED 
* Modify the commit comments to sync with the changes above.

Changes from V3:
* add a patch for discard the san_path_XXX_feature 
* fail the path in the kernel before enqueueing the path for checking
  rather than after knowing the checking result to make it more
  reliable. (Martin)
* use posix_memalign instead of manual alignment for direct IO buffer. (Martin) 
* use PATH_MAX to avoid certain compiler warning when opening file
  rather than FILE_NAME_SIZE. (Martin)
* discard unnecessary sanity check when getting block size (Martin)
* do not return 0 in send_each_aync_io if io_starttime of a path is
  not set(Martin)
* Wait 10ms instead of 60 second if every path is down. (Martin)
* rename handle_async_io_timeout to poll_async_io_timeout and use polling
  method because io_getevents does not return 0 if there are timeout IO
  and normal IO.
* rename hit_io_err_recover_time ro hit_io_err_recheck_time 
* modify the multipath.conf.5 and commit comments to keep sync with the
  above changes

Changes from V2:
* fix uncondistional rescedule forverver
* use script/checkpatch.pl in Linux to cleanup informal coding style
* fix "continous" and "internel" typos

Changes from V1:
* send continous IO instead of a single IO in a sample interval (Martin)
* when recover time expires, we reschedule the checking process (Hannes)
* Use the error rate threshold as a permillage instead of IO number(Martin)
* Use a common io_context for libaio for all paths (Martin)
* Other small fixes (Martin)

Junxiong Guan (2):
  multipath-tools: intermittent IO error accounting to improve
    reliability
  multipath-tools: discard san_path_err_XXX feature

 libmultipath/Makefile      |   5 +-
 libmultipath/config.c      |   3 -
 libmultipath/config.h      |  21 +-
 libmultipath/configure.c   |   7 +-
 libmultipath/dict.c        |  88 +++---
 libmultipath/io_err_stat.c | 744 +++++++++++++++++++++++++++++++++++++++++++++
 libmultipath/io_err_stat.h |  15 +
 libmultipath/propsel.c     |  70 +++--
 libmultipath/propsel.h     |   7 +-
 libmultipath/structs.h     |  15 +-
 libmultipath/uevent.c      |  32 ++
 libmultipath/uevent.h      |   2 +
 multipath/multipath.conf.5 |  89 ++++--
 multipathd/main.c          | 140 ++++-----
 14 files changed, 1043 insertions(+), 195 deletions(-)
 create mode 100644 libmultipath/io_err_stat.c
 create mode 100644 libmultipath/io_err_stat.h

-- 
2.11.1