[dm-devel] [PATCH] multipath-tools: intermittent IO error accounting to improve reliability

Mon Aug 28 11:13:04 UTC 2017

On Thu, 2017-08-24 at 17:59 +0800, Guan Junxiong wrote:
> Hi, Hannes
>      Thanks for your comments. My reply inline.
> 
> On 2017/8/22 23:37, Hannes Reinecke wrote:
> > - As we now have advanced path selectors the overall consensus is
> > that
> > those selectors _should_ be able to handle these situations; ie for
> > a
> > flaky path the path selector should switch away from it and move
> > the
> > load to other, unaffected paths.
> > Have you checked if the existing path selectors are able to cope
> > with
> > this situation? If not, why not?
> 
> The existing path selectors in the kernel space are able to fail_path
> the flaky path when certain IO errors occurs. However only the user-
> space
> multipathd's checkers can detect whether the path is up. Therefore,
> for path
> with long-time intermittent IO or flaky path, that path selectors
> suffers
> from taking in the path and taking out the path _again_  _and_
> _again_.
> Even the san_path_err_threshold , san_path_err_forget_rate and
> san_path_err_recovery_time
> is turned on, the detect sample interval of that path checkers is so
> big/coarse
> that it doesn't see what happens in the middle of the sample
> interval.

I have the concern that we are introducing too many different
regulation algorithms. We have path selectors, path checkers,
san_path_err_XXX, and now path_io_err_XXX as well. We must be certain
that these play together in a well-defined fashion (most importantly,
avoid that one mechanism activates a path while the other is in the
process of tearing it down, etc.). We must also avoid causing user
confusion, as multipath configuration is already a daunting task for
many. Your new algorithm should be mutually exclusive with
san_path_err_XXX. Perhaps we should even consider dropping the
san_path_err_XXX options entirely if we choose to adopt your new
approach.

> > - However, flaky path detection is implemented, it will work most
> > efficiently when moving I/O _away_ from the flaky path. However, in
> > doing so we don't have a mechanism to figure out if and when the
> > path is
> > useable again (as we're not sending I/O to it, and the TUR or any
> > other
> > path checker might not be affected from the flaky behaviour).
> > So when should we declare a path as 'good' again?
> 
> In this patch, the flaky path will stay only
> path_io_err_recovery_time seconds
> if there are more than one active path. After only
> path_io_err_recovery_time seconds,
> the flaky path will stay in normal, which means , when path checker
> detects it
> is up, it will reinstate into the usable path.
> 
> However, how about we schedule the intermittent IO checking process
> again when
> the path_io_err_recovery_time seconds expires. If the number of IO
> erros is less
> than path_io_err_num_threshold, we declare the path as 'good' again.

That sounds like a reasonable improvement over the original patch.

Regards,
Martin

-- 
Dr. Martin Wilck <mwilck at suse.com>, Tel. +49 (0)911 74053 2107
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)