[dm-devel] [PATCH V4 1/2] multipath-tools: intermittent IO error accounting to improve reliability

Mon Sep 18 19:51:37 UTC 2017

On Mon, 2017-09-18 at 22:36 +0800, Guan Junxiong wrote:
> Hi Muneendra,
> 
> Thanks for you feedback.  My comments are incline below.
> 
> On 2017/9/18 20:53, Muneendra Kumar M wrote:
> > Hi Guan,
> > This a good effort for detecting the intermittent IO error
> > accounting to improve reliability.
> > Your new algorithm is  mutually exclusive with san_path_err_XXX.
> > It resolved the below issue which you have mentioned .
> > > > Even the san_path_err_threshold , san_path_err_forget_rate and
> > > > san_path_err_recovery_time is turned on,
> > > > the detect sample interval of that path checkers is so
> > > > big/coarse that it doesn't see what happens in the middle of
> > > > the sample interval.
> > 
> > But I have some concerns.
> > 
> > Correct me if my understanding on the below line is correct
> > > > On a particular path when a path failing events occur twice in
> > > > 60 second due to an IO error, multipathd will fail the path and
> > > > enqueue 
> > > > this path into a queue of which each member is sent a couple of
> > > > continuous direct reading asynchronous io at a fixed sample
> > > > rate of 10HZ. 
> > 
> > Once we hit the above condition (2 errors in 60 secs) for a
> > path_io_err_sample_time we keeps on injecting the asynchronous io
> > at a fixed sample rate of 10HZ.
> > And during this path_io_err_sample_time if we hit the the
> > path_io_err_rate_threshold then we will not reinstantate this path
> > for a path_io_err_recovery_time.
> > Is this understanding correct?
> > 
> 
> Partial correct.
> If we hit the above condition (2 errors in 60 secs), we will fail the
> path first before injecting a couple of asynchronous IOs to keep the
> testing not affected by other IOs.
> And after this path_io_err_sample_time :
> (1) if we hit the the path_io_err_rate_threshold, the failed path
> will keep unchanged  and then after the path_io_err_recovery_time
> (which is confusing, sorry, I will rename it to "recheck"), we will
> reschedule this IO error checking process again.
> (2) if we do NOT hit the path_io_err_rate_threshold, the failed path
> will reinstated by path checking thread in a tick (1 second) ASAP.
> 
> 
> > If the above understanding is correct then my concern is :
> > 1) On a particular path if we are seeing continuous errors but not
> > within 60 secs (may be for every 120 secs) of duration how do we
> > handle this. Still this a shaky link.
> > This is what our customers are pointing out.
> > And if i am not wrong the new algorithm will comes into place
> > only  if a path failing events occur twice in 60 seconds.
> > 
> > Then this will not solve the intermittent IO error issue which we
> > are seeing as the data is still going on the shaky path .
> > I think this is the place where we need to pull in  in
> > san_path_err_forget_rate .
> > 
> 
> Yes .  I have thought about using some adjustable parameters such as
> san_path_err_pre_check_time and  san_path_err_threshold to cover ALL
> the scenarios the user encounters.
> In the above fixed example,san_path_err_pre_check_time is set to 60
> seconds, san_path_err_threshold is set 2.
> However, if I adopt this, we have 5 parameters
> (san_path_err_pre_check_time and  san_path_err_threshold + 3
> path_io_err_XXXs ) to support this feature. You know, mulitpath.conf 
> configuration is becoming more and more daunting as Martin pointed in
> the V1 of this patch.
> 
> But now, maybe it is acceptable for users to use the 5 parameters if
> we set san_path_err_pre_check_time and  san_path_err_threshold to
> some default values such as 60 second and 2 respectively.
> **Martin** , **Muneendra**, how about this a little compromising
> method?  If it is OK , I will update in next version of patch.

Hm, that sounds a lot like san_path_err_threshold and
san_path_err_forget_rate, which you were about to remove.

Maybe we can simplify the algorithm by checking paths which fail in a
given time interval after they've been reinstated? That would be one
less additional parameter.

The big question is: how do administrators derive appropriate values
for these parameters for their environment? IIUC the values don't
depend on the storage array, but rather on the environment as a whole;
all kinds of things like switches, cabling, or even network load can
affect the behavior, so multipathd's hwtable will not help us provide
good defaults. Yet we have to assume that a very high percentage of
installations will just use default or vendor-recommended values. Even
if the documentation of the algorithm and its parameters was perfect
(which it currently isn't), most admins won't have a clue how to set
them. AFAICS we don't even have a test procedure to derive the optimal
settings experimentally, thus guesswork is going to be applied, with
questionable odds for success.

IOW: the whole stuff is basically useless without good default values.
It would be up to you hardware guys to come up with them.

> san_path_err_forget_rate is hard to understand, shall we use Regards
> san_path_err_pre_check_time

A 'rate' would be something which is measured in Hz, which is not the
case here. Calling it a 'time' is more accurate. If we go with my
proposal above, we might call it "san_path_double_fault_time".

Regards
Martin

-- 
Dr. Martin Wilck <mwilck at suse.com>, Tel. +49 (0)911 74053 2107
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)