[dm-devel] [PATCH 04/19] Revert "multipath-tools: discard san_path_err_XXX feature"

Fri Dec 21 11:03:35 UTC 2018

Hi Martin,
The san_path_err_XX feature was added by me and pushed to the upstream.
And this feature was driven from Brocade Customer Feedback.

And the below link will give  the history of this where couple of
discussions went before we started this feature.

https://www.redhat.com/archives/dm-devel/2017-January/msg00025.html

Our requirement was simple
For example If there are two paths on a dm-1 say sda and sdb as below.

 #  multipath -ll
 mpathd (3600110d001ee7f0102050001cc0b6751) dm-1 SANBlaze,VLUN MyLun
 size=8.0M features='0' hwhandler='0' wp=rw
 `-+- policy='round-robin 0' prio=50 status=active
   |- 8:0:1:0  sda 8:48 active ready  running
   `- 9:0:1:0  sdb 8:64 active ready  running

 And on sda if iam seeing lot of errors due to which the sda path is
fluctuating from failed state to active state and vicevera.

 The  requirement was something like this  if sda is failed(moved from
active to failed state) for more than X
 times in a Y duration ,then I want to keep the sda in failed state for Z
duration

 And the data should travel only through sdb path for Z hrs.

 From the configuration point of view

 san_path_err_threshold: The number of times the sda has been moved from
active to failed (from the above example it is X)
 san_path_err_forget_rate: Watch window (within this time frame if the path
failures (sda moving from active to failed ) are more than err threshold
then don't reinstate the path) (from the above example it is Y)
 san_path_err_recovery_time: Place the path in failed state for this
particular time (from the above example it is Z)

 Moving from active state to Failed state (good to bad) is considered as 1
count.

 It means if a particular path has failed (moved from active to failed
states)  san_path_err_threshold times within a
 san_path_err_forget_rate time frame window ,place the path in failed state
and does not reinstantate it for  san_path_err_recovery_time time.

 Coming back to the marginal path implementation i have rechecked the
implementation and I completely agree with you
 it's difficult to derive  the direct formula for both.
And the example which I gave doesn't holds god.

And both approaches are mutually exclusive in detecting the marginal/shaky
path.

 In san_path_err_XX case we are taking the consideration of overall failures
(san_path_err_threshold ) whereas in marginal case IMO we are considering
the error rate (marginal_path_err_rate_threshold   )?
 And you are correct if we merge the san_path_err_XX  ,marginal_path_XX
configuration as one parameters this will further confuse the user.

 Since there are different approaches we need to come up with a way as how
the user can choose the algorithm in multipath.conf.

 Similar to Multipaths  configuration in .conf file.

 Regards,
 Muneendra

-----Original Message-----
From: Martin Wilck [mailto:mwilck at suse.com]
Sent: Friday, December 21, 2018 2:56 AM
To: Muneendra Kumar M <muneendra.kumar at broadcom.com>; Christophe Varoqui
<christophe.varoqui at opensvc.com>; mwilck at suse.com
Cc: M Muneendra Kumar <mmandala at brocade.com>; Guan Junxiong
<guanjunxiong at huawei.com>; Benjamin Marzinski <bmarzins at redhat.com>;
dm-devel at redhat.com; Hannes Reinecke <hare at suse.de>
Subject: Re: [PATCH 04/19] Revert "multipath-tools: discard san_path_err_XXX
feature"

Hello Muneedra,

On Thu, 2018-12-20 at 16:11 +0530, Muneendra Kumar M wrote:
> Hi Martin,
> I completely agree with you as we cannot derive a direct formula
> behind these two unless we don't know the IOPS on a particular path.
>
> As the IOPS in both the cases are different during the detection of
> Shaky path.
> In marginal_path_XX case the IOPS are fixed i.e 100 (at a sample rate
> of
> 10HZ) ,Similarly in san_path_xx case the IOPS are not fixed(as it
> depends on the application).
>
> But there are lot of ways to derive the IOPS on a particular path if
> we can get that then we can derive the values  like below IMO.
>
> And to calculate these we need to derive error threshold as the
> percentage of IOPS and the percentage should not be less than 1(as
> most of the Brocade SAN customers are using this configuration).
> i.e  san_path_errr_threshold and
> marginal_path_err_rate_threshold   needs to
> be computed as percentage of  IOPS for a given number of secs(derived
> from san_path_err_forget_rate/ marginal_path_err_sample_time).

You make me curious - are Brocade customers using our upstream multipath
code? Do you have insights about if, and how, they apply marginal path
checking in multipath-tools, and what parameter values they are applying?

If yes, it would be very valuable for the community if you could share some
of these insights. So far I'm gathering that you recommend to consider paths
as shaky if they have an error rate of more than 1%.

>
> For example if  1000 IOPS are happening on a particular path and
> making the percentage factor as 1 and sample time as 60 secs the
> configuration will be as below
>
>       san_path_err_threshold     =600 (1 percentage of 60*1000)
>       san_path_err_forget_rate   =60
>       san_path_err_recovery_time 100

Hm, I understand it differently. In the san_path_err model, if you have an
error rate of 1% and the settings above, IMO you will *never* reach the
threshold. The failure count will increase (on average) in 1/100 ticks, but
it will decrease in 1/60 ticks, resulting in a negative first derivative
(more precisely, a stochastic process where the overall trend goes towards
0, not upwards towards the threshold).

In the san_path_err model, the maximum tolerable failure rate is basically
the reciprocal of the san_path_err_forget_rate parameter.

The error threshold as a different effect, acting rather as a "delay"
until the algorithm really considers the path shaky. The closer the failure
rate to the forget rate, the longer it takes. For example, if you have an
error rate of 1/30 (3.3%), the failure count will increase by one every 60
ticks (1/30-1/60 = 1/60), and it will take 60*600 =
36000 (!) ticks, or 10h at best, until the path is considered shaky.
OTOH, with an error rate of 10%, the threshold is reached in 7200 ticks, and
at an error rate of 50%, in 1200s.

For you scenario, I'd use something like

   san_path_err_threshold 4
   san_path_err_forget_rate 100
   san_path_err_recovery_time 100

At least that's how I understand the algorithm. Am I wrong?

Btw, are you aware that the san_path_err algorithm, at least in the form
that was merged upstream, only counts good->bad transitions?
Especially with high error rates, this is quite different from an overall
error rate (failures / overall I/Os), because several subsequent failures
are only counted as one.

>
> Now this user is supposed to migrate to marginal_path settings.
> (IOPS in this case is fixed to 100 during the shaky path detection)
>           60 (1 percentage of 60*100)
>       marginal_path_err_sample_time      60
>       marginal_path_err_recheck_gap_time 100
>
>
>
> And in this case  san_path_err_forget_rate  should be same as
> marginal_path_err_sample_time    and
> san_path_err_recovery_time should be same as
> marginal_path_err_recheck_gap_time  .
> only the variable factor is san_path_err_threshold  and
> marginal_path_err_rate_threshold   which keeps changing based on the
> number
> of errors as a percentage of IOPS for a given number of secs.
>
> The only parameter that is extra in marginal case is
> marginal_path_double_failed_time   which we need to configure for
> suspecting
> a marginal path.

I don't think these parameters will have the behavior as the san_path_err
parameters above. Argument above.

Note that marginal_path_err_sample_time 60 is invalid (the marginal path
code requires at least 120s), and that the error threshold is always given
as a "permillage" (should be set to 10 for 1%).

>
> As we still see some merits in the san_path_XX approach as you
> mentioned earlier and we need both san_path_err_xx and
> marginal_path_err_xx  I am thinking of the below approach so that the
> customers can have the common configuration for both.
> From the functionality wise san_path_err_forget_rate  ,
> marginal_path_err_sample_time    and
> san_path_err_recovery_time ,marginal_path_err_recheck_gap_time  and
> san_path_err_threshold  , marginal_path_err_rate_threshold are same.
>
> So we can have the common configuration name as marginal_path_err_XX
> (parameters) for both approaches and the deriving factor should be
> marginal_path_double_failed_time   .
> If marginal_path_double_failed_time   is not  defined go with
> san_path_err
> approach else go with marginal_path_err approach to detect the Shaky
> path.

I'm not sure about that. It's important that users are able to understand
the effect that each parameter has. If we use the same parameter name for
different parameters of different algorithms, even bigger confusion might
arise than we have now.
"san_path_err_recovery_time" and "marginal_path_recheck_gap_time"
obviously have very similar effects, but for the other parameters I don't
see 1:1 equivalence.

Best regards,
Martin

--
Dr. Martin Wilck <mwilck at suse.com>, Tel. +49 (0)911 74053 2107 SUSE Linux
GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG
Nürnberg)