[dm-devel] [PATCH 04/19] Revert "multipath-tools: discard san_path_err_XXX feature"

Thu Dec 20 10:41:14 UTC 2018

Hi Martin,
I completely agree with you as we cannot derive a direct formula behind
these two unless we don't know the IOPS on a particular path.

As the IOPS in both the cases are different during the detection of Shaky
path.
In marginal_path_XX case the IOPS are fixed i.e 100 (at a sample rate of
10HZ) ,Similarly in san_path_xx case the IOPS are not fixed(as it depends on
the application).

But there are lot of ways to derive the IOPS on a particular path if we can
get that then we can derive the values  like below IMO.

And to calculate these we need to derive error threshold as the percentage
of IOPS and the percentage should not be less than 1(as most of the Brocade
SAN customers are using this configuration).
i.e  san_path_errr_threshold and marginal_path_err_rate_threshold   needs to
be computed as percentage of  IOPS for a given number of secs(derived from
san_path_err_forget_rate/ marginal_path_err_sample_time).

For example if  1000 IOPS are happening on a particular path and making the
percentage factor as 1 and sample time as 60 secs the configuration will be
as below

	san_path_err_threshold     =600 (1 percentage of 60*1000)
	san_path_err_forget_rate   =60
	san_path_err_recovery_time 100

Now this user is supposed to migrate to marginal_path settings.
(IOPS in this case is fixed to 100 during the shaky path detection)
	marginal_path_err_rate_threshold   60 (1 percentage of 60*100)
	marginal_path_err_sample_time      60
	marginal_path_err_recheck_gap_time 100

And in this case  san_path_err_forget_rate  should be same as
marginal_path_err_sample_time    and
san_path_err_recovery_time should be same as
marginal_path_err_recheck_gap_time  .
only the variable factor is san_path_err_threshold  and
marginal_path_err_rate_threshold   which keeps changing based on the number
of errors as a percentage of IOPS for a given number of secs.

The only parameter that is extra in marginal case is
marginal_path_double_failed_time   which we need to configure for suspecting
a marginal path.

As we still see some merits in the san_path_XX approach as you mentioned
earlier
and we need both san_path_err_xx and marginal_path_err_xx  I am thinking of
the below approach so that the customers can have the common configuration
for both.
>From the functionality wise san_path_err_forget_rate  ,
marginal_path_err_sample_time    and
san_path_err_recovery_time ,marginal_path_err_recheck_gap_time  and
san_path_err_threshold  , marginal_path_err_rate_threshold are same.

So we can have the common configuration name as marginal_path_err_XX
(parameters) for both approaches and the deriving factor should be
marginal_path_double_failed_time   .
If marginal_path_double_failed_time   is not  defined go with san_path_err
approach else go with marginal_path_err approach to detect the Shaky path.

Regards,
Muneendra.

-----Original Message-----
From: Martin Wilck [mailto:mwilck at suse.com]
Sent: Wednesday, December 19, 2018 5:32 PM
To: Muneendra Kumar M <muneendra.kumar at broadcom.com>; Christophe Varoqui
<christophe.varoqui at opensvc.com>; mwilck+gmail at suse.de
Cc: M Muneendra Kumar <mmandala at brocade.com>; Guan Junxiong
<guanjunxiong at huawei.com>; Benjamin Marzinski <bmarzins at redhat.com>;
dm-devel at redhat.com; Hannes Reinecke <hare at suse.de>
Subject: Re: [PATCH 04/19] Revert "multipath-tools: discard san_path_err_XXX
feature"

On Wed, 2018-12-19 at 17:02 +0530, Muneendra Kumar M wrote:
> Hi Martin,
> In one of the patch   "[PATCH 00/19] san_path_err & multipath ANA
> support"
>
> you have mentioned that san_path_err_XXX has some merits over
> marginal_path_err_XXX.
>
> Is this understanding correct if so could you please explain the
> scenario in which use case this was better.
>
> I can say Marginal_path_err_xx is superset of san_path_err_xx.

If you think so, please explain how. Imagine a user who has configured

  san_path_err_threshold     X
  san_path_err_forget_rate   Y
  san_path_err_recovery_time Z

Now this user is suppsed migrate to marginal_path settings.

  marginal_path_double_failed_time   A
  marginal_path_err_sample_time      B
  marginal_path_err_rate_threshold   C
  marginal_path_err_recheck_gap_time D

Can you provide a formula to calculate A,B,C,D such that the system behaves
the same way (or "better") than previously with X, Y, Z?

I have pondered this for a while and concluded that I can't.

> If we need both san_path_err_xx , Marginal_path_err_xx then so many
> configurations will really confuse the customers.

True, the many different options are confusing. However, I don't think it
becomes much worse by offering both methods. Both methods aren't easy to
understand by themselves. Once users understand that these two parameter
sets are mutually exclusive, I think they can deal with that.

What we really need is easier set-up of either method (think of 2-3 sets of
reasobable pre-set parameter values for different scenarios).
I believe most admins are so intimidated by the complexity of the parameters
and their interaction that they give up and use delay_xx_checks instead, or
nothing at all.

Unfortunately this is all based on guessing; we at least have no data if
users are trying these parameters and if yes, what they are using.

Martin

--
Dr. Martin Wilck <mwilck at suse.com>, Tel. +49 (0)911 74053 2107 SUSE Linux
GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG
Nürnberg)