[dm-devel] [PATCH 04/19] Revert "multipath-tools: discard san_path_err_XXX feature"

Fri Dec 28 12:19:17 UTC 2018

Hi Martin,
Please find my replies below.

>Hi Muneedra,

> The san_path_err_XX feature was added by me and pushed to the
> upstream.
> And this feature was driven from Brocade Customer Feedback.
>
> And the below link will give  the history of this where couple of
> discussions went before we started this feature.
>
> https://www.redhat.com/archives/dm-devel/2017-January/msg00025.html

>I'm aware that you authored the feature. I was not aware of that post you
>quoted, thanks for the link. Anyway, you mentioned in that post that the
>interested customers were using RHEL, have you made them upgrade their
>multipath-tools to >recent upstream to use the san_path_err and/or
>marginal_path features?

>>>> I will get back to u with the details.

> Our requirement was simple
> For example If there are two paths on a dm-1 say sda and sdb as below.
>
>  #  multipath -ll
>  mpathd (3600110d001ee7f0102050001cc0b6751) dm-1 SANBlaze,VLUN MyLun
> size=8.0M features='0' hwhandler='0' wp=rw
>  `-+- policy='round-robin 0' prio=50 status=active
>    |- 8:0:1:0  sda 8:48 active ready  running
>    `- 9:0:1:0  sdb 8:64 active ready  running
>
>  And on sda if iam seeing lot of errors due to which the sda path is
> fluctuating from failed state to active state and vicevera.
>
>  The  requirement was something like this  if sda is failed(moved from
> active to failed state) for more than X  times in a Y duration ,then I
> want to keep the sda in failed state for Z duration

>Thanks for clarifying what you meant with "is failed". I'd been wondering
>if it meant "good"->"failed" transitions, as you just confirmed, or overall
>"failed" state count.

>  And the data should travel only through sdb path for Z hrs.
>
>
>  From the configuration point of view
>
>  san_path_err_threshold: The number of times the sda has been moved
> from active to failed (from the above example it is X)
>  san_path_err_forget_rate: Watch window (within this time frame if the
> path failures (sda moving from active to failed ) are more than err
> threshold then don't reinstate the path) (from the above example it is
> Y)

>The "watch window" analogy fits if you have a stable path (no or only very
>rare failures over extended periods of time) which suddenly starts
>fluctuating. More precisely, a "background" failure rate clearly below
>"san_path_err_forget_rate", >interchanging with problematic periods in
>which the failure rate is significantly higher than
>"san_path_err_forget_rate". And that's is the situation the algorithm was
>made for, right?

>In general, the "time" (in ticks) to reach the treshold is

  >t = T / max(1/R - 1/F, 0)

>Where T is san_path_err_threshold, R is the average time (in ticks) between
>"good"->"failed" transitions of the path, and F is san_path_err_forget_rate
>(aka the time in ticks after which "path_failures" is decremented by 1).

>If R >= F, t is infinite; the "path_failures" count effectively stays 0. If
>R is much smaller than F, t ~ T * R. If R is only a little bit smaller than
>F, t is finite but (possibly much) larger than T * R.
>That's why I sloppily called F the "maximum tolerable failure rate" in my
>previous post.

>>>> Yes.

......

Regards,
Muneendra.