[dm-devel] [PATCH] multipath-tools: intermittent IO error accounting to improve reliability

Tue Aug 22 15:37:39 UTC 2017

On 08/22/2017 12:07 PM, Guan Junxiong wrote:
> This patch adds a new method of path state checking based on accounting
> IO error. This is useful in many scenarios such as intermittent IO error
> an a path due to network congestion, or a shaky link.
> 
> Three parameters are added for the admin: "path_io_err_sample_time",
> "path_io_err_num_threshold" and "path_io_err_recovery_time".
> If path_io_err_sample_time and path_io_err_recovery_time are set to a
> value greater than 0, when a path fail event occurs due to an IO error,
> multipathd will enqueue this path into a queue of which each member is
> sent direct reading asynchronous io at a fixed sample rate of 100HZ. The
> IO accounting process for a path will last for path_io_err_sample_time.
> If the number of IO error on a particular path is greater than the
> path_io_err_num_threshold, then the path will not reinstate for
> 
> This helps us place the path in delayed state if we hit a lot of
> intermittent IO errors on a particular path due to network/target
> issues and isolate such degraded path and allow the admin to rectify
> the errors on a path.
> 
> Signed-off-by: Junxiong Guan <guanjunxiong at huawei.com>
> ---
There have been several attempts for this over the years; if you check
the mail archive for 'flaky patch' you're bound to hit several threads
discussing this.
However, each has floundered for several problems:

- As we now have advanced path selectors the overall consensus is that
those selectors _should_ be able to handle these situations; ie for a
flaky path the path selector should switch away from it and move the
load to other, unaffected paths.
Have you checked if the existing path selectors are able to cope with
this situation? If not, why not?
- But even if something like this is implemented, the real problem here
is reliability. Multipath internally only considers two real path
states; useable and unuseable. Consequently the flaky path needs to be
placed in one of these; so with your patch after enough errors
accumulate the flaky path will be placed in an unuseable state
eventually. If a failover event occurs the daemon cannot switch to the
flaky paths, and the system becomes unuseable even though I/O could be
sent via the flaky paths.
- However, flaky path detection is implemented, it will work most
efficiently when moving I/O _away_ from the flaky path. However, in
doing so we don't have a mechanism to figure out if and when the path is
useable again (as we're not sending I/O to it, and the TUR or any other
path checker might not be affected from the flaky behaviour).
So when should we declare a path as 'good' again?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare at suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)