[dm-devel] [PATCH] multipath-tools:Prioritizer based on a time-delay algorithm

Yang Feng philip.yang at huawei.com
Wed May 24 01:59:21 UTC 2017


Hello Benjamin,

How about this patch?
Look forward your replies.
Thank you very much.
Best regards!





On 2017/5/19 17:45, Yang Feng wrote:
> Hi Benjamin,
> 
> Thank you very much for your comments.
> Please find my replys and the up-to-date patch.
> Best regards!
> 
>>
>> First, one overall question. We have dynamic path selectors available to
>> deal with paths that are just simply slower that other paths, but can
>> still be used together.  Is there specific hardware or a specific setup
>> where this isn't good enough and we really need to seperate these paths
>> into different pathgroups, but we can't find out deterministically how
>> the groups should be set up?  It just seems like there could be a less
>> hacky solution to this problem, but perhaps there are some situations
>> where this is truly the best option. I'm just wondering what those are.1. In the Storage-Backup environment of HyperCluster,includs one storage array near
> to the host and one remote storage array, and the two storage arrays have the same hardware.
> The same LUN is writed or readed by the two storage arrays.
> However, usually, the average latency of the paths of the remote storage array is much higher than the
> near storage array's.
> apparently, the prioritizer can be a good automatic solution.
> And the current selectors don't solve it, IOs will send to the paths of the remote storage array, IOPS will be influenced unavoidably.
> 2. In the environment of single storage array, the prioritizer can automatically separate the paths who's latency is much higher,
> IOs will not send to this paths.
> But the current selectors don't solve this problem, IOPS will be influenced unavoidably.
> 
>>> +
>>> +/* interval_unit_str and interval_unit_type keep the same assignment sequence */
>>> +static const char interval_unit_str[][MAX_CHAR_SIZE] = {
>>> +    CHAR_USEC, CHAR_MSEC, CHAR_SEC
>>
>> This is a nit, but for constant strings, could you please use "char
>> *var" instead of "char var[]", to be consistent with the rest of the
>> multipath code.
> Thanks, as the following patch.
> 
>>> +    if ((args == NULL) || (interval == NULL)
>>> +        || (consnum == NULL) || (type == NULL))
>>> +        return 0;
>>> +
>>> +    /* int type */
>>> +    if ((size < 1) || (size > MAX_CHAR_SIZE-1))
>>> +        return 0;
>>
>> You should probably have log messages for these error returns.
> Thanks, as the following patch.
> 
>>> +
>>> +    memcpy(source, args, size+1);
>>> +    if (strstr(source, vertica) == NULL)
>>> +        return 0;
>>> +
>>> +    *type = get_interval_type(source, typestr);
>>> +    if (*type == INTERVAL_INVALID)
>>> +    {
>>> +        condlog(0, "delay_interval type is invalid");
>>> +        return 0;
>>> +    }
>>
>> I'm confused here. How do you get to use the default interval. Shouldn't
>> you accept "20s|" and "|30" and as valid inputs that use the defaults
>> for the part they don't specify. 
> OK,the default arguments value is removed. If get inputs failed, return default priority "0".
> As the following patch.
>>
>>> +    tokenbefore = strtok(source, vertica);
>>> +    tokenafter = strtok(NULL, vertica);
>>> +    typestr[1] = '\0';
>>> +    tokenbefore = strtok(tokenbefore, typestr);
>>> +    if ((tokenbefore == NULL) || (tokenafter == NULL))
>>> +        return 0;
>>> +
>>> +    tmp = tokenbefore;
>>> +    while (*tmp != '\0')
>>> +        if (!isdigit(*tmp++))
>>> +        {
>>> +            condlog(0, "delay_interval string include invalid char");
>>> +            return 0;
>>> +        }
>>> +
>>> +    tmp = tokenafter;
>>> +    while (*tmp != '\0')
>>> +        if (!isdigit(*tmp++))
>>> +        {
>>> +            condlog(0, "cons_num string include invalid char");
>>> +            return 0;
>>> +        }
>>> +
>>> +    *interval = atoi(tokenbefore);
>>
>> Why do you keep track of the type and the interval seperately? Can't you
>> just find out the type, and use that to multiply the interval once you
>> read it, and then just use that value, instead of keeping track of two
>> values across multiple functions?
> Thanks, as the following patch.
> 
>>> +
>>> +	if (pp->fd < 0)
>>> +	    return -PRIO_NO_INFORMATION;
>>> +
>>> +    if (get_delay_pref_arg(args, &delay_interval, &cons_num, &type) == 0)
>>> +    {
>>> +        condlog(3, "%s: get delay arg fail", pp->dev);
>>
>> Why use the word "fail" in this message? Not setting prio_args to get
>> the defaults seems like a perfectly valid choice.
> The defaults are not used. Insteadly, return default priority "0". See below.
> 
>>> diff --git a/libmultipath/prioritizers/delayedpath.h b/libmultipath/prioritizers/delayedpath.h
>>> new file mode 100644
>>> index 0000000..d8213e9
>>> --- /dev/null
>>> +++ b/libmultipath/prioritizers/delayedpath.h
>>> @@ -0,0 +1,17 @@
>>> +#ifndef _DELAYEDPATH_H
>>> +#define _DELAYEDPATH_H
>>> +
>>> +#define PRIO_DELAYED_PATH "delayedpath"
>>
>> In order for the rest of the code to refer to this prioritizer, this
>> define should be in prio.h with the other prioritizer names, and as long
>> as delayedpath.c includes prio.h, there's no need to put it in
>> delayedpath.h.
> OK, as the following patch.
>>
>>> +
>>> +#define PRIO_NO_INFORMATION 5
>>
>> The rest of the multipath code only cares if getprio returns a negative
>> number of not. It doesn't check what the specific negative number is.  I
>> realize the the alua prioritizer returns a set of error codes, but they
>> aren't used, or even usable in their present form. If we wanted to have
>> better error reporting, we should set up a universal set of error codes
>> in prio.h, and have all prioritizers use them, instead of having each
>> prioritizer define its own error codes. There's no reason why your
>> prioritizer needs to return this error code instead of -1.
> OK, as the following patch.
>>
>>> +
>>> +#define USEC_PER_SEC      1000000LL
>>> +#define USEC_PER_MSEC     1000LL
>>> +#define USEC_PER_USEC     1LL
>>> +
>>> +static inline long long timeval_to_us(const struct timespec *tv)
>>> +{
>>> +	return ((long long) tv->tv_sec * USEC_PER_SEC) + (tv->tv_nsec >> 10);
>>> +}
>>
>> No other file besides delayedpath.c will likely be including this .h
>> file, so I don't see any purpose for these being defined here.  In fact,
>> I don't see why you can't just have a .c file without a .h file like the
>> majority of prioritizers.  I'm pretty sure that none of the prioritizers
>> really need their own .h file.
> OK, as the following patch.
>>
>>> +#endif
>>> diff --git a/multipath/multipath.conf.5 b/multipath/multipath.conf.5
>>> index 5939688..f1e126e 100644
>>> --- a/multipath/multipath.conf.5
>>> +++ b/multipath/multipath.conf.5
>>> @@ -293,6 +293,10 @@ Generate a random priority between 1 and 10.
>>>  Generate the path priority based on the regular expression and the
>>>  priority provided as argument. Requires prio_args keyword.
>>>  .TP
>>> +.I delayedpath
>>> +Generate the path priority based on a time-delay algorithm.
>>> +Requires prio_args keyword.
>>
>> Really it doesn't require prio_args if you want to use the default
>> values, and should probably say so.
> The default args is discarded, as the following patch.
> 
>>> +.I delayed
>>> +Needs a value of the form
>>> +\fI"<delay_interval|cons_num>"\fR
>>> +.RS
>>> +.TP 8
>>> +.I delay_interval
>>> +The interval values of average IO-time-delay between two different neighbour ranks of path priority, used to partition different priority ranks.
>>> +Form: XXs, or XXXus, or XXXms. Unit: Second, or Microsecond, or Millisecond. Valid Values: Integer, s [1, 60], ms [1, 1000), us [1, 1000),
>>> +For example: 10s, or 100us, or 100ms. The default is: 10ms.
>>> +.TP
>>> +.I cons_num
>>> +The number of read IOs sent to the current path continuously, used to calculate the average IO-time-delay. Valid Values: Integer, [3, 1000].
>>> +For example: 30. The default is: 20.
>>> +.RE
>>> +.TP 12
>>
>> Looking at the "weighted" prio_args definition just above your "delayed"
>> definition, the pipe character "|" is being used to say that any of a
>> set of options is allowed.  Your definition has it being a literal
>> character, but it's still inside the angle brackets that usually
>> delineate a variable.  perhaps "<delay_interval>|<io_num>" would be
>> easier to understand, or even "[delayed_interval]|[io_num]" if you can
>> omit these to use the defaults.
> OK, as the following patch.
> The up-to-date patch as follows:
> 
> ---
>>From 58d718fdd34550bd9c4a32c6e9a87099c1e45a9f Mon Sep 17 00:00:00 2001
> From: Yang Feng <philip.yang at huawei.com>
> Date: Fri, 19 May 2017 16:09:07 +0800
> Subject: [PATCH] libmultipath/prioritizers: Prioritizer for device mapper multipath, where the corresponding priority
> values of specific paths are provided by a latency algorithm. And the latency algorithm is dependent on the following
> arguments(latency_interval and io_num). The principle of the algorithm is illustrated as follows:
> 1. By sending a certain number "cons_num" of read IOs to the current path continuously, the IOs' average latency can be calculated.
> 2. According to the average latency of each path and the weight value "latency_interval", the priority "rc" of each path can be provided.
> 
>                    latency_interval   latency_interval   latency_interval       latency_interval
>          	 |------------------|------------------|------------------|...|------------------|
> 		 |  priority rank 1 |  priority rank 2 |  priority rank 3 |...|  priority rank x |
> 		 |------------------|------------------|------------------|...|------------------|
> 				          Priority Rank Partitioning
> ---
>  libmultipath/prioritizers/Makefile       |   6 +-
>  libmultipath/prioritizers/path_latency.c | 271 +++++++++++++++++++++++++++++++
>  multipath/multipath.conf.5               |  18 ++
>  libmultipath/prio.h 			  |   1 +
>  4 files changed, 295 insertions(+), 1 deletion(-)
>  create mode 100644 libmultipath/prioritizers/path_latency.c
> 
> diff --git a/libmultipath/prioritizers/Makefile b/libmultipath/prioritizers/Makefile
> index 36b42e4..d2f20f6 100644
> --- a/libmultipath/prioritizers/Makefile
> +++ b/libmultipath/prioritizers/Makefile
> @@ -18,13 +18,17 @@ LIBS = \
>  	libpriorandom.so \
>  	libpriordac.so \
>  	libprioweightedpath.so \
> -	libpriosysfs.so
> +	libpriopath_latency.so \
> +	libpriosysfs.so
> 
>  all: $(LIBS)
> 
>  libprioalua.so: alua.o alua_rtpg.o
>  	$(CC) $(LDFLAGS) $(SHARED_FLAGS) -o $@ $^
> 
> +libpriopath_latency.so: path_latency.o  ../checkers/libsg.o
> +	$(CC) $(LDFLAGS) $(SHARED_FLAGS) -o $@ $^ -lm
> +
>  libprio%.so: %.o
>  	$(CC) $(LDFLAGS) $(SHARED_FLAGS) -o $@ $^
> 
> diff --git a/libmultipath/prioritizers/path_latency.c b/libmultipath/prioritizers/path_latency.c
> new file mode 100644
> index 0000000..a666b6c
> --- /dev/null
> +++ b/libmultipath/prioritizers/path_latency.c
> @@ -0,0 +1,271 @@
> +/*
> + * (C) Copyright HUAWEI Technology Corp. 2017, 2021   All Rights Reserved.
> + *
> + * main.c
> + *
> + * Prioritizer for device mapper multipath, where the corresponding priority
> + * values of specific paths are provided by a latency algorithm. And the
> + * latency algorithm is dependent on arguments.
> + *
> + * The principle of the algorithm as follows:
> + * 1. By sending a certain number "io_num" of read IOs to the current path
> + *    continuously, the IOs' average latency can be calculated.
> + * 2. According to the average latency of each path and the weight value
> + *    "latency_interval", the priority "rc" of each path can be provided.
> + *
> + * Author(s): Yang Feng <philip.yang at huawei.com>
> + *            Zou Ming <zouming.zouming at huawei.com>
> + *
> + * This file is released under the GPL.
> + */
> +#include <stdio.h>
> +#include <math.h>
> +#include <ctype.h>
> +#include <time.h>
> +
> +#include "debug.h"
> +#include "prio.h"
> +#include "structs.h"
> +#include "../checkers/libsg.h"
> +
> +#define THRES_USEC_VALUE        120000000LL    /*unit: us, =120s*/
> +
> +#define MAX_IO_NUM              200
> +#define MIN_IO_NUM              10
> +
> +#define MAX_LATENCY_INTERVAL    60            /*unit: s*/
> +#define MIN_LATENCY_INTERVAL    1             /*unit: us, or ms, or s*/
> +
> +#define DEFAULT_PRIORITY        0
> +
> +#define MAX_CHAR_SIZE           30
> +
> +#define CHAR_USEC               "us"
> +#define CHAR_MSEC               "ms"
> +#define CHAR_SEC                "s"
> +
> +enum interval_type {
> +    INTERVAL_USEC,
> +    INTERVAL_MSEC,
> +    INTERVAL_SEC,
> +    INTERVAL_INVALID
> +};
> +
> +/* interval_unit_str and interval_unit_type keep the same assignment sequence */
> +static const char *interval_unit_str[MAX_CHAR_SIZE] = {
> +    CHAR_USEC, CHAR_MSEC, CHAR_SEC
> +};
> +static const int interval_unit_type[] = {
> +    INTERVAL_USEC, INTERVAL_MSEC, INTERVAL_SEC
> +};
> +
> +#define USEC_PER_SEC      1000000LL
> +#define USEC_PER_MSEC     1000LL
> +#define USEC_PER_USEC     1LL
> +
> +static const int conversion_ratio[] = {
> +    [INTERVAL_USEC]		= USEC_PER_USEC,
> +    [INTERVAL_MSEC]     = USEC_PER_MSEC,
> +    [INTERVAL_SEC]		= USEC_PER_SEC,
> +    [INTERVAL_INVALID]	= 0
> +};
> +
> +static long long path_latency[MAX_IO_NUM];
> +
> +static inline long long timeval_to_us(const struct timespec *tv)
> +{
> +	return ((long long) tv->tv_sec * USEC_PER_SEC) + (tv->tv_nsec >> 10);
> +}
> +
> +static int do_readsector0(int fd, unsigned int timeout)
> +{
> +	unsigned char buf[4096];
> +	unsigned char sbuf[SENSE_BUFF_LEN];
> +	int ret;
> +
> +	ret = sg_read(fd, &buf[0], 4096, &sbuf[0],
> +		      SENSE_BUFF_LEN, timeout);
> +
> +	return ret;
> +}
> +
> +int check_args_valid(int io_num, long long latency_interval, int type)
> +{
> +    if ((io_num < MIN_IO_NUM) || (io_num > MAX_IO_NUM))
> +    {
> +        condlog(0, "args io_num is more than the valid values range");
> +        return 0;
> +    }
> +
> +    /* s:[1, 60], ms:[1, 60000], us:[1, 60000000] */
> +    if ((latency_interval < MIN_LATENCY_INTERVAL) || (latency_interval > (MAX_LATENCY_INTERVAL * USEC_PER_SEC / conversion_ratio[type])))
> +    {
> +        condlog(0, "args latency_interval is more than the valid values range");
> +        return 0;
> +    }
> +
> +    return 1;
> +}
> +
> +static int get_interval_type(char *type)
> +{
> +    int index;
> +
> +    for (index = 0; index < sizeof(interval_unit_str)/MAX_CHAR_SIZE; index++)
> +    {
> +        if (strcmp(type, interval_unit_str[index]) == 0)
> +        {
> +            return interval_unit_type[index];
> +        }
> +    }
> +
> +    return INTERVAL_INVALID;
> +}
> +
> +long long get_conversion_ratio(int type)
> +{
> +    return conversion_ratio[type];
> +}
> +
> +/* In multipath.conf, args form: io_num|latency_interval. For example,
> +*  args is "20|10ms", this function can get 20, 10.
> +*/
> +static int get_interval_and_ionum(char *args,
> +                                        int *ionum,
> +                                        long long *interval)
> +{
> +    char source[MAX_CHAR_SIZE];
> +    char vertica = '|';
> +    char *endstrbefore = NULL;
> +    char *endstrafter = NULL;
> +    int type;
> +    unsigned int size = strlen(args);
> +    long long ratio;
> +
> +    if ((args == NULL) || (ionum == NULL) || (interval == NULL))
> +    {
> +        condlog(0, "args string is NULL");
> +        return 0;
> +    }
> +
> +    if ((size < 1) || (size > MAX_CHAR_SIZE-1))
> +    {
> +        condlog(0, "args string's size is too long");
> +        return 0;
> +    }
> +
> +    memcpy(source, args, size+1);
> +
> +    if (!isdigit(source[0]))
> +    {
> +        condlog(0, "args io_num string's first char is not digit");
> +        return 0;
> +    }
> +
> +    *ionum = (int)strtoul(source, &endstrbefore, 10);
> +    if (endstrbefore[0] != vertica)
> +    {
> +        condlog(0, "segmentation char is invalid");
> +        return 0;
> +    }
> +
> +    if (!isdigit(endstrbefore[1]))
> +    {
> +        condlog(0, "args latency_interval string's first char is not digit");
> +        return 0;
> +    }
> +
> +    *interval = (long long)strtol(&endstrbefore[1], &endstrafter, 10);
> +    type = get_interval_type(endstrafter);
> +    if (type == INTERVAL_INVALID)
> +    {
> +        condlog(0, "args latency_interval type is invalid");
> +        return 0;
> +    }
> +
> +    if (check_args_valid(*ionum, *interval, type) == 0)
> +    {
> +        return 0;
> +    }
> +
> +	ratio = get_conversion_ratio(type);
> +    *interval *= (long long)ratio;
> +
> +    return 1;
> +}
> +
> +long long calc_standard_deviation(long long *path_latency, int size, long long avglatency)
> +{
> +    int index;
> +    long long total = 0;
> +
> +    for (index = 0; index < size; index++)
> +    {
> +        total += (path_latency[index] - avglatency) * (path_latency[index] - avglatency);
> +    }
> +
> +    total /= (size-1);
> +
> +    return (long long)sqrt((double)total);
> +}
> +
> +int getprio (struct path *pp, char *args, unsigned int timeout)
> +{
> +    int rc, temp;
> +    int index = 0;
> +    int io_num;
> +    long long latency_interval;
> +    long long avglatency;
> +    long long standard_deviation;
> +    long long toldelay = 0;
> +    long long before, after;
> +    struct timespec tv;
> +
> +	if (pp->fd < 0)
> +		return -1;
> +
> +    if (get_interval_and_ionum(args, &io_num, &latency_interval) == 0)
> +    {
> +        condlog(0, "%s: get path_latency args fail", pp->dev);
> +        return DEFAULT_PRIORITY;
> +    }
> +
> +    memset(path_latency, 0, sizeof(path_latency));
> +
> +    temp = io_num;
> +    while (temp-- > 0)
> +    {
> +        (void)clock_gettime(CLOCK_MONOTONIC, &tv);
> +        before = timeval_to_us(&tv);		
> +
> +        if (do_readsector0(pp->fd, timeout) == 2)
> +        {
> +            condlog(0, "%s: path down", pp->dev);
> +            return -1;
> +        }
> +
> +        (void)clock_gettime(CLOCK_MONOTONIC, &tv);
> +        after = timeval_to_us(&tv);
> +
> +        path_latency[index] = after - before;
> +        toldelay += path_latency[index++];
> +    }
> +
> +    avglatency = toldelay/(long long)io_num;
> +    condlog(4, "%s: average latency is (%lld)", pp->dev, avglatency);
> +
> +    if (avglatency > THRES_USEC_VALUE)
> +    {
> +        condlog(0, "%s: average latency (%lld) is more than thresold", pp->dev, avglatency);
> +        return DEFAULT_PRIORITY;
> +    }
> +
> +    /* warn the user if the latency_interval set is smaller than (2 * standard deviation), or equal */
> +    standard_deviation = calc_standard_deviation(path_latency, index, avglatency);
> +    if (latency_interval <= (2 * standard_deviation))
> +        condlog(3, "%s: args latency_interval set is smaller than 2 * standard deviation (%lld us), or equal",
> +            pp->dev, standard_deviation);
> +
> +	rc = (int)(THRES_USEC_VALUE - (avglatency/(long long)latency_interval));
> +    return rc;
> +}
> diff --git a/multipath/multipath.conf.5 b/multipath/multipath.conf.5
> index 5939688..3dd0d77 100644
> --- a/multipath/multipath.conf.5
> +++ b/multipath/multipath.conf.5
> @@ -293,6 +293,10 @@ Generate a random priority between 1 and 10.
>  Generate the path priority based on the regular expression and the
>  priority provided as argument. Requires prio_args keyword.
>  .TP
> +.I path_latency
> +Generate the path priority based on a latency algorithm.
> +Requires prio_args keyword.
> +.TP
>  .I datacore
>  .\" XXX
>  ???. Requires prio_args keyword.
> @@ -333,6 +337,20 @@ these values can be looked up through sysfs or by running \fImultipathd show pat
>  "%N:%R:%n:%r"\fR. For example: 0x200100e08ba0aea0:0x210100e08ba0aea0:.*:.* , .*:.*:iqn.2009-10.com.redhat.msp.lab.ask-06:.*
>  .RE
>  .TP 12
> +.I path_latency
> +Needs a value of the form
> +\fI"<latency_interval>|<io_num>"\fR
> +.RS
> +.TP 8
> +.I latency_interval
> +The interval values of average latency between two different neighbour ranks of path priority, used to partition different priority ranks.
> +Form: XXs, or XXXus, or XXXms. Unit: Second, or Microsecond, or Millisecond. Valid Values: Integer, s [1, 60], ms [1, 60000], us [1, 60000000],
> +For example: If latency_interval=10ms, the paths will be grouped in priority groups with path latency 0-10ms, 10-20ms, 20-30ms, etc..
> +.TP
> +.I io_num
> +The number of read IOs sent to the current path continuously, used to calculate the average path latency. Valid Values: Integer, [10, 200].
> +.RE
> +.TP 12
>  .I alua
>  If \fIexclusive_pref_bit\fR is set, paths with the \fIpreferred path\fR bit
>  set will always be in their own path group.
> diff --git a/libmultipath/prio.h b/libmultipath/prio.h
> index 0193c52..c97fe39 100644
> --- a/libmultipath/prio.h
> +++ b/libmultipath/prio.h
> @@ -29,6 +29,7 @@ struct path;
>  #define PRIO_RDAC		"rdac"
>  #define PRIO_WEIGHTED_PATH	"weightedpath"
>  #define PRIO_SYSFS		"sysfs"
> +#define PRIO_PATH_LATENCY	"path_latency"
> 
>  /*
>   * Value used to mark the fact prio was not defined
> 




More information about the dm-devel mailing list