[dm-devel] [PATCH] multipath-tools:Prioritizer based on a time-delay algorithm

Yang Feng philip.yang at huawei.com
Mon May 15 10:44:47 UTC 2017


Hello Martin,

Thank you very much for your remarks. I am sorry for late reply, Please find my answer and the updated patch.

> Hello Yang,
> 
> thank you for your work. Please find my remarks below.
> 
> On Mon, 2017-05-08 at 11:58 +0800, Yang Feng wrote:
>> Prioritizer for device mapper multipath, where the corresponding
>> priority
>> values of specific paths are provided by a time-delay algorithm. And
>> the
>> time-delay algorithm is dependent on the following
>> arguments(delay_interval,
>> cons_num).
>> The principle of the algorithm is illustrated as follows:
>> 1. By sending a certain number "cons_num" of read IOs to the current
>> path
>>    continuously, the IOs' average delay can be calculated.
>> 2. According to the average delay of each path and the weight value
>>    "delay_interval", the priority "rc" of each path can be provided.
>>
>>      delay_interval  delay_interval  delay_interval       delay_inter
> 
> How does this algorithm behave under load? Can we be sure that
> priorities don't start to fluctuate wildly because busy paths will
> usually have longer latencies than idle ones?
I have a lot of test under load. When the appropriate value of argument "delay_interval" is set,
this algorithm behave well and can separate the paths who's average delay is more than others.
When add a new path or the path's state change from down to up, getprio() of the prioritizer is triggered, and
the current path is not under IOs.

>>  libmultipath/Makefile                   |   2 +-
>>  libmultipath/checkers/Makefile          |   7 +-
>>  libmultipath/checkers/emc_clariion.c    |   2 +-
>>  libmultipath/checkers/libsg.c           |  94 ------------
>>  libmultipath/checkers/libsg.h           |   9 --
>>  libmultipath/checkers/readsector0.c     |   2 +-
>>  libmultipath/libsg.c                    |  94 ++++++++++++
>>  libmultipath/libsg.h                    |   9 ++
>>  libmultipath/prioritizers/Makefile      |   6 +-
>>  libmultipath/prioritizers/delayedpath.c | 246 
> 
> Why do you have to move libsg for this? It's already used by various
> checkers, why can't your checker do the same? If you really need to do
> it, you should at least separate that part of the patch from the added
> code.
OK, this time, libsg will not be moved.

>> +
>> +#define CHAR_SEC                "SEC"
>> +#define CHAR_MSEC               "MSEC"
>> +#define CHAR_USEC               "USEC"
> 
> I suggest to use "s", "ms", and "us" here instead.
OK, as the following patch.

> If you create an array of "const char*" instead like you did for
> conversion_ratio below, you could implement get_interval_type() more
> elegantly using a loop over that array.
OK, as the following patch.

>> +static int get_interval_type(char *source, char *type)
>> +{  
>> +    /*is USEC*/
>> +    if ((strstr(source, CHAR_USEC) != NULL)
>> +        && (strstr(source, CHAR_USEC)[4] == '_'))
> 
> Please avoid these double strstr() invocation. The compiler may
> optimize it away, but it just looks strange. The following would 
> look better to me, and I find it actually more readable:
> 
>         if (((p = strstr(source, CHAR_USEC)) != NULL) && p[4] == '_')
OK, as the following patch.

>> +static int get_string_from_under(char *args,
>> +                                        char *beforestring,
>> +                                        char *afterstring,
>> +                                        int *type)
> 
> Maybe you could figure out a more descriptive name for this function?
> 
> A comment in the code showing how the string to be parsed typically
> looks like would be helpful for the reader.
OK, as the following patch.

>> +    token = strtok_r(source, under, &saveptr);
>> +    token = strtok(token, char_type);
> 
> I'm pretty sure this is is not what you intended to write. If char_type
> is "usec", this would split the string at the possible delimiters 'u',
> 's', 'e', and 'c' (the 2nd argument of strtok(3) is not a sequence, but
> a 'set' of bytes). It might accidentally work with the input strings
> you are using (in particular because you only look at the first token),
> but nevertheless it's wrong.
OK, as the following patch.

>> +    if ((token == NULL) || (saveptr == NULL))
>> +        return 0;
>> +
>> +    tmp = token;
>> +    while (*tmp != '\0')
>> +        if (!isdigit(*tmp++))
>> +            return 0;
>> +
>> +    tmp = saveptr;
>> +    while (*tmp != '\0')
>> +        if (!isdigit(*tmp++))
>> +            return 0;
>> +
>> +    strncpy(beforestring, token, strlen(token) + 1);
>> +    strncpy(afterstring, saveptr, strlen(saveptr) + 1);
>> +    return 1;
>> +}
> 
> I don't think it's safe to use saveptr the way you do it. The strtok_r
> man page says this parameter is for "internal use". While it makes
> sense to assume that it points to the next token, I'm not sure if
> that's guaranteed. You would be safe by calling 
> 
>     somevar = strtok_r(NULL, under, &saveptr)
> 
> and use "somevar".
OK, as the following patch.

> 
> In general, this whole parsing code is odd. IIUC this parses input
> looking like ([0-9]+)(SEC|MSEC|USEC)_([0-9]+) and sets beforestring,
> type, and afterstring to the regex matches \1, \2, and \3,
> respectively.
> 
> Why don't you start parsing from the beginning of the input, e.g. with
> strtoul(), and look at the rest later?
OK, as the following patch.

>> +
>> +int checkargvalid(int delay_interval, int cons_num, int type)
>> +{
>> +    if (type == INTERVAL_SEC)
>> +    {
>> +        if ((delay_interval < 1) || (delay_interval > 60))
>> +            return 0;
>> +    }
>> +    else if (type != INTERVAL_INVALID)
>> +    {
>> +        if ((delay_interval < 1) || (delay_interval >= 1000))
>> +            return 0;
>> +    }
> 
> You could be more forgiving here. 15000MSEC could be a legal value.
Because this value is more than 1 second, you can use the unit second.

> 
>> +    
>> +    if ((cons_num < 3) || (cons_num > 1000))
>> +        return 0;
>> +
>> +    return 1;
>> +}
>> +
>> +int get_delay_pref_arg(char *args, int *delay_interval, int
>> *cons_num, int *type)
>> +{
>> +    char delayintervalstr[MAX_CHAR_SIZE];
>> +    char consnumstr[MAX_CHAR_SIZE];
>> +
>> +    if (get_string_from_under(args, delayintervalstr, consnumstr,
>> type) == 0)
>> +        return 0;
> 
> It might be good to write the parser so that the consnum part can be
> left out by the user, and assume a reasonable default in that case.
OK, as the following patch.

>> +    while (temp-- > 0)
>> +    {
>> +        (void)gettimeofday(&tv, NULL);
>> +        before = timeval_to_us(&tv);		
>> +
>> +        if (do_readsector0(pp->fd, timeout) == 2)
>> +        {
>> +            condlog(0, "%s: path down", pp->dev);
>> +            return 1;
>> +        }
>> +        
>> +        (void)gettimeofday(&tv, NULL);
> 
> It's better to use clock_gettime(CLOCK_MONOTONIC, ...) here. Then you
> can throw away the delay < 0 check below.
OK, as the following patch.

>> +    toldelay -= min + max;
>> +    avgdelay = toldelay/(long long)(cons_num - 2);
>> +    if (avgdelay > THRES_USEC_VALUE) 
>> +    {           
>> +        condlog(0, "%s: avgdelay is more than thresold", pp->dev);
>> +        return 1;
>> +    }
>> +    
>> +	ratio = get_conversion_ratio(type);
>> +	rc = (int)(THRES_USEC_VALUE - (avgdelay/(((long
>> long)delay_interval) * ratio)));
>> +
>> +    return rc;
>> +}
> 
> Is it reasonable to do these interval calculations synchronously in
> getprio()? cons_num is limited to 1000, so this routine could issue
> 1000 reads on the device before returning. In particular if the device
> is under IO load and the delay is high, execution if this routine could
> be really slow.
> 
> It would make more sense to me to have a separate thread that
> calculates some sort of "running average" for the delay of the
> different paths, and have getprio() just fetch the current value of
> that variable.
> 
> Regards
> Martin
> 
When add a new path or the path's state change from down to up, getprio() of the prioritizer is triggered, and
the current path is not under IOs.
Usually, the 1000 reads will be finish fastly.
And have getprio() need to get the up-to-date average delay of the current path, these interval calculations synchronously
is essential because of unreasonable asynchronously process's executing time D-value, particulayly when add a new path or
the path's state change down to up.

Please find the up-to-date patch below:

---
>From 1a9426dfbad00b5dbefc7020603e40e8896e4869 Mon Sep 17 00:00:00 2001
From: Yang Feng <philip.yang at huawei.com>
Date: Mon, 15 May 2017 18:33:29 +0800
Subject: [PATCH] [dm-devel] [PATCH] multipath-tools:Prioritizer based on a time-delay algorithm
 Prioritizer for device mapper multipath, where the corresponding priority values of specific paths are provided by a time-delay
 algorithm. And the time-delay algorithm is dependent on the following arguments(delay_interval, cons_num). The principle of the
 algorithm is illustrated as follows:
 1. By sending a certain number "cons_num" of read IOs to the current path    continuously, the IOs' average delay can be calculated.
 2. According to the average delay of each path and the weight value "delay_interval", the priority "rc" of each path can be provided.

           delay_interval  delay_interval  delay_interval      delay_interval
	 |---------------|---------------|---------------|   |---------------|
	 |priority rank 1|priority rank 2|priority rank 3|...|priority rank x|
         |---------------|---------------|---------------|   |---------------|
		               Priority Rank Partitioning
---
 libmultipath/prioritizers/Makefile      |   6 +-
 libmultipath/prioritizers/delayedpath.c | 261 ++++++++++++++++++++++++++++++++
 libmultipath/prioritizers/delayedpath.h |  17 +++
 multipath/multipath.conf.5              |  19 +++
 4 files changed, 302 insertions(+), 1 deletion(-)
 create mode 100644 libmultipath/prioritizers/delayedpath.c
 create mode 100644 libmultipath/prioritizers/delayedpath.h

diff --git a/libmultipath/prioritizers/Makefile b/libmultipath/prioritizers/Makefile
index 36b42e4..8df5234 100644
--- a/libmultipath/prioritizers/Makefile
+++ b/libmultipath/prioritizers/Makefile
@@ -18,13 +18,17 @@ LIBS = \
 	libpriorandom.so \
 	libpriordac.so \
 	libprioweightedpath.so \
-	libpriosysfs.so
+	libpriodelayedpath.so \
+	libpriosysfs.so

 all: $(LIBS)

 libprioalua.so: alua.o alua_rtpg.o
 	$(CC) $(LDFLAGS) $(SHARED_FLAGS) -o $@ $^

+libpriodelayedpath.so: delayedpath.o  ../checkers/libsg.o
+	$(CC) $(LDFLAGS) $(SHARED_FLAGS) -o $@ $^
+
 libprio%.so: %.o
 	$(CC) $(LDFLAGS) $(SHARED_FLAGS) -o $@ $^

diff --git a/libmultipath/prioritizers/delayedpath.c b/libmultipath/prioritizers/delayedpath.c
new file mode 100644
index 0000000..0490e8d
--- /dev/null
+++ b/libmultipath/prioritizers/delayedpath.c
@@ -0,0 +1,261 @@
+/*
+ * (C) Copyright HUAWEI Technology Corp. 2017, 2021   All Rights Reserved.
+ *
+ * main.c
+ *
+ * Prioritizer for device mapper multipath, where the corresponding priority
+ * values of specific paths are provided by a time-delay algorithm. And the
+ * time-delay algorithm is dependent on arguments.
+ *
+ * The principle of the algorithm as follows:
+ * 1. By sending a certain number "cons_num" of read IOs to the current path
+ *    continuously, the IOs' average delay can be calculated.
+ * 2. According to the average delay of each path and the weight value
+ *    "delay_interval", the priority "rc" of each path can be provided.
+ *
+ * Author(s): Yang Feng <philip.yang at huawei.com>
+ *            Zou Ming <zouming.zouming at huawei.com>
+ *
+ * This file is released under the GPL.
+ */
+#include <stdio.h>
+#include <ctype.h>
+#include <time.h>
+
+#include "debug.h"
+#include "prio.h"
+#include "structs.h"
+#include "../checkers/libsg.h"
+
+#include "delayedpath.h"
+
+#define THRES_USEC_VALUE        300000000LL    /*USEC, 300SEC*/
+#define DEFAULT_DELAY_INTERVAL  10             /*MSEC*/
+#define DEFAULT_CONS_NUM        20
+
+#define MAX_CHAR_SIZE           30
+
+#define CHAR_SEC                "s"
+#define CHAR_MSEC               "ms"
+#define CHAR_USEC               "us"
+
+enum interval_type {
+    INTERVAL_SEC,
+    INTERVAL_MSEC,
+    INTERVAL_USEC,
+    INTERVAL_INVALID
+};
+
+/* interval_unit_str and interval_unit_type keep the same assignment sequence */
+static const char interval_unit_str[][MAX_CHAR_SIZE] = {
+    CHAR_USEC, CHAR_MSEC, CHAR_SEC
+};
+static const int interval_unit_type[] = {
+    INTERVAL_USEC, INTERVAL_MSEC, INTERVAL_SEC
+};
+
+static const int conversion_ratio[] = {
+	[INTERVAL_SEC]		= USEC_PER_SEC,
+	[INTERVAL_MSEC]	        = USEC_PER_MSEC,
+	[INTERVAL_USEC]		= USEC_PER_USEC,
+	[INTERVAL_INVALID]	= 0
+};
+
+
+static int do_readsector0(int fd, unsigned int timeout)
+{
+	unsigned char buf[4096];
+	unsigned char sbuf[SENSE_BUFF_LEN];
+	int ret;
+
+	ret = sg_read(fd, &buf[0], 4096, &sbuf[0],
+		      SENSE_BUFF_LEN, timeout);
+
+	return ret;
+}
+
+static int get_interval_type(char *source, char *type)
+{
+    char *p;
+    int size;
+    int i;
+
+    for (i = 0; i < sizeof(interval_unit_str)/MAX_CHAR_SIZE; i++)
+    {
+        size = strlen(interval_unit_str[i]);
+        p = strstr(source, interval_unit_str[i]);
+        if (p != NULL && p[size] == '|')
+        {
+            memcpy(type, interval_unit_str[i], size+1);
+            return interval_unit_type[i];
+        }
+    }
+
+    return INTERVAL_INVALID;
+}
+
+/* In multipath.conf, args form: delay_interval|cons_num. For example,
+*  args is "10ms|20", this function can get 10, ms, and 20.
+*/
+static int get_digit_and_type(char *args,
+                              int *interval,
+                              int *consnum,
+                              int *type)
+{
+    char typestr[MAX_CHAR_SIZE];
+    char source[MAX_CHAR_SIZE];
+    char vertica[] = "|";
+    char *tokenbefore = NULL;
+    char *tokenafter = NULL;
+    char *tmp = NULL;
+    unsigned int size = strlen(args);
+
+    if ((args == NULL) || (interval == NULL)
+        || (consnum == NULL) || (type == NULL))
+        return 0;
+
+    /* int type */
+    if ((size < 1) || (size > MAX_CHAR_SIZE-1))
+        return 0;
+
+    memcpy(source, args, size+1);
+    if (strstr(source, vertica) == NULL)
+        return 0;
+
+    *type = get_interval_type(source, typestr);
+    if (*type == INTERVAL_INVALID)
+    {
+        condlog(0, "delay_interval type is invalid");
+        return 0;
+    }
+
+    tokenbefore = strtok(source, vertica);
+    tokenafter = strtok(NULL, vertica);
+    typestr[1] = '\0';
+    tokenbefore = strtok(tokenbefore, typestr);
+    if ((tokenbefore == NULL) || (tokenafter == NULL))
+        return 0;
+
+    tmp = tokenbefore;
+    while (*tmp != '\0')
+        if (!isdigit(*tmp++))
+        {
+            condlog(0, "delay_interval string include invalid char");
+            return 0;
+        }
+
+    tmp = tokenafter;
+    while (*tmp != '\0')
+        if (!isdigit(*tmp++))
+        {
+            condlog(0, "cons_num string include invalid char");
+            return 0;
+        }
+
+    *interval = atoi(tokenbefore);
+    *consnum = atoi(tokenafter);
+
+    return 1;
+}
+
+int check_args_valid(int delay_interval, int cons_num, int type)
+{
+    if (type == INTERVAL_SEC)
+    {
+        if ((delay_interval < 1) || (delay_interval > 60))
+        {
+            condlog(0, "delay_interval values is invalid");
+            return 0;
+        }
+    }
+    else if (type != INTERVAL_INVALID)
+    {
+        if ((delay_interval < 1) || (delay_interval >= 1000))
+        {
+            condlog(0, "delay_interval values is invalid");
+            return 0;
+        }
+    }
+
+    if ((cons_num < 3) || (cons_num > 1000))
+    {
+        condlog(0, "cons_num values is invalid");
+        return 0;
+    }
+
+    return 1;
+}
+
+int get_delay_pref_arg(char *args, int *delay_interval, int *cons_num, int *interval_type)
+{
+    if (get_digit_and_type(args, delay_interval, cons_num, interval_type) == 0)
+        return 0;
+
+    if (check_args_valid(*delay_interval, *cons_num, *interval_type) == 0)
+        return 0;
+
+    return 1;
+}
+
+long long get_conversion_ratio(int type)
+{
+    return conversion_ratio[type];
+}
+
+int getprio (struct path *pp, char *args, unsigned int timeout)
+{
+    int rc, delay_interval, cons_num, type, temp;
+    long long delay, avgdelay, ratio;
+    long long min = THRES_USEC_VALUE;
+    long long max = 0;
+    long long toldelay = 0;
+    long long before, after;
+    struct timespec tv;
+
+	if (pp->fd < 0)
+	    return -PRIO_NO_INFORMATION;
+
+    if (get_delay_pref_arg(args, &delay_interval, &cons_num, &type) == 0)
+    {
+        condlog(3, "%s: get delay arg fail", pp->dev);
+        delay_interval = DEFAULT_DELAY_INTERVAL;
+        cons_num = DEFAULT_CONS_NUM;
+        type = INTERVAL_MSEC;
+    }
+
+    temp = cons_num;
+    while (temp-- > 0)
+    {
+        (void)clock_gettime(CLOCK_MONOTONIC, &tv);
+        before = timeval_to_us(&tv);		
+
+        if (do_readsector0(pp->fd, timeout) == 2)
+        {
+            condlog(0, "%s: path down", pp->dev);
+            return -1;
+        }
+
+        (void)clock_gettime(CLOCK_MONOTONIC, &tv);
+        after = timeval_to_us(&tv);
+
+        delay = after - before;
+    	
+        min = (min <= delay) ? min : delay;
+        max = (max >= delay) ? max : delay;
+
+        toldelay += delay;
+    }
+
+    toldelay -= min + max;
+    avgdelay = toldelay/(long long)(cons_num - 2);
+    if (avgdelay > THRES_USEC_VALUE)
+    {
+        condlog(0, "%s: avgdelay is more than thresold", pp->dev);
+        return 1;
+    }
+
+	ratio = get_conversion_ratio(type);
+	rc = (int)(THRES_USEC_VALUE - (avgdelay/(((long long)delay_interval) * ratio)));
+
+    return rc;
+}
diff --git a/libmultipath/prioritizers/delayedpath.h b/libmultipath/prioritizers/delayedpath.h
new file mode 100644
index 0000000..d8213e9
--- /dev/null
+++ b/libmultipath/prioritizers/delayedpath.h
@@ -0,0 +1,17 @@
+#ifndef _DELAYEDPATH_H
+#define _DELAYEDPATH_H
+
+#define PRIO_DELAYED_PATH "delayedpath"
+
+#define PRIO_NO_INFORMATION 5
+
+#define USEC_PER_SEC      1000000LL
+#define USEC_PER_MSEC     1000LL
+#define USEC_PER_USEC     1LL
+
+static inline long long timeval_to_us(const struct timespec *tv)
+{
+	return ((long long) tv->tv_sec * USEC_PER_SEC) + (tv->tv_nsec >> 10);
+}
+
+#endif
diff --git a/multipath/multipath.conf.5 b/multipath/multipath.conf.5
index 5939688..f1e126e 100644
--- a/multipath/multipath.conf.5
+++ b/multipath/multipath.conf.5
@@ -293,6 +293,10 @@ Generate a random priority between 1 and 10.
 Generate the path priority based on the regular expression and the
 priority provided as argument. Requires prio_args keyword.
 .TP
+.I delayedpath
+Generate the path priority based on a time-delay algorithm.
+Requires prio_args keyword.
+.TP
 .I datacore
 .\" XXX
 ???. Requires prio_args keyword.
@@ -333,6 +337,21 @@ these values can be looked up through sysfs or by running \fImultipathd show pat
 "%N:%R:%n:%r"\fR. For example: 0x200100e08ba0aea0:0x210100e08ba0aea0:.*:.* , .*:.*:iqn.2009-10.com.redhat.msp.lab.ask-06:.*
 .RE
 .TP 12
+.I delayed
+Needs a value of the form
+\fI"<delay_interval|cons_num>"\fR
+.RS
+.TP 8
+.I delay_interval
+The interval values of average IO-time-delay between two different neighbour ranks of path priority, used to partition different priority ranks.
+Form: XXs, or XXXus, or XXXms. Unit: Second, or Microsecond, or Millisecond. Valid Values: Integer, s [1, 60], ms [1, 1000), us [1, 1000),
+For example: 10s, or 100us, or 100ms. The default is: 10ms.
+.TP
+.I cons_num
+The number of read IOs sent to the current path continuously, used to calculate the average IO-time-delay. Valid Values: Integer, [3, 1000].
+For example: 30. The default is: 20.
+.RE
+.TP 12
 .I alua
 If \fIexclusive_pref_bit\fR is set, paths with the \fIpreferred path\fR bit
 set will always be in their own path group.
-- 










More information about the dm-devel mailing list