<div dir="ltr">This version is now merged.<div>Thanks.</div><div><br></div><div>Incremental enhancements can go from there if needed.</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Jun 21, 2017 at 11:40 AM, Yang Feng <span dir="ltr"><<a href="mailto:philip.yang@huawei.com" target="_blank">philip.yang@huawei.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">libmultipath/prioritizers: Prioritizer for device mapper multipath,<br>
where the corresponding priority values of specific paths are provided<br>
by a latency algorithm on the logarithmic scale. And the latency algorithm<br>
is dependent on the following arguments(io_num and base_num).<br>
The principle of the algorithm is illustrated as follows:<br>
1. By sending a certain number "io_num" of read IOs to the current<br>
path continuously, the IOs' average latency can be calculated.<br>
2. Max average latency value is 100s, and min value is 1us. According<br>
to the average latency of each path and the "base_number" of logarithmic<br>
scale, the priority "rc" of each path can be provided.<br>
<br>
For example: If base_num=10, the paths will be grouped in priority groups<br>
with path latency <=1us, (1us, 10us], (10us, 100us], (100us, 1ms], (1ms, 10ms],<br>
(10ms, 100ms], (100ms, 1s], (1s, 10s], (10s, 100s], >100s. As follows:<br>
<br>
<=1us (1us, 10us] (10us, 100us] >100s<br>
|------------------|----------<wbr>--------|------------------|..<wbr>.|------------------|<br>
| priority rank 9 | priority rank 8 | priority rank 7 |...| priority rank 0 |<br>
|------------------|----------<wbr>--------|------------------|..<wbr>.|------------------|<br>
Priority Rank Partitioning<br>
<br>
Signed-off-by: Yang Feng <<a href="mailto:philip.yang@huawei.com">philip.yang@huawei.com</a>><br>
Reviewed-by: Benjamin Marzinski <<a href="mailto:bmarzins@redhat.com">bmarzins@redhat.com</a>><br>
Reviewed-by: Martin Wilck <<a href="mailto:mwilck@suse.com">mwilck@suse.com</a>><br>
Reviewed-by: Xose Vazquez Perez <<a href="mailto:xose.vazquez@gmail.com">xose.vazquez@gmail.com</a>><br>
Reviewed-by: Hannes Reinecke <<a href="mailto:hare@suse.de">hare@suse.de</a>><br>
---<br>
<span class=""> libmultipath/prio.h | 1 +<br>
libmultipath/prioritizers/<wbr>Makefile | 4 +<br>
libmultipath/prioritizers/<wbr>path_latency.c | 257 ++++++++++++++++++++++++++++++<wbr>+<br>
multipath/multipath.conf.5 | 20 +++<br>
4 files changed, 282 insertions(+)<br>
create mode 100644 libmultipath/prioritizers/<wbr>path_latency.c<br>
<br>
</span>diff --git a/libmultipath/prio.h b/libmultipath/prio.h<br>
index 0193c52..c97fe39 100644<br>
--- a/libmultipath/prio.h<br>
+++ b/libmultipath/prio.h<br>
@@ -29,6 +29,7 @@ struct path;<br>
#define PRIO_RDAC "rdac"<br>
#define PRIO_WEIGHTED_PATH "weightedpath"<br>
#define PRIO_SYSFS "sysfs"<br>
+#define PRIO_PATH_LATENCY "path_latency"<br>
<br>
/*<br>
* Value used to mark the fact prio was not defined<br>
diff --git a/libmultipath/prioritizers/<wbr>Makefile b/libmultipath/prioritizers/<wbr>Makefile<br>
index 36b42e4..ca47cdf 100644<br>
--- a/libmultipath/prioritizers/<wbr>Makefile<br>
+++ b/libmultipath/prioritizers/<wbr>Makefile<br>
@@ -18,6 +18,7 @@ LIBS = \<br>
libpriorandom.so \<br>
libpriordac.so \<br>
libprioweightedpath.so \<br>
+ libpriopath_latency.so \<br>
libpriosysfs.so<br>
<br>
all: $(LIBS)<br>
@@ -25,6 +26,9 @@ all: $(LIBS)<br>
libprioalua.so: alua.o alua_rtpg.o<br>
$(CC) $(LDFLAGS) $(SHARED_FLAGS) -o $@ $^<br>
<br>
+libpriopath_latency.so: path_latency.o ../checkers/libsg.o<br>
+ $(CC) $(LDFLAGS) $(SHARED_FLAGS) -o $@ $^ -lm<br>
+<br>
libprio%.so: %.o<br>
$(CC) $(LDFLAGS) $(SHARED_FLAGS) -o $@ $^<br>
<br>
diff --git a/libmultipath/prioritizers/<wbr>path_latency.c b/libmultipath/prioritizers/<wbr>path_latency.c<br>
new file mode 100644<br>
index 0000000..046e13b<br>
--- /dev/null<br>
+++ b/libmultipath/prioritizers/<wbr>path_latency.c<br>
@@ -0,0 +1,257 @@<br>
+/*<br>
+ * (C) Copyright HUAWEI Technology Corp. 2017, All Rights Reserved.<br>
+ *<br>
+ * path_latency.c<br>
+ *<br>
+ * Prioritizer for device mapper multipath, where the corresponding priority<br>
+ * values of specific paths are provided by a latency algorithm. And the<br>
+ * latency algorithm is dependent on arguments("io_num" and "base_num").<br>
+ *<br>
+ * The principle of the algorithm as follows:<br>
+ * 1. By sending a certain number "io_num" of read IOs to the current path<br>
+ * continuously, the IOs' average latency can be calculated.<br>
+ * 2. Max value and min value of average latency are constant. According to<br>
+ * the average latency of each path and the "base_num" of logarithmic<br>
+ * scale, the priority "rc" of each path can be provided.<br>
+ *<br>
+ * Author(s): Yang Feng <<a href="mailto:philip.yang@huawei.com">philip.yang@huawei.com</a>><br>
+ *<br>
+ * This file is released under the GPL version 2, or any later version.<br>
+ *<br>
+ */<br>
+#include <stdio.h><br>
+#include <math.h><br>
+#include <ctype.h><br>
+#include <time.h><br>
+<br>
+#include "debug.h"<br>
+#include "prio.h"<br>
+#include "structs.h"<br>
+#include "../checkers/libsg.h"<br>
+<br>
+#define pp_pl_log(prio, fmt, args...) condlog(prio, "path_latency prio: " fmt, ##args)<br>
+<br>
+#define MAX_IO_NUM 200<br>
+#define MIN_IO_NUM 2<br>
+<br>
+#define MAX_BASE_NUM 10<br>
+#define MIN_BASE_NUM 2<br>
+<br>
+#define MAX_AVG_LATENCY 100000000. /*Unit: us*/<br>
+#define MIN_AVG_LATENCY 1. /*Unit: us*/<br>
+<br>
+#define DEFAULT_PRIORITY 0<br>
+<br>
+#define MAX_CHAR_SIZE 30<br>
+<br>
+#define USEC_PER_SEC 1000000LL<br>
+#define NSEC_PER_USEC 1000LL<br>
+<br>
+static long long path_latency[MAX_IO_NUM];<br>
+<br>
+static inline long long timeval_to_us(const struct timespec *tv)<br>
+{<br>
+ return ((long long) tv->tv_sec * USEC_PER_SEC) + (tv->tv_nsec / NSEC_PER_USEC);<br>
+}<br>
+<br>
+static int do_readsector0(int fd, unsigned int timeout)<br>
+{<br>
+ unsigned char buf[4096];<br>
+ unsigned char sbuf[SENSE_BUFF_LEN];<br>
+ int ret;<br>
+<br>
+ ret = sg_read(fd, &buf[0], 4096, &sbuf[0],<br>
+ SENSE_BUFF_LEN, timeout);<br>
+<br>
+ return ret;<br>
+}<br>
+<br>
+int check_args_valid(int io_num, int base_num)<br>
+{<br>
+ if ((io_num < MIN_IO_NUM) || (io_num > MAX_IO_NUM))<br>
+ {<br>
+ pp_pl_log(0, "args io_num is outside the valid range");<br>
+ return 0;<br>
+ }<br>
+<br>
+ if ((base_num < MIN_BASE_NUM) || (base_num > MAX_BASE_NUM))<br>
+ {<br>
+ pp_pl_log(0, "args base_num is outside the valid range");<br>
+ return 0;<br>
+ }<br>
+<br>
+ return 1;<br>
+}<br>
+<br>
+/* In multipath.conf, args form: io_num|base_num. For example,<br>
+* args is "20|10", this function can get io_num value 20, and<br>
+ base_num value 10.<br>
+*/<br>
+static int get_ionum_and_basenum(char *args,<br>
+ int *ionum,<br>
+ int *basenum)<br>
+{<br>
+ char source[MAX_CHAR_SIZE];<br>
+ char vertica = '|';<br>
+ char *endstrbefore = NULL;<br>
+ char *endstrafter = NULL;<br>
+ unsigned int size = strlen(args);<br>
+<br>
+ if ((args == NULL) || (ionum == NULL) || (basenum == NULL))<br>
+ {<br>
+ pp_pl_log(0, "args string is NULL");<br>
+ return 0;<br>
+ }<br>
+<br>
+ if ((size < 1) || (size > MAX_CHAR_SIZE-1))<br>
+ {<br>
+ pp_pl_log(0, "args string's size is too long");<br>
+ return 0;<br>
+ }<br>
+<br>
+ memcpy(source, args, size+1);<br>
+<br>
+ if (!isdigit(source[0]))<br>
+ {<br>
+ pp_pl_log(0, "invalid prio_args format: %s", source);<br>
+ return 0;<br>
+ }<br>
+<br>
+ *ionum = (int)strtoul(source, &endstrbefore, 10);<br>
+ if (endstrbefore[0] != vertica)<br>
+ {<br>
+ pp_pl_log(0, "invalid prio_args format: %s", source);<br>
+ return 0;<br>
+ }<br>
+<br>
+ if (!isdigit(endstrbefore[1]))<br>
+ {<br>
+ pp_pl_log(0, "invalid prio_args format: %s", source);<br>
+ return 0;<br>
+ }<br>
+<br>
+ *basenum = (long long)strtol(&endstrbefore[1], &endstrafter, 10);<br>
+ if (check_args_valid(*ionum, *basenum) == 0)<br>
+ {<br>
+ return 0;<br>
+ }<br>
+<br>
+ return 1;<br>
+}<br>
+<br>
+long long calc_standard_deviation(long long *path_latency, int size, long long avglatency)<br>
+{<br>
+ int index;<br>
+ long long total = 0;<br>
+<br>
+ for (index = 0; index < size; index++)<br>
+ {<br>
+ total += (path_latency[index] - avglatency) * (path_latency[index] - avglatency);<br>
+ }<br>
+<br>
+ total /= (size-1);<br>
+<br>
+ return (long long)sqrt((double)total);<br>
+}<br>
+<br>
+int calcPrio(double avglatency, double max_avglatency, double min_avglatency, double base_num)<br>
+{<br>
+ double lavglatency = log(avglatency)/log(base_num);<br>
+ double lmax_avglatency = log(max_avglatency)/log(base_<wbr>num);<br>
+ double lmin_avglatency = log(min_avglatency)/log(base_<wbr>num);<br>
+<br>
+ if (lavglatency <= lmin_avglatency)<br>
+ return (int)(lmax_avglatency + 1.);<br>
+<br>
+ if (lavglatency > lmax_avglatency)<br>
+ return 0;<br>
+<br>
+ return (int)(lmax_avglatency - lavglatency + 1.);<br>
+}<br>
+<br>
+/* Calc the latency interval corresponding to the average latency */<br>
+long long calc_latency_interval(double avglatency, double max_avglatency,<br>
+ double min_avglatency, double base_num)<br>
+{<br>
+ double lavglatency = log(avglatency)/log(base_num);<br>
+ double lmax_avglatency = log(max_avglatency)/log(base_<wbr>num);<br>
+ double lmin_avglatency = log(min_avglatency)/log(base_<wbr>num);<br>
+<br>
+ if ((lavglatency <= lmin_avglatency)<br>
+ || (lavglatency > lmax_avglatency))<br>
+ return 0;/* Invalid value */<br>
+<br>
+ if ((double)((int)lavglatency) == lavglatency)<br>
+ return (long long)(avglatency - (avglatency / base_num));<br>
+ else<br>
+ return (long long)(pow(base_num, (double)((int)lavglatency + 1))<br>
+ - pow(base_num, (double)((int)lavglatency)));<br>
+}<br>
+<br>
+int getprio (struct path *pp, char *args, unsigned int timeout)<br>
+{<br>
+ int rc, temp;<br>
+ int index = 0;<br>
+ int io_num;<br>
+ int base_num;<br>
+ long long avglatency;<br>
+ long long latency_interval;<br>
+ long long standard_deviation;<br>
+ long long toldelay = 0;<br>
+ long long before, after;<br>
+ struct timespec tv;<br>
+<br>
+ if (pp->fd < 0)<br>
+ return -1;<br>
+<br>
+ if (get_ionum_and_basenum(args, &io_num, &base_num) == 0)<br>
+ {<br>
+ pp_pl_log(0, "%s: get path_latency args fail", pp->dev);<br>
+ return DEFAULT_PRIORITY;<br>
+ }<br>
+<br>
+ memset(path_latency, 0, sizeof(path_latency));<br>
+<br>
+ temp = io_num;<br>
+ while (temp-- > 0)<br>
+ {<br>
+ (void)clock_gettime(CLOCK_<wbr>MONOTONIC, &tv);<br>
+ before = timeval_to_us(&tv);<br>
+<br>
+ if (do_readsector0(pp->fd, timeout) == 2)<br>
+ {<br>
+ pp_pl_log(0, "%s: path down", pp->dev);<br>
+ return -1;<br>
+ }<br>
+<br>
+ (void)clock_gettime(CLOCK_<wbr>MONOTONIC, &tv);<br>
+ after = timeval_to_us(&tv);<br>
+<br>
+ path_latency[index] = after - before;<br>
+ toldelay += path_latency[index++];<br>
+ }<br>
+<br>
+ avglatency = toldelay/(long long)io_num;<br>
+ pp_pl_log(4, "%s: average latency is (%lld us)", pp->dev, avglatency);<br>
+<br>
+ if (avglatency > MAX_AVG_LATENCY)<br>
+ {<br>
+ pp_pl_log(0, "%s: average latency (%lld us) is outside the thresold (%lld us)",<br>
+ pp->dev, avglatency, (long long)MAX_AVG_LATENCY);<br>
+ return DEFAULT_PRIORITY;<br>
+ }<br>
+<br>
+ /* Min average latency and max average latency are constant, the args base_num<br>
+ set can change latency_interval value corresponding to avglatency and is not constant.<br>
+ Warn the user if latency_interval is smaller than (2 * standard_deviation), or equal */<br>
+ standard_deviation = calc_standard_deviation(path_<wbr>latency, index, avglatency);<br>
+ latency_interval = calc_latency_interval(<wbr>avglatency, MAX_AVG_LATENCY, MIN_AVG_LATENCY, base_num);<br>
+ if ((latency_interval!= 0)<br>
+ && (latency_interval <= (2 * standard_deviation)))<br>
+ pp_pl_log(3, "%s: latency interval (%lld) according to average latency (%lld us) is smaller than "<br>
+ "2 * standard deviation (%lld us), or equal, args base_num (%d) needs to be set bigger value",<br>
+ pp->dev, latency_interval, avglatency, standard_deviation, base_num);<br>
+<br>
+ rc = calcPrio(avglatency, MAX_AVG_LATENCY, MIN_AVG_LATENCY, base_num);<br>
+ return rc;<br>
+}<br>
diff --git a/multipath/multipath.conf.5 b/multipath/multipath.conf.5<br>
index 5939688..8dffac0 100644<br>
--- a/multipath/multipath.conf.5<br>
+++ b/multipath/multipath.conf.5<br>
@@ -293,6 +293,10 @@ Generate a random priority between 1 and 10.<br>
Generate the path priority based on the regular expression and the<br>
priority provided as argument. Requires prio_args keyword.<br>
.TP<br>
+.I path_latency<br>
+Generate the path priority based on a latency algorithm.<br>
+Requires prio_args keyword.<br>
+.TP<br>
.I datacore<br>
.\" XXX<br>
???. Requires prio_args keyword.<br>
@@ -333,6 +337,22 @@ these values can be looked up through sysfs or by running \fImultipathd show pat<br>
"%N:%R:%n:%r"\fR. For example: 0x200100e08ba0aea0:<wbr>0x210100e08ba0aea0:.*:.* , .*:.*:iqn.2009-10.com.redhat.<wbr>msp.lab.ask-06:.*<br>
.RE<br>
.TP 12<br>
+.I path_latency<br>
+Needs a value of the form<br>
+\fI"<io_num>|<base_num>"\fR<br>
+.RS<br>
+.TP 8<br>
+.I io_num<br>
+The number of read IOs sent to the current path continuously, used to calculate the average path latency.<br>
+Valid Values: Integer, [2, 200].<br>
+.TP<br>
+.I base_num<br>
+The base number value of logarithmic scale, used to partition different priority ranks. Valid Values: Integer,<br>
+[2, 10]. And Max average latency value is 100s, min average latency value is 1us.<br>
+For example: If base_num=10, the paths will be grouped in priority groups with path latency <=1us, (1us, 10us],<br>
+(10us, 100us], (100us, 1ms], (1ms, 10ms], (10ms, 100ms], (100ms, 1s], (1s, 10s], (10s, 100s], >100s.<br>
+.RE<br>
+.TP 12<br>
.I alua<br>
If \fIexclusive_pref_bit\fR is set, paths with the \fIpreferred path\fR bit<br>
set will always be in their own path group.<br>
<span class="HOEnZb"><font color="#888888">--<br>
2.6.4.windows.1<br>
<br>
<br>
</font></span></blockquote></div><br></div>