[dm-devel] [PATCH 10/16] libmultipath: make pgpolicyfn take a paths vector

Martin Wilck Martin.Wilck at suse.com
Wed Aug 21 10:28:04 UTC 2019


On Tue, 2019-08-20 at 17:55 -0500, Benjamin Marzinski wrote:
> 
> So, I've looked at this, and I'd like to make the case that these
> patches don't put multipath into an inflexible method of dealing with
> path groups. In my mind, there are three issues with path grouping
> that
> my patches deal with
> 
> 1. How the paths are grouped together
> 2. How the pathgroups are ordered
> 3. How the best pathgroup is chosen
> 
> I'd argue that the only issue where my patches really adds some
> significant code is on 1. I think that my choice of groups for 1 is
> correct, and I have a suggestion for making issue 3 go away.
> 
> In regards to the first issue, how the paths are grouped together, my
> patch basically follows two rules:
> 
> 1. Marginal paths shouldn't be grouped with non-marginal paths.
> 
> Obviously, if you want the marginal paths to stay in the same group
> as
> they would normally be in, then you don't want marginal path
> detection
> at all.

Well, this is how multipathd used to behave - the "marginal" property
didn't have any influence on the path grouping. But this kind of
grouping can still be used by setting marginal_pathgroups to "no".
And if someone wants marginal path detection to be active and to affect
path grouping, indeed she doesn't want marginal and non-marginal paths
in one path group.

So yes, I think you are right here.


> >Further, I don't see a scenario where we would like the marginal
> paths to be grouped with non-marginal paths that they wouldn't
> normally
> be grouped with.

Right.

> 2. Marginal paths should be grouped with other marginal paths using
> the
> same rules as for non-marginal paths.
> 
> There are often very good reasons why paths are grouped the way they
> are.  For instance, if the marginal path is passive (I'm not even
> sure
> that we should keep paths in the marginal state if they are
> PATH_GHOST,
> but we currently do), we really don't want to put it in a pathgroup
> with
> active paths. There probably are cases where it is safe to put all,
> or
> some, of the marginal paths together, but it's not easy to know when
> that is, and I don't think there is much benefit from doing the work
> to
> figure it out, and it is always safe to group them like you would if
> they were non-marginal paths.
> 
> I don't see a better way to set up the groups than what my patch
> does,
> so I'm not particularly worried about all the code involved in making
> sure that the groups look like this.

Right. Note that your description of "marginal_pathgroups" in the man
page is a bit misleading, as it talks about "a separate pathgroup",
which could be read as "one separate pathgroup". The paragraph could be
understood such that all marginal paths are lumped together into one
PG, wich is not the case.

You have logically changed the path group vector into a 2-dimensional
Matrix. In the past, similar problems have been solved by simply
scaling the prio values (think ALUA with "exclusive_pref_bit" set), but
that obviously works for grouping only with "group_by_prio".
In fact, your new approach is more powerful, and we might consider
using it in a generalized form in the future.

> As for the second issue, how the pathgroups are ordered, I don't
> think
> my code locks us in at all.  The functions that order the pathgroups
> are
> path_group_prio_update() and sort_pathgroups().  If you wanted to
> make
> multipath just change the priority of marginal pathgroups, you would
> just need to set that priority in path_group_prio_update(), and if
> you
> only wanted to use the priority for sorting them, you would just
> change
> sort_pathgroups() to only sort by priority. If you wanted to change
> pgp->marginal from something binary to something that reflected how
> marginal a path was, and you want to have that to change how a path
> was
> sorted vs other paths with different marginal and priority values,
> you
> could do all of that by simply changing the values set in
> path_group_prio_update() and how those values are sorted in
> sort_pathgroups(). I don't think my code does anything to make this
> less
> flexible.

I guess you are right here, too. (*)

> As for the third issue, how the best pathgroup is chosen, this is
> also a
> case where changing things just involves changing how things are done
> in
> one function, select_path_group(), to match what's done in
> sort_pathgroups(). But since the pathgroups are already in order from
> the best one to use to the worst, the multipath userspace tools could
> just work how the kernel works, and pick the first usable pathgroup.
> This won't always give the same answer that multipath currently does,
> since the current code looks at the number of enabled paths. But the
> only times when it will be different is when there are multiple
> pathgroups with the same priority, and the first one has some failed
> paths.  Since we rarely have multiple pathgroups with the same
> priorty
> except when using the failover policy (group_by_serial and
> group_by_node_name being rarely used), and you will always have one
> path
> per pathgroup with failover, in practice this will almost never
> occur.

I don't see an inconsistency here. sort_pathgroups() also takes the
number of enabled paths into account; it's just not always called when
paths are failed or reinstated. The point is just that in the corner
case you describe, we run a lightweight "switch_group" ioctl rather
than reloading the map with a new PG ordering, which is good. (**) 

> Oh, I did notice a bug in my select_path_group() code. I should be
> calling path_group_prio_update() before checking if the pathgroup
> is marginal or not. I'll fix that.
> 
> So, I'm not against making this all work with priorities if there is
> a
> reason to do so, but doing that will just involve changes in 3
> straightforward functions, or 2 if we decide to simplify
> select_path_group() so it just uses the order from sort_pathgroups()
> as
> a guide.
> 
> If you feel strongly about doing this with priorities, I can add a
> patch
> that changes this. But if it gets us the same results, then I don't
> see
> much benefit. We can always change it laster if we want to change how
> the groups actually end up.

I had a discussion with Hannes, and it went rather the opposite way of
what I'd written in my previous post (consider multi-dimentional
priorities/grouping for the future). He strongly supported your notion
that "marginality" should be the most important grouping criterion /
priority factor, taking precedence over other factors.

So yes, you've mostly convinced me. Please re-post with the fixes you
mentioned, and I'll ack the series. I still have a number of open
issues with how we do prioritizing and grouping in general, but I agree
your series is a step in the right direction.

Cheers, Martin


(*) Note to self: It just occured to me that rather than using sorting,
we could use dm-multipaths's concept of "disabled" or "bypassed" path
groups to represent marginal groups. When switching PGs, the kernel
tries bypassed PGs only if all others have no valid paths, so this fits
your marginal model quite well. But then, we wouldn't have the
flexibility that PG sorting provides, as you argued. Moreover, PGs
can't start out bypassed when a map is reloaded, opening a short race
window where the kernel could try using these PGs. Finally, the kernel
sets the bypassed flag when a SCSI DH returns SCSI_DH_DEV_TEMP_BUSY,
which is used by the EMC and ALUA DHs, and at least by the latter in a 
questionable way (indicating unspecific errors). I find this whole
kernel logic weird - why disable the entire PG if a DH error occurs on
a single path? - but that's a different issue.

(**) In the corner-case-in-corner-case where all paths of this selected
PG (which is now not the first) fail, the kernel might come to a
different fallback conclusion as if we had re-sorted PGs. But I think
this can be ignored).





More information about the dm-devel mailing list