[dm-devel] [RFC PATCH 14/16] multipath.rules: find_multipaths+ignore_wwids logic

Tue Jan 30 13:07:57 UTC 2018

On Mon, 2018-01-29 at 16:28 -0600, Benjamin Marzinski wrote:
> On Fri, Jan 26, 2018 at 06:29:04PM +0100, Martin Wilck wrote:
> > On Thu, 2018-01-25 at 07:40 -0600, Benjamin Marzinski wrote:
> > > 
> > > Possibly another option is to check if something else is using
> > > the
> > > device
> > > and to not claim the device in this case. That will solve the
> > > initramfs
> > > case, with is the really bad one.
> > 
> > I agree it's bad, but AFAICS "i" only eliminates a part of the
> > problem.
> 
> I agree "i" isn't a perfect solution.  But adding code to never claim
> a
> device that is currently in use will help things, regardless of any
> other changes we do or whatever mode we are in.  If we are in
> agreement
> that reassign_maps isn't safe, then there's never a good reason for
> multipathd to set up on a path that is already in use. If we don't do
> that, then we should never claim these paths either. 

If a path is in use, multipathd fails to claim it. I've been pondering
to simply try "open(device, O_EXCL)" to test for "is in use". 
If the open() fails, the problem is to determine whether the device is 
permanently in use (e.g. already mounted or claimed by MD) or only
temporarily (because some other process is probing it at the same time
as us). Retrying, at least over more than a minimal time span, is out
of the question. Maybe multipathd could try to claim a single-path map
(if should_multipath is true), and just try the open() call in the case
of failure, to test whether the reason for the failure was that the
path was in use.

Btw, I've been wondering what the additional "in use" logic based on
lock_multipath()/flock() is useful for these days. We can either grab a
path (which comes down to open(..., O_EXCL)), or we can't. It seems odd
to have an additional, not fully equivalent "in use" logic
(lock_multipath() success doesn't imply subsequent O_EXCL succcess!). 

IMO we should determine if we're supposed to grab the path
(should_multipath(), using the best heuristics we can), and if yes, we
should immediately do so without doing other checks first. If that
fails, we might either retry or fall back. Once we have the O_EXCL fd,
we may  also (try to) flock() the paths to notify other processes using
flock(), but that would be fully optional.

>  
> > > It will still leave the case where
> > > multipath will never be able to create the device for some other
> > > reason,
> > > but that is almost always a configuration issues, where for
> > > instance
> > > multipath should be blacklisting a whole class of devices that it
> > > isn't.
> > 
> > Hannes and I have recently discussed an idea how to deal with this
> > situation. It's quite high on my todo list. The basic idea is to
> > that
> > if multipathd fails to set up a device, it leaves a flag somewhere
> > and
> > retrigger an uevent. When multipath sees the flag, it doesn't claim
> > the
> > device next time.
> 
> How is this different from the pathes you just wrote to deal with the
> find_multipaths case? It claims the path device immediately, sets a
> timer, and if multipathd doesn't set itself up within that time, it
> unclaims the path.
>   The only addition you need is for multipathd to
> trigger that change event for devices that it fails set up on, so
> that
> you don't have to wait for the timeout. You still do need a backup

That's the idea. But a flag is necessary that multipath -u will see.

> timeout to deal with the case where mutipathd doesn't start up
> correctly.

Hm. IMO we have to assume that multipathd will start up. If it doesn't,
all the schemes we've discussed so far would fail, even strict fiN. But
you've just hinted that my "retry timer" idea might actually be a
solution to this problem :-)

> 
> > 
> > If you insist, we'll just let "fin" survive as, say,
> > 'find_multipaths
> > "conservative"'. Or we call "fIn" "greedy" and "fin" "no".
> 
> I'm fine with this. But, assuming you agree with my two above
> comments,
> then I'm not sure why you don't like the five options outlined in my
> last email: 
> 
> fin like we have now
> Fin like we have now
> fIn with the ability to go back and unclaim the device
> Fin with the ability to go back and unclaim the device
> xxN where f/F and i/I are ignored

I think we should differentiate "what we try to claim" and "what we do
if we fail". The ability to go back in the error case is highly
desirable, but orthogonal to the options which describe the "try to"
side. Luckily, for "go back and unclaim" we need no option - it seems
obvious that failure recovery shouldn't be disabled (assuming we get
this right).

> * all with the additional check that we don't claim or assemble on
> paths
> that are currently in use.
> 
> I really don't care how we expose these options to the user. I was
> just
> pointing out that if we used "find_multipaths", "ignore_wwids", and
> "no_new_devs", where "ignore_wwids" implied the ability to go back
> and
> unclaim the device, then every combination of the three is valid. But
> whether we use one config option with 5 keywords, or three config
> options that are boolean, or whatever, is fine by me.

I think one config option is far better, be it only for the sake of
documenting it. I hate stuff like "option X does Y if set to Z, but
only if option A is set to B, ...". It's bad for readers and even worse
for writers.

>  
> > We had a case where the customer ran "dracut --add multipath", but
> > forgot to run "systemctl enable multipath.service". Because we use
> > "fIn", multipathd grabbed the device in the initrd, and afterwards
> > the
> > root device's subvolumes couldn't be mounted, because systemd would
> > try
> > to mount the members, which were locked by the multipath device.
> 
> So, I'm still a little fuzzy on what happened here. Are you saying
> that
> this case was also an instance of 4C.2 after all? If not that, was
> the
> problem that multipath was assembled on the path device, but not
> having
> multipathd enabled made "multipath -u" not claim the device after the
> switch-root, making it appear available for others to use. 

Exactly. systemd would have been able to mount the other files systems
through the mpath device, but the path devices show up earlier during
"coldplug", and once the mount unit is in FAILED state, systemd heads
towards emergency.

>  Perhaps in
> that case, if we fail the check to see if multipathd is
> running/wanted
> in "multipath -u", we should still check if the path device is
> multipathed already, and if it is, we should claim it regardless,
> because nobody else will be able to.

Yes, that might have helped in this failure scenario.

>  
> > >  As an
> > > aside, I am personally very wary about reassign_maps. Multipath
> > > doesn't
> > > own the other devices it is reloading. There is nothing to
> > > guarantee
> > > that someone else isn't trying to modify the LVM device at the
> > > same
> > > time. I don't know of a specific bug with this (we never turn it
> > > on),
> > > but it seems very risky to start changing devices we don't own
> > > with
> > > no
> > > coordination.
> > 
> > I've no experience with reassign_maps. It's tempting to think that
> > it
> > could solve all these nasty issues, but my gut feeling, like yours,
> > says that it might be dangerous. We don't turn it on, either.
> > 
> 
> If neither of us find this safe, we should deprecate it at the least,
> and probably disable it and print a warning message instead, if it is
> set in multipath.conf.

I'm unsure. I don't have the experience to say it does work or it
doesn't.

@ALL: (if anyone except Ben read this far :-), can you contribute
experience with reassign_maps?

> > I agree your approach can't hurt, although I still think it will
> > just
> > make a very unlikely outcome even more unlikely (the 4C.3 case
> > described above is obviously very special, and your patch wouldn't
> > avoid it). I hope that our mutual ideas can go together.
> > 
> 
> If we can cut down the cases where we end up waiting, I'm fine with
> dropping my idea. I had some discussions this week, and people do
> recognize that there are problems with the existing device assembling
> code in udev. I'm now pretty confident that sooner or later, we will
> have the ability to find about how a device was labelled in the past,
> even between boots. 

I'd be interested to learn more about this. Do you have a reference?

Cheers,
Martin

-- 
Dr. Martin Wilck <mwilck at suse.com>, Tel. +49 (0)911 74053 2107
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)