[dm-devel] [PATCH 0/3] New approach at handling changed WWIDs

Fri Mar 22 14:13:47 UTC 2019

On Fri, Mar 22, 2019 at 12:28:56PM +0100, Martin Wilck wrote:
> On Thu, 2019-03-21 at 17:31 -0500, Benjamin Marzinski wrote:
> > 
> > ideally, we would be able to determine whether or not udev was able
> > to
> > get all the necessary information. It would be nice to be notified if
> > scsi_id failed or udev timed out.
> 
> The latter (udev timeout) won't work without significant changes to
> udev/systemd. Currently udevd simply kills hanging workers with
> SIGKILL, and doesn't bother to send an "incomplete uevent" message to
> monitor listeners.
> 
> What we *might* do in multipathd is listen for kernel uevents _and_
> udev uevents, and figuring out possible problems by relating and
> comparing them. It would be a non-trivial task; we'd have to deal with
> the possibility that kernel events might get lost, or might have
> happened before multipathd was started. During startup, this wouldn't
> help us. Rather, it would enable us to react on events during runtime
> which we currently miss. I see this as a feature enhancement -
> possible, but really hard to get right, and not directly related to the
> problems we are currently dealing with.
> 
> If we look at udev information for a block device (either during device
> probing or uevent processing), and the WWID is not set but some other
> properties (e.g. ID_PATH) are, we can be pretty sure that scsi_id or
> sg_inq have failed during processing of the last uevent for that
> device. If, in this case, we used our fallback action to retrieve the
> WWID, we'd be able to determine if it was a very short-lived problem or
> something more serious.
>  
> > > Whatever we do, we should stop trying to "fix" the path WWID in
> > > disassemble_map(). That's *so* against the separation of concerns
> > > principle. In getuid(), we might check if a path with missing WWID
> > > is
> > > already part of an existing multipath map, and then set the path
> > > WWID
> > > from the map WWID as sort-of a last emergency fallback. But that,
> > > too,
> > > should only be done during startup (assuming that a previous
> > > multipath
> > > or multipathd instance had set up the map correctly, and that udev
> > > information had been "lost" since then), and only after retrying as
> > > described above.
> > 
> > We don't want to remove paths from multipath devices because
> > multipathd
> > started up when the path was missing udev information. The udev
> > properties are trickier, but if we simply have a null WWID, it makes
> > sense to allow it as a last resort if the device otherwise appears to
> > have the same paths as it previously did.
> 
> I agree, but I don't agree with filling in pp->wwid from mpp->wwid.

I'm fine with that.

> >  users can always run
> > 
> > # multipath -f
> > 
> > to remove the device. If it looks like some of the paths are supposed
> > to
> > change on the device, we should quite possibly not include paths with
> > a
> > null WWID, because we don't know what has changed.  But we can do
> > this
> > someplace else than in disassemble_map(). 
> 
> How would you determine that something is "supposed to change"?

Sorry. That was worded badly. For example, if we start up and determine
that mpatha should look just like the existing mpatha, except for a path
that in missing information, it makes sense to assume that path also
belongs. If configure determines that some of the paths that are in the
existing mpatha definitely do not belong there, then it shouldn't assume
the paths that we know nothing about do belong there.

So, if it appears that something has changed in a multipath device,
looking at the paths that we do have information for, we can't make
assumptions about the paths we don't have information for. 

> > 
> > > Note that since by-property blacklisting was introduced in 2013,
> > > significant progress has been made in other areas. We have
> > > blacklisting
> > > by transport now, "find_multipaths", the "failed_wwids" logic that
> > > avoids repeated attempts at setting up maps for busy devices, and
> > > the
> > > INIT_MISSING_UDEV logic to deal with incomplete initialization. The
> > > udev rules have been improved as well. So, doing away with
> > > "required
> > > udev properties" may not be so dangerous, after all. 
> > > 
> > > Thoughts?
> > 
> > Another option would be to do some extra work in reconfigure.  If we
> > held on to the old path, and cleaned up everything but the old udev
> > device and file descriptor, we could be sure that the kernel wouldn't
> > reuse that device major:minor while we were reconfiguring. If we got
> > some paths without their udev information, we would have the old udev
> > information to check against the new config, to see if the device
> > should
> > be removed. Again, this works best if we could determine if we were
> > missing udev information.  Although in this case we could probably
> > just
> > use any path that became blacklisted because of not having the
> > necessary
> > property information.
> 
> IOW, we shouldn't blacklist these paths, which is what I was trying to
> say. Your idea to hold the references until reconfigure() is finished
> sounds clever to me.
> 
> The idea of blacklisting-by-missing-properties is to determine if
> ID_SERIAL for a given WWID is _reliable_. It should be used if we get a
> non-zero ID_SERIAL and at the same time none of the required
> properties, e.g. ID_WWN. IMO, if this is the case, we can be certain
> that scsi_id did *not* fail - after all, it was able to obtain
> ID_SERIAL. OTOH, if neither ID_SERIAL nor ID_WWN is set, failure to
> access the device is likely. Thus the solution here is simple: We
> should apply "blacklisting by missing property" *only* if ID_SERIAL is
> set, but ID_WWN is not.

Sure.

-Ben

> Martin
> 
> -- 
> Dr. Martin Wilck <mwilck at suse.com>, Tel. +49 (0)911 74053 2107
> SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
> HRB 21284 (AG Nürnberg)
>