[dm-devel] [PATCH 2/2] multipathd: add recheck_wwid_time option to verify the path wwid

Benjamin Block bblock at linux.ibm.com
Wed Feb 10 18:09:31 UTC 2021


On Tue, Feb 09, 2021 at 10:19:45PM +0000, Martin Wilck wrote:
> On Mon, 2021-02-08 at 23:19 -0600, Benjamin Marzinski wrote:
> > There are cases where the wwid of a path changes due to LUN remapping
> > without triggering uevent for the changed path. Multipathd has no
> > method
> > for trying to catch these cases, and corruption has resulted because
> > of
> > it.
> > 
> > In order to have a better chance at catching these cases, multipath
> > now
> > has a recheck_wwid_time option, which can either be set to "off" or a
> > number of seconds. If a path is failed for equal to or greater than
> > the
> > configured number of seconds, multipathd will recheck its wwid before
> > restoring it, when the path checker sees that it has come back up.
> 
> Can't the WWID change also happen without the path going offline, or
> at least without being offline long enough that multipathd would
> notice?
> 
> >  If
> > multipathd notices that a path's wwid has changed it will remove and
> > re-add the path, just like the existing wwid checking code for change
> > events does.  In cases where the no uevent occurs, both the udev
> > database entry and sysfs will have the old wwid, so the only way to
> > get
> > a current wwid is to ask the device directly. 
> 
> sysfs is wrong too, really? In that case I fear triggering an uevent
> won't fix the situation. You need to force the kernel to rescan the
> device, otherwise udev will fetch the WWID from sysfs again, which
> still has the wrong ID... or what am I missing here?
> 
> > > Currently multipath only
> > has code to directly get the wwid for scsi devices, so this option
> > only
> > effects scsi devices. Also, since it's possible the the udev wwid
> > won't
> > match the wwid from get_vpd_sgio(), if multipathd doesn't initially
> > see
> > the two values matching for a device, it will disable this option for
> > that device.
> > 
> > If recheck_wwid_time is not turned off, multipathd will also
> > automatically recheck the wwid whenever an existing path gets a add
> > event, or is manually re-added with cli_add_path().
> > 
> > Co-developed-by: Chongyun Wu <wucy11 at chinatelecom.cn>
> > Signed-off-by: Benjamin Marzinski <bmarzins at redhat.com>
> 
> I am uncertain about this.
> 
> We get one more configuration option that defaults to off and that only
> the truly inaugurated will understand and use. And even those will not
> know how to set the recheck time. Should it be 1s, 10, or 100? We
> already have too many of these options in multipath-tools. We shy away
> from giving users reasonable defaults, with the result that most people
> won't bother.
> 
> I generally don't understand what the UP/DOWN state has to do with
> this. If the WWID can change without any events seen by either the
> kernel or user space, why would the path go down and up again? And even
> if so, why would it matter how long the device remained down?
> 
> But foremost, do we really have to try to deal with configuration
> mistakes as blatant as this? What if a user sets the same WWID for
> different devices, or re-uses the same WWID on different storage
> servers? I already hesitated about the code I added myself for catching
> user errors in the WWIDs file, but this seems even more far-fetched.
> 
> Please convince me.
> 
> This said, I'd like to understand why there are no events in these
> cases. I'd have thought we'd at least get a UNIT ATTENTION (REPORTED
> LUNS DATA HAS CHANGED), which would have caused a uevent. If there was
> no UNIT ATTENTION, I'd blame the storage side. 

Yeah, just for reference, I saw this happening in practice when
something with the LU mapping changed on IBM storage - IIRC I saw it
with capacity changes. You end up in this code in the kernel:
    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/scsi/scsi_error.c?id=92bf22614b21a2706f4993b278017e437f7785b3#n416

And from there you ought to get an uevent for the sdev.

The WWID in sysfs might still be wrong though AFAIK. The kernel seems to
ignore the UA after it delivered the uevent.

> 
> Maybe we need to monitor scsi_device uevents.
> 
> Technical remarks below.
> 
> 

-- 
Best Regards, Benjamin Block  / Linux on IBM Z Kernel Development / IBM Systems
IBM Deutschland Research & Development GmbH    /    https://www.ibm.com/privacy
Vorsitz. AufsR.: Gregor Pillen         /        Geschäftsführung: Dirk Wittkopp
Sitz der Gesellschaft: Böblingen / Registergericht: AmtsG Stuttgart, HRB 243294





More information about the dm-devel mailing list