[dm-devel] [QUESTION]: multipath device with wrong path lead to metadata err

Martin Wilck mwilck at suse.com
Tue Jan 26 23:11:20 UTC 2021


On Tue, 2021-01-26 at 19:14 +0800, lixiaokeng wrote:
> 
> > > Hi,
> > >   Unfortunately the verify_path() called before *and* after
> > > domap()
> > > in
> > > coalesce_paths can't solve this problem. I think it is another
> > > way to
> > > lead multipath with wrong path, but now I can't find the way from
> > > log.
> > 
> > Can you provide multipathd -v3 logs, and kernel logs? Maybe I'll
> > see
> > something.

This is not a -v3 log, right? We can't see much what multipathd is
doing. Anyway, I understand now that verify_paths() won't help. It
looks only for paths that have been removed (i.e. don't exist any more
in sysfs) since the last path detection. But then, when the error
occurs, it seems that sdf has been removed *and re-added*. So, the
check whether the path still exists succeeds. The uevents were also
missed because the uevent handler didn't get the lock.


> 
> (1)multipath -r: The sdf is found as a path of
> 36001405b7679bd96b094bccbf971bc90
> (iscsi node is 4:0:0:2)
> 
> (2)iscsi logout: The sdf is removed in iscsi in system time
> [1202538.467014].
> 
> (3)iscsi login: The sdf appears in iscsi in system time
> [1202538.825745].
> It is a path of 3600140584e11eb1818c4afab12c17800 (iscsi node
> 2:0:0:0)
> 
> Here I have a doubt. When I stop in domap using gdb and iscsi log
> out/in,
> the sdf will not  be used again becasue the disk refcount is not
> zero. I
> add a print if the disk refcount is zero in put_disk_and_module (for
> example lxk ref put after: name sdi; count 0), but there is not this
> print
> about sdf.

Yes, this is a very good point, and it's indeed strange. multipathd
should have opened a file descriptor to /dev/sdf in pathinfo(), and as
long as that file is open, the use count shouldn't drop to 0, the disk
devices (block device and scsi_disk device) shouldn't be released, and
the major/minor number shouldn't be reused. Unless I'm missing
something essential, that is.

> Jan 25 12:37:48 client1 kernel: [1202538.467014] sd 4:0:0:2: [sdf] Synchronizing SCSI cache
> Jan 25 12:37:48 client1 kernel: [1202538.568195] scsi 4:0:0:2: alua: Detached
> Jan 25 12:37:48 client1 kernel: [1202538.630507] sd 2:0:0:0: [sdf] 20971520 512-byte logical blocks: (10.7 GB/10.0 GiB)

Less than 0.1s between the disappearance of 4:0:0:2 as sdf and reappearance
of 2:0:0:0, without any sign of multipathd having noticed this change,
is indeed quite strange.

So we can only conclude that (if there's no kernel refcounting bug,
which I doubt) either orphan_path()->uninitialize_path() had been
called (closing the fd),  or that opening the sd device had failed in
the first place (in which case the path WWID should have been nulled in
pathinfo(). In both cases it makes little sense that the path should
still be part of a struct multipath. 

Please increase the log level of the "Couldn't open device node"
message in pathinfo(), and see if respective errors are logged.

Can you verify in the debugger if multipathd still has the fd to the
disk device open?

Perhaps you could trace scsi_disk_release() in the kernel?

Martin






More information about the dm-devel mailing list