[dm-devel] Antw: [EXT] Re: RFC: one more time: SCSI device identification

Tue May 4 07:32:25 UTC 2021

On 4/27/21 12:52 PM, Ulrich Windl wrote:
>>>> Hannes Reinecke <hare at suse.de> schrieb am 27.04.2021 um 10:21 in Nachricht
> <2a6903e4-ff2b-67d5-e772-6971db8448fb at suse.de>:
>> On 4/27/21 10:10 AM, Martin Wilck wrote:
>>> On Tue, 2021‑04‑27 at 13:48 +1000, Erwin van Londen wrote:
>>>>>
>>>>> Wrt 1), we can only hope that it's the case. But 2) and 3) need work,
>>>>> afaics.
>>>>>
>>>> In my view the WWID should never change.
>>>
>>> In an ideal world, perhaps not. But in the dm‑multipath realm, we know
>>> that WWID changes can happen with certain storage arrays. See
>>> https://listman.redhat.com/archives/dm‑devel/2021‑February/msg00116.html
>>> and follow‑ups, for example.
>>>
>> And it's actually something which might happen quite easily.
>> The storage array can unmap a LUN, delete it, create a new one, and map
>> that one into the same LUN number than the old one.
>> If we didn't do I/O during that interval upon the next I/O we will be
>> getting the dreaded 'Power‑On/Reset' sense code.
>> _And nothing else_, due to the arcane rules for sense code generation in
>> SAM.
>> But we end up with a completely different device.
>>
>> The only way out of it is to do a rescan for every POR sense code, and
>> disable the device eg via DID_NO_CONNECT whenever we find that the
>> identification has changed. We already have a copy of the original VPD
>> page 0x83 at hand, so that should be reasonably easy.
> 
> I don't know the depth of the SCSI or FC protocol, but storage systems
> typically signal such events, maybe either via some unit attention or some FC
> event. Older kernels logged that there was a change, but a manual SCSI bus scan
> is needed, while newer kernels find new devices "automagically" for some
> products. The HP EVA 6000 series wored that way, a 3PAR SotorServ 8000 series
> also seems to work that way, but not Pure Storage X70 R3. FOr the latter you
> need something like a FC LIP to make the kernel detect the new devices (LUNs).
> I'm unsure where the problem is, but in principle the kernel can be
> notified...
> 
My point was that while there _is_ a unit attention with the sense code 
'INQUIRY DATA CHANGED' (and that indeed will generate a kernel message), 
it might be obscured by a subsequent unit attention with the sense code 
'Power-On/Reset', as per SCSI spec the latter might cause the previous 
ones to _not_ being sent.
So from that reasoning we will need to rescan the device upon 
'Power-on/Reset'.
But 'Power-On/Reset' is a sense code which we also get during initial 
device scan, so the problem is that we will be triggering a rescan while 
_doing_ a rescan, and as such it would need some really careful testing.

As for the PureStorage behaviour: The problem with changing the LUN 
mapping on the array is that it we might not _have_ a device to send 
unit attentions to.
If the array already exports LUNs to some other hosts, it doesn't need 
to re-initialize the FC port when starting to export LUNs to _this_ 
host. And as _this_ host doesn't have a LUN on which unit attentions can 
be sent, _and_ the FC port is already registered, there are no events 
whatsoever which would cause the host to initiate a rescan.
To resolve that the array would need to induce eg an RSCN, but that will 
only be triggered if a FC port is (re-)registered.
Which is what HPe arrays do; initiate a link-bounce on the attached 
ports, which will cause the attached hosts to initiate a rescan.
Of course, _all_ hosts will need to rescan (and thereby causing an 
interruption even on unrelated hosts), which is why this is not done by 
all vendors.

>>
>> I had a rather lengthy discussion with Fred Knight @ NetApp about
>> Power‑On/Reset handling, what with him complaining that we don't handle
>> is correctly. So this really is something we should be looking into,
>> even independently of multipathing.
>>
>> But actually I like the idea from Martin Petersen to expose the parsed
>> VPD identifiers to sysfs; that would allow us to drop sg_inq completely
>> from the udev rules.
> 
> Talking of VPDs: Somewhere in the last 12 years (within SLES 11)there was a
> kernel change regarding trailing blanks in VPD data. That change blew up
> several configurations being unable to re-recognize the devices. In one case
> the software even had bound a license to a specific device with serial number,
> and that software found "new" devices while missing the "old" ones...
> 
That's probably just for VPD page 0x80. Page 0x83 has pretty strict 
rules on how the entries are formatted, so chopping off trailing blanks 
is not easily done there.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare at suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer