[dm-devel] ALUA - rescan device capacity on zero sized block devices

Tue Apr 14 08:14:47 UTC 2015

On 04/14/2015 09:20 AM, Thomas Wouters wrote:
> ----- On Apr 13, 2015, at 7:44 PM, Bart Van Assche bart.vanassche at sandisk.com wrote:
>> On 04/13/15 17:32, Thomas Wouters wrote:
>>> We're performing some tests with open-iscsi and multipath on two 3par
>>> servers and their peer persistence feature.
>>> 3par is a commercial storage solution that uses ALUA to allow failover.
>>> We have two connections from each 3par server to a linux server.
>>>
>>> Every 3par server has two network controllers, so on our linux server we
>>> initiate 4 iscsi connections.
>>> Multipath detects that two of these connections are active paths (both
>>> to the same 3par device, that is active at that point) and two are ghost
>>> paths, to the passive 3par device.
>>>
>>> At this moment we have four block devices, the active paths show the
>>> actual device size and the standby paths show the devices as zero sized:
>>>
>>> # multipath -ll
>>> 360002ac000000000000000420001510c dm-3 3PARdata,VV
>>> size=100G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
>>> |-+- policy='round-robin 0' prio=130 status=active
>>> | |- 48:0:0:123 sdc 8:32 active ready running
>>> | `- 50:0:0:123 sdb 8:16 active ready running
>>> `-+- policy='round-robin 0' prio=1 status=enabled
>>>    |- 49:0:0:123 sdd 8:48 active ghost running
>>>    `- 51:0:0:123 sde 8:64 active ghost running
>>>
>>> # cat /sys/block/sdb/size
>>> 209715200
>>> # cat /sys/block/sdc/size
>>> 209715200
>>> # cat /sys/block/sdd/size
>>> 0
>>> # cat /sys/block/sde/size
>>> 0
>>>
>>> As soon as we perform a switchover on the 3par systems, multipath
>>> detects the priority changes and switches paths but the new active paths
>>> fail.
>>> We believe this is because 3par doesn't allow us to read the capacity of
>>> the disk on a standby path - and we have proof of this in the logs:
>>>
>>> Apr 13 15:05:12 deb-3par-test kernel: [   40.079736] sd 5:0:0:0: [sdc]
>>> READ CAPACITY failed
>>>
>>> Unfortunately, once we perform the switchover on 3par, the capacity of
>>> those old ghost paths, now active paths, is not re-read.  The multipath
>>> device is therefore reduced to a size of 0 and the filesystem becomes
>>> unavailable.
>>>
>>> If we only login on the two active paths without starting multipath,
>>> perform a switchover, then login on the two new active paths and start
>>> multipath, we have four block devices with a non-zero size and we can
>>> perform switchovers at will without any issues.
>>>
>>> We've found some older discussions describing these issues on the scsi
>>> target-devel and dm-devel mailinglists:
>>> - http://permalink.gmane.org/gmane.linux.scsi.target.devel/6531
>>> - https://www.redhat.com/archives/dm-devel/2014-July/msg00156.html
>>>
>>> As far as we can conclude after reading these messages, it is correct
>>> behavior for disallowing READ CAPACITY on ghost paths.  However, once
>>> the path becomes active, we do need a reread of the capacity in order
>>> for the path to be functional...
>>>
>>> We've created a workaround for our issue but we're not sure we're going
>>> in the right direction.
>>>
>>> diff --git a/multipathd/main.c b/multipathd/main.c
>>> index f876258..ff32681 100644
>>> --- a/multipathd/main.c
>>> +++ b/multipathd/main.c
>>> @@ -1235,6 +1235,11 @@ check_path (struct vectors * vecs, struct path * pp)
>>>
>>> pp->chkrstate = newstate;
>>> if (newstate != pp->state) {
>>> +
>>> + if (newstate == PATH_UP && pp->size != pp->mpp->size ) {
>>> + sysfs_attr_set_value(pp->udev, "device/rescan", "1\n",2);
>>> + }
>>> +
>>> int oldstate = pp->state;
>>> pp->state = newstate;
>>
>> Hello Thomas,
>>
>> The above patch will trigger a rescan after every failover and failback.
>> I'm afraid that will slow down failover and failback, especially if the
>> number of LUNs is large. I would appreciate it if the capacity would be
>> reexamined only if it is not yet known.
>>
>> Thanks,
>>
>> Bart.
> 
> Hi Bart,
> 
> I realize this is not the best way to handle the situation.
> This patch was never meant to be implemented as is but more of a clarification of how
> we look at the issue.
> 
> If we resize a lun on the storage servers, the new size can't be read on standby paths.
> This means that if a failover occurs for any reason we could end up with a corrupt block device?
> 
> Is there a better way to rescan the capacity? Using sysfs_attr_set_value() like this
> doesn't look clean to me.
> 
> Would it make sense to make this a configurable setting which is used for systems
> that don't allow READ CAPACITY on standby paths?
> 
Finally someone tripped over it.

We've noticed some time ago that the current multipath-tools
implementation does not follow SPC-3 as it relies on a valid
capacity when assembling paths.

I've done a patch for this (cf commit
cb4d86aac16bedaa22fdda8ee14130dfa2a98563), but this is only
a partial solution, as it will only allow to add paths with a zero
capacity to existing multipath maps.
If we come across a _new_ path with zero capacity it'll still be
skipped. Also the current 'rescan' functionality within the kernel
is very rudimentary, and would not allow a full device update.
IE if a device has a capacity of zero we currently will not
automatically update it upon switch-over.

What we can try is to listen to uevents for the device capacity
change or ALUA state changes, and retry the read capacity for those
events. Which would tie in nicely with the proposed device rescan
mechanism discussed at LSF.

Hmm.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		               zSeries & Storage
hare at suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)