[dm-devel] [LSF/MM ATTEND][LSF/MM TOPIC] Multipath redesign

Thu Jan 14 07:25:52 UTC 2016

On 01/13/2016 06:52 PM, Benjamin Marzinski wrote:
> On Wed, Jan 13, 2016 at 10:10:43AM +0100, Hannes Reinecke wrote:
>> Hi all,
>>
>> I'd like to attend LSF/MM and would like to present my ideas for a multipath
>> redesign.
>>
>> The overall idea is to break up the centralized multipath handling in
>> device-mapper (and multipath-tools) and delegate to the appropriate
>> sub-systems.
>>
>> Individually the plan is:
>> a) use the 'wwid' sysfs attribute to detect multipath devices;
>>     this removes the need of the current 'path_id' functionality
>>     in multipath-tools
>
> If all the devices that we support advertise their WWID through sysfs,
> I'm all for this. Not needing to worry about callouts or udev sounds
> great.
>
As of now, multipath-tools pretty much requires VPD page 0x83 to be 
implemented. So that's not a big issue. Plus I would leave the old 
infrastructure in place, as there are vendors which do provide their 
own path_id mechanism.

>> b) leverage topology information from scsi_dh_alua (which we will
>>     have once my ALUA handler update is in) to detect the multipath
>>     topology. This removes the need of a 'prio' infrastructure
>>     in multipath-tools
>
> What about devices that don't use alua? Or users who want to be able to
> pick a specific path to prefer? While I definitely prefer simple, we
> can't drop real funtionality to get there. Have you posted your
> scsi_dh_alua update somewhere?
>
Yep. Check the linux-scsi mailing list.

> I've recently had requests from users to
> 1. make a path with the TPGS pref bit set be in its own path group with
> the highest priority
Isn't that always the case?
Paths with TPGS pref bit set will have a different priority than 
those without the pref bit, and they should always have the highest 
priority.
I would rather consider this an error in the prioritizer ...

> 2. make the weighted prioritizer use persistent information to make its
> choice, so its actually useful. This is to deal with the need to prefer a
> specific path in a non-alua setup.
>
yeah, I had a similar request. And we should distinguish between the 
individual transports, as paths might be coming in via different 
protocols/transports.

> Some of the complexity with priorities is there out of necessity.
>
Agree.

>> c) implement block or scsi events whenever a remote port becomes
>>     unavailable. This removes the need of the 'path_checker'
>>     functionality in multipath-tools.
>
> I'm not convinced that we will be able to find out when paths come back
> online in all cases without some sort of actual polling. Again, I'd love
> this to be simpler, but asking all the types of storage we plan to
> support to notify us when they are up and down may not be realistic.
>
Currently we have three main transports: FC, iSCSI, and SAS.
FC has reliable path events via RSCN, as this is also what the 
drivers rely on internally (hello, zfcp :-)
If _that_ doesn't work we're in a deep hole anyway, cf the 
eh_deadline mechanism we had to implement.
iSCSI has the NOP mechanism, which in effect is polling on the iSCSI 
level. That would provide equivalent information; unfortunately not 
every target supports that.
But even without iSCSI has it's own error recovery logic, which will 
kick in whenever an error is detected. So we can as well hook into 
that and use it to send events.
And for SAS we have a far better control over the attached fabric, 
so it should be possible to get reliable events there, too.

That only leaves the non-transport drivers like virtio or the 
various RAID-like cards, which indeed might not be able to provide 
us with events.

So I would propose to make that optional; if events are supported 
(which could be figured out via sysfs) we should be using them and 
don't insist on polling, but fall back to the original methods if we 
don't have them.

>> d) leverage these events to handle path-up/path-down events
>>     in-kernel
>
> If polling is necessary, I'd rather it be done in userspace. Personally,
> I think the checker code is probably the least obectionable part of the
> multipath-tools (It's getting all the device information to set up the
> devices in the first place and coordinating with uevents that's really
> ugly, IMHO).
>
And this is where I do disagree.
The checker code is causing massive lock congestion on large-scale 
systems as there is precisely _one_ checker thread, having to check 
all devices serially. If paths go down on a large system we're 
having a flood of udev events, which we cannot handle in-time as the 
checkerloop holds the lock trying to check all those paths.

So being able to do away with the checkerloop is a major improvement 
there.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare at suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)