[dm-devel] LSF: Multipathing and path checking question

Mon Apr 20 07:59:33 UTC 2009

Hi Mike,

Mike Christie wrote:
> Hannes Reinecke wrote:
>>
>> FC Transport already maintains an attribute for the path state, and even
>> sends netlink events if and when this attribute changes. For iSCSI I have
> 
> Are you referring to fc_host_post_event? Is the same thing we talked
> about last year, where you wanted events? Is this in multipath tools now
> or just in the SLES ones?
> 
Yep, that's the thing.

> For something like FCH_EVT_LINKDOWN, are you going to fail the path at
> that time or when would the multipath path be marked failed?
> 
This is just a notification that the path has gone down. Fast fail / dev_loss_tmo
still applies, ie that path won't get switched then.

> 
> 
>> to defer to your superior knowledge; of course it would be easiest if
>> iSCSI could send out the very same message FC does.
> 
> We can do something like fc_host_event_code for iscsi.
> 
Oh, that'll be grand.

> Question on what you are needing:
> 
> Do you mean you want to make fc_host_event_code more generic (there are
> some FC specific ones like lip_reset)? Put them in scsi-ml and send from
> a new netlink group that just sends these events?
> 
> Or do you just want something similar from iscsi? iscsi will hook into
> the iscsi netlink code using the scsi_netlink.c and then send a
> ISCSIH_EVT_LINKUP, ISCSIH_EVT, LINKDOWN, etc.
> 
Well, actually, I don't care. It's just if we were to go with the
proposal we'll have to fix up all transports to present the path state
to userspace; preferably with both, netlink events and sysfs attributes.

The actual implementation might well be transport-specific.

> What do the FCH_EVT_PORT_* ones means?
> 
FC stuff methinks. James S. should know better.

> 
> 
>>
>> Idea was to modify the state machine so that fast_fail_io_tmo is
>> being made mandatory, which transitions the sdev into an intermediate
>> state 'DISABLED' and sends out a netlink message.
> 
> 
> Above when you said, "No, I already do this for FC (should be checking
> the replacement_timeout, too ...)", did you mean that you have mulitpath
> tools always setting fast io fail now?
> 
Yes, quite so. Look at
git://git.kernel.org/pub/scm/linux/kernel/git/hare/multipath-tools
branch sles11
for details.

> For iscsi the replacement_timeout is always set already. If from
> multipath tools you are going to add some code so multipth sets this I
> can make iscsi allow the replacement_timeout to be set from sysfs like
> is done for FC's fast io fail.
> 
Oh, that would be awesome. Currently I think we have a mismatch / race
condition between iSCSI and multipathing, where ERL in iSCSI actually
counteracts multipathing. But I'll be investigating that one shortly.

> 
> 
>>
>> sdev state:   RUNNING <-> BLOCKED <-> DISABLED -> CANCEL
>> mpath state:  path up <-> <stall> <-> path down -> remove from map
>>
>> This will allow us to switch paths early, ie when it moves into
>> 'DISABLED' state. But the path structure themselves are still alive,
>> so when a path comes back between 'DISABLED' and 'CANCEL' we won't
>> have an issue reconnecting it. And we could even allow to set a
>> dev_loss_tmo to infinity thereby simulating the 'old' behaviour.
>>
>> However, this proposal didn't go through.
> 
> You got my hopes up for a solution in the the long explanation, then you
> destroyed them :)
> 
Yes, same here. I really thought this to be a sensible proposal, but
then the discussion veered off into queue_if_no_path handling.

> 
> Was the reason people did not like this because of the scsi device
> lifetime issue?
> 
> 
> I think we still want someone to set the fast io fail tmo for users when
> multipath is being used, because we want IO out of the queues and
> drivers and sent to the multipath layer before dev_loss_tmo if
> dev_loss_tmo is still going to be a lot longer. fast io fail tmo is
> usually less than 10 or 5 and for dev_loss_tmo seems like we still have
> user setting that to minutes.
> 
Exactly. Point here is that with the current implementation we basically
_cannot_ return 'path down' anymore, as the path is either blocked (during
which time all I/O got stalled) or failed completely (ie in state 'CANCEL').
Which is a bit of a detriment and we actually run into quite some contention
when the path is removed, as we have to kill all I/O, fail over paths, remove
stale paths, update device-mapper tables etc.

When decoupling this by having the midlayer always return 'DID_TRANSPORT_DISRUPTED'
after fast_fail_io we would be able to kill all I/O and switch paths gracefully.
Path removal and device-mapper table update would then be done later one when
dev_loss_tmo triggers.

> 
> Can't the transport layers just send two events?
> 1. On the initial link down when the port/session is blocked.
> 2. When there fast io fail tmos fire.
> 
Yes, that would be a good start.

> Today, instead of #2, the Red Hat multipath tools guy and I were talking
> about doing a probe with SG_IO. For example we would send down a path
> tester IO and then wait for it to be failed with DID_TRANSPORT_FAILFAST.
> 
No. this is exactly what you cannot do. SG_IO will be stalled when the
sdev is BLOCKED and will only return a result _after_ the sdev transitions
_out_ of the BLOCKED state.
Translated to FC this means that whenever dev_loss_tmo is _active_ (!)
no I/O will be send out neither any I/O result will be returned to userland.

Hence using SG_IO for path checker is a bad idea here.
Hence my proposal.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare at suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)