[dm-devel] RE: [PATCH 0/4] scsi_dh: Make scsi_dh_activate asynchronous

Wed Oct 7 23:08:40 UTC 2009

Hi Hannes,
I have tested the patch you had sent. Failover works fine. 

But, we are seeing problems during the failback. It is causing continues mode-select thrashing(ping pong).

Reason for this is, handler does not know if the mode select is coming for failover or failback. Every mode select will cause movement of all the Luns. It does not matter if the LUNs are on preferred path or not. On next polling interval, multipathd will find some Luns are not on preferred path and will initiate another failback. This will result in continues ping pong. I have explained this with an EXAMPLE 1 below.

For failback to work properly, we have to have selective Lun level failover.

There is also one more Cluster scenario where we could get into thrashing with Controller Level failover. Please see the EXAMPLE 2 below.

We have been testing LUN level failover with device mapper for a while now. It is working well for us and only problem we have is slower failovers with big configurations(failover was taking about 12 minutes with 234 luns). LSI and IBM(Chandra) has been working on  asynchronous behavior for the past 3-4 months. I have tested all the patches Chandra has posted and we have seen very good results(Failover takes only 1 minute with 234 luns).

Also, these patches give the opportunity for other handlers to move to asynchronous behavior if they wish to. We need your(and Linux community) help to review the patches and move forward on this issue.

Thanks
Babu Moger
LSI Corporation

Following are the two example where we could see mode-select thrashing.. 

EXAMPLE 1 (mode select thrashing with 2 Luns in single host)
=======================================================
Let's take a very simple example.

I have 2 Luns on my host. Host is seeing both the controllers with one path to each controller.

Lun 0 is owned by controller A and preferred owner is A.
Lun 1 is owned by controller B and preferred owned is B

Here is multipath -ll output..

mpath237 (3600a0b80000f519c0000cc8a48fc7d0b) dm-4 LSI,INF-01-00
[size=2.0G][features=1 queue_if_no_path][hwhandler=1 rdac][rw]
\_ round-robin 0 [prio=100][active]   
 \_ 1:0:0:0 sde 8:64  [active][ready] (controller A)
\_ round-robin 0 [prio=0][enabled]
 \_ 2:0:0:0 sdi 8:128 [active][ghost] (controller B)

mpath180 (3600a0b80000f519c0000cc9048fc7d7b) dm-5 LSI,INF-01-00
[size=2.0G][features=1 queue_if_no_path][hwhandler=1 rdac][rw]
\_ round-robin 0 [prio=100][active]   
 \_ 2:0:0:1 sdj 8:144 [active][ready] (controller B)
\_ round-robin 0 [prio=0][enabled]
 \_ 1:0:0:1 sdf 8:80  [active][ghost] (controller A)

1. Run I/O on both these Luns
2. Pull the cable connected to controller A.
3. Failover will happen and Lun 0 will move to controller B. Now both the Luns are on controller B.
4. Connect the cable back on controller A.
5. multipath tool will detect the physical Luns on controller A and run the priority test.
6. It will find that Lun 0 is not on preferred path and will initiate a failback. 
   Because it is a controller level failover it will move the Lun 1 also to controller A.
   Now both the Luns are on controller A.
7. Multipath tool will come again and find Lun 1 not on preferred path and initiate failback.
   This will both the Luns to controller B.
   This will continue forever.   

EXAMPLE 2: (mode select thrashing in cluster setup)
============================================================================
Let's take two node cluster environment where luns are visible across multiple nodes, although any given lun would only be accessible via one node at a time.  If a cluster configuration were to get into a state where one node only has visibility to one controller while another node only has visibility to the alternate, a “thrashing” condition could happen.  Take this example:

•	32 luns have been mapped from the storage to all nodes.
•	Luns 0-15 are owned by the ‘A’ controller and being accessed by node #1; luns 16-31 are owned by ‘B’ and mapped to node #2.
•	Node #1 only has access to the ‘A’ controller; node #2 only has access to the ‘B’ controller.

Let’s say Node #1 decides to access lun 16.  Because it does not have visibility to the ‘B’ controller it must issue a volume transfer request.  With Controller failover solution the volume transfer request would also move luns 17-31.  If node #2 were accessing those luns they would receive ownership errors, causing a volume transfer request to move them back.  However, this also moves lun 16 from ‘A’ back to ‘B’, causing node #1 to do the volume transfer request again…..etc.  

> -----Original Message-----
> From: dm-devel-bounces at redhat.com [mailto:dm-devel-bounces at redhat.com]
> On Behalf Of Moger, Babu
> Sent: Tuesday, October 06, 2009 2:46 PM
> To: Hannes Reinecke; sekharan at linux.vnet.ibm.com
> Cc: michaelc at cs.wisc.edu; Stankey, Robert; linux-scsi at vger.kernel.org;
> Dachepalli, Sudhir; dm-devel at redhat.com; Chauhan, Vijay;
> Benoit_Arthur at emc.com; Qi, Yanling; Eddie.Williams at steeleye.com
> Subject: [dm-devel] RE: [PATCH 0/4] scsi_dh: Make scsi_dh_activate
> asynchronous
> 
> Thanks Hannes and Chandra for your patches and feedback.  Right now we
> are internally discussing the pros and cons of both these approaches.
> We will respond to all your question once we arrive at the conclusion.
> 
> Also adding our Team (Bob Stankey, Sudhir, Yanling, Vijay ) in the
> response.
> 
> Thanks
> Babu Moger
> 
> > -----Original Message-----
> > From: Hannes Reinecke [mailto:hare at suse.de]
> > Sent: Tuesday, October 06, 2009 3:08 AM
> > To: sekharan at linux.vnet.ibm.com
> > Cc: linux-scsi at vger.kernel.org; dm-devel at redhat.com; Moger, Babu;
> > michaelc at cs.wisc.edu; Benoit_Arthur at emc.com;
> > Eddie.Williams at steeleye.com; berthiaume_wayne at emc.com
> > Subject: Re: [PATCH 0/4] scsi_dh: Make scsi_dh_activate asynchronous
> >
> > Chandra Seetharaman wrote:
> > > On Mon, 2009-10-05 at 15:01 +0200, Hannes Reinecke wrote:
> > >
> > > Thanks for the comment and the patch. I will try the patch.
> > >
> > >> Hmm. IIRC we added the synchronous mode by request of LSI/IBM, as
> > the RDAC
> > >> handler could only support on MODE SELECT command at a time.
> > >> If LSI checked this, okay, no objections.
> > >
> > > The original patch (for rdac) came from LSI (Moger Babu), I made
> > changes
> > > to the infrastructure and to the code to fit them together.
> > >
> > >> However: The main reason why we're getting flooded with MODE
> SELECT
> > commands
> > >> is that the RDAC handler switches _each LUN_, not the entire
> > controller.
> > >> Seeing that the controller simply cannot cope with the resulting
> > MODE SELECT
> > >> flood wouldn't it be more sensible to switch the entire controller
> > here?
> > >
> > > I see your point of view.... but here is the rationale...
> > >
> > > When we originally added the rdac support (as dm_rdac), we decided
> > this
> > > way consciously for the following reasons:
> > >  1. we do not know which link is broken (hba-ctlr or ctlr-lun).
> > >     The way it is currently implemented, both these cases are
> > handled.
> > Quite. But if the ctlr-lun link is broken, we really should receive
> > and appropriate error code, which then could be handled in dm-rdac
> > appropriately. After all, the controller is still accessible and
> > so we don't have to guess what happened (which is all what multipath
> > is about). So I doubt we need to worry about this.
> >
> > >  2. multipath layer to decide what to do and this module to just do
> > >     what it was told.
> > Yep.
> >
> > >  3. since multipath sends pg_init only if there is any IO sent to
> the
> > >     lun, (with the current implementation) we don't have to change
> > >     ownership (back and forth) of all the luns if the user is using
> > only
> > >     a handful.
> > Well, yes. But if the implementation is such that changing ownership
> > for all LUNs is about as efficient as changing an individual LUN (or,
> > as the case might be, as _inefficient_ :-), surely it's better to
> > save us some coding here.
> >
> > >  4. to be consistent with LSI's original driver (which does one lun
> > at a
> > >     time).
> > :-/ That's unfair.
> > But it still has the drawback that it doesn't scale; given enough
> LUNs
> > and
> > access patterns you _inevitably_ have to send MODE SELECT commands
> for
> > each LUN, ie you only delay this issue.
> > Only by switching all LUNs you can avoid this.
> >
> > Cheers,
> >
> > Hannes
> > --
> > Dr. Hannes Reinecke		      zSeries & Storage
> > hare at suse.de			      +49 911 74053 688
> > SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
> > GF: Markus Rex, HRB 16746 (AG Nürnberg)
> 
> --
> dm-devel mailing list
> dm-devel at redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel