[dm-devel] [PATCH 0/6] scsi_dh : Couple of fixes for scsi device handlers

Moger, Babu Babu.Moger at lsi.com
Fri Jul 30 18:12:39 UTC 2010


See my comments below..

> -----Original Message-----
> From: Chandra Seetharaman [mailto:sekharan at us.ibm.com]
> Sent: Thursday, July 29, 2010 7:05 PM
> To: Moger, Babu
> Cc: device-mapper development; linux-scsi at vger.kernel.org;
> Shyam_Iyer at Dell.com; Qi, Yanling; Chauhan, Vijay; Dachepalli, Sudhir;
> Stankey, Robert
> Subject: RE: [PATCH 0/6] scsi_dh : Couple of fixes for scsi device
> handlers
> 
> See comment below
> 
> On Thu, 2010-07-29 at 16:35 -0600, Moger, Babu wrote:
> > Chandra, Shyam,
> >  Thanks for your comments.. Please see my response.
> >
> > > -----Original Message-----
> > > From: Chandra Seetharaman [mailto:sekharan at us.ibm.com]
> > > Sent: Thursday, July 29, 2010 4:54 PM
> > > To: Moger, Babu
> > > Cc: device-mapper development; linux-scsi at vger.kernel.org; Qi,
> Yanling;
> > > Chauhan, Vijay; Dachepalli, Sudhir; Stankey, Robert
> > > Subject: Re: [PATCH 0/6] scsi_dh : Couple of fixes for scsi device
> > > handlers
> > >
> > > Babu,
> > >
> > > Your main object is to protect scsi_dh_data across
> scsi_dh_activate()
> > > by
> > > way of getting kref around scsi_dh_activate(), right ?
> > >
> >
> >   Yes, That is correct.
> >
> > > Wouldn't doing what Shyam suggested (doing kref_put() and
> put_device())
> > > in scsi_activate() make it simpler and code still be readable ? (it
> > > would make all the patches except 2/6 not needed).
> > >
> > > Did you hit with any problems doing it that way ?
> > >
> >
> >   Yes, We can do that.  Problem is I am hitting the issue with BUG_ON
> >  in get_rdac_data which is there in the beginning of rdac_activate.
> 
> I do not understand the problem.
> 
> If the BUG_ON on get_rdac_data() is being triggered, that means
> sdev->scsi_dh_data is NULL, if that is the case, how can you access
> sdev->scsi_dh_data->kref in scsi_dh_activate (in patch 2/6) ? Wouldn't
> it trip a oops ?

Test case is deleting both active and passive paths almost together during the multipath testing.  Looks like DM picked up the active path failure first. Then failed the active path and started scheduling activate_path to failover to passive path. Passive path is also about to go down pretty soon. When the control was in scsi_dh_activate, Looks like scsi_dh_data was still valid because I did not see panic here.  But scsi_dh_data became NULL when control went to rdac_activate. That is when I hit the bug on. 

kernel BUG at /usr/src/packages/BUILD/lsi-scsi_dh_rdac-01.00/obj/default/scsi_dh_rdac.c:232!
RIP: 0010:  rdac_activate+0x257/0x387 [scsi_dh_rdac]

My understanding is someone triggered scsi_dh->detach for passive path during this small window.  Only way I could see problem go away is holding reference counts between these calls.  Did i miss anything here? See the code snippet below.. 

int scsi_dh_activate(struct request_queue *q, activate_complete fn, void *data)
 {
         int err = 0;
         unsigned long flags;
         struct scsi_device *sdev;
         struct scsi_device_handler *scsi_dh = NULL;
 
         spin_lock_irqsave(q->queue_lock, flags);
         sdev = q->queuedata;
         if (sdev && sdev->scsi_dh_data)
                 scsi_dh = sdev->scsi_dh_data->scsi_dh;
         if (!scsi_dh || !get_device(&sdev->sdev_gendev))
                 err = SCSI_DH_NOSYS;
         spin_unlock_irqrestore(q->queue_lock, flags);
 
         if (err)
                 return err;
 
         if (scsi_dh->activate)
                 err = scsi_dh->activate(sdev, fn, data);
         put_device(&sdev->sdev_gendev);
         return err;
 }
 EXPORT_SYMBOL_GPL(scsi_dh_activate);
 

static int rdac_activate(struct scsi_device *sdev,
                         activate_complete fn, void *data)
 {
         struct rdac_dh_data *h = get_rdac_data(sdev); *******  I hit BUG_ON here *********
         int err = SCSI_DH_OK;
 
         err = check_ownership(sdev, h);
         if (err != SCSI_DH_OK)
                 goto done;
 
         if (h->lun_state == RDAC_LUN_UNOWNED) {
                 err = queue_mode_select(sdev, fn, data);
                 if (err == SCSI_DH_OK)
                         return 0;
         }
 done:
         if (fn)
                 fn(data, err);
         return 0;
 }


> 
> 
> >   If I have to go this way, then I need to remove a call
> get_rdac_data
> >  and just validate pointers. Report error(SCSI_DH_IO) if pointer is
> not valid.
> >   Then hold the reference counts and continue if everything is
> alright.
> >   I will send the new patches as soon as I can.
> 





More information about the dm-devel mailing list