[dm-devel] [PATCH 0/6] scsi_dh : Couple of fixes for scsi device handlers

Fri Jul 30 23:56:28 UTC 2010

On Fri, 2010-07-30 at 12:12 -0600, Moger, Babu wrote:

> > >
> > >   Yes, We can do that.  Problem is I am hitting the issue with BUG_ON
> > >  in get_rdac_data which is there in the beginning of rdac_activate.
> > 
> > I do not understand the problem.
> > 
> > If the BUG_ON on get_rdac_data() is being triggered, that means
> > sdev->scsi_dh_data is NULL, if that is the case, how can you access
> > sdev->scsi_dh_data->kref in scsi_dh_activate (in patch 2/6) ? Wouldn't
> > it trip a oops ?
> 
> Test case is deleting both active and passive paths almost together during
>  the multipath testing.  Looks like DM picked up the active path failure 
> first. Then failed the active path and started scheduling activate_path to 
> failover to passive path. Passive path is also about to go down pretty soon. 
> When the control was in scsi_dh_activate, Looks like scsi_dh_data was still 
> valid because I did not see panic here.  But scsi_dh_data became NULL when 
> control went to rdac_activate. That is when I hit the bug on. 
> 
> kernel BUG at /usr/src/packages/BUILD/lsi-scsi_dh_rdac-01.00/obj/default/scsi_dh_rdac.c:232!
> RIP: 0010:  rdac_activate+0x257/0x387 [scsi_dh_rdac]
> 
> My understanding is someone triggered scsi_dh->detach for passive path during
>  this small window.  Only way I could see problem go away is holding reference 
> counts between these calls.  Did i miss anything here? See the code snippet below.. 

So, basically, the new patch just reduces the window of race. IOW, if it
just spins for few seconds (or preempted) just before calling the
kref_get() in the new patch, it will generate an oops as scsi_dh_data
would be NULL.

Looks like you have to create a lock to protect scsi_dh_data and then
call kref_get() with the protection of lock.

> 
> int scsi_dh_activate(struct request_queue *q, activate_complete fn, void *data)
>  {
>          int err = 0;
>          unsigned long flags;
>          struct scsi_device *sdev;
>          struct scsi_device_handler *scsi_dh = NULL;
>  
>          spin_lock_irqsave(q->queue_lock, flags);
>          sdev = q->queuedata;
>          if (sdev && sdev->scsi_dh_data)
>                  scsi_dh = sdev->scsi_dh_data->scsi_dh;
>          if (!scsi_dh || !get_device(&sdev->sdev_gendev))
>                  err = SCSI_DH_NOSYS;
>          spin_unlock_irqrestore(q->queue_lock, flags);
>  
>          if (err)
>                  return err;
>  
>          if (scsi_dh->activate)
>                  err = scsi_dh->activate(sdev, fn, data);
>          put_device(&sdev->sdev_gendev);
>          return err;
>  }
>  EXPORT_SYMBOL_GPL(scsi_dh_activate);
>  
> 
> static int rdac_activate(struct scsi_device *sdev,
>                          activate_complete fn, void *data)
>  {
>          struct rdac_dh_data *h = get_rdac_data(sdev); *******  I hit BUG_ON here *********
>          int err = SCSI_DH_OK;
>  
>          err = check_ownership(sdev, h);
>          if (err != SCSI_DH_OK)
>                  goto done;
>  
>          if (h->lun_state == RDAC_LUN_UNOWNED) {
>                  err = queue_mode_select(sdev, fn, data);
>                  if (err == SCSI_DH_OK)
>                          return 0;
>          }
>  done:
>          if (fn)
>                  fn(data, err);
>          return 0;
>  }
> 
> 
> > 
> > 
> > >   If I have to go this way, then I need to remove a call
> > get_rdac_data
> > >  and just validate pointers. Report error(SCSI_DH_IO) if pointer is
> > not valid.
> > >   Then hold the reference counts and continue if everything is
> > alright.
> > >   I will send the new patches as soon as I can.
> > 
> 
> NrybXǧv^)޺{.n+{"{ay
ʇڙ,jfhz
w
j:+vwjmzZ+ݢj"!