[dm-devel] 4.1-rc2 dm-multipath-mq kernel warning

Wed May 27 12:57:32 UTC 2015

On Thu, May 07 2015 at  6:19am -0400,
Bart Van Assche <bart.vanassche at sandisk.com> wrote:

> On 05/06/15 20:29, Mike Snitzer wrote:
> >On Wed, May 06 2015 at  3:45am -0400,
> >Bart Van Assche <bart.vanassche at sandisk.com> wrote:
> >
> >>On 05/06/15 04:23, Mike Snitzer wrote:
> >>>On Tue, May 05 2015 at 10:04am -0400,
> >>>Bart Van Assche <bart.vanassche at sandisk.com> wrote:
> >>>>While retesting my SRP initiator patches on top of kernel v4.1-rc2
> >>>>with DM_MQ_DEFAULT=y I ran into the kernel warning below. Does this
> >>>>mean that I'm missing any device mapper related patches ? This
> >>>>warning was reported shortly after scsi_remove_host() had been
> >>>>invoked.
> >>>
> >>>I put the warning in place because, to me, if it triggers it speaks to
> >>>unsafe teardown occuring (request is still completing but the queue it
> >>>was issued from no longer exists).
> >>>
> >>>Like I said before I'm open to removing the WARN_ON_ONCE() if this
> >>>scenario is perfectly valid.  But I just haven't had time to revisit
> >>>what appears to be a potentially serious problem with the underlying
> >>>paths' teardown vs upper level mpath IO.
> >>>
> >>>I'll try to revisit this week.  But I welcome input from others too.
> >>>
> >>>(Just thinking about it further now, it could be that the way the clone
> >>>request is allocated in the case of blk-mq DM is as part of the original
> >>>request's pdu... meaning there isn't a proper get_request() call against
> >>>the underlying queue.. so the expected refcounting likely isn't
> >>>happening.  And given the request won't be free'd from that underlying
> >>>request_queue there really isn't a need to artificially link these
> >>>cloned requests with the underlying request_queue... so I'm now leaning
> >>>toward just removing the WARN_ON_ONCE.. but I'll look closer tomorrow)
> >>
> >>Hello Mike,
> >>
> >>With CONFIG_SCSI_MQ_DEFAULT=y and CONFIG_DM_MQ_DEFAULT=n I just ran into
> >>the bug report below. I will continue my v4.1-rc2 tests with SCSI_MQ=n.
> >
> >What were you doing when this happened?  Quite a strange place to get a
> >NULL pointer (it should be noted that for 4.2 hch's patch does away with
> >cloning the request's bios).  Is there an easy reproducer (unlikely
> >considering I've tested CONFIG_SCSI_MQ_DEFAULT=y and
> >CONFIG_DM_MQ_DEFAULT=n a fair amount).
> >
> >BTW, my "Just thinking about it further now" above was relative to
> >CONFIG_DM_MQ_DEFAULT=y and CONFIG_SCSI_MQ_DEFAULT=n.
> 
> Hello Mike,
> 
> With kernel v4.1-rc2, with CONFIG_SCSI_MQ_DEFAULT=y and
> CONFIG_DM_MQ_DEFAULT=n if I run "for p in
> /sys/class/srp_remote_ports/*; do echo 1 > $p/delete; done" if no
> I/O is running that command works fine. That command triggers a call
> of scsi_remove_host(). But if I run the same command while I/O is
> running the message "BUG: unable to handle kernel NULL pointer
> dereference at 0000000000000068 / IP: blk_rq_prep_clone+0x87/0x160"
> appears. I just reproduced this after having rebuilt the kernel
> after a "make clean".

Hey Bart,

Looks like Junichi likely fixed this issue you reported, please try this
patch: https://patchwork.kernel.org/patch/6487321/

Thanks,
Mike