[dm-devel] block_abort_queue (blk_abort_request) racing with scsi_request_fn
Mike Christie
michaelc at cs.wisc.edu
Wed Nov 10 07:09:24 UTC 2010
On 05/12/2010 12:23 AM, Mike Anderson wrote:
> I was looking at a dump from a weekend run and I believe I am seeing a
> case where blk_abort_request through blk_abort_queue picked up a request
> for timeout that scsi_request_fn decided not to start. This test was under
> error injection.
>
> I assume the case in scsi_request_fn this is hitting is that a request has
> been put on the timeout_list with blk_start_request and then one of the
> not_ready checks is hit and the request is decided not to be started. I
> believe the drop
>
> It appears that my usage of walking the timeout_list in block_abort_queue
> and using blk_mark_rq_complete in block_abort_request will not work in
> this case.
>
> While it would be good to have way to ensure a command is started, it is
> unclear if even at a low timeout of 1 second that a user other than
> blk_abort_queue would hit this race.
>
> The dropping / acquiring of host_lock and queue_lock in scsi_request_fn
> and scsi_dispatch_cmd make it unclear to me if usage of
> blk_mark_rq_complete will cover all cases.
>
> I looked at checking serial_number in scsi_times_out along with a couple
> blk_mark_rq_complete additions, but unclear if this would be good and / or
> work in all cases.
>
> I looked at just accelerating deadline by some default value but unclear
> if that would be acceptable.
>
> I also looked at just using just the mark interface I previously posted
> and not calling blk_abort_request at all, but that would change current
> behavior that has been in use for a while.
>
Did you ever solve this? I am hitting this with the dm-multipath
blk_abort_queue case (the email I sent you a couple weeks ago).
It seems we could fix this by just having blk_requeue_request do a check
for if the request timedout similar to what scsi used to do. A hacky way
might be to have 2 requeue functions.
blk_requeue_completed_request - This is the blk_requeue_request we have
today. It unconditionally requeues the request. It should only be used
if the command has been completed either from blk_complete_request or
from block layer timeout handling
(blk_rq_timed_out_timer->blk_rq_timed_out->rq_timed_out_fn).
blk_requeue_running_request - This checks if the timer is running before
requeueing the request. If blk_rq_timed_out_timer/blk_rq_timed_out has
taken over the request and is going to handle it then this function just
returns and does not requeue. So for this we could just have it check if
the queue has a rq_timed_out_fn and if rq->timeout_list is empty or not.
I think this might be confusing to use, so I tried something slightly
different below.
I also tried a patch where we just add another req bit. We set it in
blk_rq_timed_out_timer and clear it in a new function that clears it
then calls blk_requeue_reqeust. The new function:
blk_requeue_timedout_request - used when request is to be requeued if a
LLD q->rq_timed_out_fn returned BLK_EH_NOT_HANDLED and has resolved the
problem and wants the request to be requeued. This function clears
REQ_ATOM_TIMEDOUT and then calls blk_requeue_request.
blk_reqeuue_request would then check if REQ_ATOM_TIMEDOUT is set and if
it was just drop the request assuming the rq_timed_out_fn was handling
it. This still requires the caller to know how the command is supposed
to be reqeueud. But I think it might be easier since the driver here has
returned BLK_EH_NOT_HANDLED in the q timeout fn so they know that they
are going to be handling the request in a special way.
I attached the last idea here. Made over Linus's tree. Only compile tested.
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: blk-requeue-timedout-request.patch
URL: <http://listman.redhat.com/archives/dm-devel/attachments/20101110/2ecaf507/attachment.ksh>
More information about the dm-devel
mailing list