[dm-devel] [PATCH 8/9] dm: Fix two race conditions related to stopping and starting queues

Thu Sep 1 20:15:17 UTC 2016

On 09/01/2016 12:05 PM, Mike Snitzer wrote:
> On Thu, Sep 01 2016 at  1:59pm -0400,
> Bart Van Assche <bart.vanassche at sandisk.com> wrote:
>> On 09/01/2016 09:12 AM, Mike Snitzer wrote:
>>> Please see/test the dm-4.8 and dm-4.9 branches (dm-4.9 being rebased
>>> ontop of dm-4.8):
>>> https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-4.8
>>> https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-4.9
>>
>> Hello Mike,
>>
>> The result of my tests of the dm-4.9 branch is as follows:
>> * With patch "dm mpath: check if path's request_queue is dying in
>> activate_path()" I still see every now and then that CPU usage of
>> one of the kworker threads jumps to 100%.
>
> So you're saying that the dying queue check is still needed in the path
> selector?  Would be useful to know why the 100% is occuring.  Can you
> get a stack trace during this time?

Hello Mike,

A few days ago I had already tried to obtain a stack trace with perf but 
the information reported by perf wasn't entirely accurate. What I know 
about that 100% CPU usage is as follows:
* "dmsetup table" showed three SRP SCSI device nodes but these SRP SCSI
   device nodes were not visible in /sys/block. This means that
   scsi_remove_host() had already removed these from sysfs.
* hctx->run_work kept being requeued over and over again on the kernel
   thread with name "kworker/3:1H". I assume this means that
   blk_mq_run_hw_queue() was called with the second argument (async) set
   to true. This probably means that the following dm-rq code was
   triggered:

	if (map_request(tio, rq, md) == DM_MAPIO_REQUEUE) {
		/* Undo dm_start_request() before requeuing */
		rq_end_stats(md, rq);
		rq_completed(md, rq_data_dir(rq), false);
		return BLK_MQ_RQ_QUEUE_BUSY;
	}

Bart.