[dm-devel] blk-mq request allocation stalls [was: Re: [PATCH v3 0/8] dm: add request-based blk-mq support]

Tue Jan 6 16:15:53 UTC 2015

On 01/06/2015 09:05 AM, Mike Snitzer wrote:
> On Tue, Jan 06 2015 at  4:31am -0500,
> Bart Van Assche <bvanassche at acm.org> wrote:
>
>> On 01/05/15 22:35, Mike Snitzer wrote:
>>> On Fri, Jan 02 2015 at 12:53pm -0500,
>>> Bart Van Assche <bvanassche at acm.org> wrote:
>>>> Thanks, my tests confirm that this patch indeed fixes the issue I had
>>>> reported. Unfortunately this doesn't mean that the blk-mq multipath code
>>>> is already working perfectly. Most of the time I/O requests are
>>>> processed within the expected time but sometimes I/O processing takes
>>>> much more time than what I expected:
>>>>
>>>> # /usr/bin/time -f %e mkfs.xfs -f /dev/dm-0 >/dev/null
>>>> 0.02
>>>> # /usr/bin/time -f %e mkfs.xfs -f /dev/dm-0 >/dev/null
>>>> 0.02
>>>> # /usr/bin/time -f %e mkfs.xfs -f /dev/dm-0 >/dev/null
>>>> 8.68
>>>>
>>>> However, if I run the same command on the underlying device it always
>>>> completes within the expected time.
>>>
>>> I don't have very large blk-mq devices, but I can work on that.
>>> How large is the blk-mq device in question?
>>>
>>> Also, how much memory does the system have?  Is memory fragmented at
>>> all?  With this change the requests are cloned using memory allocated
>>> from block core's blk_get_request (rather than a dedicated mempool in DM
>>> core).
>>>
>>> Any chance you could use 'perf record' to try to analyze where the
>>> kernel is spending its time?
>>
>> Hello Mike,
>>
>> The device used in this test was a tmpfs file with a size of 16 MB. That
>> file had been created as follows: dd if=/dev/zero of=/dev/vdisk bs=1M
>> count=16. The initiator and target systems did have enough memory to keep
>> this tmpfs file in RAM all the time (32 GB and 4 GB respectively).
>>
>> For the runs that took much longer than expected the CPU load was low.
>> This probably means that the system was waiting for one or another I/O
>> timer to expire. The output triggered by "echo w > /proc/sysrq-trigger"
>> during a run that took longer than expected was as follows:
>>
>> SysRq : Show Blocked State
>>    task                        PC stack   pid father
>> kdmwork-253:0   D ffff8807c1fd3b78     0 10396      2 0x00000000
>>   ffff8807c1fd3b78 ffff88083b6b6cc0 0000000000012ec0 ffff8807c1fd3fd8
>>   0000000000012ec0 ffff880824225aa0 ffff88083b6b6cc0 ffff88081b0cb2c0
>>   ffff88085fc537c8 ffff8807c1fd3c98 ffff8807f7a99d70 ffffe8ffffc43bc0
>> Call Trace:
>>   [<ffffffff814d5230>] io_schedule+0xa0/0x130
>>   [<ffffffff8125a3f7>] bt_get+0x117/0x1b0
>>   [<ffffffff81256580>] ? blk_mq_queue_enter+0x30/0x2a0
>>   [<ffffffff81094cf0>] ? prepare_to_wait_event+0x110/0x110
>>   [<ffffffff8125a76f>] blk_mq_get_tag+0x9f/0xd0
>>   [<ffffffff8125591b>] __blk_mq_alloc_request+0x1b/0x210
>>   [<ffffffff812571c9>] blk_mq_alloc_request+0x139/0x150
>>   [<ffffffff8124c16e>] blk_get_request+0x2e/0xe0
>>   [<ffffffff8109a60d>] ? trace_hardirqs_on+0xd/0x10
>>   [<ffffffffa07f7d0f>] __multipath_map.isra.15+0x1cf/0x210 [dm_multipath]
>>   [<ffffffffa07f7d6a>] multipath_clone_and_map+0x1a/0x20 [dm_multipath]
>>   [<ffffffffa039dbb5>] map_tio_request+0x1d5/0x3a0 [dm_mod]
>>   [<ffffffff8109a53d>] ? trace_hardirqs_on_caller+0xfd/0x1c0
>>   [<ffffffff81075cbe>] kthread_worker_fn+0x7e/0x1b0
>>   [<ffffffff81075c40>] ? __init_kthread_worker+0x60/0x60
>>   [<ffffffff81075bc8>] kthread+0xf8/0x110
>>   [<ffffffff81075ad0>] ? kthread_create_on_node+0x210/0x210
>>   [<ffffffff814dacac>] ret_from_fork+0x7c/0xb0
>>   [<ffffffff81075ad0>] ? kthread_create_on_node+0x210/0x210
>
> Jens,
>
> This stack trace confirms my suspicion that switching DM-multipath over
> to allocating clone requests via blk_get_request (rather than using a
> dedicated mempool in DM core) is the cause of the slowdown that Bart has
> experienced.
>
> Given blk_mq_get_tag() looks to be the culprit is there anything we can
> do to speed up blk-mq request allocation?  I'm currently using
> GFP_KERNEL when calling blk_get_request().

blk-mq request allocation is pretty much as optimized/fast as it can be. 
The slowdown must be due to one of two reasons:

- A bug related to running out of requests, perhaps a missing queue run 
or something like that.
- A smaller number of available requests, due to the requested queue depth.

Looking at Barts results, it looks like it's usually fast, but sometimes 
very slow. That would seem to indicate it's option #1 above that is the 
issue. Bart, since this seems to wait for quite a bit, would it be 
possible to cat the 'tags' file for that queue when it is stuck like that?

-- 
Jens Axboe