[dm-devel] [PATCH 1/1] block: Convert hd_struct in_flight from atomic to percpu
Jens Axboe
axboe at kernel.dk
Fri Jun 30 14:08:33 UTC 2017
On 06/30/2017 07:05 AM, Brian King wrote:
> On 06/29/2017 09:17 PM, Jens Axboe wrote:
>> On 06/29/2017 07:20 PM, Ming Lei wrote:
>>> On Fri, Jun 30, 2017 at 2:42 AM, Jens Axboe <axboe at kernel.dk> wrote:
>>>> On 06/29/2017 10:00 AM, Jens Axboe wrote:
>>>>> On 06/29/2017 09:58 AM, Jens Axboe wrote:
>>>>>> On 06/29/2017 02:40 AM, Ming Lei wrote:
>>>>>>> On Thu, Jun 29, 2017 at 5:49 AM, Jens Axboe <axboe at kernel.dk> wrote:
>>>>>>>> On 06/28/2017 03:12 PM, Brian King wrote:
>>>>>>>>> This patch converts the in_flight counter in struct hd_struct from a
>>>>>>>>> pair of atomics to a pair of percpu counters. This eliminates a couple
>>>>>>>>> of atomics from the hot path. When running this on a Power system, to
>>>>>>>>> a single null_blk device with 80 submission queues, irq mode 0, with
>>>>>>>>> 80 fio jobs, I saw IOPs go from 1.5M IO/s to 11.4 IO/s.
>>>>>>>>
>>>>>>>> This has been done before, but I've never really liked it. The reason is
>>>>>>>> that it means that reading the part stat inflight count now has to
>>>>>>>> iterate over every possible CPU. Did you use partitions in your testing?
>>>>>>>> How many CPUs were configured? When I last tested this a few years ago
>>>>>>>> on even a quad core nehalem (which is notoriously shitty for cross-node
>>>>>>>> latencies), it was a net loss.
>>>>>>>
>>>>>>> One year ago, I saw null_blk's IOPS can be decreased to 10%
>>>>>>> of non-RQF_IO_STAT on a dual socket ARM64(each CPU has
>>>>>>> 96 cores, and dual numa nodes) too, the performance can be
>>>>>>> recovered basically if per numa-node counter is introduced and
>>>>>>> used in this case, but the patch was never posted out.
>>>>>>> If anyone is interested in that, I can rebase the patch on current
>>>>>>> block tree and post out. I guess the performance issue might be
>>>>>>> related with system cache coherency implementation more or less.
>>>>>>> This issue on ARM64 can be observed with the following userspace
>>>>>>> atomic counting test too:
>>>>>>>
>>>>>>> http://kernel.ubuntu.com/~ming/test/cache/
>>>>>>
>>>>>> How well did the per-node thing work? Doesn't seem to me like it would
>>>>>> go far enough. And per CPU is too much. One potential improvement would
>>>>>> be to change the part_stat_read() to just loop online CPUs, instead of
>>>>>> all possible CPUs. When CPUs go on/offline, use that as the slow path to
>>>>>> ensure the stats are sane. Often there's a huge difference between
>>>>>> NR_CPUS configured and what the system has. As Brian states, RH ships
>>>>>> with 2048, while I doubt a lot of customers actually run that...
>>>>>>
>>>>>> Outside of coming up with a more clever data structure that is fully
>>>>>> CPU topology aware, one thing that could work is just having X cache
>>>>>> line separated read/write inflight counters per node, where X is some
>>>>>> suitable value (like 4). That prevents us from having cross node
>>>>>> traffic, and it also keeps the cross cpu traffic fairly low. That should
>>>>>> provide a nice balance between cost of incrementing the inflight
>>>>>> counting, and the cost of looping for reading it.
>>>>>>
>>>>>> And that brings me to the next part...
>>>>>>
>>>>>>>> I do agree that we should do something about it, and it's one of those
>>>>>>>> items I've highlighted in talks about blk-mq on pending issues to fix
>>>>>>>> up. It's just not great as it currently stands, but I don't think per
>>>>>>>> CPU counters is the right way to fix it, at least not for the inflight
>>>>>>>> counter.
>>>>>>>
>>>>>>> Yeah, it won't be a issue for non-mq path, and for blk-mq path, maybe
>>>>>>> we can use some blk-mq knowledge(tagset?) to figure out the
>>>>>>> 'in_flight' counter. I thought about it before, but never got a
>>>>>>> perfect solution, and looks it is a bit hard, :-)
>>>>>>
>>>>>> The tags are already a bit spread out, so it's worth a shot. That would
>>>>>> remove the need to do anything in the inc/dec path, as the tags already
>>>>>> do that. The inlight count could be easily retrieved with
>>>>>> sbitmap_weight(). The only issue here is that we need separate read and
>>>>>> write counters, and the weight would obviously only get us the total
>>>>>> count. But we can have a slower path for that, just iterate the tags and
>>>>>> count them. The fast path only cares about total count.
>>>>>>
>>>>>> Let me try that out real quick.
>>>>>
>>>>> Well, that only works for whole disk stats, of course... There's no way
>>>>> around iterating the tags and checking for this to truly work.
>>>>
>>>> Totally untested proof of concept for using the tags for this. I based
>>>> this on top of Brian's patch, so it includes his patch plus the
>>>> _double() stuff I did which is no longer really needed.
>>>>
>>>>
>>>> diff --git a/block/bio.c b/block/bio.c
>>>> index 9cf98b29588a..ec99d9ba0f33 100644
>>>> --- a/block/bio.c
>>>> +++ b/block/bio.c
>>>> @@ -1737,7 +1737,7 @@ void generic_start_io_acct(int rw, unsigned long sectors,
>>>> part_round_stats(cpu, part);
>>>> part_stat_inc(cpu, part, ios[rw]);
>>>> part_stat_add(cpu, part, sectors[rw], sectors);
>>>> - part_inc_in_flight(part, rw);
>>>> + part_inc_in_flight(cpu, part, rw);
>>>>
>>>> part_stat_unlock();
>>>> }
>>>> @@ -1751,7 +1751,7 @@ void generic_end_io_acct(int rw, struct hd_struct *part,
>>>>
>>>> part_stat_add(cpu, part, ticks[rw], duration);
>>>> part_round_stats(cpu, part);
>>>> - part_dec_in_flight(part, rw);
>>>> + part_dec_in_flight(cpu, part, rw);
>>>>
>>>> part_stat_unlock();
>>>> }
>>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>>> index af393d5a9680..6ab2efbe940b 100644
>>>> --- a/block/blk-core.c
>>>> +++ b/block/blk-core.c
>>>> @@ -2434,8 +2434,13 @@ void blk_account_io_done(struct request *req)
>>>>
>>>> part_stat_inc(cpu, part, ios[rw]);
>>>> part_stat_add(cpu, part, ticks[rw], duration);
>>>> - part_round_stats(cpu, part);
>>>> - part_dec_in_flight(part, rw);
>>>> +
>>>> + if (req->q->mq_ops)
>>>> + part_round_stats_mq(req->q, cpu, part);
>>>> + else {
>>>> + part_round_stats(cpu, part);
>>>> + part_dec_in_flight(cpu, part, rw);
>>>> + }
>>>>
>>>> hd_struct_put(part);
>>>> part_stat_unlock();
>>>> @@ -2492,8 +2497,12 @@ void blk_account_io_start(struct request *rq, bool new_io)
>>>> part = &rq->rq_disk->part0;
>>>> hd_struct_get(part);
>>>> }
>>>> - part_round_stats(cpu, part);
>>>> - part_inc_in_flight(part, rw);
>>>> + if (rq->q->mq_ops)
>>>> + part_round_stats_mq(rq->q, cpu, part);
>>>> + else {
>>>> + part_round_stats(cpu, part);
>>>> + part_inc_in_flight(cpu, part, rw);
>>>> + }
>>>> rq->part = part;
>>>> }
>>>>
>>>> diff --git a/block/blk-merge.c b/block/blk-merge.c
>>>> index 99038830fb42..3b5eb2d4b964 100644
>>>> --- a/block/blk-merge.c
>>>> +++ b/block/blk-merge.c
>>>> @@ -634,7 +634,7 @@ static void blk_account_io_merge(struct request *req)
>>>> part = req->part;
>>>>
>>>> part_round_stats(cpu, part);
>>>> - part_dec_in_flight(part, rq_data_dir(req));
>>>> + part_dec_in_flight(cpu, part, rq_data_dir(req));
>>>>
>>>> hd_struct_put(part);
>>>> part_stat_unlock();
>>>> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
>>>> index d0be72ccb091..a7b897740c47 100644
>>>> --- a/block/blk-mq-tag.c
>>>> +++ b/block/blk-mq-tag.c
>>>> @@ -214,7 +214,7 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
>>>> bitnr += tags->nr_reserved_tags;
>>>> rq = tags->rqs[bitnr];
>>>>
>>>> - if (rq->q == hctx->queue)
>>>> + if (rq && rq->q == hctx->queue)
>>>> iter_data->fn(hctx, rq, iter_data->data, reserved);
>>>> return true;
>>>> }
>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>>> index 05dfa3f270ae..cad4d2c26285 100644
>>>> --- a/block/blk-mq.c
>>>> +++ b/block/blk-mq.c
>>>> @@ -43,6 +43,58 @@ static LIST_HEAD(all_q_list);
>>>> static void blk_mq_poll_stats_start(struct request_queue *q);
>>>> static void blk_mq_poll_stats_fn(struct blk_stat_callback *cb);
>>>>
>>>> +struct mq_inflight {
>>>> + struct hd_struct *part;
>>>> + unsigned int inflight;
>>>> +};
>>>> +
>>>> +static void blk_mq_check_inflight(struct blk_mq_hw_ctx *hctx,
>>>> + struct request *rq, void *priv,
>>>> + bool reserved)
>>>> +{
>>>> + struct mq_inflight *mi = priv;
>>>> +
>>>> + if (rq->part == mi->part &&
>>>> + test_bit(REQ_ATOM_STARTED, &rq->atomic_flags))
>>>> + mi->inflight++;
>>>> +}
>>>> +
>>>> +unsigned long part_in_flight_mq(struct request_queue *q,
>>>> + struct hd_struct *part)
>>>> +{
>>>> + struct mq_inflight mi = { .part = part, .inflight = 0 };
>>>> +
>>>> + blk_mq_queue_tag_busy_iter(q, blk_mq_check_inflight, &mi);
>>>> + return mi.inflight;
>>>> +}
>>>
>>> Compared with the totally percpu approach, this way might help 1:M or
>>> N:M mapping, but won't help 1:1 map(NVMe), when hctx is mapped to
>>> each CPU(especially there are huge hw queues on a big system), :-(
>>
>> Not disagreeing with that, without having some mechanism to only
>> loop queues that have pending requests. That would be similar to the
>> ctx_map for sw to hw queues. But I don't think that would be worthwhile
>> doing, I like your pnode approach better. However, I'm still not fully
>> convinced that one per node is enough to get the scalability we need.
>>
>> Would be great if Brian could re-test with your updated patch, so we
>> know how it works for him at least.
>
> I'll try running with both approaches today and see how they compare.
Focus on Ming's, a variant of that is the most likely path forward,
imho. It'd be great to do a quick run on mine as well, just to establish
how it compares to mainline, though.
--
Jens Axboe
More information about the dm-devel
mailing list