[dm-devel] Raid0 performance regression
Roger Willcocks
roger at filmlight.ltd.uk
Mon Jan 24 16:48:49 UTC 2022
> On 23 Jan 2022, at 21:34, Paul Menzel <pmenzel at molgen.mpg.de> wrote:
>
> Dear Roger,
>
>
> Am 23.01.22 um 19:00 schrieb Lukas Straub:
>> CC'ing Song Liu (md-raid maintainer) and linux-raid mailing list.
>> On Fri, 21 Jan 2022 16:38:03 +0000 Roger Willcocks wrote:
>
>>> we noticed a thirty percent drop in performance on one of our raid
>>> arrays when switching from CentOS 6.5 to 8.4; it uses raid0-like
>
> For those outside the CentOS universe, what Linux kernel versions are those?
>
2.6.32 (and backported changes) and 4.18.0 (sim.)
>>> striping to balance (by time) access to a pair of hardware raid-6
>>> arrays. The underlying issue is also present in the native raid0
>>> driver so herewith the gory details; I'd appreciate your thoughts.
>>>
>>> --
>>>
>>> blkdev_direct_IO() calls submit_bio() which calls an outermost
>>> generic_make_request() (aka submit_bio_noacct()).
>>>
>>> md_make_request() calls blk_queue_split() which cuts an incoming
>>> request into two parts with the first no larger than get_max_io_size()
>>> bytes (which in the case of raid0, is the chunk size):
>>>
>>> R -> AB
>>> blk_queue_split() gives the second part 'B' to generic_make_request()
>>> to worry about later and returns the first part 'A'.
>>>
>>> md_make_request() then passes 'A' to a more specific request handler,
>>> In this case raid0_make_request().
>>>
>>> raid0_make_request() cuts its incoming request into two parts at the
>>> next chunk boundary:
>>>
>>> A -> ab
>>>
>>> it then fixes up the device (chooses a physical device) for 'a', and
>>> gives both parts, separately, to generic make request()
>>>
>>> This is where things go awry, because 'b' is still targetted to the
>>> original device (same as 'B'), but 'B' was queued before 'b'. So we
>>> end up with:
>>>
>>> R -> Bab
>>>
>>> The outermost generic_make_request() then cuts 'B' at
>>> get_max_io_size(), and the process repeats. Ascii art follows:
>>>
>>>
>>> /---------------------------------------------------/ incoming rq
>>>
>>> /--------/--------/--------/--------/--------/------/ max_io_size
>>> |--------|--------|--------|--------|--------|--------|--------| chunks
>>>
>>> |...=====|---=====|---=====|---=====|---=====|---=====|--......| rq out
>>> a b c d e f g h i j k l
>>>
>>> Actual submission order for two-disk raid0: 'aeilhd' and 'cgkjfb'
>>>
>>> --
>>>
>>> There are several potential fixes -
>>>
>>> simplest is to set raid0 blk_queue_max_hw_sectors() to UINT_MAX
>>> instead of chunk_size, so that raid0_make_request() receives the
>>> entire transfer length and cuts it up at chunk boundaries;
>>>
>>> neatest is for raid0_make_request() to recognise that 'b' doesn't
>>> cross a chunk boundary so it can be sent directly to the physical
>>> device;
>>>
>>> and correct is for blk_queue_split to requeue 'A' before 'B'.
>>>
>>> --
>>>
>>> There's also a second issue - with large raid0 chunk size (256K), the
>>> segments submitted to the physical device are at least 128K and
>>> trigger the early unplug code in blk_mq_make_request(), so the
>>> requests are never merged. There are legitimate reasons for a large
>>> chunk size so this seems unhelpful.
>>>
>>> --
>>>
>>> As I said, I'd appreciate your thoughts.
>
> Thank you for the report and the analysis.
>
> Is the second issue also a regression? If not, I suggest to split it into a separate thread.
>
Yes this is also a regression, both issues above have to be addressed to recover the
original performance.
Specifically, an md raid0 array with 256K chunk size interleaving two x 12-disk raid6
devices (Adaptec 3154 controller, 50MB files stored contiguously on disk, four threads)
can achieve a sequential read rate of 3800 MB/sec with the (very) old 6.5 kernel; this
falls to 2500 MB/sec with the relatively newer kernel.
This change to raid0.c:
- blk_queue_max_hw_sectors(mddev->queue, mddev->chunk_sectors);
+ blk_queue_max_hw_sectors(mddev->queue, UINT_MAX);
improves things somewhat, the sub-chunk requests are now submitted in order but we
still only get 2800 MB/sec because no merging takes place; the controller struggles to
keep up with the large number of sub-chunk transfers. This additional change to
blk-mq.c:
- if (request_count >= BLK_MAX_REQUEST_COUNT || (last &&
+ if (request_count >= BLK_MAX_REQUEST_COUNT || (false && last &&
blk_rq_bytes(last) >= BLK_PLUG_FLUSH_SIZE)) {
blk_flush_plug_list(plug, false);
Brings performance back to 6.5 levels.
>
> Kind regards,
>
> Paul
>
More information about the dm-devel
mailing list