[dm-devel] Raid0 performance regression

Sun Jan 23 21:34:02 UTC 2022

Dear Roger,

Am 23.01.22 um 19:00 schrieb Lukas Straub:
> CC'ing Song Liu (md-raid maintainer) and linux-raid mailing list.
> 
> On Fri, 21 Jan 2022 16:38:03 +0000 Roger Willcocks wrote:

>> we noticed a thirty percent drop in performance on one of our raid
>> arrays when switching from CentOS 6.5 to 8.4; it uses raid0-like

For those outside the CentOS universe, what Linux kernel versions are those?

>> striping to balance (by time) access to a pair of hardware raid-6
>> arrays. The underlying issue is also present in the native raid0
>> driver so herewith the gory details; I'd appreciate your thoughts.
>>
>> --
>>
>> blkdev_direct_IO() calls submit_bio() which calls an outermost
>> generic_make_request() (aka submit_bio_noacct()).
>>
>> md_make_request() calls blk_queue_split() which cuts an incoming
>> request into two parts with the first no larger than get_max_io_size()
>> bytes (which in the case of raid0, is the chunk size):
>>
>>    R -> AB
>>    
>> blk_queue_split() gives the second part 'B' to generic_make_request()
>> to worry about later and returns the first part 'A'.
>>
>> md_make_request() then passes 'A' to a more specific request handler,
>> In this case raid0_make_request().
>>
>> raid0_make_request() cuts its incoming request into two parts at the
>> next chunk boundary:
>>
>> A -> ab
>>
>> it then fixes up the device (chooses a physical device) for 'a', and
>> gives both parts, separately, to generic make request()
>>
>> This is where things go awry, because 'b' is still targetted to the
>> original device (same as 'B'), but 'B' was queued before 'b'. So we
>> end up with:
>>
>>    R -> Bab
>>
>> The outermost generic_make_request() then cuts 'B' at
>> get_max_io_size(), and the process repeats. Ascii art follows:
>>
>>
>>      /---------------------------------------------------/   incoming rq
>>
>>      /--------/--------/--------/--------/--------/------/   max_io_size
>>        
>> |--------|--------|--------|--------|--------|--------|--------| chunks
>>
>> |...=====|---=====|---=====|---=====|---=====|---=====|--......| rq out
>>        a    b  c     d  e     f  g     h  i     j  k     l
>>
>> Actual submission order for two-disk raid0: 'aeilhd' and 'cgkjfb'
>>
>> --
>>
>> There are several potential fixes -
>>
>> simplest is to set raid0 blk_queue_max_hw_sectors() to UINT_MAX
>> instead of chunk_size, so that raid0_make_request() receives the
>> entire transfer length and cuts it up at chunk boundaries;
>>
>> neatest is for raid0_make_request() to recognise that 'b' doesn't
>> cross a chunk boundary so it can be sent directly to the physical
>> device;
>>
>> and correct is for blk_queue_split to requeue 'A' before 'B'.
>>
>> --
>>
>> There's also a second issue - with large raid0 chunk size (256K), the
>> segments submitted to the physical device are at least 128K and
>> trigger the early unplug code in blk_mq_make_request(), so the
>> requests are never merged. There are legitimate reasons for a large
>> chunk size so this seems unhelpful.
>>
>> --
>>
>> As I said, I'd appreciate your thoughts.

Thank you for the report and the analysis.

Is the second issue also a regression? If not, I suggest to split it 
into a separate thread.

Kind regards,

Paul