[dm-devel] Raid0 performance regression

Fri Jan 21 16:38:03 UTC 2022

Hi folks,

we noticed a thirty percent drop in performance on one of our raid
arrays when switching from CentOS 6.5 to 8.4; it uses raid0-like
striping to balance (by time) access to a pair of hardware raid-6
arrays. The underlying issue is also present in the native raid0
driver so herewith the gory details; I'd appreciate your thoughts.

--

blkdev_direct_IO() calls submit_bio() which calls an outermost
generic_make_request() (aka submit_bio_noacct()).

md_make_request() calls blk_queue_split() which cuts an incoming
request into two parts with the first no larger than get_max_io_size()
bytes (which in the case of raid0, is the chunk size):

  R -> AB

blk_queue_split() gives the second part 'B' to generic_make_request()
to worry about later and returns the first part 'A'.

md_make_request() then passes 'A' to a more specific request handler,
In this case raid0_make_request().

raid0_make_request() cuts its incoming request into two parts at the
next chunk boundary:

A -> ab

it then fixes up the device (chooses a physical device) for 'a', and
gives both parts, separately, to generic make request()

This is where things go awry, because 'b' is still targetted to the
original device (same as 'B'), but 'B' was queued before 'b'. So we
end up with:

  R -> Bab

The outermost generic_make_request() then cuts 'B' at
get_max_io_size(), and the process repeats. Ascii art follows:

    /---------------------------------------------------/   incoming rq

    /--------/--------/--------/--------/--------/------/   max_io_size

|--------|--------|--------|--------|--------|--------|--------| chunks

|...=====|---=====|---=====|---=====|---=====|---=====|--......| rq out
      a    b  c     d  e     f  g     h  i     j  k     l

Actual submission order for two-disk raid0: 'aeilhd' and 'cgkjfb'

--

There are several potential fixes -

simplest is to set raid0 blk_queue_max_hw_sectors() to UINT_MAX
instead of chunk_size, so that raid0_make_request() receives the
entire transfer length and cuts it up at chunk boundaries;

neatest is for raid0_make_request() to recognise that 'b' doesn't
cross a chunk boundary so it can be sent directly to the physical
device;

and correct is for blk_queue_split to requeue 'A' before 'B'.

--

There's also a second issue - with large raid0 chunk size (256K), the
segments submitted to the physical device are at least 128K and
trigger the early unplug code in blk_mq_make_request(), so the
requests are never merged. There are legitimate reasons for a large
chunk size so this seems unhelpful.

--

As I said, I'd appreciate your thoughts.

--

Roger