[dm-devel] [PATCH 1/6] dm raid45 target: export region hash functions and add a needed one

Thu Jul 2 12:52:10 UTC 2009

On Mon, 2009-06-22 at 21:10 +0200, Heinz Mauelshagen wrote:
> On Sun, 2009-06-21 at 22:06 +1000, Neil Brown wrote:
> > On Friday June 19, heinzm at redhat.com wrote:
> > > On Fri, 2009-06-19 at 11:43 +1000, Neil Brown wrote:
> > > > On Wednesday June 17, neilb at suse.de wrote:
> > > > > 
> > > > > I will try to find time to review your dm-raid5 code with a view to
> > > > > understanding how it plugs in to dm, and then how the md/raid5 engine
> > > > > can be used by dm-raid5.
> > > 
> > > Hi Neil.
> > > 
> > > > 
> > > > I've had a bit of a look through the dm-raid5 patches.
> > > 
> > > Thanks.
> > > 
> > > > 
> > > > Some observations:
> > > > 
> > > > - You have your own 'xor' code against which you do a run-time test of
> > > >   the 'xor_block' code which md/raid5 uses - then choose the fasted.
> > > >   This really should not be necessary.  If you have xor code that runs
> > > >   faster than anything in xor_block, it really would be best to submit
> > > >   it for inclusion in the common xor code base.
> > > 
> > > This is in because it actually shows better performance regularly by
> > > utilizing cache lines etc. more efficiently (tested on Intel, AMD and
> > > Sparc).
> > > 
> > > If xor_block would always have been performed best, I'd dropped that
> > > optimization already.
<SNIP>

Dan, Neil,

like mentioned before I left to LinuxTag last week, here comes an initial
take on dm-raid45 warm/cold CPU cache xor speed optimization metrics.

This shall give us the base to decide to keep or drop the dm-raid45
internal xor optimization magic or move (part of) it into the crypto
subsystem.

Heinz

Howto:
------
I added a loop to walk the list of recovery stripes to dm-raid45.c
in xor_optimize() to allow for over committing the cache and some variables
to be able to display absolute minimum and maximum xor runs performed plus
the number of xor runs achieved per cycle for xor_blocks() and for the dm-raid45
build in xor optimization.

In order to make results more deterministic, I run xor_speed() for <= 5 ticks.

See diff vs. dm-devel dm-raid45 patch (submitted Jun 15th) attached.

Tests being performed on the following 2 systems:

   hostname: a4
   2.6.31-rc1 at 250HZ timer frequency
   Core i7 920 at 3.4GHz, 8 MB 3rd Level Cache
   6GB RAM

   hostname: t4
   2.6.31-rc1 at 250HZ timer frequency
   2 CPU Opeteron 280 at 2.4GHz, 2*1 MB 2nd Level Cache
   2GB RAM

with the xor optimization being the only load on the systems.

I've performed test runs on each of those with the following mapping tables
and 128 iterations for each of them, which represents a small array case
with 3 drives per set, running the xor optimization on a single core:

Intel:
0 58720256 raid45 core 2 8192 nosync  raid5_la 7 -1 -1 -1 512 10 nosync 1  3 -1 /dev/mapper/error1 0 /dev/mapper/error2 0 /dev/mapper/error3 0
...
0 58720256 raid45 core 2 8192 nosync  raid5_la 7 -1 -1 -1 512 10 nosync 13  3 -1 /dev/mapper/error1 0 /dev/mapper/error2 0 /dev/mapper/error3 0

Opteron:
0 58720256 raid45 core 2 8192 nosync  raid5_la 7 -1 -1 -1 256 10 nosync 1  3 -1 /dev/mapper/error1 0 /dev/mapper/error2 0 /dev/mapper/error3 0
...
0 58720256 raid45 core 2 8192 nosync  raid5_la 7 -1 -1 -1 256 10 nosync 13  3 -1 /dev/mapper/error1 0 /dev/mapper/error2 0 /dev/mapper/error3 0

Because no actual IO is being performed, I just mapped to error targets
(table used: "0 2199023255552 error"; I know it's large but it ain't matter).

The number following the 2nd nosync parameter is the amount of recovery
stripes with io size of 512 sectors = 256 kilobytes per chunk or
256 sectors = 128 kilobytes per chunk respectively.
I.e. a work set of 768/384 kilobytes per recovery stripe.
These values shall make sure, that results differ in the per mille range
(i.e. more than 100 cycles per test run) where appropriate.

Systems are running out of cache at
~ >= 8 stripes on the Intel (8192 - 2048 code / (512 / 2) / 3)
and
~ >= 0 stripes on the Opteron system (1024 - 768 code) / (256 / 2) / 3).
assuming some cache utilization for code and other data.

See raw kernel log extracts being created by these test runs attached
in a tarball and the script to extract the metrics as well.

Intel results with 128 iterations each:
---------------------------------------

1 stripe  : NB:10 111/80 HM:118 111/82
2 stripes : NB:25 113/87 HM:103 112/91
3 stripes : NB:24 115/93 HM:104 114/93
4 stripes : NB:48 114/93 HM:80 114/93
5 stripes : NB:38 113/94 HM:90 114/94
6 stripes : NB:25 116/94 HM:103 114/94
7 stripes : NB:25 115/95 HM:103 115/95
8 stripes : NB:62 117/96 HM:66 116/95 <<<--- cold cache starts here
9 stripes : NB:66 117/96 HM:62 116/95
10 stripes: NB:73 117/96 HM:55 114/95
11 stripes: NB:63 114/96 HM:65 112/95
12 stripes: NB:51 111/96 HM:77 110/95
13 stripes: NB:65 109/96 HM:63 112/95

NB: number of xor_blocks() parity calculations winning per 128 iterations
HM: number of dm-raid45 xor() parity calculations equal to/winning
    xor_blocks per 128 iterations
NN/MM: count of maximm/minimum calculations achived per iteration in <= 5 ticks.

Opteron results with 128 iterations each:
-----------------------------------------
1 stripe  : NB:0 30/20 HM:128 64/53
2 stripes : NB:0 31/21 HM:128 68/55
3 stripes : NB:0 31/22 HM:128 68/57
4 stripes : NB:0 32/22 HM:128 70/61
5 stripes : NB:0 32/22 HM:128 70/63
6 stripes : NB:0 35/22 HM:128 70/64
7 stripes : NB:0 32/23 HM:128 69/63
8 stripes : NB:0 44/23 HM:128 76/65
9 stripes : NB:0 43/23 HM:128 73/65
10 stripes: NB:0 35/23 HM:128 72/64
11 stripes: NB:0 35/24 HM:128 72/64
12 stripes: NB:0 33/24 HM:128 72/65
13 stripes: NB:0 33/23 HM:128 71/64

Test analysis:
--------------
I must have done something wrong ;-)

On the Opteron, dm-raid45 xor() outperforms xor_blocks() by far.
No warm cache significance visible.

On the Intel, dm-raid45 xor() performs slightly better on warm cache
vs. xor_blocks() performing slightly better on cold cache, which may be
the result of the lag of prefetching in dm-raid45 xor().
xor_blocks() achieves a slightly better maximum in 8 / 13 runs vs.
xor() in 2 test runs. in 3 runs they achieve the same maximum.

This is not deterministic:
min/max varying by up to > 200% on the Opteron
and up to 46% on the Intel.

Questions/Recommendations:
--------------------------
Review the code changes and the data analysis please.

Review the test cases and argue if those are valid
or recommend different ones please.

Can we get this more deterministic (e.g. use prefetching for dm-raid45 xor()) ?

Regards,
Heinz

-------------- next part --------------
A non-text attachment was scrubbed...
Name: xor_performance_metrics
Type: application/x-shellscript
Size: 808 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/dm-devel/attachments/20090702/29c3d718/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: dm-raid45-2.6.31-rc1.patch
Type: text/x-patch
Size: 6876 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/dm-devel/attachments/20090702/29c3d718/attachment-0001.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: xor_optomize_test_data.tar.bz2
Type: application/x-bzip-compressed-tar
Size: 17107 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/dm-devel/attachments/20090702/29c3d718/attachment-0002.bin>