[dm-devel] [PATCH] dm-thin: optimize power of two block size

Mon Jun 25 01:53:22 UTC 2012

On Mon, 18 Jun 2012, Joe Thornber wrote:

> On Mon, Jun 18, 2012 at 10:09:56AM -0400, Mikulas Patocka wrote:
> > Hi
> > 
> > This patch should be applied after 
> > dm-thin-support-for-non-power-of-2-pool-blocksize.patch. It optimizes 
> > power-of-two blocksize.
> 
> I'm going to nack this unless you can provide a benchmark that shows
> it measurably improves performance for some architecture somewhere.
> And a real benchmark, with io going through all the devices, not just
> a micro benchmark of the 'if' in a tight loop.
> 
> - Joe

Hi

Here are some tests ran on the collection of my computers.

This is a do_div benchmark, the source is here:
http://people.redhat.com/~mpatocka/testcases/do_div_benchmark.c
For the "bignum" test, I replaced 0x12345678 with 0xff12345678LL (so that 
do_div divides real 64-bit numbers).

It is especially slow on PA-RISC and Alpha because they don't have a 
divide instruction.

PA-RISC 900MHz 64-bit:
shift+mask:		4 ticks		(4.4ns)
shift+mask bignum:	4 ticks		(4.4ns)
do_div:			825 ticks	(917ns)
do_div bignum:		825 ticks	(917ns)

UltraSparc2 440MHz 64-bit:
shift+mask:		3 ticks		(6.8ns)
shift+mask bignum:	3 ticks		(6.8ns)
do_div:			87 ticks	(198ns)
do_div bignum:		93 ticks	(211ns)

Alpha ev45 233MHz 64-bit:
shift+mask:		7 ticks		(30ns)
shift+mask bignum:	8 ticks		(34ns)
do_div:			598 ticks	(2563ns)
do_div bignum:		897 ticks	(3844ns)

Pentium 3 850MHz:
shift+mask:		12.25 ticks	(14ns)
shift+mask bignum:	16 ticks	(19ns)
do_div:			63.5 ticks	(75ns)
do_div bignum:		94 ticks	(111ns)

Core2 Xeon 1600MHz 64-bit:
shift+mask:		3.2 ticks	(2ns)
shift+mask bignum:	3.4 ticks	(2.1ns)
do_div:			64 ticks	(40ns)
do_div bignum:		64 ticks	(40ns)

K10 Opteron 2300MHz 64-bit:
shift+mask:		3 ticks		(1.3ns)
shift+mask bignum:	3 ticks		(1.3ns)
do_div:			46 ticks	(20ns)
do_div bignum:		57 ticks	(28ns)

---

On that PA-RISC machine, I set up dm-stripe target consisting of two 
stripes on a ramdisk, with 4k stripe size. I performed
dd if=/dev/mapper/stripe of=/dev/null bs=512 count=100000 iflag=direct
With the optimization patches: 38.2-38.5 MB/s
Without the optimization patches: 35.3-35.6 MB/s

With larger io size:
dd if=/dev/mapper/stripe of=/dev/null bs=1M count=200 iflag=direct
With the optimization patches: 269-272 MB/s
Without the optimization patches: 250-253 MB/s

Tests with dm-thin on PA-RISC:
A device with 512MB pool and 512MB metadata on ramdisks, 64k chunk.

Overwrite the first time with
dd if=/dev/zero of=/dev/mapper/thin bs=1M oflag=direct
Without the optimization patches: 91.0-91.4
With the optimization patches: 90.6-91.6

Subsequent overwrite with
dd if=/dev/zero of=/dev/mapper/thin bs=1M oflag=direct
Without the optimization patches: 104 MB/s
With the optimization patches: 104 MB/s

Read the overwritten device with
dd if=/dev/mapper/thin of=/dev/null bs=1M iflag=direct
Without the optimization patches: 252-254 MB/s
With the optimization patches: 257-258 MB/s

So the conclusion is that is that that divide instruction degrades 
transfer speed, especially on dm-stripe with 4k stripe size (on dm-thin it 
is measurable only with raw read, the difference is smaller because it has 
a minimum chunk size 64k).

The question is why do you want to avoid such optimization? If it is 
because of source code clarity, we can create #define sector_div_optimized 
that optimizes the common case of power-of-two divisor and the code would 
be no more complicated than with sector div. Or do you have some other 
reasons?

BTW. when unloading the dm-thin device with debugging enabled (the tests 
were done with debugging disabled), I got this message:
device-mapper: space map checker: free block counts differ, checker 
131060, sm-disk:130991
--- so there is supposedly some bug? The kernel is 3.4.3.

Mikulas