[dm-devel] [PATCH] dm-thin: optimize power of two block size
Mikulas Patocka
mpatocka at redhat.com
Mon Jun 25 01:53:22 UTC 2012
On Mon, 18 Jun 2012, Joe Thornber wrote:
> On Mon, Jun 18, 2012 at 10:09:56AM -0400, Mikulas Patocka wrote:
> > Hi
> >
> > This patch should be applied after
> > dm-thin-support-for-non-power-of-2-pool-blocksize.patch. It optimizes
> > power-of-two blocksize.
>
> I'm going to nack this unless you can provide a benchmark that shows
> it measurably improves performance for some architecture somewhere.
> And a real benchmark, with io going through all the devices, not just
> a micro benchmark of the 'if' in a tight loop.
>
> - Joe
Hi
Here are some tests ran on the collection of my computers.
This is a do_div benchmark, the source is here:
http://people.redhat.com/~mpatocka/testcases/do_div_benchmark.c
For the "bignum" test, I replaced 0x12345678 with 0xff12345678LL (so that
do_div divides real 64-bit numbers).
It is especially slow on PA-RISC and Alpha because they don't have a
divide instruction.
PA-RISC 900MHz 64-bit:
shift+mask: 4 ticks (4.4ns)
shift+mask bignum: 4 ticks (4.4ns)
do_div: 825 ticks (917ns)
do_div bignum: 825 ticks (917ns)
UltraSparc2 440MHz 64-bit:
shift+mask: 3 ticks (6.8ns)
shift+mask bignum: 3 ticks (6.8ns)
do_div: 87 ticks (198ns)
do_div bignum: 93 ticks (211ns)
Alpha ev45 233MHz 64-bit:
shift+mask: 7 ticks (30ns)
shift+mask bignum: 8 ticks (34ns)
do_div: 598 ticks (2563ns)
do_div bignum: 897 ticks (3844ns)
Pentium 3 850MHz:
shift+mask: 12.25 ticks (14ns)
shift+mask bignum: 16 ticks (19ns)
do_div: 63.5 ticks (75ns)
do_div bignum: 94 ticks (111ns)
Core2 Xeon 1600MHz 64-bit:
shift+mask: 3.2 ticks (2ns)
shift+mask bignum: 3.4 ticks (2.1ns)
do_div: 64 ticks (40ns)
do_div bignum: 64 ticks (40ns)
K10 Opteron 2300MHz 64-bit:
shift+mask: 3 ticks (1.3ns)
shift+mask bignum: 3 ticks (1.3ns)
do_div: 46 ticks (20ns)
do_div bignum: 57 ticks (28ns)
---
On that PA-RISC machine, I set up dm-stripe target consisting of two
stripes on a ramdisk, with 4k stripe size. I performed
dd if=/dev/mapper/stripe of=/dev/null bs=512 count=100000 iflag=direct
With the optimization patches: 38.2-38.5 MB/s
Without the optimization patches: 35.3-35.6 MB/s
With larger io size:
dd if=/dev/mapper/stripe of=/dev/null bs=1M count=200 iflag=direct
With the optimization patches: 269-272 MB/s
Without the optimization patches: 250-253 MB/s
Tests with dm-thin on PA-RISC:
A device with 512MB pool and 512MB metadata on ramdisks, 64k chunk.
Overwrite the first time with
dd if=/dev/zero of=/dev/mapper/thin bs=1M oflag=direct
Without the optimization patches: 91.0-91.4
With the optimization patches: 90.6-91.6
Subsequent overwrite with
dd if=/dev/zero of=/dev/mapper/thin bs=1M oflag=direct
Without the optimization patches: 104 MB/s
With the optimization patches: 104 MB/s
Read the overwritten device with
dd if=/dev/mapper/thin of=/dev/null bs=1M iflag=direct
Without the optimization patches: 252-254 MB/s
With the optimization patches: 257-258 MB/s
So the conclusion is that is that that divide instruction degrades
transfer speed, especially on dm-stripe with 4k stripe size (on dm-thin it
is measurable only with raw read, the difference is smaller because it has
a minimum chunk size 64k).
The question is why do you want to avoid such optimization? If it is
because of source code clarity, we can create #define sector_div_optimized
that optimizes the common case of power-of-two divisor and the code would
be no more complicated than with sector div. Or do you have some other
reasons?
BTW. when unloading the dm-thin device with debugging enabled (the tests
were done with debugging disabled), I got this message:
device-mapper: space map checker: free block counts differ, checker
131060, sm-disk:130991
--- so there is supposedly some bug? The kernel is 3.4.3.
Mikulas
More information about the dm-devel
mailing list