Proper alignment between disk HW blocks, mdadm strides, and ext[23] blocks

Tue Nov 13 17:20:54 UTC 2007

On Nov 9, 2007 11:16 PM, Andreas Dilger <adilger at sun.com> wrote:
> On Nov 09, 2007  19:11 -0700, Chris Worley wrote:
> > How do you measure/gauge/assure proper alignment?
> >
> > The physical disk has a block structure.  What is it or how do you
> > find it?  I'm guessing it's best to not partition disks in order to
> > assure that whatever it's block read/write is isn't bisected by the
> > partition.
>
> For Lustre we never partition the disks for exactly this reason, and if
> you are using LVM/md on the whole device it doesn't make sense either.
>
> > Then, mdadm has some block structure.  The "-c" ("chunk") is in
> > "kibibytes" (feed the dog kibbles?), with a default of 64.  Not a clue
> > what they're trying to do.
>
> That just means for RAID 0/5/6 that the amount of data or parity in a
> stripe is a multipe of the chunk size, i.e. for a 4+1 RAID5 you get:
>
>         disk0 disk1 disk2 disk3 disk4
>         [64kB][64kB][64kB][64kB][64kB]
>         [64kB][64kB]...
>
> > Finally, mkfs.ext[23] has a "stride", which is defined as a "stripe
> > size" in the man page (and I thought all your stripes added together
> > are a "stride"), as well as a block size.
>
> For ext2/3/4 the stride size (in kB) == the mdadm chunk size.  Note that
> the ext2/3/4 stride size is in units of filesystem blocks, so if you have
> 4kB filesystem blocks (default for filesystems > 500MB) and a 64kB RAID5
> chunk size, this is 16:
>
>         e2fsck -E stride=16 /dev/md0

So, if:

B=Ext Block size
S=Ext Stride size
C=MD Chunk size

Then:

S=C/B

Is that correct?

Ignorantly/randomly shopping around for values (using 1MB block sizes
and 16GB transfers in  DD as the benchmark), I found performance
increased as I increased the MD chunk (testing just the MD device),
but, greater than 1024, the MD performance increased, but the EXT fs
got slower.  Strangely the EXT stride performed best set at 2048 (the
above equation says 256 would have been correct):

mdadm --create /dev/md0 --level=0 --chunk=1024 --raid-devices 12  /dev/sd[b-m]
mkfs.ext2 -T largefile4 -b 4096 -E stride=2048 /dev/md0

So, it may be best put that "S", in the equation above, is some factor
of the stride value used.

Note that I am trying to optimize for big blocks and big files, with
little regard for data reliability.

I also found some strange performance differences using different
manufacturer's disks.  I have a bunch of Maxtor 15K and Seagate 10K
SCSI disks.  Streaming to a single drive serially, the Maxtor disks
are faster, but, in parallel, the Seagate drives are faster.  I
measure this with something like:

for i in /dev/sd[e-r]
do /usr/bin/time -f "$i: %e" \
       dd bs=1024k count=16000 of=/dev/null if=$i 2>&1 \
         | grep -v records &
done
wait

This test doesn't truly emulate an MD device, as each disk is treated
independently; a given disk is allowed to get ahead of the rest... why
the Seagates outperform the Maxtors is unknown.  They are evenly
distributed across the SCSI channels (as many Seagates on a channel as
Maxtors).

I'm guessing the Seagate disks have deeper buffers.

I remember a few years ago increasing the number of outstanding
scatter/gather requests helped increase the performance of Qlogic FC
drivers... is there any such driver or kernel tweak these days?

I'd still like to know what the disks use for a block size.

Thanks,

Chris
P.S. Andreas: Hope your having fun at SC07... I don't get to go  :(
>
> > It's important to make sure these all align properly, but their definitions
> > do.
>
> ... do not?
>
> > Could somebody please clarify... with an example?
>
> Yes, I constantly wish the terminology were constant between different tools,
> but sadly there isn't any "proper" terminology out there as far as I've been
> able to see.
>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Software Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
>