From lucisandor at gmail.com  Tue Dec  8 16:48:18 2009
From: lucisandor at gmail.com (=?UTF-8?Q?Lucian_=C8=98andor?=)
Date: Tue, 8 Dec 2009 11:48:18 -0500
Subject: botched RAID, now e2fsck or what?
Message-ID: <c775b4470912080848x50718bdft57666c4ee4c4df26@mail.gmail.com>

Hi all,

Somehow I managed to mess with a RAID array containing an ext3 partition.

Parenthesis, if it matters: I disconnected physically a drive while
the array was online. Next thing, I lost the right order of the drives
in the array. While trying to re-create it, I overwrote the raid
superblocks. Luckily, the array was RAID5 degraded, so whenever I
re-created it, it didn't go into sync; thus, everything besides the
RAID superblocks is preserved (or so I think).

Now, I am trying to re-create the array in the proper order. It takes
me countless attempts, through hundreds of permutations. I am doing it
programatically, but I don't think I have the right tool.
Now, after creating the array and mounting it with
mount -t ext3 -n -r /dev/md2 /media/olddepot
I issue an:
e2fsck -n -f /media/olddepot
However, I cycled through all the permutations without apparent
success. I.e., in all combinations it just refused to check it, saying
something about "short read" and, of course, about invalid file
systems.

Does anybody know a better tool to check whether the mounted partition
is a slightly damaged ext3 file system? I am thinking about dumping
ext3 superblocks, but I don't know how that works.

Thanks.

(I am on the latest openSuSE, 11.2, with the latest mdadm available.)


From lists at nerdbynature.de  Wed Dec  9 03:43:59 2009
From: lists at nerdbynature.de (Christian Kujau)
Date: Tue, 8 Dec 2009 19:43:59 -0800 (PST)
Subject: botched RAID, now e2fsck or what?
In-Reply-To: <c775b4470912080848x50718bdft57666c4ee4c4df26@mail.gmail.com>
References: <c775b4470912080848x50718bdft57666c4ee4c4df26@mail.gmail.com>
Message-ID: <alpine.DEB.2.01.0912081933490.2915@bogon.housecafe.de>

On Tue, 8 Dec 2009 at 11:48, Lucian ?andor wrote:
> Now, after creating the array and mounting it with
> mount -t ext3 -n -r /dev/md2 /media/olddepot
> I issue an:
> e2fsck -n -f /media/olddepot

Huh? Normally you'd want to run fsck agains the blockdevice:

$ umount /media/olddepot
$ fsck.ext3 -nvf /dev/md2

If this still does not succeed, you could try specifying a different 
superblock (-b). But the important thing will be to get your raid in the 
right order, otherwise fsck could do more harm than helping.

Christian.
-- 
BOFH excuse #25:

Decreasing electron flux


From sandeen at redhat.com  Wed Dec  9 05:09:30 2009
From: sandeen at redhat.com (Eric Sandeen)
Date: Tue, 08 Dec 2009 23:09:30 -0600
Subject: botched RAID, now e2fsck or what?
In-Reply-To: <c775b4470912080848x50718bdft57666c4ee4c4df26@mail.gmail.com>
References: <c775b4470912080848x50718bdft57666c4ee4c4df26@mail.gmail.com>
Message-ID: <4B1F310A.3070208@redhat.com>

Lucian ?andor wrote:
> Hi all,
> 
> Somehow I managed to mess with a RAID array containing an ext3 partition.
> 
> Parenthesis, if it matters: I disconnected physically a drive while
> the array was online. Next thing, I lost the right order of the drives
> in the array. While trying to re-create it, I overwrote the raid
> superblocks. Luckily, the array was RAID5 degraded, so whenever I
> re-created it, it didn't go into sync; thus, everything besides the
> RAID superblocks is preserved (or so I think).
> 
> Now, I am trying to re-create the array in the proper order. It takes
> me countless attempts, through hundreds of permutations. I am doing it
> programatically, but I don't think I have the right tool.
> Now, after creating the array and mounting it with
> mount -t ext3 -n -r /dev/md2 /media/olddepot
> I issue an:
> e2fsck -n -f /media/olddepot
> However, I cycled through all the permutations without apparent
> success. I.e., in all combinations it just refused to check it, saying
> something about "short read" and, of course, about invalid file
> systems.

As Christian pointed out, use the device not the mountpoint for the fsck arg:

[tmp]$ mkdir dir
[tmp]$ e2fsck -fn dir/
e2fsck 1.41.4 (27-Jan-2009)
e2fsck: Attempt to read block from filesystem resulted in short read while trying to open dir/
Could this be a zero-length partition?


 :)

-Eric


From lucisandor at gmail.com  Thu Dec 10 01:50:47 2009
From: lucisandor at gmail.com (=?UTF-8?Q?Lucian_=C8=98andor?=)
Date: Wed, 9 Dec 2009 20:50:47 -0500
Subject: botched RAID, now e2fsck or what?
In-Reply-To: <4B1F310A.3070208@redhat.com>
References: <c775b4470912080848x50718bdft57666c4ee4c4df26@mail.gmail.com> 
	<4B1F310A.3070208@redhat.com>
Message-ID: <c775b4470912091750m70b2d5b3vd90a0a4c66adc257@mail.gmail.com>

Hi,

Thanks both for replies. Things are moving now, since I started using
e2fsck -n -f -v /dev/md0

However, no combination seems useful. Sometimes I get:
"e2fsck: Bad magic number in super-block while trying to open /dev/md0"
Other times I get:
"Superblock has an invalid journal (inode 8)."
Other times I get:
"e2fsck: Illegal inode number while checking ext3 journal for /dev/md2."
None of these appears in only one permutation, so none is indicative
for the corectness of the permutation.

I also ran dumpe2fs /dev/md2, but I don't know how to make it more
useful than it is now. Right now it finds supernodes in a series of
permutations, so again, it is not of much help.
Question 1: Is there a way to make dumpe2fs or another command
estimate the number of files in what appears to be an ext3 partition?
(I would then go by the permutation which fonds the largest number of
files.)
Question: if I were to struck lucky and find the right combination,
would dumpe2fs give me a very-very long list of superblocks? Do the
superblocks extend far into the partition, or do they always stop
early (thus showing the same number each time my RAID starts with the
right drive)?

Question 3: Is there any other tool that would search for files in the
remains of an ext3 partition, and, this way, validate or invalidate
the permutations I try?

Thanks,
Lucian Sandor


2009/12/9 Eric Sandeen <sandeen at redhat.com>:
> Lucian ?andor wrote:
>> Hi all,
>>
>> Somehow I managed to mess with a RAID array containing an ext3 partition.
>>
>> Parenthesis, if it matters: I disconnected physically a drive while
>> the array was online. Next thing, I lost the right order of the drives
>> in the array. While trying to re-create it, I overwrote the raid
>> superblocks. Luckily, the array was RAID5 degraded, so whenever I
>> re-created it, it didn't go into sync; thus, everything besides the
>> RAID superblocks is preserved (or so I think).
>>
>> Now, I am trying to re-create the array in the proper order. It takes
>> me countless attempts, through hundreds of permutations. I am doing it
>> programatically, but I don't think I have the right tool.
>> Now, after creating the array and mounting it with
>> mount -t ext3 -n -r /dev/md2 /media/olddepot
>> I issue an:
>> e2fsck -n -f /media/olddepot
>> However, I cycled through all the permutations without apparent
>> success. I.e., in all combinations it just refused to check it, saying
>> something about "short read" and, of course, about invalid file
>> systems.
>
> As Christian pointed out, use the device not the mountpoint for the fsck arg:
>
> [tmp]$ mkdir dir
> [tmp]$ e2fsck -fn dir/
> e2fsck 1.41.4 (27-Jan-2009)
> e2fsck: Attempt to read block from filesystem resulted in short read while trying to open dir/
> Could this be a zero-length partition?
>
>
> ?:)
>
> -Eric
>


From lists at nerdbynature.de  Thu Dec 10 06:09:51 2009
From: lists at nerdbynature.de (Christian Kujau)
Date: Wed, 9 Dec 2009 22:09:51 -0800 (PST)
Subject: botched RAID, now e2fsck or what?
In-Reply-To: <c775b4470912091750m70b2d5b3vd90a0a4c66adc257@mail.gmail.com>
References: <c775b4470912080848x50718bdft57666c4ee4c4df26@mail.gmail.com>
	<4B1F310A.3070208@redhat.com>
	<c775b4470912091750m70b2d5b3vd90a0a4c66adc257@mail.gmail.com>
Message-ID: <alpine.DEB.2.01.0912092157140.2915@bogon.housecafe.de>

On Wed, 9 Dec 2009 at 20:50, Lucian ?andor wrote:
> However, no combination seems useful. Sometimes I get:
> "e2fsck: Bad magic number in super-block while trying to open /dev/md0"

Did you try specifying a different superblock? If you can remember how the 
filesystem was initially created, you can use:

 $ mkfs.ext3 -n /dev/md0        (MIND THE -n SWITCH!)

to get a list of the backup superblocks, which you can then use with fsck. 
Don't forget to man mkfs.ext3 :-)

> Question 1: Is there a way to make dumpe2fs or another command
> estimate the number of files in what appears to be an ext3 partition?

I can only think of:

$ dumpe2fs -h /dev/loop0 | egrep 'Inode count|Free inodes'

The difference between both values should be the used inodes, i.e. 
files/directories on the filesystem.

> Question: if I were to struck lucky and find the right combination,
> would dumpe2fs give me a very-very long list of superblocks?

The superblock count depends on how the fs was initially created. I could 
imagine that the list is longer for a real filesystem, as "garbage" 
won't have any superblocks at all.

> superblocks extend far into the partition, or do they always stop

Superblocks are usually spread all over the device.

> Question 3: Is there any other tool that would search for files in the
> remains of an ext3 partition, and, this way, validate or invalidate
> the permutations I try?

Have a look at:
http://ext4.wiki.kernel.org/index.php/Undeletion

Christian.
-- 
BOFH excuse #208:

Your mail is being routed through Germany ... and they're censoring us.


From adilger at sun.com  Thu Dec 10 06:54:54 2009
From: adilger at sun.com (Andreas Dilger)
Date: Wed, 09 Dec 2009 23:54:54 -0700
Subject: botched RAID, now e2fsck or what?
In-Reply-To: <c775b4470912091750m70b2d5b3vd90a0a4c66adc257@mail.gmail.com>
References: <c775b4470912080848x50718bdft57666c4ee4c4df26@mail.gmail.com>
	<4B1F310A.3070208@redhat.com>
	<c775b4470912091750m70b2d5b3vd90a0a4c66adc257@mail.gmail.com>
Message-ID: <2CF687DC-699B-4029-B607-B4376F3B3657@sun.com>

On 2009-12-09, at 18:50, Lucian ?andor wrote:
> However, no combination seems useful. Sometimes I get:
> "e2fsck: Bad magic number in super-block while trying to open /dev/ 
> md0"
> Other times I get:
> "Superblock has an invalid journal (inode 8)."
> Other times I get:
> "e2fsck: Illegal inode number while checking ext3 journal for /dev/ 
> md2."
> None of these appears in only one permutation, so none is indicative
> for the corectness of the permutation.

You need to know a bit about your RAID layout and the structure of  
ext*.  One thing that is VERY important is whether your new MD config  
has the same chunk size as it did initially.  It will be impossible to  
recover your config if you don't have the same chunk size.

Also, if you haven't disabled RAID resync then it may well be that  
changing the RAID layout has caused a resync that has permanently  
corrupted your data.

That said, I will assume the primary ext3 superblock will reside on  
the first disk in the RAID set, since it is located at an offset of  
1kB from the start of the device.

You should build and run the "findsuper" tool that is in the e2fsprogs  
source tree.  It will scan the raw disk devices and locate the ext3  
superblocks.  Each superblock contains the group number in which it is  
stored, so you can find the first RAID disk by looking for the one  
that has superblock 0 at offset 1kB from the start of the disk.

There may be other copies of the superblock #0 stored in the journal  
file, but those should be ignored.

The backup superblocks have a non-zero group number, and "findsuper"  
prints the offset at which that superblock should be located from the  
start of the LUN.  Depending on whether you have a non-power-of-two  
number of disks in your RAID set, you may find the superblock copies  
on different disks, and you can do some math to determine which order  
the disks should be in by computing the relative offset of the  
superblck within the RAID set.


The other thing that can help order the disks (depending on the RAID  
chunksize and the total number of groups in the filesystem,  
proportional to the filesystem size) is the group descriptor table.   
It is located immediately after the superblocks, and contains a very  
regular list of block numbers for the block and inode bitmaps, and the  
inode table in each group.

Using "od -Ax -tx4" on a regular ext3 filesystem you can see the group  
descriptor table starting at offset 0x1000, and the block numbers  
basically just "count" up.  This may in fact be the easiest way to  
order the disks, if the group descriptor table is large enough to  
cover all of the disks:

# od -Ax -tx4 /dev/hda1 | more
:
:
001000 0000012c 0000012d 0000012e 02430000
001010 000001f2 00000000 00000000 00000000
001020 0000812c 0000812d 0000812e 2e422b21
001030 0000000d 00000000 00000000 00000000
001040 00010000 00010001 00010002 27630074
001050 000000b8 00000000 00000000 00000000
001060 0001812c 0001812d 0001812e 27a70b8a
001070 00000231 00000000 00000000 00000000
001080 00020000 00020001 00020002 2cc10000
001090 00000008 00000000 00000000 00000000
0010a0 0002812c 0002812d 0002812e 25660134
0010b0 00000255 00000000 00000000 00000000
0010c0 00030000 00030001 00030002 17a50003
0010d0 000001c6 00000000 00000000 00000000
0010e0 0003812c 0003812d 0003812e 27a70000
0010f0 00000048 00000000 00000000 00000000
001100 00040000 00040001 00040002 2f8b0000

See nearly regular incrementing sequence every 0x20 bytes:

0000012c, 0000812c, 00010000, 0001812c, 00020000, 0002812c, 00030000,  
0003812c


Each group descriptor block (4kB = 0x1000) covers 16GB of filesystem  
space, so  64 blocks per 1TB of filesystem size.  If your RAID chunk  
size is not too large, and the filesystem IS large, you will be able  
to fully order your disks in the RAID set.  You can also verify the  
RAID chunk size by determining how many blocks of consecutive group  
descriptors are present before there is a "jump" where the group  
descriptor blocks were written to other disks before returning to the  
current disk.  Remember that one of the disks in the set will also  
need to store parity, so there will be some number of "garbage" blocks  
before the proper data resumes.

> I also ran dumpe2fs /dev/md2, but I don't know how to make it more
> useful than it is now. Right now it finds supernodes in a series of
> permutations, so again, it is not of much help.

I would also make sure that you can get the correct ordering and MD  
chunk size before doing ANY kind of modification to the disks.  It  
would only take a single mistake (e.g. RAID parity rebuild while not  
in the right order) to totally corrupt the filesystem.

> Question 1: Is there a way to make dumpe2fs or another command
> estimate the number of files in what appears to be an ext3 partition?
> (I would then go by the permutation which fonds the largest number of
> files.)
> Question: if I were to struck lucky and find the right combination,
> would dumpe2fs give me a very-very long list of superblocks? Do the
> superblocks extend far into the partition, or do they always stop
> early (thus showing the same number each time my RAID starts with the
> right drive)?
>
> Question 3: Is there any other tool that would search for files in the
> remains of an ext3 partition, and, this way, validate or invalidate
> the permutations I try?
>
> Thanks,
> Lucian Sandor
>
>
> 2009/12/9 Eric Sandeen <sandeen at redhat.com>:
>> Lucian ?andor wrote:
>>> Hi all,
>>>
>>> Somehow I managed to mess with a RAID array containing an ext3  
>>> partition.
>>>
>>> Parenthesis, if it matters: I disconnected physically a drive while
>>> the array was online. Next thing, I lost the right order of the  
>>> drives
>>> in the array. While trying to re-create it, I overwrote the raid
>>> superblocks. Luckily, the array was RAID5 degraded, so whenever I
>>> re-created it, it didn't go into sync; thus, everything besides the
>>> RAID superblocks is preserved (or so I think).
>>>
>>> Now, I am trying to re-create the array in the proper order. It  
>>> takes
>>> me countless attempts, through hundreds of permutations. I am  
>>> doing it
>>> programatically, but I don't think I have the right tool.
>>> Now, after creating the array and mounting it with
>>> mount -t ext3 -n -r /dev/md2 /media/olddepot
>>> I issue an:
>>> e2fsck -n -f /media/olddepot
>>> However, I cycled through all the permutations without apparent
>>> success. I.e., in all combinations it just refused to check it,  
>>> saying
>>> something about "short read" and, of course, about invalid file
>>> systems.
>>
>> As Christian pointed out, use the device not the mountpoint for the  
>> fsck arg:
>>
>> [tmp]$ mkdir dir
>> [tmp]$ e2fsck -fn dir/
>> e2fsck 1.41.4 (27-Jan-2009)
>> e2fsck: Attempt to read block from filesystem resulted in short  
>> read while trying to open dir/
>> Could this be a zero-length partition?
>>
>>
>>  :)
>>
>> -Eric
>>
>
> _______________________________________________
> Ext3-users mailing list
> Ext3-users at redhat.com
> https://www.redhat.com/mailman/listinfo/ext3-users


Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


From tytso at mit.edu  Thu Dec 10 13:47:47 2009
From: tytso at mit.edu (tytso at mit.edu)
Date: Thu, 10 Dec 2009 08:47:47 -0500
Subject: botched RAID, now e2fsck or what?
In-Reply-To: <c775b4470912080848x50718bdft57666c4ee4c4df26@mail.gmail.com>
References: <c775b4470912080848x50718bdft57666c4ee4c4df26@mail.gmail.com>
Message-ID: <20091210134747.GB4353@thunk.org>

On Tue, Dec 08, 2009 at 11:48:18AM -0500, Lucian ?andor wrote:
> 
> Now, I am trying to re-create the array in the proper order. It takes
> me countless attempts, through hundreds of permutations. I am doing it
> programatically, but I don't think I have the right tool.

Something that may help is to use the findsuper program, in the
e2fsprogs sources; it's not built by default, but you can build it by
hand.  Each of the backup superblocks has a group number in one of the
fields, if it was created with a relatively modern mke2fs, so you can
use it to get information like this:

byte_offset  byte_start     byte_end  fs_blocks blksz  grp  last_mount_time           sb_uuid label
       1024           0  95999229952   23437312  4096    0  Thu Dec 10 00:24:39 2009 fd5210bd 
  134217728           0  95999229952   23437312  4096    1  Wed Dec 31 19:00:00 1969 fd5210bd 
  402653184           0  95999229952   23437312  4096    3  Wed Dec 31 19:00:00 1969 fd5210bd 
  671088640           0  95999229952   23437312  4096    5  Wed Dec 31 19:00:00 1969 fd5210bd 


The group number information should help you determine the order of the
disks in the raid array.

Good luck!

						- Ted


From lucisandor at gmail.com  Thu Dec 10 20:30:19 2009
From: lucisandor at gmail.com (=?UTF-8?Q?Lucian_=C8=98andor?=)
Date: Thu, 10 Dec 2009 15:30:19 -0500
Subject: botched RAID, now e2fsck or what?
In-Reply-To: <20091210134747.GB4353@thunk.org>
References: <c775b4470912080848x50718bdft57666c4ee4c4df26@mail.gmail.com> 
	<20091210134747.GB4353@thunk.org>
Message-ID: <c775b4470912101230o2b8eb613q842330c795056d61@mail.gmail.com>

Thank you all for your kind replies.

One extra thought and question: would it help if I had some large file
that is also on the array? Could I search for a part of the file on
the individual drives, or at least on permutated arrays?

I tried findsuper, but it keeps finding the same backup superblocks,
no matter how I switch the order of the disks except for the first
one. It might be possible, I think, that the superblocks fall on the
same disk. That is only a general impression, so I am running it
thoroughly on a smaller array, to make sure. Another issue with this
approach is that it takes a lot of time: I have a 4.5 TB array with
720 permutations to try. This sound more like a job for a few years.
2009/12/10  <tytso at mit.edu>:
> Something that may help is to use the findsuper program, in the
> e2fsprogs sources; it's not built by default, but you can build it by
> hand.
> The group number information should help you determine the order of the
> disks in the raid array.

Same issue if I use the used inode count: the permutations yield the
same numbers over and over again. I think dume2fs -h doesn't go into
the actual drive, but only reads the descriptions in the beginning,
and these fall on the same drive...
2009/12/10 Christian Kujau <lists at nerdbynature.de>:
> On Wed, 9 Dec 2009 at 20:50, Lucian ?andor wrote:
>> Question 1: Is there a way to make dumpe2fs or another command
>> estimate the number of files in what appears to be an ext3 partition?
>
> I can only think of:
> $ dumpe2fs -h /dev/loop0 | egrep 'Inode count|Free inodes'
> The difference between both values should be the used inodes, i.e.
> files/directories on the filesystem.


2009/12/10 Andreas Dilger <adilger at sun.com>:
> On 2009-12-09, at 18:50, Lucian ?andor wrote:
>>
>> However, no combination seems useful. Sometimes I get:
>> "e2fsck: Bad magic number in super-block while trying to open /dev/md0"
>> Other times I get:
>> "Superblock has an invalid journal (inode 8)."
>> Other times I get:
>> "e2fsck: Illegal inode number while checking ext3 journal for /dev/md2."
>> None of these appears in only one permutation, so none is indicative
>> for the corectness of the permutation.
>
> You need to know a bit about your RAID layout and the structure of ext*.
>  One thing that is VERY important is whether your new MD config has the same
> chunk size as it did initially.  It will be impossible to recover your
> config if you don't have the same chunk size.
>
> Also, if you haven't disabled RAID resync then it may well be that changing
> the RAID layout has caused a resync that has permanently corrupted your
> data.

I have the chunk size for one of the arrays. I thought that mdadm
would automatically use the same values it used when it first created
the arrays, but gues what, it did not. Now I have another headache for
the other array.
The arrays were degraded at the time of the whole mess, and I always
re-created them as degraded. I wonder how long can I still pull this
feat, after being so messy in the first place.

> That said, I will assume the primary ext3 superblock will reside on the
> first disk in the RAID set, since it is located at an offset of 1kB from the
> start of the device.
>
> You should build and run the "findsuper" tool that is in the e2fsprogs
> source tree.  It will scan the raw disk devices and locate the ext3
> superblocks.  Each superblock contains the group number in which it is
> stored, so you can find the first RAID disk by looking for the one that has
> superblock 0 at offset 1kB from the start of the disk.
>
> There may be other copies of the superblock #0 stored in the journal file,
> but those should be ignored.
>
> The backup superblocks have a non-zero group number, and "findsuper" prints
> the offset at which that superblock should be located from the start of the
> LUN.  Depending on whether you have a non-power-of-two number of disks in
> your RAID set, you may find the superblock copies on different disks, and
> you can do some math to determine which order the disks should be in by
> computing the relative offset of the superblck within the RAID set.
>
>
> The other thing that can help order the disks (depending on the RAID
> chunksize and the total number of groups in the filesystem, proportional to
> the filesystem size) is the group descriptor table.  It is located
> immediately after the superblocks, and contains a very regular list of block
> numbers for the block and inode bitmaps, and the inode table in each group.
>
> Using "od -Ax -tx4" on a regular ext3 filesystem you can see the group
> descriptor table starting at offset 0x1000, and the block numbers basically
> just "count" up.  This may in fact be the easiest way to order the disks, if
> the group descriptor table is large enough to cover all of the disks:
>
> # od -Ax -tx4 /dev/hda1 | more
> :
> :
> 001000 0000012c 0000012d 0000012e 02430000
> 001010 000001f2 00000000 00000000 00000000
> 001020 0000812c 0000812d 0000812e 2e422b21
> 001030 0000000d 00000000 00000000 00000000
> 001040 00010000 00010001 00010002 27630074
> 001050 000000b8 00000000 00000000 00000000
> 001060 0001812c 0001812d 0001812e 27a70b8a
> 001070 00000231 00000000 00000000 00000000
> 001080 00020000 00020001 00020002 2cc10000
> 001090 00000008 00000000 00000000 00000000
> 0010a0 0002812c 0002812d 0002812e 25660134
> 0010b0 00000255 00000000 00000000 00000000
> 0010c0 00030000 00030001 00030002 17a50003
> 0010d0 000001c6 00000000 00000000 00000000
> 0010e0 0003812c 0003812d 0003812e 27a70000
> 0010f0 00000048 00000000 00000000 00000000
> 001100 00040000 00040001 00040002 2f8b0000
>
> See nearly regular incrementing sequence every 0x20 bytes:
>
> 0000012c, 0000812c, 00010000, 0001812c, 00020000, 0002812c, 00030000,
> 0003812c
>
>
> Each group descriptor block (4kB = 0x1000) covers 16GB of filesystem space,
> so  64 blocks per 1TB of filesystem size.  If your RAID chunk size is not
> too large, and the filesystem IS large, you will be able to fully order your
> disks in the RAID set.  You can also verify the RAID chunk size by
> determining how many blocks of consecutive group descriptors are present
> before there is a "jump" where the group descriptor blocks were written to
> other disks before returning to the current disk.  Remember that one of the
> disks in the set will also need to store parity, so there will be some
> number of "garbage" blocks before the proper data resumes.
>

This seems a great idea. The 4.5 TB array is huge (should have a 1100
kB table), and likely its group descriptor table extends on all
partitions. I already found the pattern, but the job requires
programming, since it would be troubling to read megs of data over the
hundreds of permutations. I will try coding it, but I hope that
somebody else wrote it before. Isn't there any utility that will take
a group descriptor table and verify its integrity without modifying
it?

>> I also ran dumpe2fs /dev/md2, but I don't know how to make it more
>> useful than it is now. Right now it finds supernodes in a series of
>> permutations, so again, it is not of much help.
>
> I would also make sure that you can get the correct ordering and MD chunk
> size before doing ANY kind of modification to the disks.  It would only take
> a single mistake (e.g. RAID parity rebuild while not in the right order) to
> totally corrupt the filesystem.
>
>> Question 1: Is there a way to make dumpe2fs or another command
>> estimate the number of files in what appears to be an ext3 partition?
>> (I would then go by the permutation which fonds the largest number of
>> files.)
>> Question: if I were to struck lucky and find the right combination,
>> would dumpe2fs give me a very-very long list of superblocks? Do the
>> superblocks extend far into the partition, or do they always stop
>> early (thus showing the same number each time my RAID starts with the
>> right drive)?
>>
>> Question 3: Is there any other tool that would search for files in the
>> remains of an ext3 partition, and, this way, validate or invalidate
>> the permutations I try?
>>
>> Thanks,
>> Lucian Sandor
>>
>>
>> 2009/12/9 Eric Sandeen <sandeen at redhat.com>:
>>>
>>> Lucian ?andor wrote:
>>>>
>>>> Hi all,
>>>>
>>>> Somehow I managed to mess with a RAID array containing an ext3
>>>> partition.
>>>>
>>>> Parenthesis, if it matters: I disconnected physically a drive while
>>>> the array was online. Next thing, I lost the right order of the drives
>>>> in the array. While trying to re-create it, I overwrote the raid
>>>> superblocks. Luckily, the array was RAID5 degraded, so whenever I
>>>> re-created it, it didn't go into sync; thus, everything besides the
>>>> RAID superblocks is preserved (or so I think).
>>>>
>>>> Now, I am trying to re-create the array in the proper order. It takes
>>>> me countless attempts, through hundreds of permutations. I am doing it
>>>> programatically, but I don't think I have the right tool.
>>>> Now, after creating the array and mounting it with
>>>> mount -t ext3 -n -r /dev/md2 /media/olddepot
>>>> I issue an:
>>>> e2fsck -n -f /media/olddepot
>>>> However, I cycled through all the permutations without apparent
>>>> success. I.e., in all combinations it just refused to check it, saying
>>>> something about "short read" and, of course, about invalid file
>>>> systems.
>>>
>>> As Christian pointed out, use the device not the mountpoint for the fsck
>>> arg:
>>>
>>> [tmp]$ mkdir dir
>>> [tmp]$ e2fsck -fn dir/
>>> e2fsck 1.41.4 (27-Jan-2009)
>>> e2fsck: Attempt to read block from filesystem resulted in short read
>>> while trying to open dir/
>>> Could this be a zero-length partition?
>>>
>>>
>>>  :)
>>>
>>> -Eric
>>>
>>
>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
>


From adilger at sun.com  Thu Dec 10 20:41:32 2009
From: adilger at sun.com (Andreas Dilger)
Date: Thu, 10 Dec 2009 13:41:32 -0700
Subject: botched RAID, now e2fsck or what?
In-Reply-To: <c775b4470912101230o2b8eb613q842330c795056d61@mail.gmail.com>
References: <c775b4470912080848x50718bdft57666c4ee4c4df26@mail.gmail.com>
	<20091210134747.GB4353@thunk.org>
	<c775b4470912101230o2b8eb613q842330c795056d61@mail.gmail.com>
Message-ID: <6A887537-22B1-4227-BE6E-4E8C97CE272B@sun.com>

On 2009-12-10, at 13:30, Lucian ?andor wrote:
> 2009/12/10 Andreas Dilger <adilger at sun.com>:
>>
>> Using "od -Ax -tx4" on a regular ext3 filesystem you can see the  
>> group descriptor table starting at offset 0x1000, and the block  
>> numbers basically just "count" up.  This may in fact be the easiest  
>> way to order the disks, if the group descriptor table is large  
>> enough to cover all of the disks:
>>
>> # od -Ax -tx4 /dev/hda1 | more
>> :
>> :
>> 001000 0000012c 0000012d 0000012e 02430000
>> 001010 000001f2 00000000 00000000 00000000
>> 001020 0000812c 0000812d 0000812e 2e422b21
>> 001030 0000000d 00000000 00000000 00000000
>> 001040 00010000 00010001 00010002 27630074
>> 001050 000000b8 00000000 00000000 00000000
>> 001060 0001812c 0001812d 0001812e 27a70b8a
>> 001070 00000231 00000000 00000000 00000000
>> 001080 00020000 00020001 00020002 2cc10000
>> 001090 00000008 00000000 00000000 00000000
>> 0010a0 0002812c 0002812d 0002812e 25660134
>> 0010b0 00000255 00000000 00000000 00000000
>> 0010c0 00030000 00030001 00030002 17a50003
>> 0010d0 000001c6 00000000 00000000 00000000
>> 0010e0 0003812c 0003812d 0003812e 27a70000
>> 0010f0 00000048 00000000 00000000 00000000
>> 001100 00040000 00040001 00040002 2f8b0000
>>
>> See nearly regular incrementing sequence every 0x20 bytes:
>>
>> 0000012c, 0000812c, 00010000, 0001812c, 00020000, 0002812c, 00030000,
>> 0003812c
>>
>>
>> Each group descriptor block (4kB = 0x1000) covers 16GB of  
>> filesystem space, so  64 blocks per 1TB of filesystem size.  If  
>> your RAID chunk size is not too large, and the filesystem IS large,  
>> you will be able to fully order your disks in the RAID set.  You  
>> can also verify the RAID chunk size by determining how many blocks  
>> of consecutive group descriptors are present before there is a  
>> "jump" where the group descriptor blocks were written to other  
>> disks before returning to the current disk.  Remember that one of  
>> the disks in the set will also need to store parity, so there will  
>> be some number of "garbage" blocks before the proper data resumes.
>
> This seems a great idea. The 4.5 TB array is huge (should have a 1100
> kB table), and likely its group descriptor table extends on all
> partitions. I already found the pattern, but the job requires
> programming, since it would be troubling to read megs of data over the
> hundreds of permutations. I will try coding it, but I hope that
> somebody else wrote it before. Isn't there any utility that will take
> a group descriptor table and verify its integrity without modifying
> it?

I think you are going about this incorrectly...  Run the "od" command  
on the raw component drives (e.g. /dev/sda, /dev/sdb, /dev/sdc, etc),  
not on the assembled MD RAID array (e.g. NOT /dev/md0).

The data blocks on the raw devices will be correct, with every 1/N  
chunks of space being used for parity information (so will look like  
garbage).  That won't prevent you from seeing the data in the group  
descriptor table and allowing you to see the order in which the disks  
are supposed to be AND the chunk size.

Since the group descriptor table is only a few kB from the start of  
the disk (I'm assuming you used whole-disk devices for the MD array,  
instead of DOS partitions) you can just use "od ... | less" and your  
eyes to see what is there.  No programming needed.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


From lucisandor at gmail.com  Fri Dec 11 19:33:01 2009
From: lucisandor at gmail.com (=?UTF-8?Q?Lucian_=C8=98andor?=)
Date: Fri, 11 Dec 2009 14:33:01 -0500
Subject: botched RAID, now e2fsck or what?
In-Reply-To: <6A887537-22B1-4227-BE6E-4E8C97CE272B@sun.com>
References: <c775b4470912080848x50718bdft57666c4ee4c4df26@mail.gmail.com> 
	<20091210134747.GB4353@thunk.org>
	<c775b4470912101230o2b8eb613q842330c795056d61@mail.gmail.com> 
	<6A887537-22B1-4227-BE6E-4E8C97CE272B@sun.com>
Message-ID: <c775b4470912111133i5f9a560fod41de6f866bba4d6@mail.gmail.com>

Hi,

Thanks for your idea. It worked great in the first step. One other
thing: immediately after the first table, there is a second one. Using
both tables, I was able to tell the parity position. For me, with 6
drives. the tables fell into an annoying pattern of complementation,
such as that four of them will always give 0000 0000 0000 and the
other two drives had identical chunks.

I am still no better because I don't know how to assemble it. Should I
create it as 1 2 3 4 5 P, or maybe as P 1 2 3 4 5?. But that is
something I might find trying a few combinations and looking at the
way the beginning of /dev/md0 is assembled.

One issue is that no matter how I will mix them, I have an extra drive
that I need to keep out. (The array was degraded for a few days before
the drive mix, and the failing drive is in the computer, now mixed up
with the others.) I can try assemble the array with any of the six
drives as missing, but I don't see a difference in the beginning of
/dev/md0, that part being written back in the times when the array was
running, and I get the same errors from e2fsck (complaining about
journal invalidity). Findsuper finds the same superblocks, e2fsck find
the same inodes :(

There should be a way of telling whether one of the 6 left
permutations makes a better combination. As I said, I even have files
that are also on the array. Any other thoughts?

Best,
Lucian Sandor


2009/12/10 Andreas Dilger <adilger at sun.com>:
> On 2009-12-10, at 13:30, Lucian ?andor wrote:
>>
>> 2009/12/10 Andreas Dilger <adilger at sun.com>:
>>>
>>> Using "od -Ax -tx4" on a regular ext3 filesystem you can see the group
>>> descriptor table starting at offset 0x1000, and the block numbers basically
>>> just "count" up. ?This may in fact be the easiest way to order the disks, if
>>> the group descriptor table is large enough to cover all of the disks:
>>>
>>> # od -Ax -tx4 /dev/hda1 | more
>>> :
>>> :
>>> 001000 0000012c 0000012d 0000012e 02430000
>>> 001010 000001f2 00000000 00000000 00000000
>>> 001020 0000812c 0000812d 0000812e 2e422b21
>>> 001030 0000000d 00000000 00000000 00000000
>>> 001040 00010000 00010001 00010002 27630074
>>> 001050 000000b8 00000000 00000000 00000000
>>> 001060 0001812c 0001812d 0001812e 27a70b8a
>>> 001070 00000231 00000000 00000000 00000000
>>> 001080 00020000 00020001 00020002 2cc10000
>>> 001090 00000008 00000000 00000000 00000000
>>> 0010a0 0002812c 0002812d 0002812e 25660134
>>> 0010b0 00000255 00000000 00000000 00000000
>>> 0010c0 00030000 00030001 00030002 17a50003
>>> 0010d0 000001c6 00000000 00000000 00000000
>>> 0010e0 0003812c 0003812d 0003812e 27a70000
>>> 0010f0 00000048 00000000 00000000 00000000
>>> 001100 00040000 00040001 00040002 2f8b0000
>>>
>>> See nearly regular incrementing sequence every 0x20 bytes:
>>>
>>> 0000012c, 0000812c, 00010000, 0001812c, 00020000, 0002812c, 00030000,
>>> 0003812c
>>>
>>>
>>> Each group descriptor block (4kB = 0x1000) covers 16GB of filesystem
>>> space, so ?64 blocks per 1TB of filesystem size. ?If your RAID chunk size is
>>> not too large, and the filesystem IS large, you will be able to fully order
>>> your disks in the RAID set. ?You can also verify the RAID chunk size by
>>> determining how many blocks of consecutive group descriptors are present
>>> before there is a "jump" where the group descriptor blocks were written to
>>> other disks before returning to the current disk. ?Remember that one of the
>>> disks in the set will also need to store parity, so there will be some
>>> number of "garbage" blocks before the proper data resumes.
>>
>> This seems a great idea. The 4.5 TB array is huge (should have a 1100
>> kB table), and likely its group descriptor table extends on all
>> partitions. I already found the pattern, but the job requires
>> programming, since it would be troubling to read megs of data over the
>> hundreds of permutations. I will try coding it, but I hope that
>> somebody else wrote it before. Isn't there any utility that will take
>> a group descriptor table and verify its integrity without modifying
>> it?
>
> I think you are going about this incorrectly... ?Run the "od" command on the
> raw component drives (e.g. /dev/sda, /dev/sdb, /dev/sdc, etc), not on the
> assembled MD RAID array (e.g. NOT /dev/md0).
>
> The data blocks on the raw devices will be correct, with every 1/N chunks of
> space being used for parity information (so will look like garbage). ?That
> won't prevent you from seeing the data in the group descriptor table and
> allowing you to see the order in which the disks are supposed to be AND the
> chunk size.
>
> Since the group descriptor table is only a few kB from the start of the disk
> (I'm assuming you used whole-disk devices for the MD array, instead of DOS
> partitions) you can just use "od ... | less" and your eyes to see what is
> there. ?No programming needed.
>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
>


From oehme.markus at gmx.de  Sun Dec 20 09:35:01 2009
From: oehme.markus at gmx.de (Markus Oehme)
Date: Sun, 20 Dec 2009 10:35:01 +0100
Subject: ext3-fs error (bad entry in directory)
Message-ID: <87tyvmrnsq.wl%oehme.markus@gmx.de>

Hello everybody,

we have a bit of a strange problem with our ext3 partition here. We have a
lot of the following errors occuring in dmesg:

Dec 16 02:09:06 hestia kernel: [1594272.845672] EXT3-fs error (device dm-2): ext3_readdir: bad entry in directory #17083: rec_len % 4 != 0 - offset=0, inode=76, rec_len=5121, name_len=2

Mount options are

/dev/mapper/hestia-home on /srv/samba/homes type ext3 (rw,noexec,nosuid,nodev,noatime,usrquota,grpquota)

We already did a complete test to the hard drives and they seem to be
fine. e2fsck also doesn't fix the problem. Currently we suspect the
controller, but that's pure speculation and otherwise the machine is running
quitely.

Sometimes the corresponding partitions is automatically remounted read-only,
which is quite a hazzle.

The message is always exactly the same. Somebody have a clue as to what is
going on here? And an easier question: How do I find out which directory is
#17083?

		Markus Oehme

PS: Please Cc me, since I'm not on the list.

--
Aoccdrnig to a threoy, it deosn't mttaer in waht oredr the ltteers in a wrod
are, the olny iprmoatnt tihng is taht the frist and lsat ltteer are in the
rghit pclae. The rset can be a taotl mses and you can sitll raed it in msot
csaes. Tihs is bcuseae the huamn mnid deos not raed ervey lteter by istlef,
but the wrod as a wlohe. And I awlyas thought slpeling was ipmorantt.


From sandeen at redhat.com  Tue Dec 22 17:15:03 2009
From: sandeen at redhat.com (Eric Sandeen)
Date: Tue, 22 Dec 2009 11:15:03 -0600
Subject: ext3-fs error (bad entry in directory)
In-Reply-To: <87tyvmrnsq.wl%oehme.markus@gmx.de>
References: <87tyvmrnsq.wl%oehme.markus@gmx.de>
Message-ID: <4B30FE97.8050502@redhat.com>

Markus Oehme wrote:
> Hello everybody,
> 
> we have a bit of a strange problem with our ext3 partition here. We have a
> lot of the following errors occuring in dmesg:
> 
> Dec 16 02:09:06 hestia kernel: [1594272.845672] EXT3-fs error (device dm-2): ext3_readdir: bad entry in directory #17083: rec_len % 4 != 0 - offset=0, inode=76, rec_len=5121, name_len=2

You didn't mention what kernel you were using; there was one significant fix
in this area a while back,

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=ef2b02d3e617cb0400eedf2668f86215e1b0e6af

> Mount options are
> 
> /dev/mapper/hestia-home on /srv/samba/homes type ext3 (rw,noexec,nosuid,nodev,noatime,usrquota,grpquota)
> 
> We already did a complete test to the hard drives and they seem to be
> fine. e2fsck also doesn't fix the problem. Currently we suspect the
> controller, but that's pure speculation and otherwise the machine is running
> quitely.

does e2fsck -find- the problem?  what version of e2fsprogs did you use?

If it checks clean then maybe it is a controller or memory error...

> Sometimes the corresponding partitions is automatically remounted read-only,
> which is quite a hazzle.
> 
> The message is always exactly the same. Somebody have a clue as to what is
> going on here? And an easier question: How do I find out which directory is
> #17083?

you can use debugfs:

debugfs:  ncheck
ncheck: Usage: ncheck <inode number> ...
debugfs:  ncheck 6031
Inode	Pathname
6031	//testfilename

-Eric

> 		Markus Oehme
> 
> PS: Please Cc me, since I'm not on the list.


From lists at nerdbynature.de  Thu Dec 24 10:31:10 2009
From: lists at nerdbynature.de (Christian Kujau)
Date: Thu, 24 Dec 2009 02:31:10 -0800 (PST)
Subject: benchmark results
Message-ID: <alpine.DEB.2.01.0912240205510.3483@bogon.housecafe.de>

I've had the chance to use a testsystem here and couldn't resist running a 
few benchmark programs on them: bonnie++, tiobench, dbench and a few 
generic ones (cp/rm/tar/etc...) on ext{234}, btrfs, jfs, ufs, xfs, zfs.

All with standard mkfs/mount options and +noatime for all of them.

Here are the results, no graphs - sorry:
   http://nerdbynature.de/benchmarks/v40z/2009-12-22/

Reiserfs is locking up during dbench, so I removed it from the 
config, here are some earlier results:

   http://nerdbynature.de/benchmarks/v40z/2009-12-21/bonnie.html

Bonnie++ couldn't complete on nilfs2, only the generic tests 
and tiobench were run. As nilfs2, ufs, zfs aren't supporting xattr, dbench 
could not be run on these filesystems.
   
Short summary, AFAICT:
    - btrfs, ext4 are the overall winners
    - xfs to, but creating/deleting many files was *very* slow
    - if you need only fast but no cool features or journaling, ext2
      is still a good choice :)

Thanks,
Christian.
-- 
BOFH excuse #84:

Someone is standing on the ethernet cable, causing a kink in the cable


From tytso at mit.edu  Thu Dec 24 21:27:56 2009
From: tytso at mit.edu (tytso at mit.edu)
Date: Thu, 24 Dec 2009 16:27:56 -0500
Subject: [Jfs-discussion] benchmark results
In-Reply-To: <19251.26403.762180.228181@tree.ty.sabi.co.uk>
References: <alpine.DEB.2.01.0912240205510.3483@bogon.housecafe.de>
	<19251.26403.762180.228181@tree.ty.sabi.co.uk>
Message-ID: <20091224212756.GM21594@thunk.org>

On Thu, Dec 24, 2009 at 01:05:39PM +0000, Peter Grandi wrote:
> > I've had the chance to use a testsystem here and couldn't
> > resist
> 
> Unfortunately there seems to be an overproduction of rather
> meaningless file system "benchmarks"...

One of the problems is that very few people are interested in writing
or maintaining file system benchmarks, except for file system
developers --- but many of them are more interested in developing (and
unfortunately, in some cases, promoting) their file systems than they
are in doing a good job maintaining a good set of benchmarks.  Sad but
true...

> * In the "generic" test the 'tar' test bandwidth is exactly the
>   same ("276.68 MB/s") for nearly all filesystems.
> 
> * There are read transfer rates higher than the one reported by
>   'hdparm' which is "66.23 MB/sec" (comically enough *all* the
>   read transfer rates your "benchmarks" report are higher).

If you don't do a "sync" after the tar, then in most cases you will be
measuring the memory bandwidth, because data won't have been written
to disk.  Worse yet, it tends to skew the results of the what happens
afterwards (*especially* if you aren't running the steps of the
benchmark in a script).

> BTW the use of Bonnie++ is also usually a symptom of a poor
> misunderstanding of file system benchmarking.

Dbench is also a really nasty benchmark.  If it's tuned correctly, you
are measuring memory bandwidth and the hard drive light will never go
on.  :-) The main reason why it was interesting was that it and tbench
was used to model a really bad industry benchmark, netbench, which at
one point a number of years ago I/T managers used to decide which CIFS
server they would buy[1].  So it was useful for Samba developers who were
trying to do competitive benchmkars, but it's not a very accurate
benchmark for measuring real-life file system workloads.

[1] http://samba.org/ftp/tridge/dbench/README

> On the plus side, test setup context is provided in the "env"
> directory, which is rare enough to be commendable.

Absolutely.  :-)

Another good example of well done file system benchmarks can be found
at http://btrfs.boxacle.net; it's done by someone who does performance
benchmarks for a living.  Note that JFS and XFS come off much better
on a number of the tests --- and that there is a *large* number amount
of variation when you look at different simulated workloads and with a
varying number of threads writing to the file system at the same time.

Regards,

	       	  	  	     	 - Ted


From lists at nerdbynature.de  Fri Dec 25 01:52:34 2009
From: lists at nerdbynature.de (Christian Kujau)
Date: Thu, 24 Dec 2009 17:52:34 -0800 (PST)
Subject: [Jfs-discussion] benchmark results
In-Reply-To: <20091224212756.GM21594@thunk.org>
References: <alpine.DEB.2.01.0912240205510.3483@bogon.housecafe.de>
	<19251.26403.762180.228181@tree.ty.sabi.co.uk>
	<20091224212756.GM21594@thunk.org>
Message-ID: <alpine.DEB.2.01.0912241739160.3483@bogon.housecafe.de>

On Thu, 24 Dec 2009 at 16:27, tytso at mit.edu wrote:
> If you don't do a "sync" after the tar, then in most cases you will be
> measuring the memory bandwidth, because data won't have been written

Well, I do "sync" after each operation, so the data should be on disk, but 
that doesn't mean it'll clear the filesystem buffers - but this doesn't 
happen that often in the real world too. Also, all filesystem were tested 
equally (I hope), yet some filesystem perform better than another - even 
if all the content copied/tar'ed/removed would perfectly well fit into the 
machines RAM.

> Another good example of well done file system benchmarks can be found
> at http://btrfs.boxacle.net

Thanks, I'll have a look at it and perhaps even integrate it in the 
wrapper script.

> benchmarks for a living.  Note that JFS and XFS come off much better
> on a number of the tests

Indeed, I was surpised to see JFS perform that good and XFS of course is 
one of the best too - I just wanted to point out that both of them 
are strangely slow at times (removing or creating many files) - not what I 
expected.

> --- and that there is a *large* number amount
> of variation when you look at different simulated workloads and with a
> varying number of threads writing to the file system at the same time.

True, the TODO list in the script ("different benchmark options") is in 
there for a reason :-)

Christian.
-- 
BOFH excuse #291:

Due to the CDA, we no longer have a root account.


From tytso at mit.edu  Fri Dec 25 16:11:46 2009
From: tytso at mit.edu (tytso at mit.edu)
Date: Fri, 25 Dec 2009 11:11:46 -0500
Subject: [Jfs-discussion] benchmark results
In-Reply-To: <20091224234631.GA1028@ioremap.net>
References: <alpine.DEB.2.01.0912240205510.3483@bogon.housecafe.de>
	<19251.26403.762180.228181@tree.ty.sabi.co.uk>
	<20091224212756.GM21594@thunk.org>
	<20091224234631.GA1028@ioremap.net>
Message-ID: <20091225161146.GC32757@thunk.org>

On Fri, Dec 25, 2009 at 02:46:31AM +0300, Evgeniy Polyakov wrote:
> > [1] http://samba.org/ftp/tridge/dbench/README
> 
> Was not able to resist to write a small notice, what no matter what, but
> whatever benchmark is running, it _does_ show system behaviour in one
> or another condition. And when system behaves rather badly, it is quite
> a common comment, that benchmark was useless. But it did show that
> system has a problem, even if rarely triggered one :)

If people are using benchmarks to improve file system, and a benchmark
shows a problem, then trying to remedy the performance issue is a good
thing to do, of course.  Sometimes, though the case which is
demonstrated by a poor benchmark is an extremely rare corner case that
doesn't accurately reflect common real-life workloads --- and if
addressing it results in a tradeoff which degrades much more common
real-life situations, then that would be a bad thing.

In situations where benchmarks are used competitively, it's rare that
it's actually a *problem*.  Instead it's much more common that a
developer is trying to prove that their file system is *better* to
gullible users who think that a single one-dimentional number is
enough for them to chose file system X over file system Y.

For example, if I wanted to play that game and tell people that ext4
is better, I'd might pick this graph:

http://btrfs.boxacle.net/repository/single-disk/2.6.29-rc2/2.6.29-rc2/2.6.29-rc2_Mail_server_simulation._num_threads=32.html

On the other hand, this one shows ext4 as the worst compared to all
other file systems:

http://btrfs.boxacle.net/repository/single-disk/2.6.29-rc2/2.6.29-rc2/2.6.29-rc2_Large_file_random_writes_odirect._num_threads=8.html

Benchmarking, like statistics, can be extremely deceptive, and if
people do things like carefully order a tar file so the files are
optimal for a file system, it's fair to ask whether that's a common
thing for people to be doing (either unpacking tarballs or unpacking
tarballs whose files have been carefully ordered for a particular file
systems).  When it's the only number used by a file system developer
when trying to convince users they should use their file system, at
least in my humble opinion it becomes murderously dishonest.

						- Ted


From tytso at mit.edu  Fri Dec 25 16:14:53 2009
From: tytso at mit.edu (tytso at mit.edu)
Date: Fri, 25 Dec 2009 11:14:53 -0500
Subject: [Jfs-discussion] benchmark results
In-Reply-To: <alpine.DEB.2.01.0912241739160.3483@bogon.housecafe.de>
References: <alpine.DEB.2.01.0912240205510.3483@bogon.housecafe.de>
	<19251.26403.762180.228181@tree.ty.sabi.co.uk>
	<20091224212756.GM21594@thunk.org>
	<alpine.DEB.2.01.0912241739160.3483@bogon.housecafe.de>
Message-ID: <20091225161453.GD32757@thunk.org>

On Thu, Dec 24, 2009 at 05:52:34PM -0800, Christian Kujau wrote:
> 
> Well, I do "sync" after each operation, so the data should be on disk, but 
> that doesn't mean it'll clear the filesystem buffers - but this doesn't 
> happen that often in the real world too. Also, all filesystem were tested 
> equally (I hope), yet some filesystem perform better than another - even 
> if all the content copied/tar'ed/removed would perfectly well fit into the 
> machines RAM.

Did you include the "sync" in part of what you timed?  Peter was quite
right --- the fact that the measured bandwidth in your "cp" test is
five times faster than the disk bandwidth as measured by hdparm, and
many file systems had exactly the same bandwidth, makes me very
suspicious that what was being measured was primarily memory bandwidth
--- and not very useful when trying to measure file system
performance.

    	 	     	       	       	    - Ted


From lists at nerdbynature.de  Fri Dec 25 18:42:30 2009
From: lists at nerdbynature.de (Christian Kujau)
Date: Fri, 25 Dec 2009 10:42:30 -0800 (PST)
Subject: [Jfs-discussion] benchmark results
In-Reply-To: <20091225161453.GD32757@thunk.org>
References: <alpine.DEB.2.01.0912240205510.3483@bogon.housecafe.de>
	<19251.26403.762180.228181@tree.ty.sabi.co.uk>
	<20091224212756.GM21594@thunk.org>
	<alpine.DEB.2.01.0912241739160.3483@bogon.housecafe.de>
	<20091225161453.GD32757@thunk.org>
Message-ID: <alpine.DEB.2.01.0912251021340.3483@bogon.housecafe.de>

On Fri, 25 Dec 2009 at 11:14, tytso at mit.edu wrote:
> Did you include the "sync" in part of what you timed?

In my "generic" tests[0] I do "sync" after each of the cp/tar/rm 
operations.

> Peter was quite
> right --- the fact that the measured bandwidth in your "cp" test is
> five times faster than the disk bandwidth as measured by hdparm, and
> many file systems had exactly the same bandwidth, makes me very
> suspicious that what was being measured was primarily memory bandwidth

That's right, and that's what I replied to Peter on jfs-discussion[1]:

  >> * In the "generic" test the 'tar' test bandwidth is exactly the
  >> same ("276.68 MB/s") for nearly all filesystems.
  True, because I'm tarring up ~2.7GB of content while the box is equipped
  with 8GB of RAM. So it *should* be the same for all filesystems, as 
  Linux could easily hold all this in its caches. Still, jfs and zfs 
  manage to be slower than the rest.

> --- and not very useful when trying to measure file system
> performance.

For the bonnie++ tests I chose an explicit filesize of 16GB, two times the 
size of the machine's RAM to make sure it will tests the *disks* 
performance. And to be consistent across one benchmark run, I should have 
copied/tarred/removed 16GB as well. However, I figured not to do that - 
but to *use* the filesystem buffers instead of ignoring them. After all, 
it's not about disk performace (that's what hdparm could be for) but 
filesystem performance (or comparision, more exactly) - and I'm not exited 
about the fact, that almost all filesystems are copying with ~276MB/s but 
I'm wondering why zfs is 13 times slower when copying data or xfs takes 
200 seconds longer than other filesystems, while it's handling the same 
size as all the others. So no, please don't compare the bonnie++ results 
against my "generic" results withing these results - as they're 
(obviously, I thought) taken with different parameters/content sizes.

Christian.

[0] http://nerdbynature.de/benchmarks/v40z/2009-12-22/env/fs-bench.sh.txt
[1] http://tinyurl.com/yz6x2sj
-- 
BOFH excuse #85:

Windows 95 undocumented "feature"


From markryde at gmail.com  Tue Dec 29 14:32:32 2009
From: markryde at gmail.com (Mark Ryden)
Date: Tue, 29 Dec 2009 16:32:32 +0200
Subject: Can cp copy files which are greater than 4GB ?
Message-ID: <dac45060912290632g2a1aebeel9c15e43104baedab@mail.gmail.com>

Hello,
Can "cp" under ext3 partition copy files which are greater than 4GB ?
(the source file and the destination are on the same ext3 partition, and there
is of course enough space)
Is there a limit on the size of a file which can be copied thus ?

"man cp" does not refer to this issue.

Rgs,
Mark


From tytso at mit.edu  Tue Dec 29 16:12:35 2009
From: tytso at mit.edu (tytso at mit.edu)
Date: Tue, 29 Dec 2009 11:12:35 -0500
Subject: Can cp copy files which are greater than 4GB ?
In-Reply-To: <dac45060912290632g2a1aebeel9c15e43104baedab@mail.gmail.com>
References: <dac45060912290632g2a1aebeel9c15e43104baedab@mail.gmail.com>
Message-ID: <20091229161235.GG4429@thunk.org>

On Tue, Dec 29, 2009 at 04:32:32PM +0200, Mark Ryden wrote:
> Hello,
> Can "cp" under ext3 partition copy files which are greater than 4GB ?
> (the source file and the destination are on the same ext3 partition, and there
> is of course enough space)
> Is there a limit on the size of a file which can be copied thus ?

It should, as long as /bin/cp is compiled with O_LARGEFILE support.

         	      	      	      	 - Ted


From markryde at gmail.com  Tue Dec 29 18:17:16 2009
From: markryde at gmail.com (Mark Ryden)
Date: Tue, 29 Dec 2009 20:17:16 +0200
Subject: Can cp copy files which are greater than 4GB ?
In-Reply-To: <20091229161235.GG4429@thunk.org>
References: <dac45060912290632g2a1aebeel9c15e43104baedab@mail.gmail.com>
	<20091229161235.GG4429@thunk.org>
Message-ID: <dac45060912291017p15072ea9n89c027681213b00d@mail.gmail.com>

Well,
  In the mean time I tried it and it failed; probably
coreutils I am using was build without this
flag. (coreutils-7.2-4.fc11.i586).

I will try to build coreutils with this flag and then perform that cp.

Rgs,
Mark

On Tue, Dec 29, 2009 at 6:12 PM,  <tytso at mit.edu> wrote:
> On Tue, Dec 29, 2009 at 04:32:32PM +0200, Mark Ryden wrote:
>> Hello,
>> Can "cp" under ext3 partition copy files which are greater than 4GB ?
>> (the source file and the destination are on the same ext3 partition, and there
>> is of course enough space)
>> Is there a limit on the size of a file which can be copied thus ?
>
> It should, as long as /bin/cp is compiled with O_LARGEFILE support.
>
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? - Ted
>


From sandeen at redhat.com  Tue Dec 29 18:49:35 2009
From: sandeen at redhat.com (Eric Sandeen)
Date: Tue, 29 Dec 2009 12:49:35 -0600
Subject: Can cp copy files which are greater than 4GB ?
In-Reply-To: <dac45060912291017p15072ea9n89c027681213b00d@mail.gmail.com>
References: <dac45060912290632g2a1aebeel9c15e43104baedab@mail.gmail.com>	<20091229161235.GG4429@thunk.org>
	<dac45060912291017p15072ea9n89c027681213b00d@mail.gmail.com>
Message-ID: <4B3A4F3F.6020702@redhat.com>

Mark Ryden wrote:
> Well,
>   In the mean time I tried it and it failed; probably
> coreutils I am using was build without this
> flag. (coreutils-7.2-4.fc11.i586).
> 
> I will try to build coreutils with this flag and then perform that cp.

I'd be very surprised if it's not bulit that way already.

Can you strace the failing cp and see how/why it's failing?

-Eric

> Rgs,
> Mark
> 
> On Tue, Dec 29, 2009 at 6:12 PM,  <tytso at mit.edu> wrote:
>> On Tue, Dec 29, 2009 at 04:32:32PM +0200, Mark Ryden wrote:
>>> Hello,
>>> Can "cp" under ext3 partition copy files which are greater than 4GB ?
>>> (the source file and the destination are on the same ext3 partition, and there
>>> is of course enough space)
>>> Is there a limit on the size of a file which can be copied thus ?
>> It should, as long as /bin/cp is compiled with O_LARGEFILE support.
>>
>>                                         - Ted
>>
> 
> _______________________________________________
> Ext3-users mailing list
> Ext3-users at redhat.com
> https://www.redhat.com/mailman/listinfo/ext3-users


From zbr at ioremap.net  Thu Dec 24 23:46:31 2009
From: zbr at ioremap.net (Evgeniy Polyakov)
Date: Fri, 25 Dec 2009 02:46:31 +0300
Subject: [Jfs-discussion] benchmark results
In-Reply-To: <20091224212756.GM21594@thunk.org>
References: <alpine.DEB.2.01.0912240205510.3483@bogon.housecafe.de>
	<19251.26403.762180.228181@tree.ty.sabi.co.uk>
	<20091224212756.GM21594@thunk.org>
Message-ID: <20091224234631.GA1028@ioremap.net>

Hi Ted.

On Thu, Dec 24, 2009 at 04:27:56PM -0500, tytso at mit.edu (tytso at mit.edu) wrote:
> > Unfortunately there seems to be an overproduction of rather
> > meaningless file system "benchmarks"...
> 
> One of the problems is that very few people are interested in writing
> or maintaining file system benchmarks, except for file system
> developers --- but many of them are more interested in developing (and
> unfortunately, in some cases, promoting) their file systems than they
> are in doing a good job maintaining a good set of benchmarks.  Sad but
> true...

Hmmmm.... I suppose here should be a link to such set? :)
No link? Than I suppose benchmark results are pretty much in sync with
what they are supposed to show.

> > * In the "generic" test the 'tar' test bandwidth is exactly the
> >   same ("276.68 MB/s") for nearly all filesystems.
> > 
> > * There are read transfer rates higher than the one reported by
> >   'hdparm' which is "66.23 MB/sec" (comically enough *all* the
> >   read transfer rates your "benchmarks" report are higher).
> 
> If you don't do a "sync" after the tar, then in most cases you will be
> measuring the memory bandwidth, because data won't have been written
> to disk.  Worse yet, it tends to skew the results of the what happens
> afterwards (*especially* if you aren't running the steps of the
> benchmark in a script).

It depends on the size of untarred object, for linux kernel tarball and
common several gigs of RAM it is very valid not to run a sync after the
tar, since writeback will take care about it.

> > BTW the use of Bonnie++ is also usually a symptom of a poor
> > misunderstanding of file system benchmarking.
> 
> Dbench is also a really nasty benchmark.  If it's tuned correctly, you
> are measuring memory bandwidth and the hard drive light will never go
> on.  :-) The main reason why it was interesting was that it and tbench
> was used to model a really bad industry benchmark, netbench, which at
> one point a number of years ago I/T managers used to decide which CIFS
> server they would buy[1].  So it was useful for Samba developers who were
> trying to do competitive benchmkars, but it's not a very accurate
> benchmark for measuring real-life file system workloads.
> 
> [1] http://samba.org/ftp/tridge/dbench/README

Was not able to resist to write a small notice, what no matter what, but
whatever benchmark is running, it _does_ show system behaviour in one
or another condition. And when system behaves rather badly, it is quite
a common comment, that benchmark was useless. But it did show that
system has a problem, even if rarely triggered one :)

Not an ext4 nitpick of course.

-- 
	Evgeniy Polyakov


From veelai at jonglieren-jena.de  Wed Dec 23 16:23:53 2009
From: veelai at jonglieren-jena.de (Markus Oehme)
Date: Wed, 23 Dec 2009 17:23:53 +0100
Subject: ext3-fs error (bad entry in directory)
In-Reply-To: <4B30FE97.8050502@redhat.com>
References: <87tyvmrnsq.wl%oehme.markus@gmx.de> <4B30FE97.8050502@redhat.com>
Message-ID: <87ljgtk6au.wl%veelai@jonglieren-jena.de>

At Tue, 22 Dec 2009 11:15:03 -0600,
Eric Sandeen wrote:
> > The message is always exactly the same. Somebody have a clue as to what is
> > going on here? And an easier question: How do I find out which directory is
> > #17083?
> 
> you can use debugfs:
> 
> debugfs:  ncheck
> ncheck: Usage: ncheck <inode number> ...
> debugfs:  ncheck 6031
> Inode	Pathname
> 6031	//testfilename

That solved the problem. We must have had a failed write there, I found an
empty directory, that should have been a file by the semantics, quite
strange. Fortunately I could simply delete it and now everything seems to be
fine.

Thanks a lot for the pointer.

       	 Markus

--
My key: http://users.minet.uni-jena.de/~veelai/veelai.gpg
--
For instance, on the planet Earth, man had always assumed that he was more
intelligent than dolphins because he had achieved so much---the wheel, New
York, wars and so on---while all the dolphins had ever done was muck about
in the water having a good time. But conversely, the dolphins had always
believed that they were far more intelligent than man---for precisely the
same reasons. (Douglas Adams, The Hitchhikers Guide to the Galaxy.)


From sega01 at gmail.com  Thu Dec 24 12:59:20 2009
From: sega01 at gmail.com (Teran McKinney)
Date: Thu, 24 Dec 2009 12:59:20 +0000
Subject: benchmark results
In-Reply-To: <alpine.DEB.2.01.0912240205510.3483@bogon.housecafe.de>
References: <alpine.DEB.2.01.0912240205510.3483@bogon.housecafe.de>
Message-ID: <b245135e0912240459j3ca92827sc69bb4665f60d8c3@mail.gmail.com>

Which I/O scheduler are you using? Pretty sure that ReiserFS is a
little less deadlocky with CFQ or another over deadline, but that
deadline usually gives the best results for me (especially for JFS).

Thanks,
Teran

On Thu, Dec 24, 2009 at 10:31, Christian Kujau <lists at nerdbynature.de> wrote:
> I've had the chance to use a testsystem here and couldn't resist running a
> few benchmark programs on them: bonnie++, tiobench, dbench and a few
> generic ones (cp/rm/tar/etc...) on ext{234}, btrfs, jfs, ufs, xfs, zfs.
>
> All with standard mkfs/mount options and +noatime for all of them.
>
> Here are the results, no graphs - sorry:
> ? http://nerdbynature.de/benchmarks/v40z/2009-12-22/
>
> Reiserfs is locking up during dbench, so I removed it from the
> config, here are some earlier results:
>
> ? http://nerdbynature.de/benchmarks/v40z/2009-12-21/bonnie.html
>
> Bonnie++ couldn't complete on nilfs2, only the generic tests
> and tiobench were run. As nilfs2, ufs, zfs aren't supporting xattr, dbench
> could not be run on these filesystems.
>
> Short summary, AFAICT:
> ? ?- btrfs, ext4 are the overall winners
> ? ?- xfs to, but creating/deleting many files was *very* slow
> ? ?- if you need only fast but no cool features or journaling, ext2
> ? ? ?is still a good choice :)
>
> Thanks,
> Christian.
> --
> BOFH excuse #84:
>
> Someone is standing on the ethernet cable, causing a kink in the cable
> --
> To unsubscribe from this list: send the line "unsubscribe reiserfs-devel" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
>


From pg_jf2 at jf2.for.sabi.co.UK  Thu Dec 24 13:05:39 2009
From: pg_jf2 at jf2.for.sabi.co.UK (Peter Grandi)
Date: Thu, 24 Dec 2009 13:05:39 +0000
Subject: [Jfs-discussion] benchmark results
In-Reply-To: <alpine.DEB.2.01.0912240205510.3483@bogon.housecafe.de>
References: <alpine.DEB.2.01.0912240205510.3483@bogon.housecafe.de>
Message-ID: <19251.26403.762180.228181@tree.ty.sabi.co.uk>

> I've had the chance to use a testsystem here and couldn't
> resist

Unfortunately there seems to be an overproduction of rather
meaningless file system "benchmarks"...

> running a few benchmark programs on them: bonnie++, tiobench,
> dbench and a few generic ones (cp/rm/tar/etc...) on ext{234},
> btrfs, jfs, ufs, xfs, zfs. All with standard mkfs/mount options
> and +noatime for all of them.

> Here are the results, no graphs - sorry: [ ... ]

After having a glance, I suspect that your tests could be
enormously improved, and doing so would reduce the pointlessness of
the results.

A couple of hints:

* In the "generic" test the 'tar' test bandwidth is exactly the
  same ("276.68 MB/s") for nearly all filesystems.

* There are read transfer rates higher than the one reported by
  'hdparm' which is "66.23 MB/sec" (comically enough *all* the
  read transfer rates your "benchmarks" report are higher).

BTW the use of Bonnie++ is also usually a symptom of a poor
misunderstanding of file system benchmarking.

On the plus side, test setup context is provided in the "env"
directory, which is rare enough to be commendable.

> Short summary, AFAICT:
>     - btrfs, ext4 are the overall winners
>     - xfs to, but creating/deleting many files was *very* slow

Maybe, and these conclusions are sort of plausible (but I prefer
JFS and XFS for different reasons); however they are not supported
by your results as they seem to me to lack much meaning, as what is
being measured is far from clear, and in particular it does not
seem to be the file system performance, or anyhow an aspect of
filesystem performance that might relate to common usage.

I think that it is rather better to run a few simple operations
(like the "generic" test) properly (unlike the "generic" test), to
give a feel for how well implemented are the basic operations of
the file system design.

Profiling a file system performance with a meaningful full scale
benchmark is a rather difficult task requiring great intellectual
fortitude and lots of time.

>     - if you need only fast but no cool features or
>       journaling, ext2 is still a good choice :)

That is however a generally valid conclusion, but with a very,
very important qualification: for freshly loaded filesystems.
Also with several other important qualifications, but "freshly
loaded" is a pet peeve of mine :-).


From lm at bitmover.com  Fri Dec 25 16:22:38 2009
From: lm at bitmover.com (Larry McVoy)
Date: Fri, 25 Dec 2009 08:22:38 -0800
Subject: [Jfs-discussion] benchmark results
In-Reply-To: <20091225161453.GD32757@thunk.org>
References: <alpine.DEB.2.01.0912240205510.3483@bogon.housecafe.de>
	<19251.26403.762180.228181@tree.ty.sabi.co.uk>
	<20091224212756.GM21594@thunk.org>
	<alpine.DEB.2.01.0912241739160.3483@bogon.housecafe.de>
	<20091225161453.GD32757@thunk.org>
Message-ID: <20091225162238.GB19303@bitmover.com>

On Fri, Dec 25, 2009 at 11:14:53AM -0500, tytso at mit.edu wrote:
> On Thu, Dec 24, 2009 at 05:52:34PM -0800, Christian Kujau wrote:
> > 
> > Well, I do "sync" after each operation, so the data should be on disk, but 
> > that doesn't mean it'll clear the filesystem buffers - but this doesn't 
> > happen that often in the real world too. Also, all filesystem were tested 
> > equally (I hope), yet some filesystem perform better than another - even 
> > if all the content copied/tar'ed/removed would perfectly well fit into the 
> > machines RAM.
> 
> Did you include the "sync" in part of what you timed?  Peter was quite
> right --- the fact that the measured bandwidth in your "cp" test is
> five times faster than the disk bandwidth as measured by hdparm, and
> many file systems had exactly the same bandwidth, makes me very
> suspicious that what was being measured was primarily memory bandwidth
> --- and not very useful when trying to measure file system
> performance.

Dudes, sync() doesn't flush the fs cache, you have to unmount for that.
Once upon a time Linux had an ioctl() to flush the fs buffers, I used
it in lmbench.  

	ioctl(fd, BLKFLSBUF, 0);

No idea if that is still supported, but sync() is a joke for benchmarking.
-- 
---
Larry McVoy                lm at bitmover.com           http://www.bitkeeper.com


From tytso at mit.edu  Fri Dec 25 16:33:41 2009
From: tytso at mit.edu (tytso at mit.edu)
Date: Fri, 25 Dec 2009 11:33:41 -0500
Subject: [Jfs-discussion] benchmark results
In-Reply-To: <20091225162238.GB19303@bitmover.com>
References: <alpine.DEB.2.01.0912240205510.3483@bogon.housecafe.de>
	<19251.26403.762180.228181@tree.ty.sabi.co.uk>
	<20091224212756.GM21594@thunk.org>
	<alpine.DEB.2.01.0912241739160.3483@bogon.housecafe.de>
	<20091225161453.GD32757@thunk.org>
	<20091225162238.GB19303@bitmover.com>
Message-ID: <20091225163341.GE32757@thunk.org>

On Fri, Dec 25, 2009 at 08:22:38AM -0800, Larry McVoy wrote:
> 
> Dudes, sync() doesn't flush the fs cache, you have to unmount for that.
> Once upon a time Linux had an ioctl() to flush the fs buffers, I used
> it in lmbench.  
> 
> 	ioctl(fd, BLKFLSBUF, 0);
> 
> No idea if that is still supported, but sync() is a joke for benchmarking.

Depends on what you are trying to do (flush has multiple meanings, so
using can be ambiguous).  BLKFLSBUF will write out any dirty buffers,
*and* empty the buffer cache.  I use it when benchmarking e2fsck
optimization.  It doesn't do anything for the page cache.  If you are
measuring the time to write a file, using fsync() or sync() will
include the time to actually write the data to disk.  It won't empty
caches, though; if you are going to measure read as well as writes,
then you'll probably want to do something like "echo 3 >
/proc/sys/vm/drop-caches".

						- Ted


From lists at nerdbynature.de  Fri Dec 25 18:51:05 2009
From: lists at nerdbynature.de (Christian Kujau)
Date: Fri, 25 Dec 2009 10:51:05 -0800 (PST)
Subject: [Jfs-discussion] benchmark results
In-Reply-To: <20091225162238.GB19303@bitmover.com>
References: <alpine.DEB.2.01.0912240205510.3483@bogon.housecafe.de>
	<19251.26403.762180.228181@tree.ty.sabi.co.uk>
	<20091224212756.GM21594@thunk.org>
	<alpine.DEB.2.01.0912241739160.3483@bogon.housecafe.de>
	<20091225161453.GD32757@thunk.org>
	<20091225162238.GB19303@bitmover.com>
Message-ID: <alpine.DEB.2.01.0912251042540.3483@bogon.housecafe.de>

On Fri, 25 Dec 2009 at 08:22, Larry McVoy wrote:
> Dudes, sync() doesn't flush the fs cache, you have to unmount for that.

Thanks Larry, that was exactly my point[0] too, I should add that to the 
results page to avoid further confusion or misassumptions:

   > Well, I do "sync" after each operation, so the data should be on
   > disk, but that doesn't mean it'll clear the filesystem buffers 
   > - but this doesn't happen that often in the real world too.

I realize however that on the same results page the bonnie++ tests were 
run with a filesize *specifically* set to not utilize the filesystem 
buffers any more but the measure *disk* performance while my "generic* 
tests do something else - and thus cannot be compared to the bonnie++ or 
hdparm results.

> No idea if that is still supported, but sync() is a joke for benchmarking.

I was using "sync" to make sure that the data "should" be on the disks 
now, I did not want to flush the filesystem buffers during the "generic" 
tests.

Thanks,
Christian.

[0] http://www.spinics.net/lists/linux-ext4/msg16878.html
-- 
BOFH excuse #210:

We didn't pay the Internet bill and it's been cut off.


From lists at nerdbynature.de  Fri Dec 25 18:56:53 2009
From: lists at nerdbynature.de (Christian Kujau)
Date: Fri, 25 Dec 2009 10:56:53 -0800 (PST)
Subject: [Jfs-discussion] benchmark results
In-Reply-To: <20091225163341.GE32757@thunk.org>
References: <alpine.DEB.2.01.0912240205510.3483@bogon.housecafe.de>
	<19251.26403.762180.228181@tree.ty.sabi.co.uk>
	<20091224212756.GM21594@thunk.org>
	<alpine.DEB.2.01.0912241739160.3483@bogon.housecafe.de>
	<20091225161453.GD32757@thunk.org>
	<20091225162238.GB19303@bitmover.com>
	<20091225163341.GE32757@thunk.org>
Message-ID: <alpine.DEB.2.01.0912251054200.3483@bogon.housecafe.de>

On Fri, 25 Dec 2009 at 11:33, tytso at mit.edu wrote:
> caches, though; if you are going to measure read as well as writes,
> then you'll probably want to do something like "echo 3 >
> /proc/sys/vm/drop-caches".

Thanks for the hint, I could find sys/vm/drop-caches documented in 
Documentation/ but it's good to know there's a way to flush all these 
caces via this knob. Maybe I should add this to those "genric" tests to be 
more comparable to the other benchmarks.

Christian.
-- 
BOFH excuse #210:

We didn't pay the Internet bill and it's been cut off.


From lists at nerdbynature.de  Fri Dec 25 19:32:58 2009
From: lists at nerdbynature.de (Christian Kujau)
Date: Fri, 25 Dec 2009 11:32:58 -0800 (PST)
Subject: [Jfs-discussion] benchmark results
In-Reply-To: <alpine.DEB.2.01.0912251054200.3483@bogon.housecafe.de>
References: <alpine.DEB.2.01.0912240205510.3483@bogon.housecafe.de>
	<19251.26403.762180.228181@tree.ty.sabi.co.uk>
	<20091224212756.GM21594@thunk.org>
	<alpine.DEB.2.01.0912241739160.3483@bogon.housecafe.de>
	<20091225161453.GD32757@thunk.org>
	<20091225162238.GB19303@bitmover.com>
	<20091225163341.GE32757@thunk.org>
	<alpine.DEB.2.01.0912251054200.3483@bogon.housecafe.de>
Message-ID: <alpine.DEB.2.01.0912251126080.3483@bogon.housecafe.de>

On Fri, 25 Dec 2009 at 10:56, Christian Kujau wrote:
> Thanks for the hint, I could find sys/vm/drop-caches documented in 
------------------------------^ not, was what I meant to say,
but it's all there, as "drop_caches" in Documentation/sysctl/vm.txt

Christian.

> Documentation/ but it's good to know there's a way to flush all these 
> caces via this knob. Maybe I should add this to those "genric" tests to be 
> more comparable to the other benchmarks.
-- 
BOFH excuse #129:

The ring needs another token


From lists at nerdbynature.de  Sat Dec 26 19:06:38 2009
From: lists at nerdbynature.de (Christian Kujau)
Date: Sat, 26 Dec 2009 11:06:38 -0800
Subject: [Jfs-discussion] benchmark results
In-Reply-To: <4B36333B.3030600@hp.com>
References: <alpine.DEB.2.01.0912240205510.3483@bogon.housecafe.de>
	<19251.26403.762180.228181@tree.ty.sabi.co.uk>
	<20091224212756.GM21594@thunk.org>
	<alpine.DEB.2.01.0912241739160.3483@bogon.housecafe.de>
	<20091225161453.GD32757@thunk.org>
	<20091225162238.GB19303@bitmover.com>
	<alpine.DEB.2.01.0912251042540.3483@bogon.housecafe.de>
	<4B36333B.3030600@hp.com>
Message-ID: <4B365EBE.5050804@nerdbynature.de>

On 26.12.09 08:00, jim owens wrote:
>> I was using "sync" to make sure that the data "should" be on the disks 
> 
> Good, but not good enough for many tests... info sync
[...]
>        On Linux, sync is only guaranteed to  schedule  the  dirty  blocks  for
>        writing;  it  can  actually take a short time before all the blocks are
>        finally written.

Noted, many times already. That's why I wrote "should be" - but in this
special scenario (filesystem speed tests) I don't care for file
integrity: if I pull the plug after "sync" and some data didn't make it
to the disks, I'll only look if the testscript got all the timestamps
and move on to the next test. I'm not testing for "filesystem integrity
after someone pulls the plug" here. And remember, I'm doing "sync" for
all the filesystems tested, so the comparison still stands.

Christian.


From tytso at mit.edu  Sat Dec 26 19:19:16 2009
From: tytso at mit.edu (tytso at mit.edu)
Date: Sat, 26 Dec 2009 14:19:16 -0500
Subject: [Jfs-discussion] benchmark results
In-Reply-To: <4B36333B.3030600@hp.com>
References: <alpine.DEB.2.01.0912240205510.3483@bogon.housecafe.de>
	<19251.26403.762180.228181@tree.ty.sabi.co.uk>
	<20091224212756.GM21594@thunk.org>
	<alpine.DEB.2.01.0912241739160.3483@bogon.housecafe.de>
	<20091225161453.GD32757@thunk.org>
	<20091225162238.GB19303@bitmover.com>
	<alpine.DEB.2.01.0912251042540.3483@bogon.housecafe.de>
	<4B36333B.3030600@hp.com>
Message-ID: <20091226191916.GI32757@thunk.org>

On Sat, Dec 26, 2009 at 11:00:59AM -0500, jim owens wrote:
> Christian Kujau wrote:
>  
> > I was using "sync" to make sure that the data "should" be on the disks 
> 
> Good, but not good enough for many tests... info sync
> 
> CONFORMING TO
>        POSIX.2
> 
> NOTES
>        On Linux, sync is only guaranteed to  schedule  the  dirty  blocks  for
>        writing;  it  can  actually take a short time before all the blocks are
>        finally written.
> 
> This is consistent with all the feels-like-unix OSes I have used.

Actually, Linux's sync does more than just schedule the writes; it has
for quite some time:

static void sync_filesystems(int wait)
{
	...
}

SYSCALL_DEFINE0(sync)
{
	wakeup_flusher_threads(0);
	sync_filesystems(0);
	sync_filesystems(1);
	if (unlikely(laptop_mode))
	   laptop_sync_completion();
	   return 0;
}

At least for ext3 and ext4, we will even do a device barrier operation
as a restult of a call to sb->s_op->sync_fs() --- which is called by
__sync_filesystem, which is called in turn by sync_filesystems().
This isn't done for all file systems, though, as near as I can tell.
(Ext2 at least doesn't.)

But for quite some time, under Linux the sync(2) system call will wait
for the blocks to be flushed out to HBA, although we currently don't
wait for the blocks to have been committed to the platters (at least
not for all file systems).   

Applications shouldn't depend on this, of course, since POSIX and
other legacy Unix systems don't guarantee this.  But in terms of
knowing what Linux does, the man page is a bit out of date.

Best regards,

					- Ted


From lists at nerdbynature.de  Sun Dec 27 21:55:26 2009
From: lists at nerdbynature.de (Christian Kujau)
Date: Sun, 27 Dec 2009 13:55:26 -0800 (PST)
Subject: [Jfs-discussion] benchmark results
In-Reply-To: <4B37BA76.7050403@hp.com>
References: <alpine.DEB.2.01.0912240205510.3483@bogon.housecafe.de>
	<19251.26403.762180.228181@tree.ty.sabi.co.uk>
	<20091224212756.GM21594@thunk.org>
	<alpine.DEB.2.01.0912241739160.3483@bogon.housecafe.de>
	<20091225161453.GD32757@thunk.org>
	<20091225162238.GB19303@bitmover.com>
	<alpine.DEB.2.01.0912251042540.3483@bogon.housecafe.de>
	<4B36333B.3030600@hp.com> <4B365EBE.5050804@nerdbynature.de>
	<4B37BA76.7050403@hp.com>
Message-ID: <alpine.DEB.2.01.0912271346240.3483@bogon.housecafe.de>

On Sun, 27 Dec 2009 at 14:50, jim owens wrote:
> And I don't even care about comparing 2 filesystems, I only care about
> timing 2 versions of code in the single filesystem I am working on,
> and forgetting about hardware cache effects has screwed me there.  

Not me, I'm comparing filesystems - and when the HBA or whatever plays 
tricks and "sync" doesn't flush all the data, it'll do so for every tested 
filesystem. Of course, filesystem could handle "sync" differently, and 
they probably do, hence the different times they take to complete. That's 
what my tests are about: timing comparision (does that still fall under 
the "benchmark" category?), not functional comparision. That's left as a 
task for the reader of these results: "hm, filesystem xy is so much faster 
when doing foo, why is that? And am I willing to sacrifice e.g. proper 
syncs to gain more speed?"

> So unless you are sure you have no hardware cache effects...
> "the comparison still stands" is *false*.

Again, I don't argue with "hardware caches will have effects", but that's 
not the point of these tests. Of course hardware is different, but 
filesystems are too and I'm testing filesystems (on the same hardware).

Christian.
-- 
BOFH excuse #278:

The Dilithium Crystals need to be rotated.


From tytso at mit.edu  Sun Dec 27 22:33:07 2009
From: tytso at mit.edu (tytso at mit.edu)
Date: Sun, 27 Dec 2009 17:33:07 -0500
Subject: [Jfs-discussion] benchmark results
In-Reply-To: <alpine.DEB.2.01.0912271346240.3483@bogon.housecafe.de>
References: <19251.26403.762180.228181@tree.ty.sabi.co.uk>
	<20091224212756.GM21594@thunk.org>
	<alpine.DEB.2.01.0912241739160.3483@bogon.housecafe.de>
	<20091225161453.GD32757@thunk.org>
	<20091225162238.GB19303@bitmover.com>
	<alpine.DEB.2.01.0912251042540.3483@bogon.housecafe.de>
	<4B36333B.3030600@hp.com> <4B365EBE.5050804@nerdbynature.de>
	<4B37BA76.7050403@hp.com>
	<alpine.DEB.2.01.0912271346240.3483@bogon.housecafe.de>
Message-ID: <20091227223307.GA4429@thunk.org>

On Sun, Dec 27, 2009 at 01:55:26PM -0800, Christian Kujau wrote:
> On Sun, 27 Dec 2009 at 14:50, jim owens wrote:
> > And I don't even care about comparing 2 filesystems, I only care about
> > timing 2 versions of code in the single filesystem I am working on,
> > and forgetting about hardware cache effects has screwed me there.  
> 
> Not me, I'm comparing filesystems - and when the HBA or whatever plays 
> tricks and "sync" doesn't flush all the data, it'll do so for every tested 
> filesystem. Of course, filesystem could handle "sync" differently, and 
> they probably do, hence the different times they take to complete. That's 
> what my tests are about: timing comparision (does that still fall under 
> the "benchmark" category?), not functional comparision. That's left as a 
> task for the reader of these results: "hm, filesystem xy is so much faster 
> when doing foo, why is that? And am I willing to sacrifice e.g. proper 
> syncs to gain more speed?"

Yes, but given many of the file systems have almost *exactly* the same
bandwidth measurement for the "cp" test, and said bandwidth
measurement is 5 times the disk bandwidith as measured by hdparm, it
makes me suspect that you are doing this:

/bin/time /bin/cp -r /source/tree /filesystem-under-test
sync
/bin/time /bin/rm -rf /filesystem-under-test/tree
sync

etc.

It is *a* measurement, but the question is whether it's a useful
comparison.  Consider two different file systems.  One file system
which does a very good job making sure that file writes are done
contiguously to disk, minimizing seek overhead --- and another file
system which is really crappy at disk allocation, and writes the files
to random locations all over the disk.  If you are only measuring the
"cp", then the fact that filesystem 'A' has a very good layout, and is
able to write things to disk very efficiently, and filesystem 'B' has
files written in a really horrible way, won't be measured by your
test.  This is especially true if, for example, you have 8GB of memory
and you are copying 4GB worth of data.

You might notice it if you include the "sync" in the timing, i.e.:

/bin/time /bin/sh -c "/bin/cp -r /source/tree /filesystem-under-test;/bin/sync"

> Again, I don't argue with "hardware caches will have effects", but that's 
> not the point of these tests. Of course hardware is different, but 
> filesystems are too and I'm testing filesystems (on the same hardware).

The question is whether your tests are doing the best job of measuring
how good the filesystem really is.  If your workload is one where you
will only be copying file sets much smaller than your memory, and you
don't care about when the data actually hits the disk, only when
"/bin/cp" returns, then sure, do whatever you want.  But if you want
the tests to have meaning if, for example, you have 2GB of memory and
you are copying 8GB of data, or if later on will be continuously
streaming data to the disk, and sooner or later the need to write data
to the disk will start slowing down your real-life workload, then not
including the time to do the sync in the time to copy your file set
may cause you to assume that filesystems 'A' and 'B' are identical in
performance, and then your filesystem comparison will end up
misleading you.

The bottom line is that it's very hard to do good comparisons that are
useful in the general case.

Best regards,

						- Ted


From lists at nerdbynature.de  Mon Dec 28 01:24:05 2009
From: lists at nerdbynature.de (Christian Kujau)
Date: Sun, 27 Dec 2009 17:24:05 -0800 (PST)
Subject: [Jfs-discussion] benchmark results
In-Reply-To: <20091227223307.GA4429@thunk.org>
References: <19251.26403.762180.228181@tree.ty.sabi.co.uk>
	<20091224212756.GM21594@thunk.org>
	<alpine.DEB.2.01.0912241739160.3483@bogon.housecafe.de>
	<20091225161453.GD32757@thunk.org>
	<20091225162238.GB19303@bitmover.com>
	<alpine.DEB.2.01.0912251042540.3483@bogon.housecafe.de>
	<4B36333B.3030600@hp.com> <4B365EBE.5050804@nerdbynature.de>
	<4B37BA76.7050403@hp.com>
	<alpine.DEB.2.01.0912271346240.3483@bogon.housecafe.de>
	<20091227223307.GA4429@thunk.org>
Message-ID: <alpine.DEB.2.01.0912271707240.3483@bogon.housecafe.de>

On Sun, 27 Dec 2009 at 17:33, tytso at mit.edu wrote:
> Yes, but given many of the file systems have almost *exactly* the same

"Almost" indeed - but curiously enough some filesystem are *not* the same, 
although they should. Again: we have 8GB RAM, I'm copying ~3GB of data, so 
why _are_ there differences? (Answer: because filesystems are different). 
That's the only point of this test. Also note the disclaimer[0] I added to 
the results page a few days ago.

> measurement is 5 times the disk bandwidith as measured by hdparm, it
> makes me suspect that you are doing this:
> /bin/time /bin/cp -r /source/tree /filesystem-under-test
> sync

No, I'm not - see the test script[1] - I'm taking the time for cp/rm/tar 
*and* sync. But even if I would only take the time *only* for say "cp", 
not the sync part. Still, it would be a valid comparison across 
filesystems (the same operation for every filesystem) also a not very 
realistic one - because in the real world I *want* to make sure my data is 
on the disk. But that's as far as I go in these tests, I'm not even 
messing around with disk caches or HBA caches - that's not the scope of 
these tests.

> You might notice it if you include the "sync" in the timing, i.e.:
> /bin/time /bin/sh -c "/bin/cp -r /source/tree /filesystem-under-test;/bin/sync"

Yes, that's exactly what the tests do.

> "/bin/cp" returns, then sure, do whatever you want.  But if you want
> the tests to have meaning if, for example, you have 2GB of memory and
> you are copying 8GB of data, 

For the bonnie++ tests I chose a filesize (16GB) so that disk performance 
will matter here. As the generic tests shuffle around much more smaller 
data, no disk performance, but filesystem performance is measured (and 
compared to other filesystems) - well aware of the fact that caches *Are* 
being used. Why would I want to discard caches? My daily usage pattern 
(opening webrowsers, terminal windows, spreadcheats deal with much smaller 
datasets and I'm happy that Linux is so hungry for cache - yet some 
filesystems do not seem to utilize this opportunity as good as others do. 
That's the whole point of this particular test. But constantly explaining 
my point over and over again I see what I have to do: I shall run the 
generic tests again with much bigger datasets, so that disk-performance is 
also reflected, as people do seem to care about this (I don't - I can 
switch filesystems more easily than disks).

> The bottom line is that it's very hard to do good comparisons that are
> useful in the general case.

And it's difficult to find out what's a "useful comparison" for the 
general public :-)

Christian.

[0] http://nerdbynature.de/benchmarks/v40z/2009-12-22/
[1] http://nerdbynature.de/benchmarks/v40z/2009-12-22/env/fs-bench.sh.txt
-- 
BOFH excuse #292:

We ran out of dial tone and we're and waiting for the phone company to deliver another bottle.


From lm at bitmover.com  Mon Dec 28 14:08:55 2009
From: lm at bitmover.com (Larry McVoy)
Date: Mon, 28 Dec 2009 06:08:55 -0800
Subject: [Jfs-discussion] benchmark results
In-Reply-To: <20091227223307.GA4429@thunk.org>
References: <20091224212756.GM21594@thunk.org>
	<alpine.DEB.2.01.0912241739160.3483@bogon.housecafe.de>
	<20091225161453.GD32757@thunk.org>
	<20091225162238.GB19303@bitmover.com>
	<alpine.DEB.2.01.0912251042540.3483@bogon.housecafe.de>
	<4B36333B.3030600@hp.com> <4B365EBE.5050804@nerdbynature.de>
	<4B37BA76.7050403@hp.com>
	<alpine.DEB.2.01.0912271346240.3483@bogon.housecafe.de>
	<20091227223307.GA4429@thunk.org>
Message-ID: <20091228140855.GD10982@bitmover.com>

> The bottom line is that it's very hard to do good comparisons that are
> useful in the general case.

It has always amazed me watching people go about benchmarking.  I should
have a blog called "you're doing it wrong" or something.

Personally, I use benchmarks to validate what I already believe to be true.
So before I start I have a predicition as to what the answer should be,
based on my understanding of the system being measured.  Back when I
was doing this a lot, I was always within a factor of 10 (not a big
deal) and usually within a factor of 2 (quite a bit bigger deal).
When things didn't match up that was a clue that either

    - the benchmark was broken
    - the code was broken
    - the hardware was broken
    - my understanding was broken

If you start a benchmark and you don't know what the answer should be,
at the very least within a factor of 10 and ideally within a factor of 2,
you shouldn't be running the benchmark.  Well, maybe you should, they 
are fun.  But you sure as heck shouldn't be publishing results unless
you know they are correct.

This is why lmbench, to toot my own horn, measures what it does.  If go
run that, memorize the results, you can tell yourself "well, this machine
has sustained memory copy bandwidth of 3.2GB/sec, the disk I'm using
can read at 60MB/sec and write at 52MB/sec (on the outer zone where I'm
going to run my tests), it does small seeks in about 6 milliseconds,
I'm doing sequential I/O, the bcopy is in the noise, the blocks are big
enough that the seeks are hidden, so I'd like to see a steady 50MB/sec
or so on a sustained copy test".

If you have a mental model for how the bits of the system works you 
can decompose the benchmark into the parts, predict the result, run
it, and compare.  It'll match or Lucy, you have some 'splainin to do.
-- 
---
Larry McVoy                lm at bitmover.com           http://www.bitkeeper.com


From shadowbu at gmail.com  Tue Dec 29 06:06:02 2009
From: shadowbu at gmail.com (George Butler)
Date: Tue, 29 Dec 2009 00:06:02 -0600
Subject: ext3 partition size
Message-ID: <4B399C4A.2010109@gmail.com>

Hi all,
     I am running fedora 11 with kernel 2.6.30.9-102.fc11.x86_64 #1 SMP 
Fri Dec 4 00:18:53 EST 2009 x86_64 x86_64 x86_64 GNU/Linux. I am 
noticing a partition on  my drive is reporting incorrect size with "df", 
the partition is ext3 size 204GB with about 79GB actual usage, the "df" 
result show the partition size to be 111GB, 93GB is missing. Please 
advice on what can be done to see why the system is reporting incorrect 
partition size.

e2fsprogs version:

rpm -qa | grep e2fsprogs
e2fsprogs-libs-1.41.4-12.fc11.x86_64
e2fsprogs-1.41.4-12.fc11.x86_64
e2fsprogs-libs-1.41.4-12.fc11.i586
e2fsprogs-devel-1.41.4-12.fc11.x86_64
e2fsprogs-debuginfo-1.41.4-12.fc11.x86_64

mount: /dev/sdb8 on /srv/multimedia type ext3 (rw,relatime)

$ df -hT
Filesystem    Type    Size  Used Avail Use% Mounted on
/dev/sdb2     ext3     30G  1.1G   28G   4% /
/dev/sdb7     ext3     20G  1.3G   18G   7% /var
/dev/sdb6     ext3     30G   12G   17G  43% /usr
/dev/sdb5     ext3     40G   25G   13G  67% /home
/dev/sdb1     ext3    107M   52M   50M  52% /boot
*/dev/sdb8     ext3    111G   79G   27G  76% /srv/multimedia*
tmpfs        tmpfs    2.9G   35M  2.9G   2% /dev/shm

Parted info:

(parted) select /dev/sdb
Using /dev/sdb
(parted) print
Model: ATA ST3500630AS (scsi)
Disk /dev/sdb: 500GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos

Number  Start   End     Size    Type      File system  Flags
  1      32.3kB  115MB   115MB   primary   ext3         boot
  2      115MB   32.3GB  32.2GB  primary   ext3
  3      32.3GB  35.5GB  3224MB  primary   linux-swap
  4      35.5GB  500GB   465GB   extended
  5      35.5GB  78.5GB  43.0GB  logical   ext3
  6      78.5GB  111GB   32.2GB  logical   ext3
  7      111GB   132GB   21.5GB  logical   ext3
*8      132GB   352GB   220GB   logical   ext3*
  9      352GB   492GB   140GB   logical   ext3


result of e2fsck:

$ e2fsck -f -v -c -E fragcheck /dev/sdb8
e2fsck 1.41.4 (27-Jan-2009)
Checking for bad blocks (read-only test): done
sg500misc: Updating bad block inode.
Pass 1: Checking inodes, blocks, and sizes
      8(f): expecting  32768 got phys  34307 (blkcnt 31191)
491521(f): expecting 1049096 got phys 1081858 (blkcnt 6)
491521(f): expecting 1081864 got phys 1114626 (blkcnt -1)
491521(f): expecting 1114632 got phys 1180162 (blkcnt 17)
491521(f): expecting 1180168 got phys 1212930 (blkcnt 23)
491521(f): expecting 1212936 got phys 1505163 (blkcnt 29)

***********50K + more output lines******************

Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information

sg500misc: ***** FILE SYSTEM WAS MODIFIED *****

    33658 inodes used (0.23%)
     3325 non-contiguous files (9.9%)
       24 non-contiguous directories (0.1%)
          # of inodes with ind/dind/tind blocks: 25055/2723/2
21136610 blocks used (71.72%)
        0 bad blocks
        6 large files

    31899 regular files
     1698 directories
        0 character device files
        0 block device files
        0 fifos
        0 links
       52 symbolic links (52 fast symbolic links)
        0 sockets
--------
    33649 files

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20091229/d4a93530/attachment.htm>

From adilger at sun.com  Thu Dec 31 21:39:05 2009
From: adilger at sun.com (Andreas Dilger)
Date: Thu, 31 Dec 2009 14:39:05 -0700
Subject: ext3 partition size
In-Reply-To: <4B399C4A.2010109@gmail.com>
References: <4B399C4A.2010109@gmail.com>
Message-ID: <8005FD36-E520-4E0E-B461-30A0C0F4DFCB@sun.com>

On 2009-12-28, at 23:06, George Butler wrote:
>     I am running fedora 11 with kernel 2.6.30.9-102.fc11.x86_64 #1  
> SMP Fri Dec 4 00:18:53 EST 2009 x86_64 x86_64 x86_64 GNU/Linux. I am  
> noticing a partition on  my drive is reporting incorrect size with  
> "df", the partition is ext3 size 204GB with about 79GB actual usage,  
> the "df" result show the partition size to be 111GB, 93GB is  
> missing. Please advice on what can be done to see why the system is  
> reporting incorrect partition size.
>
> mount: /dev/sdb8 on /srv/multimedia type ext3 (rw,relatime)
>
> $ df -hT
> Filesystem    Type    Size  Used Avail Use% Mounted on
> /dev/sdb2     ext3     30G  1.1G   28G   4% /
> /dev/sdb7     ext3     20G  1.3G   18G   7% /var
> /dev/sdb6     ext3     30G   12G   17G  43% /usr
> /dev/sdb5     ext3     40G   25G   13G  67% /home
> /dev/sdb1     ext3    107M   52M   50M  52% /boot
> /dev/sdb8     ext3    111G   79G   27G  76% /srv/multimedia
> tmpfs        tmpfs    2.9G   35M  2.9G   2% /dev/shm
>
> Parted info:
>
> (parted) select /dev/sdb
> Using /dev/sdb
> (parted) print
> Model: ATA ST3500630AS (scsi)
> Disk /dev/sdb: 500GB
> Sector size (logical/physical): 512B/512B
> Partition Table: msdos
>
> Number  Start   End     Size    Type      File system  Flags
>  1      32.3kB  115MB   115MB   primary   ext3         boot
>  2      115MB   32.3GB  32.2GB  primary   ext3
>  3      32.3GB  35.5GB  3224MB  primary   linux-swap
>  4      35.5GB  500GB   465GB   extended
>  5      35.5GB  78.5GB  43.0GB  logical   ext3
>  6      78.5GB  111GB   32.2GB  logical   ext3
>  7      111GB   132GB   21.5GB  logical   ext3
>  8      132GB   352GB   220GB   logical   ext3
>  9      352GB   492GB   140GB   logical   ext3

It definitely looks strange.  Did you resize this partition after it  
was created?  In any case, running "resize2fs /dev/sdb8" should  
enlarge the
filesystem to fill the partition.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.