From kevin at ucsd.edu  Mon Jan  4 03:28:07 2010
From: kevin at ucsd.edu (Kevin Bowen)
Date: Sun, 3 Jan 2010 19:28:07 -0800
Subject: ext3 resize failed, data loss
Message-ID: <9c9db87d1001031928g470b9939v479d5af4c316e40@mail.gmail.com>

I used parted to resize (shrink) an ext3 filesystem and associated
partition, and it buggered my system. The operation completed
apparently successfully, reporting no errors, but after reboot, the fs
wouldn't mount, being marked as having errors, and and e2fsck said
"The filesystem size (according to the superblock) is xxx blocks
The physical size of the device is xxx blocks Either the superblock or
the partition table is likely to be corrupt!". So the fs still thought
it was its original size (larger than its partition).

At this point, the fs would actually mount without errors if I mounted
it manually (ro), and all my data seemed intact, it just thought it
had way more free space than it should have, and it couldn't complete
an fsck (and was obviously not safe to use mounted rw lest it try to
write to space it didn't actually own). Google turned up some accounts
of people with the identical issue, and suggestions to fix it by
writing a new superblock with e2sck -S, then fscking - I did this, and
it totally trashed my filesystem. The fs is now the right size and
mounts fine, but everything just got dumped into lost+found.

Is there any way I can fix this and get my data back? At least get it
back to its previous state so I can mount it ro and copy my data off?
Is my old superblock backed up somewhere, or does e2fsk update the
backup superblock as well? Would my old superblock even help, or did
the fsck trash my inode structure?

Currently I think I have all my data, just dumped in lost+found
without filenames - is there any way to salvage anything from that?

And is this a known bug in ext2resize? In parted?

-- 
Kevin Bowen
kevin at ucsd.edu


From pop3 at flachtaucher.de  Wed Jan  6 11:00:58 2010
From: pop3 at flachtaucher.de (Martin Baum)
Date: Wed, 06 Jan 2010 12:00:58 +0100
Subject: Optimizing dd images of ext3 partitions: Only copy blocks in use by fs
Message-ID: <20100106120058.12202uj1udebzcmc@webmail.df.eu>

Hello,

for bare-metal recovery I need to create complete disk images of ext3  
partitions of about 30 servers. I'm doing this by creating  
lvm2-snapshots and then dd'ing the snapshot-device to my backup media.  
(I am aware that backups created by this procedure are the equivalent  
of hitting the power switch at the time the snapshot was taken.)

This works great and avoids a lot of seeks on highly utilized file  
systems. However it wastes a lot of space for disks with nearly empty  
filesystems.

It would be a lot better if I could only read the blocks from raw disk  
that are really in use by ext3 (the rest could be sparse in the  
imagefile created). Is there a way to do this?

I am aware that e2image -r dumps all metadata. Is there a tool that  
does not only dump metadata but also the data blocks? (maybe even in a  
way that avoids seeks by compiling a list of blocks first and then  
reading them in disk-order) If not: Is there a tool I can extend to do  
so / can you point me into the righ direction?

(I tried dumpfs, however it dumps inodes on a per-directory base.  
Skimming through the source I did not see any optimization regarding  
seeks. So on highly populated filesystems dumpfs still is slower than  
full images with dd for me.)

Thanks a lot,
Martin


From adilger at sun.com  Wed Jan  6 21:09:10 2010
From: adilger at sun.com (Andreas Dilger)
Date: Wed, 06 Jan 2010 14:09:10 -0700
Subject: Optimizing dd images of ext3 partitions: Only copy blocks in use
	by fs
In-Reply-To: <20100106120058.12202uj1udebzcmc@webmail.df.eu>
References: <20100106120058.12202uj1udebzcmc@webmail.df.eu>
Message-ID: <1F6683F6-966A-4AD6-932F-DC80AB36DDBA@sun.com>

On 2010-01-06, at 04:00, Martin Baum wrote:
> for bare-metal recovery I need to create complete disk images of  
> ext3 partitions of about 30 servers. I'm doing this by creating lvm2- 
> snapshots and then dd'ing the snapshot-device to my backup media. (I  
> am aware that backups created by this procedure are the equivalent  
> of hitting the power switch at the time the snapshot was taken.)
>
> This works great and avoids a lot of seeks on highly utilized file  
> systems. However it wastes a lot of space for disks with nearly  
> empty filesystems.
>
> It would be a lot better if I could only read the blocks from raw  
> disk that are really in use by ext3 (the rest could be sparse in the  
> imagefile created). Is there a way to do this?

You can use "dump" which will read only the in-use blocks, but it  
doesn't create a full disk image.

The other trick that I've used for similar situations is to write a  
file of all zeroes to the filesystem until it is full (e.g. dd if=/dev/ 
zero of=/foo) and then the backup will be able to compress quite  
well.  If the filesystem is in use, you should stop before the  
filesystem is completely full, and also unlink the file right after it  
is created, so in case of trouble the file will automatically be  
unlinked (even after a crash).

> I am aware that e2image -r dumps all metadata. Is there a tool that  
> does not only dump metadata but also the data blocks? (maybe even in  
> a way that avoids seeks by compiling a list of blocks first and then  
> reading them in disk-order) If not: Is there a tool I can extend to  
> do so / can you point me into the righ direction?
>
> (I tried dumpfs, however it dumps inodes on a per-directory base.  
> Skimming through the source I did not see any optimization regarding  
> seeks. So on highly populated filesystems dumpfs still is slower  
> than full images with dd for me.)


Optimizing dump to e.g. sort inodes might help the performance, if  
that isn't already done.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


From mrubin at google.com  Mon Jan  4 18:57:49 2010
From: mrubin at google.com (Michael Rubin)
Date: Mon, 4 Jan 2010 10:57:49 -0800
Subject: [Jfs-discussion] benchmark results
In-Reply-To: <20100104162748.GA11932@think>
References: <alpine.DEB.2.01.0912240205510.3483@bogon.housecafe.de> 
	<19251.26403.762180.228181@tree.ty.sabi.co.uk>
	<20091224212756.GM21594@thunk.org> 
	<20091224234631.GA1028@ioremap.net> <20091225161146.GC32757@thunk.org> 
	<20100104162748.GA11932@think>
Message-ID: <532480951001041057w3ad8d1dfy361ced0346ebaaa4@mail.gmail.com>

Google is currently in the middle of upgrading from ext2 to a more up
to date file system. We ended up choosing ext4. This thread touches
upon many of the issues we wrestled with, so I thought it would be
interesting to share. We should be sending out more details soon.

The driving performance reason to upgrade is that while ext2 had been "good
enough" for a very long time the metadata arrangement on a stale file
system was leading to what we call "read inflation". This is where we
end up doing many seeks to read one block of data. In general latency
from poor block allocation was causing performance hiccups.

We spent a lot of time with unix standard benchmarks (dbench, compile
bench, et al) on xfs, ext4, jfs to try to see which one was going to
perform the best. In the end we mostly ended up using the benchmarks
to validate our assumptions and do functional testing. Larry is
completely right IMHO. These benchmarks were instrumental in helping
us understand how the file systems worked in controlled situations and
gain confidence from our customers.

For our workloads we saw ext4 and xfs as "close enough" in performance
in the areas we cared about. The fact that we had a much smoother
upgrade path with ext4 clinched the deal. The only upgrade option we
have is online. ext4 is already moving the bottleneck away from the
storage stack for some of our most intensive applications.

It was not until we moved from benchmarks to customer workload that we
were able to make detailed performance comparisons and find bugs in
our implementation.

"Iterate often" seems to be the winning strategy for SW dev. But when
it involves rebooting a cloud of systems and making a one way
conversion of their data it can get messy. That said I see benchmarks
as tools to build confidence before running traffic on redundant live
systems.

mrubin

PS for some reason "dbench" holds mythical power over many folks I
have met. They just believe it's the most trusted and standard
benchmark for file systems. In my experience it often acts as a random
number generator. It has found some bugs in our code as it exercises
the VFS layer very well.


From david at fromorbit.com  Tue Jan  5 00:41:17 2010
From: david at fromorbit.com (Dave Chinner)
Date: Tue, 5 Jan 2010 11:41:17 +1100
Subject: [Jfs-discussion] benchmark results
In-Reply-To: <20100104162748.GA11932@think>
References: <alpine.DEB.2.01.0912240205510.3483@bogon.housecafe.de>
	<19251.26403.762180.228181@tree.ty.sabi.co.uk>
	<20091224212756.GM21594@thunk.org>
	<20091224234631.GA1028@ioremap.net>
	<20091225161146.GC32757@thunk.org> <20100104162748.GA11932@think>
Message-ID: <20100105004117.GP13802@discord.disaster>

On Mon, Jan 04, 2010 at 11:27:48AM -0500, Chris Mason wrote:
> On Fri, Dec 25, 2009 at 11:11:46AM -0500, tytso at mit.edu wrote:
> > On Fri, Dec 25, 2009 at 02:46:31AM +0300, Evgeniy Polyakov wrote:
> > > > [1] http://samba.org/ftp/tridge/dbench/README
> > > 
> > > Was not able to resist to write a small notice, what no matter what, but
> > > whatever benchmark is running, it _does_ show system behaviour in one
> > > or another condition. And when system behaves rather badly, it is quite
> > > a common comment, that benchmark was useless. But it did show that
> > > system has a problem, even if rarely triggered one :)
> > 
> > If people are using benchmarks to improve file system, and a benchmark
> > shows a problem, then trying to remedy the performance issue is a good
> > thing to do, of course.  Sometimes, though the case which is
> > demonstrated by a poor benchmark is an extremely rare corner case that
> > doesn't accurately reflect common real-life workloads --- and if
> > addressing it results in a tradeoff which degrades much more common
> > real-life situations, then that would be a bad thing.
> > 
> > In situations where benchmarks are used competitively, it's rare that
> > it's actually a *problem*.  Instead it's much more common that a
> > developer is trying to prove that their file system is *better* to
> > gullible users who think that a single one-dimentional number is
> > enough for them to chose file system X over file system Y.
> 
> [ Look at all this email from my vacation...sorry for the delay ]
> 
> It's important that people take benchmarks from filesystem developers
> with a big grain of salt, which is one reason the boxacle.net results
> are so nice.  Steve more than willing to take patches and experiment to
> improve a given FS results, but his business is a fair representation of
> performance and it shows.

Just looking at the results there, I notice that the RAID system XFS
mailserver results dropped by an order of magnitude between
2.6.29-rc2 and 2.6.31. The single disk results are pretty
much identical across the two kernels.

IIRC, in 2.6.31 RAID0 started passing barriers through so I suspect
this is the issue. However, seeing as dmesg is not collected by
the scripts after the run and the output of the mounttab does
not show default options, I cannot tell if this is the case. This
might be worth checking by running XFS with the "nobarrier" mount
option....

FWIW, is it possible to get these benchmarks run on each filesystem for
each kernel release so ext/xfs/btrfs all get some regular basic
performance regression test coverage?

Cheers,

Dave.
-- 
Dave Chinner
david at fromorbit.com


From chris.mason at oracle.com  Mon Jan  4 16:27:48 2010
From: chris.mason at oracle.com (Chris Mason)
Date: Mon, 4 Jan 2010 11:27:48 -0500
Subject: [Jfs-discussion] benchmark results
In-Reply-To: <20091225161146.GC32757@thunk.org>
References: <alpine.DEB.2.01.0912240205510.3483@bogon.housecafe.de>
	<19251.26403.762180.228181@tree.ty.sabi.co.uk>
	<20091224212756.GM21594@thunk.org>
	<20091224234631.GA1028@ioremap.net>
	<20091225161146.GC32757@thunk.org>
Message-ID: <20100104162748.GA11932@think>

On Fri, Dec 25, 2009 at 11:11:46AM -0500, tytso at mit.edu wrote:
> On Fri, Dec 25, 2009 at 02:46:31AM +0300, Evgeniy Polyakov wrote:
> > > [1] http://samba.org/ftp/tridge/dbench/README
> > 
> > Was not able to resist to write a small notice, what no matter what, but
> > whatever benchmark is running, it _does_ show system behaviour in one
> > or another condition. And when system behaves rather badly, it is quite
> > a common comment, that benchmark was useless. But it did show that
> > system has a problem, even if rarely triggered one :)
> 
> If people are using benchmarks to improve file system, and a benchmark
> shows a problem, then trying to remedy the performance issue is a good
> thing to do, of course.  Sometimes, though the case which is
> demonstrated by a poor benchmark is an extremely rare corner case that
> doesn't accurately reflect common real-life workloads --- and if
> addressing it results in a tradeoff which degrades much more common
> real-life situations, then that would be a bad thing.
> 
> In situations where benchmarks are used competitively, it's rare that
> it's actually a *problem*.  Instead it's much more common that a
> developer is trying to prove that their file system is *better* to
> gullible users who think that a single one-dimentional number is
> enough for them to chose file system X over file system Y.

[ Look at all this email from my vacation...sorry for the delay ]

It's important that people take benchmarks from filesystem developers
with a big grain of salt, which is one reason the boxacle.net results
are so nice.  Steve more than willing to take patches and experiment to
improve a given FS results, but his business is a fair representation of
performance and it shows.

> 
> For example, if I wanted to play that game and tell people that ext4
> is better, I'd might pick this graph:
> 
> http://btrfs.boxacle.net/repository/single-disk/2.6.29-rc2/2.6.29-rc2/2.6.29-rc2_Mail_server_simulation._num_threads=32.html
> 
> On the other hand, this one shows ext4 as the worst compared to all
> other file systems:
> 
> http://btrfs.boxacle.net/repository/single-disk/2.6.29-rc2/2.6.29-rc2/2.6.29-rc2_Large_file_random_writes_odirect._num_threads=8.html
> 
> Benchmarking, like statistics, can be extremely deceptive, and if
> people do things like carefully order a tar file so the files are
> optimal for a file system, it's fair to ask whether that's a common
> thing for people to be doing (either unpacking tarballs or unpacking
> tarballs whose files have been carefully ordered for a particular file
> systems).

I tend to use compilebench for testing the ability to create lots of
small files, which puts the file names into FS native order (by
unpacking and then readdiring the results) before it does any timings.

I'd agree with Larry that benchmarking is most useful to test a theory.
Here's a patch that is supposed to do xyz, is that actually true.  With
that said we should also be trying to write benchmarks that show the
worst case...we know some of our design weakness and should be able to
show numbers for how bad it really is (see the random write
btrfs.boxacle.net tests for that one).

-chris


From slpratt at austin.ibm.com  Tue Jan  5 15:31:00 2010
From: slpratt at austin.ibm.com (Steven Pratt)
Date: Tue, 05 Jan 2010 09:31:00 -0600
Subject: [Jfs-discussion] benchmark results
In-Reply-To: <20100105004117.GP13802@discord.disaster>
References: <alpine.DEB.2.01.0912240205510.3483@bogon.housecafe.de>
	<19251.26403.762180.228181@tree.ty.sabi.co.uk>
	<20091224212756.GM21594@thunk.org>
	<20091224234631.GA1028@ioremap.net>
	<20091225161146.GC32757@thunk.org> <20100104162748.GA11932@think>
	<20100105004117.GP13802@discord.disaster>
Message-ID: <4B435B34.20003@austin.ibm.com>

Dave Chinner wrote:
> On Mon, Jan 04, 2010 at 11:27:48AM -0500, Chris Mason wrote:
>   
>> On Fri, Dec 25, 2009 at 11:11:46AM -0500, tytso at mit.edu wrote:
>>     
>>> On Fri, Dec 25, 2009 at 02:46:31AM +0300, Evgeniy Polyakov wrote:
>>>       
>>>>> [1] http://samba.org/ftp/tridge/dbench/README
>>>>>           
>>>> Was not able to resist to write a small notice, what no matter what, but
>>>> whatever benchmark is running, it _does_ show system behaviour in one
>>>> or another condition. And when system behaves rather badly, it is quite
>>>> a common comment, that benchmark was useless. But it did show that
>>>> system has a problem, even if rarely triggered one :)
>>>>         
>>> If people are using benchmarks to improve file system, and a benchmark
>>> shows a problem, then trying to remedy the performance issue is a good
>>> thing to do, of course.  Sometimes, though the case which is
>>> demonstrated by a poor benchmark is an extremely rare corner case that
>>> doesn't accurately reflect common real-life workloads --- and if
>>> addressing it results in a tradeoff which degrades much more common
>>> real-life situations, then that would be a bad thing.
>>>
>>> In situations where benchmarks are used competitively, it's rare that
>>> it's actually a *problem*.  Instead it's much more common that a
>>> developer is trying to prove that their file system is *better* to
>>> gullible users who think that a single one-dimentional number is
>>> enough for them to chose file system X over file system Y.
>>>       
>> [ Look at all this email from my vacation...sorry for the delay ]
>>
>> It's important that people take benchmarks from filesystem developers
>> with a big grain of salt, which is one reason the boxacle.net results
>> are so nice.  Steve more than willing to take patches and experiment to
>> improve a given FS results, but his business is a fair representation of
>> performance and it shows.
>>     
>
> Just looking at the results there, I notice that the RAID system XFS
> mailserver results dropped by an order of magnitude between
> 2.6.29-rc2 and 2.6.31. The single disk results are pretty
> much identical across the two kernels.
>
> IIRC, in 2.6.31 RAID0 started passing barriers through so I suspect
> this is the issue. However, seeing as dmesg is not collected by
> the scripts after the run and the output of the mounttab does
> not show default options, I cannot tell if this is the case. 
Well the dmesg collection is done by the actual benchmark run which 
occurs after the mount command is issued, so if you are looking for 
dmesg related to mounting the xfs volume, it should be in the dmesg we 
did collect.  If dmesg actually formatted timestamps, this would be 
easier to see.  It seems that nothing from xfs is ending up in dmesg 
since we are running xfs with different threads counts in order without 
reboot, so the dmesg for 16 thread xfs is run right after 1 thread xfs, 
but the dmesg show ext3 as the last thing, so safe to say no output from 
xfs is ending up in dmesg at all.


> This
> might be worth checking by running XFS with the "nobarrier" mount
> option....
>   
I could give that a try for you.

> FWIW, is it possible to get these benchmarks run on each filesystem for
> each kernel release so ext/xfs/btrfs all get some regular basic
> performance regression test coverage?
>   
Possible yes.  Just need to find the time to do the runs, and more 
importantly postprocess the data in some meaningful way.  I'll see what 
I can do.

Steve

> Cheers,
>
> Dave.
>   


From lakshmipathi.g at gmail.com  Wed Jan 13 09:05:14 2010
From: lakshmipathi.g at gmail.com (lakshmi pathi)
Date: Wed, 13 Jan 2010 14:35:14 +0530
Subject: Fwd: ext4_inode: i_block[] doubt
In-Reply-To: <ae2f51271001122348q41adf483te5d4b402f854e266@mail.gmail.com>
References: <ae2f51271001122348q41adf483te5d4b402f854e266@mail.gmail.com>
Message-ID: <ae2f51271001130105m1c654621w3a28dc733fb44ad5@mail.gmail.com>

~~~~~~~~~~~~~~~~
I checked for ext4-users mailing list - but unable to find it and
posted this question to  ext4-beta-list at redhat.com - It's seems like
that mailing list is less active.So I'm posting it again to ext3 list
.
~~~~~~~~~~~~~~~~
I was accessing ext4 file using ext2fs lib (from
e2fsprogs-1.41.9-Fedora 12 ) ,while parsing inode contents I got these
output Let me know whether my assumptions are correct?

---------------
//code-part : print inode values
ext2fs_read_inode(current_fs,d->d_ino,&inode);
for ( i = 0; i < 15; i++)

printf ("\ni_block[%d] :%u", i, inode.i_block[i]);

---------------

In struct ext4_inode i_block[EXT4_N_BLOCKS], i_block[0] to i_block[2]
denotes extent headers and tells whether this inode uses Htree for
storing files data blocks.

//output
i_block[0] :324362
i_block[1] :4
i_block[2] :0

//remaining i_block[3] to i_block[14] holds four extents ?in following format
//{extent index,number of blocks in extents,starting block number}
i_block[3] :0 ? ? ? ? ? // --> first extent denoted as 0
i_block[4] :1 ? ? ? ? ? // --> has single block
i_block[5] :36890 ? ? ? // --> block number is 36890
i_block[6] :1 ? ? ? ? ? // --> second extent denoted as 1
i_block[7] :2 ? ? ? ? ? // --> has two blocks
i_block[8] :36892 ? ? ? // --> it uses 36892 and 36893
i_block[9] :3 ? ? ? ? ? // --??--> third extent ?-- why is it numbered
as 3 instead of 2?
i_block[10] :2 ? ? ? ? ?// --> has two blocks
i_block[11] :36898 ? ? ?// --> uses 36898 and 36899
i_block[12] :5 ? ? ? ? ?// --??--> fourth and finally extent -- again why its
comes as 5 instead of 3.
i_block[13] :11 ? ? ? ? // --> it uses 11 blocks
i_block[14] :38402 ? ? ?// --> starting block is 38402.

If my assumption are correct,the question is ,why i_block[9] ?shows 3
instead of 2 and ? i_block[12] says 5 instead of 3?

Thanks.

-- 
----
Cheers,
Lakshmipathi.G
www.giis.co.in


From shadowbu at gmail.com  Wed Jan 13 15:27:41 2010
From: shadowbu at gmail.com (George Butler)
Date: Wed, 13 Jan 2010 09:27:41 -0600
Subject: ext3 partition size
In-Reply-To: <8005FD36-E520-4E0E-B461-30A0C0F4DFCB@sun.com>
References: <4B399C4A.2010109@gmail.com>
	<8005FD36-E520-4E0E-B461-30A0C0F4DFCB@sun.com>
Message-ID: <4B4DE66D.1060707@gmail.com>

Andreas.... thanks for the suggestion, I did a *resize2fs -d -p 
/dev/sdb8*  on the partition  and is now showing the size and disk usage 
correctly.

Thanks for your advise.

george

On 12/31/2009 03:39 PM, Andreas Dilger wrote:
> On 2009-12-28, at 23:06, George Butler wrote:
>>     I am running fedora 11 with kernel 2.6.30.9-102.fc11.x86_64 #1 
>> SMP Fri Dec 4 00:18:53 EST 2009 x86_64 x86_64 x86_64 GNU/Linux. I am 
>> noticing a partition on  my drive is reporting incorrect size with 
>> "df", the partition is ext3 size 204GB with about 79GB actual usage, 
>> the "df" result show the partition size to be 111GB, 93GB is missing. 
>> Please advice on what can be done to see why the system is reporting 
>> incorrect partition size.
>>
>> mount: /dev/sdb8 on /srv/multimedia type ext3 (rw,relatime)
>>
>> $ df -hT
>> Filesystem    Type    Size  Used Avail Use% Mounted on
>> /dev/sdb2     ext3     30G  1.1G   28G   4% /
>> /dev/sdb7     ext3     20G  1.3G   18G   7% /var
>> /dev/sdb6     ext3     30G   12G   17G  43% /usr
>> /dev/sdb5     ext3     40G   25G   13G  67% /home
>> /dev/sdb1     ext3    107M   52M   50M  52% /boot
>> /dev/sdb8     ext3    111G   79G   27G  76% /srv/multimedia
>> tmpfs        tmpfs    2.9G   35M  2.9G   2% /dev/shm
>>
>> Parted info:
>>
>> (parted) select /dev/sdb
>> Using /dev/sdb
>> (parted) print
>> Model: ATA ST3500630AS (scsi)
>> Disk /dev/sdb: 500GB
>> Sector size (logical/physical): 512B/512B
>> Partition Table: msdos
>>
>> Number  Start   End     Size    Type      File system  Flags
>>  1      32.3kB  115MB   115MB   primary   ext3         boot
>>  2      115MB   32.3GB  32.2GB  primary   ext3
>>  3      32.3GB  35.5GB  3224MB  primary   linux-swap
>>  4      35.5GB  500GB   465GB   extended
>>  5      35.5GB  78.5GB  43.0GB  logical   ext3
>>  6      78.5GB  111GB   32.2GB  logical   ext3
>>  7      111GB   132GB   21.5GB  logical   ext3
>>  8      132GB   352GB   220GB   logical   ext3
>>  9      352GB   492GB   140GB   logical   ext3
>
> It definitely looks strange.  Did you resize this partition after it 
> was created?  In any case, running "resize2fs /dev/sdb8" should 
> enlarge the
> filesystem to fill the partition.
>
> Cheers, Andreas
> -- 
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20100113/b28bc88a/attachment.htm>

From doug at warner.fm  Wed Jan 20 17:30:35 2010
From: doug at warner.fm (Doug Warner)
Date: Wed, 20 Jan 2010 12:30:35 -0500
Subject: Slow fsck on adaptec SAS/SATA raid
Message-ID: <4B573DBB.1070301@warner.fm>

I'm trying to do an fsck on an ext3 partition but I'm seeing abysmally slow
disk throughput; monitoring with "dstat" (like vmstat) shows ~1200-1500KB/s
throughput to the disks.  Even with 24hrs of fsck-ing I only get ~3% (still in
pass1).

The filesystem is ext3 running "e2fsck -C0 /dev/sda3" and about 3.7TB on an
x86_64-based system with 4GB RAM.  e2fsprogs is 1.41.9.

During initial periods of the fsck I see throughput in the 30-70MB/s range
(ie, <0.4% complete).  Shortly after that throughput tanks and stays there.
Just a rough extrapolation of the size of my filesystem (3.4TB used; ~95%)
makes it look like this will take ~28 days to complete.  I'm using approx 12M
inodes out of my 243M available.

-Doug

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 190 bytes
Desc: OpenPGP digital signature
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20100120/3395d5fd/attachment.sig>

From pg_ext3 at ext3.for.sabi.co.UK  Wed Jan 20 22:59:05 2010
From: pg_ext3 at ext3.for.sabi.co.UK (Peter Grandi)
Date: Wed, 20 Jan 2010 22:59:05 +0000
Subject: Slow fsck on adaptec SAS/SATA raid
In-Reply-To: <4B573DBB.1070301@warner.fm>
References: <4B573DBB.1070301@warner.fm>
Message-ID: <19287.35513.523189.476521@tree.ty.sabi.co.uk>

>>> On Wed, 20 Jan 2010 12:30:35 -0500, Doug Warner
>>> <doug at warner.fm> said:

doug> I'm trying to do an fsck on an ext3 partition but I'm
doug> seeing abysmally slow disk throughput; monitoring with
doug> "dstat" (like vmstat) shows ~1200-1500KB/s throughput to
doug> the disks.

That seems pretty good to me.

Perhaps the impression of slowness is motivated by insufficient
understanding of how 'fsck' works and the IOP limitations of
small rotating mass storage arrays.

doug> Even with 24hrs of fsck-ing I only get ~3% (still in
doug> pass1).

That's pretty good too.

doug> The filesystem is ext3 running "e2fsck -C0 /dev/sda3" and
doug> about 3.7TB on an x86_64-based system with 4GB RAM.
doug> e2fsprogs is 1.41.9.

There have been reports of a 1.5TB 'ext3' filesystem taking over
a month:

  http://www.sabi.co.uk/blog/anno05-4th.html#051009

even if in optimal cases it can be better:

  http://www.sabi.co.uk/blog/0802feb.html#080210

doug> Just a rough extrapolation of the size of my filesystem
doug> (3.4TB used; ~95%) makes it look like this will take ~28
doug> days to complete. I'm using approx 12M inodes out of my
doug> 243M available.

That sounds about right for lots of small files (~280KB average)
in a very large number of directories (or in directories that
are really very long), and which is 95% used.

You have designed that filesystem that way, and it is performing
as expected or better.

People who use filesystems as databases deserve whatn they get.


From sandeen at redhat.com  Fri Jan 22 20:01:48 2010
From: sandeen at redhat.com (Eric Sandeen)
Date: Fri, 22 Jan 2010 14:01:48 -0600
Subject: Slow fsck on adaptec SAS/SATA raid
In-Reply-To: <4B573DBB.1070301@warner.fm>
References: <4B573DBB.1070301@warner.fm>
Message-ID: <4B5A042C.4020608@redhat.com>

Doug Warner wrote:
> I'm trying to do an fsck on an ext3 partition but I'm seeing abysmally slow
> disk throughput; monitoring with "dstat" (like vmstat) shows ~1200-1500KB/s
> throughput to the disks.  Even with 24hrs of fsck-ing I only get ~3% (still in
> pass1).
> 
> The filesystem is ext3 running "e2fsck -C0 /dev/sda3" and about 3.7TB on an
> x86_64-based system with 4GB RAM.  e2fsprogs is 1.41.9.
> 
> During initial periods of the fsck I see throughput in the 30-70MB/s range
> (ie, <0.4% complete).  Shortly after that throughput tanks and stays there.
> Just a rough extrapolation of the size of my filesystem (3.4TB used; ~95%)
> makes it look like this will take ~28 days to complete.  I'm using approx 12M
> inodes out of my 243M available.

It'd be interesting to use blktrace and/or seekwatcher to see if you are seeking
madly all over the disk, that would certainly clobber perf.

-Eric
 
> -Doug


From dshaw at jabberwocky.com  Fri Jan 22 21:01:21 2010
From: dshaw at jabberwocky.com (David Shaw)
Date: Fri, 22 Jan 2010 16:01:21 -0500
Subject: Extended attributes being cleared by e2fsck?
Message-ID: <23361D9D-0DB5-41EF-9D7F-6267C6873EDB@jabberwocky.com>

Twice now I have rebooted a box and seen a hundred or so unexpected messages from e2fsck about extended attributes being cleared:

disk: Extended attribute in inode 1437565 has a value size (0) which is invalid
CLEARED.

(for many different inodes)

This filesystem has a lot of files with single-byte xattrs (it is user.test='x').  After the fsck, I looked at a few of the files that correspond to the inodes mentioned by e2fsck, and that xattr was missing.  However, some other files were not touched by e2fsck and still had the single-digit xattr.

The only other clue I have at the moment is that in at least one of the examples, the filesystem had just been resized (online) with resize2fs.

Both boxes are Fedora 11 with kernel-2.6.30.9-102.fc11.i586 and e2fsprogs-1.41.4-12.fc11.i586

Any suggestions on where to investigate next?

David


From rwheeler at redhat.com  Fri Jan 22 21:11:34 2010
From: rwheeler at redhat.com (Ric Wheeler)
Date: Fri, 22 Jan 2010 16:11:34 -0500
Subject: Extended attributes being cleared by e2fsck?
In-Reply-To: <23361D9D-0DB5-41EF-9D7F-6267C6873EDB@jabberwocky.com>
References: <23361D9D-0DB5-41EF-9D7F-6267C6873EDB@jabberwocky.com>
Message-ID: <4B5A1486.2020801@redhat.com>

On 01/22/2010 04:01 PM, David Shaw wrote:
> Twice now I have rebooted a box and seen a hundred or so unexpected messages from e2fsck about extended attributes being cleared:
>
> disk: Extended attribute in inode 1437565 has a value size (0) which is invalid
> CLEARED.
>
> (for many different inodes)
>
> This filesystem has a lot of files with single-byte xattrs (it is user.test='x').  After the fsck, I looked at a few of the files that correspond to the inodes mentioned by e2fsck, and that xattr was missing.  However, some other files were not touched by e2fsck and still had the single-digit xattr.
>
> The only other clue I have at the moment is that in at least one of the examples, the filesystem had just been resized (online) with resize2fs.
>
> Both boxes are Fedora 11 with kernel-2.6.30.9-102.fc11.i586 and e2fsprogs-1.41.4-12.fc11.i586
>
> Any suggestions on where to investigate next?
>
> David
>
>
>    

Hi David,

Could you open a fedora bugzilla ticket and fill in as much information 
as you have - the above would be a great start...

Thanks!

Ric


From tytso at mit.edu  Fri Jan 22 21:11:20 2010
From: tytso at mit.edu (tytso at mit.edu)
Date: Fri, 22 Jan 2010 16:11:20 -0500
Subject: Slow fsck on adaptec SAS/SATA raid
In-Reply-To: <4B573DBB.1070301@warner.fm>
References: <4B573DBB.1070301@warner.fm>
Message-ID: <20100122211120.GG21263@thunk.org>

On Wed, Jan 20, 2010 at 12:30:35PM -0500, Doug Warner wrote:
> I'm trying to do an fsck on an ext3 partition but I'm seeing abysmally slow
> disk throughput; monitoring with "dstat" (like vmstat) shows ~1200-1500KB/s
> throughput to the disks.  Even with 24hrs of fsck-ing I only get ~3% (still in
> pass1).
> 
> The filesystem is ext3 running "e2fsck -C0 /dev/sda3" and about 3.7TB on an
> x86_64-based system with 4GB RAM.  e2fsprogs is 1.41.9.
> 
> During initial periods of the fsck I see throughput in the 30-70MB/s range
> (ie, <0.4% complete).  Shortly after that throughput tanks and stays there.
> Just a rough extrapolation of the size of my filesystem (3.4TB used; ~95%)
> makes it look like this will take ~28 days to complete.  I'm using approx 12M
> inodes out of my 243M available.

Are you using some kind of backup scheme that creates huge number of
hard links to files?  E2fsck could be thrashing due to lack of memory
space.   

					- Ted


From dshaw at jabberwocky.com  Fri Jan 22 21:40:01 2010
From: dshaw at jabberwocky.com (David Shaw)
Date: Fri, 22 Jan 2010 16:40:01 -0500
Subject: Extended attributes being cleared by e2fsck?
In-Reply-To: <4B5A1486.2020801@redhat.com>
References: <23361D9D-0DB5-41EF-9D7F-6267C6873EDB@jabberwocky.com>
	<4B5A1486.2020801@redhat.com>
Message-ID: <E7CFBE4F-8471-41C6-8BBE-9ED5FB316F02@jabberwocky.com>

On Jan 22, 2010, at 4:11 PM, Ric Wheeler wrote:

> On 01/22/2010 04:01 PM, David Shaw wrote:
>> Twice now I have rebooted a box and seen a hundred or so unexpected messages from e2fsck about extended attributes being cleared:
>> 
>> disk: Extended attribute in inode 1437565 has a value size (0) which is invalid
>> CLEARED.
>> 
>> (for many different inodes)
>> 
>> This filesystem has a lot of files with single-byte xattrs (it is user.test='x').  After the fsck, I looked at a few of the files that correspond to the inodes mentioned by e2fsck, and that xattr was missing.  However, some other files were not touched by e2fsck and still had the single-digit xattr.
>> 
>> The only other clue I have at the moment is that in at least one of the examples, the filesystem had just been resized (online) with resize2fs.
>> 
>> Both boxes are Fedora 11 with kernel-2.6.30.9-102.fc11.i586 and e2fsprogs-1.41.4-12.fc11.i586
>> 
>> Any suggestions on where to investigate next?
>> 
>> David
>> 
>> 
>>   
> 
> Hi David,
> 
> Could you open a fedora bugzilla ticket and fill in as much information as you have - the above would be a great start...

Done: https://bugzilla.redhat.com/show_bug.cgi?id=557959

David


From jarmstrong at postpath.com  Fri Jan 22 21:49:32 2010
From: jarmstrong at postpath.com (Joe Armstrong)
Date: Fri, 22 Jan 2010 21:49:32 +0000 (GMT)
Subject: unsubscribe
Message-ID: <6EB2467CC553DE11AD6B00221957FF8E5D12AC@ppst1mlb01.ppst3.intra>


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20100122/0fd696aa/attachment.htm>

From samix_119 at yahoo.com  Tue Jan 26 13:15:22 2010
From: samix_119 at yahoo.com (Muhammed Sameer)
Date: Tue, 26 Jan 2010 05:15:22 -0800 (PST)
Subject: Going readonly too frequently
Message-ID: <326877.26414.qm@web112608.mail.gq1.yahoo.com>

Hey,

* Our server is going readonly atleast 10 - 15 times a day with the below error

<snip>
Jan 26 12:57:36 mailbox kernel: EXT3-fs error (device sdb1): ext3_free_blocks_sb: bit already cleared for block 53771686
Jan 26 12:57:36 mailbox kernel: Aborting journal on device sdb1.
Jan 26 12:57:36 mailbox kernel: EXT3-fs error (device sdb1) in ext3_reserve_inode_write: Journal has aborted
Jan 26 12:57:36 mailbox kernel: EXT3-fs error (device sdb1) in ext3_truncate: Journal has aborted
Jan 26 12:57:36 mailbox kernel: EXT3-fs error (device sdb1) in ext3_reserve_inode_write: Journal has aborted
Jan 26 12:57:36 mailbox kernel: EXT3-fs error (device sdb1) in ext3_orphan_del: Journal has aborted
Jan 26 12:57:36 mailbox kernel: EXT3-fs error (device sdb1) in ext3_reserve_inode_write: Journal has aborted
Jan 26 12:57:36 mailbox kernel: EXT3-fs error (device sdb1) in ext3_delete_inode: Journal has aborted
Jan 26 12:57:36 mailbox kernel: __journal_remove_journal_head: freeing b_committed_data
Jan 26 12:57:36 mailbox last message repeated 3 times
Jan 26 12:57:36 mailbox kernel: ext3_abort called.
Jan 26 12:57:36 mailbox kernel: EXT3-fs error (device sdb1): ext3_journal_start_sb: Detected aborted journal
Jan 26 12:57:36 mailbox kernel: Remounting filesystem read-only
Jan 26 12:58:04 mailbox kernel: __journal_remove_journal_head: freeing b_committed_data
Jan 26 12:58:05 mailbox kernel: __journal_remove_journal_head: freeing b_committed_data
</snip>

* Our kernel is 
2.6.18-182.el5

* Our OS is
Red Hat Enterprise Linux Server release 5.2 (Tikanga)

* We even tried fsck  but the problem persists


Regards,
Muhammed Sameer


From sandeen at redhat.com  Tue Jan 26 15:48:53 2010
From: sandeen at redhat.com (Eric Sandeen)
Date: Tue, 26 Jan 2010 09:48:53 -0600
Subject: Going readonly too frequently
In-Reply-To: <326877.26414.qm@web112608.mail.gq1.yahoo.com>
References: <326877.26414.qm@web112608.mail.gq1.yahoo.com>
Message-ID: <4B5F0EE5.9060403@redhat.com>

Muhammed Sameer wrote:
> Hey,
> 
> * Our server is going readonly atleast 10 - 15 times a day with the below error

ouch - and even just once is "too frequently" :)
> 
> <snip>
> Jan 26 12:57:36 mailbox kernel: EXT3-fs error (device sdb1): ext3_free_blocks_sb: bit already cleared for block 53771686
> Jan 26 12:57:36 mailbox kernel: Aborting journal on device sdb1.
> Jan 26 12:57:36 mailbox kernel: EXT3-fs error (device sdb1) in ext3_reserve_inode_write: Journal has aborted
> Jan 26 12:57:36 mailbox kernel: EXT3-fs error (device sdb1) in ext3_truncate: Journal has aborted
> Jan 26 12:57:36 mailbox kernel: EXT3-fs error (device sdb1) in ext3_reserve_inode_write: Journal has aborted
> Jan 26 12:57:36 mailbox kernel: EXT3-fs error (device sdb1) in ext3_orphan_del: Journal has aborted
> Jan 26 12:57:36 mailbox kernel: EXT3-fs error (device sdb1) in ext3_reserve_inode_write: Journal has aborted
> Jan 26 12:57:36 mailbox kernel: EXT3-fs error (device sdb1) in ext3_delete_inode: Journal has aborted
> Jan 26 12:57:36 mailbox kernel: __journal_remove_journal_head: freeing b_committed_data
> Jan 26 12:57:36 mailbox last message repeated 3 times
> Jan 26 12:57:36 mailbox kernel: ext3_abort called.
> Jan 26 12:57:36 mailbox kernel: EXT3-fs error (device sdb1): ext3_journal_start_sb: Detected aborted journal
> Jan 26 12:57:36 mailbox kernel: Remounting filesystem read-only
> Jan 26 12:58:04 mailbox kernel: __journal_remove_journal_head: freeing b_committed_data
> Jan 26 12:58:05 mailbox kernel: __journal_remove_journal_head: freeing b_committed_data
> </snip>
> 
> * Our kernel is 
> 2.6.18-182.el5
> 
> * Our OS is
> Red Hat Enterprise Linux Server release 5.2 (Tikanga)
> 
> * We even tried fsck  but the problem persists

It would probably be best to open a support ticket with Red Hat for this one,
since it's a Red Hat kernel.

If you have any way to reproduce it that'd be very useful information - 
perhaps even a test using an e2image of the filesystem in question.

Thanks,
-Eric

> Regards,
> Muhammed Sameer
> 
> 
>       
> 
> _______________________________________________
> Ext3-users mailing list
> Ext3-users at redhat.com
> https://www.redhat.com/mailman/listinfo/ext3-users


From adilger at sun.com  Tue Jan 26 23:09:32 2010
From: adilger at sun.com (Andreas Dilger)
Date: Tue, 26 Jan 2010 16:09:32 -0700
Subject: Going readonly too frequently
In-Reply-To: <4B5F0EE5.9060403@redhat.com>
References: <326877.26414.qm@web112608.mail.gq1.yahoo.com>
	<4B5F0EE5.9060403@redhat.com>
Message-ID: <20BC1566-10EB-4760-A50D-C13E1B9F3384@sun.com>

On 2010-01-26, at 08:48, Eric Sandeen wrote:
> Muhammed Sameer wrote:
>> Hey,
>>
>> * Our server is going readonly atleast 10 - 15 times a day with the  
>> below error
>
> ouch - and even just once is "too frequently" :)

>> <snip>
>> Jan 26 12:57:36 mailbox kernel: EXT3-fs error (device sdb1):  
>> ext3_free_blocks_sb: bit already cleared for block 53771686

I haven't seen any similar bug reports, and given that the RHEL5.2  
kernel has been out for a long time would steer me toward thinking  
this is a hardware error (bad RAM or cable, though only with PATA or  
SCSI drives).  You could check for this by converting the block  
numbers to hex, and looking for consistent bits being cleared.

> It would probably be best to open a support ticket with Red Hat for  
> this one,
> since it's a Red Hat kernel.
>
> If you have any way to reproduce it that'd be very useful  
> information -
> perhaps even a test using an e2image of the filesystem in question.


Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


From darkonc at gmail.com  Wed Jan 27 07:25:23 2010
From: darkonc at gmail.com (Stephen Samuel (gmail))
Date: Tue, 26 Jan 2010 23:25:23 -0800
Subject: Going readonly too frequently
In-Reply-To: <326877.26414.qm@web112608.mail.gq1.yahoo.com>
References: <326877.26414.qm@web112608.mail.gq1.yahoo.com>
Message-ID: <6cd50f9f1001262325weee5738v3fa3fb1f65b05988@mail.gmail.com>

Clearly you have a corrupt filesystem. Of all these times that the
filesystem has gone read-only, how many times have you done FSCKs?  Have you
done two FSCKs in a row?  It's rare, but  I've seen a second FSCK find stuff
that was missed on the first run.

A smart report will at least tell you if the drive has been suffering errors
recently.

Do you have a SMART report for the drive?

On Tue, Jan 26, 2010 at 5:15 AM, Muhammed Sameer <samix_119 at yahoo.com>wrote:

> Hey,
>
> * Our server is going readonly atleast 10 - 15 times a day with the below
> error
>
> <snip>
> Jan 26 12:57:36 mailbox kernel: EXT3-fs error (device sdb1):
> ext3_free_blocks_sb: bit already cleared for block 53771686
> Jan 26 12:57:36 mailbox kernel: Aborting journal on device sdb1.
> Jan 26 12:57:36 mailbox kernel: EXT3-fs error (device sdb1) in
> ext3_reserve_inode_write: Journal has aborted
> Jan 26 12:57:36 mailbox kernel: EXT3-fs error (device sdb1) in
> ext3_truncate: Journal has aborted
> Jan 26 12:57:36 mailbox kernel: EXT3-fs error (device sdb1) in
> ext3_reserve_inode_write: Journal has aborted
> Jan 26 12:57:36 mailbox kernel: EXT3-fs error (device sdb1) in
> ext3_orphan_del: Journal has aborted
> Jan 26 12:57:36 mailbox kernel: EXT3-fs error (device sdb1) in
> ext3_reserve_inode_write: Journal has aborted
> Jan 26 12:57:36 mailbox kernel: EXT3-fs error (device sdb1) in
> ext3_delete_inode: Journal has aborted
> Jan 26 12:57:36 mailbox kernel: __journal_remove_journal_head: freeing
> b_committed_data
> Jan 26 12:57:36 mailbox last message repeated 3 times
> Jan 26 12:57:36 mailbox kernel: ext3_abort called.
> Jan 26 12:57:36 mailbox kernel: EXT3-fs error (device sdb1):
> ext3_journal_start_sb: Detected aborted journal
> Jan 26 12:57:36 mailbox kernel: Remounting filesystem read-only
> Jan 26 12:58:04 mailbox kernel: __journal_remove_journal_head: freeing
> b_committed_data
> Jan 26 12:58:05 mailbox kernel: __journal_remove_journal_head: freeing
> b_committed_data
> </snip>
>
> * Our kernel is
> 2.6.18-182.el5
>
> * Our OS is
> Red Hat Enterprise Linux Server release 5.2 (Tikanga)
>
> * We even tried fsck  but the problem persists
>
>
> Regards,
> Muhammed Sameer
>
>
>
>
> _______________________________________________
> Ext3-users mailing list
> Ext3-users at redhat.com
> https://www.redhat.com/mailman/listinfo/ext3-users
>


-- 
Stephen Samuel http://www.bcgreen.com  Software, like love,
778-861-7641                              grows when you give it away
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20100126/4c7485f0/attachment.htm>