From a_lindeman at hotmail.com  Thu Mar  1 21:21:38 2007
From: a_lindeman at hotmail.com (Andy Lindeman)
Date: Thu, 01 Mar 2007 16:21:38 -0500
Subject: whoops, corrupted my filesystem
Message-ID: <BAY131-F2817C2642ADEE0D6B5374CF9800@phx.gbl>

Hi all-

I corrupted my filesystem by not doing a RTFM first...  I got an automated 
email that the process monitoring the SMART data from my hard drive detected 
a bad sector.  Not thinking (or RTFMing), I did a fsck on my partition- 
which is the main partition.  Now it appears that I've ruined the 
superblock.

I am running Fedora Core 6.  I am booting off the Fedora Core 6 Rescue CD in 
order to try to fix things (my system isn't bootable.)

Doing a e2fsck /dev/hda2 tells me that the superblock is corrupt.   When I 
do a mke2fs -n /dev/hda2, it tells me that other backups are stored on 
32768, 98304, 16840, 229376, 294912, 819200, 884736, 1605632, 265???? (cut 
off), 4096000, 7962624, 11239424, 20480000, 23887872.

When I try doing a e2fsck -b xxx /dev/hda2, on any of the superblocks <= 
4096000 I get the message that it's corrupted.  When I do >= 7962625, I get 
"Invalid argument while trying to open /dev/hda2."

By the way, there's some sort of weird Logical Volume thing going on with 
this partition.  On an old (out of date unfortunately) backup, the mtab file 
has it listed as /dev/mapper/VolGroup00-LogVol00.  Perhaps this partition 
can't be addressed as /dev/hda2 and it should be addressed differently??

Should I try a mke2fs -S on this drive or is there something else I should 
try first?  Everything I've read says to back up before mke2fs -S ing.  I 
have an external ext3 drive with enough space to hold this mangled partition 
on it, although it currently has a single ext3 partition.  Is there a way to 
copy the contents of the mangaled partition to the external ext3 partition 
w/o deleting what's already on it or resizing it and creating a 2nd 
partition?

If it is suggested that I try a mke2fs -S, how does that work?  mke2fs -n 
tells me that:

Block size=4096 (log=2)
Fragment size=4096 (log=2)
30523392 inodes, 61022902 blocks
3051145 blocks
First data block=0
Maximum filesystem blocsk=0
1863 block groups
32768 blocks per group, 32768 fragments per group
16384 inodes per group

Thanks much for any help!  I'd love to recover this instead of having to 
rebuild my linux PC!

Andrew


ps- This is a 250 GB Parallel ATA drive.

_________________________________________________________________
Rates near 39yr lows!  $430K Loan for $1,399/mo - Paying Too Much? Calculate 
new payment 
http://www.lowermybills.com/lre/index.jsp?sourceid=lmb-9632-18226&moid=7581


From lakshmipathi.g at gmail.com  Fri Mar  2 07:53:14 2007
From: lakshmipathi.g at gmail.com (lakshmi pathi)
Date: Fri, 2 Mar 2007 13:23:14 +0530
Subject: Hi all
In-Reply-To: <20070226150818.26821.qmail@webmail89.rediffmail.com>
References: <20070226150818.26821.qmail@webmail89.rediffmail.com>
Message-ID: <ae2f51270703012353k4de94199re6bef6226d9d0c15@mail.gmail.com>

Hi,
basically i would like to know,is possible to include this package in Redhat?
is there any review panel to submit ur tools so that they can be
released  in distribution...
what's the procedure to be followed?
Thx in advance.

On 26 Feb 2007 15:08:18 -0000, bimal  pandit
<bimal_pandit at rediffmail.com> wrote:
>
>
>  Dear Laxmi,
>
>
>  On Mon, 26 Feb 2007 laksmi pathi wrote :
>
>  >Hi Beos,
>  >It's true you can't recover files from ext3 since file address are
>  >zeroed out  while deleting.
>  >This tool is crash proof recovery tool.
>  >You can the recover the files which are deleted only after it's
>  >installation.The concept is, once you install the tool,It make backup
>  >copy of  your files addresses.When you delete a file , it's address in
>  >inode is deleted ...but we can access file from it's address which we
>  >copied earlier-provided the content is not overwirtte-So it's like a
>  >crash proof tool.
>  >Hi Bruno Wolff ,
>  >Yes it's always  better to take regular backup-
>  >and fellow developers in freshmeat tested and rated this tool,
>  >i assume they are quite satisfied with the tool.
>  >Please check out :
>  >http://freshmeat.net/projects/giis/
>  >
>  >Warm Regards,
>  >Lakshmipathi.G
>  >
>  >
>  >
>  >
>  >On 2/25/07, Bruno Wolff III <bruno at wolff.to> wrote:
>  >>On Sat, Feb 24, 2007 at 22:19:02 -0800,
>  >>  "..:::BeOS Mr. X:::.." <mr._x at shaw.ca> wrote:
>  >> > Yes, but I always here that recover from ext3 is not possible...
>  >> > possibly explain some of the technology ? I have interest in using the
>  >> > program if I can in fact figure out how to use it. I accidently
> recently
>  >> > deleted a music folder with many mp3 files in it.
>  >>
>  >>You are probably better off regularly making backups rather than beta
> testing
>  >>This software.
>  >>
>  >
>
>  great job, will test it and would be keen to help and support you to the
> extent and the way I could be ...
>
>  regards,
>
>  Bimal Pandit
>
>
>


From adilger at clusterfs.com  Fri Mar  2 10:51:28 2007
From: adilger at clusterfs.com (Andreas Dilger)
Date: Fri, 2 Mar 2007 18:51:28 +0800
Subject: whoops, corrupted my filesystem
In-Reply-To: <BAY131-F2817C2642ADEE0D6B5374CF9800@phx.gbl>
References: <BAY131-F2817C2642ADEE0D6B5374CF9800@phx.gbl>
Message-ID: <20070302105128.GU6573@schatzie.adilger.int>

On Mar 01, 2007  16:21 -0500, Andy Lindeman wrote:
> Doing a e2fsck /dev/hda2 tells me that the superblock is corrupt.   When I 
> do a mke2fs -n /dev/hda2, it tells me that other backups are stored on 
> 32768, 98304, 16840, 229376, 294912, 819200, 884736, 1605632, 265???? (cut 
> off), 4096000, 7962624, 11239424, 20480000, 23887872.
> 
> When I try doing a e2fsck -b xxx /dev/hda2, on any of the superblocks <= 
> 4096000 I get the message that it's corrupted.  When I do >= 7962625, I get 
> "Invalid argument while trying to open /dev/hda2."
> 
> By the way, there's some sort of weird Logical Volume thing going on with 
> this partition.  On an old (out of date unfortunately) backup, the mtab 
> file has it listed as /dev/mapper/VolGroup00-LogVol00.  Perhaps this 
> partition can't be addressed as /dev/hda2 and it should be addressed 
> differently??

Correct.  You should be running e2fsck /dev/mapper/VolGroup00-LogVol00
instead of /dev/hda2.  That's likely why your filesystem is "corrupted"...

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


From a_lindeman at hotmail.com  Fri Mar  2 11:43:34 2007
From: a_lindeman at hotmail.com (Andy Lindeman)
Date: Fri, 02 Mar 2007 06:43:34 -0500
Subject: whoops, corrupted my filesystem
In-Reply-To: <20070302105128.GU6573@schatzie.adilger.int>
Message-ID: <BAY131-F315801D48130DB9C448AF1F9870@phx.gbl>

Hi Andreas-

Is it known what happens when e2fsck is run on /dev/hda2 instead of the 
volume device?

I've run e2fsck on /dev/mapper/VolGroup00-LogVol00 and it gives me multiple 
"Block bitmap for group 0 is not in group. (block XXXXXX) Relocate<y>?".   I 
select y (actually, I ran with automatic mode.)  This doesn't seem to help 
matters.  When I rerun e2fsck, I get the same errors on the same blocks.

Thanks for your help!  Andy


----Original Message Follows----
From: Andreas Dilger <adilger at clusterfs.com>
To: Andy Lindeman <a_lindeman at hotmail.com>

Correct.  You should be running e2fsck /dev/mapper/VolGroup00-LogVol00
instead of /dev/hda2.  That's likely why your filesystem is "corrupted"...

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

_________________________________________________________________
Mortgage rates as low as 4.625% - Refinance $150,000 loan for $579 a month. 
Intro*Terms  
https://www2.nextag.com/goto.jsp?product=100000035&url=%2fst.jsp&tm=y&search=mortgage_text_links_88_h27f6&disc=y&vers=743&s=4056&p=5117


From Matt_Dodson at messageone.com  Mon Mar  5 22:18:23 2007
From: Matt_Dodson at messageone.com (Matt Dodson)
Date: Mon, 5 Mar 2007 16:18:23 -0600
Subject: Missing blocks
Message-ID: <44B5599C8B5B1347AFF903FDCEC003070174EA85@auscorpex-1.austin.messageone.com>

Hopefully this is a simple issue or just my ignorance on the results
returned by "df -k" but can anyone explain why the available block is 0
if total 1k-blocks - Used is greater than 0? 


#df -k /ems/bigdisk/
Filesystem          			1K-blocks      	Used
Available 	Use% 	Mounted on
/dev/mapper/vg0-bigdisk                     397367512 	383562960
0 		100% 	/<name>

Filesystem volume name:   <none>
Last mounted on:          <not available>
Filesystem UUID:          de2b600f-120d-41d2-ba23-b48b50705432
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal resize_inode dir_index filetype
needs_recovery sparse_super large_file
Default mount options:    (none)
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              50462720
Block count:              100925440
Reserved block count:     5046160
Free blocks:              3174088
Free inodes:              45030587
First block:              0
Block size:               4096
Fragment size:            4096
Reserved GDT blocks:      1021
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         16384
Inode blocks per group:   512
Filesystem created:       Fri May 12 08:43:41 2006
Last mount time:          Sun Mar  4 23:26:08 2007
Last write time:          Sun Mar  4 23:37:03 2007
Mount count:              5
Maximum mount count:      28
Last checked:             Fri May 12 08:43:41 2006
Check interval:           15552000 (6 months)
Next check after:         Wed Nov  8 07:43:41 2006
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               128
Journal inode:            8
Default directory hash:   tea
Directory Hash Seed:      f127e09e-0c0b-4f18-9e81-d822f8eadf4a
Journal backup:           inode blocks


Kernel  2.6.9-34.0.2.ELsmp
e2fsprogs-1.35-12.3.EL4

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20070305/adc38b1a/attachment.htm>

From jburgess777 at googlemail.com  Mon Mar  5 22:39:34 2007
From: jburgess777 at googlemail.com (Jon Burgess)
Date: Mon, 05 Mar 2007 22:39:34 +0000
Subject: Missing blocks
In-Reply-To: <44B5599C8B5B1347AFF903FDCEC003070174EA85@auscorpex-1.austin.messageone.com>
References: <44B5599C8B5B1347AFF903FDCEC003070174EA85@auscorpex-1.austin.messageone.com>
Message-ID: <1173134374.29303.6.camel@localhost.localdomain>

On Mon, 2007-03-05 at 16:18 -0600, Matt Dodson wrote:
> Hopefully this is a simple issue or just my ignorance on the results
> returned by ?df -k? but can anyone explain why the available block is
> 0 if total 1k-blocks ? Used is greater than 0? 
> 
> 
You have 5% reserved for root use only, this is 20GB on your current
filesystem. See the -m option in 'man mke2fs' for details. 

tune2fs can adjust this if the filesystem is unmounted. 


> #df -k /ems/bigdisk/
> 
> Filesystem                              1K-blocks       Used
> Available       Use%    Mounted on
> 
> /dev/mapper/vg0-bigdisk                     397367512   383562960
> 0             100%    /<name>
> 

> Block count:              100925440
> 
> Reserved block count:     5046160
> 
> Free blocks:              3174088

> Block size:               4096

Above we see 5046160 x 4096 bytes are reserved.

	Jon


From Matt_Dodson at messageone.com  Mon Mar  5 22:45:27 2007
From: Matt_Dodson at messageone.com (Matt Dodson)
Date: Mon, 5 Mar 2007 16:45:27 -0600
Subject: Missing blocks
In-Reply-To: <1173134374.29303.6.camel@localhost.localdomain>
References: <44B5599C8B5B1347AFF903FDCEC003070174EA85@auscorpex-1.austin.messageone.com>
	<1173134374.29303.6.camel@localhost.localdomain>
Message-ID: <44B5599C8B5B1347AFF903FDCEC003070174EAB7@auscorpex-1.austin.messageone.com>

Thanks for explaining this to me. 

--------------------
 
Matt Dodson
Infrastructure Engineer
matt_dodson at messageone.com
http://www.messageone.com
 
MessageOne
11044 Research Blvd
Building C, Fith Floor
Austin, Tx 78759
(512) 652-4500 (office)

-----Original Message-----
From: Jon Burgess [mailto:jburgess777 at googlemail.com] 
Sent: Monday, March 05, 2007 4:40 PM
To: Matt Dodson
Cc: ext3-users at redhat.com
Subject: Re: Missing blocks

On Mon, 2007-03-05 at 16:18 -0600, Matt Dodson wrote:
> Hopefully this is a simple issue or just my ignorance on the results
> returned by "df -k" but can anyone explain why the available block is
> 0 if total 1k-blocks - Used is greater than 0? 
> 
> 
You have 5% reserved for root use only, this is 20GB on your current
filesystem. See the -m option in 'man mke2fs' for details. 

tune2fs can adjust this if the filesystem is unmounted. 


> #df -k /ems/bigdisk/
> 
> Filesystem                              1K-blocks       Used
> Available       Use%    Mounted on
> 
> /dev/mapper/vg0-bigdisk                     397367512   383562960
> 0             100%    /<name>
> 

> Block count:              100925440
> 
> Reserved block count:     5046160
> 
> Free blocks:              3174088

> Block size:               4096

Above we see 5046160 x 4096 bytes are reserved.

	Jon


From aj at dungeon.inka.de  Tue Mar  6 06:28:44 2007
From: aj at dungeon.inka.de (Andreas Jellinghaus)
Date: Tue, 06 Mar 2007 07:28:44 +0100
Subject: resume from swap files
Message-ID: <20070306062847.4F25E22A910@dungeon.inka.de>

Hi,

the latest kernel supports swap files, so I guess the resume
code also works with those. So I wonder: is this still a good
idea with ext3? As far as I know there is no such thing
as a "mount read-only" with journalling filesystems -
ext3 when mounted will always detect that it is not clean
and replay the journal etc.

So what do you think? Is it ok to use ext3 with swap files
and suspend/resume? Or is that a combination asking for trouble?
(The filesystem would only be used as in "mount /; resume",
i.e. no other write operations, but it needs to be mounted for
resume to work.)

Thanks for your advise.

Regards, Andreas


From jlforrest at berkeley.edu  Mon Mar 12 15:29:07 2007
From: jlforrest at berkeley.edu (Jon Forrest)
Date: Mon, 12 Mar 2007 08:29:07 -0700
Subject: How To Recover From Creating >2TB ext3 Filesystem on MSDOS Partition
 Table?
Message-ID: <45F571C3.9090303@berkeley.edu>

(I've already sent this message to Ted Ts'o directly. I should
have sent it to this list first but I didn't know about it
until today. My apologies to Ted.)

Last Friday a system that I just inherited refused to mount
a file system that had been working fine for about 6 months.
This is on a Scientific Linux 4.3 system using a 2.6.9
kernel. This is another Linux distribution based on RHEL 4.
I don't think the actual hardware is relevant
here so I won't mention it. If there's more information you'd
like to see I'd be happy to provide it.

It turns out that this 4.2TB file system was created in an
msdos partition table, as shown below:

----
GNU Parted 1.6.19
Using /dev/sdb
(parted) p
Disk geometry for /dev/sdb: 0.000-4291443.000 megabytes
Disk label type: msdos
Minor    Start       End     Type      Filesystem  Flags
1          0.031  97137.567  primary   ext3
----

Running fsck fails as shown below:

----
e2fsck 1.35 (28-Feb-2004)
The filesystem size (according to the superblock) is 1098609033 blocks
The physical size of the device is 24867209 blocks
Either the superblock or the partition table is likely to be corrupt!
Abort<y>? yes

Error reading block 24870914 (Invalid argument) while doing inode scan.
----

I have 2 questions:

1) How did this system run just file for ~6 months using this
file system as a /home? I'm suspecting that the problem
actually occurred long ago when the file system allocated
meta or user data in blocks that are somehow unreachable
by fsck but exactly how this could have happened isn't
clear. Although it's too late now, I'd really like
to know what happened.

2) Given that this happened, how can I recover as many
files as possible from this file system? The professor
who owns this system had put his faith in hardware
RAID so he had never backed it up. He's very nervous
right now.

Any information or help you can provide would be
very much appreciated.

Cordially,
Jon Forrest
Unix Computing Support
College of Chemistry
Univ. of Cal. Berkeley
173 Tan Hall
Berkeley, CA
94720-1460
510-643-1032
jlforrest at berkeley.edu


From ling at fnal.gov  Mon Mar 12 20:05:21 2007
From: ling at fnal.gov (Ling C. Ho)
Date: Mon, 12 Mar 2007 15:05:21 -0500
Subject: How To Recover From Creating >2TB ext3 Filesystem on MSDOS
 Partition Table?
In-Reply-To: <45F571C3.9090303@berkeley.edu>
References: <45F571C3.9090303@berkeley.edu>
Message-ID: <45F5B281.5060403@fnal.gov>

Can u recreate your sdb1 using parted, but specifying a different end 
size, or just use "-1" ? And maybe try changing the label to "gpt" ? 
Then run e2fsck -n and see what it does. I wonder how you were able to 
create a 4TB ext3 filesystem with the msdos label under SL4.3. Never 
worked for me without the labelling it gpt.

Jon Forrest wrote:
> (I've already sent this message to Ted Ts'o directly. I should
> have sent it to this list first but I didn't know about it
> until today. My apologies to Ted.)
>
> Last Friday a system that I just inherited refused to mount
> a file system that had been working fine for about 6 months.
> This is on a Scientific Linux 4.3 system using a 2.6.9
> kernel. This is another Linux distribution based on RHEL 4.
> I don't think the actual hardware is relevant
> here so I won't mention it. If there's more information you'd
> like to see I'd be happy to provide it.
>
> It turns out that this 4.2TB file system was created in an
> msdos partition table, as shown below:
>
> ----
> GNU Parted 1.6.19
> Using /dev/sdb
> (parted) p
> Disk geometry for /dev/sdb: 0.000-4291443.000 megabytes
> Disk label type: msdos
> Minor    Start       End     Type      Filesystem  Flags
> 1          0.031  97137.567  primary   ext3
> ----
>
> Running fsck fails as shown below:
>
> ----
> e2fsck 1.35 (28-Feb-2004)
> The filesystem size (according to the superblock) is 1098609033 blocks
> The physical size of the device is 24867209 blocks
> Either the superblock or the partition table is likely to be corrupt!
> Abort<y>? yes
>
> Error reading block 24870914 (Invalid argument) while doing inode scan.
> ----
>
> I have 2 questions:
>
> 1) How did this system run just file for ~6 months using this
> file system as a /home? I'm suspecting that the problem
> actually occurred long ago when the file system allocated
> meta or user data in blocks that are somehow unreachable
> by fsck but exactly how this could have happened isn't
> clear. Although it's too late now, I'd really like
> to know what happened.
>
> 2) Given that this happened, how can I recover as many
> files as possible from this file system? The professor
> who owns this system had put his faith in hardware
> RAID so he had never backed it up. He's very nervous
> right now.
>
> Any information or help you can provide would be
> very much appreciated.
>
> Cordially,
> Jon Forrest
> Unix Computing Support
> College of Chemistry
> Univ. of Cal. Berkeley
> 173 Tan Hall
> Berkeley, CA
> 94720-1460
> 510-643-1032
> jlforrest at berkeley.edu
>
> _______________________________________________
> Ext3-users mailing list
> Ext3-users at redhat.com
> https://www.redhat.com/mailman/listinfo/ext3-users


From jlforrest at berkeley.edu  Mon Mar 12 21:00:26 2007
From: jlforrest at berkeley.edu (Jon Forrest)
Date: Mon, 12 Mar 2007 14:00:26 -0700
Subject: How To Recover From Creating >2TB ext3 Filesystem on MSDOS
 Partition Table?
In-Reply-To: <45F5B281.5060403@fnal.gov>
References: <45F571C3.9090303@berkeley.edu> <45F5B281.5060403@fnal.gov>
Message-ID: <45F5BF6A.8000701@berkeley.edu>

Ling C. Ho wrote:
> Can u recreate your sdb1 using parted, but specifying a different end 
> size, or just use "-1" ? And maybe try changing the label to "gpt" ? 
> Then run e2fsck -n and see what it does.

I'll add this to the small collection of suggestions. I clearly
have to be very careful in what I do to restore this because
I'll probably only have one chance.

> I wonder how you were able to 
> create a 4TB ext3 filesystem with the msdos label under SL4.3. Never 
> worked for me without the labelling it gpt.

There are two mysteries in my mind - 1) how the file system was allowed
to be created, and 2) what was the exact scenario that caused
the corruption, i.e. what is it about an msdos partition table
that causes problems when a file system is >2TB.

As for #1, I didn't create the file system. This is on a cluster
that I recently took over managing. The file system was created
before I started here. However, the person who did it is quite
knowledgeable. Since it was done on a system running Scientific
Linux 4.3, which is based on a fairly old kernel and tools,
I'm wondering if the tools didn't recognize the dangerous
configuration. Ted Ts'o was surprised to hear about this himself.

Regarding #2, there are a number of places where very knowledgeable
people describe the danger in creating >2TB file systems on msdos
partition tables but I haven't seen an explanation of the fundemental
problem. I would love to learn this (I'm not doubting that it's true).

Cordially,

-- 
Jon Forrest
Unix Computing Support
College of Chemistry
173 Tan Hall
University of California Berkeley
Berkeley, CA
94720-1460
510-643-1032
jlforrest at berkeley.edu


From jlb17 at duke.edu  Mon Mar 12 21:25:03 2007
From: jlb17 at duke.edu (Joshua Baker-LePain)
Date: Mon, 12 Mar 2007 17:25:03 -0400 (EDT)
Subject: How To Recover From Creating >2TB ext3 Filesystem on MSDOS
 Partition Table?
In-Reply-To: <45F5BF6A.8000701@berkeley.edu>
References: <45F571C3.9090303@berkeley.edu> <45F5B281.5060403@fnal.gov>
	<45F5BF6A.8000701@berkeley.edu>
Message-ID: <alpine.LRH.0.83.0703121719230.27016@chaos.egr.duke.edu>

On Mon, 12 Mar 2007 at 2:00pm, Jon Forrest wrote

> Regarding #2, there are a number of places where very knowledgeable
> people describe the danger in creating >2TB file systems on msdos
> partition tables but I haven't seen an explanation of the fundemental
> problem. I would love to learn this (I'm not doubting that it's true).

AIUI, msdos disk labels use a 32bit integer to describe the length of a 
partition.  2^32*512byte blocks=2TiB.

-- 
Joshua Baker-LePain
Department of Biomedical Engineering
Duke University


From bdavids1 at gmu.edu  Mon Mar 12 21:40:48 2007
From: bdavids1 at gmu.edu (Brian Davidson)
Date: Mon, 12 Mar 2007 17:40:48 -0400
Subject: e2fsck hanging
Message-ID: <F0B6EE98-54E5-4B4A-9A29-9E92A3625A87@gmu.edu>

I'm trying to run e2fsck on a ~6TB filesystem which is about 90%  
full.  We're doing backup to disk to this filesystem, and have a  
number of hard links (link counts up to 90).

strace shows:

write(1, "Pass 2: Checking ", 17)       = 17
write(1, "directory", 9)                = 9
write(1, " structure\n", 11)            = 11
mmap(NULL, 91574272, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS,  
-1, 0) = 0x2b4299dbd000
mmap(NULL, 91574272, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS,  
-1, 0) = 0x2b429f512000
mmap(NULL, 506724352, PROT_READ|PROT_WRITE, MAP_PRIVATE| 
MAP_ANONYMOUS, -1, 0) = 0x2b42a4c67000
mmap(NULL, 596029440, PROT_READ|PROT_WRITE, MAP_PRIVATE| 
MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
brk(0x23e56000)                         = 0x5eb000
mmap(NULL, 596164608, PROT_READ|PROT_WRITE, MAP_PRIVATE| 
MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
mmap(NULL, 2097152, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS| 
MAP_NORESERVE, -1, 0) = 0x2b430a09e000
munmap(0x2b430a09e000, 401408)          = 0
munmap(0x2b430a200000, 647168)          = 0
mprotect(0x2b430a100000, 135168, PROT_READ|PROT_WRITE) = 0
mmap(NULL, 596029440, PROT_READ|PROT_WRITE, MAP_PRIVATE| 
MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
lseek(3, 6303744, SEEK_SET)             = 6303744
read(3, "\2\0\0\0\f\0\1\2.\0\0\0\2\0\0\0\f\0\2\2..\0\0\v\0\0\0 
\24"..., 4096) = 4096
lseek(3, 6307840, SEEK_SET)             = 6307840
read(3, "\v\0\0\0\f\0\1\2.\0\0\0\2\0\0\0\364\17\2\2..\0\0\0\0\0"...,  
4096) = 4096
lseek(3, 6311936, SEEK_SET)             = 6311936
read(3, "\0\0\0\0\0\20\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,  
4096) = 4096
lseek(3, 6316032, SEEK_SET)             = 6316032
read(3, "\0\0\0\0\0\20\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,  
4096) = 4096
lseek(3, 6320128, SEEK_SET)             = 6320128
read(3, "\0\0\0\0\0\20\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,  
4096) = 4096
lseek(3, 41709568, SEEK_SET)            = 41709568
read(3, "\323\0\0\0\f\0\1\2.\0\0\0\226\2\252+\f\0\2\2..\0\0\324"...,  
4096) = 4096
lseek(3, 41713664, SEEK_SET)            = 41713664
read(3, "\324\0\0\0\f\0\1\2.\0\0\0\323\0\0\0\f\0\2\2..\0\0\214 
\300"..., 4096) = 4096
lseek(3, 41717760, SEEK_SET)            = 41717760
read(3, "\325\0\0\0\f\0\1\2.\0\0\0\226\2\252+\f\0\2\2..\0\0\326"...,  
4096) = 4096

And, that's it.  No more output.

A backtrace from gdb shows:

(gdb) bt
#0  0x0000000000418aa5 in get_icount_el (icount=0x5cf170,
ino=732562070, create=1) at icount.c:251
#1  0x0000000000418dd7 in ext2fs_icount_increment (icount=0x5cf170,
ino=732562070, ret=0x7fffffa79a96)
      at icount.c:339
#2  0x000000000040a3cf in check_dir_block (fs=0x5af560,
db=0x2b7070cc6064, priv_data=0x7fffffa79c90) at pass2.c:1021
#3  0x0000000000416c69 in ext2fs_dblist_iterate (dblist=0x5c3f20,
func=0x409980 <check_dir_block>,
      priv_data=0x7fffffa79c90) at dblist.c:234
#4  0x0000000000408d9d in e2fsck_pass2 (ctx=0x5ae700) at pass2.c:149
#5  0x0000000000403102 in e2fsck_run (ctx=0x5ae700) at e2fsck.c:193
#6  0x0000000000401e50 in main (argc=Variable "argc" is not available.
) at unix.c:1075


It's stuck inside the while loop in get_icount_el() (line 251).

I've added more memory to the server (up to 6 GB now), and am re- 
running e2fsck.  Additionally, I upped /proc/sys/vm/max_map_count to  
20,000,000 (just pulled that number out of the air).  It takes 6 or 7  
hours to get the part where it locks up, so I'm not sure if this is  
going to help or not.  I figured while it's running I would post here  
to see if anyone has any additional insights.

Thanks!

Brian Davidson
George Mason University


From maxi.belino at gmail.com  Mon Mar 12 22:44:56 2007
From: maxi.belino at gmail.com (Maxi Belino)
Date: Mon, 12 Mar 2007 19:44:56 -0300
Subject: Error mounting
Message-ID: <29180abb0703121544s627df8dev5495bf165a10cf90@mail.gmail.com>

Hi all,

i'm new in the list so i'm sorry if this i'm posting is off-topic or it was
already answered before.

I'm having this problem; i've got an ext3 8GB partition and it doesn't
mount, the cause of this: a user (yes me!) running fsck.ext3 with the
filesystem mounted, ups! (snif, forgive me!!, totally newbie and mad)

errors while booting:

EXT3_fs error (device hda4): ext3_check_descriptors: Blockbitmap for group 0
not in group (block 41471)!
 EXT3_fs: group descriptors corrupted
mount: error 22 mounting ext3flags defaults
well, retrying without the options flags

and repeats this again twice; then:

pivotroot:pivot_root (/sysroot, /sysroot/initrd) failed:2
umont /initrd/sys failed: 2
 umont /initrd/proc failed: 2
Initrd finished
Freeing unused kernel memory: 240 K freed
Kernel panic - not syncing: No init found. Try passing init= option to
kernel


and it freezes

Booting with Knoppix 3.2 it mounts all partitions but hda4, it gives this
error:

mount: wrong fs type , bad option, bad superblock on /dev/hda4,
or too many mounted file systems


i've already test running dd_rhelp and it grabs an 8GB file without problems
but then, i can't mount it (using mount -o loop ...)

If there's a solution or any chance i can get data from this partition i
would love to hear how, if i'm really fried i'm already prepared.

regards,
Maxi
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20070312/74cc44b1/attachment.htm>

From bdavids1 at gmu.edu  Tue Mar 13 04:04:47 2007
From: bdavids1 at gmu.edu (Brian Davidson)
Date: Tue, 13 Mar 2007 00:04:47 -0400
Subject: e2fsck hanging
In-Reply-To: <F0B6EE98-54E5-4B4A-9A29-9E92A3625A87@gmu.edu>
References: <F0B6EE98-54E5-4B4A-9A29-9E92A3625A87@gmu.edu>
Message-ID: <749E66B8-C720-4FEB-8C66-5A4938E80C8E@gmu.edu>

Here's strace when running w/ 6GB of memory & with max_map_count set  
to 20000000.  It looks like that got rid of the ENOMEM's from mmap,  
but it's still hanging in the same place...

write(1, "Pass 2: Checking ", 17)       = 17
write(1, "directory", 9)                = 9
write(1, " structure\n", 11)            = 11
mmap(NULL, 91574272, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS,  
-1, 0) = 0x2b1078c55000
mmap(NULL, 91574272, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS,  
-1, 0) = 0x2b107e3aa000
mmap(NULL, 501645312, PROT_READ|PROT_WRITE, MAP_PRIVATE| 
MAP_ANONYMOUS, -1, 0) = 0x2b1083aff000
mmap(NULL, 588230656, PROT_READ|PROT_WRITE, MAP_PRIVATE| 
MAP_ANONYMOUS, -1, 0) = 0x2b10a1967000
munmap(0x2b10a1967000, 588230656)       = 0
lseek(5, 6303744, SEEK_SET)             = 6303744
read(5, "\2\0\0\0\f\0\1\2.\0\0\0\2\0\0\0\f\0\2\2..\0\0\v\0\0\0 
\24"..., 4096) = 4096
lseek(5, 6307840, SEEK_SET)             = 6307840
read(5, "\v\0\0\0\f\0\1\2.\0\0\0\2\0\0\0\364\17\2\2..\0\0\0\0\0"...,  
4096) = 4096
lseek(5, 6311936, SEEK_SET)             = 6311936
read(5, "\0\0\0\0\0\20\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,  
4096) = 4096
lseek(5, 6316032, SEEK_SET)             = 6316032
read(5, "\0\0\0\0\0\20\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,  
4096) = 4096
lseek(5, 6320128, SEEK_SET)             = 6320128
read(5, "\0\0\0\0\0\20\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,  
4096) = 4096
lseek(5, 41709568, SEEK_SET)            = 41709568
read(5, "\323\0\0\0\f\0\1\2.\0\0\0\226\2\252+\f\0\2\2..\0\0\324"...,  
4096) = 4096
lseek(5, 41713664, SEEK_SET)            = 41713664
read(5, "\324\0\0\0\f\0\1\2.\0\0\0\323\0\0\0\f\0\2\2..\0\0\214 
\300"..., 4096) = 4096
lseek(5, 41717760, SEEK_SET)            = 41717760
read(5, "\325\0\0\0\f\0\1\2.\0\0\0\226\2\252+\f\0\2\2..\0\0\326"...,  
4096) = 4096

The backtrace seems to be essentially the same:

(gdb) bt
#0  0x0000000000418aa5 in get_icount_el (icount=0x5cf170,  
ino=732562070, create=1) at icount.c:251
#1  0x0000000000418dd7 in ext2fs_icount_increment (icount=0x5cf170,  
ino=732562070, ret=0x7fffffad6e06)
     at icount.c:339
#2  0x000000000040a3cf in check_dir_block (fs=0x5af560,  
db=0x2b1011a88064, priv_data=0x7fffffad7000) at pass2.c:1021
#3  0x0000000000416c69 in ext2fs_dblist_iterate (dblist=0x5c3f20,  
func=0x409980 <check_dir_block>,
     priv_data=0x7fffffad7000) at dblist.c:234
#4  0x0000000000408d9d in e2fsck_pass2 (ctx=0x5ae700) at pass2.c:149
#5  0x0000000000403102 in e2fsck_run (ctx=0x5ae700) at e2fsck.c:193
#6  0x0000000000401e50 in main (argc=Variable "argc" is not available.
) at unix.c:1075
#7  0x0000000000421161 in __libc_start_main ()
#8  0x000000000040018a in _start ()
#9  0x00007fffffad7508 in ?? ()
#10 0x0000000000000000 in ?? ()

Additional info:

$ cat /etc/redhat-release
Red Hat Enterprise Linux AS release 4 (Nahant Update 4)

$ uname -a
Linux XXXXX.gmu.edu 2.6.16 #1 SMP Mon Mar 27 16:56:51 EST 2006 x86_64  
x86_64 x86_64 GNU/Linux

$ e2fsck -V
e2fsck 1.35 (28-Feb-2004)
         Using EXT2FS Library version 1.35, 28-Feb-2004

$ rpm -q e2fsprogs
e2fsprogs-1.35-12.4.EL4

Brian Davidson
George Mason University


From adilger at clusterfs.com  Tue Mar 13 07:04:33 2007
From: adilger at clusterfs.com (Andreas Dilger)
Date: Tue, 13 Mar 2007 03:04:33 -0400
Subject: How To Recover From Creating >2TB ext3 Filesystem on MSDOS
	Partition Table?
In-Reply-To: <45F571C3.9090303@berkeley.edu>
References: <45F571C3.9090303@berkeley.edu>
Message-ID: <20070313070433.GL5266@schatzie.adilger.int>

On Mar 12, 2007  08:29 -0700, Jon Forrest wrote:
> Last Friday a system that I just inherited refused to mount
> a file system that had been working fine for about 6 months.
> This is on a Scientific Linux 4.3 system using a 2.6.9
> kernel. This is another Linux distribution based on RHEL 4.
> I don't think the actual hardware is relevant
> here so I won't mention it. If there's more information you'd
> like to see I'd be happy to provide it.
> 
> ----
> e2fsck 1.35 (28-Feb-2004)
> The filesystem size (according to the superblock) is 1098609033 blocks
> The physical size of the device is 24867209 blocks
> Either the superblock or the partition table is likely to be corrupt!
> Abort<y>? yes
> 
> Error reading block 24870914 (Invalid argument) while doing inode scan.

Did you recently update your kernel?  Is your kernel using CONFIG_LBD?
If CONFIG_LBD is not set, then any use of > 2TB is completely unsafe.
It will silently and fatally corrupt your filesystem.  I'd pointed this
out previously, but the patch I submitted wasn't accepted.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


From adilger at clusterfs.com  Tue Mar 13 07:27:32 2007
From: adilger at clusterfs.com (Andreas Dilger)
Date: Tue, 13 Mar 2007 03:27:32 -0400
Subject: e2fsck hanging
In-Reply-To: <749E66B8-C720-4FEB-8C66-5A4938E80C8E@gmu.edu>
References: <F0B6EE98-54E5-4B4A-9A29-9E92A3625A87@gmu.edu>
	<749E66B8-C720-4FEB-8C66-5A4938E80C8E@gmu.edu>
Message-ID: <20070313072732.GP5266@schatzie.adilger.int>

On Mar 13, 2007  00:04 -0400, Brian Davidson wrote:
> Here's strace when running w/ 6GB of memory & with max_map_count set  
> to 20000000.  It looks like that got rid of the ENOMEM's from mmap,  
> but it's still hanging in the same place...
> 
> The backtrace seems to be essentially the same:
> 
> (gdb) bt
> #0  0x0000000000418aa5 in get_icount_el (icount=0x5cf170,  
> ino=732562070, create=1) at icount.c:251
> #1  0x0000000000418dd7 in ext2fs_icount_increment (icount=0x5cf170,  
> ino=732562070, ret=0x7fffffad6e06)
>     at icount.c:339
> #2  0x000000000040a3cf in check_dir_block (fs=0x5af560,  
> db=0x2b1011a88064, priv_data=0x7fffffad7000) at pass2.c:1021
> #3  0x0000000000416c69 in ext2fs_dblist_iterate (dblist=0x5c3f20,  
> func=0x409980 <check_dir_block>,
>     priv_data=0x7fffffad7000) at dblist.c:234
> #4  0x0000000000408d9d in e2fsck_pass2 (ctx=0x5ae700) at pass2.c:149
> #5  0x0000000000403102 in e2fsck_run (ctx=0x5ae700) at e2fsck.c:193
> #6  0x0000000000401e50 in main (argc=Variable "argc" is not available.

The icount implementation assumes that the number of hard-linked files
is very low in comparison to the number of singly-linked files.  It
uses a linear list to look up the hard-linked inodes.  I suspect it
needs some algorithm lovin' to make it into a hash table (possibly
multi-level) if the number of links becomes too large in a given bucket.

We could consider the common case to be a single hash bucket if that
makes the code simpler and more efficient.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


From adilger at clusterfs.com  Tue Mar 13 07:38:09 2007
From: adilger at clusterfs.com (Andreas Dilger)
Date: Tue, 13 Mar 2007 03:38:09 -0400
Subject: Error mounting
In-Reply-To: <29180abb0703121544s627df8dev5495bf165a10cf90@mail.gmail.com>
References: <29180abb0703121544s627df8dev5495bf165a10cf90@mail.gmail.com>
Message-ID: <20070313073809.GR5266@schatzie.adilger.int>

On Mar 12, 2007  19:44 -0300, Maxi Belino wrote:
> I'm having this problem; i've got an ext3 8GB partition and it doesn't
> mount, the cause of this: a user (yes me!) running fsck.ext3 with the
> filesystem mounted, ups! (snif, forgive me!!, totally newbie and mad)

e2fsprogs should not allow you to run e2fsck while the filesystem is
mounted.

> If there's a solution or any chance i can get data from this partition i
> would love to hear how, if i'm really fried i'm already prepared.

Try e2fsck with a backup superblock (-b), not sure what else to try.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


From tytso at mit.edu  Tue Mar 13 13:53:27 2007
From: tytso at mit.edu (Theodore Tso)
Date: Tue, 13 Mar 2007 09:53:27 -0400
Subject: e2fsck hanging
In-Reply-To: <749E66B8-C720-4FEB-8C66-5A4938E80C8E@gmu.edu>
References: <F0B6EE98-54E5-4B4A-9A29-9E92A3625A87@gmu.edu>
	<749E66B8-C720-4FEB-8C66-5A4938E80C8E@gmu.edu>
Message-ID: <20070313135326.GA7362@thunk.org>

At a first glance your report looks vaguely like this bugreport:

http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=411838

I've been crazy busy the last few weeks so I haven't had a chance to
look at it yet.  There is a suggested fix in the above bug report, but
not a patch, and I haven't had time to validate it yet.

Regards,

						- Ted


From bdavids1 at gmu.edu  Tue Mar 13 14:59:43 2007
From: bdavids1 at gmu.edu (Brian Davidson)
Date: Tue, 13 Mar 2007 10:59:43 -0400
Subject: e2fsck hanging
In-Reply-To: <20070313135326.GA7362@thunk.org>
References: <F0B6EE98-54E5-4B4A-9A29-9E92A3625A87@gmu.edu>
	<749E66B8-C720-4FEB-8C66-5A4938E80C8E@gmu.edu>
	<20070313135326.GA7362@thunk.org>
Message-ID: <070FB85A-AE98-4523-9F3F-28AFD13C3AC4@gmu.edu>

On Mar 13, 2007, at 9:53 AM, Theodore Tso wrote:

> At a first glance your report looks vaguely like this bugreport:
>
> http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=411838
>
> I've been crazy busy the last few weeks so I haven't had a chance to
> look at it yet.  There is a suggested fix in the above bug report, but
> not a patch, and I haven't had time to validate it yet.
>
> Regards,
>
> 						- Ted

Yes, that's the same issue.  We reduced the issue to a floating point  
precision issue too:

main()
{
float range;
double range2;
unsigned int ino, lowval, highval;
int high, low;

ino=732562070;
lowval= 2;
highval = 732562081;
high=57402135;
low=0;

range = ((float) (ino - lowval)) / (highval - lowval);

printf("range=%f\n",range);
}

It outputs 1.0, rather than .99999...

We're trying the suggested fix from the bug report.  It'll take about  
6 hours or so to get to that point.  Here's specifically what we're  
doing:

--- e2fsprogs-1.39/lib/ext2fs/icount.c  2005-09-06 05:40:14.000000000  
-0400
+++ e2fsprogs-1.39-test/lib/ext2fs/icount.c     2007-03-13  
10:56:19.000000000 -0400
@@ -251,6 +251,10 @@
                                 range = ((float) (ino - lowval)) /
                                         (highval - lowval);
                         mid = low + ((int) (range * (high-low)));
+                       if (mid > high)
+                               mid = high;
+                       if (mid < low)
+                               mid = low;
                 }
#endif
                 if (ino == icount->list[mid].ino) {


From jlforrest at berkeley.edu  Tue Mar 13 15:43:43 2007
From: jlforrest at berkeley.edu (Jon Forrest)
Date: Tue, 13 Mar 2007 08:43:43 -0700
Subject: How To Recover From Creating >2TB ext3 Filesystem on MSDOS
 Partition Table?
In-Reply-To: <20070313070433.GL5266@schatzie.adilger.int>
References: <45F571C3.9090303@berkeley.edu>
	<20070313070433.GL5266@schatzie.adilger.int>
Message-ID: <45F6C6AF.6080709@berkeley.edu>

Andreas Dilger wrote:

> Did you recently update your kernel?

No. The system had been running for months.

> Is your kernel using CONFIG_LBD?

Yes.

Jon


From bdavids1 at gmu.edu  Wed Mar 14 00:32:44 2007
From: bdavids1 at gmu.edu (Brian Davidson)
Date: Tue, 13 Mar 2007 20:32:44 -0400
Subject: e2fsck hanging
In-Reply-To: <070FB85A-AE98-4523-9F3F-28AFD13C3AC4@gmu.edu>
References: <F0B6EE98-54E5-4B4A-9A29-9E92A3625A87@gmu.edu>
	<749E66B8-C720-4FEB-8C66-5A4938E80C8E@gmu.edu>
	<20070313135326.GA7362@thunk.org>
	<070FB85A-AE98-4523-9F3F-28AFD13C3AC4@gmu.edu>
Message-ID: <65B0B3F4-4231-473B-9594-6BF8BCEFB6DA@gmu.edu>

This patch does the trick.

> --- e2fsprogs-1.39/lib/ext2fs/icount.c  2005-09-06  
> 05:40:14.000000000 -0400
> +++ e2fsprogs-1.39-test/lib/ext2fs/icount.c     2007-03-13  
> 10:56:19.000000000 -0400
> @@ -251,6 +251,10 @@
>                                 range = ((float) (ino - lowval)) /
>                                         (highval - lowval);
>                         mid = low + ((int) (range * (high-low)));
> +                       if (mid > high)
> +                               mid = high;
> +                       if (mid < low)
> +                               mid = low;
>                 }
> #endif
>                 if (ino == icount->list[mid].ino) {

Our inode count is 732,577,792 on a 5.4 TB filesystem with 5.0 TB in  
use (94% use).  It took about 9 hours to run, and used of 4GB of memory.


From jss at ast.cam.ac.uk  Wed Mar 14 09:17:16 2007
From: jss at ast.cam.ac.uk (Jeremy Sanders)
Date: Wed, 14 Mar 2007 09:17:16 +0000
Subject: e2fsck hanging
References: <F0B6EE98-54E5-4B4A-9A29-9E92A3625A87@gmu.edu>
	<749E66B8-C720-4FEB-8C66-5A4938E80C8E@gmu.edu>
	<20070313135326.GA7362@thunk.org>
	<070FB85A-AE98-4523-9F3F-28AFD13C3AC4@gmu.edu>
Message-ID: <et8eis$28n$1@sea.gmane.org>

Brian Davidson wrote:

> --- e2fsprogs-1.39/lib/ext2fs/icount.c  2005-09-06 05:40:14.000000000
> -0400
> +++ e2fsprogs-1.39-test/lib/ext2fs/icount.c     2007-03-13
> 10:56:19.000000000 -0400
> @@ -251,6 +251,10 @@
>                                  range = ((float) (ino - lowval)) /
>                                          (highval - lowval);
>                          mid = low + ((int) (range * (high-low)));
> +                       if (mid > high)
> +                               mid = high;
> +                       if (mid < low)
> +                               mid = low;
>                  }
> #endif
>                  if (ino == icount->list[mid].ino) {

I'm happy to report this patch solved the fsck hanging problem I reported a
few weeks ago.

Jeremy

-- 
Jeremy Sanders <jss at ast.cam.ac.uk>   http://www-xray.ast.cam.ac.uk/~jss/
X-Ray Group, Institute of Astronomy, University of Cambridge, UK.
Public Key Server PGP Key ID: E1AAE053


From jlforrest at berkeley.edu  Wed Mar 14 21:07:55 2007
From: jlforrest at berkeley.edu (Jon Forrest)
Date: Wed, 14 Mar 2007 14:07:55 -0700
Subject: Solution to Corrupt >2TB Filesystem in MSDOS Partition Table
In-Reply-To: <20070313070433.GL5266@schatzie.adilger.int>
References: <45F571C3.9090303@berkeley.edu>
	<20070313070433.GL5266@schatzie.adilger.int>
Message-ID: <45F8642B.5080908@berkeley.edu>

Thanks to Ted and several others, I was
able to recover 100% of the corrupted
file system that I posted about last week.
(This was an >2TB ext3 file system that had been
created in a MSDOS partition which had worked
until the server was rebooted, at which time
it wouldn't mount and fsck wouldn't fix the
problem.)

Based on the suggestions of various people
here's what I did:

1) Upgraded to the latest version of GNU parted.
The server is running Scientific Linux 4.3,
a RHEL4 derived distribution with a 2.6.9
kernel. This distribution contained parted 1.6.19
whereas the latest release was 1.8.2.

2) Using parted 1.8.2, I removed the partition
containing the corrupt file system. This was
the only partition on the disk.

3) I then used the parted "rescue" command
to recreate the partition. I gave it the original
starting point at the start value and "-1s" as
the ending value.

After this, I was able to mount the file system as before,
and all the files were there. The first thing I did
was to copy the whole file system to another disk
which completed without any errors.

I have to admit that I don't fully understand why this
worked. Clearly the combination of removing the partition
and then rescuing it reset something that was fouling
up the works before.

Anyway, we're all very happy about this and we all appreciate
the help we received from this list and elsewhere. I hope
we'll be able to help you one day.

Cordially,

-- 
Jon Forrest
Unix Computing Support
College of Chemistry
173 Tan Hall
University of California Berkeley
Berkeley, CA
94720-1460
510-643-1032
jlforrest at berkeley.edu


From adilger at clusterfs.com  Wed Mar 14 14:57:31 2007
From: adilger at clusterfs.com (Andreas Dilger)
Date: Wed, 14 Mar 2007 10:57:31 -0400
Subject: e2fsck hanging
In-Reply-To: <et8eis$28n$1@sea.gmane.org>
References: <F0B6EE98-54E5-4B4A-9A29-9E92A3625A87@gmu.edu>
	<749E66B8-C720-4FEB-8C66-5A4938E80C8E@gmu.edu>
	<20070313135326.GA7362@thunk.org>
	<070FB85A-AE98-4523-9F3F-28AFD13C3AC4@gmu.edu>
	<et8eis$28n$1@sea.gmane.org>
Message-ID: <20070314145731.GB5513@schatzie.adilger.int>

On Mar 14, 2007  09:17 +0000, Jeremy Sanders wrote:
> > --- e2fsprogs-1.39/lib/ext2fs/icount.c  2005-09-06 05:40:14.000000000
> > -0400
> > +++ e2fsprogs-1.39-test/lib/ext2fs/icount.c     2007-03-13
> > 10:56:19.000000000 -0400
> > @@ -251,6 +251,10 @@
> >                                  range = ((float) (ino - lowval)) /
> >                                          (highval - lowval);
> >                          mid = low + ((int) (range * (high-low)));
> > +                       if (mid > high)
> > +                               mid = high;
> > +                       if (mid < low)
> > +                               mid = low;
> >                  }
> > #endif
> >                  if (ino == icount->list[mid].ino) {
> 
> I'm happy to report this patch solved the fsck hanging problem I reported a
> few weeks ago.

Any real reason we don't change this to a double instead of a float?

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


From jss at ast.cam.ac.uk  Thu Mar 15 09:36:36 2007
From: jss at ast.cam.ac.uk (Jeremy Sanders)
Date: Thu, 15 Mar 2007 09:36:36 +0000
Subject: e2fsck hanging
References: <F0B6EE98-54E5-4B4A-9A29-9E92A3625A87@gmu.edu>
	<749E66B8-C720-4FEB-8C66-5A4938E80C8E@gmu.edu>
	<20070313135326.GA7362@thunk.org>
	<070FB85A-AE98-4523-9F3F-28AFD13C3AC4@gmu.edu>
	<et8eis$28n$1@sea.gmane.org>
	<20070314145731.GB5513@schatzie.adilger.int>
Message-ID: <etb434$3h4$1@sea.gmane.org>

Andreas Dilger wrote:

> Any real reason we don't change this to a double instead of a float?

Presumably that would make it less likely to happen, not get rid of the
problem completely, although on a real filesystem the issue may never
happen with a double. It's probably a reasonable idea to change to a
double, but also check for the bounding issues.

Jeremy

-- 
Jeremy Sanders <jss at ast.cam.ac.uk>   http://www-xray.ast.cam.ac.uk/~jss/
X-Ray Group, Institute of Astronomy, University of Cambridge, UK.
Public Key Server PGP Key ID: E1AAE053


From lakshmipathi.g at gmail.com  Thu Mar 15 14:25:47 2007
From: lakshmipathi.g at gmail.com (lakshmi pathi)
Date: Thu, 15 Mar 2007 19:55:47 +0530
Subject: How to name it?
Message-ID: <ae2f51270703150725i26b87eatf75451bda0de19aa@mail.gmail.com>

hi all,
The reason why writting this mail--i don't know how to name a tool written
by myself :-)
Following is the functionality of a file system tool :
When you install the tool it acts as a protection for your files.Tool copies
the address of files.
If you accidently deleted a file -if its contents are not modified-then the
tool retrives the contents of file.
How should i call this tool./
Saying file recovery is somewhat miss leading-(i got critisied for saying
it's recovery tool)--because it doesnt recover files deleted before it's
intallation of tool.
It's can't be backup tool - since the tool backup only address of file and
not the file itself.
Is their any other similar tool is out there?
Thanks.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20070315/7635175f/attachment.htm>

From samuel at bcgreen.com  Fri Mar 16 05:31:08 2007
From: samuel at bcgreen.com (Stephen Samuel)
Date: Thu, 15 Mar 2007 22:31:08 -0700
Subject: How to name it?
In-Reply-To: <ae2f51270703150725i26b87eatf75451bda0de19aa@mail.gmail.com>
References: <ae2f51270703150725i26b87eatf75451bda0de19aa@mail.gmail.com>
Message-ID: <6cd50f9f0703152231v4af0f0e8v84bd34b4eb9fef3c@mail.gmail.com>

It's an undelete tool...

Although it only allows you to undelete files since installation, It
still allows undeletion of files deleted while it is working.

On 3/15/07, lakshmi pathi <lakshmipathi.g at gmail.com> wrote:
> hi all,
> The reason why writting this mail--i don't know how to name a tool written
> by myself :-)


From mats_a at MIT.EDU  Sun Mar 18 01:42:17 2007
From: mats_a at MIT.EDU (Mats Ahlgren)
Date: Sat, 17 Mar 2007 21:42:17 -0400
Subject: Frequent metadata corruption with ext3 + hard power-off
Message-ID: <200703172142.17868.mats_a@mit.edu>

Hello.

I'm having serious issues with ext3; any insight would be greatly appreciated:


_____ Overview:

I believe ext3 is supposed to be recoverable in the case of a power failure by 
replaying the log.

However, on two separate computers (running different operatings systems too), 
this has been everything but the case.


_____ Specifics:

Sometimes, my kernel will hard-freeze and I'll have to do a hard reboot. When 
this happens, sometimes fsck will insist on running and find some orphaned 
inodes, which it will proceed to put in the /lost+found directory.

This is unacceptable: The last time this happened, random files in my 
operating system were plucked from the file system and stuffed in lost+found, 
corrupting the OS and forcing a reinstall. Another time, files I had recently 
moved (a final project) a minute before the crash were orphaned and put in 
the lost+found, effectively destroying it.

Why should a lost+found folder even be necessary when the file hierarchy is 
guaranteed to be consistent?


In response to these problems, I changed the ext3 journaling mode to "journal" 
rather than "ordered" (frankly it seems deeply disturbing that "ordered" is 
the default). Since then, I've once had to hard-reboot and yet again found 
files in the /lost+found folder.

Might anyone know why ext3 is not fulfilling its promise of an 
always-consistent file system?


_____ Other interacting issues:

I'm running RAID1 (mirroring) on one computer, but I've had the same issues on 
another computer without RAID.

(In response to "you shouldn't hard-reboot your computer": I realize that most 
computers are not meant to be hard-rebooted, but I don't have a sysrq key and 
xmodmapping it has been difficult. I also realize that kernels shouldn't 
crash, but what's a person to do if the computer doesn't respond to 
ctrl-alt-f1 and doesn't leave any messages in the logs...)

(In response to "maybe your drive is defective": This is not a problem with a 
defective drive; I've tried multiple drives.)

(In response to "you should backup your data": Periodic backups clearly help, 
but it's ridiculous to restore a system from backup every week because a 
hard-freeze corrupted your filesystem...)


Any insight would be greatly appreciated. These problems have been making me 
look for other file systems (such as zfs, which unfortunately I can't use to 
boot; or reiser4, which also makes a filesystem-is-always-consistent 
guarantee); I would prefer to use ext3, but I've never had these sorts of 
problems with old Mac OS, OS X, or Windows.


Thank you,
Mats


From tytso at mit.edu  Sun Mar 18 13:33:59 2007
From: tytso at mit.edu (Theodore Tso)
Date: Sun, 18 Mar 2007 09:33:59 -0400
Subject: Frequent metadata corruption with ext3 + hard power-off
In-Reply-To: <200703172142.17868.mats_a@mit.edu>
References: <200703172142.17868.mats_a@mit.edu>
Message-ID: <20070318133359.GA31914@thunk.org>

It sounds like you have a disk which is doing very aggressive write
caching.  If you are using a new enough kernel (2.6.9 or greater
should have this), adding "barrier=1" to your mount options should
help.  We should probably make this the default at this point...

						- Ted


From ahlist at gmail.com  Mon Mar 19 21:15:59 2007
From: ahlist at gmail.com (ahlist)
Date: Mon, 19 Mar 2007 17:15:59 -0400
Subject: rebooting more often to stop fsck problems and total disk loss
Message-ID: <bc5becf60703191415r17eb3e3aq581d2ec751a6ce81@mail.gmail.com>

Hi,

I run several hundred servers that are used heavily (webhosting, etc.)
all day long.

Quite often we'll have a server that either needs a really long fsck
(10 hours - 200 gig drive) or an fsck that evntually results in
everything going to lost+found (pretty much a total loss).

Would rebooting these servers monthly (or some other frequency) stop this?

Is it correct to visualize this as small errors compounding over time
thus more frequent reboots would allow quick fsck's to fix the errors
before they become huge?

(OS is redhat 7.3 and el3)

Thanks for any input!


From adilger at clusterfs.com  Mon Mar 19 21:27:19 2007
From: adilger at clusterfs.com (Andreas Dilger)
Date: Mon, 19 Mar 2007 15:27:19 -0600
Subject: rebooting more often to stop fsck problems and total disk loss
In-Reply-To: <bc5becf60703191415r17eb3e3aq581d2ec751a6ce81@mail.gmail.com>
References: <bc5becf60703191415r17eb3e3aq581d2ec751a6ce81@mail.gmail.com>
Message-ID: <20070319212719.GF5967@schatzie.adilger.int>

On Mar 19, 2007  17:15 -0400, ahlist wrote:
> Quite often we'll have a server that either needs a really long fsck
> (10 hours - 200 gig drive) or an fsck that evntually results in
> everything going to lost+found (pretty much a total loss).

Strange.  We get 1TB/hr fscks these days unless the filesystem is
completely corrupted and has a lot of duplicate blocks.

> Would rebooting these servers monthly (or some other frequency) stop this?

What else is important is that if you do an fsck you run with "-f" to
actually check the filesystem instead of just the superblock.  e2fsck
will only do a full e2fsck if the kernel detected disk corruption, OR
if the "last checked" time is > 6 months or {20 < X < 40} mounts have
happened since the last check time.  See tune2fs(8) for details.

> Is it correct to visualize this as small errors compounding over time
> thus more frequent reboots would allow quick fsck's to fix the errors
> before they become huge?

That is definitely true.  If the bitmaps get corrupted, then this will
spread corruption throughout the filesystem.

> (OS is redhat 7.3 and el3)

I would instead suggest updating to a newer kernel (e.g. RHEL4 2.6.9) as
this has fixed a LOT of bugs in ext3.  Also, make sure you are using the
newest e2fsck available, as some bugs have been fixed there also.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


From rjackson at mason.gmu.edu  Tue Mar 20 13:44:07 2007
From: rjackson at mason.gmu.edu (Richard Jackson)
Date: Tue, 20 Mar 2007 09:44:07 -0400 (EDT)
Subject: e2fsck hanging
Message-ID: <200703201344.l2KDi8u5017035@mason.gmu.edu>

There are are few issues with the get_icount_el() code.  First a simple
binary search may be sufficient.  Also, We now know the float type is
not sufficient to handle the large or small values handled by this
code.  One problem with using float is it does not have the precision
to divide two sufficently large numbers with a small enough
difference.  The other issue is with float value approximation that
causes 'mid' to be larger than 'high'.  The approximation is due to
float single-precision 23 bit mantissa.  Values up to integer
16,777,215 are handled as expected but starting at 16,777,216 the least
significant bits are truncated producing an approximation.  The
approximation could be more or less than what is expected.  This is a
feature of using float.  Double type for IEEE 754 double-precision 64
bit provides a 52 bit mantissa to play with.  That is a large number.

Since the e2fsck code must handle large numbers the use of float type should
be used with caution.

Reference
http://steve.hollasch.net/cgindex/coding/ieeefloat.html
http://en.wikipedia.org/wiki/IEEE_754


From tytso at mit.edu  Tue Mar 20 22:59:20 2007
From: tytso at mit.edu (Theodore Tso)
Date: Tue, 20 Mar 2007 18:59:20 -0400
Subject: e2fsck hanging
In-Reply-To: <200703201344.l2KDi8u5017035@mason.gmu.edu>
References: <200703201344.l2KDi8u5017035@mason.gmu.edu>
Message-ID: <20070320225920.GA10134@thunk.org>

On Tue, Mar 20, 2007 at 09:44:07AM -0400, Richard Jackson wrote:
> There are are few issues with the get_icount_el() code.  First a simple
> binary search may be sufficient.  Also, We now know the float type is
> not sufficient to handle the large or small values handled by this
> code.  One problem with using float is it does not have the precision
> to divide two sufficently large numbers with a small enough
> difference.  The other issue is with float value approximation that
> causes 'mid' to be larger than 'high'.  The approximation is due to
> float single-precision 23 bit mantissa.  Values up to integer
> 16,777,215 are handled as expected but starting at 16,777,216 the least
> significant bits are truncated producing an approximation.  The
> approximation could be more or less than what is expected.  This is a
> feature of using float.  Double type for IEEE 754 double-precision 64
> bit provides a 52 bit mantissa to play with.  That is a large number.

Well, keep in mind that the float is just as an optimization to doing
a simple binary search.  So it doesn't have to be precise; an
approximation is fine, except when mid ends up being larger than high.
But it's simple enough to catch that particular case where the
division going to 1 instead of 0.99999 as we might expect.  Catching
that should be enough, I expect.

						- Ted


From bdavids1 at gmu.edu  Tue Mar 20 23:53:24 2007
From: bdavids1 at gmu.edu (Brian Davidson)
Date: Tue, 20 Mar 2007 19:53:24 -0400
Subject: e2fsck hanging
In-Reply-To: <20070320225920.GA10134@thunk.org>
References: <200703201344.l2KDi8u5017035@mason.gmu.edu>
	<20070320225920.GA10134@thunk.org>
Message-ID: <9409CCD0-3AB9-48BF-A3D7-7CA353E70CA6@gmu.edu>


On Mar 20, 2007, at 6:59 PM, Theodore Tso wrote:

> Well, keep in mind that the float is just as an optimization to doing
> a simple binary search.  So it doesn't have to be precise; an
> approximation is fine, except when mid ends up being larger than high.
> But it's simple enough to catch that particular case where the
> division going to 1 instead of 0.99999 as we might expect.  Catching
> that should be enough, I expect.
>
> 						- Ted

With a float, you're still trying to cram 32 bits into a 24 bit  
mantissa (23 bits + implicit bit).  If nothing else, the float should  
get changed to a double which has a 53 bit mantissa (52 + implicit  
bit).  Just catching the case where division goes to one causes it to  
do a linear search.  Given that this only occurs on really big  
filesystems, that's probably not what you want to do...

Brian


From armangau_philippe at emc.com  Wed Mar 21 17:18:10 2007
From: armangau_philippe at emc.com (armangau_philippe at emc.com)
Date: Wed, 21 Mar 2007 13:18:10 -0400
Subject: Ext3 behavior on power failure
Message-ID: <A74E8B4A356D8143B79BBEB839421F3004023496@CORPUSMX20B.corp.emc.com>

Hi all,

We are building a new system which is going to use ext3 FS. We would like to know more about the behavior of ext3 in the case of failure.  But before I procede, I would like to share more information about our future system. 

*	Our application always does an fsync on files
*	When symbolic links (more specifically fast symlink) are created, the host directory is also fsync'ed. 
*	Our application is also going to front an EMC disk array configured using RAID5 or RAID6.
*	We will be using multipathing  so that we can assume that no disk errors will be reported. 

In this context , we would like to know the following for recovery after a power outage:

1.	When will an fsck have to be run (not counting  the scheduled fsck every N-mounts)?
2.	In the case of a crash, are the fsync-ed file contents and symbolic links safe no matter what?

Thanks,


Philippe Armangau
Centera Software Group
Consultant Software Engineer

EMC? <http://www.emc.com/> 
Where Information Lives
* Office: 508-249-5575 (toll free 877-362-2887 x45475)
*   Cell: 978-760-0485
* Fax: 508-249-5495
* E-mail: armangau_philippe at emc.com


From skye0507 at yahoo.com  Wed Mar 21 23:51:56 2007
From: skye0507 at yahoo.com (brian stone)
Date: Wed, 21 Mar 2007 16:51:56 -0700 (PDT)
Subject: EXT2 vs. EXT3: mount w/sync or fdatasync
Message-ID: <221628.39405.qm@web59005.mail.re1.yahoo.com>

My application always needs to sync file data after writing. I don't want anything handing around in the kernel buffers. I am wondering what is the best method to accomplish this.
 
 1. Do I use EXT2 and use fdatasync() or fsync()?
 
 2. Do I use EXT2 and mount with the "sync" option?
 
 3. Do I use EXT2 and use the O_DIRECT flag on open()?
 
 4. Do I use EXT3 in full journaled mode, where the data and metadata are journaled? In this case, is the journaled data sync'd or async'd? When the journal commits the data to the file system, is that sync'd or dumped into kernel buffers?
 
 5. Since I will always be syncing the data, does it make any sense to use EXT3? It feels like the EXT3 journal would be unnecessary.
 
 Thanks in advance
 
---------------------------------
The fish are biting.
 Get more visitors on your site using Yahoo! Search Marketing.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20070321/282a25ef/attachment.htm>

From adilger at clusterfs.com  Thu Mar 22 04:14:24 2007
From: adilger at clusterfs.com (Andreas Dilger)
Date: Wed, 21 Mar 2007 22:14:24 -0600
Subject: EXT2 vs. EXT3: mount w/sync or fdatasync
In-Reply-To: <221628.39405.qm@web59005.mail.re1.yahoo.com>
References: <221628.39405.qm@web59005.mail.re1.yahoo.com>
Message-ID: <20070322041424.GM5967@schatzie.adilger.int>

On Mar 21, 2007  16:51 -0700, brian stone wrote:
> My application always needs to sync file data after writing. I don't want anything handing around in the kernel buffers. I am wondering what is the best method to accomplish this.

>  4. Do I use EXT3 in full journaled mode, where the data and metadata are journaled? In this case, is the journaled data sync'd or async'd? When the journal commits the data to the file system, is that sync'd or dumped into kernel buffers?
>  
>  5. Since I will always be syncing the data, does it make any sense to use EXT3? It feels like the EXT3 journal would be unnecessary.

In theory, ext3 + data=journal will give you the best performance, because
sync IO will always be linear IO to the journal.  Unless your filesystem is
constantly busy, then the writes to the filesystem can happen asynchronously
after being committed to the journal without danger of being lost.

That said, nothing better than benchmarking your app with different
filesystem options to see which one is best.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


From skye0507 at yahoo.com  Thu Mar 22 11:51:00 2007
From: skye0507 at yahoo.com (brian stone)
Date: Thu, 22 Mar 2007 04:51:00 -0700 (PDT)
Subject: EXT2 vs. EXT3: mount w/sync or fdatasync
In-Reply-To: <20070322041424.GM5967@schatzie.adilger.int>
Message-ID: <823230.44351.qm@web59009.mail.re1.yahoo.com>

>>nothing better than benchmarking your app with different

IO performance is always a consideration, but for this application reliability is much more important.  

I am looking for the most reliable way of dumping files to disk.  We I call close(), I need to know that the data is one disk.  It doesn't need to be the highest performance method, just the most reliable.

>>Unless your filesystem is constantly busy

It is constantly busy.  Each file system manages around 10 millions files across a TB.  Each day, an average of 500,000 files totaling 100G are throw away while the same amount is generated.  Its a constant cycle.  The point is, these are very active file systems. 

I have already seen EXT3 corrupt its superblock(s) after a disk failure, using data=ordered.  Trying different superblocks didn't work, maybe -O sparse_super isn't the best idea.

No merit in EXT2 with fdatasync calls?

thanks for the response.

Andreas Dilger <adilger at clusterfs.com> wrote: On Mar 21, 2007  16:51 -0700, brian stone wrote:
> My application always needs to sync file data after writing. I don't want anything handing around in the kernel buffers. I am wondering what is the best method to accomplish this.

>  4. Do I use EXT3 in full journaled mode, where the data and metadata are journaled? In this case, is the journaled data sync'd or async'd? When the journal commits the data to the file system, is that sync'd or dumped into kernel buffers?
>  
>  5. Since I will always be syncing the data, does it make any sense to use EXT3? It feels like the EXT3 journal would be unnecessary.

In theory, ext3 + data=journal will give you the best performance, because
sync IO will always be linear IO to the journal.  Unless your filesystem is
constantly busy, then the writes to the filesystem can happen asynchronously
after being committed to the journal without danger of being lost.

That said, nothing better than benchmarking your app with different
filesystem options to see which one is best.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


---------------------------------
Don't be flakey. Get Yahoo! Mail for Mobile and 
always stay connected to friends.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20070322/19d2b4b5/attachment.htm>

From skye0507 at yahoo.com  Thu Mar 22 11:58:50 2007
From: skye0507 at yahoo.com (brian stone)
Date: Thu, 22 Mar 2007 04:58:50 -0700 (PDT)
Subject: EXT2 vs. EXT3: mount w/sync or fdatasync
In-Reply-To: <823230.44351.qm@web59009.mail.re1.yahoo.com>
Message-ID: <957664.74539.qm@web59015.mail.re1.yahoo.com>

>>I have already seen EXT3 corrupt its superblock(s) after a disk failure, using data=ordered.

Not sure why this post printed data="".  I was using ordered mode, the default.

thanks

brian stone <skye0507 at yahoo.com> wrote: >>nothing better than benchmarking your app with different

IO performance is always a consideration, but for this application reliability is much more important.  

I am looking for the most reliable way of dumping files to disk.  We I call close(), I need to know that the data is one disk.  It doesn't need to be the highest performance method, just the most reliable.

>>Unless your filesystem is constantly busy

It is constantly busy.  Each file system manages around 10 millions files across a TB.  Each day, an average of 500,000 files totaling 100G are throw away while the same amount is generated.  Its a constant cycle.  The point is, these are very active file systems. 

I have already seen EXT3 corrupt its superblock(s) after a disk failure, using data=ordered.  Trying different superblocks didn't work, maybe -O sparse_super isn't the best idea.

No merit  in EXT2 with fdatasync calls?

thanks for the response.

Andreas Dilger <adilger at clusterfs.com> wrote: On Mar 21, 2007  16:51 -0700, brian stone wrote:
> My application always needs to sync file data after writing. I don't want anything handing around in the kernel buffers. I am wondering what is the best method to accomplish this.

>  4. Do I use EXT3 in full journaled mode, where the data and metadata are journaled? In this case, is the journaled data sync'd or async'd? When the journal commits the data to the file system, is that sync'd or dumped into kernel buffers?
>  
>  5. Since I will always be syncing the data, does it make any sense to use EXT3? It feels like the EXT3 journal would be unnecessary.

In theory, ext3 + data=journal will give you the best performance,  because
sync IO will always be linear IO to the journal.  Unless your filesystem is
constantly busy, then the writes to the filesystem can happen asynchronously
after being committed to the journal without danger of being lost.

That said, nothing better than benchmarking your app with different
filesystem options to see which one is best.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


---------------------------------
Don't be flakey. Get Yahoo! Mail for Mobile and 
always stay connected to friends._______________________________________________
Ext3-users mailing list
Ext3-users at redhat.com
https://www.redhat.com/mailman/listinfo/ext3-users

 
---------------------------------
The fish are biting.
 Get more visitors on your site using Yahoo! Search Marketing.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20070322/b3bef136/attachment.htm>

From skye0507 at yahoo.com  Fri Mar 23 03:44:40 2007
From: skye0507 at yahoo.com (brian stone)
Date: Thu, 22 Mar 2007 20:44:40 -0700 (PDT)
Subject: EXT2 vs. EXT3: mount w/sync or fdatasync
In-Reply-To: <20070322041424.GM5967@schatzie.adilger.int>
Message-ID: <810328.85867.qm@web59007.mail.re1.yahoo.com>

Ran some performance tests as suggested.

Machine A connects to machine B on a gigabit lan.  Machine A sends 
1024 1MB chucks of data; 1 GB in total. Machine B, the server, reads 
in the MB and writes it to a file.

NOTE: server and client are little test programs written in C.  

Machine B (Server) hardware:
- Single (no raid) Seagate Cheetah 70G Ultra320 15K
- Quad Opteron 870
- 16G DDR400
- Backplane: LSI Logic / Symbios Logic 53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI (rev 8)

Sync methods include:
1. mount with sync option
  - tried sync,dirsync which added no additional overhead
2. use O_SYNC open() flag
3. use fdatasync() just before closing the file
  - fsync() and fdatasync() produced the same results


EXT2 tests
==========================================
No sync                     12.3 seconds  (83 MB/Sec)
mount=sync                  44.3 seconds  (23 MB/Sec)
O_SYNC                      31.7 seconds  (32 MB/Sec)
fdatasync()                 31.3 seconds  (32 MB/Sec)


EXT3 tests
===========================================
No sync data=writeback      14.5 seconds  (70 MB/Sec)
No sync data=ordered        17 seconds    (60 MB/Sec)
No sync data=journal        65 seconds    (15 MB/Sec)
data=ordered O_SYNC         49 seconds    (20 MB/Sec)
data=ordered,sync           52 seconds    (19 MB/Sec)
data=ordered fdatasync()    45.5 seconds  (22 MB/Sec)
data=journal O_SYNC         72.5 seconds  (14 MB/Sec)
data=journal,sync           81 seconds    (12 MB/Sec)
data=journal fdatasync()    60.5 seconds  (17 MB/Sec)

thanks

Andreas Dilger <adilger at clusterfs.com> wrote: On Mar 21, 2007  16:51 -0700, brian stone wrote:
> My application always needs to sync file data after writing. I don't want anything handing around in the kernel buffers. I am wondering what is the best method to accomplish this.

>  4. Do I use EXT3 in full journaled mode, where the data and metadata are journaled? In this case, is the journaled data sync'd or async'd? When the journal commits the data to the file system, is that sync'd or dumped into kernel buffers?
>  
>  5. Since I will always be syncing the data, does it make any sense to use EXT3? It feels like the EXT3 journal would be unnecessary.

In theory, ext3 + data=journal will give you the best performance, because
sync IO will always be linear IO to the journal.  Unless your filesystem is
constantly busy, then the writes to the filesystem can happen asynchronously
after being committed to the journal without danger of being lost.

That said, nothing better than benchmarking your app with different
filesystem options to see which one is best.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


---------------------------------
No need to miss a message. Get email on-the-go 
with Yahoo! Mail for Mobile. Get started.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20070322/e1811773/attachment.htm>

From skye0507 at yahoo.com  Fri Mar 23 03:50:38 2007
From: skye0507 at yahoo.com (brian stone)
Date: Thu, 22 Mar 2007 20:50:38 -0700 (PDT)
Subject: EXT2 vs. EXT3: mount w/sync or fdatasync
In-Reply-To: <810328.85867.qm@web59007.mail.re1.yahoo.com>
Message-ID: <546593.25102.qm@web59015.mail.re1.yahoo.com>

Why does this forum convert the right side of an equal sign to ""??? 

Test results reformatted:

EXT2 tests
==========================================
No sync                     12.3 seconds  (83 MB/Sec)
sync                        44.3 seconds  (23 MB/Sec)
O_SYNC                      31.7 seconds  (32 MB/Sec)
fdatasync()                 31.3 seconds  (32 MB/Sec)


EXT3 tests
===========================================
No sync writeback      14.5 seconds  (70 MB/Sec)
No sync ordered         17 seconds    (60 MB/Sec)
No sync journal          65 seconds    (15 MB/Sec)
ordered O_SYNC       49 seconds    (20 MB/Sec)
ordered,sync             52 seconds    (19 MB/Sec)
ordered fdatasync()    45.5 seconds  (22 MB/Sec)
journal O_SYNC        72.5 seconds  (14 MB/Sec)
journal,sync              81 seconds    (12 MB/Sec)
journal fdatasync()      60.5 seconds  (17 MB/Sec)


---------------------------------
No need to miss a message. Get email on-the-go 
with Yahoo! Mail for Mobile. Get started.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20070322/0312904d/attachment.htm>

From adilger at clusterfs.com  Fri Mar 23 06:18:40 2007
From: adilger at clusterfs.com (Andreas Dilger)
Date: Fri, 23 Mar 2007 00:18:40 -0600
Subject: EXT2 vs. EXT3: mount w/sync or fdatasync
In-Reply-To: <810328.85867.qm@web59007.mail.re1.yahoo.com>
References: <20070322041424.GM5967@schatzie.adilger.int>
	<810328.85867.qm@web59007.mail.re1.yahoo.com>
Message-ID: <20070323061840.GC5967@schatzie.adilger.int>

On Mar 22, 2007  20:44 -0700, brian stone wrote:
> Machine A connects to machine B on a gigabit lan.  Machine A sends 
> 1024 1MB chucks of data; 1 GB in total. Machine B, the server, reads 
> in the MB and writes it to a file.
> 
> NOTE: server and client are little test programs written in C.  
> 
> Machine B (Server) hardware:
> - Single (no raid) Seagate Cheetah 70G Ultra320 15K
> - Quad Opteron 870
> - 16G DDR400
> - Backplane: LSI Logic / Symbios Logic 53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI (rev 8)
> 
> Sync methods include:
> 1. mount with sync option
>   - tried sync,dirsync which added no additional overhead
> 2. use O_SYNC open() flag
> 3. use fdatasync() just before closing the file
>   - fsync() and fdatasync() produced the same results
> 
> 
> EXT2 tests
> ==========================================
> No sync                     12.3 seconds  (83 MB/Sec)
> mount=sync                  44.3 seconds  (23 MB/Sec)
> O_SYNC                      31.7 seconds  (32 MB/Sec)
> fdatasync()                 31.3 seconds  (32 MB/Sec)
> 
> 
> EXT3 tests
> ===========================================
> No sync data=writeback      14.5 seconds  (70 MB/Sec)
> No sync data=ordered        17 seconds    (60 MB/Sec)
> No sync data=journal        65 seconds    (15 MB/Sec)
> data=ordered O_SYNC         49 seconds    (20 MB/Sec)
> data=ordered,sync           52 seconds    (19 MB/Sec)
> data=ordered fdatasync()    45.5 seconds  (22 MB/Sec)
> data=journal O_SYNC         72.5 seconds  (14 MB/Sec)
> data=journal,sync           81 seconds    (12 MB/Sec)
> data=journal fdatasync()    60.5 seconds  (17 MB/Sec)

If you are doing a large number of 1MB writes then I agree that
data=journal is probably not the way to go because it means you
can get at most 1/2 of the bandwidth of the disk (unless you
create the journal on a separate disk).  data=journal is good
for small writes and lots of transactions, like mail servers
that need lots of sync operations.

For large writes, I'd suggest you put the journal on a separate
device, and make it 1 or 2 GB (your server has plenty of RAM,
so that isn't a problem).  Are you using EAs, like selinux or
similar?  If yes, then you should also format your filesystem
with large inodes (-I 256).

You may also want to try out ext4dev with the mballoc and delalloc
patches from Alex Tomas, as this code has been optimized for
doing large power-of-two allocations in the filesystem.  They've
been posted to the ext4-devel lists a couple of times.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


From ric at emc.com  Fri Mar 23 10:47:26 2007
From: ric at emc.com (Ric Wheeler)
Date: Fri, 23 Mar 2007 06:47:26 -0400
Subject: Ext3 behavior on power failure
In-Reply-To: <A74E8B4A356D8143B79BBEB839421F3004023496@CORPUSMX20B.corp.emc.com>
References: <A74E8B4A356D8143B79BBEB839421F3004023496@CORPUSMX20B.corp.emc.com>
Message-ID: <4603B03E.7080302@emc.com>


armangau_philippe at emc.com wrote:
> Hi all,
>
> We are building a new system which is going to use ext3 FS. We would like to know more about the behavior of ext3 in the case of failure.  But before I procede, I would like to share more information about our future system. 
>
> *	Our application always does an fsync on files
> *	When symbolic links (more specifically fast symlink) are created, the host directory is also fsync'ed. 
> *	Our application is also going to front an EMC disk array configured using RAID5 or RAID6.
> *	We will be using multipathing  so that we can assume that no disk errors will be reported. 
>
> In this context , we would like to know the following for recovery after a power outage:
>
> 1.	When will an fsck have to be run (not counting  the scheduled fsck every N-mounts)?
> 2.	In the case of a crash, are the fsync-ed file contents and symbolic links safe no matter what?
>
> Thanks,

This is an interesting twist on some of the discussion that we have had 
at the recent workshop and in other forums on hardening  file system in 
order to prevent the need to fsck.

The twist is that we have a disk that will not lose power without being 
able to write to platter all of the data that has been sent - this is 
the case for most mid-range or higher disk arrays.

If the application can precisely use fsync() on files, directories and 
symlinks, it wants to know that all objects are safe on disk that have 
completed a successful fsync. It also wants to know that the file system 
will not need any recovery beyond replaying transactions after a power 
outage/reboot - simply mount, let the transactions get replayed and you 
should be good to go without the fsck.

The hard part of the question is to understand when and how often we 
will fail to deliver this easy case. Also, does any of the hardening in 
ext4 help here.

Maybe the Stanford explode work/analysis sheds some light on this behavior?

ric


From skye0507 at yahoo.com  Fri Mar 23 13:17:06 2007
From: skye0507 at yahoo.com (brian stone)
Date: Fri, 23 Mar 2007 06:17:06 -0700 (PDT)
Subject: EXT2 vs. EXT3: mount w/sync or fdatasync
In-Reply-To: <20070323061840.GC5967@schatzie.adilger.int>
Message-ID: <663779.93645.qm@web59009.mail.re1.yahoo.com>

I am currently leaning towards: 
mount in ordered mode with the dirsync option and use fsync().

That seemed to be the most consistent in performance tests.  Some of the config tests would fart in the middle, hesitating for a second or two.  The ordered mode with fsync() was rock solid.  Also, I think journaling the data when you are syncing it is more than one needs.

Without going to unneeded details, I will give you a glimpse of what this application is doing.

Machine A, which I will call an app server, generates binary chucks/blocks of data ranging from 28 bytes to a maximum of 1MB.  There are multiple app servers.  The app servers need to quickly store these blocks on one of several Machine Bs, which I will call volume servers.  When a block is transferred from an app server to a volume server, it must be done reliably ... thus the need to sync.  If the volume server says, "I got that block", then it really must have it ... on disk.

>>Are you using EAs, like selinux or similar
File system permissions and security attributes are meaningless in this system.  selinux is disabled.  These blocks are not browsed by users.  I actually mount using "noatime,nodiratime,noacl,nouser_xattr".  Only the app servers have any idea what these blocks mean.  The volume server is nothing more than a dumping ground out on the network.  We even toyed with writing raw, opening a device directly with no fs and using O_DIRECT.  Not a bad idea just a heck of a lot of work!  Easier to fiddle with the correct config for ext3.

So, maybe the volume servers need two fs configs: one for blocks less than 128KB and one for blocks over 128KB.  

I tested with 1MB blocks because that would be the worst case; I wanted to know how it would perform.  The average block size is currently around 100KB.

thanks soo much for your thoughts


Andreas Dilger <adilger at clusterfs.com> wrote: On Mar 22, 2007  20:44 -0700, brian stone wrote:
> Machine A connects to machine B on a gigabit lan.  Machine A sends 
> 1024 1MB chucks of data; 1 GB in total. Machine B, the server, reads 
> in the MB and writes it to a file.
> 
> NOTE: server and client are little test programs written in C.  
> 
> Machine B (Server) hardware:
> - Single (no raid) Seagate Cheetah 70G Ultra320 15K
> - Quad Opteron 870
> - 16G DDR400
> - Backplane: LSI Logic / Symbios Logic 53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI (rev 8)
> 
> Sync methods include:
> 1. mount with sync option
>   - tried sync,dirsync which added no additional overhead
> 2. use O_SYNC open() flag
> 3. use fdatasync() just before closing the file
>   - fsync() and fdatasync() produced the same results
> 
> 
> EXT2 tests
> ==========================================
> No sync                     12.3 seconds  (83 MB/Sec)
> mount=sync                  44.3 seconds  (23 MB/Sec)
> O_SYNC                      31.7 seconds  (32 MB/Sec)
> fdatasync()                 31.3 seconds  (32 MB/Sec)
> 
> 
> EXT3 tests
> ===========================================
> No sync data=writeback      14.5 seconds  (70 MB/Sec)
> No sync data=ordered        17 seconds    (60 MB/Sec)
> No sync data=journal        65 seconds    (15 MB/Sec)
> data=ordered O_SYNC         49 seconds    (20 MB/Sec)
> data=ordered,sync           52 seconds    (19 MB/Sec)
> data=ordered fdatasync()    45.5 seconds  (22 MB/Sec)
> data=journal O_SYNC         72.5 seconds  (14 MB/Sec)
> data=journal,sync           81 seconds    (12 MB/Sec)
> data=journal fdatasync()    60.5 seconds  (17 MB/Sec)

If you are doing a large number of 1MB writes then I agree that
data=journal is probably not the way to go because it means you
can get at most 1/2 of the bandwidth of the disk (unless you
create the journal on a separate disk).  data=journal is good
for small writes and lots of transactions, like mail servers
that need lots of sync operations.

For large writes, I'd suggest you put the journal on a separate
device, and make it 1 or 2 GB (your server has plenty of RAM,
so that isn't a problem).  Are you using EAs, like selinux or
similar?  If yes, then you should also format your filesystem
with large inodes (-I 256).

You may also want to try out ext4dev with the mballoc and delalloc
patches from Alex Tomas, as this code has been optimized for
doing large power-of-two allocations in the filesystem.  They've
been posted to the ext4-devel lists a couple of times.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


---------------------------------
Finding fabulous fares is fun.
Let Yahoo! FareChase search your favorite travel sites to find flight and hotel bargains.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20070323/16040812/attachment.htm>

From skye0507 at yahoo.com  Sat Mar 24 15:19:58 2007
From: skye0507 at yahoo.com (brian stone)
Date: Sat, 24 Mar 2007 08:19:58 -0700 (PDT)
Subject: EXT2 vs. EXT3: mount w/sync or fdatasync
In-Reply-To: <20070323061840.GC5967@schatzie.adilger.int>
Message-ID: <361965.45536.qm@web59008.mail.re1.yahoo.com>

Final configuration and performance results.

Changed machines (for a RAID test):
 - 3ware 9550SX with BBU
 - Pentium D 940
 - 2G DDR2 667
- (4) 750G Seagate SATAII drives (AS series)

RAID levels:
 - machine was configured for RAID5 but that was horribly slow, 12 MB/Sec
 - created a (2) drive RAID0, then sliced out a 100G partition
 - journal was on a separate JBOD disk
- write caching was enabled for the RAID0 and journal disk
- 64K stripes was used on RAID0 and JBOD journal

File system configuration:
- 100G ext3 file system
- Used a 32M journal on a physically separate device
- used "ordered" mode for the journal
- mounted with "noatime,nodiratime,noauto,noacl,nouser_xattr,dirsync"
- used the mkfs.ext3 -E option to set stripes to 16
   - RAID0 was using 64K stripes.
   - fs was using 4K blocks
- each file transaction did: open(),write(),fsync(),close() 
- slammed 1024 1MB chucks at it

I got 36 MB/Sec consistently.  A good sign because with the proper hardware, this would perform really well.

In production, I would probably use a RAID10 with at least 12 15K SAS/FC drives with dual controllers in Active-Active mode: failover+load balancing.  Either fiber or SAS connected.  That should scream!

Fortunately, this config needs very little space ... maybe 500G in total.  So the hardware cost is not terrible.  This config is for a queue directory that is crawled by a background process.  That process moves the data from this queue to mass "slow" storage, fiber attached SATAII 7200RPM RAID5.  The queue needs to be as fast as possible and must sync the data.  Tricky problem :)

thanks.

Andreas Dilger <adilger at clusterfs.com> wrote: On Mar 22, 2007  20:44 -0700, brian stone wrote:
> Machine A connects to machine B on a gigabit lan.  Machine A sends 
> 1024 1MB chucks of data; 1 GB in total. Machine B, the server, reads 
> in the MB and writes it to a file.
> 
> NOTE: server and client are little test programs written in C.  
> 
> Machine B (Server) hardware:
> - Single (no raid) Seagate Cheetah 70G Ultra320 15K
> - Quad Opteron 870
> - 16G DDR400
> - Backplane: LSI Logic / Symbios Logic 53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI (rev 8)
> 
> Sync methods include:
> 1. mount with sync option
>   - tried sync,dirsync which added no additional overhead
> 2. use O_SYNC open() flag
> 3. use fdatasync() just before closing the file
>   - fsync() and fdatasync() produced the same results
> 
> 
> EXT2 tests
> ==========================================
> No sync                     12.3 seconds  (83 MB/Sec)
> mount=sync                  44.3 seconds  (23 MB/Sec)
> O_SYNC                      31.7 seconds  (32 MB/Sec)
> fdatasync()                 31.3 seconds  (32 MB/Sec)
> 
> 
> EXT3 tests
> ===========================================
> No sync data=writeback      14.5 seconds  (70 MB/Sec)
> No sync data=ordered        17 seconds    (60 MB/Sec)
> No sync data=journal        65 seconds    (15 MB/Sec)
> data=ordered O_SYNC         49 seconds    (20 MB/Sec)
> data=ordered,sync           52 seconds    (19 MB/Sec)
> data=ordered fdatasync()    45.5 seconds  (22 MB/Sec)
> data=journal O_SYNC         72.5 seconds  (14 MB/Sec)
> data=journal,sync           81 seconds    (12 MB/Sec)
> data=journal fdatasync()    60.5 seconds  (17 MB/Sec)

If you are doing a large number of 1MB writes then I agree that
data=journal is probably not the way to go because it means you
can get at most 1/2 of the bandwidth of the disk (unless you
create the journal on a separate disk).  data=journal is good
for small writes and lots of transactions, like mail servers
that need lots of sync operations.

For large writes, I'd suggest you put the journal on a separate
device, and make it 1 or 2 GB (your server has plenty of RAM,
so that isn't a problem).  Are you using EAs, like selinux or
similar?  If yes, then you should also format your filesystem
with large inodes (-I 256).

You may also want to try out ext4dev with the mballoc and delalloc
patches from Alex Tomas, as this code has been optimized for
doing large power-of-two allocations in the filesystem.  They've
been posted to the ext4-devel lists a couple of times.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


---------------------------------
TV dinner still cooling?
Check out "Tonight's Picks" on Yahoo! TV.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20070324/fad4bcc2/attachment.htm>

From adilger at clusterfs.com  Sat Mar 24 21:25:02 2007
From: adilger at clusterfs.com (Andreas Dilger)
Date: Sat, 24 Mar 2007 15:25:02 -0600
Subject: EXT2 vs. EXT3: mount w/sync or fdatasync
In-Reply-To: <361965.45536.qm@web59008.mail.re1.yahoo.com>
References: <20070323061840.GC5967@schatzie.adilger.int>
	<361965.45536.qm@web59008.mail.re1.yahoo.com>
Message-ID: <20070324212502.GJ5967@schatzie.adilger.int>

On Mar 24, 2007  08:19 -0700, brian stone wrote:
> File system configuration:
> - 100G ext3 file system
> - Used a 32M journal on a physically separate device

We normally run our servers with at least 256MB journals - under metadata
intensive loads (including truncates) this can really help.

> - used "ordered" mode for the journal
> - mounted with "noatime,nodiratime,noauto,noacl,nouser_xattr,dirsync"
> - used the mkfs.ext3 -E option to set stripes to 16
>    - RAID0 was using 64K stripes.
>    - fs was using 4K blocks
> - each file transaction did: open(),write(),fsync(),close() 
> - slammed 1024 1MB chucks at it
> 
> I got 36 MB/Sec consistently.  A good sign because with the proper hardware, this would perform really well.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


From jack at suse.cz  Wed Mar 28 12:40:16 2007
From: jack at suse.cz (Jan Kara)
Date: Wed, 28 Mar 2007 14:40:16 +0200
Subject: Ext3 behavior on power failure
In-Reply-To: <4603B03E.7080302@emc.com>
References: <A74E8B4A356D8143B79BBEB839421F3004023496@CORPUSMX20B.corp.emc.com>
	<4603B03E.7080302@emc.com>
Message-ID: <20070328124015.GG14935@atrey.karlin.mff.cuni.cz>

> armangau_philippe at emc.com wrote:
> >Hi all,
> >
> >We are building a new system which is going to use ext3 FS. We would like 
> >to know more about the behavior of ext3 in the case of failure.  But 
> >before I procede, I would like to share more information about our future 
> >system. 
> >*	Our application always does an fsync on files
> >*	When symbolic links (more specifically fast symlink) are created, 
> >the host directory is also fsync'ed. *	Our application is also 
> >going to front an EMC disk array configured using RAID5 or RAID6.
> >*	We will be using multipathing  so that we can assume that no disk 
> >errors will be reported. 
> >In this context , we would like to know the following for recovery after a 
> >power outage:
> >
> >1.	When will an fsck have to be run (not counting  the scheduled fsck 
> >every N-mounts)?
> >2.	In the case of a crash, are the fsync-ed file contents and symbolic 
> >links safe no matter what?
> >
> >Thanks,
> 
> This is an interesting twist on some of the discussion that we have had 
> at the recent workshop and in other forums on hardening  file system in 
> order to prevent the need to fsck.
> 
> The twist is that we have a disk that will not lose power without being 
> able to write to platter all of the data that has been sent - this is 
> the case for most mid-range or higher disk arrays.
> 
> If the application can precisely use fsync() on files, directories and 
> symlinks, it wants to know that all objects are safe on disk that have 
> completed a successful fsync. It also wants to know that the file system 
> will not need any recovery beyond replaying transactions after a power 
> outage/reboot - simply mount, let the transactions get replayed and you 
> should be good to go without the fsck.
> 
> The hard part of the question is to understand when and how often we 
> will fail to deliver this easy case. Also, does any of the hardening in 
> ext4 help here.
  I'm probably misunderstanding something because the answer seems to be
too obvious to me :) But anyway I'll write it so that you can correct
me:
  Due to journalling guarantees you should get consistent FS whenever
you replay the log (unless there are some software bugs or hardware
problems which is why fsck is run once per several mounts anyway).
  If you fsync() your data, you are guaranteed that also your data are
safely on disk when fsync returns. So what is the question here?

								Honza
-- 
Jan Kara <jack at suse.cz>
SuSE CR Labs


From jack at suse.cz  Wed Mar 28 13:29:04 2007
From: jack at suse.cz (Jan Kara)
Date: Wed, 28 Mar 2007 15:29:04 +0200
Subject: Ext3 behavior on power failure
In-Reply-To: <alpine.DEB.0.83.0703280915400.30460@sigma.j-a-k-j.com>
References: <A74E8B4A356D8143B79BBEB839421F3004023496@CORPUSMX20B.corp.emc.com>
	<4603B03E.7080302@emc.com>
	<20070328124015.GG14935@atrey.karlin.mff.cuni.cz>
	<alpine.DEB.0.83.0703280915400.30460@sigma.j-a-k-j.com>
Message-ID: <20070328132903.GI14935@atrey.karlin.mff.cuni.cz>

> > If you fsync() your data, you are guaranteed that also your data are
> >safely on disk when fsync returns. So what is the question here?
> Pardon a newbie's intrusion, but I do know this isn't true. There is a 
> window of possible loss because of the multitude of layers of caching, 
> especially within the drive itself. Unless there is a super_duper_fsync() 
> that is able to actually poll the hardware and get a confirmation that the 
> internal buffers are purged?
  OK :), to correct myself: After fsync() returns, all the data is acked from
the disk (or at least it should be like that unless there's a bug
somewhere). So if there are some caches in the hardware which the hardware
is not able to flush on power failure, that's a bad luck... That's why
you should turn off write caching on cheaper disks if you really care
about data integrity.

									Honza
-- 
Jan Kara <jack at suse.cz>
SuSE CR Labs


From armangau_philippe at emc.com  Wed Mar 28 14:17:33 2007
From: armangau_philippe at emc.com (armangau_philippe at emc.com)
Date: Wed, 28 Mar 2007 10:17:33 -0400
Subject: Ext3 behavior on power failure
In-Reply-To: <alpine.DEB.0.83.0703280915400.30460@sigma.j-a-k-j.com>
References: <A74E8B4A356D8143B79BBEB839421F3004023496@CORPUSMX20B.corp.emc.com>
	<4603B03E.7080302@emc.com>
	<20070328124015.GG14935@atrey.karlin.mff.cuni.cz>
	<alpine.DEB.0.83.0703280915400.30460@sigma.j-a-k-j.com>
Message-ID: <A74E8B4A356D8143B79BBEB839421F300415A86B@CORPUSMX20B.corp.emc.com>

In my case the disk cache is not a  problem - We use an emc disk array
the write cache is protected - 
Once the data has made over the disk array we can assume it is safe - 
Thx 
Philippe  

-----Original Message-----
From: John Anthony Kazos Jr. [mailto:jakj at j-a-k-j.com] 
Sent: Wednesday, March 28, 2007 9:17 AM
To: Jan Kara
Cc: wheeler, richard; armangau, philippe; ext3-users at redhat.com;
linux-ext4 at vger.kernel.org; csar at stanford.edu
Subject: Re: Ext3 behavior on power failure

>  If you fsync() your data, you are guaranteed that also your data are
> safely on disk when fsync returns. So what is the question here?

Pardon a newbie's intrusion, but I do know this isn't true. There is a 
window of possible loss because of the multitude of layers of caching, 
especially within the drive itself. Unless there is a
super_duper_fsync() 
that is able to actually poll the hardware and get a confirmation that
the 
internal buffers are purged?


From jack at suse.cz  Wed Mar 28 15:00:03 2007
From: jack at suse.cz (Jan Kara)
Date: Wed, 28 Mar 2007 17:00:03 +0200
Subject: Ext3 behavior on power failure
In-Reply-To: <A74E8B4A356D8143B79BBEB839421F300415A86B@CORPUSMX20B.corp.emc.com>
References: <A74E8B4A356D8143B79BBEB839421F3004023496@CORPUSMX20B.corp.emc.com>
	<4603B03E.7080302@emc.com>
	<20070328124015.GG14935@atrey.karlin.mff.cuni.cz>
	<alpine.DEB.0.83.0703280915400.30460@sigma.j-a-k-j.com>
	<A74E8B4A356D8143B79BBEB839421F300415A86B@CORPUSMX20B.corp.emc.com>
Message-ID: <20070328150003.GE29587@duck.suse.cz>

On Wed 28-03-07 10:17:33, armangau_philippe at emc.com wrote:
> In my case the disk cache is not a  problem - We use an emc disk array
> the write cache is protected - 
> Once the data has made over the disk array we can assume it is safe - 
  Then if you are able to reproduce the situation that not all data
is written after fsync(); poweroff; that is a bug worth reporting..

								Honza
> 
> -----Original Message-----
> From: John Anthony Kazos Jr. [mailto:jakj at j-a-k-j.com] 
> Sent: Wednesday, March 28, 2007 9:17 AM
> To: Jan Kara
> Cc: wheeler, richard; armangau, philippe; ext3-users at redhat.com;
> linux-ext4 at vger.kernel.org; csar at stanford.edu
> Subject: Re: Ext3 behavior on power failure
> 
> >  If you fsync() your data, you are guaranteed that also your data are
> > safely on disk when fsync returns. So what is the question here?
> 
> Pardon a newbie's intrusion, but I do know this isn't true. There is a 
> window of possible loss because of the multitude of layers of caching, 
> especially within the drive itself. Unless there is a
> super_duper_fsync() 
> that is able to actually poll the hardware and get a confirmation that
> the 
> internal buffers are purged?
> 
-- 
Jan Kara <jack at suse.cz>
SuSE CR Labs


From tsh at mrc-lmb.cam.ac.uk  Wed Mar 28 17:47:32 2007
From: tsh at mrc-lmb.cam.ac.uk (T. Horsnell)
Date: Wed, 28 Mar 2007 18:47:32 +0100
Subject: ext3 usage guidance
Message-ID: <20070328174732.GA31129@ls1.lmb.internal>

Is there a document anywhere offering guidance on the optimum
use of ext3 filesystems? Googling shows nothing useful and the
Linux ext3 FAQ is not very forthcoming.

I'm particularly interested in:

1. The effect on performance of large numbers of (generally) small files
   One of my ext3 filesystems has 750K files on a 36GB disk, and 
   backup with tar takes forever. Even 'find /fs -type f -ls'
   to establish ownership of the various files takes some hours.
   Are there thresholds for #files-per-directory or #total-files-per-filesystem
   beyond which performance degrades rapidly?

2. I have a number of filesystems on SCSI disks which I would
   like to fsck on demand, rather than have an unscheduled
   fsck at reboot because some mount-count has expired.
   I use 'tune2fs -c 0 and -t 0' to do this, and would like
   to use 'shutdown -F -r 'at a chosen time to force fsck on
   reboot, and I'd then like fsck to do things in parallel.
   What are the resources (memory etc) required for parallel
   fsck'ing? Can I reasonably expect to be able to fsck say,
   50 300GB filesystems in parallel, or should I group them into
   smaller groups? How small?

Thanks,
Terry.

-- 


From ric at emc.com  Wed Mar 28 23:00:54 2007
From: ric at emc.com (Ric Wheeler)
Date: Wed, 28 Mar 2007 19:00:54 -0400
Subject: Ext3 behavior on power failure
In-Reply-To: <20070328124015.GG14935@atrey.karlin.mff.cuni.cz>
References: <A74E8B4A356D8143B79BBEB839421F3004023496@CORPUSMX20B.corp.emc.com>
	<4603B03E.7080302@emc.com>
	<20070328124015.GG14935@atrey.karlin.mff.cuni.cz>
Message-ID: <460AF3A6.403@emc.com>


Jan Kara wrote:
>> armangau_philippe at emc.com wrote:
>>> Hi all,
>>>
>>> We are building a new system which is going to use ext3 FS. We would like 
>>> to know more about the behavior of ext3 in the case of failure.  But 
>>> before I procede, I would like to share more information about our future 
>>> system. 
>>> *	Our application always does an fsync on files
>>> *	When symbolic links (more specifically fast symlink) are created, 
>>> the host directory is also fsync'ed. *	Our application is also 
>>> going to front an EMC disk array configured using RAID5 or RAID6.
>>> *	We will be using multipathing  so that we can assume that no disk 
>>> errors will be reported. 
>>> In this context , we would like to know the following for recovery after a 
>>> power outage:
>>>
>>> 1.	When will an fsck have to be run (not counting  the scheduled fsck 
>>> every N-mounts)?
>>> 2.	In the case of a crash, are the fsync-ed file contents and symbolic 
>>> links safe no matter what?
>>>
>>> Thanks,
>> This is an interesting twist on some of the discussion that we have had 
>> at the recent workshop and in other forums on hardening  file system in 
>> order to prevent the need to fsck.
>>
>> The twist is that we have a disk that will not lose power without being 
>> able to write to platter all of the data that has been sent - this is 
>> the case for most mid-range or higher disk arrays.
>>
>> If the application can precisely use fsync() on files, directories and 
>> symlinks, it wants to know that all objects are safe on disk that have 
>> completed a successful fsync. It also wants to know that the file system 
>> will not need any recovery beyond replaying transactions after a power 
>> outage/reboot - simply mount, let the transactions get replayed and you 
>> should be good to go without the fsck.
>>
>> The hard part of the question is to understand when and how often we 
>> will fail to deliver this easy case. Also, does any of the hardening in 
>> ext4 help here.
>   I'm probably misunderstanding something because the answer seems to be
> too obvious to me :) But anyway I'll write it so that you can correct
> me:
>   Due to journalling guarantees you should get consistent FS whenever
> you replay the log (unless there are some software bugs or hardware
> problems which is why fsck is run once per several mounts anyway).
>   If you fsync() your data, you are guaranteed that also your data are
> safely on disk when fsync returns. So what is the question here?
> 
> 								Honza

I think that the real question here is in practice, how often does this really 
hold to be true? When it fails, how long does it take to recover the file system?

There are a lot of odd errors that can happen when you monitor a large enough 
number of file systems. In my experience, I would guess that disk errors are 
clearly the leading cause of issues, followed by software bugs (file system, 
firmware, etc) and then a group of errors caused by various occasional things 
(bad DRAM in the server/HBA/disk, bad cables/etc). Note that using a high end 
array does not eliminate errors, it just reduces the rate (hopefully by a large 
amount).

What is really hard to predict is the rate of the failures that require fsck 
with our current file system (say for a specific hardware setup) and how changes 
like the checksumming in ext4 can help us ride through these errors without 
needing a full fsck.

This rate has a direct impact on how much pain an fsck will inflict and how 
important redundancy is to avoid having the file system be a single point of 
failure.

ric


From jack at suse.cz  Thu Mar 29 08:00:59 2007
From: jack at suse.cz (Jan Kara)
Date: Thu, 29 Mar 2007 10:00:59 +0200
Subject: Ext3 behavior on power failure
In-Reply-To: <460AF3A6.403@emc.com>
References: <A74E8B4A356D8143B79BBEB839421F3004023496@CORPUSMX20B.corp.emc.com>
	<4603B03E.7080302@emc.com>
	<20070328124015.GG14935@atrey.karlin.mff.cuni.cz>
	<460AF3A6.403@emc.com>
Message-ID: <20070329080059.GA7698@duck.suse.cz>

On Wed 28-03-07 19:00:54, Ric Wheeler wrote:
> Jan Kara wrote:
> >>armangau_philippe at emc.com wrote:
> >>>Hi all,
> >>>
> >>>We are building a new system which is going to use ext3 FS. We would 
> >>>like to know more about the behavior of ext3 in the case of failure.  
> >>>But before I procede, I would like to share more information about our 
> >>>future system. 
> >>>*	Our application always does an fsync on files
> >>>*	When symbolic links (more specifically fast symlink) are created, 
> >>>the host directory is also fsync'ed. *	Our application is also 
> >>>going to front an EMC disk array configured using RAID5 or RAID6.
> >>>*	We will be using multipathing  so that we can assume that no disk 
> >>>errors will be reported. 
> >>>In this context , we would like to know the following for recovery after 
> >>>a power outage:
> >>>
> >>>1.	When will an fsck have to be run (not counting  the scheduled fsck 
> >>>every N-mounts)?
> >>>2.	In the case of a crash, are the fsync-ed file contents and symbolic 
> >>>links safe no matter what?
> >>>
> >>>Thanks,
> >>This is an interesting twist on some of the discussion that we have had 
> >>at the recent workshop and in other forums on hardening  file system in 
> >>order to prevent the need to fsck.
> >>
> >>The twist is that we have a disk that will not lose power without being 
> >>able to write to platter all of the data that has been sent - this is 
> >>the case for most mid-range or higher disk arrays.
> >>
> >>If the application can precisely use fsync() on files, directories and 
> >>symlinks, it wants to know that all objects are safe on disk that have 
> >>completed a successful fsync. It also wants to know that the file system 
> >>will not need any recovery beyond replaying transactions after a power 
> >>outage/reboot - simply mount, let the transactions get replayed and you 
> >>should be good to go without the fsck.
> >>
> >>The hard part of the question is to understand when and how often we 
> >>will fail to deliver this easy case. Also, does any of the hardening in 
> >>ext4 help here.
> >  I'm probably misunderstanding something because the answer seems to be
> >too obvious to me :) But anyway I'll write it so that you can correct
> >me:
> >  Due to journalling guarantees you should get consistent FS whenever
> >you replay the log (unless there are some software bugs or hardware
> >problems which is why fsck is run once per several mounts anyway).
> >  If you fsync() your data, you are guaranteed that also your data are
> >safely on disk when fsync returns. So what is the question here?
> >
> >								Honza
> 
> I think that the real question here is in practice, how often does this 
> really hold to be true? When it fails, how long does it take to recover the 
> file system?
  I see, thanks for explanation :).

> There are a lot of odd errors that can happen when you monitor a large 
> enough number of file systems. In my experience, I would guess that disk 
> errors are clearly the leading cause of issues, followed by software bugs 
> (file system, firmware, etc) and then a group of errors caused by various 
> occasional things (bad DRAM in the server/HBA/disk, bad cables/etc). Note 
> that using a high end array does not eliminate errors, it just reduces the 
> rate (hopefully by a large amount).
> 
> What is really hard to predict is the rate of the failures that require 
> fsck with our current file system (say for a specific hardware setup) and 
> how changes like the checksumming in ext4 can help us ride through these 
> errors without needing a full fsck.
  OK. All the features I've seen so far were more aiming to detecting that
such an unexpected problem happened rather than trying to fix it or make
fixing it faster. So currently it seems to me that any such unexpected
failure requires fsck...

								Honza
-- 
Jan Kara <jack at suse.cz>
SuSE CR Labs


From adilger at clusterfs.com  Thu Mar 29 09:16:44 2007
From: adilger at clusterfs.com (Andreas Dilger)
Date: Thu, 29 Mar 2007 03:16:44 -0600
Subject: ext3 usage guidance
In-Reply-To: <20070328174732.GA31129@ls1.lmb.internal>
References: <20070328174732.GA31129@ls1.lmb.internal>
Message-ID: <20070329091644.GC5967@schatzie.adilger.int>

On Mar 28, 2007  18:47 +0100, T. Horsnell wrote:
> 1. The effect on performance of large numbers of (generally) small files
>    One of my ext3 filesystems has 750K files on a 36GB disk, and 
>    backup with tar takes forever. Even 'find /fs -type f -ls'
>    to establish ownership of the various files takes some hours.
>    Are there thresholds for #files-per-directory or #total-files-per-filesystem
>    beyond which performance degrades rapidly?

You should enable directory indexing if you have > 5000 file directories,
then index the directories.  "tune2fs -O dir_index /dev/XXX; e2fsck -fD /dev/XXX"

> 2. I have a number of filesystems on SCSI disks which I would
>    like to fsck on demand, rather than have an unscheduled
>    fsck at reboot because some mount-count has expired.
>    I use 'tune2fs -c 0 and -t 0' to do this, and would like
>    to use 'shutdown -F -r 'at a chosen time to force fsck on
>    reboot, and I'd then like fsck to do things in parallel.
>    What are the resources (memory etc) required for parallel
>    fsck'ing? Can I reasonably expect to be able to fsck say,
>    50 300GB filesystems in parallel, or should I group them into
>    smaller groups? How small?

I think it was at least "(inodes_count * 7 + blocks_count * 3) / 8" per
filesystem when I last checked, but I don't recall exactly anymore.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


From tsh at mrc-lmb.cam.ac.uk  Thu Mar 29 10:09:07 2007
From: tsh at mrc-lmb.cam.ac.uk (T. Horsnell)
Date: Thu, 29 Mar 2007 11:09:07 +0100
Subject: ext3 usage guidance
In-Reply-To: <20070329091644.GC5967@schatzie.adilger.int>
References: <20070328174732.GA31129@ls1.lmb.internal>
	<20070329091644.GC5967@schatzie.adilger.int>
Message-ID: <20070329100907.GA7238@ls1.lmb.internal>

On Thu, Mar 29, 2007 at 03:16:44AM -0600, Andreas Dilger wrote:
> On Mar 28, 2007  18:47 +0100, T. Horsnell wrote:
> > 1. The effect on performance of large numbers of (generally) small files
> >    One of my ext3 filesystems has 750K files on a 36GB disk, and 
> >    backup with tar takes forever. Even 'find /fs -type f -ls'
> >    to establish ownership of the various files takes some hours.
> >    Are there thresholds for #files-per-directory or #total-files-per-filesystem
> >    beyond which performance degrades rapidly?
> 
> You should enable directory indexing if you have > 5000 file directories,
> then index the directories.  "tune2fs -O dir_index /dev/XXX; e2fsck -fD /dev/XXX"

Thanks very much.
Do you mean '> 5000 directories-per-filesystem' or '> 5000 files-per-directory' ?
tune2fs refers to 'large directories' which implies to me that its files-per-directory

Cheers,
Terry.

> 
> > 2. I have a number of filesystems on SCSI disks which I would
> >    like to fsck on demand, rather than have an unscheduled
> >    fsck at reboot because some mount-count has expired.
> >    I use 'tune2fs -c 0 and -t 0' to do this, and would like
> >    to use 'shutdown -F -r 'at a chosen time to force fsck on
> >    reboot, and I'd then like fsck to do things in parallel.
> >    What are the resources (memory etc) required for parallel
> >    fsck'ing? Can I reasonably expect to be able to fsck say,
> >    50 300GB filesystems in parallel, or should I group them into
> >    smaller groups? How small?
> 
> I think it was at least "(inodes_count * 7 + blocks_count * 3) / 8" per
> filesystem when I last checked, but I don't recall exactly anymore.
> 
> Cheers, Andreas
> --
> Andreas Dilger
> Principal Software Engineer
> Cluster File Systems, Inc.
> 

-- 


From vcaron at bearstech.com  Thu Mar 29 12:17:56 2007
From: vcaron at bearstech.com (Vincent Caron)
Date: Thu, 29 Mar 2007 14:17:56 +0200
Subject: tune2fs -l stale info
Message-ID: <1175170676.5185.42.camel@localhost>

Hello,

  I just noticed that 'tune2fs -l' did not returned a "lively" updated
information regarding the free inodes count (looks like it's always
correct after unmounting). It became suprising after an online resizing
operation, where the total inode count was immediatly updated (grown in
my case) but the free inode count was the same: one could deduce that
suddenly a lot of inodes were used.

  Is it a normal/expected behaviour ? Stale info is okay (as long as
advertised as such), but partially updated info makes it look incoherent
to me.

  I'm using ext3 on a 2.6.18 (Debian's "vanilla") kernel, x86_64
platform and tune2fs 1.40-WIP (14-Nov-2006).


From tytso at mit.edu  Thu Mar 29 18:59:30 2007
From: tytso at mit.edu (Theodore Tso)
Date: Thu, 29 Mar 2007 14:59:30 -0400
Subject: tune2fs -l stale info
In-Reply-To: <1175170676.5185.42.camel@localhost>
References: <1175170676.5185.42.camel@localhost>
Message-ID: <20070329185930.GA30858@thunk.org>

On Thu, Mar 29, 2007 at 02:17:56PM +0200, Vincent Caron wrote:
> Hello,
> 
>   I just noticed that 'tune2fs -l' did not returned a "lively" updated
> information regarding the free inodes count (looks like it's always
> correct after unmounting). It became suprising after an online resizing
> operation, where the total inode count was immediatly updated (grown in
> my case) but the free inode count was the same: one could deduce that
> suddenly a lot of inodes were used.

Yes, this is expected.  Don't use tune2fs -l for this.  Use df -i
instead.  It is accurate while the filesystem is moutned, and it's
even portable, which is important if you ever need to use other legacy
Unix systems, such as Solaris.  :-)

You can use tune2fs -l or dumpe2fs to obtain the free block/inode
quotes for unmounted filesystems, assuming they were uncleanly
mounted.  If the system had crashed and you haven't yet run the
journal using e2fsck, the dumpe2fs/tune2fs -l may print stale
information until you run the journal by running e2fsck, or by
mounting and unmounting the ext3 filesystem.

						- Ted


From adilger at clusterfs.com  Thu Mar 29 19:59:39 2007
From: adilger at clusterfs.com (Andreas Dilger)
Date: Thu, 29 Mar 2007 13:59:39 -0600
Subject: tune2fs -l stale info
In-Reply-To: <1175170676.5185.42.camel@localhost>
References: <1175170676.5185.42.camel@localhost>
Message-ID: <20070329195939.GI5967@schatzie.adilger.int>

On Mar 29, 2007  14:17 +0200, Vincent Caron wrote:
>   I just noticed that 'tune2fs -l' did not returned a "lively" updated
> information regarding the free inodes count (looks like it's always
> correct after unmounting).

This is a bit of a defect in all 2.6 kernels.  They never update the
on disk superblock free blocks/inodes information to avoid lock contention,
even if this info is available.

Can you please give the following patch a try?  It fixes this issue,
and also makes statfs MUCH more efficient for large filesystems, because
the filesystem overhead is constant unless the filesystem size changes
and checking that for 16k groups is slow (hence hack to add cond_resched()
instead of fixing problem correctly).  It has not been tested much, but
is very straight forward.

Only the last part is strictly necessary to fix your particular problem
(setting of es->s_free_inodes_count and es->s_free_blocks_count).  This
is lazy, in the sense that you need a "statfs" to update the count, and
then a truncate or unlink or rmdir in order to dirty the superblock to
flush it to disk.  However, it will be correct in the buffer cache, and
it is a lot better than what we have now.  We don't want a non-lazy version
anyways, because of performance.

Signed-off-by: Andreas Dilger <adilger at clusterfs.com>

======================= ext3-statfs-2.6.20.diff ==========================
Index: linux-stage/fs/ext3/super.c
===================================================================
--- linux-stage.orig/fs/ext3/super.c	2007-03-22 17:29:30.000000000 -0600
+++ linux-stage/fs/ext3/super.c	2007-03-23 01:48:41.000000000 -0600
@@ -2389,19 +2389,22 @@ restore_opts:
 	struct super_block *sb = dentry->d_sb;
 	struct ext3_sb_info *sbi = EXT3_SB(sb);
 	struct ext3_super_block *es = sbi->s_es;
-	ext3_fsblk_t overhead;
-	int i;
+	static ext3_fsblk_t overhead_last;
+	static __le32 blocks_last;
 	u64 fsid;
 
-	if (test_opt (sb, MINIX_DF))
-		overhead = 0;
-	else {
-		unsigned long ngroups;
-		ngroups = EXT3_SB(sb)->s_groups_count;
+	if (test_opt (sb, MINIX_DF)) {
+		overhead_last = 0;
+	} else if (blocks_last != es->s_blocks_count) {
+		unsigned long ngroups = sbi->s_groups_count, group, metabg = ~0;
+		unsigned three = 1, five = 5, seven = 7;
+		ext3_fsblk_t overhead = 0;
 		smp_rmb();
 
 		/*
-		 * Compute the overhead (FS structures)
+		 * Compute the overhead (FS structures).  This is constant
+		 * for a given filesystem unless the number of block groups
+		 * changes so we cache the previous value until it does.
 		 */
 
 		/*
@@ -2419,28 +2422,43 @@ static int ext3_statfs (struct super_blo
 		 * block group descriptors.  If the sparse superblocks
 		 * feature is turned on, then not all groups have this.
 		 */
-		for (i = 0; i < ngroups; i++) {
-			overhead += ext3_bg_has_super(sb, i) +
-				ext3_bg_num_gdb(sb, i);
-			cond_resched();
-		}
+		overhead += 1 + sbi->s_gdb_count +
+			le16_to_cpu(es->s_reserved_gdt_blocks); /* group 0 */
+		if (EXT3_HAS_INCOMPAT_FEATURE(sb,
+					      EXT3_FEATURE_INCOMPAT_META_BG)) {
+			metabg = le32_to_cpu(es->s_first_meta_bg) *
+					sbi->s_desc_per_block;
+			group = ngroups - metabg;
+			overhead += (group + 1) / sbi->s_desc_per_block * 3 +
+				((group%sbi->s_desc_per_block)>= 2?2:(group%2));
+		}
+
+		while ((group = ext3_list_backups(sb, &three, &five, &seven)) <
+		       ngroups) /* sb + group descriptors backups */
+			overhead += 1 +(group >= metabg ? 0 : sbi->s_gdb_count +
+					le16_to_cpu(es->s_reserved_gdt_blocks));
 
 		/*
 		 * Every block group has an inode bitmap, a block
 		 * bitmap, and an inode table.
 		 */
-		overhead += (ngroups * (2 + EXT3_SB(sb)->s_itb_per_group));
+		overhead += ngroups * (2 + sbi->s_itb_per_group);
+		overhead_last = overhead;
+		smp_wmb();
+		blocks_last = es->s_blocks_count;
 	}
 
 	buf->f_type = EXT3_SUPER_MAGIC;
 	buf->f_bsize = sb->s_blocksize;
-	buf->f_blocks = le32_to_cpu(es->s_blocks_count) - overhead;
+	buf->f_blocks = le32_to_cpu(es->s_blocks_count) - overhead_last;
 	buf->f_bfree = percpu_counter_sum(&sbi->s_freeblocks_counter);
+	es->s_free_blocks_count = cpu_to_le32(buf->f_bfree);
 	buf->f_bavail = buf->f_bfree - le32_to_cpu(es->s_r_blocks_count);
 	if (buf->f_bfree < le32_to_cpu(es->s_r_blocks_count))
 		buf->f_bavail = 0;
 	buf->f_files = le32_to_cpu(es->s_inodes_count);
 	buf->f_ffree = percpu_counter_sum(&sbi->s_freeinodes_counter);
+	es->s_free_inodes_count = cpu_to_le32(buf->f_ffree);
 	buf->f_namelen = EXT3_NAME_LEN;
 	fsid = le64_to_cpup((void *)es->s_uuid) ^
 	       le64_to_cpup((void *)es->s_uuid + sizeof(u64));
Index: linux-stage/fs/ext3/resize.c
===================================================================
--- linux-stage.orig/fs/ext3/resize.c	2007-03-22 17:29:30.000000000 -0600
+++ linux-stage/fs/ext3/resize.c	2007-03-23 01:16:38.000000000 -0600
@@ -292,8 +292,8 @@ exit_journal:
  * sequence of powers of 3, 5, and 7: 1, 3, 5, 7, 9, 25, 27, 49, 81, ...
  * For a non-sparse filesystem it will be every group: 1, 2, 3, 4, ...
  */
-static unsigned ext3_list_backups(struct super_block *sb, unsigned *three,
-				  unsigned *five, unsigned *seven)
+unsigned ext3_list_backups(struct super_block *sb, unsigned *three,
+			   unsigned *five, unsigned *seven)
 {
 	unsigned *min = three;
 	int mult = 3;
Index: linux-stage/include/linux/ext3_fs.h
===================================================================
--- linux-stage.orig/include/linux/ext3_fs.h	2007-03-22 17:29:30.000000000 -0600
+++ linux-stage/include/linux/ext3_fs.h	2007-03-23 00:41:22.000000000 -0600
@@ -846,6 +846,8 @@ extern int ext3_group_add(struct super_b
 extern int ext3_group_extend(struct super_block *sb,
 				struct ext3_super_block *es,
 				ext3_fsblk_t n_blocks_count);
+extern unsigned ext3_list_backups(struct super_block *sb, unsigned *three,
+				  unsigned *five, unsigned *seven);
 
 /* super.c */
 extern void ext3_error (struct super_block *, const char *, const char *, ...)

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


From vcaron at bearstech.com  Thu Mar 29 20:12:13 2007
From: vcaron at bearstech.com (Vincent Caron)
Date: Thu, 29 Mar 2007 22:12:13 +0200
Subject: tune2fs -l stale info
In-Reply-To: <20070329185930.GA30858@thunk.org>
References: <1175170676.5185.42.camel@localhost>
	<20070329185930.GA30858@thunk.org>
Message-ID: <1175199133.5185.60.camel@localhost>

On jeu, 2007-03-29 at 14:59 -0400, Theodore Tso wrote:
> On Thu, Mar 29, 2007 at 02:17:56PM +0200, Vincent Caron wrote:
> > Hello,
> > 
> >   I just noticed that 'tune2fs -l' did not returned a "lively" updated
> > information regarding the free inodes count (looks like it's always
> > correct after unmounting). It became suprising after an online resizing
> > operation, where the total inode count was immediatly updated (grown in
> > my case) but the free inode count was the same: one could deduce that
> > suddenly a lot of inodes were used.
> 
> Yes, this is expected.  Don't use tune2fs -l for this.  Use df -i
> instead.  It is accurate while the filesystem is moutned, and it's
> even portable, which is important if you ever need to use other legacy
> Unix systems, such as Solaris.  :-)

  Thanks for the tip, figures looks much better now...


From jakj at j-a-k-j.com  Wed Mar 28 13:17:42 2007
From: jakj at j-a-k-j.com (John Anthony Kazos Jr.)
Date: Wed, 28 Mar 2007 13:17:42 -0000
Subject: Ext3 behavior on power failure
In-Reply-To: <20070328124015.GG14935@atrey.karlin.mff.cuni.cz>
References: <A74E8B4A356D8143B79BBEB839421F3004023496@CORPUSMX20B.corp.emc.com>
	<4603B03E.7080302@emc.com>
	<20070328124015.GG14935@atrey.karlin.mff.cuni.cz>
Message-ID: <alpine.DEB.0.83.0703280915400.30460@sigma.j-a-k-j.com>

>  If you fsync() your data, you are guaranteed that also your data are
> safely on disk when fsync returns. So what is the question here?

Pardon a newbie's intrusion, but I do know this isn't true. There is a 
window of possible loss because of the multitude of layers of caching, 
especially within the drive itself. Unless there is a super_duper_fsync() 
that is able to actually poll the hardware and get a confirmation that the 
internal buffers are purged?