From chris at cjx.com  Tue Oct  3 22:30:38 2006
From: chris at cjx.com (Chris Allen)
Date: Tue, 03 Oct 2006 23:30:38 +0100
Subject: 16TB ext3 mainstream - when?
Message-ID: <4522E48E.9040905@cjx.com>

Are we likely to see patches to allow 16TB ext3 in the mainstream
kernel any time soon?

I am working with a storage box that has 16x750GB drives RAID5-ed together
to create a potential 10.5TB of potential storage. But because ext3 is 
limited to
8TB I am forced to split into 2 smaller ext3 filesystems which is really 
cumbersome
for my app.

Any ideas anybody?




From adilger at clusterfs.com  Wed Oct  4 00:06:11 2006
From: adilger at clusterfs.com (Andreas Dilger)
Date: Tue, 3 Oct 2006 18:06:11 -0600
Subject: 16TB ext3 mainstream - when?
In-Reply-To: <4522E48E.9040905@cjx.com>
References: <4522E48E.9040905@cjx.com>
Message-ID: <20061004000611.GX22010@schatzie.adilger.int>

On Oct 03, 2006  23:30 +0100, Chris Allen wrote:
> Are we likely to see patches to allow 16TB ext3 in the mainstream
> kernel any time soon?

I think the patches are going into -mm (if not already), so start testing...
If not, they have been posted here several times, along with a URL for
download.

> I am working with a storage box that has 16x750GB drives RAID5-ed together
> to create a potential 10.5TB of potential storage. But because ext3 is 
> limited to
> 8TB I am forced to split into 2 smaller ext3 filesystems which is really 
> cumbersome
> for my app.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.



From menscher at uiuc.edu  Wed Oct  4 00:20:20 2006
From: menscher at uiuc.edu (Damian Menscher)
Date: Tue, 3 Oct 2006 19:20:20 -0500 (CDT)
Subject: 16TB ext3 mainstream - when?
In-Reply-To: <20061004000611.GX22010@schatzie.adilger.int>
References: <4522E48E.9040905@cjx.com>
	<20061004000611.GX22010@schatzie.adilger.int>
Message-ID: <Pine.LNX.4.63.0610031918200.27436@zeus.itg.uiuc.edu>

On Tue, 3 Oct 2006, Andreas Dilger wrote:

> On Oct 03, 2006  23:30 +0100, Chris Allen wrote:
>> Are we likely to see patches to allow 16TB ext3 in the mainstream
>> kernel any time soon?
>
> I think the patches are going into -mm (if not already), so start testing...
> If not, they have been posted here several times, along with a URL for
> download.

Will those patches work to grow an existing ext3 filesystem to >8TB, or 
do they only work on new filesystems (created with those patches or 
other special options)?

I ask because we need to create an <8TB filesystem now, but with the 
option to grow it to >8TB in the future.

Damian Menscher
-- 
-=#| <menscher at uiuc.edu> www.uiuc.edu/~menscher/ Ofc:(650)253-2757 |#=-
-=#| The above opinions are not necessarily those of my employers. |#=-



From adilger at clusterfs.com  Wed Oct  4 05:50:03 2006
From: adilger at clusterfs.com (Andreas Dilger)
Date: Tue, 3 Oct 2006 23:50:03 -0600
Subject: 16TB ext3 mainstream - when?
In-Reply-To: <Pine.LNX.4.63.0610031918200.27436@zeus.itg.uiuc.edu>
References: <4522E48E.9040905@cjx.com>
	<20061004000611.GX22010@schatzie.adilger.int>
	<Pine.LNX.4.63.0610031918200.27436@zeus.itg.uiuc.edu>
Message-ID: <20061004055003.GA22010@schatzie.adilger.int>

On Oct 03, 2006  19:20 -0500, Damian Menscher wrote:
> On Tue, 3 Oct 2006, Andreas Dilger wrote:
> >On Oct 03, 2006  23:30 +0100, Chris Allen wrote:
> >>Are we likely to see patches to allow 16TB ext3 in the mainstream
> >>kernel any time soon?
> >
> >I think the patches are going into -mm (if not already), so start 
> >testing...
> >If not, they have been posted here several times, along with a URL for
> >download.
> 
> Will those patches work to grow an existing ext3 filesystem to >8TB, or 
> do they only work on new filesystems (created with those patches or 
> other special options)?

There are no special options or features needed to use > 8TB filesystems,
just bug fixes in the kernel.

> I ask because we need to create an <8TB filesystem now, but with the 
> option to grow it to >8TB in the future.

I have never tested that, and I don't know anyone else who has.  That
said, I'm not aware of any inherent limitations on growing the filesystem
up to 16TB.  I haven't looked at that code for a long time, and never
really with an eye toward scalability to 16TB.  It definitely will NOT
work to grow past 16TB at all.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.



From evilninja at gmx.net  Wed Oct  4 15:40:31 2006
From: evilninja at gmx.net (Christian)
Date: Wed, 4 Oct 2006 16:40:31 +0100 (BST)
Subject: 16TB ext3 mainstream - when?
In-Reply-To: <20061004000611.GX22010@schatzie.adilger.int>
References: <4522E48E.9040905@cjx.com>
	<20061004000611.GX22010@schatzie.adilger.int>
Message-ID: <Pine.LNX.4.64.0610041628260.9296@sheep.housecafe.de>

On Tue, 3 Oct 2006, Andreas Dilger wrote:
> On Oct 03, 2006  23:30 +0100, Chris Allen wrote:
>> Are we likely to see patches to allow 16TB ext3 in the mainstream
>> kernel any time soon?
>
> I think the patches are going into -mm (if not already), so start testing...
> If not, they have been posted here several times, along with a URL for
> download.

I don't get it: I thought ext2/3 filesystems (volumes) can be 32TiB in 
size? At least that's what [0] says. If this is wrong, someone should 
correct this information. Although I must admit that 16 TiB per fs makes 
more sense, given that with a max blocksize of 4K and a max of 2^32 
blocks we have 16TiB of data.... Where does this 2^32 limitation come 
from anyway?

Thanks,
Christian.

[0] http://en.wikipedia.org/wiki/Comparison_of_file_systems
-- 
BOFH excuse #247:

Due to Federal Budget problems we have been forced to cut back on the number of users able to access the system at one time. (namely none allowed....)



From adilger at clusterfs.com  Wed Oct  4 17:11:33 2006
From: adilger at clusterfs.com (Andreas Dilger)
Date: Wed, 4 Oct 2006 11:11:33 -0600
Subject: 16TB ext3 mainstream - when?
In-Reply-To: <Pine.LNX.4.64.0610041628260.9296@sheep.housecafe.de>
References: <4522E48E.9040905@cjx.com>
	<20061004000611.GX22010@schatzie.adilger.int>
	<Pine.LNX.4.64.0610041628260.9296@sheep.housecafe.de>
Message-ID: <20061004171133.GE22010@schatzie.adilger.int>

On Oct 04, 2006  16:40 +0100, Christian wrote:
> I don't get it: I thought ext2/3 filesystems (volumes) can be 32TiB in 
> size? At least that's what [0] says. If this is wrong, someone should 
> correct this information. Although I must admit that 16 TiB per fs makes 
> more sense, given that with a max blocksize of 4K and a max of 2^32 
> blocks we have 16TiB of data.... Where does this 2^32 limitation come 
> from anyway?

The 2^32 limit is a 32-bit integer number of blocks.  In older kernels
(i.e. anything except the latest -mm) there is a signed-int problem,
so the effective limit is 2^31 blocks.  With 1kB blocks this limit is
2TB (2^41 bytes), with 4kB blocks (most common) it is 8TB (2^43 bytes)
with 64kB blocks (PPC64, ia64, other large PAGE_SIZE systems) this limit
is 32TB (2^45 bytes).

The ext4 filesystem allows up to 2^48 blocks in the filesystem so the
limit is 2^60 bytes for 4kB blocks, and 2^64 bytes for 64kB blocks.

The major problem at this point is e2fsck time, which is about 1h/TB for
fast disks, at minimum (i.e. no major corruption found).  One of the
goals for future ext4 development is to include checksums into the fs
to allow online sanity checking, and also speed up e2fsck in various ways.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.



From evilninja at gmx.net  Wed Oct  4 18:30:50 2006
From: evilninja at gmx.net (Christian)
Date: Wed, 4 Oct 2006 19:30:50 +0100 (BST)
Subject: 16TB ext3 mainstream - when?
In-Reply-To: <20061004171133.GE22010@schatzie.adilger.int>
References: <4522E48E.9040905@cjx.com>
	<20061004000611.GX22010@schatzie.adilger.int>
	<Pine.LNX.4.64.0610041628260.9296@sheep.housecafe.de>
	<20061004171133.GE22010@schatzie.adilger.int>
Message-ID: <Pine.LNX.4.64.0610041921490.9296@sheep.housecafe.de>

On Wed, 4 Oct 2006, Andreas Dilger wrote:
> 2TB (2^41 bytes), with 4kB blocks (most common) it is 8TB (2^43 bytes)
> with 64kB blocks (PPC64, ia64, other large PAGE_SIZE systems) this limit
> is 32TB (2^45 bytes).

Ah, although a max of 4kB is documented in my e2fsprogs-1.39 manpage, I 
can override it (e.g. -b 65536, but I cannot mount it then). OK.

> The major problem at this point is e2fsck time, which is about 1h/TB for
> fast disks, at minimum (i.e. no major corruption found).  One of the
> goals for future ext4 development is to include checksums into the fs
> to allow online sanity checking, and also speed up e2fsck in various ways.

I'm tracking -mm, hopefully ext4 will be included in it anytime soon...

Thanks,
Christian.
-- 
BOFH excuse #140:

LBNC (luser brain not connected)



From Matt_Dodson at messageone.com  Wed Oct  4 21:33:52 2006
From: Matt_Dodson at messageone.com (Matt Dodson)
Date: Wed, 4 Oct 2006 16:33:52 -0500
Subject: EXT3 and large directories
Message-ID: <44B5599C8B5B1347AFF903FDCEC00307A058FD@auscorpex-1.austin.messageone.com>

 

I have an ext3 filesystem that has several directories and each
directory gets a large number of files inserted and then deleted over
time. The filesystem is basically used as a temp store before files are
processed. The issue is over time the directory scans get extremely slow
even if the directories are empty. I have noticed the directories can
range in size from 4k - 100M even when they are empty.  Is there a way
to fix this without recreating the directories or bringing the
filesystem offline? 

 

File system Info:

 

tune2fs 1.35 (28-Feb-2004)

Filesystem volume name:   <none>

Last mounted on:          <not available>

Filesystem UUID:          7cbda7aa-e8e7-4da1-9c7c-de45668e98f3

Filesystem magic number:  0xEF53

Filesystem revision #:    1 (dynamic)

Filesystem features:      has_journal resize_inode dir_index filetype
needs_recovery sparse_super large_file

Default mount options:    (none)

Filesystem state:         clean

Errors behavior:          Continue

Filesystem OS type:       Linux

Inode count:              98304000

Block count:              196608000

Reserved block count:     9830400

Free blocks:              31795332

Free inodes:              83024519

First block:              0

Block size:               4096

Fragment size:            4096

Reserved GDT blocks:      1024

Blocks per group:         32768

Fragments per group:      32768

Inodes per group:         16384

Inode blocks per group:   512

Filesystem created:       Thu Aug 10 11:10:59 2006

Last mount time:          Tue Oct  3 00:10:48 2006

Last write time:          Tue Oct  3 00:10:48 2006

Mount count:              4

Maximum mount count:      21

Last checked:             Thu Aug 10 11:10:59 2006

Check interval:           15552000 (6 months)

Next check after:         Tue Feb  6 10:10:59 2007

Reserved blocks uid:      0 (user root)

Reserved blocks gid:      0 (group root)

First inode:              11

Inode size:               128

Journal inode:            8

Default directory hash:   tea

Directory Hash Seed:      59fd108a-7ec7-45f9-8967-b9f3aaec3edf

Journal backup:           inode blocks

 

Matt D.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20061004/c2ccd283/attachment.htm>

From evilninja at gmx.net  Wed Oct  4 22:07:36 2006
From: evilninja at gmx.net (Christian)
Date: Wed, 4 Oct 2006 23:07:36 +0100 (BST)
Subject: EXT3 and large directories
In-Reply-To: <44B5599C8B5B1347AFF903FDCEC00307A058FD@auscorpex-1.austin.messageone.com>
References: <44B5599C8B5B1347AFF903FDCEC00307A058FD@auscorpex-1.austin.messageone.com>
Message-ID: <Pine.LNX.4.64.0610042254580.9296@sheep.housecafe.de>


(please refrain from sendin HTML mails)

On Wed, 4 Oct 2006, Matt Dodson wrote:
> I have an ext3 filesystem that has several directories and each
> directory gets a large number of files inserted and then deleted over
> time.

Can you specify "large number"? What do "ls large-directory | wc -l" 
say?

> The filesystem is basically used as a temp store before files are
> processed. The issue is over time the directory scans get extremely slow
> even if the directories are empty. I have noticed the directories can
> range in size from 4k - 100M even when they are empty.

proably deleted-but-still-open files. When lsof(8) is installed, you can 
find out with: "lsof -ln | grep large-but-empty-directory"

Can you specify "slow" as well? You also might want strace(1) an "ls" on 
your large directory to see what is taking so long.

> Is there a way to fix this without recreating the directories or
> bringing the filesystem offline?

You have enabled htree (dir_index) already:

> Filesystem features:      has_journal resize_inode dir_index filetype
> needs_recovery sparse_super large_file

If you've enabled dir_index after the directories have been created, you 
might want to "e2fsck -D" (see the manpage for details) the filesystem. 
For partitions with temprary files you could play with "noatime","async" 
and "data" mount-options (please read the manpage, really!).
Which kernel do you use? Which arch?

C.
-- 
BOFH excuse #83:

Support staff hung over, send aspirin and come back LATER.



From evilninja at gmx.net  Wed Oct  4 23:18:07 2006
From: evilninja at gmx.net (Christian)
Date: Thu, 5 Oct 2006 00:18:07 +0100 (BST)
Subject: EXT3 and large directories (fwd)
Message-ID: <Pine.LNX.4.64.0610050014200.9296@sheep.housecafe.de>


(please reply on-list, so everybody can comment/help)

Matt,

thanks for the details, but apart from mount-option tuning and dir_index 
(which you've already enabled), I dunno why ls(1) would take *hours* to 
stat ~1M files...

out of curiosity: are you able to try a newer kernel? does it change 
anything?

Christian.

---------- Forwarded message ----------

The dir_index was enabled during the filesystem creation.

The directories can have from 0 - 1,000,000 files at any given time

The slowness is on an open of the directory when it is empty, The size
of the directory refers to the directory file itself not the size of the
directories contents. There are no open files in the directory when it
is being stated. I will add that when we do have 50,000 files or more in
the directories a listing is also very slow, can take hours.

Slowness can be 5-10 minutes on a directory open

open("/ems/bigdisk/132", O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY) =
3
fstat64(3, {st_mode=S_IFDIR|0755, st_size=79298560, ...}) = 0
fcntl64(3, F_SETFD, FD_CLOEXEC)         = 0
getdents64(3, /* 3 entries */, 4096)    = 72
getdents64(3, /* 0 entries */, 4096)    = 0
close(3)                                = 0


Kernel is 2.6.9-34.0.2.ELsmp

I have tried noatime which does speed up reads when there are lots of
files but does not fix the directory issue

These directories are empty except for two file:
drwxr-xr-x  3 vbox132 root  76M Oct  2 10:04 132
drwxr-xr-x  3 vbox151 root 226M Oct  4 17:00 151
drwxr-xr-x  3 vbox229 root  33M Oct  2 10:16 229
drwxr-xr-x  3 vbox235 root 7.5M Oct  2 10:14 235
drwxr-xr-x  3 vbox246 root  52M Sep 30 20:59 246
drwxr-xr-x  3 vbox249 root 1.1M Oct  2 10:04 249

-- 
BOFH excuse #83:

Support staff hung over, send aspirin and come back LATER.

_______________________________________________
Ext3-users mailing list
Ext3-users at redhat.com
https://www.redhat.com/mailman/listinfo/ext3-users




From adilger at clusterfs.com  Thu Oct  5 00:43:22 2006
From: adilger at clusterfs.com (Andreas Dilger)
Date: Wed, 4 Oct 2006 18:43:22 -0600
Subject: EXT3 and large directories
In-Reply-To: <44B5599C8B5B1347AFF903FDCEC00307A058FD@auscorpex-1.austin.messageone.com>
References: <44B5599C8B5B1347AFF903FDCEC00307A058FD@auscorpex-1.austin.messageone.com>
Message-ID: <20061005004322.GQ22010@schatzie.adilger.int>

On Oct 04, 2006  16:33 -0500, Matt Dodson wrote:
> I have an ext3 filesystem that has several directories and each
> directory gets a large number of files inserted and then deleted over
> time. The filesystem is basically used as a temp store before files are
> processed. The issue is over time the directory scans get extremely slow
> even if the directories are empty. I have noticed the directories can
> range in size from 4k - 100M even when they are empty.  Is there a way
> to fix this without recreating the directories or bringing the
> filesystem offline? 

No way to fix this w/o offline e2fsck -fD.  ext3 doesn't shrink directories
when deleting files.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.



From Matt_Dodson at messageone.com  Thu Oct  5 02:12:24 2006
From: Matt_Dodson at messageone.com (Matt Dodson)
Date: Wed, 4 Oct 2006 21:12:24 -0500
Subject: EXT3 and large directories
In-Reply-To: <20061005004322.GQ22010@schatzie.adilger.int>
Message-ID: <44B5599C8B5B1347AFF903FDCEC00307A05945@auscorpex-1.austin.messageone.com>

Is this a bug or by design? Would there be a better filesystem to use
for my situation?

Matt D.

-----Original Message-----
From: Andreas Dilger [mailto:adilger at clusterfs.com] 
Sent: Wednesday, October 04, 2006 7:43 PM
To: Matt Dodson
Cc: ext3-users at redhat.com
Subject: Re: EXT3 and large directories

On Oct 04, 2006  16:33 -0500, Matt Dodson wrote:
> I have an ext3 filesystem that has several directories and each
> directory gets a large number of files inserted and then deleted over
> time. The filesystem is basically used as a temp store before files
are
> processed. The issue is over time the directory scans get extremely
slow
> even if the directories are empty. I have noticed the directories can
> range in size from 4k - 100M even when they are empty.  Is there a way
> to fix this without recreating the directories or bringing the
> filesystem offline? 

No way to fix this w/o offline e2fsck -fD.  ext3 doesn't shrink
directories
when deleting files.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.




From bjacke at sernet.de  Thu Oct  5 13:57:57 2006
From: bjacke at sernet.de (=?iso-8859-1?Q?Bj=F6rn?= JACKE)
Date: Thu, 5 Oct 2006 15:57:57 +0200
Subject: creation time stamps for ext4 ?
Message-ID: <E1GVUd2-0006O6-00@intern.SerNet.DE>

Hi,

I would like to know if there are any plans to introduce a creation 
timestamp in future ext3/4 versions. Having a 4th timestamp saving the 
creation time would be very good for projekts like Samba for example. 
It would be important that creation time can also be set manually 
later on by some system call. Systems like FreeBSD's UFS and Solais' 
ZFS already support creation times. Unfortunately Linux doesn't have 
such a thing standarized anywhere but it would be geat if it would.

Are there any plans to add this?

Bjoern



From adilger at clusterfs.com  Thu Oct  5 15:19:37 2006
From: adilger at clusterfs.com (Andreas Dilger)
Date: Thu, 5 Oct 2006 09:19:37 -0600
Subject: creation time stamps for ext4 ?
In-Reply-To: <E1GVUd2-0006O6-00@intern.SerNet.DE>
References: <E1GVUd2-0006O6-00@intern.SerNet.DE>
Message-ID: <20061005151937.GV22010@schatzie.adilger.int>

On Oct 05, 2006  15:57 +0200, Bj?rn JACKE wrote:
> I would like to know if there are any plans to introduce a creation 
> timestamp in future ext3/4 versions. Having a 4th timestamp saving the 
> creation time would be very good for projekts like Samba for example. 
> It would be important that creation time can also be set manually 
> later on by some system call. Systems like FreeBSD's UFS and Solais' 
> ZFS already support creation times. Unfortunately Linux doesn't have 
> such a thing standarized anywhere but it would be geat if it would.
> 
> Are there any plans to add this?

I've given this some thought for adding as part of the nsec timestamp
patch.  That is more feasable if we move the nsec ctime into the main
inode to double as the version field.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.



From lists at nerdbynature.de  Thu Oct  5 15:41:23 2006
From: lists at nerdbynature.de (Christian Kujau)
Date: Thu, 5 Oct 2006 16:41:23 +0100 (BST)
Subject: EXT3 and large directories
In-Reply-To: <44B5599C8B5B1347AFF903FDCEC00307A05945@auscorpex-1.austin.messageone.com>
References: <44B5599C8B5B1347AFF903FDCEC00307A05945@auscorpex-1.austin.messageone.com>
Message-ID: <Pine.LNX.4.64.0610051635110.9296@sheep.housecafe.de>

On Wed, 4 Oct 2006, Matt Dodson wrote:
> Is this a bug or by design? Would there be a better filesystem to use
> for my situation?

I think it's a design issue, I don't think 1M files was not common when 
ext3 came out (1999). ReiserFS is said to be "fast with lots of small 
files", but as always: evaluate the fs before putting applications on 
it. FWIW, I've did a little test with ext3 and 0,1M/1M files on an 
already existing fs (rootfs of an existing FC6 installation):

http://nerdbynature.de/bits/2.6.18-mm3/

cheers,
Christian.
-- 
BOFH excuse #143:

had to use hammer to free stuck disk drive heads.



From tytso at mit.edu  Thu Oct  5 16:55:04 2006
From: tytso at mit.edu (Theodore Tso)
Date: Thu, 5 Oct 2006 12:55:04 -0400
Subject: creation time stamps for ext4 ?
In-Reply-To: <20061005151937.GV22010@schatzie.adilger.int>
References: <E1GVUd2-0006O6-00@intern.SerNet.DE>
	<20061005151937.GV22010@schatzie.adilger.int>
Message-ID: <20061005165504.GA23727@thunk.org>

On Thu, Oct 05, 2006 at 09:19:37AM -0600, Andreas Dilger wrote:
> On Oct 05, 2006  15:57 +0200, Bj?rn JACKE wrote:
> > I would like to know if there are any plans to introduce a creation 
> > timestamp in future ext3/4 versions. Having a 4th timestamp saving the 
> > creation time would be very good for projekts like Samba for example. 
> > It would be important that creation time can also be set manually 
> > later on by some system call. Systems like FreeBSD's UFS and Solais' 
> > ZFS already support creation times. Unfortunately Linux doesn't have 
> > such a thing standarized anywhere but it would be geat if it would.
> > 
> > Are there any plans to add this?
> 
> I've given this some thought for adding as part of the nsec timestamp
> patch.  That is more feasable if we move the nsec ctime into the main
> inode to double as the version field.

Shoehorning an extra creation time field into the inode is relatively
easy, but it's also necessary to have system calls to get and set the
creation time.  The stat structure doesn't have room for the creation
time, so that means a new version of the stat structure exported the
kernel, and a new version of the stat structure exported by glibc.

So there are VFS and glibc changes necessary to make this be useful.
But that doesn't prevent us from reserving space in the inode and
starting to fill it in with the creation time, although it may be
quite a while before it will be easily available to user programs like
Samba.

						- Ted



From tytso at mit.edu  Thu Oct  5 17:02:29 2006
From: tytso at mit.edu (Theodore Tso)
Date: Thu, 5 Oct 2006 13:02:29 -0400
Subject: EXT3 and large directories
In-Reply-To: <20061005004322.GQ22010@schatzie.adilger.int>
References: <44B5599C8B5B1347AFF903FDCEC00307A058FD@auscorpex-1.austin.messageone.com>
	<20061005004322.GQ22010@schatzie.adilger.int>
Message-ID: <20061005170229.GB23727@thunk.org>

On Wed, Oct 04, 2006 at 06:43:22PM -0600, Andreas Dilger wrote:
> On Oct 04, 2006  16:33 -0500, Matt Dodson wrote:
> > I have an ext3 filesystem that has several directories and each
> > directory gets a large number of files inserted and then deleted over
> > time. The filesystem is basically used as a temp store before files are
> > processed. The issue is over time the directory scans get extremely slow
> > even if the directories are empty. I have noticed the directories can
> > range in size from 4k - 100M even when they are empty.  Is there a way
> > to fix this without recreating the directories or bringing the
> > filesystem offline? 
> 
> No way to fix this w/o offline e2fsck -fD.  ext3 doesn't shrink directories
> when deleting files.

Well, if there isn't anyone else using the directory, you can also do
the following:

	mkdir foo.new
	mv foo/* foo.new
	rmdir foo
	mv foo.new foo

And of course, if you know the directory is empty, just do:

	rmdir foo
	mkdir foo

Historically this is a pretty common restriction in Unix filesystems.
If someone cared enough, it would be possible to change ext3/4 to
release directory blocks when they are empty, but no one has found it
important enough to create such a patch.

						- Ted



From alex at alex.org.uk  Thu Oct  5 18:10:30 2006
From: alex at alex.org.uk (Alex Bligh)
Date: Thu, 05 Oct 2006 19:10:30 +0100
Subject: EXT3 and large directories
In-Reply-To: <20061005170229.GB23727@thunk.org>
References: <44B5599C8B5B1347AFF903FDCEC00307A058FD@auscorpex-1.austin.messag
	eone.com>	<20061005004322.GQ22010@schatzie.adilger.int>
	<20061005170229.GB23727@thunk.org>
Message-ID: <1BB7D14D25835639B3770CDB@[192.168.0.101]>



--On 05 October 2006 13:02 -0400 Theodore Tso <tytso at mit.edu> wrote:
>> >The issue is over time the directory scans get extremely slow
>> > even if the directories are empty. I have noticed the directories can
>> > range in size from 4k - 100M even when they are empty.
...
>> No way to fix this w/o offline e2fsck -fD.  ext3 doesn't shrink
>> directories when deleting files.
...
> Historically this is a pretty common restriction in Unix filesystems.
> If someone cared enough, it would be possible to change ext3/4 to
> release directory blocks when they are empty, but no one has found it
> important enough to create such a patch.

I had sort of assumed this wouldn't be a problem after htree was
incorporated as far speed, as opposed to size is concerned - and speed was
the original poster's problem, not size on disk. Does that imply there is
still some linear searching going on, or that htree is not "enough" to
speed up the searches.

How do the deleted entries get reused? EG if I have a mail spool
application, where a given directory has around 100,000 files in at any
time, and they are periodically deleted by age in batches of (say) 10,000
such that the number in the directory never exceeds 100,000, does the size
of the directory just keep growing for ever? Or do newly created directory
entries take up the space in the directory of old ones (assume all the
filenames are unique).

Alex



From tytso at mit.edu  Thu Oct  5 18:58:18 2006
From: tytso at mit.edu (Theodore Tso)
Date: Thu, 5 Oct 2006 14:58:18 -0400
Subject: EXT3 and large directories
In-Reply-To: <1BB7D14D25835639B3770CDB@[192.168.0.101]>
References: <44B5599C8B5B1347AFF903FDCEC00307A058FD@auscorpex-1.austin.messageone.com>
	<20061005004322.GQ22010@schatzie.adilger.int>
	<20061005170229.GB23727@thunk.org>
	<1BB7D14D25835639B3770CDB@[192.168.0.101]>
Message-ID: <20061005185818.GA7621@thunk.org>

On Thu, Oct 05, 2006 at 07:10:30PM +0100, Alex Bligh wrote:
> I had sort of assumed this wouldn't be a problem after htree was
> incorporated as far speed, as opposed to size is concerned - and speed was
> the original poster's problem, not size on disk. Does that imply there is
> still some linear searching going on, or that htree is not "enough" to
> speed up the searches.

The current implementation of htree doesn't shrink leaf nodes when
they are empty, so if you create a really, really big directory, and
then delete all of the files, the leaf nodes remain in the htree,
empty.  

So htree will speed up the lookup of *specific* files, but it won't
speed up readdir() scanning a large, empty directory.

						- Ted



From bjacke at sernet.de  Thu Oct  5 19:23:12 2006
From: bjacke at sernet.de (=?iso-8859-1?Q?Bj=F6rn?= JACKE)
Date: Thu, 5 Oct 2006 21:23:12 +0200
Subject: creation time stamps for ext4 ?
In-Reply-To: <20061005165504.GA23727@thunk.org>
References: <E1GVUd2-0006O6-00@intern.SerNet.DE>
	<20061005151937.GV22010@schatzie.adilger.int>
	<20061005165504.GA23727@thunk.org>
Message-ID: <E1GVYnq-00004v-00@intern.SerNet.DE>

On 2006-10-05 at 12:55 -0400 Theodore Tso sent off:
> > I've given this some thought for adding as part of the nsec timestamp
> > patch.  That is more feasable if we move the nsec ctime into the main
> > inode to double as the version field.
> 
> Shoehorning an extra creation time field into the inode is relatively
> easy, but it's also necessary to have system calls to get and set the
> creation time.  The stat structure doesn't have room for the creation
> time, so that means a new version of the stat structure exported the
> kernel, and a new version of the stat structure exported by glibc.
> 
> So there are VFS and glibc changes necessary to make this be useful.
> But that doesn't prevent us from reserving space in the inode and
> starting to fill it in with the creation time, although it may be
> quite a while before it will be easily available to user programs like
> Samba.

yes, probably. But it's a reasonable effort to start that at some 
time. It's good if ext3 developers have it in mind already now.
Should I open a feature request at bugzilla.kernel.org for the needed 
VFS changes?

Bjoern



From adilger at clusterfs.com  Thu Oct  5 20:07:26 2006
From: adilger at clusterfs.com (Andreas Dilger)
Date: Thu, 5 Oct 2006 14:07:26 -0600
Subject: creation time stamps for ext4 ?
In-Reply-To: <20061005165504.GA23727@thunk.org>
References: <E1GVUd2-0006O6-00@intern.SerNet.DE>
	<20061005151937.GV22010@schatzie.adilger.int>
	<20061005165504.GA23727@thunk.org>
Message-ID: <20061005200726.GW22010@schatzie.adilger.int>

On Oct 05, 2006  12:55 -0400, Theodore Tso wrote:
> > I've given this some thought for adding creation time as part of the nsec
> > timestamp patch.  That is more feasable if we move the nsec ctime into
> > the main inode to double as the version field.
> 
> Shoehorning an extra creation time field into the inode is relatively
> easy, but it's also necessary to have system calls to get and set the
> creation time.  The stat structure doesn't have room for the creation
> time, so that means a new version of the stat structure exported the
> kernel, and a new version of the stat structure exported by glibc.

For Lustre and NFSv4, an in-kernel interface is sufficient.

I was thinking that as a preliminary userspace interface we can use
getxattr with a standard name like user.crtime.  Storing the crtime
directly in the inode is more efficient than a separate EA, but it would
also be compatible if Samba wanted to use real EAs to store this in the
absence of large inodes.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.



From adilger at clusterfs.com  Thu Oct  5 21:30:36 2006
From: adilger at clusterfs.com (Andreas Dilger)
Date: Thu, 5 Oct 2006 15:30:36 -0600
Subject: EXT3 and large directories
In-Reply-To: <1BB7D14D25835639B3770CDB@[192.168.0.101]>
References: <44B5599C8B5B1347AFF903FDCEC00307A058FD@auscorpex-1.austin.messageone.com>
	<20061005004322.GQ22010@schatzie.adilger.int>
	<20061005170229.GB23727@thunk.org>
	<1BB7D14D25835639B3770CDB@[192.168.0.101]>
Message-ID: <20061005213036.GZ22010@schatzie.adilger.int>

On Oct 05, 2006  19:10 +0100, Alex Bligh wrote:
> How do the deleted entries get reused? EG if I have a mail spool
> application, where a given directory has around 100,000 files in at any
> time, and they are periodically deleted by age in batches of (say) 10,000
> such that the number in the directory never exceeds 100,000, does the size
> of the directory just keep growing for ever? Or do newly created directory
> entries take up the space in the directory of old ones (assume all the
> filenames are unique).

It depends on the hash function, and the nature of the filenames being used.
The hash function should be good at randomizing the hashes, and in the above
case would expect to have a very uniform hash distribution.  That means the
empty entries would be filled relatively uniformly.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.



From bothie at gmx.de  Sun Oct  8 16:45:19 2006
From: bothie at gmx.de (Bodo Thiesen)
Date: Sun, 8 Oct 2006 18:45:19 +0200
Subject: Retaining undelete data on ext3
In-Reply-To: <20060925154818.GC22010@schatzie.adilger.int>
References: <S1751239AbWIXR36/20060924172958Z+453@vger.kernel.org>
	<4516C67E.10609@bcgreen.com>
	<20060925154818.GC22010@schatzie.adilger.int>
Message-ID: <20061008184519.6082fe00@30_bodo.rupinet>

Andreas Dilger <adilger at clusterfs.com> wrote:

> On Sep 24, 2006  10:55 -0700, Stephen Samuel wrote:
> > Having just spent a day trying to recover a deleted ext3 file
> > for a friend, I'm wondering about this way of maintining
> > undelete information in ext3, like is done for ext2:
> > 
> > The last step in the deletion process would be to put back
> > the (previously zeroed) block pointers.  Since it gets logged
> > to the journal, I _think_ that this should be safe.  The worst
> > that would happen is that, if the plug gets pulled in the
> > middle of a file delete, the old block pointers would be
> > unavailable --  I don't see this as a killer issue, since
> > editing the filesystem to do an undelete should be considered an
> > emergency operation anyways.
> 
> I've written a couple of times the best way to do this,

Your solution works only for small files.

Big files must managed another way, like how I wrote on Sun, 1 Feb 2004 
07:00:58 +0100 in the thread "Ext3 and undeletion - A way how it could 
work."

But it semms, that the problem is not ideas on how to implement it, but in 
somebody just doing it ...

I don't have the knowledge currently, else I would have done it already.

Regards, Bodo



From bothie at gmx.de  Sun Oct  8 16:52:14 2006
From: bothie at gmx.de (Bodo Thiesen)
Date: Sun, 8 Oct 2006 18:52:14 +0200
Subject: Retaining undelete data on ext3
In-Reply-To: <20060924195319.GC11083@thunk.org>
References: <S1751239AbWIXR36/20060924172958Z+453@vger.kernel.org>
	<4516C67E.10609@bcgreen.com> <20060924195319.GC11083@thunk.org>
Message-ID: <20061008185214.04ed1b8f@30_bodo.rupinet>

Theodore Tso <tytso at mit.edu> wrote:

> The other caveat is that
> storing all of the previously zeroed block pointers temporarily in
> memory could take quite a bit of memory, especially if what is being
> deleted is really big.

Even Stephens idea doesn't need MBs of space. After freeing all blocks 
pointed to by an ind, that ind is unlinked in it's dind or in the inode 
whatever applies. In this moment, we can already restore it's contents. 
So the worst case for 8k blocks is to remember two ind blocks, two dind 
blocks and one tind block and the inode. That makes 41088 bytes. I don't 
agree this to be a problem ;)

> Of course, storing the information as a series of extents would be an
> obvious optimization, which would work on all but a very badly
> fragmented file (for example, if said DVD .iso image was created when
> the filesystem was close to 100% full).  

Or just read my mail from Sun, 1 Feb 2004 07:00:58 +0100 (Ext3 and 
undeletion - A way how it could work.)

> The are some other ways it could be done that would be more optimized,
> but the bottom line is that main reason why it hasn't be done is
> because the people who could do it haven't had the time to implement
> it.  We've been working on other features that are higher priority,
> either for ourselves or for our employers.

:(

But as I told: Ideas are not the problem. Time is the problem.

Regards, Bodo



From tytso at mit.edu  Sun Oct  8 17:03:29 2006
From: tytso at mit.edu (Theodore Tso)
Date: Sun, 8 Oct 2006 13:03:29 -0400
Subject: Retaining undelete data on ext3
In-Reply-To: <20061008185214.04ed1b8f@30_bodo.rupinet>
References: <S1751239AbWIXR36/20060924172958Z+453@vger.kernel.org>
	<4516C67E.10609@bcgreen.com> <20060924195319.GC11083@thunk.org>
	<20061008185214.04ed1b8f@30_bodo.rupinet>
Message-ID: <20061008170329.GA30816@thunk.org>

On Sun, Oct 08, 2006 at 06:52:14PM +0200, Bodo Thiesen wrote:
> Theodore Tso <tytso at mit.edu> wrote:
> 
> > The other caveat is that
> > storing all of the previously zeroed block pointers temporarily in
> > memory could take quite a bit of memory, especially if what is being
> > deleted is really big.
> 
> Even Stephens idea doesn't need MBs of space. After freeing all blocks 
> pointed to by an ind, that ind is unlinked in it's dind or in the inode 
> whatever applies. In this moment, we can already restore it's contents. 
> So the worst case for 8k blocks is to remember two ind blocks, two dind 
> blocks and one tind block and the inode. That makes 41088 bytes. I don't 
> agree this to be a problem ;)

Actually, you can't --- that's the problem.  Until the changes are
committed, which means that the changes represented in the filesystem
are self-consistent and in a transaction which has been committed to
the journal, you can't start restoring the information in the indirect
block.

You could if you forced transaction boundaries between every single
indirect block, but that would seriously degrade ext3's unlink
performance, and slow down any other filesystem activity that might be
happenning at the same time.

This is what makes the undelete problem so subtle.  Doing in such a
way which is optimal for performance, and is preserves the journalling
guarantees, and yet still allows the undelete, is more complicated
than it first appears.  

> But as I told: Ideas are not the problem. Time is the problem.

Yep, exactly.

					- Ted



From bothie at gmx.de  Sun Oct  8 17:40:12 2006
From: bothie at gmx.de (Bodo Thiesen)
Date: Sun, 8 Oct 2006 19:40:12 +0200
Subject: Retaining undelete data on ext3
In-Reply-To: <20061008170329.GA30816@thunk.org>
References: <S1751239AbWIXR36/20060924172958Z+453@vger.kernel.org>
	<4516C67E.10609@bcgreen.com> <20060924195319.GC11083@thunk.org>
	<20061008185214.04ed1b8f@30_bodo.rupinet>
	<20061008170329.GA30816@thunk.org>
Message-ID: <20061008194012.35e49669@30_bodo.rupinet>

Theodore Tso <tytso at mit.edu> wrote:

> Actually, you can't --- that's the problem.  Until the changes are
> committed, which means that the changes represented in the filesystem
> are self-consistent and in a transaction which has been committed to
> the journal, you can't start restoring the information in the indirect
> block.

I don't see the problem here. Ok, I must admit not to know the code very 
much, especially the journaling part, I only know the non journaling 
on-disk structures.

But just considering:

We are talking about commiting transactions or not committing transactions. 
Assume we have a big file, and ind block I1, dind block D1 and tind block T 
must be changed to be self-consistent. Ok, no problem, we store the original 
contents of this three blocks in memory, and then update (i.e. zero out) some 
parts. In the next transaction, we need to change I2, D2 and T. If I1 != I2, 
we restore I1 in this transaction - it's no longer needed, remember the old 
content of I2 and log the changes for I2 in the journal. Same applies to D2 
vs. D1. If I1 and I2 (or D1 and D2 resp.) are the same, they are just updated 
leaving the in memory copy of the original data alone. So, I don't see the 
point why we would need to force the data to disk. Is the system crashes, I1 
will just be written several times instead just one time. But after the whole 
log has been replayed, the file system is consistent again. The only 
missfeature is now, that the in memory copy of the original versions of the 
blocks will be lost, but my proposal from very long time fixes that as well by 
just storing the updates in other places then the original version which 
retain unmodified.

Regards, Bodo



From tytso at mit.edu  Sun Oct  8 19:40:20 2006
From: tytso at mit.edu (Theodore Tso)
Date: Sun, 8 Oct 2006 15:40:20 -0400
Subject: Retaining undelete data on ext3
In-Reply-To: <20061008194012.35e49669@30_bodo.rupinet>
References: <S1751239AbWIXR36/20060924172958Z+453@vger.kernel.org>
	<4516C67E.10609@bcgreen.com> <20060924195319.GC11083@thunk.org>
	<20061008185214.04ed1b8f@30_bodo.rupinet>
	<20061008170329.GA30816@thunk.org>
	<20061008194012.35e49669@30_bodo.rupinet>
Message-ID: <20061008194020.GA26726@thunk.org>

On Sun, Oct 08, 2006 at 07:40:12PM +0200, Bodo Thiesen wrote:
> We are talking about commiting transactions or not committing
> transactions.  Assume we have a big file, and ind block I1, dind
> block D1 and tind block T must be changed to be self-consistent. Ok,
> no problem, we store the original contents of this three blocks in
> memory, and then update (i.e. zero out) some parts. In the next
> transaction, we need to change I2, D2 and T. If I1 != I2, we restore
> I1 in this transaction - it's no longer needed, remember the old
> content of I2 and log the changes for I2 in the journal. Same

"In the next transaction" --- that's exactly the problem, as I said,
in my earlier comment:

   You could if you forced transaction boundaries between every single
   indirect block, but that would seriously degrade ext3's unlink
   performance, and slow down any other filesystem activity that might be
   happenning at the same time.

The way ext3 works is that we batch multiple operations into a single
transaction.  This is because commiting transactions is expensive, so
we amortize the cost over potentially a large number of filesystem
operations that might be happening very close together.

So your "trick" would require force a single unlink system call to
require into multiple ext3 transactions, each which would have to be
written to the disk, and which would have to stall until all journal
blocks have been written to the disk before the journal commit block
is written.  The resulting performance degradation would be
disastrous.

							- Ted



From bothie at gmx.de  Sun Oct  8 21:38:22 2006
From: bothie at gmx.de (Bodo Thiesen)
Date: Sun, 8 Oct 2006 23:38:22 +0200
Subject: Retaining undelete data on ext3
In-Reply-To: <20061008194020.GA26726@thunk.org>
References: <S1751239AbWIXR36/20060924172958Z+453@vger.kernel.org>
	<4516C67E.10609@bcgreen.com> <20060924195319.GC11083@thunk.org>
	<20061008185214.04ed1b8f@30_bodo.rupinet>
	<20061008170329.GA30816@thunk.org>
	<20061008194012.35e49669@30_bodo.rupinet>
	<20061008194020.GA26726@thunk.org>
Message-ID: <20061008233822.481e8647@30_bodo.rupinet>

Theodore Tso <tytso at mit.edu> wrote:

> "In the next transaction" --- that's exactly the problem, as I said,
> in my earlier comment:
> 
>    You could if you forced transaction boundaries between every single
>    indirect block, but that would seriously degrade ext3's unlink
>    performance, and slow down any other filesystem activity that might be
>    happenning at the same time.
> 
> The way ext3 works is that we batch multiple operations into a single
> transaction.  This is because commiting transactions is expensive, so
> we amortize the cost over potentially a large number of filesystem
> operations that might be happening very close together.

What does the journaling code, if a block x which was already written to in 
a transaction get's written to again? What say we delete a small file from a 
directory and immediately recreate it, so the same directory data block 
needs to be updated again? Will this require a new transaction as well? If 
not, my aproach doesn't either. BTW: When I talked about a transaction I 
obviously meant something different than you, on the other hand that was my 
fault. What I meant with transaction is something like an atom. Moving a 
file from directory A to directory B needs (at least) four updates, the 
inodes of the directories and the directory data blocks. I would say, that 
this update is one transaction. But you would say, that is only a part of a 
transaction, as you would put deletion of another file, writing some data 
to an iso image and whatever else in the same transaction. So, just replace 
my "transactions" by "transaction atoms", and then read again, what I 
wrote, maybe that makes my idea more clearer.

As soon as I1 is completely zeroed, it will be unlinked in D1, and thus I1 
doesn't need to be written as having zeros. So if no update to I1 was 
already committed to disk, there is no need to do it at all (something like 
forget should be available in the journaling code as well). If it was 
already committed, it's original content needs to be committed in the next 
transaction, but there is no need to force a commit at this place at all.

Regards, Bodo



From bothie at gmx.de  Sun Oct  8 21:59:03 2006
From: bothie at gmx.de (Bodo Thiesen)
Date: Sun, 8 Oct 2006 23:59:03 +0200
Subject: Root filesystem on ext2
In-Reply-To: <47164.208.250.32.6.1158287646.squirrel@192.168.175.202>
References: <47164.208.250.32.6.1158287646.squirrel@192.168.175.202>
Message-ID: <20061008235903.19ecc8a2@30_bodo.rupinet>

"Jayjitkumar Lobhe" <jayjitkumar.lobhe at patni.com> wrote:

> - My initrd image

... is irrelevant ...

> - /etc/fstab

... which is irrelevant as well, as the *kernel* doesn't look there.

> - I dont mount the real root during [...] linuxrc [...] the kernel will 
>   mount it after linuxrc is finished.

Right, using the general autoprobe order. You have to know: ext2 and ext3 
(and ext4) are the same file systems. There are three differend drivers for 
the same file system, ext2 which supports the "normal" filesystem including 
many extenstions, but NOT supporting the extension "journalling", that 
extension is only supported by the ext3 *driver*. But again: It's the same 
file system (thus calling it ext3 and ext4 is a very very bad misnomer, but 
that is another story).

The kernel must autoprobe for the driver to use when mounting it's root file 
system as it doesn't get any hints. So it tries to mount as iso9660 ... and 
fails, it tries to mount as vfat ... and fails and so on. The order in which 
the file system *drivers* are tried is controlled by the order they are 
kernel internally registered. Your ext3 driver will be registered later than 
the ext2 driver. So ext2 will be tried first, and ext2 recognizes the file 
system as ext2 as long as the file system was unmounted correctly. So the 
ext2 driver can successfully mount the root file system, and you are stuck 
without journalling forever [i.e. until you reboot].

Solutions:

a) Make the ext3 file system driver a part of the static kernel (i.e. NOT a 
   modul). In this case the kernel code makes sure, ext3 gets registeres 
   before ext2. ext3 only works with file systems countaining a journal and 
   thus leaving those file systems which don't have a journal alone.
b) Don't unmount the root file system before rebooting *scnr*
   [Of course you would need it to be mounted as ext3 at any point for this 
   to work (a Knoppix boot would suffice), but you wouldn't consider that 
   anyways, I hope ;)]
c) Change the file system of the initrd to ANYTHING else than ext2 AND make 
   ext2 a module as like ext3. Them make sure to modprobe ext3 BEFORE ext2.

> - The system boots up successfully, mount command shows / partition
> mounted as ext3 but /proc/mount shows it as ext2.

That's another thing. When you (the kernel) mount(s) your root file system, 

1.) That file system isn't accessible nor is it writable yet (that it is 
    only AFTER being mounted)
2.) There is no mount utility doing it.

I guess, your /etc is on your root partition. In /etc you will find a file 
called mtab. That file contains the (wrong and very old) information, that 
your root was mounted using the ext3 file system driver. mount just cat's 
this file and thus shows the same wrong information. /proc/mount contains 
the current and correct information known by the kernel. Some peoble even 
deleted /etc/mtab already and replaced it by a symlink to /proc/mount. Other 
peoble told that to be a problem, I don't see the point, but just to warn 
you ;). For me, it worked fine.

Regards, Bodo



From bryan at kadzban.is-a-geek.net  Mon Oct  9 02:13:14 2006
From: bryan at kadzban.is-a-geek.net (Bryan Kadzban)
Date: Sun, 08 Oct 2006 22:13:14 -0400
Subject: Root filesystem on ext2
In-Reply-To: <20061008235903.19ecc8a2@30_bodo.rupinet>
References: <47164.208.250.32.6.1158287646.squirrel@192.168.175.202>
	<20061008235903.19ecc8a2@30_bodo.rupinet>
Message-ID: <4529B03A.5080508@kadzban.is-a-geek.net>

Bodo Thiesen wrote:
> Solutions:
> 
> a) Make the ext3 file system driver a part of the static kernel
> b) Don't unmount the root file system before rebooting
> c) Change the file system of the initrd to ANYTHING else than ext2
>    AND make ext2 a module as like ext3. Them make sure to modprobe
>    ext3 BEFORE ext2.

d) Change your initramfs to manually mount the root filesystem.  You
will be able to completely specify what you want mount to do, including
use a different FS than it normally would.

(Er, wait, you still use an initrd?  I suppose that doesn't really
matter, but initramfs is newer.)

> Some peoble even deleted /etc/mtab already and replaced it by a
> symlink to /proc/mount. Other peoble told that to be a problem, I
> don't see the point, but just to warn you ;). For me, it worked fine.

You must not be using any mount options that require keeping state
between mount and umount, then.  That's not the case in general.

One such option is "user" -- with "user", any user can mount the FS, but
only that same user (or root) is allowed to umount it.  To enforce this,
mount has to keep track of which user did the mount -- it does so in
/etc/mtab.  The kernel doesn't care (this restriction is enforced by the
setuid-root mount and umount programs, not the kernel), so that
information does not appear in /proc/mounts at all.

If you have a "user" FS in fstab, then I'd be willing to bet that any
user can mount it, and any other user can umount it.  If you want to try
it in your symlink setup, mount one of them as root, then see if you can
umount it as a user.  I can't when mtab is not a symlink; this is
correct behavior.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 258 bytes
Desc: OpenPGP digital signature
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20061008/9e740489/attachment.sig>

From tytso at mit.edu  Mon Oct  9 03:12:09 2006
From: tytso at mit.edu (Theodore Tso)
Date: Sun, 8 Oct 2006 23:12:09 -0400
Subject: Retaining undelete data on ext3
In-Reply-To: <20061008233822.481e8647@30_bodo.rupinet>
References: <S1751239AbWIXR36/20060924172958Z+453@vger.kernel.org>
	<4516C67E.10609@bcgreen.com> <20060924195319.GC11083@thunk.org>
	<20061008185214.04ed1b8f@30_bodo.rupinet>
	<20061008170329.GA30816@thunk.org>
	<20061008194012.35e49669@30_bodo.rupinet>
	<20061008194020.GA26726@thunk.org>
	<20061008233822.481e8647@30_bodo.rupinet>
Message-ID: <20061009031209.GA24190@thunk.org>

On Sun, Oct 08, 2006 at 11:38:22PM +0200, Bodo Thiesen wrote:
> BTW: When I talked about a transaction I 
> obviously meant something different than you, on the other hand that was my 
> fault. What I meant with transaction is something like an atom. Moving a 
> file from directory A to directory B needs (at least) four updates, the 
> inodes of the directories and the directory data blocks. I would say, that 
> this update is one transaction. But you would say, that is only a part of a 
> transaction, as you would put deletion of another file, writing some data 
> to an iso image and whatever else in the same transaction. So, just replace 
> my "transactions" by "transaction atoms", and then read again, what I 
> wrote, maybe that makes my idea more clearer.

Ah, but that brings up the other problem; which is for a really big
file, your "transaction atom" might not fit in a single "transaction".
Remember, it's not just about keeping the inode, indirect block,
double indirect, and triple indirect blocks up to date; it's also
about all of those block allocation bitmaps; and for a big file, the
number of block bitmaps you might have to touch can grow very large
indeed.  If the number of blocks that have to be touched during the
unlink is larger than the space left for the journal, then we have to
write a consistent snapshot of the inode, indirect, double indirect,
and triple indirect blocks, plus all of the block bitmaps.  And if you
try to "restore" the blocks afterwards, that's potentially an extra
block that needs to be journaled in the new transaction, and getting
that all right is more than a little bit tricky.

Now, the good news is that we are using bforget in journal_forget now,
and that at least some of the time, restoring the i_blocks[] pointers
will allow the inode to be recovered --- although if the unlink
operation takes multiple transactions, you won't get the entire inode
recovered that way.

The bottom line is the interaction of truncate and journalling gets
tricky, if you want it to be 100% reliable.  If you're willing to
settle for "mostly working", it's probably not that hard.

					- Ted



From adilger at clusterfs.com  Wed Oct 11 21:01:20 2006
From: adilger at clusterfs.com (Andreas Dilger)
Date: Wed, 11 Oct 2006 15:01:20 -0600
Subject: Retaining undelete data on ext3
In-Reply-To: <20061009031209.GA24190@thunk.org>
References: <S1751239AbWIXR36/20060924172958Z+453@vger.kernel.org>
	<4516C67E.10609@bcgreen.com> <20060924195319.GC11083@thunk.org>
	<20061008185214.04ed1b8f@30_bodo.rupinet>
	<20061008170329.GA30816@thunk.org>
	<20061008194012.35e49669@30_bodo.rupinet>
	<20061008194020.GA26726@thunk.org>
	<20061008233822.481e8647@30_bodo.rupinet>
	<20061009031209.GA24190@thunk.org>
Message-ID: <20061011210120.GS22010@schatzie.adilger.int>

On Oct 08, 2006  23:12 -0400, Theodore Tso wrote:
> The bottom line is the interaction of truncate and journalling gets
> tricky, if you want it to be 100% reliable.  If you're willing to
> settle for "mostly working", it's probably not that hard.

You can't be 100% with undelete anyways, because there is no guarantee
that the blocks won't be reallocated right away.  Having a 95% undelete
solution in a few lines of code would be worthwhile, IMHO, since this
topic comes up a lot and I've lamented on a few occasions the fact you
can't ever salvage deleted files from ext3.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.



From jeffg at ahpcrc.org  Mon Oct 16 17:04:56 2006
From: jeffg at ahpcrc.org (Jeff Garlough)
Date: Mon, 16 Oct 2006 12:04:56 -0500
Subject: dual-ported raid
Message-ID: <20061016170456.0418B4D291@que.ncs.ahpcrc.org>


Hi,

I have a dual-ported raid controller which allows two computers to
connect to the same ext3 filesystem. I never mount both systems
read-write at the same time. What I would like to do is use one
normally, and mount the second system read-only to perform backups and
to rsync the filesystem to another filesystem. When it's mounted
read-write from another system, will mounting the same filesystem
read-only cause the journal to be committed at the time it's mounted? If
so, is that a bad thing, that is, will it corrupt the filesystem? Are
journal events handled similar to databases, with regard to transaction
processing of journal events, or could playing "partial" journal events
(if there is such a thing) cause corruption?  Is mounting the read-only
instance as a ext2 filesystem the best solution, or does it matter if
it's mounted ext2 or ext3 as long as it's read-only?

--
Jeff Garlough



From adilger at clusterfs.com  Mon Oct 16 18:53:44 2006
From: adilger at clusterfs.com (Andreas Dilger)
Date: Mon, 16 Oct 2006 12:53:44 -0600
Subject: dual-ported raid
In-Reply-To: <20061016170456.0418B4D291@que.ncs.ahpcrc.org>
References: <20061016170456.0418B4D291@que.ncs.ahpcrc.org>
Message-ID: <20061016185344.GL6221@schatzie.adilger.int>

On Oct 16, 2006  12:04 -0500, Jeff Garlough wrote:
> What I would like to do is use one
> normally, and mount the second system read-only to perform backups and
> to rsync the filesystem to another filesystem. When it's mounted
> read-write from another system, will mounting the same filesystem
> read-only cause the journal to be committed at the time it's mounted?

Yes, that is very bad.

> If so, is that a bad thing, that is, will it corrupt the filesystem?

Yes, it can corrupt the filesystem.

> Are journal events handled similar to databases, with regard to transaction
> processing of journal events, or could playing "partial" journal events
> (if there is such a thing) cause corruption?  Is mounting the read-only
> instance as a ext2 filesystem the best solution, or does it matter if
> it's mounted ext2 or ext3 as long as it's read-only?

You can't mount it as ext2.

I would instead use a block-device level backup, like "dump" if you really
need to do it this way.  You are probably better off just doing the backup
from the primary node.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.



From worleys at gmail.com  Mon Oct 16 19:43:26 2006
From: worleys at gmail.com (Chris Worley)
Date: Mon, 16 Oct 2006 13:43:26 -0600
Subject: dual-ported raid
In-Reply-To: <20061016185344.GL6221@schatzie.adilger.int>
References: <20061016170456.0418B4D291@que.ncs.ahpcrc.org>
	<20061016185344.GL6221@schatzie.adilger.int>
Message-ID: <f3177b9e0610161243s15273197me4c989fe6ad0374f@mail.gmail.com>

You can do it if the two systems use different luns for their
ext/reiser/xfs file system, or if you use GFS as the file system
(then, you can mount the same FS read/write).

On 10/16/06, Andreas Dilger <adilger at clusterfs.com> wrote:
> On Oct 16, 2006  12:04 -0500, Jeff Garlough wrote:
> > What I would like to do is use one
> > normally, and mount the second system read-only to perform backups and
> > to rsync the filesystem to another filesystem. When it's mounted
> > read-write from another system, will mounting the same filesystem
> > read-only cause the journal to be committed at the time it's mounted?
>
> Yes, that is very bad.
>
> > If so, is that a bad thing, that is, will it corrupt the filesystem?
>
> Yes, it can corrupt the filesystem.
>
> > Are journal events handled similar to databases, with regard to transaction
> > processing of journal events, or could playing "partial" journal events
> > (if there is such a thing) cause corruption?  Is mounting the read-only
> > instance as a ext2 filesystem the best solution, or does it matter if
> > it's mounted ext2 or ext3 as long as it's read-only?
>
> You can't mount it as ext2.
>
> I would instead use a block-device level backup, like "dump" if you really
> need to do it this way.  You are probably better off just doing the backup
> from the primary node.
>
> Cheers, Andreas
> --
> Andreas Dilger
> Principal Software Engineer
> Cluster File Systems, Inc.
>
> _______________________________________________
> Ext3-users mailing list
> Ext3-users at redhat.com
> https://www.redhat.com/mailman/listinfo/ext3-users
>



From daniel at rimspace.net  Mon Oct 16 23:20:22 2006
From: daniel at rimspace.net (Daniel Pittman)
Date: Tue, 17 Oct 2006 09:20:22 +1000
Subject: dual-ported raid
References: <20061016170456.0418B4D291@que.ncs.ahpcrc.org>
Message-ID: <87vemjrgp5.fsf@rimspace.net>

Jeff Garlough <jeffg at ahpcrc.org> writes:

> I have a dual-ported raid controller which allows two computers to
> connect to the same ext3 filesystem. I never mount both systems
> read-write at the same time. What I would like to do is use one
> normally, and mount the second system read-only to perform backups and
> to rsync the filesystem to another filesystem. 

That will not work, full stop, ever, with ext3.  Find another solution.

If you did do this, envision:

On the master node, where read/write activities are going on, we have a
bunch of on-disk data and a bunch of meta-data in memory.  Things like
inode allocation tables, etc.

These get written out to disk every now and then, through the journal
and for other reasons, on whatever schedule the master node feels is
worthwhile.


Meanwhile, over on the slave node you mount the file system.  It reads
some meta-data into memory and keeps it there, for convenience.  

You start working on data -- and, meanwhile, over on the master we
update some of the meta-data that the slave has in memory.


Now, the slave doesn't know that was updated, so it keeps using that
in-memory data happily.  Except, then it needs to load some fresh data
from disk and, pow, huge inconsistency in the file system.


ext3 alone cannot do what you want.  You might get away with it if you
can take a snapshot of the (consistent) state on the master, then mount
that on the slave, but that probably isn't a great plan either.


I strongly suggest you investigate some other solution like, say, simply
running your backups on the master.  You will have the same resource use
in both cases, pretty much, unless your rsync process is very checksum
intensive...

Regards,
        Daniel
-- 
Digital Infrastructure Solutions -- making IT simple, stable and secure
Phone: 0401 155 707        email: contact at digital-infrastructure.com.au
                 http://digital-infrastructure.com.au/



From neotericgnosis at yahoo.com  Tue Oct 17 23:04:53 2006
From: neotericgnosis at yahoo.com (Jeff Garlough)
Date: Tue, 17 Oct 2006 16:04:53 -0700 (PDT)
Subject: Subject: Re: dual-ported raid
Message-ID: <20061017230453.81243.qmail@web52202.mail.yahoo.com>

>> What I would like to do is use one
>> normally, and mount the second system read-only to
perform backups and
>> to rsync the filesystem to another filesystem. When
it's mounted
>> read-write from another system, will mounting the
same filesystem
>> read-only cause the journal to be committed at the
time it's mounted?
>
>Yes, that is very bad.

Can you elaborate why mounting a filesystem read-only
is "dangerous"?

>
>> If so, is that a bad thing, that is, will it
corrupt the filesystem?
>
>Yes, it can corrupt the filesystem.

I assume, then, that mounting the filesystem read-only
flushes the
journal. Why does flushing it "early" corrupt the
filesystem?

>
>> Are journal events handled similar to databases,
with regard to transaction
>> processing of journal events, or could playing
"partial" journal events
>> (if there is such a thing) cause corruption?  Is
mounting the read-only
>> instance as a ext2 filesystem the best solution, or
does it matter if
>> it's mounted ext2 or ext3 as long as it's
read-only?
>
>You can't mount it as ext2.
>

Why? It seemed to work, although I'm not sure, from
the comments I've
been getting, that it's safe. The ext3-faq says:
   How do I convert my ext3 partition back to ext2?

   Actually there is only little need to do so,
because in most cases it
   is sufficient to mount the partition explicitely as
ext2.  

>I would instead use a block-device level backup, like
"dump" if you really
>need to do it this way.  You are probably better off
just doing the backup
>from the primary node.
>
>Cheers, Andreas
>--
>Andreas Dilger
>Principal Software Engineer
>Cluster File Systems, Inc.
>

--
Jeff Garlough


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 



From davids at webmaster.com  Wed Oct 18 02:12:40 2006
From: davids at webmaster.com (David Schwartz)
Date: Tue, 17 Oct 2006 19:12:40 -0700
Subject: Subject: Re: dual-ported raid
In-Reply-To: <20061017230453.81243.qmail@web52202.mail.yahoo.com>
Message-ID: <MDEHLPKNGKAHNMBLJOLKEEJHPCAB.davids@webmaster.com>


> Can you elaborate why mounting a filesystem read-only
> is "dangerous"?

You will be interpreting the filesystem based on a mix of current and stale
metadata. There is no way you can be sure what will happen in this case.
Metadata may be read as data or vice versa, pieces of one file may be read
as pieces of another one.

DS




From pengchengzou at gmail.com  Fri Oct 20 23:02:53 2006
From: pengchengzou at gmail.com (Pengcheng Zou)
Date: Sat, 21 Oct 2006 07:02:53 +0800
Subject: the worst scenario of ext3 after abnormal powerdown
Message-ID: <24a313060610201602t6218a230h6a3059f8a2e50bf1@mail.gmail.com>

Hi,

I have seen and heard many cases of ext3 corrupted after abnormal
powerdown (e.g. missing all the files in one directory). yes, UPS
should help, but wonder what kind of worst scenario will ext3 present
after powerdown.

messed up meta data has been seen in many cases, for example, the
in-direct block of one inode  contains garbage, which causes the
automatic fsck failed to work and user has to repair the file system
manually (and always result in some missing files). should I blame
ext3 for it? or should I just turn off the disk write cache?

it seems Windows NTFS has less such problem than ext3, and no matter
it's the problem of ext3 or mis-configured hardware, this behavior is
really causes lots of people to doubt the stability of Linux file
system.

thanks,
  -- Pengcheng



From mnalis-ml at voyager.hr  Sat Oct 21 11:43:26 2006
From: mnalis-ml at voyager.hr (Matija Nalis)
Date: Sat, 21 Oct 2006 13:43:26 +0200
Subject: the worst scenario of ext3 after abnormal powerdown
In-Reply-To: <24a313060610201602t6218a230h6a3059f8a2e50bf1@mail.gmail.com>
References: <24a313060610201602t6218a230h6a3059f8a2e50bf1@mail.gmail.com>
Message-ID: <20061021114326.GA3149@eagle102.home.lan>

On Sat, Oct 21, 2006 at 07:02:53AM +0800, Pengcheng Zou wrote:
> messed up meta data has been seen in many cases, for example, the
> in-direct block of one inode contains garbage, which causes the automatic
> fsck failed to work and user has to repair the file system manually (and
> always result in some missing files).  should I blame ext3 for it? or
> should I just turn off the disk write cache?

In recent 2.6.x you can mount ext3 with "-o barrier=1", and you should be
able to safely use disks with write cache on (if the disks support it,
watch dmesg for "JBD: barrier-based sync failed" errors if not supported)

read Documentation/block/barrier.txt for more info.

> it seems Windows NTFS has less such problem than ext3, and no matter
> it's the problem of ext3 or mis-configured hardware, this behavior is
> really causes lots of people to doubt the stability of Linux file
> system.

It would be nice to know why "barrier=1" is not the default (to be safe by
default, like with journal=ordered instead of journal=writeback) on ext3 ?
(it is on by default on XFS for example)

Also interesting question on http://lkml.org/lkml/2005/12/18/99

"... But if you want a different raid level you should ask the ext3
 developers if there is a reason they don't call blkdev_issue_flush if
 barriers aren't supported."

-- 
Opinions above are GNU-copylefted.



From mnalis-ml at voyager.hr  Mon Oct 23 17:15:31 2006
From: mnalis-ml at voyager.hr (Matija Nalis)
Date: Mon, 23 Oct 2006 19:15:31 +0200
Subject: the worst scenario of ext3 after abnormal powerdown
In-Reply-To: <24a313060610230727t5e2aa501wcb2258410fcdd1db@mail.gmail.com>
References: <24a313060610201602t6218a230h6a3059f8a2e50bf1@mail.gmail.com>
	<20061021114326.GA3149@eagle102.home.lan>
	<24a313060610230727t5e2aa501wcb2258410fcdd1db@mail.gmail.com>
Message-ID: <20061023171531.GA3240@eagle102.home.lan>

On Mon, Oct 23, 2006 at 10:27:20PM +0800, Pengcheng Zou wrote:
> thanks a lot for the explanation.  so if i understand it clearly, to
> get a reliable data storage, i need turn off write cache or enable
> barrier. both methods depend on the hardware. so how to know whether a
> disk or drive support write cache? how to turn off write cache (i know
> hdparm -W0 for IDE, but how to turn off write cache of SCSI drive?)?

maybe http://scsirastools.sourceforge.net/ ? also see:
http://www-dt.e-technik.uni-dortmund.de/~ma/linux/kernel/safe-write-caches.html


-- 
Opinions above are GNU-copylefted.



From ramanara at cse.psu.edu  Wed Oct 25 23:43:24 2006
From: ramanara at cse.psu.edu (Rajaraman Ramanarayanan)
Date: Wed, 25 Oct 2006 19:43:24 -0400 (EDT)
Subject: FS corruption? bogus i_mode
Message-ID: <Pine.SOC.4.64.0610251637040.23086@lamanth.cse.psu.edu>

Hello,
I am doing some testing on a PXA270 based processor (on a single 
board computer) which makes the processor vulnerable to bit flips. One
such bit flips seems to have corrupted the file system.

The debug port on the board (it is a single board computer) had the 
following message when i think the FS corruption occured :

<7>init_special_inode: bogus i_mode (33061)

init_special_inode: bogus i_mode (30071)

init_special_inode: bogus i_mode (34065)

init_special_inode: bogus i_mode (30061)

init_special_inode: bogus i_mode (33061)

init_special_inode: bogus i_mode (30071)


After this happened, directories like bin etc. were corrupted ( I am 
pasting the screen shot of ll commands that i did) which meant that i 
could not start the board again using the same FS (I had to re install the 
root file system on the hard drive).

My question is what error could have caused a file system corruption like 
this. Is it possible to trace and analyze if i have the whole FS backed 
up? The OS was debian linux. I hope the question is clear and the given 
information is useful enough to make some comments. Here is the screen 
shot of the ll commands for 2 of the directories:
(The total space in the partition was 4GB)
*************************************************************************
segrith.cse.psu.edu 66% du -khs bin
426G    bin
segrith.cse.psu.edu 67% ll
total 446404348
cr-Sr-S--- 8240 959265076 876099129 32, 50 Oct  2  1997 bin
drwxr-xr-x    2 root      root        4096 Sep 30  2005 boot
drwxr-xr-x    6 root      root       24576 Oct 10 15:29 dev
drwxr-xr-x   61 root      root        4096 Oct 10 15:32 etc
drwxr-xr-x    2 root      root        4096 Sep 30  2005 home
drwxr-xr-x    2 root      root        4096 Dec 31  1969 initrd
drwxr-xr-x    9 root      root        4096 Jan 12  2006 lib
drwxr-xr-x    2 root      root       16384 Dec 31  1969 lost+found
drwxr-xr-x    4 root      root        4096 Dec 19  2005 media
drwxr-xr-x    8 root      root        4096 Apr 26 15:39 mnt
drwxr-xr-x    3 root      root        4096 Dec 19  2005 opt
dr-xr-xr-x    2 root      root        4096 Dec 31  1969 proc
drwxr-xr-x    4 root      root        4096 Oct  9 23:10 root
drwxr-xr-x    2 root      root        4096 May  1 11:38 sbin
drwxr-xr-x    2 root      root        4096 Jan 12  2006 selinux
drwxr-xr-x    2 root      root        4096 Dec 31  1969 srv
drwxr-xr-x    2 root      root        4096 Dec 31  1969 sys
drwxrwxrwt    4 root      root        4096 Oct 10 15:29 tmp
drwxr-xr-x   12 root      root        4096 Dec 19  2005 usr
drwxr-xr-x   13 root      root        4096 Dec 19  2005 var
segrith.cse.psu.edu 68% cd root/samplecodes/test7/
segrith.cse.psu.edu 69% du -khs *
426G    a.out
0       err.out
434G    matrix_a
458G    matrix_b
394G    matrix_c
426G    matrix_d
434G    matrix_e
394G    matrix_f
segrith.cse.psu.edu 70% ll
total 3107434033
?---rw---x 11552 892546336 959789109 943207220 Dec 28  1993 a.out
-rw-r--r--     1 root      root              0 Oct 10 15:09 err.out
?---rwS--t 13869 909522483 540549173 926166304 Dec 28  1993 matrix_a
?--Srw-r-x 11552 943140128 757084720 808726580 Dec 28  1993 matrix_b
?---rwx--x  8246 842276912 540030005 859124013 Feb 11  1987 matrix_c
?---rw---x 11552 892546336 959789109 943207220 Dec 28  1993 matrix_d
?---rwS--t 13869 909522483 540549173 926166304 Dec 28  1993 matrix_e
?---rwx--x  8246 842276912 540030005 859124013 Feb 11  1987 matrix_f
segrith.cse.psu.edu 71%
************************************************************************
Thank you!
Sincerely,
Rajaraman



From lists at nerdbynature.de  Thu Oct 26 14:57:17 2006
From: lists at nerdbynature.de (Christian Kujau)
Date: Thu, 26 Oct 2006 15:57:17 +0100 (BST)
Subject: FS corruption? bogus i_mode
In-Reply-To: <Pine.SOC.4.64.0610251637040.23086@lamanth.cse.psu.edu>
References: <Pine.SOC.4.64.0610251637040.23086@lamanth.cse.psu.edu>
Message-ID: <Pine.LNX.4.64.0610261547100.30473@sheep.housecafe.de>

On Wed, 25 Oct 2006, Rajaraman Ramanarayanan wrote:
> I am doing some testing on a PXA270 based processor (on a single board 
> computer) which makes the processor vulnerable to bit flips. One
> such bit flips seems to have corrupted the file system.

I don't know these PXA270 processors but your comment reads as if the 
processor is "prone to bit-flips by design", which I can't believe...so, 
I guess the cpu broke somehow, was overheated or sth.?

If so, that's like having faulty memory or faulty data-paths in 
general (bus errors, bad cabling, too hot processors, etc...). And kinds 
of errors can be caused by this and the fs can't do much about it 
because the code in the fs-driver (any fs) isn't executed in the way it is 
meant to.

> segrith.cse.psu.edu 66% du -khs bin
> 426G    bin
> segrith.cse.psu.edu 67% ll
> total 446404348
> cr-Sr-S--- 8240 959265076 876099129 32, 50 Oct  2  1997 bin

so, the system thinks /bin is a 426 GB character device on a 4GB 
filesystem?

you could run a recent version of e2fsck and see what can be repaired 
but I'd suggest to get a stable hardware platform and playback your 
backups :(

Christian.
-- 
BOFH excuse #54:

Evil dogs hypnotised the night shift



From ramanara at cse.psu.edu  Thu Oct 26 15:46:11 2006
From: ramanara at cse.psu.edu (Rajaraman Ramanarayanan)
Date: Thu, 26 Oct 2006 11:46:11 -0400 (EDT)
Subject: FS corruption? bogus i_mode
In-Reply-To: <Pine.LNX.4.64.0610261547100.30473@sheep.housecafe.de>
References: <Pine.SOC.4.64.0610251637040.23086@lamanth.cse.psu.edu>
	<Pine.LNX.4.64.0610261547100.30473@sheep.housecafe.de>
Message-ID: <Pine.SOC.4.64.0610261123300.567@lamanth.cse.psu.edu>

Thanks for the response. I am actually exposing the processor to neutron 
radiation which makes it vulnerable. Otherwise the processor and the 
system works fine once it is take out of the radiation. But this one time 
when the FS was corrupted i had to re-install the full root file system as 
it had corrupted the bin directory itself. But i have backed up the data 
(using dd command) to find out what exactly happened.

And it looks like the FS is corrupted such that many of the fields are 
corrupted (including size, file type, author etc).
Thanks again! Sincerely,
Rajaraman

On Thu, 26 Oct 2006, Christian Kujau wrote:

> On Wed, 25 Oct 2006, Rajaraman Ramanarayanan wrote:
>> I am doing some testing on a PXA270 based processor (on a single board 
>> computer) which makes the processor vulnerable to bit flips. One
>> such bit flips seems to have corrupted the file system.
>
> I don't know these PXA270 processors but your comment reads as if the 
> processor is "prone to bit-flips by design", which I can't believe...so, I 
> guess the cpu broke somehow, was overheated or sth.?
>
> If so, that's like having faulty memory or faulty data-paths in general (bus 
> errors, bad cabling, too hot processors, etc...). And kinds of errors can be 
> caused by this and the fs can't do much about it because the code in the 
> fs-driver (any fs) isn't executed in the way it is meant to.
>
>> segrith.cse.psu.edu 66% du -khs bin
>> 426G    bin
>> segrith.cse.psu.edu 67% ll
>> total 446404348
>> cr-Sr-S--- 8240 959265076 876099129 32, 50 Oct  2  1997 bin
>
> so, the system thinks /bin is a 426 GB character device on a 4GB filesystem?
>
> you could run a recent version of e2fsck and see what can be repaired but I'd 
> suggest to get a stable hardware platform and playback your backups :(
>
> Christian.
> -- 
> BOFH excuse #54:
>
> Evil dogs hypnotised the night shift
>



From lists at nerdbynature.de  Thu Oct 26 16:28:54 2006
From: lists at nerdbynature.de (Christian Kujau)
Date: Thu, 26 Oct 2006 17:28:54 +0100 (BST)
Subject: FS corruption? bogus i_mode
In-Reply-To: <Pine.SOC.4.64.0610261123300.567@lamanth.cse.psu.edu>
References: <Pine.SOC.4.64.0610251637040.23086@lamanth.cse.psu.edu>
	<Pine.LNX.4.64.0610261547100.30473@sheep.housecafe.de>
	<Pine.SOC.4.64.0610261123300.567@lamanth.cse.psu.edu>
Message-ID: <Pine.LNX.4.64.0610261715230.30473@sheep.housecafe.de>

On Thu, 26 Oct 2006, Rajaraman Ramanarayanan wrote:
> I am actually exposing the processor to neutron 
> radiation which makes it vulnerable. Otherwise the processor and the system 
> works fine once it is take out of the radiation.

ROFL, this really is the best setup I've read about on ext3-users ;)

> But this one time when the 
> FS was corrupted i had to re-install the full root file system as it had 
> corrupted the bin directory itself. But i have backed up the data (using dd 
> command) to find out what exactly happened.

So, if this would be reproducible, one could activate the in-kernel 
debug flags or more specifically JBD_DEBUG or even try kdb[0] to see 
what's going on. Oh, and when we can see corruption patterns while
the system is exposed to your special environment, I'd love to test the 
patch introducing CONFIG_EXT3_NEUTRON ;)

Christian.

[0] ftp://oss.sgi.com/www/projects/kdb/download/latest/
-- 
BOFH excuse #113:

Root nameservers are out of sync



From ramanara at cse.psu.edu  Thu Oct 26 18:35:52 2006
From: ramanara at cse.psu.edu (Rajaraman Ramanarayanan)
Date: Thu, 26 Oct 2006 14:35:52 -0400 (EDT)
Subject: FS corruption? bogus i_mode
In-Reply-To: <Pine.LNX.4.64.0610261715230.30473@sheep.housecafe.de>
References: <Pine.SOC.4.64.0610251637040.23086@lamanth.cse.psu.edu>
	<Pine.LNX.4.64.0610261547100.30473@sheep.housecafe.de>
	<Pine.SOC.4.64.0610261123300.567@lamanth.cse.psu.edu>
	<Pine.LNX.4.64.0610261715230.30473@sheep.housecafe.de>
Message-ID: <Pine.SOC.4.64.0610261426140.759@lamanth.cse.psu.edu>



On Thu, 26 Oct 2006, Christian Kujau wrote:

>
> ROFL, this really is the best setup I've read about on ext3-users ;)
>

Thanks! ;) Thats what my research is about - To test the effect of neutron 
induced errors on memories, processors etc.


> So, if this would be reproducible, one could activate the in-kernel debug 
> flags or more specifically JBD_DEBUG or even try kdb[0] to see what's going 
> on. Oh, and when we can see corruption patterns while
> the system is exposed to your special environment, I'd love to test the patch 
> introducing CONFIG_EXT3_NEUTRON ;)
>

I have seen this only once. So as of now it is not reporoducible, and i 
definitely cannot predict if and when it can occur. Also I am not familiar 
with activating debug flags, Is there any document that i can refer to for 
these.. or is it something i have to figure out myself? Thanks!

Rajaraman



From thomas_chris_666 at yahoo.co.in  Fri Oct 27 05:09:50 2006
From: thomas_chris_666 at yahoo.co.in (Thomas chris)
Date: Fri, 27 Oct 2006 06:09:50 +0100 (BST)
Subject: Test
Message-ID: <20061027050951.32461.qmail@web7704.mail.in.yahoo.com>

This is a test


--
Thomas Chris
http://www.youbanking.com
http://www.youbanking.com/email_page.html

 				
---------------------------------
 Find out what India is talking about on  - Yahoo! Answers India 
 Send FREE SMS to your friend's mobile from Yahoo! Messenger Version 8. Get it NOW
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20061027/ed90d329/attachment.htm>

From lists at nerdbynature.de  Fri Oct 27 16:24:04 2006
From: lists at nerdbynature.de (Christian Kujau)
Date: Fri, 27 Oct 2006 17:24:04 +0100 (BST)
Subject: FS corruption? bogus i_mode
In-Reply-To: <Pine.SOC.4.64.0610261426140.759@lamanth.cse.psu.edu>
References: <Pine.SOC.4.64.0610251637040.23086@lamanth.cse.psu.edu>
	<Pine.LNX.4.64.0610261547100.30473@sheep.housecafe.de>
	<Pine.SOC.4.64.0610261123300.567@lamanth.cse.psu.edu>
	<Pine.LNX.4.64.0610261715230.30473@sheep.housecafe.de>
	<Pine.SOC.4.64.0610261426140.759@lamanth.cse.psu.edu>
Message-ID: <Pine.LNX.4.64.0610271707080.2555@sheep.housecafe.de>

On Thu, 26 Oct 2006, Rajaraman Ramanarayanan wrote:
> I have seen this only once. So as of now it is not reporoducible, and i 
> definitely cannot predict if and when it can occur. Also I am not familiar 
> with activating debug flags, Is there any document that i can refer to for 
> these..

I'm not a filesystem wizard and use debug flags only when things go 
wrong and this question should be answered by the e2fs-crew, but for 
starters: when configuring your kernel (make menuconfig?), enabling "JBD 
(ext3) debugging support" (under "File systems") should make the ext3 
fs-driver more verbose, especially when something goes wrong.

Then there are the numerous "kernel debugging" options (under "Kernel 
hacking")..but I find it hard to propose a specific option here, because 
we don't know which part of the kernel would generate certain errors 
when exposed to the radiation. in general, these options make the 
various code-paths more verbose.

but I doubt that apart from this (being more chatty whwn something goes wrong) will actually help to debug and 
even coding workarounds for 
hardware-going-crazy-under-certain-conditions. But then again, the 
satellites in space have lots of chips inside too and are exposed 
to radiation as well....hm, dunno how this is done.

Christian.
-- 
BOFH excuse #253:

We've run out of licenses



From magnusm at massive.se  Fri Oct 13 12:14:09 2006
From: magnusm at massive.se (=?iso-8859-1?Q?Magnus_M=E5nsson?=)
Date: Fri, 13 Oct 2006 12:14:09 -0000
Subject: e2defrag - Unable to allocate buffer for inode priorities
Message-ID: <F97B964C23FAFC4990BEFE9DB220E18C013B5E77@msx.valhalla.local>

Hi, first of all, apologies if this isn't the right mailing list but it was the best I could find. If you know a better mailing list, please tell me.

Today I tried to defrag one of my filesystems. It's a 3.5T large filesystem that has 6 software-raids in the bottom and then merged together using lvm. I was running ext3 but removed the journal flag with
thor:~# tune2fs -O ^has_journal /dev/vgraid/data

After that I fsckd just to be sure I wouldnt meet any unexpected problems.

So now it was time to defrag, I used this command:
thor:~# e2defrag -r /dev/vgraid/data

After about 15 seconds (after it ate all my 1.5G of RAM) I got this answer:
e2defrag (/dev/vgraid/data): Unable to allocate buffer for inode priorities

I am using Debian unstable and here is the version information from e2defrag:
thor:~# e2defrag -V
e2defrag 0.73pjm1
RCS version $Id: defrag.c,v 1.4 1997/08/17 14:23:57 linux Exp $

I also tried to use -p 256, -p 128, -p 64 to see if it used less memory then, it didn't seem like that to me, took the same time for the program to abort.


Is there any way to get around this problem? The answer might be to get 10G of RAM, but that's not very realistic, 2G sure, but I think that's the limit on my motherboard. A huge amount of swapfiles may solve it, and that's probably doable, but it will be enormous slow I guess?

Why do I want to defrag? Well, fsck gives this nice info to me:
/dev/vgraid/data: 227652/475987968 files (41.2% non-contiguous), 847539147/951975936 blocks

41% sounds like a lot in my ears and I am having a constant read of files on the drives, it's to slow already.


Very thankful for ideas or others experiences, maybe it's just not possible with such large partition with todays tools, hey ext[23] only supports 4T. Let's hope ext4 comes within a year in the mainstream kernels.


PS! Please CC me since I am not on the list so I dont have to wait for marc's archive to get the mails.

--
Magnus M?nsson
Systems administrator
Massive Entertainment AB
Malm?, Sweden
Office: +46-40-6001000




From magnusm at massive.se  Fri Oct 13 14:44:04 2006
From: magnusm at massive.se (=?iso-8859-1?Q?Magnus_M=E5nsson?=)
Date: Fri, 13 Oct 2006 14:44:04 -0000
Subject: FW: e2defrag - Unable to allocate buffer for inode priorities
Message-ID: <F97B964C23FAFC4990BEFE9DB220E18C013B5E9E@msx.valhalla.local>

I have made some more research and found out the following ..

thor:~# df -i
Filesystem            Inodes   IUsed   IFree IUse% Mounted on
-[cut]-
/dev/mapper/vgraid-data
                     475987968  227652 475760316    1% /data


thor:~# strace e2defrag -r /dev/vgraid/data
-[cut]-
mmap2(NULL, 1903955968, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)
= 0x46512000
 (delay 15 seconds while allocating memory)
mmap2(NULL, 475992064, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) =
 -1 ENOMEM (Cannot allocate memory)
-[cut]-

The first allocation seems to be 4 bytes per available inode on my filesystem. I wish now that I created the FS with less inodes, and there is another question. What's the gain of having less available inodes? If I recreated my filesystem, would it be an idea to make one inode per hundred block or something since that still is way more than I need? Would I gain speed from it?


 
-----Original Message-----
From: Magnus M?nsson 
Sent: den 13 oktober 2006 14:14
To: 'ext3-users at redhat.com'
Subject: e2defrag - Unable to allocate buffer for inode priorities

Hi, first of all, apologies if this isn't the right mailing list but it was the best I could find. If you know a better mailing list, please tell me.

Today I tried to defrag one of my filesystems. It's a 3.5T large filesystem that has 6 software-raids in the bottom and then merged together using lvm. I was running ext3 but removed the journal flag with thor:~# tune2fs -O ^has_journal /dev/vgraid/data

After that I fsckd just to be sure I wouldnt meet any unexpected problems.

So now it was time to defrag, I used this command:
thor:~# e2defrag -r /dev/vgraid/data

After about 15 seconds (after it ate all my 1.5G of RAM) I got this answer:
e2defrag (/dev/vgraid/data): Unable to allocate buffer for inode priorities

I am using Debian unstable and here is the version information from e2defrag:
thor:~# e2defrag -V
e2defrag 0.73pjm1
RCS version $Id: defrag.c,v 1.4 1997/08/17 14:23:57 linux Exp $

I also tried to use -p 256, -p 128, -p 64 to see if it used less memory then, it didn't seem like that to me, took the same time for the program to abort.


Is there any way to get around this problem? The answer might be to get 10G of RAM, but that's not very realistic, 2G sure, but I think that's the limit on my motherboard. A huge amount of swapfiles may solve it, and that's probably doable, but it will be enormous slow I guess?

Why do I want to defrag? Well, fsck gives this nice info to me:
/dev/vgraid/data: 227652/475987968 files (41.2% non-contiguous), 847539147/951975936 blocks

41% sounds like a lot in my ears and I am having a constant read of files on the drives, it's to slow already.


Very thankful for ideas or others experiences, maybe it's just not possible with such large partition with todays tools, hey ext[23] only supports 4T. Let's hope ext4 comes within a year in the mainstream kernels.


PS! Please CC me since I am not on the list so I dont have to wait for marc's archive to get the mails.

--
Magnus M?nsson
Systems administrator
Massive Entertainment AB
Malm?, Sweden
Office: +46-40-6001000




From magnusm at massive.se  Fri Oct 13 16:55:13 2006
From: magnusm at massive.se (=?Windows-1252?Q?Magnus_M=E5nsson?=)
Date: Fri, 13 Oct 2006 16:55:13 -0000
Subject: FW: e2defrag - Unable to allocate buffer for inode priorities
Message-ID: <F97B964C23FAFC4990BEFE9DB220E18C2B0BF0@msx.valhalla.local>

I have now upgraded my server from 1.5G of RAM to 4G of RAM. It get's a bit longer, it now looks like this with strace:

mmap2(NULL, 1903955968, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x464a7000
 (15 second delay)
mmap2(NULL, 475992064, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x29eb6000
 (this I didn't have memory enough to before)
mmap2(NULL, 1903955968, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
 (here it wants another 2G RAM, sorry I dont have 2G-modules .. )

So if noone has any idea, I am stuck until I can find 4 pieces of 2G DDR400 modules. :(

--
Magnus M?nsson


-----Original Message-----
From:	Magnus M?nsson
Sent:	Fri 10/13/2006 4:32 PM
To:	ext3-users at redhat.com
Cc:	Magnus M?nsson
Subject:	FW: e2defrag - Unable to allocate buffer for inode priorities
I have made some more research and found out the following ..

thor:~# df -i
Filesystem            Inodes   IUsed   IFree IUse% Mounted on
-[cut]-
/dev/mapper/vgraid-data
                     475987968  227652 475760316    1% /data


thor:~# strace e2defrag -r /dev/vgraid/data
-[cut]-
mmap2(NULL, 1903955968, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)
= 0x46512000
 (delay 15 seconds while allocating memory)
mmap2(NULL, 475992064, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) =
 -1 ENOMEM (Cannot allocate memory)
-[cut]-

The first allocation seems to be 4 bytes per available inode on my filesystem. I wish now that I created the FS with less inodes, and there is another question. What's the gain of having less available inodes? If I recreated my filesystem, would it be an idea to make one inode per hundred block or something since that still is way more than I need? Would I gain speed from it?


 
-----Original Message-----
From: Magnus M?nsson 
Sent: den 13 oktober 2006 14:14
To: 'ext3-users at redhat.com'
Subject: e2defrag - Unable to allocate buffer for inode priorities

Hi, first of all, apologies if this isn't the right mailing list but it was the best I could find. If you know a better mailing list, please tell me.

Today I tried to defrag one of my filesystems. It's a 3.5T large filesystem that has 6 software-raids in the bottom and then merged together using lvm. I was running ext3 but removed the journal flag with thor:~# tune2fs -O ^has_journal /dev/vgraid/data

After that I fsckd just to be sure I wouldnt meet any unexpected problems.

So now it was time to defrag, I used this command:
thor:~# e2defrag -r /dev/vgraid/data

After about 15 seconds (after it ate all my 1.5G of RAM) I got this answer:
e2defrag (/dev/vgraid/data): Unable to allocate buffer for inode priorities

I am using Debian unstable and here is the version information from e2defrag:
thor:~# e2defrag -V
e2defrag 0.73pjm1
RCS version $Id: defrag.c,v 1.4 1997/08/17 14:23:57 linux Exp $

I also tried to use -p 256, -p 128, -p 64 to see if it used less memory then, it didn't seem like that to me, took the same time for the program to abort.


Is there any way to get around this problem? The answer might be to get 10G of RAM, but that's not very realistic, 2G sure, but I think that's the limit on my motherboard. A huge amount of swapfiles may solve it, and that's probably doable, but it will be enormous slow I guess?

Why do I want to defrag? Well, fsck gives this nice info to me:
/dev/vgraid/data: 227652/475987968 files (41.2% non-contiguous), 847539147/951975936 blocks

41% sounds like a lot in my ears and I am having a constant read of files on the drives, it's to slow already.


Very thankful for ideas or others experiences, maybe it's just not possible with such large partition with todays tools, hey ext[23] only supports 4T. Let's hope ext4 comes within a year in the mainstream kernels.


PS! Please CC me since I am not on the list so I dont have to wait for marc's archive to get the mails.

--
Magnus M?nsson
Systems administrator
Massive Entertainment AB
Malm?, Sweden
Office: +46-40-6001000


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20061013/dbda7d2b/attachment.htm>

From adilger at clusterfs.com  Tue Oct 31 17:10:50 2006
From: adilger at clusterfs.com (Andreas Dilger)
Date: Wed, 1 Nov 2006 01:10:50 +0800
Subject: e2defrag - Unable to allocate buffer for inode priorities
In-Reply-To: <F97B964C23FAFC4990BEFE9DB220E18C013B5E77@msx.valhalla.local>
References: <F97B964C23FAFC4990BEFE9DB220E18C013B5E77@msx.valhalla.local>
Message-ID: <20061031171050.GG5655@schatzie.adilger.int>

On Oct 13, 2006  14:13 +0200, Magnus M?nsson wrote:
> Today I tried to defrag one of my filesystems. It's a 3.5T large
> filesystem that has 6 software-raids in the bottom and then merged
> together using lvm. I was running ext3 but removed the journal flag with

> Why do I want to defrag? Well, fsck gives this nice info to me:
> /dev/vgraid/data: 227652/475987968 files (41.2% non-contiguous), 847539147/951975936 blocks
> 
> 41% sounds like a lot in my ears and I am having a constant read of files
> on the drives, it's to slow already.

The 41% isn't necessarily bad if the files are very large.  For large
files it is inevitable that there will be fragmentation after 125MB or so.

What is a bigger problem is if the filesystem is constantly very nearly
full, or if your applications are appending a lot (e.g. mailspool).

> So now it was time to defrag, I used this command:
> thor:~# e2defrag -r /dev/vgraid/data

This program is dangerous to use and any attempts to use it should be
stopped.  It hasn't been updated in such a long time that it doesn't
even KNOW that it is dangerous (i.e. it doesn't check the filesystem
version number or feature flags).

What I would suggest in the meantime is to make as much free space in the 
filesystem as you can, find files that are very fragmented (via the
filefrag program) and then copy these files to a new temp file, and rename
it over the old file.  It should help for files that are very fragmented.

There is also a discussion about implementing online defragmentation, but
that is still a ways away.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.



From tytso at mit.edu  Tue Oct 31 19:29:48 2006
From: tytso at mit.edu (Theodore Tso)
Date: Tue, 31 Oct 2006 14:29:48 -0500
Subject: e2defrag - Unable to allocate buffer for inode priorities
In-Reply-To: <20061031171050.GG5655@schatzie.adilger.int>
References: <F97B964C23FAFC4990BEFE9DB220E18C013B5E77@msx.valhalla.local>
	<20061031171050.GG5655@schatzie.adilger.int>
Message-ID: <20061031192947.GA12277@thunk.org>

Package: defrag
Version: 0.73pjm1-8
Severity: grave

On Wed, Nov 01, 2006 at 01:10:50AM +0800, Andreas Dilger wrote:
> > So now it was time to defrag, I used this command:
> > thor:~# e2defrag -r /dev/vgraid/data
> 
> This program is dangerous to use and any attempts to use it should be
> stopped.  It hasn't been updated in such a long time that it doesn't
> even KNOW that it is dangerous (i.e. it doesn't check the filesystem
> version number or feature flags).

In fact we need to create a Debian bug report indicating that this
package should *NOT* be included when the Debian etch distribution
releases.  

Goswin, I am setting the severity to grave (a release-critical
severity) because defrag right now is almost guaranteed to corrupt the
filesystem if used with modern ext3 filesystems leading to data loss,
and this satisfies the definition of grave.  I believe the correct
answer is either to (a) make defrag refuse to run if any filesystem
features are enabled (at the very least, resize_inode, but some of the
other newer ext3 filesystem features make me nervous with respect to
e2defrag, or (b) since (a) would make e2defrag mostly useless
especially since filesystems with resize inodes are created by default
in etch, and as far as I know upstream abandoned defrag a long time
ago, that we should simply remove e2defrag from etch and probably from
Debian altogether.

If you are interested in doing a huge amount of auditing and testing
of e2defrag with modern ext3 (and soon ext4) filesystems, that's
great, but I suspect that will not at all be trivial, and even making
sure e2defrag won't scramble users' data probably can't be achievable
before etch releases.

Regards,


						- Ted



From brederlo at informatik.uni-tuebingen.de  Tue Oct 31 21:44:03 2006
From: brederlo at informatik.uni-tuebingen.de (Goswin von Brederlow)
Date: Tue, 31 Oct 2006 22:44:03 +0100
Subject: e2defrag - Unable to allocate buffer for inode priorities
In-Reply-To: <20061031192947.GA12277@thunk.org> (Theodore Tso's message of
	"Tue, 31 Oct 2006 14:29:48 -0500")
References: <F97B964C23FAFC4990BEFE9DB220E18C013B5E77@msx.valhalla.local>
	<20061031171050.GG5655@schatzie.adilger.int>
	<20061031192947.GA12277@thunk.org>
Message-ID: <87iri0ma8s.fsf@informatik.uni-tuebingen.de>

Theodore Tso <tytso at mit.edu> writes:

> Package: defrag
> Version: 0.73pjm1-8
> Severity: grave
>
> On Wed, Nov 01, 2006 at 01:10:50AM +0800, Andreas Dilger wrote:
>> > So now it was time to defrag, I used this command:
>> > thor:~# e2defrag -r /dev/vgraid/data
>> 
>> This program is dangerous to use and any attempts to use it should be
>> stopped.  It hasn't been updated in such a long time that it doesn't
>> even KNOW that it is dangerous (i.e. it doesn't check the filesystem
>> version number or feature flags).

It should be doing that (checking for ext3 I can confirm) as of

defrag (0.73pjm1-8) unstable; urgency=low

  * ext3-notwork.dpatch: reverse testcase (Closes: #310800)

It doesn't handle ext3 right and does know so:

# mke2fs -j /dev/ram0 
# e2defrag -r /dev/ram0

e2defrag (/dev/ram0): ext3 filesystems not (yet) supported


It hapily defrags a filesystem with resize_inode. Is it destroying
resize capability or directly destroying data?

> In fact we need to create a Debian bug report indicating that this
> package should *NOT* be included when the Debian etch distribution
> releases.  

Yes, please do so and preferably with a script to reproduce this
without resorting to a big image file. Something in the form of

mke2fs <options>
mount
unpack kernel source
umount
defrag
mount fails

would be perfect. (Well not for defrag, but to debug it. :)

> Goswin, I am setting the severity to grave (a release-critical

You should have used debbugs-CC so I get to see the bug number
directly and can reply to the bug. :)

> severity) because defrag right now is almost guaranteed to corrupt the
> filesystem if used with modern ext3 filesystems leading to data loss,
> and this satisfies the definition of grave.  I believe the correct
> answer is either to (a) make defrag refuse to run if any filesystem
> features are enabled (at the very least, resize_inode, but some of the
> other newer ext3 filesystem features make me nervous with respect to
> e2defrag, or (b) since (a) would make e2defrag mostly useless
> especially since filesystems with resize inodes are created by default
> in etch, and as far as I know upstream abandoned defrag a long time
> ago, that we should simply remove e2defrag from etch and probably from
> Debian altogether.
>
> If you are interested in doing a huge amount of auditing and testing
> of e2defrag with modern ext3 (and soon ext4) filesystems, that's
> great, but I suspect that will not at all be trivial, and even making
> sure e2defrag won't scramble users' data probably can't be achievable
> before etch releases.

There is '#235498: defrag: ext3 support would be nice :-)' for this
issue but I need some serious help there to add all the new
features. Preferably a new active upstream. Maybe some people working
on ext4 would be willing to help?

But that won't happen before etch, I'm certain of that. I'm also
confident that I can patch in any checks to make e2defrag run on a
filesystem with incompatible features (like has_journal from
ext3). Checking those is just an extension of the ext3 check. But
people that still have ext2 or can disable the extra features
(e.g. delete journal, e2defrag, create journal) can still use
e2defrag. I would prefer keeping it in.

> Regards,
>
> 						- Ted

MfG
        Goswin