From magawake at gmail.com  Mon Sep  1 17:18:31 2008
From: magawake at gmail.com (Mag Gam)
Date: Mon, 1 Sep 2008 13:18:31 -0400
Subject: dynamic inode allocation
Message-ID: <1cbd6f830809011018we95f119p55f697a2111c9e1a@mail.gmail.com>

This maybe a newbie question but how come other file systems such as
ReiserFS and Veritas' Vxfs dynamically allocate inodes and filesystems
such as ext2/ext3 and JFS we need to allocate them when creating the
filesystem? Is there a performance or maintenance gain when pre
allocating?

TIA



From tytso at mit.edu  Mon Sep  1 18:37:44 2008
From: tytso at mit.edu (Theodore Tso)
Date: Mon, 1 Sep 2008 14:37:44 -0400
Subject: dynamic inode allocation
In-Reply-To: <1cbd6f830809011018we95f119p55f697a2111c9e1a@mail.gmail.com>
References: <1cbd6f830809011018we95f119p55f697a2111c9e1a@mail.gmail.com>
Message-ID: <20080901183744.GD13069@mit.edu>

On Mon, Sep 01, 2008 at 01:18:31PM -0400, Mag Gam wrote:
> This maybe a newbie question but how come other file systems such as
> ReiserFS and Veritas' Vxfs dynamically allocate inodes and filesystems
> such as ext2/ext3 and JFS we need to allocate them when creating the
> filesystem? Is there a performance or maintenance gain when pre
> allocating?

Having a static inode table is definitely much simpler than a dynamic
inode table, and that's why ext2 originally used a static inode
allocation system.  Ext2 drew much of its initial design inspiration
from the BSD Fast Filesystem, and it (along with most traditional Unix
filesystems) used a static inode table.  

One of the advantages of having a static inode table is you can always
reliably find it.  With a dynamic inode table, it can often be much
more difficult to find it in the face of filesystem corruption, caused
by either hardware or software failure.  For example, with Reiserfs,
the inodes are stored in a B-Tree.  If the root node, or a relatively
high-level node of the B-tree is lost, the only way to recover all of
the inodes is by looking at each block, and trying to determine if it
"looks" like part of the filesystem B-tree or not.  This is what the
reiserfs's fsck program will do if the filesystem is sufficiently
damaged.  Unfortuntaely, this means that if you store reiserfs
filesystem image (for example, for use by vmware, or qemu, or kvm, or
xen) in a reiserfs filesystem, and the filesystem gets damaged, the
recovery procedure will take every single block that looks like it
could have been part Reiserfs B-tree, and stich them together into a
new-btree.  The result, if you have Reiserfs filesystem images is
those blocks will get treated as if they were part of the containing
filesystem, and the result is not pretty.

These problems can be solved (although they were not for Reiserfs),
but it means a lot more complexity.

							- Ted



From magawake at gmail.com  Mon Sep  1 20:29:06 2008
From: magawake at gmail.com (Mag Gam)
Date: Mon, 1 Sep 2008 16:29:06 -0400
Subject: dynamic inode allocation
In-Reply-To: <20080901183744.GD13069@mit.edu>
References: <1cbd6f830809011018we95f119p55f697a2111c9e1a@mail.gmail.com>
	<20080901183744.GD13069@mit.edu>
Message-ID: <1cbd6f830809011329mdc2a3e3v763a70a18d7dc383@mail.gmail.com>

On Mon, Sep 1, 2008 at 2:37 PM, Theodore Tso <tytso at mit.edu> wrote:
> On Mon, Sep 01, 2008 at 01:18:31PM -0400, Mag Gam wrote:
>> This maybe a newbie question but how come other file systems such as
>> ReiserFS and Veritas' Vxfs dynamically allocate inodes and filesystems
>> such as ext2/ext3 and JFS we need to allocate them when creating the
>> filesystem? Is there a performance or maintenance gain when pre
>> allocating?
>
> Having a static inode table is definitely much simpler than a dynamic
> inode table, and that's why ext2 originally used a static inode
> allocation system.  Ext2 drew much of its initial design inspiration
> from the BSD Fast Filesystem, and it (along with most traditional Unix
> filesystems) used a static inode table.
>
> One of the advantages of having a static inode table is you can always
> reliably find it.  With a dynamic inode table, it can often be much
> more difficult to find it in the face of filesystem corruption, caused
> by either hardware or software failure.  For example, with Reiserfs,
> the inodes are stored in a B-Tree.  If the root node, or a relatively
> high-level node of the B-tree is lost, the only way to recover all of
> the inodes is by looking at each block, and trying to determine if it
> "looks" like part of the filesystem B-tree or not.  This is what the
> reiserfs's fsck program will do if the filesystem is sufficiently
> damaged.  Unfortuntaely, this means that if you store reiserfs
> filesystem image (for example, for use by vmware, or qemu, or kvm, or
> xen) in a reiserfs filesystem, and the filesystem gets damaged, the
> recovery procedure will take every single block that looks like it
> could have been part Reiserfs B-tree, and stich them together into a
> new-btree.  The result, if you have Reiserfs filesystem images is
> those blocks will get treated as if they were part of the containing
> filesystem, and the result is not pretty.
>
> These problems can be solved (although they were not for Reiserfs),
> but it means a lot more complexity.
>
>                                                        - Ted
>

Ted,

Thanks for the explanation and dumb-ing it down for me :-)

So, if a reiserFs filesystem is damaged and it naturally do a fsck.
The fsck basically recreated the b-tree by scanning from 1 to end of
the filesystem?



From tytso at mit.edu  Mon Sep  1 20:39:13 2008
From: tytso at mit.edu (Theodore Tso)
Date: Mon, 1 Sep 2008 16:39:13 -0400
Subject: dynamic inode allocation
In-Reply-To: <1cbd6f830809011329mdc2a3e3v763a70a18d7dc383@mail.gmail.com>
References: <1cbd6f830809011018we95f119p55f697a2111c9e1a@mail.gmail.com>
	<20080901183744.GD13069@mit.edu>
	<1cbd6f830809011329mdc2a3e3v763a70a18d7dc383@mail.gmail.com>
Message-ID: <20080901203913.GF13069@mit.edu>

On Mon, Sep 01, 2008 at 04:29:06PM -0400, Mag Gam wrote:
> 
> So, if a reiserFs filesystem is damaged and it naturally do a fsck.
> The fsck basically recreated the b-tree by scanning from 1 to end of
> the filesystem?

If the filesystem is sufficiently damaged such that portions of the
b-tree can't be found, then yes.  Otherwise, the data would be totally
lost.  As you can imagine, scaning every single block on the disk to
see if it looks like filesystem metadata is quite slow, so naturally
the reiserfs's fsck will avoid doing it if at all possible.  But if
the root or top-level nodes of the B-tree is damaged, it doesn't have
much choice.

						- Ted



From magawake at gmail.com  Mon Sep  1 21:16:01 2008
From: magawake at gmail.com (Mag Gam)
Date: Mon, 1 Sep 2008 17:16:01 -0400
Subject: dynamic inode allocation
In-Reply-To: <20080901203913.GF13069@mit.edu>
References: <1cbd6f830809011018we95f119p55f697a2111c9e1a@mail.gmail.com>
	<20080901183744.GD13069@mit.edu>
	<1cbd6f830809011329mdc2a3e3v763a70a18d7dc383@mail.gmail.com>
	<20080901203913.GF13069@mit.edu>
Message-ID: <1cbd6f830809011416t5edffaa3p7e98b0324f3a13ac@mail.gmail.com>

On Mon, Sep 1, 2008 at 4:39 PM, Theodore Tso <tytso at mit.edu> wrote:
> On Mon, Sep 01, 2008 at 04:29:06PM -0400, Mag Gam wrote:
>>
>> So, if a reiserFs filesystem is damaged and it naturally do a fsck.
>> The fsck basically recreated the b-tree by scanning from 1 to end of
>> the filesystem?
>
> If the filesystem is sufficiently damaged such that portions of the
> b-tree can't be found, then yes.  Otherwise, the data would be totally
> lost.  As you can imagine, scaning every single block on the disk to
> see if it looks like filesystem metadata is quite slow, so naturally
> the reiserfs's fsck will avoid doing it if at all possible.  But if
> the root or top-level nodes of the B-tree is damaged, it doesn't have
> much choice.
>
>                                                - Ted
>
>

But, if thats the last and worst case scenario why don't they do the
full scan? Sure its going to take a long time if its a big filesystem
(there should be no changes since it would be unmounted), but its
better than not having any data at all...



From tytso at mit.edu  Mon Sep  1 21:23:04 2008
From: tytso at mit.edu (Theodore Tso)
Date: Mon, 1 Sep 2008 17:23:04 -0400
Subject: dynamic inode allocation
In-Reply-To: <1cbd6f830809011416t5edffaa3p7e98b0324f3a13ac@mail.gmail.com>
References: <1cbd6f830809011018we95f119p55f697a2111c9e1a@mail.gmail.com>
	<20080901183744.GD13069@mit.edu>
	<1cbd6f830809011329mdc2a3e3v763a70a18d7dc383@mail.gmail.com>
	<20080901203913.GF13069@mit.edu>
	<1cbd6f830809011416t5edffaa3p7e98b0324f3a13ac@mail.gmail.com>
Message-ID: <20080901212304.GI13069@mit.edu>

On Mon, Sep 01, 2008 at 05:16:01PM -0400, Mag Gam wrote:
> > If the filesystem is sufficiently damaged such that portions of the
> > b-tree can't be found, then yes.  Otherwise, the data would be totally
> > lost.  As you can imagine, scaning every single block on the disk to
> > see if it looks like filesystem metadata is quite slow, so naturally
> > the reiserfs's fsck will avoid doing it if at all possible.  But if
> > the root or top-level nodes of the B-tree is damaged, it doesn't have
> > much choice.
> >
> 
> But, if thats the last and worst case scenario why don't they do the
> full scan? Sure its going to take a long time if its a big filesystem
> (there should be no changes since it would be unmounted), but its
> better than not having any data at all...

As I said, in the worst case, it will do a full scan.  But (a) it
takes a long time, and (b) if the filesystem has any files that
contain images of reiserfs filesystem, it will be totally scrambled.
So it makes sense that the reiserfs fsck would try to avoid this if it
can (i.e., if the b-tree is only mildly corrupted).

With that said, this is really going out of scope of this mailing
list.  And I am not an expert on reiserfs's filesystem checker,
although I have had people confirm to me that indeed, you can lose
really big if your reiserfs filesystem contains files that have are
images of other reiserfs filesystems for things like Virtualization.
This problem is apparently solved in reiser4, it is NOT solved in
reiserfs (i.e., version 3).  As far as I am concerned, that's ample
reason not to use reiserfs, but obviously I'm basied.  :-)

						- Ted




From magawake at gmail.com  Mon Sep  1 21:47:26 2008
From: magawake at gmail.com (Mag Gam)
Date: Mon, 1 Sep 2008 17:47:26 -0400
Subject: dynamic inode allocation
In-Reply-To: <20080901212304.GI13069@mit.edu>
References: <1cbd6f830809011018we95f119p55f697a2111c9e1a@mail.gmail.com>
	<20080901183744.GD13069@mit.edu>
	<1cbd6f830809011329mdc2a3e3v763a70a18d7dc383@mail.gmail.com>
	<20080901203913.GF13069@mit.edu>
	<1cbd6f830809011416t5edffaa3p7e98b0324f3a13ac@mail.gmail.com>
	<20080901212304.GI13069@mit.edu>
Message-ID: <1cbd6f830809011447j7d48467cmb732ce4b5b1082b9@mail.gmail.com>

Thanks!

This has cured my curiosity (for now...)


On Mon, Sep 1, 2008 at 5:23 PM, Theodore Tso <tytso at mit.edu> wrote:
> On Mon, Sep 01, 2008 at 05:16:01PM -0400, Mag Gam wrote:
>> > If the filesystem is sufficiently damaged such that portions of the
>> > b-tree can't be found, then yes.  Otherwise, the data would be totally
>> > lost.  As you can imagine, scaning every single block on the disk to
>> > see if it looks like filesystem metadata is quite slow, so naturally
>> > the reiserfs's fsck will avoid doing it if at all possible.  But if
>> > the root or top-level nodes of the B-tree is damaged, it doesn't have
>> > much choice.
>> >
>>
>> But, if thats the last and worst case scenario why don't they do the
>> full scan? Sure its going to take a long time if its a big filesystem
>> (there should be no changes since it would be unmounted), but its
>> better than not having any data at all...
>
> As I said, in the worst case, it will do a full scan.  But (a) it
> takes a long time, and (b) if the filesystem has any files that
> contain images of reiserfs filesystem, it will be totally scrambled.
> So it makes sense that the reiserfs fsck would try to avoid this if it
> can (i.e., if the b-tree is only mildly corrupted).
>
> With that said, this is really going out of scope of this mailing
> list.  And I am not an expert on reiserfs's filesystem checker,
> although I have had people confirm to me that indeed, you can lose
> really big if your reiserfs filesystem contains files that have are
> images of other reiserfs filesystems for things like Virtualization.
> This problem is apparently solved in reiser4, it is NOT solved in
> reiserfs (i.e., version 3).  As far as I am concerned, that's ample
> reason not to use reiserfs, but obviously I'm basied.  :-)
>
>                                                - Ted
>
>
>



From thorsten.henrici at gfd.de  Tue Sep  2 20:03:36 2008
From: thorsten.henrici at gfd.de (thorsten.henrici at gfd.de)
Date: Tue, 2 Sep 2008 22:03:36 +0200
Subject: =?iso-8859-1?q?Thorsten_Henrici_ist_au=DFer_Haus=2E?=
Message-ID: <OF74503D4C.99C6CBEC-ONC12574B8.006E31BE-C12574B8.006E31BE@obi.de>



Ich werde ab  27.08.2008 nicht im B?ro sein. Ich kehre zur?ck am
22.09.2008.

Ich werde Ihre Nachricht nach meiner R?ckkehr beantworten. In dringenden
F?llen wenden Sie sich bitte an Herrn St?ver.

I'm out of office until the 22th of September. In urgent cases please
contact Mr. Karl-Heinz St?ver.


--
IMPORTANT NOTICE:
This email is confidential, may be legally privileged, and is for the
intended recipient only. Access, disclosure, copying, distribution, or
reliance on any of it by anyone else is prohibited and may be a criminal
offence. Please delete if obtained in error and email confirmation to the sender.



From tytso at mit.edu  Wed Sep  3 13:45:36 2008
From: tytso at mit.edu (Theodore Tso)
Date: Wed, 3 Sep 2008 09:45:36 -0400
Subject: spd_readdir.c and readdir_r [real new version]
In-Reply-To: <1213587981.8578.189.camel@corn.betterworld.us>
References: <1212903039.7158.31.camel@corn.betterworld.us>
	<1212985588.32113.13.camel@corn.betterworld.us>
	<1213587981.8578.189.camel@corn.betterworld.us>
Message-ID: <20080903134536.GD8360@mit.edu>

Hey Ross,

Sorry for not responding early; I was travelling a lot over the
summer, and I never got around to responding to your e-mail.

Many thanks for adding support for readdir_r and readdir64_r!  As it
turns out, I was doing some updates to spd_readdir.c to support
fdopendir (which rm uses).  Also, it looks like you based your changes
off of an older version of spd_readdir.c that didn't support the
dirfd() call.  I probably will try to package this up into its own
package, since I suspect it would be useful to a larger set of people.

In any case here's the merged version I have.  Please let me know if
this works for you, and if you have any other suggested improvements!

	       	 	    	     	       	 - Ted

-------------- next part --------------
A non-text attachment was scrubbed...
Name: spd_readdir.c
Type: text/x-csrc
Size: 10396 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20080903/4e902321/attachment.bin>

From tytso at mit.edu  Wed Sep  3 16:09:52 2008
From: tytso at mit.edu (Theodore Tso)
Date: Wed, 3 Sep 2008 12:09:52 -0400
Subject: Problem in HTREE directory node
In-Reply-To: <1219689606.12088.50.camel@corn.betterworld.us>
References: <1219689606.12088.50.camel@corn.betterworld.us>
Message-ID: <20080903160952.GE8360@mit.edu>

On Mon, Aug 25, 2008 at 11:40:06AM -0700, Ross Boylan wrote:
> Short version: 
> 
> fsck said
> "invalid HTREE directory inode 635113
> (mail/r/user/ross/comp/admin-wheat) clear HTREE index?" To which I
> replied Yes.  
> 
> What exactly does this mean was corrupted?  In particular, does it mean
> the list of files in the directory .../comp/admin-wheat was damaged?  Or
> is the trouble in the comp directory?
> 
> Is fsck likely to have fixed up things as good as new, or might
> something be lost or corrupted?  I don't know what clearing the HTREE
> index does.

That just means that the interior nodes in the HTREE were corrupt.  If
you give permission to clear the htree index, e2fsck put the inode on
the list of directories that need to have their HTREE indexes rebuilt,
and a "Pass 3A" will rebuild the directory's (or directories') HTREE
indexes.  This is similar to what "e2fsck -fD" does, except it only
rebuilds directories whose HTREE indexes were corrupted, instead of
rebuilding and optimize all of the directories in the system.

So if that was the only message you received, and there were no other
reports of damage to the directory, you wouldn't have lost any
directory names.  It's in all likelihood "good as new".

Regards,

							- Ted



From l.allegrucci at gmail.com  Mon Sep  8 19:27:32 2008
From: l.allegrucci at gmail.com (Lorenzo Allegrucci)
Date: Mon, 8 Sep 2008 21:27:32 +0200
Subject: tune2fs
Message-ID: <4dcf7d360809081227y3a536642saca35f4ecad3f2b3@mail.gmail.com>

Hi all, I was wondering if it's safe to run tune2fs with the -c or -i option
on a rw mounted filesystem.
Should I remount read only first? My man page doesn't mention it.
Thanks

-- 
Lorenzo



From tytso at MIT.EDU  Mon Sep  8 21:02:25 2008
From: tytso at MIT.EDU (Theodore Tso)
Date: Mon, 8 Sep 2008 17:02:25 -0400
Subject: tune2fs
In-Reply-To: <4dcf7d360809081227y3a536642saca35f4ecad3f2b3@mail.gmail.com>
References: <4dcf7d360809081227y3a536642saca35f4ecad3f2b3@mail.gmail.com>
Message-ID: <20080908210225.GM8161@mit.edu>

On Mon, Sep 08, 2008 at 09:27:32PM +0200, Lorenzo Allegrucci wrote:
> Hi all, I was wondering if it's safe to run tune2fs with the -c or -i option
> on a rw mounted filesystem.
> Should I remount read only first? My man page doesn't mention it.

It is safe to use tune2fs on an rw-mounted filesystem; tune2fs is very
careful about how it modifies the superblock in order to make it safe.

	      	     	      	  	     	- Ted



From tobi at oetiker.ch  Wed Sep 10 11:30:45 2008
From: tobi at oetiker.ch (Tobias Oetiker)
Date: Wed, 10 Sep 2008 13:30:45 +0200 (CEST)
Subject: journal on an ssd
Message-ID: <alpine.DEB.1.99.0809101324320.22217@sebohet.brgvxre.pu>

Experts,

What happens if the disk hosting an external journal of a filesytem
running with data=journal goes bust.

The Backstory ...

I have been batteling with filesystem performance for some time
now. Our setup is a HW Raid(6) with LVM on top and ext3 filesytems.

Recently we added an SSD to our setup and have moved all the journals
to this ssd. This has dramatically improved performance and
especially reduced the interdependence between performance of
different partitions hosted on the same RAID.

 http://insights.oetiker.ch/linux/external-journal-on-ssd.html

I realy like the performance of this new setup, but I am not all
that sure about the data security aspects of it. Especially after
reading

 http://www.cs.wisc.edu/adsl/Publications/sfa-dsn05.pdf

which suggests that damaged journals are the worst that can happen
to ext3.

any insights on this?

cheers
tobi


-- 
Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland
http://it.oetiker.ch tobi at oetiker.ch ++41 62 775 9902 / sb: -9900



From holger at wizards.de  Wed Sep 10 13:41:45 2008
From: holger at wizards.de (Holger Hoffstaette)
Date: Wed, 10 Sep 2008 15:41:45 +0200
Subject: journal on an ssd
References: <alpine.DEB.1.99.0809101324320.22217@sebohet.brgvxre.pu>
Message-ID: <pan.2008.09.10.13.41.45.768375@wizards.de>

On Wed, 10 Sep 2008 13:30:45 +0200, Tobias Oetiker wrote:

> What happens if the disk hosting an external journal of a filesytem
> running with data=journal goes bust.

Probably the same as if the journal was on the same disk, going bust. :-)
Or rather :-( as this can indeed get pretty ugly.
With ext3 you can always fall back to mounting as ext2 and at least try to
recover as much as possible.

> Recently we added an SSD to our setup and have moved all the journals to
> this ssd. This has dramatically improved performance and especially
> reduced the interdependence between performance of different partitions
> hosted on the same RAID.

That is one of the great SSD uses, yes.

>  http://insights.oetiker.ch/linux/external-journal-on-ssd.html

Very interesting, thanks! I was planning to do the same but waiting for
the Intel SSDs to come to market or the large OZCs to come down in price,
whatever happened first..

> I realy like the performance of this new setup, but I am not all that sure
> about the data security aspects of it. Especially after reading
> 
>  http://www.cs.wisc.edu/adsl/Publications/sfa-dsn05.pdf
> 
> which suggests that damaged journals are the worst that can happen to
> ext3.

True, a borked journal is bad but with the SSD you should actually have
*less* chance of corruption (of the type mentioned in the paper), since
the wear-leveling should keep the journal blocks alive without the
file system/block layer noticing. At least in theory.. :-D

You may also find this interesting:
http://labs.google.com/papers/disk_failures.html

Holger




From sandeen at redhat.com  Wed Sep 10 15:27:25 2008
From: sandeen at redhat.com (Eric Sandeen)
Date: Wed, 10 Sep 2008 10:27:25 -0500
Subject: journal on an ssd
In-Reply-To: <alpine.DEB.1.99.0809101324320.22217@sebohet.brgvxre.pu>
References: <alpine.DEB.1.99.0809101324320.22217@sebohet.brgvxre.pu>
Message-ID: <48C7E75D.8040909@redhat.com>

Tobias Oetiker wrote:
> Experts,
> 
> What happens if the disk hosting an external journal of a filesytem
> running with data=journal goes bust.
> 
> The Backstory ...
> 
> I have been batteling with filesystem performance for some time
> now. Our setup is a HW Raid(6) with LVM on top and ext3 filesytems.
> 
> Recently we added an SSD to our setup and have moved all the journals
> to this ssd. This has dramatically improved performance and
> especially reduced the interdependence between performance of
> different partitions hosted on the same RAID.
> 
>  http://insights.oetiker.ch/linux/external-journal-on-ssd.html

How does this compare to putting journals on a separate non-ssd device?

-Eric



From tobi at oetiker.ch  Wed Sep 10 16:05:00 2008
From: tobi at oetiker.ch (Tobias Oetiker)
Date: Wed, 10 Sep 2008 18:05:00 +0200 (CEST)
Subject: journal on an ssd
In-Reply-To: <48C7E75D.8040909@redhat.com>
References: <alpine.DEB.1.99.0809101324320.22217@sebohet.brgvxre.pu>
	<48C7E75D.8040909@redhat.com>
Message-ID: <alpine.DEB.1.99.0809101803270.27821@sebohet.brgvxre.pu>

Hi Eric,

I have not tested this, but since we are putting about 16 different
journals on this one ssd, I would assume that the loss through
seeking between the journals would be pretty bad, and again bring
back that inter-filesystem-dependency we were trying to loose with
this measure.

cheers
tobi

Today Eric Sandeen wrote:

> Tobias Oetiker wrote:
> > Experts,
> >
> > What happens if the disk hosting an external journal of a filesytem
> > running with data=journal goes bust.
> >
> > The Backstory ...
> >
> > I have been batteling with filesystem performance for some time
> > now. Our setup is a HW Raid(6) with LVM on top and ext3 filesytems.
> >
> > Recently we added an SSD to our setup and have moved all the journals
> > to this ssd. This has dramatically improved performance and
> > especially reduced the interdependence between performance of
> > different partitions hosted on the same RAID.
> >
> >  http://insights.oetiker.ch/linux/external-journal-on-ssd.html
>
> How does this compare to putting journals on a separate non-ssd device?
>
> -Eric
>
>

-- 
Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland
http://it.oetiker.ch tobi at oetiker.ch ++41 62 775 9902 / sb: -9900



From holger at wizards.de  Wed Sep 10 15:31:53 2008
From: holger at wizards.de (Holger Hoffstaette)
Date: Wed, 10 Sep 2008 17:31:53 +0200
Subject: journal on an ssd
References: <alpine.DEB.1.99.0809101324320.22217@sebohet.brgvxre.pu>
Message-ID: <pan.2008.09.10.15.31.53.409250@wizards.de>


Another followup..

On Wed, 10 Sep 2008 13:30:45 +0200, Tobias Oetiker wrote:

> Recently we added an SSD to our setup and have moved all the journals to
> this ssd. This has dramatically improved performance and especially
> reduced the interdependence between performance of different partitions
> hosted on the same RAID.
> 
>  http://insights.oetiker.ch/linux/external-journal-on-ssd.html

You mention that you chose data=journal, i.e. full journaling. Have you
tried ordered mode as well? This should still yield a significant
performance win because of reduced head movement and faster metadata
writes. It may or may not be faster depending on the size of the
written data itself..I'm just curious if you tested this.

thanks
Holger




From sandeen at redhat.com  Wed Sep 10 16:21:32 2008
From: sandeen at redhat.com (Eric Sandeen)
Date: Wed, 10 Sep 2008 11:21:32 -0500
Subject: journal on an ssd
In-Reply-To: <alpine.DEB.1.99.0809101803270.27821@sebohet.brgvxre.pu>
References: <alpine.DEB.1.99.0809101324320.22217@sebohet.brgvxre.pu>
	<48C7E75D.8040909@redhat.com>
	<alpine.DEB.1.99.0809101803270.27821@sebohet.brgvxre.pu>
Message-ID: <48C7F40C.2040006@redhat.com>

Tobias Oetiker wrote:
> Hi Eric,
> 
> I have not tested this, but since we are putting about 16 different
> journals on this one ssd, I would assume that the loss through
> seeking between the journals would be pretty bad, and again bring
> back that inter-filesystem-dependency we were trying to loose with
> this measure.

Ah, ok - I missed that you had several journals on one device.

Thanks,
-Eric



From worleys at gmail.com  Wed Sep 10 17:23:31 2008
From: worleys at gmail.com (Chris Worley)
Date: Wed, 10 Sep 2008 11:23:31 -0600
Subject: journal on an ssd
In-Reply-To: <alpine.DEB.1.99.0809101803270.27821@sebohet.brgvxre.pu>
References: <alpine.DEB.1.99.0809101324320.22217@sebohet.brgvxre.pu>
	<48C7E75D.8040909@redhat.com>
	<alpine.DEB.1.99.0809101803270.27821@sebohet.brgvxre.pu>
Message-ID: <f3177b9e0809101023l6f14355bxddde77df540ee693@mail.gmail.com>

Look at: http://www.fusionio.com/Products.aspx
At 120K IOPS @1K blocks, it should make for a very good journaling device.

It's not an SSD per se; it bypasses old disk controllers altogether (very
innovative block device design).

The block device layer and hardware are tailored for NAND
failure idiosyncrasies... which results in their data loss is less than any
available SSD or rotating disk.

Put two together in a RAID1 configuration to compensate for device failures
(assure you have 2 PCIe x8 slots available).

Chris
On Wed, Sep 10, 2008 at 10:05 AM, Tobias Oetiker <tobi at oetiker.ch> wrote:

> Hi Eric,
>
> I have not tested this, but since we are putting about 16 different
> journals on this one ssd, I would assume that the loss through
> seeking between the journals would be pretty bad, and again bring
> back that inter-filesystem-dependency we were trying to loose with
> this measure.
>
> cheers
> tobi
>
> Today Eric Sandeen wrote:
>
> > Tobias Oetiker wrote:
> > > Experts,
> > >
> > > What happens if the disk hosting an external journal of a filesytem
> > > running with data=journal goes bust.
> > >
> > > The Backstory ...
> > >
> > > I have been batteling with filesystem performance for some time
> > > now. Our setup is a HW Raid(6) with LVM on top and ext3 filesytems.
> > >
> > > Recently we added an SSD to our setup and have moved all the journals
> > > to this ssd. This has dramatically improved performance and
> > > especially reduced the interdependence between performance of
> > > different partitions hosted on the same RAID.
> > >
> > >  http://insights.oetiker.ch/linux/external-journal-on-ssd.html
> >
> > How does this compare to putting journals on a separate non-ssd device?
> >
> > -Eric
> >
> >
>
> --
> Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland
> http://it.oetiker.ch tobi at oetiker.ch ++41 62 775 9902 / sb: -9900
>
> _______________________________________________
> Ext3-users mailing list
> Ext3-users at redhat.com
> https://www.redhat.com/mailman/listinfo/ext3-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20080910/62959e26/attachment.htm>

From tobi at oetiker.ch  Wed Sep 10 22:58:28 2008
From: tobi at oetiker.ch (Tobias Oetiker)
Date: Thu, 11 Sep 2008 00:58:28 +0200 (CEST)
Subject: journal on an ssd
In-Reply-To: <f3177b9e0809101455s7e2a44b8u8733ff6db592179@mail.gmail.com>
References: <alpine.DEB.1.99.0809101324320.22217@sebohet.brgvxre.pu>
	<48C7E75D.8040909@redhat.com>
	<alpine.DEB.1.99.0809101803270.27821@sebohet.brgvxre.pu>
	<f3177b9e0809101023l6f14355bxddde77df540ee693@mail.gmail.com>
	<alpine.DEB.1.99.0809102043520.27821@sebohet.brgvxre.pu>
	<f3177b9e0809101310p5288e253j9cb8190dac9e9402@mail.gmail.com>
	<f3177b9e0809101455s7e2a44b8u8733ff6db592179@mail.gmail.com>
Message-ID: <alpine.DEB.1.99.0809110049540.27821@sebohet.brgvxre.pu>

Hi Chris,

Yesterday Chris Worley wrote:

> Note that I do have one to experiment with.
> What's a good way to measure journal performance, and/or in what cases do
> you need a faster journal (i.e. an EXT3 atop an MD device with big block
> stripes)?
>
> Chris

Well, the 'problem' we had to solve was the following:

setup:

- large HW raid6 array
- lvm on top
- many ext3 partitions

when there was a lot of write or meta data update activity on one
partition, performance on all other partitions went to 0.
(processes hanging for 10-20 seconds as soon as they accessed the
filesystem). I am sure that there is a bad-bad bug in the linux
kernel somewhere which is causing this, but all the upgrading and
patching did not help, the condition remained.

Until we moved the journals off to that external ssd.

Now I can copy partition A over to partition B and the server
remains nicely responsive. I am atributing that to the external
journal.

Obviously I would like to know how bad we are going to be had when
the ssd dies.

cheers
tobi

-- 
Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland
http://it.oetiker.ch tobi at oetiker.ch ++41 62 775 9902 / sb: -9900



From adilger at sun.com  Thu Sep 11 04:10:53 2008
From: adilger at sun.com (Andreas Dilger)
Date: Wed, 10 Sep 2008 22:10:53 -0600
Subject: journal on an ssd
In-Reply-To: <alpine.DEB.1.99.0809101803270.27821@sebohet.brgvxre.pu>
References: <alpine.DEB.1.99.0809101324320.22217@sebohet.brgvxre.pu>
	<48C7E75D.8040909@redhat.com>
	<alpine.DEB.1.99.0809101803270.27821@sebohet.brgvxre.pu>
Message-ID: <20080911041053.GT3086@webber.adilger.int>

On Sep 10, 2008  18:05 +0200, Tobias Oetiker wrote:
> I have not tested this, but since we are putting about 16 different
> journals on this one ssd, I would assume that the loss through
> seeking between the journals would be pretty bad, and again bring
> back that inter-filesystem-dependency we were trying to loose with
> this measure.

The cost of putting the journals on 16 separate, relatively small
disk devices would probably be comparable to the cost of the SSD
and not have a single point of failure.  The journal does mostly
linear IO, so performance is probably equal or better.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.



From tobi at oetiker.ch  Thu Sep 11 05:43:18 2008
From: tobi at oetiker.ch (Tobias Oetiker)
Date: Thu, 11 Sep 2008 07:43:18 +0200 (CEST)
Subject: journal on an ssd
In-Reply-To: <20080911041053.GT3086@webber.adilger.int>
References: <alpine.DEB.1.99.0809101324320.22217@sebohet.brgvxre.pu>
	<48C7E75D.8040909@redhat.com>
	<alpine.DEB.1.99.0809101803270.27821@sebohet.brgvxre.pu>
	<20080911041053.GT3086@webber.adilger.int>
Message-ID: <alpine.DEB.1.99.0809110737460.27821@sebohet.brgvxre.pu>

Folks,

Yesterday Andreas Dilger wrote:

> On Sep 10, 2008  18:05 +0200, Tobias Oetiker wrote:
> > I have not tested this, but since we are putting about 16 different
> > journals on this one ssd, I would assume that the loss through
> > seeking between the journals would be pretty bad, and again bring
> > back that inter-filesystem-dependency we were trying to loose with
> > this measure.
>
> The cost of putting the journals on 16 separate, relatively small
> disk devices would probably be comparable to the cost of the SSD
> and not have a single point of failure.  The journal does mostly
> linear IO, so performance is probably equal or better.

You are telling me things that I am aware of. The reason I wrote to
this group is to figure what would happen to an ext3 fs when the
external journal was lost, especially what happens when it is lost
on a filesystem where 'data=journal' is set.

Because if it is catastrophic, then it basically means that the
journal has to reside on a device that is as secure as to rest of
the data, meaning that if the data is on RAID6 then the journal
should be on RAID6 too.

What I am hoping for, is that someone tells me, that in the case of
'data=journal' the loss would only be the material that is still in
the journal (eg 30 seconds worth of data) and the rest of the fs
would have a fair chance of being recoverd with fsck.

cheers
tobi


-- 
Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland
http://it.oetiker.ch tobi at oetiker.ch ++41 62 775 9902 / sb: -9900



From chris at harvington.org.uk  Thu Sep 11 08:13:21 2008
From: chris at harvington.org.uk (Chris Haynes)
Date: Thu, 11 Sep 2008 09:13:21 +0100
Subject: journal on an ssd
In-Reply-To: <alpine.DEB.1.99.0809110737460.27821@sebohet.brgvxre.pu>
References: <alpine.DEB.1.99.0809101324320.22217@sebohet.brgvxre.pu>
	<48C7E75D.8040909@redhat.com>
	<alpine.DEB.1.99.0809101803270.27821@sebohet.brgvxre.pu>
	<20080911041053.GT3086@webber.adilger.int>
	<alpine.DEB.1.99.0809110737460.27821@sebohet.brgvxre.pu>
Message-ID: <106136964.20080911091321@harvington.org.uk>

Just a random thought, and anticipating that the experts will say that if an entire journal is lost (not present) the main data is still accessible / recoverable (in its previous state).



Is it perhaps the case that, to maximize the integrity of the main data,  one would *want* the journal to have a different failure pattern?

That, if there were any doubt about journal integrity, it would be better (for the integrity of the main file system) to discard the journal entirely?

This would suggest the use of a robust hash / cryptographic digest of the journal contents, stored with it and checked each time the journal is about to be used. These are quite quick to compute nowadays.

Any potential in this speculation?

Chris Haynes



On Thursday, September 11, 2008 at 6:43:18 AM, Tobias Oetiker wrote:
> Folks,

> Yesterday Andreas Dilger wrote:

>> On Sep 10, 2008  18:05 +0200, Tobias Oetiker wrote:
>> > I have not tested this, but since we are putting about 16 different
>> > journals on this one ssd, I would assume that the loss through
>> > seeking between the journals would be pretty bad, and again bring
>> > back that inter-filesystem-dependency we were trying to loose with
>> > this measure.

>> The cost of putting the journals on 16 separate, relatively small
>> disk devices would probably be comparable to the cost of the SSD
>> and not have a single point of failure.  The journal does mostly
>> linear IO, so performance is probably equal or better.

> You are telling me things that I am aware of. The reason I wrote to
> this group is to figure what would happen to an ext3 fs when the
> external journal was lost, especially what happens when it is lost
> on a filesystem where 'data=journal' is set.

> Because if it is catastrophic, then it basically means that the
> journal has to reside on a device that is as secure as to rest of
> the data, meaning that if the data is on RAID6 then the journal
> should be on RAID6 too.

> What I am hoping for, is that someone tells me, that in the case of
> 'data=journal' the loss would only be the material that is still in
> the journal (eg 30 seconds worth of data) and the rest of the fs
> would have a fair chance of being recoverd with fsck.

> cheers
> tobi




From rwheeler at redhat.com  Thu Sep 11 11:06:07 2008
From: rwheeler at redhat.com (Ric Wheeler)
Date: Thu, 11 Sep 2008 07:06:07 -0400
Subject: journal on an ssd
In-Reply-To: <alpine.DEB.1.99.0809110049540.27821@sebohet.brgvxre.pu>
References: <alpine.DEB.1.99.0809101324320.22217@sebohet.brgvxre.pu>	<48C7E75D.8040909@redhat.com>	<alpine.DEB.1.99.0809101803270.27821@sebohet.brgvxre.pu>	<f3177b9e0809101023l6f14355bxddde77df540ee693@mail.gmail.com>	<alpine.DEB.1.99.0809102043520.27821@sebohet.brgvxre.pu>	<f3177b9e0809101310p5288e253j9cb8190dac9e9402@mail.gmail.com>	<f3177b9e0809101455s7e2a44b8u8733ff6db592179@mail.gmail.com>
	<alpine.DEB.1.99.0809110049540.27821@sebohet.brgvxre.pu>
Message-ID: <48C8FB9F.4030904@redhat.com>

Tobias Oetiker wrote:
> Hi Chris,
>
> Yesterday Chris Worley wrote:
>
>   
>> Note that I do have one to experiment with.
>> What's a good way to measure journal performance, and/or in what cases do
>> you need a faster journal (i.e. an EXT3 atop an MD device with big block
>> stripes)?
>>
>> Chris
>>     
>
> Well, the 'problem' we had to solve was the following:
>
> setup:
>
> - large HW raid6 array
> - lvm on top
> - many ext3 partitions
>
> when there was a lot of write or meta data update activity on one
> partition, performance on all other partitions went to 0.
> (processes hanging for 10-20 seconds as soon as they accessed the
> filesystem). I am sure that there is a bad-bad bug in the linux
> kernel somewhere which is causing this, but all the upgrading and
> patching did not help, the condition remained.
>   
I assume that you have a hardware RAID card, not an external array? If 
you do have an array (IBM Shark, EMC box, etc) with battery backed 
internal cache, then you should get better than SSD speeds from one of 
its LUNs assuming your cache is large enough ;-)

ric

> Until we moved the journals off to that external ssd.
>
> Now I can copy partition A over to partition B and the server
> remains nicely responsive. I am atributing that to the external
> journal.
>
> Obviously I would like to know how bad we are going to be had when
> the ssd dies.
>
> cheers
> tobi
>
>   




From tobi at oetiker.ch  Thu Sep 11 11:45:33 2008
From: tobi at oetiker.ch (Tobias Oetiker)
Date: Thu, 11 Sep 2008 13:45:33 +0200 (CEST)
Subject: journal on an ssd
In-Reply-To: <48C8FB9F.4030904@redhat.com>
References: <alpine.DEB.1.99.0809101324320.22217@sebohet.brgvxre.pu>
	<48C7E75D.8040909@redhat.com>
	<alpine.DEB.1.99.0809101803270.27821@sebohet.brgvxre.pu>
	<f3177b9e0809101023l6f14355bxddde77df540ee693@mail.gmail.com>
	<alpine.DEB.1.99.0809102043520.27821@sebohet.brgvxre.pu>
	<f3177b9e0809101310p5288e253j9cb8190dac9e9402@mail.gmail.com>
	<f3177b9e0809101455s7e2a44b8u8733ff6db592179@mail.gmail.com>
	<alpine.DEB.1.99.0809110049540.27821@sebohet.brgvxre.pu>
	<48C8FB9F.4030904@redhat.com>
Message-ID: <alpine.DEB.1.99.0809111334480.27821@sebohet.brgvxre.pu>

Hi Ric,

Today Ric Wheeler wrote:

[...]

> I assume that you have a hardware RAID card, not an external array? If you do
> have an array (IBM Shark, EMC box, etc) with battery backed internal cache,
> then you should get better than SSD speeds from one of its LUNs assuming your
> cache is large enough ;-)

It is a hwraid card with batttery backed cache (areca). I think the
problem with the built in cache is that the raid manages it without
knowing about the structure of the filesystem.

At the bottom of it it is strong argument for zfs :-)

still wondering what happens when ext3 looses a journal.

cheers
tobi


-- 
Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland
http://it.oetiker.ch tobi at oetiker.ch ++41 62 775 9902 / sb: -9900



From tytso at MIT.EDU  Thu Sep 11 13:07:15 2008
From: tytso at MIT.EDU (Theodore Tso)
Date: Thu, 11 Sep 2008 09:07:15 -0400
Subject: journal on an ssd
In-Reply-To: <106136964.20080911091321@harvington.org.uk>
	<alpine.DEB.1.99.0809110737460.27821@sebohet.brgvxre.pu>
References: <48C7E75D.8040909@redhat.com>
	<alpine.DEB.1.99.0809101803270.27821@sebohet.brgvxre.pu>
	<20080911041053.GT3086@webber.adilger.int>
	<alpine.DEB.1.99.0809110737460.27821@sebohet.brgvxre.pu>
	<106136964.20080911091321@harvington.org.uk>
	<alpine.DEB.1.99.0809101324320.22217@sebohet.brgvxre.pu>
	<48C7E75D.8040909@redhat.com>
	<alpine.DEB.1.99.0809101803270.27821@sebohet.brgvxre.pu>
	<20080911041053.GT3086@webber.adilger.int>
	<alpine.DEB.1.99.0809110737460.27821@sebohet.brgvxre.pu>
Message-ID: <20080911130715.GA4759@mit.edu>

On Thu, Sep 11, 2008 at 07:43:18AM +0200, Tobias Oetiker wrote:
> 
> What I am hoping for, is that someone tells me, that in the case of
> 'data=journal' the loss would only be the material that is still in
> the journal (eg 30 seconds worth of data) and the rest of the fs
> would have a fair chance of being recoverd with fsck.
> 

The paper you quoted essentially indicated that ext3's JBD layer
checking for error cases sufficiently.  It has improved since then,
but there are a few places where when I did a quick audit of the code
paths, I was able to find a few places where we aren't checking the
error returns when calling sync_dirty_buffer(), for example.  In
general, though, if there is a failure to write to the SSD, it should
get detected fairly quickly, at which point the journal will get
aborted, which will suspend writes to the filesystem.  It may not
happen as quickly as we might like, and if you get really unlucky and
a singleton write fails and it's one where the error return doesn't
get written, you could end up writing garbage to the filesystem on a
journal replay.  

In that worst case scenario, you might end up losing a full inode
table block's worth of inodes, but in general, the loss should be the
last few minutes worth of data.  Fsck has a better than normal chance
of recoverying from a busted journal.  That being said, it would be
wise to monitor the health of the SSD via S.M.A.R.T., since I would
suspect that failures of the SSD should be easily predicted by the
firmware.

On Thu, Sep 11, 2008 at 09:13:21AM +0100, Chris Haynes wrote:
> 
> Is it perhaps the case that, to maximize the integrity of the main
> data, one would *want* the journal to have a different failure
> pattern?
> 
> That, if there were any doubt about journal integrity, it would be
> better (for the integrity of the main file system) to discard the
> journal entirely?
> 
> This would suggest the use of a robust hash / cryptographic digest
> of the journal contents, stored with it and checked each time the
> journal is about to be used. These are quite quick to compute
> nowadays.

Indeed, this is what ext4 does; there is a checksum (you don't need a
cryptographic digest since contrary to most sysadmin's fears, hard
drives are *not* malicious sentient beings :-), in each commit record
to detect these problems, and if a problem is found, we abort running
the journal right then and there.

It is possible this change can mean that you will lose more data, not
less.  If there is a singleton failure writing a single block, early
in the journal, aborting the journal means that we don't replay any of
the later journal commits, and it could very well be corrupted data
block was later rewritten successfully to the journal in a later
commit, and in fact, continuing the journal recovery is the right
thing to do.  On the other hand, if the corrupted datablock was a
journal descriptor, aborting the journal commit is the best thing you
could do.  But this could mean that in theory you might end up losing
more than just the last 30 seconds, but more like last couple of
minutes worth of data.

(Even data which was fsync'ed, since fsync only guarantees that the
data was written to some stable storage; fsync makes no guarantees
about what might happen if your stable storage, including the journal,
fails to store data correctly.)

We've talked about changing the journalling code to write a separate
checksum for each block, which would allow us to more intelligently
recover from a failed checksum in the journal block.  It wouldn't be a
trivial thing to add, so we haven't added that to date.  And this is a
relatively unlikely case, which involves an (undetected) single write
failure, followed by a crash at just the wrong time, before the
journal has a chance to wrap.

Also, ext4 is even better than ext3 in terms of checking error returns
(although to be honest when I did a quick audit just now I still did
find a few places where we should add some error checks; I'll work on
getting fixes submitted for both ext3 and ext4).

							- Ted



From tobi at oetiker.ch  Thu Sep 11 14:38:15 2008
From: tobi at oetiker.ch (Tobias Oetiker)
Date: Thu, 11 Sep 2008 16:38:15 +0200 (CEST)
Subject: journal on an ssd
In-Reply-To: <20080911130715.GA4759@mit.edu>
References: <48C7E75D.8040909@redhat.com>
	<alpine.DEB.1.99.0809101803270.27821@sebohet.brgvxre.pu>
	<20080911041053.GT3086@webber.adilger.int>
	<alpine.DEB.1.99.0809110737460.27821@sebohet.brgvxre.pu>
	<106136964.20080911091321@harvington.org.uk>
	<alpine.DEB.1.99.0809101324320.22217@sebohet.brgvxre.pu>
	<48C7E75D.8040909@redhat.com>
	<alpine.DEB.1.99.0809101803270.27821@sebohet.brgvxre.pu>
	<20080911041053.GT3086@webber.adilger.int>
	<alpine.DEB.1.99.0809110737460.27821@sebohet.brgvxre.pu>
	<20080911130715.GA4759@mit.edu>
Message-ID: <alpine.DEB.1.99.0809111635440.27821@sebohet.brgvxre.pu>

Teo,

Today Theodore Tso wrote:

[...]
> In that worst case scenario, you might end up losing a full inode
> table block's worth of inodes, but in general, the loss should be the
> last few minutes worth of data.  Fsck has a better than normal chance
> of recoverying from a busted journal.  That being said, it would be
> wise to monitor the health of the SSD via S.M.A.R.T., since I would
> suspect that failures of the SSD should be easily predicted by the
> firmware.

you are the man, thanks ... that was the kind of answer I was
looking for :-) I have started to smart mon my journal disk
... it has interesting properties in smart, a whole lot of which my
version of smartmontools not seems to know about ... do you have
any insight in this ? is there a list of relevant smart properties ?

I have also set errors=panic as a mount option, or is this unwise
in this context ?

-- 
Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland
http://it.oetiker.ch tobi at oetiker.ch ++41 62 775 9902 / sb: -9900



From adilger at sun.com  Thu Sep 11 21:07:01 2008
From: adilger at sun.com (Andreas Dilger)
Date: Thu, 11 Sep 2008 15:07:01 -0600
Subject: journal on an ssd
In-Reply-To: <alpine.DEB.1.99.0809110737460.27821@sebohet.brgvxre.pu>
References: <alpine.DEB.1.99.0809101324320.22217@sebohet.brgvxre.pu>
	<48C7E75D.8040909@redhat.com>
	<alpine.DEB.1.99.0809101803270.27821@sebohet.brgvxre.pu>
	<20080911041053.GT3086@webber.adilger.int>
	<alpine.DEB.1.99.0809110737460.27821@sebohet.brgvxre.pu>
Message-ID: <20080911210701.GD3086@webber.adilger.int>

On Sep 11, 2008  07:43 +0200, Tobias Oetiker wrote:
> You are telling me things that I am aware of. The reason I wrote to
> this group is to figure what would happen to an ext3 fs when the
> external journal was lost, especially what happens when it is lost
> on a filesystem where 'data=journal' is set.

Losing a journal will, in 99% of the cases, mean the loss of only a
few seconds of data.  In some rare cases it may be that an inconsistency
from a partially-updated commit will cause e2fsck to become confused
and possibly clean up a small number more files than it would have
otherwise.

> Because if it is catastrophic, then it basically means that the
> journal has to reside on a device that is as secure as to rest of
> the data, meaning that if the data is on RAID6 then the journal
> should be on RAID6 too.

No, because RAID6 is terribly sucky for performance.  If you need this
kind of reliability triple-mirrored RAID 1 would be better.  Much less
CPU overhead, and no extra IO.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.



From tobi at oetiker.ch  Thu Sep 11 21:10:36 2008
From: tobi at oetiker.ch (Tobias Oetiker)
Date: Thu, 11 Sep 2008 23:10:36 +0200 (CEST)
Subject: journal on an ssd
In-Reply-To: <20080911210701.GD3086@webber.adilger.int>
References: <alpine.DEB.1.99.0809101324320.22217@sebohet.brgvxre.pu>
	<48C7E75D.8040909@redhat.com>
	<alpine.DEB.1.99.0809101803270.27821@sebohet.brgvxre.pu>
	<20080911041053.GT3086@webber.adilger.int>
	<alpine.DEB.1.99.0809110737460.27821@sebohet.brgvxre.pu>
	<20080911210701.GD3086@webber.adilger.int>
Message-ID: <alpine.DEB.1.99.0809112309360.27719@sebohet.brgvxre.pu>

Hi Andreas,
Today Andreas Dilger wrote:

> On Sep 11, 2008  07:43 +0200, Tobias Oetiker wrote:
> > You are telling me things that I am aware of. The reason I wrote to
> > this group is to figure what would happen to an ext3 fs when the
> > external journal was lost, especially what happens when it is lost
> > on a filesystem where 'data=journal' is set.
>
> Losing a journal will, in 99% of the cases, mean the loss of only a
> few seconds of data.  In some rare cases it may be that an inconsistency
> from a partially-updated commit will cause e2fsck to become confused
> and possibly clean up a small number more files than it would have
> otherwise.

glad to hear

> > Because if it is catastrophic, then it basically means that the
> > journal has to reside on a device that is as secure as to rest of
> > the data, meaning that if the data is on RAID6 then the journal
> > should be on RAID6 too.
>
> No, because RAID6 is terribly sucky for performance.  If you need this
> kind of reliability triple-mirrored RAID 1 would be better.  Much less
> CPU overhead, and no extra IO.

true ...

do you happen to know how zfs handles it when the intent log is on
an ssd ?

cheers
tobi

-- 
Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland
http://it.oetiker.ch tobi at oetiker.ch ++41 62 775 9902 / sb: -9900



From adilger at sun.com  Thu Sep 11 21:17:41 2008
From: adilger at sun.com (Andreas Dilger)
Date: Thu, 11 Sep 2008 15:17:41 -0600
Subject: journal on an ssd
In-Reply-To: <alpine.DEB.1.99.0809112309360.27719@sebohet.brgvxre.pu>
References: <alpine.DEB.1.99.0809101324320.22217@sebohet.brgvxre.pu>
	<48C7E75D.8040909@redhat.com>
	<alpine.DEB.1.99.0809101803270.27821@sebohet.brgvxre.pu>
	<20080911041053.GT3086@webber.adilger.int>
	<alpine.DEB.1.99.0809110737460.27821@sebohet.brgvxre.pu>
	<20080911210701.GD3086@webber.adilger.int>
	<alpine.DEB.1.99.0809112309360.27719@sebohet.brgvxre.pu>
Message-ID: <20080911211741.GG3086@webber.adilger.int>

On Sep 11, 2008  23:10 +0200, Tobias Oetiker wrote:
> Today Andreas Dilger wrote:
> > No, because RAID6 is terribly sucky for performance.  If you need this
> > kind of reliability triple-mirrored RAID 1 would be better.  Much less
> > CPU overhead, and no extra IO.
> 
> do you happen to know how zfs handles it when the intent log is on
> an ssd ?

My (limited) understanding is that it will also mirror the intent log.
I'm not really a ZFS guru, and Lustre's use of the DMU doesn't (yet)
include use of the intent log.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.



From carlo at alinoe.com  Thu Sep 11 21:58:22 2008
From: carlo at alinoe.com (Carlo Wood)
Date: Thu, 11 Sep 2008 23:58:22 +0200
Subject: pthread?
Message-ID: <20080911215822.GA5731@alinoe.com>

A user of ext3grep had a configuration problem
that I tracked down to the fact that

pkg-config --cflags ext2fs

returns

-pthread

Why does it return -pthread ?
That seems a bug to me.

Please keep this user in the CC.

Note on my (debian) system `pkg-config --cflags ext2fs`
returns nothing. I don't know why his returns -pthread.
Siegward, any ideas?

-- 
Carlo Wood <carlo at alinoe.com>



From tytso at MIT.EDU  Thu Sep 11 21:57:23 2008
From: tytso at MIT.EDU (Theodore Tso)
Date: Thu, 11 Sep 2008 17:57:23 -0400
Subject: journal on an ssd
In-Reply-To: <alpine.DEB.1.99.0809111635440.27821@sebohet.brgvxre.pu>
References: <20080911041053.GT3086@webber.adilger.int>
	<alpine.DEB.1.99.0809110737460.27821@sebohet.brgvxre.pu>
	<106136964.20080911091321@harvington.org.uk>
	<alpine.DEB.1.99.0809101324320.22217@sebohet.brgvxre.pu>
	<48C7E75D.8040909@redhat.com>
	<alpine.DEB.1.99.0809101803270.27821@sebohet.brgvxre.pu>
	<20080911041053.GT3086@webber.adilger.int>
	<alpine.DEB.1.99.0809110737460.27821@sebohet.brgvxre.pu>
	<20080911130715.GA4759@mit.edu>
	<alpine.DEB.1.99.0809111635440.27821@sebohet.brgvxre.pu>
Message-ID: <20080911215723.GP5082@mit.edu>

On Thu, Sep 11, 2008 at 04:38:15PM +0200, Tobias Oetiker wrote:
> 
> you are the man, thanks ... that was the kind of answer I was
> looking for :-) I have started to smart mon my journal disk
> ... it has interesting properties in smart, a whole lot of which my
> version of smartmontools not seems to know about ... do you have
> any insight in this ? is there a list of relevant smart properties ?

Sorry, I don't.  You might try upgrading to newer version of
smartmontools, since as people figure out what some of the properties
mean (especially the ones with the high numbers that end to be hard
drive spceific, and not standardized) they get added to the smartmontools program.

Fortunately, it's not necessary to know what the properties mean in
order for smartmontools to know if the hard drive is about to fail.  

> I have also set errors=panic as a mount option, or is this unwise
> in this context ?

It's a good thing.  I would recommend using some kind of serial
console logger though, so that if there are failures, you can see what
the system emitted as its last gasp before panicking and rebooting
(since if the filesystmem containing /var/log is set with
errors=panic, you won't find that information in /var/log/messages).
In general, for any production machine, I recommend serial console
loggers, since if you have attackers who are have broken into your
machine with a rootkit, and attempt to hide their tracks by editing
the logs, presumably they won't have access to whatever machine you
have dedicated to capturing and storing the logs from the serial
console for all of your servers.

						- Ted



From tytso at MIT.EDU  Thu Sep 11 22:39:37 2008
From: tytso at MIT.EDU (Theodore Tso)
Date: Thu, 11 Sep 2008 18:39:37 -0400
Subject: pthread?
In-Reply-To: <20080911215822.GA5731@alinoe.com>
References: <20080911215822.GA5731@alinoe.com>
Message-ID: <20080911223937.GQ5082@mit.edu>

On Thu, Sep 11, 2008 at 11:58:22PM +0200, Carlo Wood wrote:
> A user of ext3grep had a configuration problem
> that I tracked down to the fact that
> 
> pkg-config --cflags ext2fs
> 
> returns
> 
> -pthread
> 
> Why does it return -pthread ?

What distribution and what version of e2fsprogs is this user using?

I'm going to guess that he is using SuSE or some OpenSuSE derivitive,
and it's because SuSE bludgeoned in a pthreads mutex into the
internals of libcom_err.  Since libext2fs can call libcom_err, it
follows that a program that links with libext2fs needs to also be
compiled and linked with -pthread.

It's for this reason I've resisted including SuSE's, because the race
they are concerned about is largely theoretical, and it causes
problems for people who want to link against libcom_err.  

What I probably should do add in locking using sem_wait/sem_post,
which doesn't require any Posix pthread nonsense.

						- Ted



From keld at dkuug.dk  Fri Sep 12 08:17:12 2008
From: keld at dkuug.dk (Keld =?utf-8?Q?J=F8rn?= Simonsen)
Date: Fri, 12 Sep 2008 10:17:12 +0200
Subject: journal on an ssd
In-Reply-To: <20080911210701.GD3086@webber.adilger.int>
References: <alpine.DEB.1.99.0809101324320.22217@sebohet.brgvxre.pu>
	<48C7E75D.8040909@redhat.com>
	<alpine.DEB.1.99.0809101803270.27821@sebohet.brgvxre.pu>
	<20080911041053.GT3086@webber.adilger.int>
	<alpine.DEB.1.99.0809110737460.27821@sebohet.brgvxre.pu>
	<20080911210701.GD3086@webber.adilger.int>
Message-ID: <20080912081712.GB21798@rap.rap.dk>

On Thu, Sep 11, 2008 at 03:07:01PM -0600, Andreas Dilger wrote:
> On Sep 11, 2008  07:43 +0200, Tobias Oetiker wrote:
> > Because if it is catastrophic, then it basically means that the
> > journal has to reside on a device that is as secure as to rest of
> > the data, meaning that if the data is on RAID6 then the journal
> > should be on RAID6 too.
> 
> No, because RAID6 is terribly sucky for performance.  If you need this
> kind of reliability triple-mirrored RAID 1 would be better.  Much less
> CPU overhead, and no extra IO.

RAID6 performs nicely for reads, but has quite bad performance for some
writes (non-sequential). Raid6 is actually surprisingly fast for
sequential reads.

Best regards
Keld



From adilger at sun.com  Fri Sep 12 09:12:33 2008
From: adilger at sun.com (Andreas Dilger)
Date: Fri, 12 Sep 2008 03:12:33 -0600
Subject: journal on an ssd
In-Reply-To: <20080912081712.GB21798@rap.rap.dk>
References: <alpine.DEB.1.99.0809101324320.22217@sebohet.brgvxre.pu>
	<48C7E75D.8040909@redhat.com>
	<alpine.DEB.1.99.0809101803270.27821@sebohet.brgvxre.pu>
	<20080911041053.GT3086@webber.adilger.int>
	<alpine.DEB.1.99.0809110737460.27821@sebohet.brgvxre.pu>
	<20080911210701.GD3086@webber.adilger.int>
	<20080912081712.GB21798@rap.rap.dk>
Message-ID: <20080912091233.GX3086@webber.adilger.int>

On Sep 12, 2008  10:17 +0200, Keld J?rn Simonsen wrote:
> On Thu, Sep 11, 2008 at 03:07:01PM -0600, Andreas Dilger wrote:
> > On Sep 11, 2008  07:43 +0200, Tobias Oetiker wrote:
> > > Because if it is catastrophic, then it basically means that the
> > > journal has to reside on a device that is as secure as to rest of
> > > the data, meaning that if the data is on RAID6 then the journal
> > > should be on RAID6 too.
> > 
> > No, because RAID6 is terribly sucky for performance.  If you need this
> > kind of reliability triple-mirrored RAID 1 would be better.  Much less
> > CPU overhead, and no extra IO.
> 
> RAID6 performs nicely for reads, but has quite bad performance for some
> writes (non-sequential). Raid6 is actually surprisingly fast for
> sequential reads.

The journal is NEVER read during normal operation, only once during
journal recovery after a crash.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.



From worleys at gmail.com  Tue Sep 16 20:10:05 2008
From: worleys at gmail.com (Chris Worley)
Date: Tue, 16 Sep 2008 14:10:05 -0600
Subject: When is a block free?
Message-ID: <f3177b9e0809161310i45d24836tf41160a12683c032@mail.gmail.com>

Where in the ext2/3 code does it know that a block on the disk is now free
to reuse?
Thanks,

Chris
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20080916/7b737a93/attachment.htm>

From rwheeler at redhat.com  Tue Sep 16 20:17:12 2008
From: rwheeler at redhat.com (Ric Wheeler)
Date: Tue, 16 Sep 2008 16:17:12 -0400
Subject: When is a block free?
In-Reply-To: <f3177b9e0809161310i45d24836tf41160a12683c032@mail.gmail.com>
References: <f3177b9e0809161310i45d24836tf41160a12683c032@mail.gmail.com>
Message-ID: <48D01448.4050107@redhat.com>

Chris Worley wrote:
> Where in the ext2/3 code does it know that a block on the disk is now 
> free to reuse?
>
> Thanks,
>
> Chris
Hi Chris,

File systems track which blocks are free from the file system creation 
time (mkfs), creation of new files and deletion. Ext2/3 is the 
gatekeeper for all deletions, so it knows when file system blocks 
transition from the used state to the free state. Ext file system use 
bitmaps to track the blocks that are allocated or not.

Regards,

Ric



From articpenguin3800 at gmail.com  Fri Sep 19 02:19:46 2008
From: articpenguin3800 at gmail.com (John Nelson)
Date: Thu, 18 Sep 2008 22:19:46 -0400
Subject: directorys
Message-ID: <d1496e650809181919o48d5cf97o6d37338d71da8fe2@mail.gmail.com>

Does ext3 journal directory changes?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20080918/9884a4dd/attachment.htm>

From tytso at mit.edu  Fri Sep 19 15:01:48 2008
From: tytso at mit.edu (Theodore Tso)
Date: Fri, 19 Sep 2008 11:01:48 -0400
Subject: directorys
In-Reply-To: <d1496e650809181919o48d5cf97o6d37338d71da8fe2@mail.gmail.com>
References: <d1496e650809181919o48d5cf97o6d37338d71da8fe2@mail.gmail.com>
Message-ID: <20080919150148.GB13113@mit.edu>

On Thu, Sep 18, 2008 at 10:19:46PM -0400, John Nelson wrote:
> Does ext3 journal directory changes?

Yes, it does; it has to, if you want the filesystem to be recoverable
across an unclean shutdown.

					- Ted



From rmichael-ext3 at edgeofthenet.org  Mon Sep 22 00:44:57 2008
From: rmichael-ext3 at edgeofthenet.org (Richard Michael)
Date: Sun, 21 Sep 2008 20:44:57 -0400
Subject: Rsync --link-dest and ext3: can I increase the number of inodes?
Message-ID: <20080922004457.GC17339@nexus.edgeofthenet.org>

Hello list,

(I run rsync --link-dest backups onto ext3 and am anticipating running
out of inodes.)

Is there a tool I can use to increase the number of inodes on an ext3
filesystem?

Also, are there any other implications I should be aware of when using
rsync in this way on ext3?  Specifically, what became of this discussion
related to e2fsck and memory use?

https://www.redhat.com/archives/ext3-users/2007-April/msg00017.html

Thanks,
Richard



From tytso at mit.edu  Mon Sep 22 02:27:24 2008
From: tytso at mit.edu (Theodore Tso)
Date: Sun, 21 Sep 2008 22:27:24 -0400
Subject: Rsync --link-dest and ext3: can I increase the number of inodes?
In-Reply-To: <20080922004457.GC17339@nexus.edgeofthenet.org>
References: <20080922004457.GC17339@nexus.edgeofthenet.org>
Message-ID: <20080922022724.GA9914@mit.edu>

On Sun, Sep 21, 2008 at 08:44:57PM -0400, Richard Michael wrote:
> (I run rsync --link-dest backups onto ext3 and am anticipating running
> out of inodes.)
> 
> Is there a tool I can use to increase the number of inodes on an ext3
> filesystem?

Not without backing up your data to tape/DVD/whatever, reformatting
the filesystem, and restoring from backups, sorry.

> Also, are there any other implications I should be aware of when using
> rsync in this way on ext3?  Specifically, what became of this discussion
> related to e2fsck and memory use?
> 
> https://www.redhat.com/archives/ext3-users/2007-April/msg00017.html

This is still a problem, and it's pretty fundamental to how e2fsck
works.  Calculating the number of hard links so we can make sure that
i_links_count is correct requires a large amount of memory; there's no
getting around that.  E2fsck has a short-cut optimization that works
for the common case where i_links_count=1, but that's not true if you
are using backup strategies such as rsync --link-dest.  The solution
described above is present in mainline e2fsprogs, as an emergency
method of allowing e2fsck to fix broken filesystems, but if you have
to resort to it, it's *S*L*O*W*.  I haven't gotten enough feedback to
know whether it would be faster to use a 64-bit system and then enable
swap; obviously the best way would be to use a 64-bit system and then
have gobs and gobs of memory installed on your system.  If you have a
32-bit system, and e2fsck needs more than 3-GB of user address space,
you can try using a statically linked e2fsck to try to use the 3GB of
address space most efficiently, but in the long run you will probably
have to use the workaround described in the above link, and resign
yourself to a very long fsck process.

Alternatively, you could try using a backup program which uses a real
database to keep track of reused files, instead of trying to use
directory inodes and hard links as a bad substitute for the same.

	  	  	     	  	     - Ted



From cs at zip.com.au  Mon Sep 22 04:12:57 2008
From: cs at zip.com.au (Cameron Simpson)
Date: Mon, 22 Sep 2008 14:12:57 +1000
Subject: Rsync --link-dest and ext3: can I increase the number of inodes?
In-Reply-To: <20080922022724.GA9914@mit.edu>
Message-ID: <20080922041257.GA4867@cskk.homeip.net>

On 21Sep2008 22:27, Theodore Tso <tytso at mit.edu> wrote:
| On Sun, Sep 21, 2008 at 08:44:57PM -0400, Richard Michael wrote:
| > (I run rsync --link-dest backups onto ext3 and am anticipating running
| > out of inodes.) [...]

Hmm. While I take the point that each link tree consumes inodes for the
directories, in a tree that changes little the use of new inodes for
new/changed files should be quite slow.

[...snip e2fsck memory requirements...]
| Alternatively, you could try using a backup program which uses a real
| database to keep track of reused files, instead of trying to use
| directory inodes and hard links as a bad substitute for the same.

But a database is... more complicated and then requires special db-aware
tools for a real recover. The hard link thing is very simple and very
direct. It has its drawbacks (chmod/chown history being the main one
that comes to my mind) but for many scenarios it works quite well.

For Richard's benefit, I can report that I've used the hard link backup
tree approach extensively on ext3 filesystems made with default mke2fs
options (i.e. no special inode count size) and have never run out of
inodes. Have you actually done some figuring to decide that running out
of inodes is probable?

Cheers,
-- 
Cameron Simpson <cs at zip.com.au> DoD#743
http://www.cskk.ezoshosting.com/cs/

Peeve:  Going to our favorite breakfast place, only to find that they were
        hit by a car...AND WE MISSED IT.
                - Don Baldwin, <donb at netcom.com>



From jelledejong at powercraft.nl  Mon Sep 22 08:01:37 2008
From: jelledejong at powercraft.nl (Jelle de Jong)
Date: Mon, 22 Sep 2008 10:01:37 +0200
Subject: badblocks output format question
Message-ID: <48D750E1.5090905@powercraft.nl>

Hello List,

I was testing a harddisk with badblocks, but i cant find what the exact
output means. I have attached my logfile. Is the drive bad or ok :-p.

Reading and comparing: 1556108 done, 339455:18:25 elapsed
3656908 done, 339455:20:00 elapsed
10566092done, 339455:25:11 elapsed

Package: e2fsprogs
Architecture: i386
Version: 1.41.1-3

Best regards,

Jelle

-------------- next part --------------
A non-text attachment was scrubbed...
Name: badblocks.log
Type: text/x-log
Size: 4932 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20080922/f35cd9be/attachment.bin>

From tytso at mit.edu  Mon Sep 22 13:51:56 2008
From: tytso at mit.edu (Theodore Tso)
Date: Mon, 22 Sep 2008 09:51:56 -0400
Subject: Rsync --link-dest and ext3: can I increase the number of inodes?
In-Reply-To: <20080922041257.GA4867@cskk.homeip.net>
References: <20080922022724.GA9914@mit.edu>
	<20080922041257.GA4867@cskk.homeip.net>
Message-ID: <20080922135156.GD9914@mit.edu>

On Mon, Sep 22, 2008 at 02:12:57PM +1000, Cameron Simpson wrote:
> On 21Sep2008 22:27, Theodore Tso <tytso at mit.edu> wrote:
> | On Sun, Sep 21, 2008 at 08:44:57PM -0400, Richard Michael wrote:
> | > (I run rsync --link-dest backups onto ext3 and am anticipating running
> | > out of inodes.) [...]
> 
> Hmm. While I take the point that each link tree consumes inodes for the
> directories, in a tree that changes little the use of new inodes for
> new/changed files should be quite slow.

There are two problems.  The first is that the number of inodes you
can consume with directories will go increase with each incremental
backup.  If you don't eventually delete some of your older backups,
then you will eventually run out of inodes.  There's no getting around
that.

The second problem is that each inode which has multiple inode takes
up a small amount of memory per inode.  If you are backing up a very
large number of files, this number may consume more address space than
you have on a 32-bit system.  I have a workaround that uses tdb, but
it is quite slow.  (I have another idea that might be faster, but I'll
have to try it too see how well or poorly it works.)

> But a database is... more complicated and then requires special db-aware
> tools for a real recover. The hard link thing is very simple and very
> direct. It has its drawbacks (chmod/chown history being the main one
> that comes to my mind) but for many scenarios it works quite well.

Sure, but the solution may not scale so well for folks who are backing
up 50+ machines and backing up all of /usr, including all of the
distribution maintained files, or for folks who never delete any of
their past incremental backups.  

> For Richard's benefit, I can report that I've used the hard link backup
> tree approach extensively on ext3 filesystems made with default mke2fs
> options (i.e. no special inode count size) and have never run out of
> inodes. Have you actually done some figuring to decide that running out
> of inodes is probable?

Sure, but how many machines are you backing up this way, and how many
days of backups are you keeping?  And have you ever tried running
"e2fsck -nftt /dev/hdXX" (you can do this on a live system if you
want; the -n means you won't write anything to disk, and the goal is
to see how much memory e2fsck needs) to make sure you can fix the
filesystem if you need it?

					- Ted



From jelledejong at powercraft.nl  Mon Sep 22 20:55:36 2008
From: jelledejong at powercraft.nl (Jelle de Jong)
Date: Mon, 22 Sep 2008 22:55:36 +0200
Subject: badblocks output format question
In-Reply-To: <48D750E1.5090905@powercraft.nl>
References: <48D750E1.5090905@powercraft.nl>
Message-ID: <48D80648.40703@powercraft.nl>

Jelle de Jong wrote:
> Hello List,
> 
> I was testing a harddisk with badblocks, but i cant find what the exact
> output means. I have attached my logfile. Is the drive bad or ok :-p.
> 
> Reading and comparing: 1556108 done, 339455:18:25 elapsed
> 3656908 done, 339455:20:00 elapsed
> 10566092done, 339455:25:11 elapsed
> 
> Package: e2fsprogs
> Architecture: i386
> Version: 1.41.1-3
> 
> Best regards,
> 
> Jelle
> 

Seems the log I sent is of a broken device, i ran badblocks on an other
disk and there were no sector in the output. However the time indicator
is a bid awkward (339455:20:00 elapsed) seems to my like a integer
overflow or initialization bug.

Best regards,

Jelle



From ulf at openlane.com  Mon Sep 22 23:10:34 2008
From: ulf at openlane.com (Ulf Zimmermann)
Date: Mon, 22 Sep 2008 16:10:34 -0700
Subject: ext3 zerofree option and RedHat back port?
Message-ID: <5DE4B7D3E79067418154C49A739C125104C4A664@msmpk01.corp.autc.com>

Can anyone tell me if the zerofree option for ext3 has been back ported
to RedHat EL4 or EL5?

Regards, Ulf.

---------------------------------------------------------------------
OPENLANE Inc., T: 650-532-6382, F: 650-532-6441
4600 Bohannon Drive, Suite 100, Menlo Park, CA 94025
---------------------------------------------------------------------




From cs at zip.com.au  Tue Sep 23 00:00:54 2008
From: cs at zip.com.au (Cameron Simpson)
Date: Tue, 23 Sep 2008 10:00:54 +1000
Subject: Rsync --link-dest and ext3: can I increase the number of inodes?
In-Reply-To: <20080922135156.GD9914@mit.edu>
Message-ID: <20080923000054.GA8244@cskk.homeip.net>

On 22Sep2008 09:51, Theodore Tso <tytso at mit.edu> wrote:
[...snip a lot of remarks I entirely agree with...]
| > But a database is... more complicated [...]
| 
| Sure, but the solution may not scale so well for folks who are backing
| up 50+ machines and backing up all of /usr, including all of the
| distribution maintained files, or for folks who never delete any of
| their past incremental backups.  

Sure. There's plenty of stuff I wouldn't back up this way.

| > For Richard's benefit, I can report that I've used the hard link backup
| > tree approach extensively on ext3 filesystems made with default mke2fs
| > options (i.e. no special inode count size) and have never run out of
| > inodes. Have you actually done some figuring to decide that running out
| > of inodes is probable?
| 
| Sure, but how many machines are you backing up this way, and how many
| days of backups are you keeping?

My own current use case is pretty small, and they're not machines but
data trees (eg static web site trees, configuration files etc - they
have well defined and simple permissions and usually low change rates
so I don't need "machine image" quality, just data integrity).
Some 10s of GB and 4 months of dailies; I do prune old trees, but for
overall disc space reasons, not lack of inodes.

Only half of this is on ext3; the other is on xfs which I think has dynamic
inode allocation.

Probably we need to know more about Richard's plans.

| And have you ever tried running
| "e2fsck -nftt /dev/hdXX" (you can do this on a live system if you
| want; the -n means you won't write anything to disk, and the goal is
| to see how much memory e2fsck needs) to make sure you can fix the
| filesystem if you need it?

I'll queue this up as something to try, though the backups themselves
are replicated to elsewhere anyway.

Cheers,
-- 
Cameron Simpson <cs at zip.com.au> DoD#743



From tytso at mit.edu  Tue Sep 23 05:07:30 2008
From: tytso at mit.edu (Theodore Tso)
Date: Tue, 23 Sep 2008 01:07:30 -0400
Subject: badblocks output format question
In-Reply-To: <48D750E1.5090905@powercraft.nl>
References: <48D750E1.5090905@powercraft.nl>
Message-ID: <20080923050730.GA8920@mit.edu>

On Mon, Sep 22, 2008 at 10:01:37AM +0200, Jelle de Jong wrote:
> Hello List,
> 
> I was testing a harddisk with badblocks, but i cant find what the exact
> output means. I have attached my logfile. Is the drive bad or ok :-p.

You're drive is fine.  This was a bug in the badblocks program which
was introduced in e2fsprogs 1.41.1.  This caused the percentage and
elapsed time to be incorrectly displayed when the badblocks options -w
and -s were given.

Thanks for mentioning it.  I'll fix it for the next release.

       	   	      	   	       	       - Ted



From sandeen at redhat.com  Tue Sep 23 15:13:20 2008
From: sandeen at redhat.com (Eric Sandeen)
Date: Tue, 23 Sep 2008 10:13:20 -0500
Subject: ext3 zerofree option and RedHat back port?
In-Reply-To: <5DE4B7D3E79067418154C49A739C125104C4A664@msmpk01.corp.autc.com>
References: <5DE4B7D3E79067418154C49A739C125104C4A664@msmpk01.corp.autc.com>
Message-ID: <48D90790.2020202@redhat.com>

Ulf Zimmermann wrote:
> Can anyone tell me if the zerofree option for ext3 has been back ported
> to RedHat EL4 or EL5?

there appears to be no backporting to do; it's a single .c file that
makes simple use (I assume...) of libext2...

But no, it's not in Fedora, EPEL, or RHEL.  Builds fine on my rhel5 box.

If you wanted to, you could be the maintainer for Fedora, and put it
into EPEL, which would make it available for RHEL :)

-Eric



From tytso at mit.edu  Tue Sep 23 16:49:26 2008
From: tytso at mit.edu (Theodore Tso)
Date: Tue, 23 Sep 2008 12:49:26 -0400
Subject: ext3 zerofree option and RedHat back port?
In-Reply-To: <48D90790.2020202@redhat.com>
References: <5DE4B7D3E79067418154C49A739C125104C4A664@msmpk01.corp.autc.com>
	<48D90790.2020202@redhat.com>
Message-ID: <20080923164926.GC12889@mit.edu>

On Tue, Sep 23, 2008 at 10:13:20AM -0500, Eric Sandeen wrote:
> Ulf Zimmermann wrote:
> > Can anyone tell me if the zerofree option for ext3 has been back ported
> > to RedHat EL4 or EL5?
> 
> there appears to be no backporting to do; it's a single .c file that
> makes simple use (I assume...) of libext2...
> 
> But no, it's not in Fedora, EPEL, or RHEL.  Builds fine on my rhel5 box.
> 
> If you wanted to, you could be the maintainer for Fedora, and put it
> into EPEL, which would make it available for RHEL :)

Or it would be roughly a 5 line change to e2image (3 for option
parsing, 1 for the usage line, and 1 to the if statement in
write_raw_image_file() :-) to add an option to extend the "raw dump"
functionality to also dump the data blocks of files, at which point it
would create a sparse file containing only the used blocks in the
filesystem for you, automatically.

    	 	       		      	   - Ted



From sandeen at redhat.com  Tue Sep 23 17:01:47 2008
From: sandeen at redhat.com (Eric Sandeen)
Date: Tue, 23 Sep 2008 12:01:47 -0500
Subject: ext3 zerofree option and RedHat back port?
In-Reply-To: <20080923164926.GC12889@mit.edu>
References: <5DE4B7D3E79067418154C49A739C125104C4A664@msmpk01.corp.autc.com>
	<48D90790.2020202@redhat.com> <20080923164926.GC12889@mit.edu>
Message-ID: <48D920FB.6030206@redhat.com>

Theodore Tso wrote:
> On Tue, Sep 23, 2008 at 10:13:20AM -0500, Eric Sandeen wrote:
>> Ulf Zimmermann wrote:
>>> Can anyone tell me if the zerofree option for ext3 has been back ported
>>> to RedHat EL4 or EL5?
>> there appears to be no backporting to do; it's a single .c file that
>> makes simple use (I assume...) of libext2...
>>
>> But no, it's not in Fedora, EPEL, or RHEL.  Builds fine on my rhel5 box.
>>
>> If you wanted to, you could be the maintainer for Fedora, and put it
>> into EPEL, which would make it available for RHEL :)
> 
> Or it would be roughly a 5 line change to e2image (3 for option
> parsing, 1 for the usage line, and 1 to the if statement in
> write_raw_image_file() :-) to add an option to extend the "raw dump"
> functionality to also dump the data blocks of files, at which point it
> would create a sparse file containing only the used blocks in the
> filesystem for you, automatically.
> 
>     	 	       		      	   - Ted

hey that sounds even better than a random collection of single-purpose
utilities!  ;)

(But I suppose the original util had the other useful purpose of
scrubbing free blocks even if you don't intend to compress the fs image...)

-Eric



From ulf at openlane.com  Wed Sep 24 03:22:09 2008
From: ulf at openlane.com (Ulf Zimmermann)
Date: Tue, 23 Sep 2008 20:22:09 -0700
Subject: ext3 zerofree option and RedHat back port?
In-Reply-To: <48D920FB.6030206@redhat.com>
References: <5DE4B7D3E79067418154C49A739C125104C4A664@msmpk01.corp.autc.com>
	<48D90790.2020202@redhat.com> <20080923164926.GC12889@mit.edu>
	<48D920FB.6030206@redhat.com>
Message-ID: <5DE4B7D3E79067418154C49A739C125104C4A66C@msmpk01.corp.autc.com>

> -----Original Message-----
> From: Eric Sandeen [mailto:sandeen at redhat.com]
> Sent: 09/23/2008 10:02
> To: Theodore Tso
> Cc: Ulf Zimmermann; ext3-users at redhat.com
> Subject: Re: ext3 zerofree option and RedHat back port?
> 
> Theodore Tso wrote:
> > On Tue, Sep 23, 2008 at 10:13:20AM -0500, Eric Sandeen wrote:
> >> Ulf Zimmermann wrote:
> >>> Can anyone tell me if the zerofree option for ext3 has been back
> ported
> >>> to RedHat EL4 or EL5?
> >> there appears to be no backporting to do; it's a single .c file
that
> >> makes simple use (I assume...) of libext2...
> >>
> >> But no, it's not in Fedora, EPEL, or RHEL.  Builds fine on my rhel5
> box.
> >>
> >> If you wanted to, you could be the maintainer for Fedora, and put
it
> >> into EPEL, which would make it available for RHEL :)
> >
> > Or it would be roughly a 5 line change to e2image (3 for option
> > parsing, 1 for the usage line, and 1 to the if statement in
> > write_raw_image_file() :-) to add an option to extend the "raw dump"
> > functionality to also dump the data blocks of files, at which point
> it
> > would create a sparse file containing only the used blocks in the
> > filesystem for you, automatically.
> >
> >     	 	       		      	   - Ted
> 
> hey that sounds even better than a random collection of single-purpose
> utilities!  ;)
> 
> (But I suppose the original util had the other useful purpose of
> scrubbing free blocks even if you don't intend to compress the fs
> image...)
> 
> -Eric

Reason I asked is this. We use currently 3Par S400 and E200 as SAN
arrays. The new T400 and T800 has a built in chip to do more intelligent
thin provisioning but I believe even the S400 and E200 we have will free
on the SAN level a block of a thin provisioned volume if it gets zero'ed
out. Haven't gotten around yet to test it, but I am planning on. We are
currently using 3 different file system types, one is a propriety from
Onstor for their Bobcats (NFS/CIFS heads) where I believe I have
observed just freeing of SAN level blocks. The two other are EXT3 and
OCFS2.


Ulf Zimmermann




From sandeen at redhat.com  Wed Sep 24 03:30:19 2008
From: sandeen at redhat.com (Eric Sandeen)
Date: Tue, 23 Sep 2008 22:30:19 -0500
Subject: ext3 zerofree option and RedHat back port?
In-Reply-To: <5DE4B7D3E79067418154C49A739C125104C4A66C@msmpk01.corp.autc.com>
References: <5DE4B7D3E79067418154C49A739C125104C4A664@msmpk01.corp.autc.com>
	<48D90790.2020202@redhat.com> <20080923164926.GC12889@mit.edu>
	<48D920FB.6030206@redhat.com>
	<5DE4B7D3E79067418154C49A739C125104C4A66C@msmpk01.corp.autc.com>
Message-ID: <48D9B44B.9000707@redhat.com>

Ulf Zimmermann wrote:

> Reason I asked is this. We use currently 3Par S400 and E200 as SAN
> arrays. The new T400 and T800 has a built in chip to do more intelligent
> thin provisioning but I believe even the S400 and E200 we have will free
> on the SAN level a block of a thin provisioned volume if it gets zero'ed
> out. Haven't gotten around yet to test it, but I am planning on. We are
> currently using 3 different file system types, one is a propriety from
> Onstor for their Bobcats (NFS/CIFS heads) where I believe I have
> observed just freeing of SAN level blocks. The two other are EXT3 and
> OCFS2.

Ok, so you really want to zero the unused blocks in-place, and e2image
writing out a new sparsified image isn't a ton of help.

The tool does that, I guess - but only on an unmounted or RO-mounted
filesystem, right?  (plus I'd triple-check that it's doing things
correctly, opening a block device and splatting zeros around, one hopes
that it is!)

But in any case the util itself is simple enough that building (or even
packaging) for fedora/EPEL should be trivial.

(FWIW, there is work upstream for filesystems to actually communicate
freed blocks to the underlying storage, just for this purpose...)

-Eric



From ulf at openlane.com  Wed Sep 24 04:17:26 2008
From: ulf at openlane.com (Ulf Zimmermann)
Date: Tue, 23 Sep 2008 21:17:26 -0700
Subject: ext3 zerofree option and RedHat back port?
In-Reply-To: <48D9B44B.9000707@redhat.com>
References: <5DE4B7D3E79067418154C49A739C125104C4A664@msmpk01.corp.autc.com>
	<48D90790.2020202@redhat.com> <20080923164926.GC12889@mit.edu>
	<48D920FB.6030206@redhat.com>
	<5DE4B7D3E79067418154C49A739C125104C4A66C@msmpk01.corp.autc.com>
	<48D9B44B.9000707@redhat.com>
Message-ID: <5DE4B7D3E79067418154C49A739C125104C4A66F@msmpk01.corp.autc.com>

> -----Original Message-----
> From: Eric Sandeen [mailto:sandeen at redhat.com]
> Sent: 09/23/2008 20:30
> To: Ulf Zimmermann
> Cc: Theodore Tso; ext3-users at redhat.com
> Subject: Re: ext3 zerofree option and RedHat back port?
> 
> Ulf Zimmermann wrote:
> 
> > Reason I asked is this. We use currently 3Par S400 and E200 as SAN
> > arrays. The new T400 and T800 has a built in chip to do more
> intelligent
> > thin provisioning but I believe even the S400 and E200 we have will
> free
> > on the SAN level a block of a thin provisioned volume if it gets
> zero'ed
> > out. Haven't gotten around yet to test it, but I am planning on. We
> are
> > currently using 3 different file system types, one is a propriety
> from
> > Onstor for their Bobcats (NFS/CIFS heads) where I believe I have
> > observed just freeing of SAN level blocks. The two other are EXT3
and
> > OCFS2.
> 
> Ok, so you really want to zero the unused blocks in-place, and e2image
> writing out a new sparsified image isn't a ton of help.
> 
> The tool does that, I guess - but only on an unmounted or RO-mounted
> filesystem, right?  (plus I'd triple-check that it's doing things
> correctly, opening a block device and splatting zeros around, one
hopes
> that it is!)
> 
> But in any case the util itself is simple enough that building (or
even
> packaging) for fedora/EPEL should be trivial.
> 
> (FWIW, there is work upstream for filesystems to actually communicate
> freed blocks to the underlying storage, just for this purpose...)
> 
> -Eric

I am going to try it out by hand. Create a thin provisioned volume,
write random crap to it, then zero the blocks. See if that shrinks the
physical allocated space.

Ulf.




From adilger at sun.com  Wed Sep 24 06:35:11 2008
From: adilger at sun.com (Andreas Dilger)
Date: Wed, 24 Sep 2008 00:35:11 -0600
Subject: ext3 zerofree option and RedHat back port?
In-Reply-To: <48D9B44B.9000707@redhat.com>
References: <5DE4B7D3E79067418154C49A739C125104C4A664@msmpk01.corp.autc.com>
	<48D90790.2020202@redhat.com> <20080923164926.GC12889@mit.edu>
	<48D920FB.6030206@redhat.com>
	<5DE4B7D3E79067418154C49A739C125104C4A66C@msmpk01.corp.autc.com>
	<48D9B44B.9000707@redhat.com>
Message-ID: <20080924063511.GX10950@webber.adilger.int>

On Sep 23, 2008  22:30 -0500, Eric Sandeen wrote:
> Ulf Zimmermann wrote:
> Ok, so you really want to zero the unused blocks in-place, and e2image
> writing out a new sparsified image isn't a ton of help.
> 
> The tool does that, I guess - but only on an unmounted or RO-mounted
> filesystem, right?  (plus I'd triple-check that it's doing things
> correctly, opening a block device and splatting zeros around, one hopes
> that it is!)

That is WAY to scary for me on a mounted filesystem.  It is racy if the
blocks become allocated.

Instead, what I always do when creating a sparse image for e2fsck test
cases is just "dd if=/dev/zero of=/mnt/fs/zeroes bs=64k; rm /mnt/fs/zeroes"
until the filesystem is full, then the file is deleted.  This will
leave blocks "empty" for the free space in the filesystem without
any special tools.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.



From rmy at tigress.co.uk  Wed Sep 24 08:12:37 2008
From: rmy at tigress.co.uk (Ron Yorston)
Date: Wed, 24 Sep 2008 09:12:37 +0100
Subject: ext3 zerofree option and RedHat back port?
In-Reply-To: <5DE4B7D3E79067418154C49A739C125104C4A664@msmpk01.corp.autc.com>
References: <5DE4B7D3E79067418154C49A739C125104C4A664@msmpk01.corp.autc.com>
Message-ID: <200809240812.m8O8CdIh010269@tiffany.internal.tigress.co.uk>

"Ulf Zimmermann" <ulf at openlane.com> wrote:
>Can anyone tell me if the zerofree option for ext3 has been back ported
>to RedHat EL4 or EL5?

I used to maintain backports of zerofree (the kernel patch, not the
utility) to EL4 and EL5, but since I wasn't actually using them I gave
up.  The last RPMs I have are from December of last year.  Contact me
directly if you want them.

I don't recommend the ext3 patch as it hasn't seen much use.  I regularly
use the ext2 version (on Fedora 9), but be warned that Ted has expressed
concerns about it.

Ron



From rmy at tigress.co.uk  Wed Sep 24 08:19:53 2008
From: rmy at tigress.co.uk (Ron Yorston)
Date: Wed, 24 Sep 2008 09:19:53 +0100
Subject: ext3 zerofree option and RedHat back port?
In-Reply-To: <5DE4B7D3E79067418154C49A739C125104C4A66C@msmpk01.corp.autc.com>
References: <5DE4B7D3E79067418154C49A739C125104C4A664@msmpk01.corp.autc.com>
	<48D90790.2020202@redhat.com> <20080923164926.GC12889@mit.edu>
	<48D920FB.6030206@redhat.com>
	<5DE4B7D3E79067418154C49A739C125104C4A66C@msmpk01.corp.autc.com>
Message-ID: <200809240819.m8O8JrfC010279@tiffany.internal.tigress.co.uk>

"Ulf Zimmermann" <ulf at openlane.com> wrote:
>Reason I asked is this. We use currently 3Par S400 and E200 as SAN
>arrays. The new T400 and T800 has a built in chip to do more intelligent
>thin provisioning but I believe even the S400 and E200 we have will free
>on the SAN level a block of a thin provisioned volume if it gets zero'ed
>out. Haven't gotten around yet to test it, but I am planning on. We are
>currently using 3 different file system types, one is a propriety from
>Onstor for their Bobcats (NFS/CIFS heads) where I believe I have
>observed just freeing of SAN level blocks. The two other are EXT3 and
>OCFS2.

Interesting.  A similar case I've seen recently is s3backer, a FUSE
filesystem that keeps its blocks as objects in Amazon S3:

   http://code.google.com/p/s3backer/

Blocks of zeroes aren't actually stored, so they suggest using zerofree
to get rid of non-zero deleted blocks and avoid being charged for them.

Ron



From rmy at tigress.co.uk  Wed Sep 24 08:23:20 2008
From: rmy at tigress.co.uk (Ron Yorston)
Date: Wed, 24 Sep 2008 09:23:20 +0100
Subject: ext3 zerofree option and RedHat back port?
In-Reply-To: <20080924063511.GX10950@webber.adilger.int>
References: <5DE4B7D3E79067418154C49A739C125104C4A664@msmpk01.corp.autc.com>
	<48D90790.2020202@redhat.com> <20080923164926.GC12889@mit.edu>
	<48D920FB.6030206@redhat.com>
	<5DE4B7D3E79067418154C49A739C125104C4A66C@msmpk01.corp.autc.com>
	<48D9B44B.9000707@redhat.com>
	<20080924063511.GX10950@webber.adilger.int>
Message-ID: <200809240823.m8O8NK55010286@tiffany.internal.tigress.co.uk>

Andreas Dilger <adilger at sun.com> wrote:
>> Ulf Zimmermann wrote:
>> Ok, so you really want to zero the unused blocks in-place, and e2image
>> writing out a new sparsified image isn't a ton of help.
>> 
>> The tool does that, I guess - but only on an unmounted or RO-mounted
>> filesystem, right?  (plus I'd triple-check that it's doing things
>> correctly, opening a block device and splatting zeros around, one hopes
>> that it is!)
>
>That is WAY to scary for me on a mounted filesystem.  It is racy if the
>blocks become allocated.

The 1.0.0 version of the zerofree utility only worked on unmounted
filesystems, but then someone suggested that it should be safe on
a read-only mount.  Is that not so?

Ron



From rwheeler at redhat.com  Wed Sep 24 11:19:02 2008
From: rwheeler at redhat.com (Ric Wheeler)
Date: Wed, 24 Sep 2008 07:19:02 -0400
Subject: ext3 zerofree option and RedHat back port?
In-Reply-To: <5DE4B7D3E79067418154C49A739C125104C4A66F@msmpk01.corp.autc.com>
References: <5DE4B7D3E79067418154C49A739C125104C4A664@msmpk01.corp.autc.com>	<48D90790.2020202@redhat.com>
	<20080923164926.GC12889@mit.edu>	<48D920FB.6030206@redhat.com>	<5DE4B7D3E79067418154C49A739C125104C4A66C@msmpk01.corp.autc.com>	<48D9B44B.9000707@redhat.com>
	<5DE4B7D3E79067418154C49A739C125104C4A66F@msmpk01.corp.autc.com>
Message-ID: <48DA2226.70509@redhat.com>

Ulf Zimmermann wrote:
>> -----Original Message-----
>> From: Eric Sandeen [mailto:sandeen at redhat.com]
>> Sent: 09/23/2008 20:30
>> To: Ulf Zimmermann
>> Cc: Theodore Tso; ext3-users at redhat.com
>> Subject: Re: ext3 zerofree option and RedHat back port?
>>
>> Ulf Zimmermann wrote:
>>
>>     
>>> Reason I asked is this. We use currently 3Par S400 and E200 as SAN
>>> arrays. The new T400 and T800 has a built in chip to do more
>>>       
>> intelligent
>>     
>>> thin provisioning but I believe even the S400 and E200 we have will
>>>       
>> free
>>     
>>> on the SAN level a block of a thin provisioned volume if it gets
>>>       
>> zero'ed
>>     
>>> out. Haven't gotten around yet to test it, but I am planning on. We
>>>       
>> are
>>     
>>> currently using 3 different file system types, one is a propriety
>>>       
>> from
>>     
>>> Onstor for their Bobcats (NFS/CIFS heads) where I believe I have
>>> observed just freeing of SAN level blocks. The two other are EXT3
>>>       
> and
>   
>>> OCFS2.
>>>       
>> Ok, so you really want to zero the unused blocks in-place, and e2image
>> writing out a new sparsified image isn't a ton of help.
>>
>> The tool does that, I guess - but only on an unmounted or RO-mounted
>> filesystem, right?  (plus I'd triple-check that it's doing things
>> correctly, opening a block device and splatting zeros around, one
>>     
> hopes
>   
>> that it is!)
>>
>> But in any case the util itself is simple enough that building (or
>>     
> even
>   
>> packaging) for fedora/EPEL should be trivial.
>>
>> (FWIW, there is work upstream for filesystems to actually communicate
>> freed blocks to the underlying storage, just for this purpose...)
>>
>> -Eric
>>     
>
> I am going to try it out by hand. Create a thin provisioned volume,
> write random crap to it, then zero the blocks. See if that shrinks the
> physical allocated space.
>
> Ulf.
>
>
>   

Note that there is work on getting file systems to use the new TRIM (for 
S-ATA drives) and its equivalent proposed standard in T10 SCSI for 
arrays which will give you this automatically. David Woodhouse was 
pushing patches for TRIM, we are still thinking about the SCSI versions...

ric



From tytso at mit.edu  Wed Sep 24 13:31:47 2008
From: tytso at mit.edu (Theodore Tso)
Date: Wed, 24 Sep 2008 09:31:47 -0400
Subject: ext3 zerofree option and RedHat back port?
In-Reply-To: <200809240812.m8O8CdIh010269@tiffany.internal.tigress.co.uk>
References: <5DE4B7D3E79067418154C49A739C125104C4A664@msmpk01.corp.autc.com>
	<200809240812.m8O8CdIh010269@tiffany.internal.tigress.co.uk>
Message-ID: <20080924133147.GD9929@mit.edu>

On Wed, Sep 24, 2008 at 09:12:37AM +0100, Ron Yorston wrote:
> "Ulf Zimmermann" <ulf at openlane.com> wrote:
> >Can anyone tell me if the zerofree option for ext3 has been back ported
> >to RedHat EL4 or EL5?
> 
> I used to maintain backports of zerofree (the kernel patch, not the
> utility) to EL4 and EL5, but since I wasn't actually using them I gave
> up.  The last RPMs I have are from December of last year.  Contact me
> directly if you want them.
> 
> I don't recommend the ext3 patch as it hasn't seen much use.  I regularly
> use the ext2 version (on Fedora 9), but be warned that Ted has expressed
> concerns about it.

I just searched my sent-mail archives for the last 5 years, and I
can't find any references to "zerofree" previous to this mail thread.
Maybe I commented about them under some other name.

Having quickly looked at the ext3 patch here:

       http://osdir.com/ml/file-systems.ext3.user/2006-09/msg00026.html

...the big thing I will note is that if you crash after a file is
deleted, but before the journal transaction is committed, the file may
end up being cleared but not deleted.  This may or may not be
problematic for your appication; in particular, if the file deletion
was implied with the intent of doing an atomic replacement of some
critical file, i.e. such as a vipw script which does:

       cp /etc/passwd /etc/passwd.vipw
       vi /etc/passwd.vipw
       <sanity check /etc/passwd.vipw for correctness>
       # atomically update /etc/passwd
       mv /etc/passwd.vipw /etc/passwd

... and you crash before the transaction is commited but after the
"mv" command has run, you could end up with a partially or completely
zero'ed /etc/passwd file.  Some might call that unfortunate.  :-)

I will admit that the chances of this happening are somewhat remote,
but in terms of potential issues that would have to be fixed before
such a patch could be included in mainline, or before (I suspect) Red
Hat would feel comfortable taking responsibility for their customers'
data after such a patch were committed, that would probably be a real
issue.  The code for supporting the "trim" command could also be used
to implement a proper zero-free command, but it gets tricky, since the
blocks in question would have to be remembered until the commit block
is written out, and then only zero'ed (or trimmed) right after the
commit has happened, but before the pinned block bitmaps are released
(which would allow the block allocator to allocate to the blocks that
had just been released).

						- Ted



From rmy at tigress.co.uk  Wed Sep 24 14:35:32 2008
From: rmy at tigress.co.uk (Ron Yorston)
Date: Wed, 24 Sep 2008 15:35:32 +0100
Subject: ext3 zerofree option and RedHat back port?
In-Reply-To: <20080924133147.GD9929@mit.edu>
References: <5DE4B7D3E79067418154C49A739C125104C4A664@msmpk01.corp.autc.com>
	<200809240812.m8O8CdIh010269@tiffany.internal.tigress.co.uk>
	<20080924133147.GD9929@mit.edu>
Message-ID: <200809241435.m8OEZY7Y010555@tiffany.internal.tigress.co.uk>

Theodore Tso <tytso at mit.edu> wrote:
>I just searched my sent-mail archives for the last 5 years, and I
>can't find any references to "zerofree" previous to this mail thread.
>Maybe I commented about them under some other name.
>
>Having quickly looked at the ext3 patch here:
>
>       http://osdir.com/ml/file-systems.ext3.user/2006-09/msg00026.html

Your response is in the same thread:

   http://osdir.com/ml/file-systems.ext3.user/2006-09/msg00031.html

Unless that was some other Theodore Tso.

>..the big thing I will note is that if you crash after a file is
>deleted, but before the journal transaction is committed, the file may
>end up being cleared but not deleted.

Indeed, that was the concern last time.  The ext3 patch hasn't changed
significantly since then because, truth be told, I don't entirely
understand journalling and was unable to fix it up.  The ext2 patch now
writes out the zeroed blocks immediately, which may or may not help.

The latest versions of the patches are available on my website:

   http://intgat.tigress.co.uk/rmy/uml/sparsify.html

Ron



From sandeen at redhat.com  Wed Sep 24 15:04:56 2008
From: sandeen at redhat.com (Eric Sandeen)
Date: Wed, 24 Sep 2008 10:04:56 -0500
Subject: ext3 zerofree option and RedHat back port?
In-Reply-To: <200809240812.m8O8CdIh010269@tiffany.internal.tigress.co.uk>
References: <5DE4B7D3E79067418154C49A739C125104C4A664@msmpk01.corp.autc.com>
	<200809240812.m8O8CdIh010269@tiffany.internal.tigress.co.uk>
Message-ID: <48DA5718.60403@redhat.com>

Ron Yorston wrote:
> "Ulf Zimmermann" <ulf at openlane.com> wrote:
>> Can anyone tell me if the zerofree option for ext3 has been back ported
>> to RedHat EL4 or EL5?
> 
> I used to maintain backports of zerofree (the kernel patch, not the
> utility) to EL4 and EL5, but since I wasn't actually using them I gave
> up.  The last RPMs I have are from December of last year.  Contact me
> directly if you want them.
> 
> I don't recommend the ext3 patch as it hasn't seen much use.  I regularly
> use the ext2 version (on Fedora 9), but be warned that Ted has expressed
> concerns about it.

oh, whoops - I guess my google-fu is weak, I searched for zerofree and
assumed we were talking about the userspace utility I found ...

/me runs off to look at that patch...

-Eric



From tytso at mit.edu  Wed Sep 24 15:19:20 2008
From: tytso at mit.edu (Theodore Tso)
Date: Wed, 24 Sep 2008 11:19:20 -0400
Subject: ext3 zerofree option and RedHat back port?
In-Reply-To: <200809241435.m8OEZY7Y010555@tiffany.internal.tigress.co.uk>
References: <5DE4B7D3E79067418154C49A739C125104C4A664@msmpk01.corp.autc.com>
	<200809240812.m8O8CdIh010269@tiffany.internal.tigress.co.uk>
	<20080924133147.GD9929@mit.edu>
	<200809241435.m8OEZY7Y010555@tiffany.internal.tigress.co.uk>
Message-ID: <20080924151919.GG9929@mit.edu>

On Wed, Sep 24, 2008 at 03:35:32PM +0100, Ron Yorston wrote:
> Your response is in the same thread:
> 
>    http://osdir.com/ml/file-systems.ext3.user/2006-09/msg00031.html
> 
> Unless that was some other Theodore Tso.

Hmm, I must have sent that from a non-primary computer, so it wasn't
in my sent-mail archive.  My bad.  :-)

						- Ted



From ulf at openlane.com  Wed Sep 24 16:23:56 2008
From: ulf at openlane.com (Ulf Zimmermann)
Date: Wed, 24 Sep 2008 09:23:56 -0700
Subject: ext3 zerofree option and RedHat back port?
In-Reply-To: <48DA5718.60403@redhat.com>
References: <5DE4B7D3E79067418154C49A739C125104C4A664@msmpk01.corp.autc.com>
	<200809240812.m8O8CdIh010269@tiffany.internal.tigress.co.uk>
	<48DA5718.60403@redhat.com>
Message-ID: <5DE4B7D3E79067418154C49A739C125104C4A671@msmpk01.corp.autc.com>

> -----Original Message-----
> From: Eric Sandeen [mailto:sandeen at redhat.com]
> Sent: 09/24/2008 08:05
> To: Ron Yorston
> Cc: Ulf Zimmermann; ext3-users at redhat.com
> Subject: Re: ext3 zerofree option and RedHat back port?
> 
> Ron Yorston wrote:
> > "Ulf Zimmermann" <ulf at openlane.com> wrote:
> >> Can anyone tell me if the zerofree option for ext3 has been back
> ported
> >> to RedHat EL4 or EL5?
> >
> > I used to maintain backports of zerofree (the kernel patch, not the
> > utility) to EL4 and EL5, but since I wasn't actually using them I
> gave
> > up.  The last RPMs I have are from December of last year.  Contact
me
> > directly if you want them.
> >
> > I don't recommend the ext3 patch as it hasn't seen much use.  I
> regularly
> > use the ext2 version (on Fedora 9), but be warned that Ted has
> expressed
> > concerns about it.
> 
> oh, whoops - I guess my google-fu is weak, I searched for zerofree and
> assumed we were talking about the userspace utility I found ...
> 
> /me runs off to look at that patch...
> 
> -Eric

Sorry, I meant the mount option for zero'ing blocks which are getting
freed.

Ulf.




From lakshmipathi.g at gmail.com  Mon Sep 29 07:43:44 2008
From: lakshmipathi.g at gmail.com (lakshmi pathi)
Date: Mon, 29 Sep 2008 13:13:44 +0530
Subject: giis file undelete tool-new features
Message-ID: <ae2f51270809290043j651a52c2p4c6ff2ea25740a69@mail.gmail.com>

Hi
I have released giis4.4.It includes following features
*Deleted files are recovered and restored into their original
directories, if the path exists.
*Dropped database tables are recovered.
*Several Bug fixes. :)

Limitation:
If directory size greater than block_size,it giis won't work. -This
will be fixed in next release.

Homepage:
www.giis.co.in

Cheers,
Lakshmipathi.G



From worleys at gmail.com  Mon Sep 29 15:24:33 2008
From: worleys at gmail.com (Chris Worley)
Date: Mon, 29 Sep 2008 09:24:33 -0600
Subject: When is a block free?
In-Reply-To: <f3177b9e0809161432i62adce12y2019062be9d160d4@mail.gmail.com>
References: <f3177b9e0809161310i45d24836tf41160a12683c032@mail.gmail.com>
	<48D01448.4050107@redhat.com>
	<f3177b9e0809161403xac72601i3c99dec0b6be9959@mail.gmail.com>
	<f3177b9e0809161432i62adce12y2019062be9d160d4@mail.gmail.com>
Message-ID: <f3177b9e0809290824m1b3cb778u79388f885587cc7@mail.gmail.com>

On Tue, Sep 16, 2008 at 3:32 PM, Chris Worley <worleys at gmail.com> wrote:
> For example, in balloc.c I'm seeing ext3_free_blocks_sb
> calls ext3_clear_bit_atomic at the bottom... is that when the block is
> freed?  Are all blocks freed here?

David Woodhouse, in an article at http://lwn.net/Articles/293658/, is
implementing the T10/T13 committees "Trim" request in 2.6.28 kernels.

Would it be appropriate to call "blkdev_issue_discard" at the bottom
of ext3_free_blocks_sb where ext3_clear_bit_atomic is being called?

Chris
>
> On Tue, Sep 16, 2008 at 3:03 PM, Chris Worley <worleys at gmail.com> wrote:
>>
>> On Tue, Sep 16, 2008 at 2:17 PM, Ric Wheeler <rwheeler at redhat.com> wrote:
>>>
>>> Chris Worley wrote:
>>>>
>>>> Where in the ext2/3 code does it know that a block on the disk is now
>>>> free to reuse?
>>>>
>>>> Thanks,
>>>>
>>>> Chris
>>>
>>> Hi Chris,
>>>
>>> File systems track which blocks are free from the file system creation
>>> time (mkfs), creation of new files and deletion. Ext2/3 is the gatekeeper
>>> for all deletions, so it knows when file system blocks transition from the
>>> used state to the free state. Ext file system use bitmaps to track the
>>> blocks that are allocated or not.
>>
>> Where (in the code... what routine... or what's the name of the bitmap) is
>> the "free" bit set?  I've been looking through the code and don't see
>> exactly where the block is marked as free.
>> Thanks,
>> Chris
>>>
>>> Regards,
>>>
>>> Ric
>>>
>>
>
>



From tytso at mit.edu  Mon Sep 29 16:39:17 2008
From: tytso at mit.edu (Theodore Tso)
Date: Mon, 29 Sep 2008 12:39:17 -0400
Subject: When is a block free?
In-Reply-To: <f3177b9e0809290824m1b3cb778u79388f885587cc7@mail.gmail.com>
References: <f3177b9e0809161310i45d24836tf41160a12683c032@mail.gmail.com>
	<48D01448.4050107@redhat.com>
	<f3177b9e0809161403xac72601i3c99dec0b6be9959@mail.gmail.com>
	<f3177b9e0809161432i62adce12y2019062be9d160d4@mail.gmail.com>
	<f3177b9e0809290824m1b3cb778u79388f885587cc7@mail.gmail.com>
Message-ID: <20080929163917.GB10831@mit.edu>

On Mon, Sep 29, 2008 at 09:24:33AM -0600, Chris Worley wrote:
> On Tue, Sep 16, 2008 at 3:32 PM, Chris Worley <worleys at gmail.com> wrote:
> > For example, in balloc.c I'm seeing ext3_free_blocks_sb
> > calls ext3_clear_bit_atomic at the bottom... is that when the block is
> > freed?  Are all blocks freed here?
> 
> David Woodhouse, in an article at http://lwn.net/Articles/293658/, is
> implementing the T10/T13 committees "Trim" request in 2.6.28 kernels.
> 
> Would it be appropriate to call "blkdev_issue_discard" at the bottom
> of ext3_free_blocks_sb where ext3_clear_bit_atomic is being called?

Unfortunately, it's not as simple as that.  The problem is that as
soon as you call trim, the drive is allowed to discard the contents of
that block so that future attempts to read from that block returns all
zeros.  Therefore we can't call Trim until after the transaction has
committed.  That means we have to keep a linked list of block extents
that are to be trimmed attached to the commit object, and only send
the trim requests once the commit block has been written to disk.

It's on the ext4 developer's TODO list to add Trim support to ext3 and
ext4.

							- Ted



From whats at wekk.net  Wed Sep 24 21:10:17 2008
From: whats at wekk.net (Albert =?ISO-8859-1?Q?Sellar=E8s?=)
Date: Wed, 24 Sep 2008 23:10:17 +0200
Subject: init_special_inode: bogus i_mode
Message-ID: <1222290617.6307.33.camel@x61s>

Hi everyone,

I have a server running Redhat 5 that have attached a SAN of 5TB. The
SAN filesystem is formated with ext3.

One month ago, the kernel was started to send this error messages:

init_special_inode: bogus i_mode (56333)
init_special_inode: bogus i_mode (111367)
init_special_inode: bogus i_mode (114022)
init_special_inode: bogus i_mode (34016)
init_special_inode: bogus i_mode (7170)
init_special_inode: bogus i_mode (117576)
init_special_inode: bogus i_mode (74600)
init_special_inode: bogus i_mode (111237)
init_special_inode: bogus i_mode (151624)
init_special_inode: bogus i_mode (132565)
init_special_inode: bogus i_mode (175003)
init_special_inode: bogus i_mode (54343)
init_special_inode: bogus i_mode (161626)
init_special_inode: bogus i_mode (114644)
init_special_inode: bogus i_mode (53215)
init_special_inode: bogus i_mode (54563)
init_special_inode: bogus i_mode (110115)
init_special_inode: bogus i_mode (160572)
init_special_inode: bogus i_mode (35607)
init_special_inode: bogus i_mode (156516)
init_special_inode: bogus i_mode (50005)
init_special_inode: bogus i_mode (5362)
init_special_inode: bogus i_mode (136237)
init_special_inode: bogus i_mode (136237)
init_special_inode: bogus i_mode (136237)
init_special_inode: bogus i_mode (136237)
init_special_inode: bogus i_mode (136237)
init_special_inode: bogus i_mode (136237)

Every day, kernel prints out one or two lines like these.

I haven't found nothing in the list archives, and searching in google I
found that it could be that the filesystem is corrupted.

On my last step to know what has happening in the filesystem, I have
look the kernel source code. Now I really think that it is an error, but
I'm not sure what it means.

Can anybody tell me what exactly means this message?

Thanks you very much.

-- 
  Albert Sellar?s        GPG id: 0x13053FFE
  http://www.wekk.net    whats_up at jabber.org 
  Linux User: 324456     Catalunya           
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: Aix? ?s una part	d'un missatge, signada digitalment
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20080924/c6120d62/attachment.sig>

From sandeen at redhat.com  Tue Sep 30 14:53:58 2008
From: sandeen at redhat.com (Eric Sandeen)
Date: Tue, 30 Sep 2008 09:53:58 -0500
Subject: init_special_inode: bogus i_mode
In-Reply-To: <1222290617.6307.33.camel@x61s>
References: <1222290617.6307.33.camel@x61s>
Message-ID: <48E23D86.9070300@redhat.com>

Albert Sellar?s wrote:
> Hi everyone,
> 
> I have a server running Redhat 5 that have attached a SAN of 5TB. The
> SAN filesystem is formated with ext3.

I suppose you mean RHEL5?

> One month ago, the kernel was started to send this error messages:
> 
> init_special_inode: bogus i_mode (56333)
<snip>
> init_special_inode: bogus i_mode (136237)
> 
> Every day, kernel prints out one or two lines like these.
> 
> I haven't found nothing in the list archives, and searching in google I
> found that it could be that the filesystem is corrupted.
> 
> On my last step to know what has happening in the filesystem, I have
> look the kernel source code. Now I really think that it is an error, but
> I'm not sure what it means.
> 
> Can anybody tell me what exactly means this message?

For an inode which is not recognized as a regular file, directory, or
link when it is read, init_special_inode is called.

At that point, if it's not a char, block, fifo, or socket, you get this
error.  Basically it doesn't know what this thing is.  It'd be nice if
it printed the inode number as well, to make it easier to find.

I'd probably suggest fsck at this point, run it with -n first if you
want to see what it *would* do, just to be safe.

-Eric