From andre.nitschke at versanet.de  Mon Mar  3 06:32:44 2008
From: andre.nitschke at versanet.de (Andre Nitschke)
Date: Mon,  3 Mar 2008 07:32:44 +0100
Subject: h-tree for ext2
Message-ID: <20080303073244.xbd2fs8b7o0800w8@webmail.versatel.de>

Hello,
with tune2fs -O dir_index i activate the h-tree function for ext3 to improve
performance. now i am not interested in the journaling function, but the
journal makes the system a little bit slower. is it possible to use ext2 (also
ext3 - journal) with a h-tree index to improve the speed?
or must the filesystem be ext3 for these feature?
greetings

Andre


From adilger at sun.com  Mon Mar  3 15:52:04 2008
From: adilger at sun.com (Andreas Dilger)
Date: Mon, 03 Mar 2008 08:52:04 -0700
Subject: h-tree for ext2
In-Reply-To: <20080303073244.xbd2fs8b7o0800w8@webmail.versatel.de>
References: <20080303073244.xbd2fs8b7o0800w8@webmail.versatel.de>
Message-ID: <20080303155204.GA3616@webber.adilger.int>

On Mar 03, 2008  07:32 +0100, Andre Nitschke wrote:
> with tune2fs -O dir_index i activate the h-tree function for ext3 to improve
> performance. now i am not interested in the journaling function, but the
> journal makes the system a little bit slower. is it possible to use ext2 (also
> ext3 - journal) with a h-tree index to improve the speed?
> or must the filesystem be ext3 for these feature?

The filesystem must be ext3, because the dir_index (htree) feature was not
ported to ext2.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


From articpenguin3800 at gmail.com  Tue Mar  4 23:12:08 2008
From: articpenguin3800 at gmail.com (John Nelson)
Date: Tue, 04 Mar 2008 18:12:08 -0500
Subject: Filefrag
Message-ID: <47CDD748.6070002@gmail.com>

hi
I have a virtualbox image of ubuntu hardy. I did filefrag and i got this


hardy.vdi: 73 extents found, perfection would be 69 extents


Why does it say perfection would be 69 extents. Shouldnt it be 1 extent?


From ling at fnal.gov  Tue Mar  4 23:18:44 2008
From: ling at fnal.gov (Ling C. Ho)
Date: Tue, 04 Mar 2008 17:18:44 -0600
Subject: Filefrag
In-Reply-To: <47CDD748.6070002@gmail.com>
References: <47CDD748.6070002@gmail.com>
Message-ID: <47CDD8D4.6070409@fnal.gov>

If your blocksize is 4k, there are 32k blocks in a group, and therefore 
about 128MB per group. So, your file size must be slightly less than 69 
* 128MB, correct?

...
ling

John Nelson wrote:
> hi
> I have a virtualbox image of ubuntu hardy. I did filefrag and i got this
>
>
> hardy.vdi: 73 extents found, perfection would be 69 extents
>
>
> Why does it say perfection would be 69 extents. Shouldnt it be 1 extent?
>
> _______________________________________________
> Ext3-users mailing list
> Ext3-users at redhat.com
> https://www.redhat.com/mailman/listinfo/ext3-users


From sandeen at redhat.com  Wed Mar  5 02:41:37 2008
From: sandeen at redhat.com (Eric Sandeen)
Date: Tue, 04 Mar 2008 20:41:37 -0600
Subject: Filefrag
In-Reply-To: <47CDD748.6070002@gmail.com>
References: <47CDD748.6070002@gmail.com>
Message-ID: <47CE0861.7030809@redhat.com>

John Nelson wrote:
> hi
> I have a virtualbox image of ubuntu hardy. I did filefrag and i got this
> 
> 
> hardy.vdi: 73 extents found, perfection would be 69 extents
> 
> 
> Why does it say perfection would be 69 extents. Shouldnt it be 1 extent?

Not if it's sparse.  As your fs image almost certainly is.

-Eric


From articpenguin3800 at gmail.com  Thu Mar  6 00:54:57 2008
From: articpenguin3800 at gmail.com (John Nelson)
Date: Wed, 05 Mar 2008 19:54:57 -0500
Subject: Journal questions
Message-ID: <47CF40E1.9000303@gmail.com>

hi  i have a couple questions about the journal in ext3.

1. Will there be performance lose with a smaller journal say 32MB 
instead of 128MB?

2. Is there a way to see free space left in the journal or is it cleared 
at each mount?

3. Is journal_data_ordered atomic like reiser4 where either a 
transaction will happen or it wont happen?


From adilger at sun.com  Thu Mar  6 06:03:11 2008
From: adilger at sun.com (Andreas Dilger)
Date: Wed, 05 Mar 2008 23:03:11 -0700
Subject: Journal questions
In-Reply-To: <47CF40E1.9000303@gmail.com>
References: <47CF40E1.9000303@gmail.com>
Message-ID: <20080306060311.GM3616@webber.adilger.int>

On Mar 05, 2008  19:54 -0500, John Nelson wrote:
> 1. Will there be performance lose with a smaller journal say 32MB instead 
> of 128MB?

Depends on how high an IO/metadata rate you have.  If you are just doing
light desktop IO it won't make any difference.

> 2. Is there a way to see free space left in the journal or is it cleared at 
> each mount?

The journal is a circular buffer, so this is hard to determine exactly.

> 3. Is journal_data_ordered atomic like reiser4 where either a transaction 
> will happen or it wont happen?

I'm not sure what you mean - there is data=journal and data=ordered
mode.  data=journal means all data and metadata changes are atomic.
data=ordered (the default) means that data is written to disk before
metadata so if there is a crash that you don't get garbage in your
files.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


From mrunal.gawade at gmail.com  Sun Mar  9 19:55:39 2008
From: mrunal.gawade at gmail.com (Mrunal Gawade)
Date: Sun, 9 Mar 2008 11:55:39 -0800
Subject: Disk hash table
Message-ID: <c62583fb0803091255p21a0b6e3i64aac546652e0c0c@mail.gmail.com>

Hi,

I need information on ext3 representation of disk based and memory hash
tables. I browsed through the code but could not understand much. Could you
point me in the right direction. If not ext3 hash table then any disk based
hash table implementation example.


Thank you,
Mrunal
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20080309/7a88bd82/attachment.htm>

From val.henson at gmail.com  Sun Mar  9 22:50:28 2008
From: val.henson at gmail.com (Valerie Henson)
Date: Sun, 9 Mar 2008 15:50:28 -0700
Subject: Avoid Fragmentation of ext3
In-Reply-To: <20080228142221.o6xo2cl0bkg80g0k@webmail.versatel.de>
References: <20080228142221.o6xo2cl0bkg80g0k@webmail.versatel.de>
Message-ID: <70b6f0bf0803091550v54e282d1kbc8212f8fec8917c@mail.gmail.com>

On Thu, Feb 28, 2008 at 6:22 AM, Andre Nitschke
<andre.nitschke at versanet.de> wrote:
> Hello,
>  i just want to know, how ext3 avoids fragmentation. Well, i think it works like
>  this (but i dont know...):
>  When the OS says to the filesystem, save the file, the file system looks, where
>  are free sectors laying together to use. when there is enough place the
>  filesystem try's to write the file without fragments. is there not enough
>  place, the fs wrote the file in the way, to create less fragemnts. some file
>  systems keep space after the file, for when the file grows. i dont know, works
>  ext3 in this way?
>  maybe somebody can explain it shortly.

Yes, that's the basic theory.  Various file systems execute it more or
less successfully.  I'd say ext3 is about average and XFS is quite
good at it.

There was a paper comparing file system fragmentation at OLS a few
years ago.  "The Effects of File System Fragmentation" by Ard
Biesheuvel, et. al.:

http://www.kernel.org/doc/ols/2006/ols2006v1-pages-193-208.pdf

-VAL


From carlo at alinoe.com  Tue Mar 11 19:23:15 2008
From: carlo at alinoe.com (Carlo Wood)
Date: Tue, 11 Mar 2008 20:23:15 +0100
Subject: New undelete tool for ext3
Message-ID: <20080311192315.GA27329@alinoe.com>

Hi all,

I developed a tool to undelete files and directories.
I did this after I accidently deleted 3 GB of my home
directory.

I have been able to successfully recover all 50,000 files.

Note that this works WITHOUT prior installed patches or
changes (like giis).

I have sent a mail to Juri Haberland, asking him to
change the FAQ entry that claims that it is impossible
to undelete files on ext3, but he has not replied to
my mail at all. I still hope that the FAQ can be changed
to point to the HOWTO that I have just written:

http://www.xs4all.nl/~carlo17/howto/undelete_ext3.html

-- 
Carlo Wood <carlo at alinoe.com>


From mike at doubleplum.net  Tue Mar 11 19:30:12 2008
From: mike at doubleplum.net (Michael Biggs)
Date: Tue, 11 Mar 2008 15:30:12 -0400 (EDT)
Subject: New undelete tool for ext3
In-Reply-To: <20080311192315.GA27329@alinoe.com>
References: <20080311192315.GA27329@alinoe.com>
Message-ID: <Pine.LNX.4.64.0803111526450.2688@theorem.ca>

On Tue, 11 Mar 2008, Carlo Wood wrote:
> I developed a tool to undelete files and directories.
> I did this after I accidently deleted 3 GB of my home
> directory.

Sounds good to me.

> http://www.xs4all.nl/~carlo17/howto/undelete_ext3.html

Why isn't the source available?  What license is it under / do you plan to
release it under, and when?
Just wondering.

__
Michael Biggs


From carlo at alinoe.com  Tue Mar 11 20:16:06 2008
From: carlo at alinoe.com (Carlo Wood)
Date: Tue, 11 Mar 2008 21:16:06 +0100
Subject: New undelete tool for ext3
In-Reply-To: <Pine.LNX.4.64.0803111526450.2688@theorem.ca>
References: <20080311192315.GA27329@alinoe.com>
	<Pine.LNX.4.64.0803111526450.2688@theorem.ca>
Message-ID: <20080311201606.GA30848@alinoe.com>

On Tue, Mar 11, 2008 at 03:30:12PM -0400, Michael Biggs wrote:
> Why isn't the source available?  What license is it under / do you plan to
> release it under, and when?
> Just wondering.

I'll release it under the GPL version 3.

I still need to make a package of it, ... I'm not REALLY in a hurry
to that though, because even though I'm not asking money for this tool,
I'd very much like it to hear from people who use it, what their
experiences are - and hopefully hear from them about success.

I've written quite some howto's and usually I never get ANY mail
about them. That kinda sucks. People should realize that I'm
doing this as a volunteer and that it costs me considerable
amount of time. A 'thank you' would be nice every now and then ;)

-- 
Carlo Wood <carlo at alinoe.com>


From Mike.Miller at hp.com  Tue Mar 11 20:17:54 2008
From: Mike.Miller at hp.com (Miller, Mike (OS Dev))
Date: Tue, 11 Mar 2008 20:17:54 +0000
Subject: New undelete tool for ext3
In-Reply-To: <20080311201606.GA30848@alinoe.com>
References: <20080311192315.GA27329@alinoe.com>
	<Pine.LNX.4.64.0803111526450.2688@theorem.ca>
	<20080311201606.GA30848@alinoe.com>
Message-ID: <DEE40752C8AF224BB890E455C7B8E0FE042B985EE3@G6W0268.americas.hpqcorp.net>


> -----Original Message-----
> From: ext3-users-bounces at redhat.com
> [mailto:ext3-users-bounces at redhat.com] On Behalf Of Carlo Wood
> Sent: Tuesday, March 11, 2008 3:16 PM
> To: Michael Biggs
> Cc: ext3-users at redhat.com
> Subject: Re: New undelete tool for ext3
>
> On Tue, Mar 11, 2008 at 03:30:12PM -0400, Michael Biggs wrote:
> > Why isn't the source available?  What license is it under / do you
> > plan to release it under, and when?
> > Just wondering.
>
> I'll release it under the GPL version 3.
>
> I still need to make a package of it, ... I'm not REALLY in a
> hurry to that though, because even though I'm not asking
> money for this tool, I'd very much like it to hear from
> people who use it, what their experiences are - and hopefully
> hear from them about success.
>
> I've written quite some howto's and usually I never get ANY
> mail about them. That kinda sucks. People should realize that
> I'm doing this as a volunteer and that it costs me
> considerable amount of time. A 'thank you' would be nice
> every now and then ;)
>


Thank you, Carlo.


> --
> Carlo Wood <carlo at alinoe.com>
>
> _______________________________________________
> Ext3-users mailing list
> Ext3-users at redhat.com
> https://www.redhat.com/mailman/listinfo/ext3-users
>


From tytso at mit.edu  Tue Mar 11 20:24:23 2008
From: tytso at mit.edu (Theodore Tso)
Date: Tue, 11 Mar 2008 16:24:23 -0400
Subject: New undelete tool for ext3
In-Reply-To: <20080311192315.GA27329@alinoe.com>
References: <20080311192315.GA27329@alinoe.com>
Message-ID: <20080311202423.GL15804@mit.edu>

On Tue, Mar 11, 2008 at 08:23:15PM +0100, Carlo Wood wrote:
> 
> I developed a tool to undelete files and directories.
> I did this after I accidently deleted 3 GB of my home
> directory.
> 
> I have been able to successfully recover all 50,000 files.
> 
> Note that this works WITHOUT prior installed patches or
> changes (like giis).
> 
> I have sent a mail to Juri Haberland, asking him to
> change the FAQ entry that claims that it is impossible
> to undelete files on ext3, but he has not replied to
> my mail at all. I still hope that the FAQ can be changed
> to point to the HOWTO that I have just written:
> 
> http://www.xs4all.nl/~carlo17/howto/undelete_ext3.html

That's a clever technique.  It only works so long as the journal
blocks haven't been reused, so you would need to use your tool very
*quickly* after the files had been deleted.  If the inode table block
hadn't been modified before the deletion, it might also not appear in
the journal, so it's also not guaranteed to work.  But certainly it's
a better shot than no chance at all.....

					- Ted


From keld at dkuug.dk  Tue Mar 11 20:42:42 2008
From: keld at dkuug.dk (Keld =?iso-8859-1?Q?J=F8rn?= Simonsen)
Date: Tue, 11 Mar 2008 21:42:42 +0100
Subject: New undelete tool for ext3
In-Reply-To: <20080311201606.GA30848@alinoe.com>
References: <20080311192315.GA27329@alinoe.com>
	<Pine.LNX.4.64.0803111526450.2688@theorem.ca>
	<20080311201606.GA30848@alinoe.com>
Message-ID: <20080311204242.GB4312@rap.rap.dk>

On Tue, Mar 11, 2008 at 09:16:06PM +0100, Carlo Wood wrote:
> On Tue, Mar 11, 2008 at 03:30:12PM -0400, Michael Biggs wrote:
> > Why isn't the source available?  What license is it under / do you plan to
> > release it under, and when?
> > Just wondering.
> 
> I'll release it under the GPL version 3.
> 
> I still need to make a package of it, ... I'm not REALLY in a hurry
> to that though, because even though I'm not asking money for this tool,
> I'd very much like it to hear from people who use it, what their
> experiences are - and hopefully hear from them about success.
> 
> I've written quite some howto's and usually I never get ANY mail
> about them. That kinda sucks. People should realize that I'm
> doing this as a volunteer and that it costs me considerable
> amount of time. A 'thank you' would be nice every now and then ;)

I just want to point out that I have also made a tool for undeleting
files on ext2/3, but it works in another way.

Available at: http://std.dkuug.dk/keld/readme-salvage.html

best regards
keld


From tpo2 at sourcepole.ch  Tue Mar 11 22:56:07 2008
From: tpo2 at sourcepole.ch (Tomas Pospisek's Mailing Lists)
Date: Tue, 11 Mar 2008 23:56:07 +0100 (CET)
Subject: New undelete tool for ext3
In-Reply-To: <20080311201606.GA30848@alinoe.com>
References: <20080311192315.GA27329@alinoe.com>
	<Pine.LNX.4.64.0803111526450.2688@theorem.ca>
	<20080311201606.GA30848@alinoe.com>
Message-ID: <Pine.LNX.4.64.0803112335420.29712@localhost>

On Tue, 11 Mar 2008, Carlo Wood wrote:

> On Tue, Mar 11, 2008 at 03:30:12PM -0400, Michael Biggs wrote:
>> Why isn't the source available?  What license is it under / do you plan to
>> release it under, and when?
>> Just wondering.
>
> I'll release it under the GPL version 3.
>
> I still need to make a package of it, ... I'm not REALLY in a hurry
> to that though, because even though I'm not asking money for this tool,
> I'd very much like it to hear from people who use it, what their
> experiences are - and hopefully hear from them about success.
>
> I've written quite some howto's and usually I never get ANY mail
> about them. That kinda sucks. People should realize that I'm
> doing this as a volunteer and that it costs me considerable
> amount of time. A 'thank you' would be nice every now and then ;)

I understand your frustration. The fact that people use your stuff but 
won't come around to say thank you will probably not change. It's 
however possible to shift your perspective and that has the potential to 
reduce the frustration.

Consider that with your work you will give other people the impetus 
or the energy or the excitement, that will make them or let them do 
their little contribution to the common wealth of open source.

Someone will google desperately, discover your tool, rescue her FS, be 
thrilled, and contribute back their enthusiasm by writing a HOWTO about 
her favoured tool - say Gimp f.ex. Some other person much later on will go 
off and write a filesystem instead - say ext3. You and me - maybe 
everybody in the OSS world are really standing on the shoulders of 
*giants* and we are only able to do what we want/need because there were 
so many others that added their little or big piece to the base we're 
using - and hey, have we gone out and thanked all those people?

   $ dpkg --get-selections|wc -l
      2840	# roughly the number of installed packages on my sys

Anyway, I was thrilled by your detailed description of the on-disk ext3 
structures. Brilliant! Two months ago this would have been *exaclty* what 
I had needed, I'm sure somebody will be *very happy* to find all this 
information, nicely tended and comprehensible in one place. Thanks, very 
cool!
*t

--
-----------------------------------------------------------
   Tomas Pospisek
   http://sourcepole.com -  Linux & Open Source Solutions
-----------------------------------------------------------


From tytso at mit.edu  Tue Mar 11 23:37:57 2008
From: tytso at mit.edu (Theodore Tso)
Date: Tue, 11 Mar 2008 19:37:57 -0400
Subject: New undelete tool for ext3
In-Reply-To: <Pine.LNX.4.64.0803112335420.29712@localhost>
References: <20080311192315.GA27329@alinoe.com>
	<Pine.LNX.4.64.0803111526450.2688@theorem.ca>
	<20080311201606.GA30848@alinoe.com>
	<Pine.LNX.4.64.0803112335420.29712@localhost>
Message-ID: <20080311233757.GQ15804@mit.edu>

On Tue, Mar 11, 2008 at 11:56:07PM +0100, Tomas Pospisek's Mailing Lists wrote:
> I understand your frustration. The fact that people use your stuff but 
> won't come around to say thank you will probably not change. It's however 
> possible to shift your perspective and that has the potential to reduce the 
> frustration.

Carlo,

Absolutely.  I'm not sure how many people use e2fsprogs, but granted
it's "somewhat large", yes?  (As in every single Linux computer out
there.  :-)  I get a thank you note about maybe once a year.  I think
*once* a grateful user sent me a paypal payment of $20, but that was
extremely rare in the over 12 or so years that e2fsprogs has been in 
existence, and I was completely (but pleasantly) surprised when it 
happened.

If you're in this business to get thank you notes, or virtual beers,
you're in the wrong business.  :-)

Now, what you *could* get out of it if you are willing to write a
paper and submit it to OLS, or Linux.conf.au, or some other
conference, might be an invitation to tell others about your cool
tool.  And if you are one of those who get satisfaction at download
statistics, that can be good too.

BTW, if you are willing to relicense your code to GPLv2, I would be
interested in reworking bits of your tool into e2fsprogs's debugfs.
Or if you'd like to keep it as a standalone tool, that's cool as well.

Regards,

							- Ted


From bruno at wolff.to  Wed Mar 12 03:03:26 2008
From: bruno at wolff.to (Bruno Wolff III)
Date: Tue, 11 Mar 2008 22:03:26 -0500
Subject: New undelete tool for ext3
In-Reply-To: <Pine.LNX.4.64.0803112335420.29712@localhost>
References: <20080311192315.GA27329@alinoe.com>
	<Pine.LNX.4.64.0803111526450.2688@theorem.ca>
	<20080311201606.GA30848@alinoe.com>
	<Pine.LNX.4.64.0803112335420.29712@localhost>
Message-ID: <20080312030326.GD15182@wolff.to>

On Tue, Mar 11, 2008 at 23:56:07 +0100,
  Tomas Pospisek's Mailing Lists <tpo2 at sourcepole.ch> wrote:
> 
> I understand your frustration. The fact that people use your stuff but 
> won't come around to say thank you will probably not change. It's 
> however possible to shift your perspective and that has the potential to 
> reduce the frustration.

If your software becomes popular enough you may not really want everyone
who uses it personally thanking you.


From lm at bitmover.com  Wed Mar 12 03:24:52 2008
From: lm at bitmover.com (Larry McVoy)
Date: Tue, 11 Mar 2008 20:24:52 -0700
Subject: New undelete tool for ext3
In-Reply-To: <20080312030326.GD15182@wolff.to>
References: <20080311192315.GA27329@alinoe.com>
	<Pine.LNX.4.64.0803111526450.2688@theorem.ca>
	<20080311201606.GA30848@alinoe.com>
	<Pine.LNX.4.64.0803112335420.29712@localhost>
	<20080312030326.GD15182@wolff.to>
Message-ID: <20080312032452.GA823@bitmover.com>

On Tue, Mar 11, 2008 at 10:03:26PM -0500, Bruno Wolff III wrote:
> On Tue, Mar 11, 2008 at 23:56:07 +0100,
>   Tomas Pospisek's Mailing Lists <tpo2 at sourcepole.ch> wrote:
> > 
> > I understand your frustration. The fact that people use your stuff but 
> > won't come around to say thank you will probably not change. It's 
> > however possible to shift your perspective and that has the potential to 
> > reduce the frustration.
> 
> If your software becomes popular enough you may not really want everyone
> who uses it personally thanking you.

Actually, I think you should let the author decide that.
-- 
---
Larry McVoy                lm at bitmover.com           http://www.bitkeeper.com


From lm at bitmover.com  Wed Mar 12 03:27:03 2008
From: lm at bitmover.com (Larry McVoy)
Date: Tue, 11 Mar 2008 20:27:03 -0700
Subject: New undelete tool for ext3
In-Reply-To: <20080311233757.GQ15804@mit.edu>
References: <20080311192315.GA27329@alinoe.com>
	<Pine.LNX.4.64.0803111526450.2688@theorem.ca>
	<20080311201606.GA30848@alinoe.com>
	<Pine.LNX.4.64.0803112335420.29712@localhost>
	<20080311233757.GQ15804@mit.edu>
Message-ID: <20080312032703.GA1621@bitmover.com>

On Tue, Mar 11, 2008 at 07:37:57PM -0400, Theodore Tso wrote:
> On Tue, Mar 11, 2008 at 11:56:07PM +0100, Tomas Pospisek's Mailing Lists wrote:
> > I understand your frustration. The fact that people use your stuff but 
> > won't come around to say thank you will probably not change. It's however 
> > possible to shift your perspective and that has the potential to reduce the 
> > frustration.
> 
> Carlo,
> 
> If you're in this business to get thank you notes, or virtual beers,
> you're in the wrong business.  :-)

Amen to that.

> BTW, if you are willing to relicense your code to GPLv2

I'd be another one who would thank you for doing that.  GPLv2 is free.  v3 
has an agenda which is not that in line with freeness.  IMO.
-- 
---
Larry McVoy                lm at bitmover.com           http://www.bitkeeper.com


From carlo at alinoe.com  Wed Mar 12 05:46:21 2008
From: carlo at alinoe.com (Carlo Wood)
Date: Wed, 12 Mar 2008 06:46:21 +0100
Subject: New undelete tool for ext3
In-Reply-To: <DEE40752C8AF224BB890E455C7B8E0FE042B985EE3@G6W0268.americas.hpqcorp.net>
References: <20080311192315.GA27329@alinoe.com>
	<Pine.LNX.4.64.0803111526450.2688@theorem.ca>
	<20080311201606.GA30848@alinoe.com>
	<DEE40752C8AF224BB890E455C7B8E0FE042B985EE3@G6W0268.americas.hpqcorp.net>
Message-ID: <20080312054621.GA30974@alinoe.com>

On Tue, Mar 11, 2008 at 08:17:54PM +0000, Miller, Mike (OS Dev) wrote:
> Thank you, Carlo.

Thanks Mike ;).

This is a reply also to the others who replied to this thread
(I read all of them that were sent after this one, too);
there is definitely no need to thank me unless you actually used
my software and/or HOWTO and found it helpful, heheh.

Also, reading all those replies I feel that I have to straighten
something out: I am NOT mad or even demotivated because I get so
little "thank you" mails. And I'm certainly not in this business
to get pats on my shoulder.

I believe that my philosophy matches the comment of Tomas. Let
me elaborate:

1) I'm no different than most humans in that I would like, that
   when I die I can look back at my life and say: I made a difference.
   Most people (the masses) think that getting a child is the
   answer to that, they hope that their son or daughter won't make
   the same mistakes; even do something great-- so that their own
   life won't have been meaningless. Note that I do not have
   offspring and never will have. I'll have to get my self-esteem
   from my work.

2) I believe in the theory that the universe, and the existance
   of life might be a chance of 1 in <a very large number>, but
   we're here to think about that anyway because we are thinking about it.
   It DOES make it rather important to me to get the most out of this
   evolution though, and it seems unacceptable to me that humanity will
   become extinct before we explored every corner of the universe.
   The current situation is a VERY critical stage: the energy we have,
   the minerals and raw material we need to run our current civilisation
   of technology is a once-in-an-evolution chance. If we can't break free
   from this planet THIS time, before the next World War, or before we
   run out of minerals/materials and energy (no doubt leading to a world
   war anyway), it will be too late. In fact, I think we're not going
   to make it, UNLESS we can bootstrap artificial intelligence, soon.

3) I believe that next step in evolution (if at all, thus) is what is
   called the 'Singularity' (you already know what that is: A.I's making
   A.I's, giving rise to an exponential growth of technological advancement).
   Whether or not humanity survives that doesn't even really interest
   me. As long as something we created will expore the universe, then
   it wasn't all for nothing (and who knows, in the end a civiliation
   of A.I. that grows exponentially smarter towards the end of time
   might become what now we call God; ascent to a high level of
   existance and recreate the Big Bang in such a way that we exist(ed)
   at all (in which case I wouldn't have to worry that this will happen,
   but ok).

4) Contrary to most believers in the Singularity, I don't believe it
   will happen during my time. However, I DO embrace the idea of
   exponential growth: The *ONLY* work that is really significant is
   work that *amplifies* the development. If I can put in a factor
   of 1.000001, then that might JUST be enough to get us there, because
   1.000001 to the power N will be JUST large enough to prevent the
   extinction of mankind before we can leave this planet.
   Thus: _productivity_ increasing software has my interest (as opposed
   to, say 3D game engines). Software that leads to FASTER development
   of the next generation of development software.

Well, ... to make a long story short, as you see, I'm not driven by
"thank you" mails, but by, well, "something else" ;)

Regards,

-- 
Carlo Wood <carlo at alinoe.com>


From jprats at cesca.es  Wed Mar 12 07:56:44 2008
From: jprats at cesca.es (Jordi Prats)
Date: Wed, 12 Mar 2008 08:56:44 +0100
Subject: error reading block
Message-ID: <47D78CBC.9040901@cesca.es>

Hi all,
I'm getting this error using fsck on my fs:

Error reading block 35979726 (Attempt to read block from filesystem 
resulted in short read) while getting next inode from scan.  Ignore error?

Anyone can explain me what exactly does it mean?

cheers!
Jordi


From sandeen at redhat.com  Wed Mar 12 12:35:24 2008
From: sandeen at redhat.com (Eric Sandeen)
Date: Wed, 12 Mar 2008 07:35:24 -0500
Subject: error reading block
In-Reply-To: <47D78CBC.9040901@cesca.es>
References: <47D78CBC.9040901@cesca.es>
Message-ID: <47D7CE0C.1030006@redhat.com>

Jordi Prats wrote:
> Hi all,
> I'm getting this error using fsck on my fs:
> 
> Error reading block 35979726 (Attempt to read block from filesystem 
> resulted in short read) while getting next inode from scan.  Ignore error?
> 
> Anyone can explain me what exactly does it mean?

It means it could not read block 35979726 ...


Is your disk healthy?  Were there any IO errors from the kernel?  Is
your filesystem reall (35979726 * blocksize) bytes long?

-Eric


From articpenguin3800 at gmail.com  Thu Mar 13 02:35:57 2008
From: articpenguin3800 at gmail.com (John Nelson)
Date: Wed, 12 Mar 2008 22:35:57 -0400
Subject: indirect blocks
Message-ID: <47D8930D.6010606@gmail.com>

what are indirects blocks? LIke double indirect triple indirect?


From davids at webmaster.com  Thu Mar 13 02:44:28 2008
From: davids at webmaster.com (David Schwartz)
Date: Wed, 12 Mar 2008 19:44:28 -0700
Subject: indirect blocks
In-Reply-To: <47D8930D.6010606@gmail.com>
Message-ID: <MDEHLPKNGKAHNMBLJOLKCEIFLBAC.davids@webmaster.com>


> what are indirects blocks? LIke double indirect triple indirect?

Indirect blocks are blocks that point to (contain the address of) other
blocks (which hold data). Double indirect blocks point to indirect blocks.
Triple indirect blocks point to double indirect blocks.

DS


From sandeen at redhat.com  Thu Mar 13 02:49:08 2008
From: sandeen at redhat.com (Eric Sandeen)
Date: Wed, 12 Mar 2008 21:49:08 -0500
Subject: indirect blocks
In-Reply-To: <MDEHLPKNGKAHNMBLJOLKCEIFLBAC.davids@webmaster.com>
References: <MDEHLPKNGKAHNMBLJOLKCEIFLBAC.davids@webmaster.com>
Message-ID: <47D89624.1000002@redhat.com>

David Schwartz wrote:
>> what are indirects blocks? LIke double indirect triple indirect?
> 
> Indirect blocks are blocks that point to (contain the address of) other
> blocks (which hold data). Double indirect blocks point to indirect blocks.
> Triple indirect blocks point to double indirect blocks.

And there's a nice picture at
http://web.mit.edu/tytso/www/linux/ext2intro.html

-Eric


From ianbrn at gmail.com  Thu Mar 13 13:54:08 2008
From: ianbrn at gmail.com (Ian Brown)
Date: Thu, 13 Mar 2008 15:54:08 +0200
Subject: The maximum number of files under a folder
Message-ID: <d0383f90803130654h4ae0454x1913b10806bb4928@mail.gmail.com>

Hello,
In an ext3-based file system,  what is the maximum number of files I
can create under a folder ?

Is it configurable somehow ?

Regards,
Ian


From articpenguin3800 at gmail.com  Thu Mar 13 16:48:50 2008
From: articpenguin3800 at gmail.com (John Nelson)
Date: Thu, 13 Mar 2008 12:48:50 -0400
Subject: The maximum number of files under a folder
Message-ID: <47D95AF2.6030301@gmail.com>

i think not more than 5k files without dir_index on. The max limit of 
subfolders is 32k


From tytso at mit.edu  Thu Mar 13 17:23:18 2008
From: tytso at mit.edu (Theodore Tso)
Date: Thu, 13 Mar 2008 13:23:18 -0400
Subject: The maximum number of files under a folder
In-Reply-To: <47D95AF2.6030301@gmail.com>
References: <47D95AF2.6030301@gmail.com>
Message-ID: <20080313172318.GB31653@mit.edu>

On Thu, Mar 13, 2008 at 12:48:50PM -0400, John Nelson wrote:
> i think not more than 5k files without dir_index on. The max limit of 
> subfolders is 32k

There is no limit to the number of files in a folder, except for the
fact that the directory itself can't be bigger than 2GB, and the
number of inodes that the entire filesystem has available to it.  Of
course, if you don't have directory indexing turned on, you may not
like the performance of doing directory lookups, but that's a
different story.

						- Ted


From articpenguin3800 at gmail.com  Thu Mar 13 17:57:18 2008
From: articpenguin3800 at gmail.com (John Nelson)
Date: Thu, 13 Mar 2008 13:57:18 -0400
Subject: The maximum number of files under a folder
In-Reply-To: <20080313172318.GB31653@mit.edu>
References: <47D95AF2.6030301@gmail.com> <20080313172318.GB31653@mit.edu>
Message-ID: <47D96AFE.4020701@gmail.com>

is an h-tree the same thing as a b+ tree?


From adilger at sun.com  Thu Mar 13 18:26:31 2008
From: adilger at sun.com (Andreas Dilger)
Date: Thu, 13 Mar 2008 11:26:31 -0700
Subject: The maximum number of files under a folder
In-Reply-To: <20080313172318.GB31653@mit.edu>
References: <47D95AF2.6030301@gmail.com> <20080313172318.GB31653@mit.edu>
Message-ID: <20080313182631.GE3217@webber.adilger.int>

On Mar 13, 2008  13:23 -0400, Theodore Ts'o wrote:
> There is no limit to the number of files in a folder, except for the
> fact that the directory itself can't be bigger than 2GB, and the
> number of inodes that the entire filesystem has available to it.  Of
> course, if you don't have directory indexing turned on, you may not
> like the performance of doing directory lookups, but that's a
> different story.

There is also a limit in the current ext3 htree code to be only 2 levels
deep.  Along with the 2GB limit you hit problems around 15M files,
depending on the length of the filenames.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


From carlo at alinoe.com  Sat Mar 15 02:20:35 2008
From: carlo at alinoe.com (Carlo Wood)
Date: Sat, 15 Mar 2008 03:20:35 +0100
Subject: Kernel header vs libext2fs headers
Message-ID: <20080315022035.GA28894@alinoe.com>

The kernel headers define EXT3_ORPHAN_FS
while libext2fs header defines EXT4_ORPHAN_FS

This means that one of the two is wrong.
Does ext3 use/have EXT3_ORPHAN_FS, or that
something that is new in ext4?

-- 
Carlo Wood <carlo at alinoe.com>


From adilger at sun.com  Sat Mar 15 03:26:37 2008
From: adilger at sun.com (Andreas Dilger)
Date: Sat, 15 Mar 2008 11:26:37 +0800
Subject: Kernel header vs libext2fs headers
In-Reply-To: <20080315022035.GA28894@alinoe.com>
References: <20080315022035.GA28894@alinoe.com>
Message-ID: <20080315032637.GO3542@webber.adilger.int>

On Mar 15, 2008  03:20 +0100, Carlo Wood wrote:
> The kernel headers define EXT3_ORPHAN_FS
> while libext2fs header defines EXT4_ORPHAN_FS
> 
> This means that one of the two is wrong.

That isn't necessarily a correct assumption.  All of the definitions in
the fs/ext3 code are EXT3_*, and similarly, all of the definitions in
fs/ext2 are EXT2_*, and in fs/ext4 they are EXT4_*.  This avoids name
conflicts.

Conversely (though I don't necessarily agree with this) the definitions
in libext2fs declare these flags depending on what "version" of extN
the feature was first added (EXT2_*, EXT3_*, EXT4_*).  That makes it
easier to see what kernel is using which feature, but isn't always 100%
accurate or correct.

> Does ext3 use/have EXT3_ORPHAN_FS, or that
> something that is new in ext4?

Note that EXT3_ORPHAN_FS isn't an on disk format or feature at all,
but just an in-memory state flag to convey the fact that the filesystem
is just being mounted and orphans are being cleaned up down to lower
levels of the code that are reading the inodes from disk.  Otherwise,
the low level ext3_read_inode() will consider inodes with i_nlink == 0
to be unlinked and return a bad inode to the caller, to avoid issues
with NFS trying to access inodes that were deleted.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


From carlo at alinoe.com  Sat Mar 15 04:17:02 2008
From: carlo at alinoe.com (Carlo Wood)
Date: Sat, 15 Mar 2008 05:17:02 +0100
Subject: Kernel header vs libext2fs headers
In-Reply-To: <20080315032637.GO3542@webber.adilger.int>
References: <20080315022035.GA28894@alinoe.com>
	<20080315032637.GO3542@webber.adilger.int>
Message-ID: <20080315041702.GA2172@alinoe.com>

On Sat, Mar 15, 2008 at 11:26:37AM +0800, Andreas Dilger wrote:
> the fs/ext3 code are EXT3_*, and similarly, all of the definitions in
> fs/ext2 are EXT2_*, and in fs/ext4 they are EXT4_*.  This avoids name
> conflicts.
> 
> Conversely (though I don't necessarily agree with this) the definitions
> in libext2fs declare these flags depending on what "version" of extN
> the feature was first added (EXT2_*, EXT3_*, EXT4_*).  That makes it
> easier to see what kernel is using which feature, but isn't always 100%
> accurate or correct.

But if EXT4_ORPHAN_FS is defined, then you imply that ext4 is the
first version of ext that has implemented it; however, the ext3 kernel
header defines it, so you should use EXT3_ORPHAN_FS in e2fsprogs.
Or am I missing something? If ORPHAN_FS was truely new since ext4,
shouldn't it be missing in /usr/include/linux/ext3_fs.h ?

-- 
Carlo Wood <carlo at alinoe.com>


From adilger at sun.com  Sat Mar 15 04:27:38 2008
From: adilger at sun.com (Andreas Dilger)
Date: Sat, 15 Mar 2008 12:27:38 +0800
Subject: Kernel header vs libext2fs headers
In-Reply-To: <20080315041702.GA2172@alinoe.com>
References: <20080315022035.GA28894@alinoe.com>
	<20080315032637.GO3542@webber.adilger.int>
	<20080315041702.GA2172@alinoe.com>
Message-ID: <20080315042738.GQ3542@webber.adilger.int>

On Mar 15, 2008  05:17 +0100, Carlo Wood wrote:
> On Sat, Mar 15, 2008 at 11:26:37AM +0800, Andreas Dilger wrote:
> > the fs/ext3 code are EXT3_*, and similarly, all of the definitions in
> > fs/ext2 are EXT2_*, and in fs/ext4 they are EXT4_*.  This avoids name
> > conflicts.
> > 
> > Conversely (though I don't necessarily agree with this) the definitions
> > in libext2fs declare these flags depending on what "version" of extN
> > the feature was first added (EXT2_*, EXT3_*, EXT4_*).  That makes it
> > easier to see what kernel is using which feature, but isn't always 100%
> > accurate or correct.
> 
> But if EXT4_ORPHAN_FS is defined, then you imply that ext4 is the
> first version of ext that has implemented it; however, the ext3 kernel
> header defines it, so you should use EXT3_ORPHAN_FS in e2fsprogs.
> Or am I missing something? If ORPHAN_FS was truely new since ext4,
> shouldn't it be missing in /usr/include/linux/ext3_fs.h ?

Actually, I'm not sure what is going on there.  In lib/ext2fs/ext2_fs.h
it is in fact defined as EXT4_ORPHAN_FS, but this has been in use on
ext3 for a long time, so you are right - there is a bug in the e2fsprogs
version of ext2_fs.h.

Can you please submit a patch to Ted with this change.  It is probably
also worth noting that this flag is only used in memory and not on disk.
Since it shares the same in-memory variable with EXT2_ERROR_FS it needs
to be declared in e2fsprogs to avoid conflict, but otherwise has no meaning.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


From carlo at alinoe.com  Sat Mar 15 04:32:04 2008
From: carlo at alinoe.com (Carlo Wood)
Date: Sat, 15 Mar 2008 05:32:04 +0100
Subject: Can journal_revoke_header_s::r_count be changed from int to __s32
	please?
Message-ID: <20080315043204.GB2172@alinoe.com>

I find that linux/jbd.h defines:

typedef struct journal_revoke_header_s
{
        journal_header_t r_header;
	__be32           r_count;       /* Count of bytes used in the block */
} journal_revoke_header_t;

thus, sizeof(r_count) == 4

However, in e2progs, in kernel-jbd.h I find:

typedef struct journal_revoke_header_s
{
        journal_header_t r_header;
        int              r_count;       /* Count of bytes used in the block */
} journal_revoke_header_t;

and this sizeof(r_count) depends on the architecture.

Using e2fslibs this is probably not a problem because
all current OS have sizeof(int) >= 4, and r_count is assigned
rather than mapped to the disk image (even on big endian
machines?).

Nevertheless, since I believe that kernel-jbd.h should
be made public (installed along with the other header files)
in order to make at least journal_superblock_t available
to user programs, I'd like to request to change this
int into __s32. That simply makes more sense as
journal_revoke_header_t represents a data structure
on disk and sizeof(journal_revoke_header_s) might be
used somewhere.

-- 
Carlo Wood <carlo at alinoe.com>


From tambewilliam at gmail.com  Sun Mar 16 20:13:47 2008
From: tambewilliam at gmail.com (William Tambe)
Date: Sun, 16 Mar 2008 15:13:47 -0500
Subject: Filesystem fragmentation and scatter-gather DMA
Message-ID: <b5cdb1f10803161313n14468601u1d30150bb8e7a7f9@mail.gmail.com>

When designing a filesystem, is fragmentation really an issue if
access to the disk can be done using scatter-gather DMA techics ?


From adilger at sun.com  Sun Mar 16 22:29:03 2008
From: adilger at sun.com (Andreas Dilger)
Date: Mon, 17 Mar 2008 06:29:03 +0800
Subject: Filesystem fragmentation and scatter-gather DMA
In-Reply-To: <b5cdb1f10803161313n14468601u1d30150bb8e7a7f9@mail.gmail.com>
References: <b5cdb1f10803161313n14468601u1d30150bb8e7a7f9@mail.gmail.com>
Message-ID: <20080316222903.GC3542@webber.adilger.int>

On Mar 16, 2008  15:13 -0500, William Tambe wrote:
> When designing a filesystem, is fragmentation really an issue if
> access to the disk can be done using scatter-gather DMA techics ?

Yes!!!  Scatter-gather only handles "fragmentation" in memory, where
seek time is zero.  If there is fragmentation on disk you pay 8ms
for each fragment in the read.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


From tambewilliam at gmail.com  Sun Mar 16 23:56:27 2008
From: tambewilliam at gmail.com (William Tambe)
Date: Sun, 16 Mar 2008 18:56:27 -0500
Subject: Filesystem fragmentation and scatter-gather DMA
In-Reply-To: <20080316222903.GC3542@webber.adilger.int>
References: <b5cdb1f10803161313n14468601u1d30150bb8e7a7f9@mail.gmail.com>
	<20080316222903.GC3542@webber.adilger.int>
Message-ID: <b5cdb1f10803161656q226cfc56k5daafb62f6ff3db7@mail.gmail.com>

Is the delay due to mechanical parts or the electronics gathering the fragments?

Would that same delay still apply to a solid state drive? Since a
solid state drive is really just a slower version of system memory
(Please correct me if I am wrong).


On Sun, Mar 16, 2008 at 5:29 PM, Andreas Dilger <adilger at sun.com> wrote:
> On Mar 16, 2008  15:13 -0500, William Tambe wrote:
> > When designing a filesystem, is fragmentation really an issue if
> > access to the disk can be done using scatter-gather DMA techics ?
>
> Yes!!!  Scatter-gather only handles "fragmentation" in memory, where
> seek time is zero.  If there is fragmentation on disk you pay 8ms
> for each fragment in the read.
>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
>


From jlforrest at berkeley.edu  Mon Mar 17 01:40:19 2008
From: jlforrest at berkeley.edu (Jon Forrest)
Date: Sun, 16 Mar 2008 18:40:19 -0700
Subject: Filesystem fragmentation and scatter-gather DMA
In-Reply-To: <b5cdb1f10803161313n14468601u1d30150bb8e7a7f9@mail.gmail.com>
References: <b5cdb1f10803161313n14468601u1d30150bb8e7a7f9@mail.gmail.com>
Message-ID: <47DDCC03.1060002@berkeley.edu>

The following is a short note I wrote a while back,
mainly in response to a discussion of filesystem
fragmentation in Windows operating systems. Most
of what I saw also applies to *nix systems.

Jon Forrest

----------------
Why PC Disk Fragmentation Doesn't Matter (much)

Jon Forrest (jlforrest at berkeley.edu)

[The following is an hypothesis. I don't have
any real data to back this up. I'd like to know
if I'm overlooking any technical details.]

Disk fragmentation can mean several things.
On one hand it can mean that the disk blocks
that a file occupies aren't right next to each
other physically. The more pieces that make up a file, the
more fragmented the file is. Or, it can mean
that the unused blocks on a disk aren't all right
next to each other. Win9X, Windows 2000, and Windows XP
come with defragmentation programs. Such programs
are also available for other Microsoft and non-Microsoft
operating systems from commercial vendors.

The question of whether a fragmented disk really
results in anything bad has always been a topic
of heated discussion. On one side of the issue
the vendors of disk defragmentation programs can
always be found. The other side is usually occupied
by skeptical system managers, such as yours truly.

For example, the following claim is made by the
vendor of one commercial vendor:

"Disk fragmentation can cripple performance even worse
than running with insufficient memory. Eliminate it
and you've eliminated the primary performance bottleneck
plaguing even the best-equipped systems." But can it, and
does it? The user's guide for this product spends some 60 pages
describing how to run the product but never justifies this
claim.

I'm not saying that fragmentation is good. That's one reason
why you can't buy a product whose purpose is to fragment a disk.
But, it's hard to imagine how fragmentation can cause any noticeable
performance problems. Here's why:

1) The greatest benefit from having a contiguous file would
be when the whole file is read (let's stick with reads) in
one I/O operation. The would result in the minimal amount of
disk arm movement, which is the slowest part of a disk I/O
operation. But, this isn't the way most I/Os take place. Instead,
most I/Os are fairly small. Plus, and this is the kicker, on
a modern multitasking operating system, those small I/Os are coming
from different processes reading from different files. Assuming that the
data to be read isn't in a memory cache, this means that the disk arm is
going to be flying all over the place, trying to satisfy all
the seek operations being issued by the operating system.
Sure, the operating system, and maybe even the disk controller,
might be trying to re-order I/Os but there's only so much of
this that can be done. A contiguous file doesn't really help
much because there's a very good change that the disk arm is
going to have to move elsewhere on the disk between the time
that pieces of a file are read.

2) The metadata for managing a filesystem is probably
cached in RAM. This means when a file is created, or
extended, the necessary metadata updates are done at memory
speed, not at disk speed. So, the overhead of allocating
multiple pieces for a new file is probably in the noise.
Of course, the in-memory metadata eventually has to be flushed
to disk but this is usually done after the original I/O completes,
so there won't be any visible slowdown in the program that issued
the I/O.

3) Modern disks do all kind of internal block remapping so there's
no guarantee that what appears to be contiguous to the operating
system is actually really and truly contiguous on the disk. I have
no idea how often this possibility occurs, or how bad the skew is
between "fake" blocks and "real" blocks. But, it could happen.

So, go ahead and run your favorite disk defragmenter. I know I do.
Now that W2K and later have an official API for moving files in an 
atomic operation, such programs probably can't cause any harm. But
don't be surprised if you don't see any noticeable performance
improvements.

The mystery that really puzzles and sometimes frightens me is
why an NTFS file system becomes fragmented so easily in the first
place. Let's say I'm installing Windows 2000 on a newly formatted
20GB disk. Let's say that the total amount of space used by the
new installation is 600MB. Why should I see any fragmented files,
other than registry files, after such an installation? I have no
idea. My thinking is that all files that aren't created and then
later extended should be able to be created contiguously to begin with.


From ling at aliko.com  Mon Mar 17 05:48:03 2008
From: ling at aliko.com (Ling C. Ho)
Date: Mon, 17 Mar 2008 00:48:03 -0500
Subject: Filesystem fragmentation and scatter-gather DMA
In-Reply-To: <47DDCC03.1060002@berkeley.edu>
References: <b5cdb1f10803161313n14468601u1d30150bb8e7a7f9@mail.gmail.com>
	<47DDCC03.1060002@berkeley.edu>
Message-ID: <47DE0613.4040907@aliko.com>

I have this experience a couple of years ago. Under some version of 
Redhat Linux Enterprise 3 using kernel 2.4x, I tested scping two files 
slightly over 1Gig to a freshly formated ext3 filesystems 
simultaneously. It turned out the version of ext3 did not have 
reservation implemented, and we ended up with 2 files with more than 
10,000 non-contiguous fragments.

Even though the two files sat physically very close together on disk, 
the fragmentation was so bad that instead of getting over 50MB/s read we 
were expecting from reading a file at a time, we were getting  about 
10MB/s.  It's not day to day usage pattern on many desktop or servers, 
but unfortunately for us,  that's what  hundreds of our servers were  
set up to do. That is to run 2 jobs at a time, where they would first 
copy the data files from some where else, read them and then  analyze 
the data, and write some result onto another file systems.

So fragmentation could be very bad, but fortunately the later versions 
of ext3 have done much better in preventing just that.

...
ling


Jon Forrest wrote:
> The following is a short note I wrote a while back,
> mainly in response to a discussion of filesystem
> fragmentation in Windows operating systems. Most
> of what I saw also applies to *nix systems.
>
> Jon Forrest
>
> ----------------
> Why PC Disk Fragmentation Doesn't Matter (much)
>
> Jon Forrest (jlforrest at berkeley.edu)
>
> [The following is an hypothesis. I don't have
> any real data to back this up. I'd like to know
> if I'm overlooking any technical details.]
>
> Disk fragmentation can mean several things.
> On one hand it can mean that the disk blocks
> that a file occupies aren't right next to each
> other physically. The more pieces that make up a file, the
> more fragmented the file is. Or, it can mean
> that the unused blocks on a disk aren't all right
> next to each other. Win9X, Windows 2000, and Windows XP
> come with defragmentation programs. Such programs
> are also available for other Microsoft and non-Microsoft
> operating systems from commercial vendors.
>
> The question of whether a fragmented disk really
> results in anything bad has always been a topic
> of heated discussion. On one side of the issue
> the vendors of disk defragmentation programs can
> always be found. The other side is usually occupied
> by skeptical system managers, such as yours truly.
>
> For example, the following claim is made by the
> vendor of one commercial vendor:
>
> "Disk fragmentation can cripple performance even worse
> than running with insufficient memory. Eliminate it
> and you've eliminated the primary performance bottleneck
> plaguing even the best-equipped systems." But can it, and
> does it? The user's guide for this product spends some 60 pages
> describing how to run the product but never justifies this
> claim.
>
> I'm not saying that fragmentation is good. That's one reason
> why you can't buy a product whose purpose is to fragment a disk.
> But, it's hard to imagine how fragmentation can cause any noticeable
> performance problems. Here's why:
>
> 1) The greatest benefit from having a contiguous file would
> be when the whole file is read (let's stick with reads) in
> one I/O operation. The would result in the minimal amount of
> disk arm movement, which is the slowest part of a disk I/O
> operation. But, this isn't the way most I/Os take place. Instead,
> most I/Os are fairly small. Plus, and this is the kicker, on
> a modern multitasking operating system, those small I/Os are coming
> from different processes reading from different files. Assuming that the
> data to be read isn't in a memory cache, this means that the disk arm is
> going to be flying all over the place, trying to satisfy all
> the seek operations being issued by the operating system.
> Sure, the operating system, and maybe even the disk controller,
> might be trying to re-order I/Os but there's only so much of
> this that can be done. A contiguous file doesn't really help
> much because there's a very good change that the disk arm is
> going to have to move elsewhere on the disk between the time
> that pieces of a file are read.
>
> 2) The metadata for managing a filesystem is probably
> cached in RAM. This means when a file is created, or
> extended, the necessary metadata updates are done at memory
> speed, not at disk speed. So, the overhead of allocating
> multiple pieces for a new file is probably in the noise.
> Of course, the in-memory metadata eventually has to be flushed
> to disk but this is usually done after the original I/O completes,
> so there won't be any visible slowdown in the program that issued
> the I/O.
>
> 3) Modern disks do all kind of internal block remapping so there's
> no guarantee that what appears to be contiguous to the operating
> system is actually really and truly contiguous on the disk. I have
> no idea how often this possibility occurs, or how bad the skew is
> between "fake" blocks and "real" blocks. But, it could happen.
>
> So, go ahead and run your favorite disk defragmenter. I know I do.
> Now that W2K and later have an official API for moving files in an 
> atomic operation, such programs probably can't cause any harm. But
> don't be surprised if you don't see any noticeable performance
> improvements.
>
> The mystery that really puzzles and sometimes frightens me is
> why an NTFS file system becomes fragmented so easily in the first
> place. Let's say I'm installing Windows 2000 on a newly formatted
> 20GB disk. Let's say that the total amount of space used by the
> new installation is 600MB. Why should I see any fragmented files,
> other than registry files, after such an installation? I have no
> idea. My thinking is that all files that aren't created and then
> later extended should be able to be created contiguously to begin with.
>
> _______________________________________________
> Ext3-users mailing list
> Ext3-users at redhat.com
> https://www.redhat.com/mailman/listinfo/ext3-users


From davids at webmaster.com  Mon Mar 17 06:05:48 2008
From: davids at webmaster.com (David Schwartz)
Date: Sun, 16 Mar 2008 23:05:48 -0700
Subject: Filesystem fragmentation and scatter-gather DMA
In-Reply-To: <47DDCC03.1060002@berkeley.edu>
Message-ID: <MDEHLPKNGKAHNMBLJOLKGEAGLDAC.davids@webmaster.com>


Jon Forrest wrote:

> 1) The greatest benefit from having a contiguous file would
> be when the whole file is read (let's stick with reads) in
> one I/O operation. The would result in the minimal amount of
> disk arm movement, which is the slowest part of a disk I/O
> operation. But, this isn't the way most I/Os take place. Instead,
> most I/Os are fairly small. Plus, and this is the kicker, on
> a modern multitasking operating system, those small I/Os are coming
> from different processes reading from different files. Assuming that the
> data to be read isn't in a memory cache, this means that the disk arm is
> going to be flying all over the place, trying to satisfy all
> the seek operations being issued by the operating system.
> Sure, the operating system, and maybe even the disk controller,
> might be trying to re-order I/Os but there's only so much of
> this that can be done. A contiguous file doesn't really help
> much because there's a very good change that the disk arm is
> going to have to move elsewhere on the disk between the time
> that pieces of a file are read.

That's not really the issue. The issue is whether a read of a chunk of a
file can take place without any extra seeks or whether it does require extra
seeks. Further, for the vast majority of cases, there is only one I/O stream
going on at a time. The disk will read ahead. If that can satisfy even a
small fraction of the subsequent I/Os the OS issues, that's a big win.

> 3) Modern disks do all kind of internal block remapping so there's
> no guarantee that what appears to be contiguous to the operating
> system is actually really and truly contiguous on the disk. I have
> no idea how often this possibility occurs, or how bad the skew is
> between "fake" blocks and "real" blocks. But, it could happen.

Not bad enough to make a significant difference on any but a nearly-failing
drive.

> The mystery that really puzzles and sometimes frightens me is
> why an NTFS file system becomes fragmented so easily in the first
> place. Let's say I'm installing Windows 2000 on a newly formatted
> 20GB disk. Let's say that the total amount of space used by the
> new installation is 600MB. Why should I see any fragmented files,
> other than registry files, after such an installation? I have no
> idea. My thinking is that all files that aren't created and then
> later extended should be able to be created contiguously to begin with.

Only if you're willing to leave big holes behind, which will rapidly lead to
a full disk and massive fragmentation. As files are being created, files are
also being deleted. There is no way for the OS to know ahead of time which
files are going to be around for a long time, so it has to mix the
short-term files with the long-term files. But, of course, once you
defragment a large chunk of non-changing files, they should stay that way.

DS


From liuyue at ncic.ac.cn  Mon Mar 17 07:29:32 2008
From: liuyue at ncic.ac.cn (liuyue)
Date: Mon, 17 Mar 2008 15:29:32 +0800
Subject: The maximum number of files under a folder
Message-ID: <20080317071048.402B21368F7@ncic.ac.cn>

John Nelson

    I see that EXT3_LINK_MAX was set to 32000, if we will meet with problems if we change this limit to 65000?
	Thanks!


>i think not more than 5k files without dir_index on. The max limit of 
>subfolders is 32k
>
>_______________________________________________
>Ext3-users mailing list
>Ext3-users at redhat.com
>https://www.redhat.com/mailman/listinfo/ext3-users
>
>

= = = = = = = = = = = = = = = = = = = =
			

?????????
??
 
				 
????????liuyue
????????liuyue at ncic.ac.cn
??????????2008-03-17


From liuyue at ncic.ac.cn  Mon Mar 17 07:40:36 2008
From: liuyue at ncic.ac.cn (liuyue)
Date: Mon, 17 Mar 2008 15:40:36 +0800
Subject: The maximum number of files under a folder
Message-ID: <20080317072152.05DB51369B8@ncic.ac.cn>

Theodore Tso,

	In 64bit system, directory size can not be bigger than 2GB?

======= 2008-03-14 01:23:18 ????????=======

>On Thu, Mar 13, 2008 at 12:48:50PM -0400, John Nelson wrote:
>> i think not more than 5k files without dir_index on. The max limit of 
>> subfolders is 32k
>
>There is no limit to the number of files in a folder, except for the
>fact that the directory itself can't be bigger than 2GB, and the
>number of inodes that the entire filesystem has available to it.  Of
>course, if you don't have directory indexing turned on, you may not
>like the performance of doing directory lookups, but that's a
>different story.
>
>						- Ted
>
>_______________________________________________
>Ext3-users mailing list
>Ext3-users at redhat.com
>https://www.redhat.com/mailman/listinfo/ext3-users
>
>

= = = = = = = = = = = = = = = = = = = =
			

?????????
??
 
				 
????????liuyue
????????liuyue at ncic.ac.cn
??????????2008-03-17


From tytso at mit.edu  Mon Mar 17 13:32:07 2008
From: tytso at mit.edu (Theodore Tso)
Date: Mon, 17 Mar 2008 09:32:07 -0400
Subject: The maximum number of files under a folder
In-Reply-To: <20080317072152.05DB51369B8@ncic.ac.cn>
References: <20080317072152.05DB51369B8@ncic.ac.cn>
Message-ID: <20080317133207.GB8368@mit.edu>

On Mon, Mar 17, 2008 at 03:40:36PM +0800, liuyue wrote:
> Theodore Tso,
> 
> 	In 64bit system, directory size can not be bigger than 2GB?

No, because the high 32-bits for i_size are overloaded to store the
directory creation acl.

In practice, you really don't want to have a directory that huge
anyway.  Iterating through it all with readdir() gets horribly slow,
and applications that try do anything with really huge directories
would be well advised to use a database, because they will get *much*
better performance that way....

							- Ted


From ric at emc.com  Mon Mar 17 13:45:54 2008
From: ric at emc.com (Ric Wheeler)
Date: Mon, 17 Mar 2008 09:45:54 -0400
Subject: Filesystem fragmentation and scatter-gather DMA
In-Reply-To: <b5cdb1f10803161656q226cfc56k5daafb62f6ff3db7@mail.gmail.com>
References: <b5cdb1f10803161313n14468601u1d30150bb8e7a7f9@mail.gmail.com>	<20080316222903.GC3542@webber.adilger.int>
	<b5cdb1f10803161656q226cfc56k5daafb62f6ff3db7@mail.gmail.com>
Message-ID: <47DE7612.2040306@emc.com>

William Tambe wrote:
> Is the delay due to mechanical parts or the electronics gathering the fragments?
> 
> Would that same delay still apply to a solid state drive? Since a
> solid state drive is really just a slower version of system memory
> (Please correct me if I am wrong).
> 

With spinning media,the big cost is moving the physical heads of the drive.

With an SSD FLASH-based device, you might also prefer having contiguous 
writes since flash needs to be erased before the write can happen (and 
that occurs in chunks). Non-contiguous writes of single sectors would 
have a high chance of causing extra erasures & read-modify-writes...

ric


From jlforrest at berkeley.edu  Mon Mar 17 16:52:04 2008
From: jlforrest at berkeley.edu (Jon Forrest)
Date: Mon, 17 Mar 2008 09:52:04 -0700
Subject: Filesystem fragmentation and scatter-gather DMA
In-Reply-To: <MDEHLPKNGKAHNMBLJOLKGEAGLDAC.davids@webmaster.com>
References: <MDEHLPKNGKAHNMBLJOLKGEAGLDAC.davids@webmaster.com>
Message-ID: <47DEA1B4.70204@berkeley.edu>

David Schwartz wrote:

> That's not really the issue. The issue is whether a read of a chunk of a
> file can take place without any extra seeks or whether it does require extra
> seeks. Further, for the vast majority of cases, there is only one I/O stream
> going on at a time. The disk will read ahead. If that can satisfy even a
> small fraction of the subsequent I/Os the OS issues, that's a big win.

Maybe on a single user PC, some of the time there is only one I/O
stream going on a time. But, once you start doing anything in parallel,
or have multiple users, the number of sources (and destinations) of I/O
goes way up. This, the arm is going to have to be moving around randomly
even if the files involved aren't fragmented. Some (most?) OSs sort
I/Os so that the movement is minimized but it still occurs.

>> 3) Modern disks do all kind of internal block remapping so there's
>> no guarantee that what appears to be contiguous to the operating
>> system is actually really and truly contiguous on the disk. I have
>> no idea how often this possibility occurs, or how bad the skew is
>> between "fake" blocks and "real" blocks. But, it could happen.
> 
> Not bad enough to make a significant difference on any but a nearly-failing
> drive.

It would be interesting to see what I'm calling the skew between
the true sector layout and what an O/S sees on modern SATA drives.
I'm not aware of any way to see this. Does anybody know?

I stand by my assertion that while disk fragmentation is in no way
a good thing, it isn't something to fear, at least not in the way
shown in the advertisements for defragmentation products.

-- 
Jon Forrest
Research Computing Support
College of Chemistry
173 Tan Hall
University of California Berkeley
Berkeley, CA
94720-1460
510-643-1032
jlforrest at berkeley.edu


From ric at emc.com  Mon Mar 17 17:11:24 2008
From: ric at emc.com (Ric Wheeler)
Date: Mon, 17 Mar 2008 13:11:24 -0400
Subject: Filesystem fragmentation and scatter-gather DMA
In-Reply-To: <47DEA1B4.70204@berkeley.edu>
References: <MDEHLPKNGKAHNMBLJOLKGEAGLDAC.davids@webmaster.com>
	<47DEA1B4.70204@berkeley.edu>
Message-ID: <47DEA63C.2010305@emc.com>

Jon Forrest wrote:
> David Schwartz wrote:
> 
>> That's not really the issue. The issue is whether a read of a chunk of a
>> file can take place without any extra seeks or whether it does require 
>> extra
>> seeks. Further, for the vast majority of cases, there is only one I/O 
>> stream
>> going on at a time. The disk will read ahead. If that can satisfy even a
>> small fraction of the subsequent I/Os the OS issues, that's a big win.
> 
> Maybe on a single user PC, some of the time there is only one I/O
> stream going on a time. But, once you start doing anything in parallel,
> or have multiple users, the number of sources (and destinations) of I/O
> goes way up. This, the arm is going to have to be moving around randomly
> even if the files involved aren't fragmented. Some (most?) OSs sort
> I/Os so that the movement is minimized but it still occurs.

You should keep in mind that big servers also have higher end storage 
systems (or at least multiple devices).  Heads don't tend to move about 
randomly - they will normally try to read (or write) in a specific 
order. Normally, that order is in increasing sector order.

Every level of the the system tries to guess how to combine and read 
ahead, all the way from the file system down to the internal firmware in 
  the storage.

The best way to get read-ahead to work is to use really obvious patterns 
- sequential, increasing and large IO's work best ;-)

> 
>>> 3) Modern disks do all kind of internal block remapping so there's
>>> no guarantee that what appears to be contiguous to the operating
>>> system is actually really and truly contiguous on the disk. I have
>>> no idea how often this possibility occurs, or how bad the skew is
>>> between "fake" blocks and "real" blocks. But, it could happen.
>>
>> Not bad enough to make a significant difference on any but a 
>> nearly-failing
>> drive.
> 
> It would be interesting to see what I'm calling the skew between
> the true sector layout and what an O/S sees on modern SATA drives.
> I'm not aware of any way to see this. Does anybody know?

I would not spend any time worrying about the sector remapping. SMART 
can tell you how many sectors have been remapped, but even with a really 
large disk the maximum number of remapped sectors is tiny (say 2000 or 
so for a 500GB disk).  Your chances of hitting them are tiny, especially 
since most drives end up with very, very few remapped sectors before 
they get tossed. Those with more than 100 sectors, for example, tend to 
complain a lot.

The short answer is to look at the sector level order of your file and 
assume (pretend) that it reflects the media layout as well.

Note that the whole deal changes when you have multi-drive RAID devices 
(software or hardware).

> I stand by my assertion that while disk fragmentation is in no way
> a good thing, it isn't something to fear, at least not in the way
> shown in the advertisements for defragmentation products.
> 

I think that fragmentation is a bad performance hit, but that we 
actually do relatively well in keeping our files contiguous in normal cases.

I have a simple bit of c code that uses fibmap to dump the 
sectors/blocks for a specific file. If you like, I can send it over to you.

Regards,

Ric


From jlforrest at berkeley.edu  Mon Mar 17 17:24:56 2008
From: jlforrest at berkeley.edu (Jon Forrest)
Date: Mon, 17 Mar 2008 10:24:56 -0700
Subject: Filesystem fragmentation and scatter-gather DMA
In-Reply-To: <47DEA63C.2010305@emc.com>
References: <MDEHLPKNGKAHNMBLJOLKGEAGLDAC.davids@webmaster.com>
	<47DEA1B4.70204@berkeley.edu> <47DEA63C.2010305@emc.com>
Message-ID: <47DEA968.5080700@berkeley.edu>

Ric Wheeler wrote:

> Every level of the the system tries to guess how to combine and read 
> ahead, all the way from the file system down to the internal firmware in 
>  the storage.

I remember Kirk McKusick once complaining about how hard it was to write
a file system when so many other levels in a system try to second guess
what he was trying to do. I've also heard disk engineers complain about
the same thing, except they complain about the OS people not leaving
optimization techniques to them. Go figure.

> I think that fragmentation is a bad performance hit, but that we 
> actually do relatively well in keeping our files contiguous in normal 
> cases.

We might disagree on how bad the performance hit is, but I'm really
trying to prevent non-technical people from panicking when they see
a fragmented filesystem (or file).

> I have a simple bit of c code that uses fibmap to dump the 
> sectors/blocks for a specific file. If you like, I can send it over to you.

Sure. Thanks.

-- 
Jon Forrest
Research Computing Support
College of Chemistry
173 Tan Hall
University of California Berkeley
Berkeley, CA
94720-1460
510-643-1032
jlforrest at berkeley.edu


From ric at emc.com  Mon Mar 17 17:29:58 2008
From: ric at emc.com (Ric Wheeler)
Date: Mon, 17 Mar 2008 13:29:58 -0400
Subject: Filesystem fragmentation and scatter-gather DMA
In-Reply-To: <47DEA968.5080700@berkeley.edu>
References: <MDEHLPKNGKAHNMBLJOLKGEAGLDAC.davids@webmaster.com>	<47DEA1B4.70204@berkeley.edu>
	<47DEA63C.2010305@emc.com> <47DEA968.5080700@berkeley.edu>
Message-ID: <47DEAA96.6070005@emc.com>

Jon Forrest wrote:
> Ric Wheeler wrote:
> 
>> Every level of the the system tries to guess how to combine and read 
>> ahead, all the way from the file system down to the internal firmware 
>> in  the storage.
> 
> I remember Kirk McKusick once complaining about how hard it was to write
> a file system when so many other levels in a system try to second guess
> what he was trying to do. I've also heard disk engineers complain about
> the same thing, except they complain about the OS people not leaving
> optimization techniques to them. Go figure.

The trick is just to do the obvious thing (big, sequential IO's) from 
the application to give the various layers the easiest job of second 
guessing ;-)

There are certainly advantages to doing the read ahead (and coalescing) 
at the different layers. For example, a file system can do predictive 
read ahead across the non-contiguous chunks of a single file while the 
IO layer can coalesce multiple write or read commands on the same host 
and a multi-ported drive can do the same for multiple hosts.


> 
>> I think that fragmentation is a bad performance hit, but that we 
>> actually do relatively well in keeping our files contiguous in normal 
>> cases.
> 
> We might disagree on how bad the performance hit is, but I'm really
> trying to prevent non-technical people from panicking when they see
> a fragmented filesystem (or file).

I agree - most casual users will never see anything close to a 
performance issue until they have completely filled the file system. In 
that case, defragmentation will not be the real help.

> 
>> I have a simple bit of c code that uses fibmap to dump the 
>> sectors/blocks for a specific file. If you like, I can send it over to 
>> you.
> 
> Sure. Thanks.

I will send it to you out of band. Mark Lord had some tweaks to this 
that I have not rolled in, let me know if it is useful.

ric


From davids at webmaster.com  Mon Mar 17 22:20:29 2008
From: davids at webmaster.com (David Schwartz)
Date: Mon, 17 Mar 2008 15:20:29 -0700
Subject: Filesystem fragmentation and scatter-gather DMA
In-Reply-To: <47DEAA96.6070005@emc.com>
Message-ID: <MDEHLPKNGKAHNMBLJOLKEEGFLDAC.davids@webmaster.com>


Ric Wheeler wrote:

> There are certainly advantages to doing the read ahead (and coalescing)
> at the different layers. For example, a file system can do predictive
> read ahead across the non-contiguous chunks of a single file while the
> IO layer can coalesce multiple write or read commands on the same host
> and a multi-ported drive can do the same for multiple hosts.

If the file system does predictive read-ahead, and the data is not used, the
penalty will be *much* larger if the predictive read-ahead required an extra
seek than if it didn't. This is one of the biggest ways that fragmentation
hurts performance. The other is if the disk does read-ahead and the next
chunk of data in the file was needed, but wasn't read by the disk because of
fragmentation.

> > We might disagree on how bad the performance hit is, but I'm really
> > trying to prevent non-technical people from panicking when they see
> > a fragmented filesystem (or file).

> I agree - most casual users will never see anything close to a
> performance issue until they have completely filled the file system. In
> that case, defragmentation will not be the real help.

I agree with this as well. The only significant differences I've seen with
disk defragmenters were in two cases:

1) The filesystem was close to full, and the defragmenter bought a bit of
extra time before something had to be done.

2) The defragmenter was smart enough to move frequenty-accessed files to the
fastest parts of the disk, and the disk had a large (20%) difference between
its fastest and slowest tracks.

Otherwise, it's a miniscule difference.

I'd love to see smarter disks with much larger caches so that the OS could
say to the disk "here's the data I need now, and here's what I might need
later".

DS


From adilger at sun.com  Tue Mar 18 01:14:12 2008
From: adilger at sun.com (Andreas Dilger)
Date: Tue, 18 Mar 2008 09:14:12 +0800
Subject: Filesystem fragmentation and scatter-gather DMA
In-Reply-To: <47DEA63C.2010305@emc.com>
References: <MDEHLPKNGKAHNMBLJOLKGEAGLDAC.davids@webmaster.com>
	<47DEA1B4.70204@berkeley.edu> <47DEA63C.2010305@emc.com>
Message-ID: <20080318011342.GH3542@webber.adilger.int>

On Mar 17, 2008  13:11 -0400, Ric Wheeler wrote:
> I have a simple bit of c code that uses fibmap to dump the sectors/blocks 
> for a specific file. If you like, I can send it over to you.

Hmm,  I could have sworn "filefrag" did this, but it doesn't have any
mode that actually prints out a list of blocks, only the discontinuities
in the file...  We are adding a new "extents" output mode to filefrag
which prints block mappings in a more useful manner, but it isn't in
upstream e2fsprogs yet.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


From adilger at sun.com  Tue Mar 18 22:56:58 2008
From: adilger at sun.com (Andreas Dilger)
Date: Wed, 19 Mar 2008 06:56:58 +0800
Subject: The maximum number of files under a folder
In-Reply-To: <20080317133207.GB8368@mit.edu>
References: <20080317072152.05DB51369B8@ncic.ac.cn>
	<20080317133207.GB8368@mit.edu>
Message-ID: <20080318225658.GA2971@webber.adilger.int>

On Mar 17, 2008  09:32 -0400, Theodore Ts'o wrote:
> On Mon, Mar 17, 2008 at 03:40:36PM +0800, liuyue wrote:
> > Theodore Tso,
> > 
> > 	In 64bit system, directory size can not be bigger than 2GB?
> 
> No, because the high 32-bits for i_size are overloaded to store the
> directory creation acl.

I think we should change the code (kernel and e2fsprogs) to allow
i_size_high for directories also.

> In practice, you really don't want to have a directory that huge
> anyway.  Iterating through it all with readdir() gets horribly slow,
> and applications that try do anything with really huge directories
> would be well advised to use a database, because they will get *much*
> better performance that way....

Actually, for many HPC applications they never do readdir at all.
The job creates 1 file/process and always uses a predefined filename
like {job}-{timestamp}-{process} that it will directly look up.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


From darkonc at gmail.com  Wed Mar 19 06:35:03 2008
From: darkonc at gmail.com (Stephen Samuel)
Date: Tue, 18 Mar 2008 23:35:03 -0700
Subject: The maximum number of files under a folder
In-Reply-To: <20080318225658.GA2971@webber.adilger.int>
References: <20080317072152.05DB51369B8@ncic.ac.cn>
	<20080317133207.GB8368@mit.edu>
	<20080318225658.GA2971@webber.adilger.int>
Message-ID: <6cd50f9f0803182335o45fa23echd8128cd6ddd2216e@mail.gmail.com>

The OS will have to search the directory to see if the file already exists
before creating it.

Well, if you hash it such that it splits up something like:
jobid(upper part)/jobid(lower- part)[/-]timestamp-process,
 you'll find that your access times will be must faster (especially if you
don't use H-Trees).  This also applies if  you're just creating a file,
because you'll have to search the entire directory to see if that filename
exists

With regular directories, searching through them to see if a file already
exist increases linearly with the number of entries.  If you hash on 3
levels with 8-bits per level, you'll have to open 2 or 3 extra inodes, but
you'll cut your directory search times down by a factor of 20000-1.  You'll
also skip having to deal with any sort of directory-size limit.
(=2^24/256/3)

I did something similar on a Solaris box which had 200000 emails in the
/var/spool/mqueue directory. That many messages was slowing the system to a
crawl.  I hashed it into 100 directories with 2000  entries each,   it sped
things up *enormously.*

On Tue, Mar 18, 2008 at 3:56 PM, Andreas Dilger <adilger at sun.com> wrote:

> On Mar 17, 2008  09:32 -0400, Theodore Ts'o wrote:
> > On Mon, Mar 17, 2008 at 03:40:36PM +0800, liuyue wrote:
> > > Theodore Tso,
> > >
> > >     In 64bit system, directory size can not be bigger than 2GB?
> >
> > No, because the high 32-bits for i_size are overloaded to store the
> > directory creation acl.
>
> I think we should change the code (kernel and e2fsprogs) to allow
> i_size_high for directories also.
>
> > In practice, you really don't want to have a directory that huge
> > anyway.  Iterating through it all with readdir() gets horribly slow,
> > and applications that try do anything with really huge directories
> > would be well advised to use a database, because they will get *much*
> > better performance that way....
>
> Actually, for many HPC applications they never do readdir at all.
> The job creates 1 file/process and always uses a predefined filename
> like {job}-{timestamp}-{process} that it will directly look up.
>
> Cheers, Andreas
>


-- 
Stephen Samuel http://www.bcgreen.com
778-861-7641
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20080318/b8c5b1ac/attachment.htm>

From articpenguin3800 at gmail.com  Wed Mar 19 12:16:15 2008
From: articpenguin3800 at gmail.com (John Nelson)
Date: Wed, 19 Mar 2008 08:16:15 -0400
Subject: The maximum number of files under a folder
In-Reply-To: <6cd50f9f0803182335o45fa23echd8128cd6ddd2216e@mail.gmail.com>
References: <20080317072152.05DB51369B8@ncic.ac.cn>	
	<20080317133207.GB8368@mit.edu>	
	<20080318225658.GA2971@webber.adilger.int>
	<6cd50f9f0803182335o45fa23echd8128cd6ddd2216e@mail.gmail.com>
Message-ID: <47E1040F.5060408@gmail.com>

What does what does the h stand for in h-tree? Like the b in btree is 
binary Tree


Stephen Samuel wrote:
> The OS will have to search the directory to see if the file already 
> exists before creating it.
>
> Well, if you hash it such that it splits up something like:
> jobid(upper part)/jobid(lower- part)[/-]timestamp-process,
>  you'll find that your access times will be must faster (especially if 
> you don't use H-Trees).  This also applies if  you're just creating a 
> file, because you'll have to search the entire directory to see if 
> that filename exists
>
> With regular directories, searching through them to see if a file 
> already exist increases linearly with the number of entries.  If you 
> hash on 3 levels with 8-bits per level, you'll have to open 2 or 3 
> extra inodes, but you'll cut your directory search times down by a 
> factor of 20000-1.  You'll also skip having to deal with any sort of 
> directory-size limit. (=2^24/256/3)
>
> I did something similar on a Solaris box which had 200000 emails in 
> the /var/spool/mqueue directory. That many messages was slowing the 
> system to a crawl.  I hashed it into 100 directories with 2000  
> entries each,   it sped things up *enormously.*
>
> On Tue, Mar 18, 2008 at 3:56 PM, Andreas Dilger <adilger at sun.com 
> <mailto:adilger at sun.com>> wrote:
>
>     On Mar 17, 2008  09:32 -0400, Theodore Ts'o wrote:
>     > On Mon, Mar 17, 2008 at 03:40:36PM +0800, liuyue wrote:
>     > > Theodore Tso,
>     > >
>     > >     In 64bit system, directory size can not be bigger than 2GB?
>     >
>     > No, because the high 32-bits for i_size are overloaded to store the
>     > directory creation acl.
>
>     I think we should change the code (kernel and e2fsprogs) to allow
>     i_size_high for directories also.
>
>     > In practice, you really don't want to have a directory that huge
>     > anyway.  Iterating through it all with readdir() gets horribly slow,
>     > and applications that try do anything with really huge directories
>     > would be well advised to use a database, because they will get
>     *much*
>     > better performance that way....
>
>     Actually, for many HPC applications they never do readdir at all.
>     The job creates 1 file/process and always uses a predefined filename
>     like {job}-{timestamp}-{process} that it will directly look up.
>
>     Cheers, Andreas
>
>
>
>
> -- 
> Stephen Samuel http://www.bcgreen.com
> 778-861-7641 


From tytso at MIT.EDU  Wed Mar 19 16:01:51 2008
From: tytso at MIT.EDU (Theodore Tso)
Date: Wed, 19 Mar 2008 12:01:51 -0400
Subject: The maximum number of files under a folder
In-Reply-To: <47E1040F.5060408@gmail.com>
References: <20080317072152.05DB51369B8@ncic.ac.cn>
	<20080317133207.GB8368@mit.edu>
	<20080318225658.GA2971@webber.adilger.int>
	<6cd50f9f0803182335o45fa23echd8128cd6ddd2216e@mail.gmail.com>
	<47E1040F.5060408@gmail.com>
Message-ID: <20080319160151.GK3158@mit.edu>

On Wed, Mar 19, 2008 at 08:16:15AM -0400, John Nelson wrote:
> What does what does the h stand for in h-tree? Like the b in btree is 
> binary Tree

Hash-tree.  (And the 'b' in btree usually standards for balanced tree).

What we do is we hash the directory name, and use the hashed name to
put into the tree.  For simplicity's sake, we don't do balancing in
ext3's htree implementation.

						- Ted


From ashitpro at yahoo.co.in  Thu Mar 20 10:51:04 2008
From: ashitpro at yahoo.co.in (ashish mahamuni)
Date: Thu, 20 Mar 2008 16:21:04 +0530 (IST)
Subject: How to get device name with device id?
Message-ID: <558394.76006.qm@web94612.mail.in2.yahoo.com>

Hi all,

I want to open a device(/dev/sda1, /dev/hda2 etc) in which my file exists.
I've used 'stat' system call to get the device id.

But now I want the device name from this id(st_dev).
How to get that one?
Or 
Do you have any other method to know the device name where my file resides?

Thanks 


      Bollywood, fun, friendship, sports and more. You name it, we have it on http://in.promos.yahoo.com/groups  


From liuyue at ncic.ac.cn  Thu Mar 20 10:59:59 2008
From: liuyue at ncic.ac.cn (liuyue)
Date: Thu, 20 Mar 2008 18:59:59 +0800
Subject: The maximum number of files under a folder
Message-ID: <20080320104039.F2DFB13692A@ncic.ac.cn>

Thank you all.

Now I find a patch which can extend ext3 subdirectory limit.
 http://osdir.com/ml/file-systems.ext2.devel/2004-12/msg00026.html

======= 2008-03-19 06:56:58 ????????=======

>On Mar 17, 2008  09:32 -0400, Theodore Ts'o wrote:
>> On Mon, Mar 17, 2008 at 03:40:36PM +0800, liuyue wrote:
>> > Theodore Tso,
>> > 
>> > 	In 64bit system, directory size can not be bigger than 2GB?
>> 
>> No, because the high 32-bits for i_size are overloaded to store the
>> directory creation acl.
>
>I think we should change the code (kernel and e2fsprogs) to allow
>i_size_high for directories also.
>
>> In practice, you really don't want to have a directory that huge
>> anyway.  Iterating through it all with readdir() gets horribly slow,
>> and applications that try do anything with really huge directories
>> would be well advised to use a database, because they will get *much*
>> better performance that way....
>
>Actually, for many HPC applications they never do readdir at all.
>The job creates 1 file/process and always uses a predefined filename
>like {job}-{timestamp}-{process} that it will directly look up.
>
>Cheers, Andreas
>--
>Andreas Dilger
>Sr. Staff Engineer, Lustre Group
>Sun Microsystems of Canada, Inc.
>
>
>

= = = = = = = = = = = = = = = = = = = =
			

?????????
??
 
				 
????????liuyue
????????liuyue at ncic.ac.cn
??????????2008-03-20


From liuyue at ncic.ac.cn  Thu Mar 20 11:04:51 2008
From: liuyue at ncic.ac.cn (liuyue)
Date: Thu, 20 Mar 2008 19:04:51 +0800
Subject: How to get device name with device id?
Message-ID: <20080320104532.1EB38136845@ncic.ac.cn>

ashish mahamuni,

I guess maybe the following function does what you want.
But it is a kernel function, sorry :(

int __file_to_disk (char * file_name,  char *disk_name) {
        int err = 0;
        struct nameidata        nd;
        struct super_block      * sb;
        struct vfsmount *mnt;
        err = path_lookup(file_name, LOOKUP_FOLLOW, &nd);
        if(err){
                DCFS3_ERROR("error to parse the file name, %s\n", file_name);
                goto exit;
        }
        mnt = nd.mnt;
        sb = mnt->mnt_sb;
        strcpy (disk_name, sb->s_bdev->bd_disk->disk_name);
        path_release(&nd);
exit:
        return err;
}

======= 2008-03-20 19:21:04 ????????=======

>Hi all,
>
>I want to open a device(/dev/sda1, /dev/hda2 etc) in which my file exists.
>I've used 'stat' system call to get the device id.
>
>But now I want the device name from this id(st_dev).
>How to get that one?
>Or 
>Do you have any other method to know the device name where my file resides?
>
>Thanks 
>
>
>      Bollywood, fun, friendship, sports and more. You name it, we have it on http://in.promos.yahoo.com/groups  
>
>
>_______________________________________________
>Ext3-users mailing list
>Ext3-users at redhat.com
>https://www.redhat.com/mailman/listinfo/ext3-users
>
>

= = = = = = = = = = = = = = = = = = = =
			

?????????
??
 
				 
????????liuyue
????????liuyue at ncic.ac.cn
??????????2008-03-20


From ashitpro at yahoo.co.in  Thu Mar 20 11:13:09 2008
From: ashitpro at yahoo.co.in (ashish mahamuni)
Date: Thu, 20 Mar 2008 16:43:09 +0530 (IST)
Subject: How to get device name with device id?
In-Reply-To: <20080320104532.1EB38136845@ncic.ac.cn>
Message-ID: <171853.43026.qm@web94601.mail.in2.yahoo.com>


Can you suggest any other method(in user space) for this?

--- On Thu, 20/3/08, liuyue <liuyue at ncic.ac.cn> wrote:

> From: liuyue <liuyue at ncic.ac.cn>
> Subject: Re: How to get device name with device id?
> To: "ashitpro at yahoo.co.in" <ashitpro at yahoo.co.in>, "ext3-users at redhat.com" <ext3-users at redhat.com>
> Date: Thursday, 20 March, 2008, 4:34 PM
> ashish mahamuni,
> 
> I guess maybe the following function does what you want.
> But it is a kernel function, sorry :(
> 
> int __file_to_disk (char * file_name,  char *disk_name) {
>         int err = 0;
>         struct nameidata        nd;
>         struct super_block      * sb;
>         struct vfsmount *mnt;
>         err = path_lookup(file_name, LOOKUP_FOLLOW,
> &nd);
>         if(err){
>                 DCFS3_ERROR("error to parse the file
> name, %s\n", file_name);
>                 goto exit;
>         }
>         mnt = nd.mnt;
>         sb = mnt->mnt_sb;
>         strcpy (disk_name,
> sb->s_bdev->bd_disk->disk_name);
>         path_release(&nd);
> exit:
>         return err;
> }
> 
> ======= 2008-03-20 19:21:04
> ????????=======
> 
> >Hi all,
> >
> >I want to open a device(/dev/sda1, /dev/hda2 etc) in
> which my file exists.
> >I've used 'stat' system call to get the
> device id.
> >
> >But now I want the device name from this id(st_dev).
> >How to get that one?
> >Or 
> >Do you have any other method to know the device name
> where my file resides?
> >
> >Thanks 
> >
> >
> >      Bollywood, fun, friendship, sports and more. You
> name it, we have it on http://in.promos.yahoo.com/groups  
> >
> >
> >_______________________________________________
> >Ext3-users mailing list
> >Ext3-users at redhat.com
> >https://www.redhat.com/mailman/listinfo/ext3-users
> >
> >
> 
> = = = = = = = = = = = = = = = = = = = =
> 			
> 
> ?????????
> ??
>  
> 				 
> ????????liuyue
> ????????liuyue at ncic.ac.cn
> ??????????2008-03-20


      Chat on a cool, new interface. No download required. Go to http://in.messenger.yahoo.com/webmessengerpromo.php


From tytso at MIT.EDU  Thu Mar 20 11:28:49 2008
From: tytso at MIT.EDU (Theodore Tso)
Date: Thu, 20 Mar 2008 07:28:49 -0400
Subject: The maximum number of files under a folder
In-Reply-To: <20080320104039.F2DFB13692A@ncic.ac.cn>
References: <20080320104039.F2DFB13692A@ncic.ac.cn>
Message-ID: <20080320112849.GU3158@mit.edu>

On Thu, Mar 20, 2008 at 06:59:59PM +0800, liuyue wrote:
> Thank you all.
> 
> Now I find a patch which can extend ext3 subdirectory limit.
>  http://osdir.com/ml/file-systems.ext2.devel/2004-12/msg00026.html

That's *subdirectories*, not files.  The maximum number of files per
directory are basically limited as discussed in this thread.  The
number of subdirectories was limited by the 16-bit i_nlink field.
Andreas' idea for extending this limit, as described above, is in
ext4.

Regards,

					- Ted


From ashitpro at yahoo.co.in  Fri Mar 21 12:16:57 2008
From: ashitpro at yahoo.co.in (ashish mahamuni)
Date: Fri, 21 Mar 2008 17:46:57 +0530 (IST)
Subject: Unable to change the 'name' field from 'ext3_dir_entry_2' structure.
Message-ID: <688934.98172.qm@web94601.mail.in2.yahoo.com>

Hello everybody,

I am trying to rename the file/directory by renaming the 'name' field from ext3_dir_entry_2 structure.

I can easily do it for directories.

I am reading the structure then I change this field, and writing it back as it is.

New file name length will be similar as the old(just for simplicity).  

But whenever I do this for file. It doesn't do any thing.

'write' sys call gets execute properly. Next time if I read dir entry for this file it shows me older one.

Am I doing anything wrong? 


      Chat on a cool, new interface. No download required. Go to http://in.messenger.yahoo.com/webmessengerpromo.php


From tytso at MIT.EDU  Fri Mar 21 12:38:46 2008
From: tytso at MIT.EDU (Theodore Tso)
Date: Fri, 21 Mar 2008 08:38:46 -0400
Subject: Unable to change the 'name' field from 'ext3_dir_entry_2'
	structure.
In-Reply-To: <688934.98172.qm@web94601.mail.in2.yahoo.com>
References: <688934.98172.qm@web94601.mail.in2.yahoo.com>
Message-ID: <20080321123846.GF7991@mit.edu>

On Fri, Mar 21, 2008 at 05:46:57PM +0530, ashish mahamuni wrote:
> Hello everybody,
> 
> I am trying to rename the file/directory by renaming the 'name' field from ext3_dir_entry_2 structure.
> 
> I can easily do it for directories.
> 
> I am reading the structure then I change this field, and writing it back as it is.
> 
> New file name length will be similar as the old(just for simplicity).  
> 
> But whenever I do this for file. It doesn't do any thing.
> 
> 'write' sys call gets execute properly. Next time if I read dir entry for this file it shows me older one.
> 
> Am I doing anything wrong? 

#1.  *Why* are you trying to do this?

#2.  Are you doing this on an unmounted filesystem?  Or is the
     filesystem mounted when you tried to modify the filesystem directly
     using the write system call?

							- Ted


From htmldeveloper at gmail.com  Sat Mar 22 04:39:32 2008
From: htmldeveloper at gmail.com (Peter Teoh)
Date: Sat, 22 Mar 2008 12:39:32 +0800
Subject: "Write once only but read many" filesystem
In-Reply-To: <20080314232403.GI3542@webber.adilger.int>
References: <804dabb00803140917o2abebd2dh12c77b21a48094c4@mail.gmail.com>
	<20080314232403.GI3542@webber.adilger.int>
Message-ID: <47E48D84.7070701@gmail.com>


For reasons of auditability/accountability, I would like a filesystem 
such that I can write to it only ONCE, subsequently not 
modifiable/deletable, but always readable.   Kind of a database journal 
logs - it is continuously being written, sequentiall appending, but not 
circular buffer based, so that upon running out of space, logging will 
be paused in memory, and after new storage devices added to it, it will 
continue to flush out whatever is outstanding in memory.

Can ext3 / ext4 or current jbd2 be easily configured to serve this purpose?

Thanks.


From ashitpro at yahoo.co.in  Sat Mar 22 07:47:04 2008
From: ashitpro at yahoo.co.in (ashish mahamuni)
Date: Sat, 22 Mar 2008 13:17:04 +0530 (IST)
Subject: Unable to change the 'name' field from 'ext3_dir_entry_2'
	structure.
In-Reply-To: <20080321123846.GF7991@mit.edu>
Message-ID: <464689.50372.qm@web94615.mail.in2.yahoo.com>

1: I am trying to write a tool to hide a file/directory.
   So I am changing the 'name' field to NULL.
   Directories get hide properly. But nothing for file(Unable to change the  'name' field)

2: Of course filesystem is mounted.   


--- On Fri, 21/3/08, Theodore Tso <tytso at MIT.EDU> wrote:

> From: Theodore Tso <tytso at MIT.EDU>
> Subject: Re: Unable to change the 'name' field from 'ext3_dir_entry_2' structure.
> To: "ashish mahamuni" <ashitpro at yahoo.co.in>
> Cc: ext3-users at redhat.com
> Date: Friday, 21 March, 2008, 6:08 PM
> On Fri, Mar 21, 2008 at 05:46:57PM +0530, ashish mahamuni
> wrote:
> > Hello everybody,
> > 
> > I am trying to rename the file/directory by renaming
> the 'name' field from ext3_dir_entry_2 structure.
> > 
> > I can easily do it for directories.
> > 
> > I am reading the structure then I change this field,
> and writing it back as it is.
> > 
> > New file name length will be similar as the old(just
> for simplicity).  
> > 
> > But whenever I do this for file. It doesn't do any
> thing.
> > 
> > 'write' sys call gets execute properly. Next
> time if I read dir entry for this file it shows me older
> one.
> > 
> > Am I doing anything wrong? 
> 
> #1.  *Why* are you trying to do this?
> 
> #2.  Are you doing this on an unmounted filesystem?  Or is
> the
>      filesystem mounted when you tried to modify the
> filesystem directly
>      using the write system call?
> 
> 							- Ted


      Unlimited freedom, unlimited storage. Get it now, on http://help.yahoo.com/l/in/yahoo/mail/yahoomail/tools/tools-08.html/


From tytso at MIT.EDU  Sat Mar 22 12:29:33 2008
From: tytso at MIT.EDU (Theodore Tso)
Date: Sat, 22 Mar 2008 08:29:33 -0400
Subject: Unable to change the 'name' field from 'ext3_dir_entry_2'
	structure.
In-Reply-To: <464689.50372.qm@web94615.mail.in2.yahoo.com>
References: <20080321123846.GF7991@mit.edu>
	<464689.50372.qm@web94615.mail.in2.yahoo.com>
Message-ID: <20080322122933.GQ7991@mit.edu>

On Sat, Mar 22, 2008 at 01:17:04PM +0530, ashish mahamuni wrote:
> 1: I am trying to write a tool to hide a file/directory.
>    So I am changing the 'name' field to NULL.
>    Directories get hide properly. But nothing for file(Unable to change the  'name' field)

So you're deliberately corrupting the filesystem.  This wouldn't be
for some university class assignment, would it?

> 2: Of course filesystem is mounted.   

Well, there's your problem.  The name is cached in the kernel's dentry
cache.  It won't necessarily work for directories, either, BTW.  I
think you've just been getting lucky.

					- Ted


From htmldeveloper at gmail.com  Sat Mar 22 15:55:53 2008
From: htmldeveloper at gmail.com (Peter Teoh)
Date: Sat, 22 Mar 2008 23:55:53 +0800
Subject: "Write once only but read many" filesystem
In-Reply-To: <20080322150626.GB19347@logfs.org>
References: <804dabb00803140917o2abebd2dh12c77b21a48094c4@mail.gmail.com>
	<20080314232403.GI3542@webber.adilger.int>
	<47E48D84.7070701@gmail.com> <20080322102331.GA19347@logfs.org>
	<804dabb00803220752h670757d8o9c1b7fa3696467bc@mail.gmail.com>
	<20080322150626.GB19347@logfs.org>
Message-ID: <804dabb00803220855q1aa41fc7mc30c7ce7951fe98@mail.gmail.com>

Thank you for your reply :-).

On Sat, Mar 22, 2008 at 11:06 PM, J?rn Engel <joern at logfs.org> wrote:
> On Sat, 22 March 2008 22:52:12 +0800, Peter Teoh wrote:
>  >
>  > what are the difference in terms of final features provided by these
>  > two different filesystem?   what is this "garbage collection"?   u
>  > still have features like creating different directories, and creating
>  > different files, and writing the files?   How about setting the file
>  > attributes...it should be set before writing right (so that after
>  > writing and handle is closed it becomes permanently not
>  > modifiable)..but creating a subdirectory below the current dir should
>  > be possible right (even after closing the previous directory)?
>
>  Your requirements aren't quite clear to me.  Do you want the complete
>  filesystem to be read-only after being written once?

YES....

>   Or do you want individual files/directories to be immutable - chattr?

chattr is not good enough, as root can still modify it.   So if
current feature is not there, then some small development may be
needed.

>  And in either case, what problem do you want to solve with a read-only filesystem?

Simple:   i want to record down everything that a user does, or a
database does, or any applications running - just record down its
state permanently securely into the filesystem, knowing that for sure,
there is not way to modify the data, short of recreating the
filesystem again.    Sound logical?   Or is there any loophole in this
concept?

In summary, are there any strong demand for such a concept/filesystem?
  I may take the plunge to implementing it, if justfiable and
everybody is interested..:-)...

-- 
Regards,
Peter Teoh


From ashitpro at yahoo.co.in  Sun Mar 23 18:13:02 2008
From: ashitpro at yahoo.co.in (ashish mahamuni)
Date: Sun, 23 Mar 2008 23:43:02 +0530 (IST)
Subject: Unable to change the 'name' field from 'ext3_dir_entry_2'
	structure.
In-Reply-To: <20080322122933.GQ7991@mit.edu>
Message-ID: <707992.29162.qm@web94603.mail.in2.yahoo.com>


ok..
I'll find some other way to hide the file/directory..
Can you suggest me the better and secure way to modify the dentry?
I mean, which one should I modify? On disk structure or kernel cache(I guess this is what we called as memory data structure).
Certainly this question is not only for dentry. The case should  be common while modifying other data structures also.

 
--- On Sat, 22/3/08, Theodore Tso <tytso at MIT.EDU> wrote:

> From: Theodore Tso <tytso at MIT.EDU>
> Subject: Re: Unable to change the 'name' field from 'ext3_dir_entry_2' structure.
> To: "ashish mahamuni" <ashitpro at yahoo.co.in>
> Cc: ext3-users at redhat.com
> Date: Saturday, 22 March, 2008, 5:59 PM
> On Sat, Mar 22, 2008 at 01:17:04PM +0530, ashish mahamuni
> wrote:
> > 1: I am trying to write a tool to hide a
> file/directory.
> >    So I am changing the 'name' field to NULL.
> >    Directories get hide properly. But nothing for
> file(Unable to change the  'name' field)
> 
> So you're deliberately corrupting the filesystem.  This
> wouldn't be
> for some university class assignment, would it?
> 
> > 2: Of course filesystem is mounted.   
> 
> Well, there's your problem.  The name is cached in the
> kernel's dentry
> cache.  It won't necessarily work for directories,
> either, BTW.  I
> think you've just been getting lucky.
> 
> 					- Ted


      Save all your chat conversations. Find them online at http://in.messenger.yahoo.com/webmessengerpromo.php


From tytso at MIT.EDU  Mon Mar 24 00:19:16 2008
From: tytso at MIT.EDU (Theodore Tso)
Date: Sun, 23 Mar 2008 20:19:16 -0400
Subject: Unable to change the 'name' field from 'ext3_dir_entry_2'
	structure.
In-Reply-To: <707992.29162.qm@web94603.mail.in2.yahoo.com>
References: <20080322122933.GQ7991@mit.edu>
	<707992.29162.qm@web94603.mail.in2.yahoo.com>
Message-ID: <20080324001916.GC24943@mit.edu>

On Sun, Mar 23, 2008 at 11:43:02PM +0530, ashish mahamuni wrote:
> 
> ok..
> I'll find some other way to hide the file/directory..
> Can you suggest me the better and secure way to modify the dentry?
> I mean, which one should I modify? On disk structure or kernel cache(I guess this is what we called as memory data structure).
> Certainly this question is not only for dentry. The case should  be common while modifying other data structures also.

So what's the high level problem?  *Why* are you trying to hide file
names or directories?

I repeat, is this for a university problem set or project?

Or is there a practical real-life use for it.  If so, *what* is the
practical real-life use?  What are you trying accomplish at the high
level, and why is it useful to try to hide filenames or directories?

Is this for a root kit, where you are trying to write malware?  

   	      	   	      	      	     - Ted


From scott.lovenberg at gmail.com  Mon Mar 24 04:49:17 2008
From: scott.lovenberg at gmail.com (Scott Lovenberg)
Date: Mon, 24 Mar 2008 00:49:17 -0400
Subject: "Write once only but read many" filesystem
In-Reply-To: <20080322165906.GC19347@logfs.org>
References: <804dabb00803140917o2abebd2dh12c77b21a48094c4@mail.gmail.com>
	<20080314232403.GI3542@webber.adilger.int>
	<47E48D84.7070701@gmail.com> <20080322102331.GA19347@logfs.org>
	<804dabb00803220752h670757d8o9c1b7fa3696467bc@mail.gmail.com>
	<20080322150626.GB19347@logfs.org>
	<804dabb00803220855q1aa41fc7mc30c7ce7951fe98@mail.gmail.com>
	<20080322165906.GC19347@logfs.org>
Message-ID: <47E732CD.3070202@gmail.com>

J?rn Engel wrote:
> On Sat, 22 March 2008 23:55:53 +0800, Peter Teoh wrote:
>>>   Or do you want individual files/directories to be immutable - chattr?
>> chattr is not good enough, as root can still modify it.   So if
>> current feature is not there, then some small development may be
>> needed.
>>
>>>  And in either case, what problem do you want to solve with a read-only filesystem?
>> Simple:   i want to record down everything that a user does, or a
>> database does, or any applications running - just record down its
>> state permanently securely into the filesystem, knowing that for sure,
>> there is not way to modify the data, short of recreating the
>> filesystem again.    Sound logical?   Or is there any loophole in this
>> concept?
> 
> The loophole is called root.  In a normal setup, root can do anything,
> including writing directly to the device your filesystem resides in,
> writing to kernel memory, etc.
> 
> It may be rather inconvenient to change a filesystem by writing to the
> block device, but far from impossible.  If you want to make such changes
> impossible, you are facing an uphill battle that I personally don't care
> about.  And if inconvenience is good enough, wouldn't chattr be
> sufficiently inconvenient?
> 
> J?rn
> 

How about mounting an isofs via loopback?  This has the added benefit of 
being ready to be exported to disc.  You can make it with mkisofs on a 
directory structure and mount it to the tree with a normal mount(1).  If 
it asks for fs type on mount, I think its 'iso9660'.


From htmldeveloper at gmail.com  Mon Mar 24 06:35:46 2008
From: htmldeveloper at gmail.com (Peter Teoh)
Date: Mon, 24 Mar 2008 14:35:46 +0800
Subject: "Write once only but read many" filesystem
In-Reply-To: <47E732CD.3070202@gmail.com>
References: <804dabb00803140917o2abebd2dh12c77b21a48094c4@mail.gmail.com>
	<20080314232403.GI3542@webber.adilger.int>
	<47E48D84.7070701@gmail.com> <20080322102331.GA19347@logfs.org>
	<804dabb00803220752h670757d8o9c1b7fa3696467bc@mail.gmail.com>
	<20080322150626.GB19347@logfs.org>
	<804dabb00803220855q1aa41fc7mc30c7ce7951fe98@mail.gmail.com>
	<20080322165906.GC19347@logfs.org> <47E732CD.3070202@gmail.com>
Message-ID: <47E74BC2.7040408@gmail.com>

An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20080324/2c36f5e7/attachment.htm>

From ashitpro at yahoo.co.in  Mon Mar 24 06:42:57 2008
From: ashitpro at yahoo.co.in (ashish mahamuni)
Date: Mon, 24 Mar 2008 12:12:57 +0530 (IST)
Subject: Unable to change the 'name' field from 'ext3_dir_entry_2'
	structure.
In-Reply-To: <20080324001916.GC24943@mit.edu>
Message-ID: <888119.83086.qm@web94606.mail.in2.yahoo.com>

Oh sir,
This is not any university problem set or project.
It really dont have any practical real-life use.
This is not a root kit or any malware.

I just want to learn the file system(ext2/ext3).
I know there are number of books on filesystem,but my way of learning is bit different. I don't like thearotical ways. I like practical implementions. 
So I thought why not to start with some little tool like hiding file.
If you don't like my idea,then suggest me somthing different which has some practical use.

Thanks
Ashish


--- On Mon, 24/3/08, Theodore Tso <tytso at MIT.EDU> wrote:

> From: Theodore Tso <tytso at MIT.EDU>
> Subject: Re: Unable to change the 'name' field from 'ext3_dir_entry_2' structure.
> To: "ashish mahamuni" <ashitpro at yahoo.co.in>
> Cc: ext3-users at redhat.com
> Date: Monday, 24 March, 2008, 5:49 AM
> On Sun, Mar 23, 2008 at 11:43:02PM +0530, ashish mahamuni
> wrote:
> > 
> > ok..
> > I'll find some other way to hide the
> file/directory..
> > Can you suggest me the better and secure way to modify
> the dentry?
> > I mean, which one should I modify? On disk structure
> or kernel cache(I guess this is what we called as memory
> data structure).
> > Certainly this question is not only for dentry. The
> case should  be common while modifying other data
> structures also.
> 
> So what's the high level problem?  *Why* are you trying
> to hide file
> names or directories?
> 
> I repeat, is this for a university problem set or project?
> 
> Or is there a practical real-life use for it.  If so,
> *what* is the
> practical real-life use?  What are you trying accomplish at
> the high
> level, and why is it useful to try to hide filenames or
> directories?
> 
> Is this for a root kit, where you are trying to write
> malware?  
> 
>    	      	   	      	      	     - Ted


      Did you know? You can CHAT without downloading messenger. Go to http://in.messenger.yahoo.com/webmessengerpromo.php/ 


From articpenguin3800 at gmail.com  Mon Mar 24 19:48:04 2008
From: articpenguin3800 at gmail.com (John Nelson)
Date: Mon, 24 Mar 2008 15:48:04 -0400
Subject: resize2fs
Message-ID: <47E80574.3090400@gmail.com>

hi
Why does resize2fs have to scan the whole partition when expanding? it 
dosent do this when it shrinks


From tytso at MIT.EDU  Mon Mar 24 22:23:08 2008
From: tytso at MIT.EDU (Theodore Tso)
Date: Mon, 24 Mar 2008 18:23:08 -0400
Subject: resize2fs
In-Reply-To: <47E80574.3090400@gmail.com>
References: <47E80574.3090400@gmail.com>
Message-ID: <20080324222308.GD30110@mit.edu>

On Mon, Mar 24, 2008 at 03:48:04PM -0400, John Nelson wrote:
> hi
> Why does resize2fs have to scan the whole partition when expanding? it 
> dosent do this when it shrinks

Resize2fs sometimes, when either expanding or shrinking a partition,
will need to scan the inode table so it can move blocks.  It may need
to do this if it is shrinking a partition, and there are files which
are using blocks at the end of partition which will no longer be
available at the end of the srhink operation, so it needs to scan the
inode tables to determine which inodes need to be updated as part of
moving the data blocks.

When resize2fs is expanding the filesystem, if the filesystem grows
enough that more blocks need to be reserved for the block group
descriptors, then similarly it will need to scan the inode table to
determine which inodes will need to be updated when moving blocks out
of the way so the block group descriptors can be expanded.

Regards,

							- Ted


From sebastia at l00-bugdead-prods.de  Mon Mar 31 06:36:45 2008
From: sebastia at l00-bugdead-prods.de (Sebastian Reitenbach)
Date: Mon, 31 Mar 2008 08:36:45 +0200
Subject: with dir_index ls is slower than without?
Message-ID: <20080331063645.F1A3AD13DA@smtp.l00-bugdead-prods.de>

Hi,

I try to tune a ext3 filesystem, as I've heard, that when enabling dir_index 
option, then ls -l or find will be a lot faster than before. So I did.

I created 2 partition on the harddisc, each 20GB:

installhost2:~ # fdisk -l /dev/sda

Disk /dev/sda: 80.0 GB, 80026361856 bytes
255 heads, 63 sectors/track, 9729 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1               1         192     1542208+  82  Linux swap / Solaris
/dev/sda2   *         193        2803    20972857+  83  Linux
/dev/sda3            2804        5236    19543072+  83  Linux
/dev/sda4            5237        7669    19543072+  83  Linux

/dev/sda3 was formatted with the dir_index option enabled, /dev/sda4 with 
dir_index disabled:

installhost2:/ # tune2fs -l /dev/sda3
tune2fs 1.38 (30-Jun-2005)
Filesystem volume name:   <none>
Last mounted on:          <not available>
Filesystem UUID:          d90ccbb9-f45a-4304-87d8-805fce775c23
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal dir_index filetype needs_recovery 
sparse_super
Default mount options:    (none)
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              2443200
Block count:              4885768
Reserved block count:     244288
Free blocks:              4273422
Free inodes:              1943188
First block:              0
Block size:               4096
Fragment size:            4096
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         16288
Inode blocks per group:   509
Filesystem created:       Thu Mar 27 17:14:40 2008
Last mount time:          Fri Mar 28 11:39:47 2008
Last write time:          Fri Mar 28 11:39:47 2008
Mount count:              7
Maximum mount count:      28
Last checked:             Thu Mar 27 17:14:40 2008
Check interval:           15552000 (6 months)
Next check after:         Tue Sep 23 18:14:40 2008
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               128
Journal inode:            8
Default directory hash:   tea
Directory Hash Seed:      306a3c58-3cbb-4f4a-856a-e48ae3006a07
Journal backup:           inode blocks

installhost2:/ # tune2fs -l /dev/sda4
tune2fs 1.38 (30-Jun-2005)
Filesystem volume name:   <none>
Last mounted on:          <not available>
Filesystem UUID:          2bb124a4-f7c7-4cac-b0c1-16aa8afc67eb
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal filetype needs_recovery sparse_super
Default mount options:    (none)
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              2443200
Block count:              4885768
Reserved block count:     244288
Free blocks:              4274331
Free inodes:              1943188
First block:              0
Block size:               4096
Fragment size:            4096
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         16288
Inode blocks per group:   509
Filesystem created:       Thu Mar 27 17:15:03 2008
Last mount time:          Fri Mar 28 11:39:47 2008
Last write time:          Fri Mar 28 11:39:47 2008
Mount count:              7
Maximum mount count:      23
Last checked:             Thu Mar 27 17:15:03 2008
Check interval:           15552000 (6 months)
Next check after:         Tue Sep 23 18:15:03 2008
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               128
Journal inode:            8
Default directory hash:   tea
Directory Hash Seed:      1cfc2290-e289-4c49-a57f-2b2e3b9e91c4
Journal backup:           inode blocks

The partitions are mounted: 
/dev/sda3 on /mnt/index type ext3 (rw)
/dev/sda4 on /mnt/noindex type ext3 (rw)

If I create 500000 files, each 1kB from /dev/urandom, the ls -la command 
needs 
a lot of time on the partition with dir_index enabled (the wc -l is to 
eleminate the slow terminal :), the files were created on one partition and 
rsynced to the other: 

installhost2:~ # time ls -la /mnt/index/ | wc -l
 500005
 
 real 2m41.015s
 user 0m4.568s
 sys 0m6.520s
 
 
 installhost2:~ # time ls -la /mnt/noindex/ | wc -l
 500005
 
 real 0m10.792s
 user 0m3.172s
 sys 0m6.000s

I expected the dir_index should speedup this a little bit?
I assume I'm still missing sth?

I'm on SLES10sp1, kernel 2.6.16.46 x86_64.

kind regards
Sebastian


From niko at petole.dyndns.org  Mon Mar 31 08:36:46 2008
From: niko at petole.dyndns.org (Nicolas KOWALSKI)
Date: Mon, 31 Mar 2008 10:36:46 +0200
Subject: with dir_index ls is slower than without?
In-Reply-To: <20080331063645.F1A3AD13DA@smtp.l00-bugdead-prods.de>
References: <20080331063645.F1A3AD13DA@smtp.l00-bugdead-prods.de>
Message-ID: <874panl275.fsf@petole.dyndns.org>

"Sebastian Reitenbach" <sebastia at l00-bugdead-prods.de> writes:

> installhost2:~ # time ls -la /mnt/index/ | wc -l
>  500005
>  
>  real 2m41.015s
>  user 0m4.568s
>  sys 0m6.520s
>  
>  
>  installhost2:~ # time ls -la /mnt/noindex/ | wc -l
>  500005
>  
>  real 0m10.792s
>  user 0m3.172s
>  sys 0m6.000s
>
> I expected the dir_index should speedup this a little bit?
> I assume I'm still missing sth?

I think the point of dir_index is "only" to quickly find in a large
directory a file when you _already_ have its name.

The performance of listing is not its purpose, and as you noted it,
even makes performance worse.

-- 
Nicolas


From sebastia at l00-bugdead-prods.de  Mon Mar 31 11:18:06 2008
From: sebastia at l00-bugdead-prods.de (Sebastian Reitenbach)
Date: Mon, 31 Mar 2008 13:18:06 +0200
Subject: with dir_index ls is slower than without?
Message-ID: <20080331111807.6A293D148D@smtp.l00-bugdead-prods.de>

Hi Nicolas,

Nicolas KOWALSKI <niko at petole.dyndns.org> wrote: 
> "Sebastian Reitenbach" <sebastia at l00-bugdead-prods.de> writes:
> 
> > installhost2:~ # time ls -la /mnt/index/ | wc -l
> >  500005
> >  
> >  real 2m41.015s
> >  user 0m4.568s
> >  sys 0m6.520s
> >  
> >  
> >  installhost2:~ # time ls -la /mnt/noindex/ | wc -l
> >  500005
> >  
> >  real 0m10.792s
> >  user 0m3.172s
> >  sys 0m6.000s
> >
> > I expected the dir_index should speedup this a little bit?
> > I assume I'm still missing sth?
> 
> I think the point of dir_index is "only" to quickly find in a large
> directory a file when you _already_ have its name.
> 
> The performance of listing is not its purpose, and as you noted it,
> even makes performance worse.

ah, that would explain what I've seen here. 

after reading your answer, I found this older mail in the archives:
http://osdir.com/ml/file-systems.ext3.user/2004-09/msg00029.html

So everything seems to depend on how the application is using the
filesystem. 
Picking a single given file might be faster than with a plain ext3, but 
scanning and opening all files in a directory might become slower. I wanted 
to use the dir_index for some partitions, like for cyrus imap server, and 
for some other applications. I think I have to benchmark the applications, 
to see whether they get a speed gain of the dir_index or not.

kind regards
Sebastian
>