From criley at erad.com  Tue Sep  1 17:20:32 2009
From: criley at erad.com (Charles Riley)
Date: Tue, 1 Sep 2009 13:20:32 -0400 (EDT)
Subject: How many files can I have safely in a subdirectory?
In-Reply-To: <10290235.17241251825434412.JavaMail.root@boardwalk2.erad.com>
Message-ID: <31886612.17261251825632868.JavaMail.root@boardwalk2.erad.com>

Greetings,

I'm not sure if it's still the case, but there used to be a limit to how many subdirectories a directory can have.  32k, to be exact.  We ended up creating our own (application level) directory hashing algorithm to work around it several years ago.  This might only be a kernel 2.4 thing though.

I'm unaware of any limit to number of files.  However once the number of files in a directory gets above about 64k, filesystem performance will significantly decrease unless the filesystem has the dir_index option.  dir_index can be specified at filesystem creation or added later using tune2fs (an fsck is required).

Charles

Charles Riley
eRAD, Inc.



----- Original Message -----
From: "z0diac" <web2009 at zeroreality.com>
To: ext3-users at redhat.com
Sent: Sunday, August 30, 2009 12:00:12 PM GMT -05:00 US/Canada Eastern
Subject: How many files can I have safely in a subdirectory?


Ok, I'm running a vBulletin forum (3.8.4) and found that all user attachments
go into 1 single directory for each user. For each attached file in the
forum, there's 2 files on disk (*.attach and *.thumb), for pictures that is.

One user already has over 100,000 attachements, thus, over 200,000 files in
his attach directory.

Someone recently told me to 'keep an eye on it' because certain setups can't
hold more than X number of files in a single directory. Yet someone else
said I could have over 1 trillion files in a single directory if the HDD was
large enough...

Here's my setup:

linux version: 2.6.18-92.1.10.el5
php: 5.1.6
mySQL: 5.0.45
File system: ext3

vB support has told me any limitation there might be, will not be the result
of vB, so now I'm looking at either Linux and the way it handles files, or
the ext3 file system.

Does anyone know if I can just keep going with putting files into one
directory? (there will be over 1million probably by year's end. Hopefully
not more than 5 million ever).

And, will having so many files in a single directory cause any performance
problems? (ie: slowdowns)

My only option is to hire a coder to somehow have it split the 1M+ files
into several subdirs, say 50,000 per subdir. But even though it's messy, if
it really doesn't make a difference in the end whether they're in 50
subdirs, or just 1 dir, then I won't bother (and can sigh a breath of
relief)


Thanks in advance!!
z0diac is offline   	
Looking for Linux Hosting? Click Here.

-- 
View this message in context: http://www.nabble.com/How-many-files-can-I-have-safely-in-a-subdirectory--tp25212801p25212801.html
Sent from the Ext3 - User mailing list archive at Nabble.com.

_______________________________________________
Ext3-users mailing list
Ext3-users at redhat.com
https://www.redhat.com/mailman/listinfo/ext3-users



From darkonc at gmail.com  Tue Sep  1 17:50:31 2009
From: darkonc at gmail.com (Stephen Samuel (gmail))
Date: Tue, 1 Sep 2009 10:50:31 -0700
Subject: How many files can I have safely in a subdirectory?
In-Reply-To: <25212801.post@talk.nabble.com>
References: <25212801.post@talk.nabble.com>
Message-ID: <6cd50f9f0909011050h6e71b754g67d065b68b54c3df@mail.gmail.com>

Well, if you presume the possibility of running into bugs when the
directory gets over 2GB,
and directory entries averaging  under 20 bytes, then you might see a
problem at around
100million entries.
You can probably expect performance issues before that point.
If you expect these directories to keep growing year after year,
you might want to consider doing directory hashing...
If nothing else, it could get ugly if someone decides to do an 'ls'
on a directory with 10million entries.


On Sun, Aug 30, 2009 at 9:00 AM, z0diac<web2009 at zeroreality.com> wrote:
>
> Ok, I'm running a vBulletin forum (3.8.4) and found that all user attachments
> go into 1 single directory for each user. For each attached file in the
> forum, there's 2 files on disk (*.attach and *.thumb), for pictures that is.
>
> One user already has over 100,000 attachements, thus, over 200,000 files in
> his attach directory.
>
> Someone recently told me to 'keep an eye on it' because certain setups can't
> hold more than X number of files in a single directory. Yet someone else
> said I could have over 1 trillion files in a single directory if the HDD was
> large enough...
>
> Here's my setup:
>
> linux version: 2.6.18-92.1.10.el5
> php: 5.1.6
> mySQL: 5.0.45
> File system: ext3
>
> vB support has told me any limitation there might be, will not be the result
> of vB, so now I'm looking at either Linux and the way it handles files, or
> the ext3 file system.
>
> Does anyone know if I can just keep going with putting files into one
> directory? (there will be over 1million probably by year's end. Hopefully
> not more than 5 million ever).
>
> And, will having so many files in a single directory cause any performance
> problems? (ie: slowdowns)
>
> My only option is to hire a coder to somehow have it split the 1M+ files
> into several subdirs, say 50,000 per subdir. But even though it's messy, if
> it really doesn't make a difference in the end whether they're in 50
> subdirs, or just 1 dir, then I won't bother (and can sigh a breath of
> relief)
>
>
> Thanks in advance!!
> z0diac is offline
> Looking for Linux Hosting? Click Here.
>
> --
> View this message in context: http://www.nabble.com/How-many-files-can-I-have-safely-in-a-subdirectory--tp25212801p25212801.html
> Sent from the Ext3 - User mailing list archive at Nabble.com.
>
> _______________________________________________
> Ext3-users mailing list
> Ext3-users at redhat.com
> https://www.redhat.com/mailman/listinfo/ext3-users
>



-- 
Stephen Samuel http://www.bcgreen.com  Software, like love,
778-861-7641                              grows when you give it away



From adilger at sun.com  Tue Sep  1 17:54:50 2009
From: adilger at sun.com (Andreas Dilger)
Date: Tue, 01 Sep 2009 11:54:50 -0600
Subject: How many files can I have safely in a subdirectory?
In-Reply-To: <31886612.17261251825632868.JavaMail.root@boardwalk2.erad.com>
References: <10290235.17241251825434412.JavaMail.root@boardwalk2.erad.com>
	<31886612.17261251825632868.JavaMail.root@boardwalk2.erad.com>
Message-ID: <20090901175450.GR4197@webber.adilger.int>

On Sep 01, 2009  13:20 -0400, Charles Riley wrote:
> I'm not sure if it's still the case, but there used to be a limit
> to how many subdirectories a directory can have.  32k, to be exact.
> We ended up creating our own (application level) directory hashing
> algorithm to work around it several years ago.  This might only be a
> kernel 2.4 thing though.

This is true for ext3 (max 32000 subdirectories), but in ext4 there
is no specific limit on the number of subdirectories.  The subdirectory
limit is the same as the number of entries in the directory.

> I'm unaware of any limit to number of files.  However once the number
> of files in a directory gets above about 64k, filesystem performance will
> significantly decrease unless the filesystem has the dir_index option.
> dir_index can be specified at filesystem creation or added later using
> tune2fs (an fsck is required).

If formatted with dir_index (which is the default for newer mke2fs for
some time now) we tested up to 10M files in a single directory on a
regular basis.  The maximum limit depends on the filename length, but
is somewhere around 15-20M for "short" filenames (e.g. 32 characters
or less).


> ----- Original Message -----
> From: "z0diac" <web2009 at zeroreality.com>
> To: ext3-users at redhat.com
> Sent: Sunday, August 30, 2009 12:00:12 PM GMT -05:00 US/Canada Eastern
> Subject: How many files can I have safely in a subdirectory?
> 
> 
> Ok, I'm running a vBulletin forum (3.8.4) and found that all user attachments
> go into 1 single directory for each user. For each attached file in the
> forum, there's 2 files on disk (*.attach and *.thumb), for pictures that is.
> 
> One user already has over 100,000 attachements, thus, over 200,000 files in
> his attach directory.
> 
> Someone recently told me to 'keep an eye on it' because certain setups can't
> hold more than X number of files in a single directory. Yet someone else
> said I could have over 1 trillion files in a single directory if the HDD was
> large enough...
> 
> Here's my setup:
> 
> linux version: 2.6.18-92.1.10.el5
> php: 5.1.6
> mySQL: 5.0.45
> File system: ext3
> 
> vB support has told me any limitation there might be, will not be the result
> of vB, so now I'm looking at either Linux and the way it handles files, or
> the ext3 file system.
> 
> Does anyone know if I can just keep going with putting files into one
> directory? (there will be over 1million probably by year's end. Hopefully
> not more than 5 million ever).
> 
> And, will having so many files in a single directory cause any performance
> problems? (ie: slowdowns)
> 
> My only option is to hire a coder to somehow have it split the 1M+ files
> into several subdirs, say 50,000 per subdir. But even though it's messy, if
> it really doesn't make a difference in the end whether they're in 50
> subdirs, or just 1 dir, then I won't bother (and can sigh a breath of
> relief)
> 
> 
> Thanks in advance!!
> z0diac is offline   	
> Looking for Linux Hosting? Click Here.
> 
> -- 
> View this message in context: http://www.nabble.com/How-many-files-can-I-have-safely-in-a-subdirectory--tp25212801p25212801.html
> Sent from the Ext3 - User mailing list archive at Nabble.com.
> 
> _______________________________________________
> Ext3-users mailing list
> Ext3-users at redhat.com
> https://www.redhat.com/mailman/listinfo/ext3-users
> 
> _______________________________________________
> Ext3-users mailing list
> Ext3-users at redhat.com
> https://www.redhat.com/mailman/listinfo/ext3-users

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.



From bruno at wolff.to  Tue Sep  1 21:00:24 2009
From: bruno at wolff.to (Bruno Wolff III)
Date: Tue, 1 Sep 2009 16:00:24 -0500
Subject: How many files can I have safely in a subdirectory?
In-Reply-To: <25212801.post@talk.nabble.com>
References: <25212801.post@talk.nabble.com>
Message-ID: <20090901210024.GA26393@wolff.to>

On Sun, Aug 30, 2009 at 09:00:12 -0700,
  z0diac <web2009 at zeroreality.com> wrote:
> 
> Someone recently told me to 'keep an eye on it' because certain setups can't
> hold more than X number of files in a single directory. Yet someone else
> said I could have over 1 trillion files in a single directory if the HDD was
> large enough...

When I have directories in the few million range doing mass changes gets
extremely slow.

Besides some the other other things mentioned, you need to worry about the
inode limit on the file system. The default now, is lower than it used to be.
This bit me once when I was moving a directory with lots of files to another
system with a similar size partition when i was expecting a similar inode
limit.



From web2009 at zeroreality.com  Tue Sep  1 21:13:50 2009
From: web2009 at zeroreality.com (z0diac)
Date: Tue, 1 Sep 2009 14:13:50 -0700 (PDT)
Subject: How many files can I have safely in a subdirectory?
In-Reply-To: <31886612.17261251825632868.JavaMail.root@boardwalk2.erad.com>
References: <25212801.post@talk.nabble.com>
	<31886612.17261251825632868.JavaMail.root@boardwalk2.erad.com>
Message-ID: <25244663.post@talk.nabble.com>


Thanks!  And thanks to all who have replied to this thread!  I will see if I
can get dir_index active.



Bugzilla from criley at erad.com wrote:
> 
> Greetings,
> 
> I'm not sure if it's still the case, but there used to be a limit to how
> many subdirectories a directory can have.  32k, to be exact.  We ended up
> creating our own (application level) directory hashing algorithm to work
> around it several years ago.  This might only be a kernel 2.4 thing
> though.
> 
> I'm unaware of any limit to number of files.  However once the number of
> files in a directory gets above about 64k, filesystem performance will
> significantly decrease unless the filesystem has the dir_index option. 
> dir_index can be specified at filesystem creation or added later using
> tune2fs (an fsck is required).
> 
> Charles
> 
> Charles Riley
> eRAD, Inc.
> 
> 
> 
> ----- Original Message -----
> From: "z0diac" <web2009 at zeroreality.com>
> To: ext3-users at redhat.com
> Sent: Sunday, August 30, 2009 12:00:12 PM GMT -05:00 US/Canada Eastern
> Subject: How many files can I have safely in a subdirectory?
> 
> 
> Ok, I'm running a vBulletin forum (3.8.4) and found that all user
> attachments
> go into 1 single directory for each user. For each attached file in the
> forum, there's 2 files on disk (*.attach and *.thumb), for pictures that
> is.
> 
> One user already has over 100,000 attachements, thus, over 200,000 files
> in
> his attach directory.
> 
> Someone recently told me to 'keep an eye on it' because certain setups
> can't
> hold more than X number of files in a single directory. Yet someone else
> said I could have over 1 trillion files in a single directory if the HDD
> was
> large enough...
> 
> Here's my setup:
> 
> linux version: 2.6.18-92.1.10.el5
> php: 5.1.6
> mySQL: 5.0.45
> File system: ext3
> 
> vB support has told me any limitation there might be, will not be the
> result
> of vB, so now I'm looking at either Linux and the way it handles files, or
> the ext3 file system.
> 
> Does anyone know if I can just keep going with putting files into one
> directory? (there will be over 1million probably by year's end. Hopefully
> not more than 5 million ever).
> 
> And, will having so many files in a single directory cause any performance
> problems? (ie: slowdowns)
> 
> My only option is to hire a coder to somehow have it split the 1M+ files
> into several subdirs, say 50,000 per subdir. But even though it's messy,
> if
> it really doesn't make a difference in the end whether they're in 50
> subdirs, or just 1 dir, then I won't bother (and can sigh a breath of
> relief)
> 
> 
> Thanks in advance!!
> z0diac is offline   	
> Looking for Linux Hosting? Click Here.
> 
> -- 
> View this message in context:
> http://www.nabble.com/How-many-files-can-I-have-safely-in-a-subdirectory--tp25212801p25212801.html
> Sent from the Ext3 - User mailing list archive at Nabble.com.
> 
> _______________________________________________
> Ext3-users mailing list
> Ext3-users at redhat.com
> https://www.redhat.com/mailman/listinfo/ext3-users
> 
> _______________________________________________
> Ext3-users mailing list
> Ext3-users at redhat.com
> https://www.redhat.com/mailman/listinfo/ext3-users
> 
> 

-- 
View this message in context: http://www.nabble.com/How-many-files-can-I-have-safely-in-a-subdirectory--tp25212801p25244663.html
Sent from the Ext3 - User mailing list archive at Nabble.com.



From web2009 at zeroreality.com  Tue Sep  1 21:23:06 2009
From: web2009 at zeroreality.com (z0diac)
Date: Tue, 1 Sep 2009 14:23:06 -0700 (PDT)
Subject: How many files can I have safely in a subdirectory?
In-Reply-To: <25244663.post@talk.nabble.com>
References: <25212801.post@talk.nabble.com>
	<31886612.17261251825632868.JavaMail.root@boardwalk2.erad.com>
	<25244663.post@talk.nabble.com>
Message-ID: <25245073.post@talk.nabble.com>


There was something mentioned in a search about dir_index and I checked and
apparently it *is* running:

# sudo tune2fs -l /dev/sdb1 | grep dir_index
Filesystem features:      has_journal resize_inode dir_index filetype
needs_recovery sparse_super large_file


.. so I'm not sure if I can keep dumping files into the same directory and
not have to worry as much about performance ( ? )  It would be MUCh easier
for me if I could, instead of having to login under multiple accounts. 
There shouldn't be much more than 1-2M files in the directory anyway.  The
partition is only a 250GB anyway, and each file is anywhere from 50-200kb on
average, so there's just not enough space to hold that quantity of files
anyway.  ie: I'm sure I"ll run out of drive space before having too many
files affects performance (hope hope)
-- 
View this message in context: http://www.nabble.com/How-many-files-can-I-have-safely-in-a-subdirectory--tp25212801p25245073.html
Sent from the Ext3 - User mailing list archive at Nabble.com.



From darkonc at gmail.com  Wed Sep  2 00:42:20 2009
From: darkonc at gmail.com (Stephen Samuel (gmail))
Date: Tue, 1 Sep 2009 17:42:20 -0700
Subject: How many files can I have safely in a subdirectory?
In-Reply-To: <7.0.1.0.2.20090901145754.0269eca8@zeroreality.com>
References: <25212801.post@talk.nabble.com>
	<6cd50f9f0909011050h6e71b754g67d065b68b54c3df@mail.gmail.com>
	<7.0.1.0.2.20090901145754.0269eca8@zeroreality.com>
Message-ID: <6cd50f9f0909011742r587c4bc4tce4c51bc262dd07b@mail.gmail.com>

The 2GB worry isn't about the contents of files in the directory, but
rather what happens when the directory 'file' itself gets to be over
2GB  in size.   If something's likely to break, then it's either there
or at 4GB (( i.e. if somebody made the mistake of using a 32bit
pointer in the wrong place )).

There are probably not a whole lot of examples of directories (as
opposed to data files) getting larger than 2GB, so I'd consider it
relatively uncharted territory.


On Tue, Sep 1, 2009 at 12:00 PM, Marc<marc_n at zeroreality.com> wrote:
> Thank you for the reply. ?(it seemed to come only via email as your reply
> and one other to my post, aren't showing up in the thread).
>
> Yes there is definitely over 2gb of files in that directory. ?My' inodes'
> are only at 3% of the 590M or whatever it was, that were created at the time
> the disc structure was created. ?I just wasn't sure if there was a limit
> with ext3 to the # of files that could reside in a single directory. ?I
> guess I will have to start adding new files under a new user account (which
> will put them in that user's attachment subdir).
>
> Thanks.!
>
> At 01:50 PM 9/1/2009, you wrote:
>>
>> Well, if you presume the possibility of running into bugs when the
>> directory gets over 2GB,
>> and directory entries averaging ?under 20 bytes, then you might see a
>> problem at around
>> 100million entries.
>> You can probably expect performance issues before that point.
>> If you expect these directories to keep growing year after year,
>> you might want to consider doing directory hashing...
>> If nothing else, it could get ugly if someone decides to do an 'ls'
>> on a directory with 10million entries.
>>
>>
>> On Sun, Aug 30, 2009 at 9:00 AM, z0diac<web2009 at zeroreality.com> wrote:
>> >
>> > Ok, I'm running a vBulletin forum (3.8.4) and found that all user
>> > attachments
>> > go into 1 single directory for each user. For each attached file in the
>> > forum, there's 2 files on disk (*.attach and *.thumb), for pictures that
>> > is.
>> >
>> > One user already has over 100,000 attachements, thus, over 200,000 files
>> > in
>> > his attach directory.
>> >
>> > Someone recently told me to 'keep an eye on it' because certain setups
>> > can't
>> > hold more than X number of files in a single directory. Yet someone else
>> > said I could have over 1 trillion files in a single directory if the HDD
>> > was
>> > large enough...
>> >
>> > Here's my setup:
>> >
>> > linux version: 2.6.18-92.1.10.el5
>> > php: 5.1.6
>> > mySQL: 5.0.45
>> > File system: ext3
>> >
>> > vB support has told me any limitation there might be, will not be the
>> > result
>> > of vB, so now I'm looking at either Linux and the way it handles files,
>> > or
>> > the ext3 file system.
>> >
>> > Does anyone know if I can just keep going with putting files into one
>> > directory? (there will be over 1million probably by year's end.
>> > Hopefully
>> > not more than 5 million ever).
>> >
>> > And, will having so many files in a single directory cause any
>> > performance
>> > problems? (ie: slowdowns)
>> >
>> > My only option is to hire a coder to somehow have it split the 1M+ files
>> > into several subdirs, say 50,000 per subdir. But even though it's messy,
>> > if
>> > it really doesn't make a difference in the end whether they're in 50
>> > subdirs, or just 1 dir, then I won't bother (and can sigh a breath of
>> > relief)
>> >
>> >
>> > Thanks in advance!!
>> > z0diac is offline
>> > Looking for Linux Hosting? Click Here.
>> >
>> > --
>> > View this message in context:
>> > http://www.nabble.com/How-many-files-can-I-have-safely-in-a-subdirectory--tp25212801p25212801.html
>> > Sent from the Ext3 - User mailing list archive at Nabble.com.
>> >
>> > _______________________________________________
>> > Ext3-users mailing list
>> > Ext3-users at redhat.com
>> > https://www.redhat.com/mailman/listinfo/ext3-users
>> >
>>
>>
>>
>> --
>> Stephen Samuel http://www.bcgreen.com ?Software, like love,
>> 778-861-7641 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?grows when you give it away
>
>



-- 
Stephen Samuel http://www.bcgreen.com  Software, like love,
778-861-7641                              grows when you give it away



From per.lanvin at fouredge.se  Wed Sep  9 13:00:09 2009
From: per.lanvin at fouredge.se (=?iso-8859-1?Q?P=E4r_Lanvin?=)
Date: Wed, 9 Sep 2009 15:00:09 +0200
Subject: Many small files, best practise.
Message-ID: <2F3893D6F401F74695CE4AE1BA204E685C705E46C6@wfeex01pv.ad.fouredge.se>


//Sys
RHEL 5.3
~1000.000.000 files (1-30k)
~7TB in total
//

Hi,

I'm looking for a best practice when implementing this using EXT3 (or some other FS if it shouldn't do the job.).

On average the reads dominate (99%), writes are only used for updating and isn't a part of the service provided.
The data is divided into 200k directories with each some 5k files. This ratio (dir/files) can be altered to
optimize FS performance.

Any suggestions are greatly appreciated.


Rgds

/PL




From rwheeler at redhat.com  Wed Sep  9 13:37:44 2009
From: rwheeler at redhat.com (Ric Wheeler)
Date: Wed, 09 Sep 2009 09:37:44 -0400
Subject: Many small files, best practise.
In-Reply-To: <2F3893D6F401F74695CE4AE1BA204E685C705E46C6@wfeex01pv.ad.fouredge.se>
References: <2F3893D6F401F74695CE4AE1BA204E685C705E46C6@wfeex01pv.ad.fouredge.se>
Message-ID: <4AA7AFA8.5040502@redhat.com>

On 09/09/2009 09:00 AM, P?r Lanvin wrote:
>
> //Sys
> RHEL 5.3
> ~1000.000.000 files (1-30k)
> ~7TB in total
> //
>
> Hi,
>
> I'm looking for a best practice when implementing this using EXT3 (or some other FS if it shouldn't do the job.).
>
> On average the reads dominate (99%), writes are only used for updating and isn't a part of the service provided.
> The data is divided into 200k directories with each some 5k files. This ratio (dir/files) can be altered to
> optimize FS performance.
>
> Any suggestions are greatly appreciated.
>
>
> Rgds
>
> /PL


Hi Par,

This sounds a lot like the challenges I had in my recent past working on a 
similar storage system.

One key that you will find is to make sure that you minimize head movement while 
doing the writing. The best performance would be to have a few threads (say 4-8) 
write to the same subdirectory for a period of time of a few minutes (say 3-5) 
before moving on to a new directory.

If you are writing to a local S-ATA disk, ext3/4 can write a few thousand 
files/sec without doing any fsync() operations. With fsync(), you will drop down 
quite a lot.

One layout for directories that works well with this kind of thing is a time 
based one (say YEAR/MONTH/DAY/HOUR/MIN where MIN might be 0, 5, 10, ..., 55 for 
example).

When reading files in ext3 (and ext4) or doing other bulk operations like a 
large deletion, it is important to sort the files by inode (do the readdir, get 
say all of the 5k files in your subdir and then sort by inode before doing your 
bulk operation).

Good luck!

Ric



From pg_ext3 at ext3.for.sabi.co.UK  Mon Sep 14 09:40:18 2009
From: pg_ext3 at ext3.for.sabi.co.UK (Peter Grandi)
Date: Mon, 14 Sep 2009 10:40:18 +0100
Subject: Many small files, best practise.
In-Reply-To: <4AA7AFA8.5040502@redhat.com>
References: <2F3893D6F401F74695CE4AE1BA204E685C705E46C6@wfeex01pv.ad.fouredge.se>
	<4AA7AFA8.5040502@redhat.com>
Message-ID: <19118.3970.628895.372996@tree.ty.sabi.co.uk>


>> RHEL 5.3
>> ~1000.000.000 files (1-30k)
>> ~7TB in total
>> //

>> I'm looking for a best practice when implementing this using
>> EXT3 (or some other FS if it shouldn't do the job.). 

"best practice" would be a rather radical solution.

>> On average the reads dominate (99%), writes are only used for
>> updating and isn't a part of the service provided.  The data
>> is divided into 200k directories with each some 5k files.
>> This ratio (dir/files) can be altered to optimize FS
>> performance.

> If you are writing to a local S-ATA disk, ext3/4 can write a
> few thousand files/sec without doing any fsync() operations.
> With fsync(), you will drop down quite a lot.

Unfortunately using 'fsync' is a good idea for production
systems.

Also note that in order to write 10^9 files at 10^3/s rate takes
10^6 seconds; roughly 10 days to populate the filesystem (or at
least that to restore it from backups).

> One layout for directories that works well with this kind of
> thing is a time based one (say YEAR/MONTH/DAY/HOUR/MIN where
> MIN might be 0, 5, 10, ..., 55 for example).

As to the problem above and ths kind of solution, I reckon that
it is utterly absurd (and I could have used much stronger words).

  BTW, the sort of people who consider seriously such utter
  absurdities try to do a thorough job, and I don't want to
  know how the underlying storage system is structured :-).

If anything, consider the obvious (obvious except to those who
want to use a filesystem as a small record database), which is
'fsck' time, in particular given the structure of 'ext3' (or
'ext4') metadata.

So: just don't use a filesystem as a database, spare us the
horror; use a database, even a simple one, which is not utterly
absurd.

Compare these two:

  http://lists.gllug.org.uk/pipermail/gllug/2005-October/055445.html
  http://lists.gllug.org.uk/pipermail/gllug/2005-October/055488.html

Anyhow I do see a lot of inane questions and "solutions" like
the above in various lists (usually the XFS one, which attracts
a lot of utter absurdities).

> When reading files in ext3 (and ext4) or doing other bulk
> operations like a large deletion, it is important to sort the
> files by inode (do the readdir, get say all of the 5k files in
> your subdir and then sort by inode before doing your bulk
> operation).

Good idea, but it is best to avoid the cases where this matters.



From r.majumdar at globallogic.com  Mon Sep 14 09:50:01 2009
From: r.majumdar at globallogic.com (Ritesh Majumdar)
Date: Mon, 14 Sep 2009 15:20:01 +0530
Subject: Untar hangs on ext3 file system
Message-ID: <1252921801.10534.9.camel@ripper.synapse.com>

Hello List,

I am trying to untar a 450 MB tar file on ext3 file system, but every
time untar (using the command "tar zxvf <filename>) hangs and I see no
disk activity. 

While I use ReiserFS file system I can untar the same file successfully.

I am not sure what is missing here.

Please Help!!!

Many Thanks,
Ritesh.



From rwheeler at redhat.com  Mon Sep 14 11:34:39 2009
From: rwheeler at redhat.com (Ric Wheeler)
Date: Mon, 14 Sep 2009 07:34:39 -0400
Subject: Many small files, best practise.
In-Reply-To: <19118.3970.628895.372996@tree.ty.sabi.co.uk>
References: <2F3893D6F401F74695CE4AE1BA204E685C705E46C6@wfeex01pv.ad.fouredge.se>	<4AA7AFA8.5040502@redhat.com>
	<19118.3970.628895.372996@tree.ty.sabi.co.uk>
Message-ID: <4AAE2A4F.8010409@redhat.com>

On 09/14/2009 05:40 AM, Peter Grandi wrote:
>    
>>> RHEL 5.3
>>> ~1000.000.000 files (1-30k)
>>> ~7TB in total
>>> //
>>>        
>    
>>> I'm looking for a best practice when implementing this using
>>> EXT3 (or some other FS if it shouldn't do the job.).
>>>        
> "best practice" would be a rather radical solution.
>
>    
>>> On average the reads dominate (99%), writes are only used for
>>> updating and isn't a part of the service provided.  The data
>>> is divided into 200k directories with each some 5k files.
>>> This ratio (dir/files) can be altered to optimize FS
>>> performance.
>>>        
>    
>> If you are writing to a local S-ATA disk, ext3/4 can write a
>> few thousand files/sec without doing any fsync() operations.
>> With fsync(), you will drop down quite a lot.
>>      
> Unfortunately using 'fsync' is a good idea for production
> systems.
>
> Also note that in order to write 10^9 files at 10^3/s rate takes
> 10^6 seconds; roughly 10 days to populate the filesystem (or at
> least that to restore it from backups).
>
>    

One thing that you can do when doing bulk loads of files (say, during a 
restore or migration), is to use a two phase write. First, write each of 
a batch of files (say 1000 files at a time), then go back and 
reopen/fsync/close them.

This will give you performance levels closer to not using fsync() and 
still give you good data integrity. Note that this usually is a good fit 
for this class of operations since you can always restart the bulk load 
if you have a crash/error/etc.

To give this a try, you can use "fs_mark" to write say 100k files with 
the fsync one file at a time (-S 1, its default) or use one of the batch 
fsync modes (-S 3 for example).

>> One layout for directories that works well with this kind of
>> thing is a time based one (say YEAR/MONTH/DAY/HOUR/MIN where
>> MIN might be 0, 5, 10, ..., 55 for example).
>>      
> As to the problem above and ths kind of solution, I reckon that
> it is utterly absurd (and I could have used much stronger words).
>    

When you deal with systems that store millions of files, you pretty much 
always are going to use some kind of made up directory layout. The above 
scheme works pretty well in that it correlates well to normal usage 
patterns and queries (and tends to have those subdirectories laid out 
contiguously).

You can always try to write 1 million files in a single subdirectory, 
but if you are writing your own application, using this kind of scheme 
is pretty trivial.

>    BTW, the sort of people who consider seriously such utter
>    absurdities try to do a thorough job, and I don't want to
>    know how the underlying storage system is structured :-).
>
> If anything, consider the obvious (obvious except to those who
> want to use a filesystem as a small record database), which is
> 'fsck' time, in particular given the structure of 'ext3' (or
> 'ext4') metadata.
>    

fsck time has improved quite a lot recently with ext4 (and with xfs).

> So: just don't use a filesystem as a database, spare us the
> horror; use a database, even a simple one, which is not utterly
> absurd.
>
> Compare these two:
>
>    http://lists.gllug.org.uk/pipermail/gllug/2005-October/055445.html
>    

In this case, doing the bulk load I described above (reading in sorted 
order, writing out in the same), would significantly reduce the time of 
the restore.

>    http://lists.gllug.org.uk/pipermail/gllug/2005-October/055488.html
>
> Anyhow I do see a lot of inane questions and "solutions" like
> the above in various lists (usually the XFS one, which attracts
> a lot of utter absurdities).
>
>    
>> When reading files in ext3 (and ext4) or doing other bulk
>> operations like a large deletion, it is important to sort the
>> files by inode (do the readdir, get say all of the 5k files in
>> your subdir and then sort by inode before doing your bulk
>> operation).
>>      
> Good idea, but it is best to avoid the cases where this matters.
>
>    




From sandeen at redhat.com  Mon Sep 14 16:43:22 2009
From: sandeen at redhat.com (Eric Sandeen)
Date: Mon, 14 Sep 2009 11:43:22 -0500
Subject: Untar hangs on ext3 file system
In-Reply-To: <1252921801.10534.9.camel@ripper.synapse.com>
References: <1252921801.10534.9.camel@ripper.synapse.com>
Message-ID: <4AAE72AA.2020404@redhat.com>

Ritesh Majumdar wrote:
> Hello List,
> 
> I am trying to untar a 450 MB tar file on ext3 file system, but every
> time untar (using the command "tar zxvf <filename>) hangs and I see no
> disk activity. 

Stating which kernel you are using would be a help ...

stracing the untar might tell you where it's at from the userspace
perspective; echo t > /proc/sysrq-trigger would give you all of the
kernel thread tracebacks, and you could find the tar process in there to
see where it is stuck.

-Eric

> While I use ReiserFS file system I can untar the same file successfully.
> 
> I am not sure what is missing here.
> 
> Please Help!!!
> 
> Many Thanks,
> Ritesh.



From pg_ext3 at ext3.for.sabi.co.UK  Mon Sep 14 21:08:58 2009
From: pg_ext3 at ext3.for.sabi.co.UK (Peter Grandi)
Date: Mon, 14 Sep 2009 22:08:58 +0100
Subject: Many small files, best practise.
In-Reply-To: <4AAE2A4F.8010409@redhat.com>
References: <2F3893D6F401F74695CE4AE1BA204E685C705E46C6@wfeex01pv.ad.fouredge.se>
	<4AA7AFA8.5040502@redhat.com>
	<19118.3970.628895.372996@tree.ty.sabi.co.uk>
	<4AAE2A4F.8010409@redhat.com>
Message-ID: <19118.45290.900343.204958@tree.ty.sabi.co.uk>

[ ... ]

>> Also note that in order to write 10^9 files at 10^3/s rate
>> takes 10^6 seconds; roughly 10 days to populate the
>> filesystem (or at least that to restore it from backups).

> One thing that you can do when doing bulk loads of files (say,
> during a restore or migration), is to use a two phase
> write. First, write each of a batch of files (say 1000 files
> at a time), then go back and reopen/fsync/close them.

Why not just restore a database?

>>> One layout for directories that works well with this kind of
>>> thing is a time based one (say YEAR/MONTH/DAY/HOUR/MIN where
>>> MIN might be 0, 5, 10, ..., 55 for example).

>> As to the problem above and ths kind of solution, I reckon that
>> it is utterly absurd (and I could have used much stronger words).

> When you deal with systems that store millions of files,

Millions of files may work; but 1 billion is an utter absurdity.
A filesystem that can store reasonably 1 billion small files in
7TB is an unsolved research issue...

The obvious thing to do is to use a database, and there is no
way around this point.

If one genuinely needs to store a lot of files, why not split
them into many independent filesystems? A single large one is
only need to allow for hard linking or for having a single large
space pool, and in applications where the directory structure
above makes any kind of sense that neither is usually required.

> you pretty much always are going to use some kind of made up
> directory layout.

File systems are usually used for storing somewhat unstructured
information, not records that can be looked up with a simple
"YEAR/MONTH/DAY/HOUR/MIN" key, which seems very suitable for
something like a simpel DBMS.

There is even a tendency to move filesystems into databases, as
they scale a lot better.

And for cases where a filesystem still makes sense I would
rather use, instead of the inane manylevel directory structure
above, a file system design with proper tree indexes and perhaps
even one with the ability to store small files into inodes.

[ ... ]

> You can always try to write 1 million files in a single
> subdirectory,

Again, I'd rather avoid anything like that.

> but if you are writing your own application, using this kind
> of scheme is pretty trivial.

And an utter absurdity, for 1 billion files in 200k directories.
Both on its own merits and compared to the OBVIOUS alternative.

>> If anything, consider the obvious (obvious except to those
>> who want to use a filesystem as a small record database),
>> which is 'fsck' time, in particular given the structure of
>> 'ext3' (or 'ext4') metadata.

> fsck time has improved quite a lot recently with ext4 (and
> with xfs).

How many months do you think a 7TB filesystem with 1 billion
files would take to 'fsck' even with those improvements? Even
with the nice improvements?

[ ... ]



From rwheeler at redhat.com  Wed Sep 16 18:56:54 2009
From: rwheeler at redhat.com (Ric Wheeler)
Date: Wed, 16 Sep 2009 14:56:54 -0400
Subject: Many small files, best practise.
In-Reply-To: <19118.45290.900343.204958@tree.ty.sabi.co.uk>
References: <2F3893D6F401F74695CE4AE1BA204E685C705E46C6@wfeex01pv.ad.fouredge.se>	<4AA7AFA8.5040502@redhat.com>	<19118.3970.628895.372996@tree.ty.sabi.co.uk>	<4AAE2A4F.8010409@redhat.com>
	<19118.45290.900343.204958@tree.ty.sabi.co.uk>
Message-ID: <4AB134F6.2060900@redhat.com>

On 09/14/2009 05:08 PM, Peter Grandi wrote:
> [ ... ]
>
>>> Also note that in order to write 10^9 files at 10^3/s rate
>>> takes 10^6 seconds; roughly 10 days to populate the
>>> filesystem (or at least that to restore it from backups).
>
>> One thing that you can do when doing bulk loads of files (say,
>> during a restore or migration), is to use a two phase
>> write. First, write each of a batch of files (say 1000 files
>> at a time), then go back and reopen/fsync/close them.
>
> Why not just restore a database?

If you started with a database, that would be reasonable. If you started with a 
file system, I guess I don't understand what you are suggesting.

>
>>>> One layout for directories that works well with this kind of
>>>> thing is a time based one (say YEAR/MONTH/DAY/HOUR/MIN where
>>>> MIN might be 0, 5, 10, ..., 55 for example).
>
>>> As to the problem above and ths kind of solution, I reckon that
>>> it is utterly absurd (and I could have used much stronger words).
>
>> When you deal with systems that store millions of files,
>
> Millions of files may work; but 1 billion is an utter absurdity.
> A filesystem that can store reasonably 1 billion small files in
> 7TB is an unsolved research issue...

Strangely enough, I have been testing ext4 and stopped filling it at a bit over 
1 billion 20KB files on Monday (with 60TB of storage).

Running fsck on it took only 2.4 hours.


>
> The obvious thing to do is to use a database, and there is no
> way around this point.

Everything has a use case. I am certainly not an anti-DB person, but your 
assertion alone is not convincing.

>
> If one genuinely needs to store a lot of files, why not split
> them into many independent filesystems? A single large one is
> only need to allow for hard linking or for having a single large
> space pool, and in applications where the directory structure
> above makes any kind of sense that neither is usually required.

Splitting a big file system into small ones means that you (the application or 
sys admin) must load balance where to put new files instead of having the system 
do it for you.



>> you pretty much always are going to use some kind of made up
>> directory layout.


The use case for big file systems with lots of small files (at least the one 
that I know of) is for object based file systems where files usually have odd, 
non-humanly generated file names (think guids with time stamps and digital 
signatures).

These are pretty trivial to map into the time based directory scheme I mentioned 
before.

>
> File systems are usually used for storing somewhat unstructured
> information, not records that can be looked up with a simple
> "YEAR/MONTH/DAY/HOUR/MIN" key, which seems very suitable for
> something like a simpel DBMS.
>
> There is even a tendency to move filesystems into databases, as
> they scale a lot better.
>
> And for cases where a filesystem still makes sense I would
> rather use, instead of the inane manylevel directory structure
> above, a file system design with proper tree indexes and perhaps
> even one with the ability to store small files into inodes.
>
> [ ... ]

Have you tried to make a production DB with 1 billion records? Or done 
experiments with fs vs db schemes?

>
>> You can always try to write 1 million files in a single
>> subdirectory,
>
> Again, I'd rather avoid anything like that.
>
>> but if you are writing your own application, using this kind
>> of scheme is pretty trivial.
>
> And an utter absurdity, for 1 billion files in 200k directories.
> Both on its own merits and compared to the OBVIOUS alternative.
>
>>> If anything, consider the obvious (obvious except to those
>>> who want to use a filesystem as a small record database),
>>> which is 'fsck' time, in particular given the structure of
>>> 'ext3' (or 'ext4') metadata.
>
>> fsck time has improved quite a lot recently with ext4 (and
>> with xfs).
>
> How many months do you think a 7TB filesystem with 1 billion
> files would take to 'fsck' even with those improvements? Even
> with the nice improvements?
>

20KB files written to ext4 run at around 3,000 files/sec. It took us about 4 
days to fill it to 1 billion files and 2.4 hours to fsck.

Not to be mean, but I have worked in this exact area and have benchmarked both 
large DB instances and large file systems.  Good use cases exist for both, but 
the facts do not back up your DB is the only solution proposal :-)

ric



From adilger at sun.com  Wed Sep 16 22:28:29 2009
From: adilger at sun.com (Andreas Dilger)
Date: Wed, 16 Sep 2009 16:28:29 -0600
Subject: Many small files, best practise.
In-Reply-To: <19118.45290.900343.204958@tree.ty.sabi.co.uk>
References: <2F3893D6F401F74695CE4AE1BA204E685C705E46C6@wfeex01pv.ad.fouredge.se>
	<4AA7AFA8.5040502@redhat.com>
	<19118.3970.628895.372996@tree.ty.sabi.co.uk>
	<4AAE2A4F.8010409@redhat.com>
	<19118.45290.900343.204958@tree.ty.sabi.co.uk>
Message-ID: <20090916222829.GQ2537@webber.adilger.int>

On Sep 14, 2009  22:08 +0100, Peter Grandi wrote:
> > When you deal with systems that store millions of files,
> 
> Millions of files may work; but 1 billion is an utter absurdity.
> A filesystem that can store reasonably 1 billion small files in
> 7TB is an unsolved research issue...

I'd disagree.  We have Lustre filesystems with 500M files on
the ext4(ish) metadata server, and these are only 4TB.  Note
there is NO DATA in the metadata files, so it isn't quite like
a normal filesystem.

It also depends on what you mean by "small files".
We've previously discussed storing small file data in an
extended attribute, and if you are tuning for this and the
file size is small enough (3kB or less) the file data could
be stored inside the inode (i.e. zero seek data IO).

> > fsck time has improved quite a lot recently with ext4 (and
> > with xfs).
> 
> How many months do you think a 7TB filesystem with 1 billion
> files would take to 'fsck' even with those improvements? Even
> with the nice improvements?

I think you aren't backing your comments with any facts.

The e2fsck time on our MDS filesystems with 500M IN USE inodes
is on the order of 4 hours (disk-based RAID-1+0 array).  If
this was on a RAID-1+0 SSD it could be noticably faster.

Ric also commented previously about single-digit hours for e2fsck
on a test 1B file ext4 filesystem.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.



From pg_ext3 at ext3.for.sabi.co.UK  Mon Sep 21 13:54:44 2009
From: pg_ext3 at ext3.for.sabi.co.UK (Peter Grandi)
Date: Mon, 21 Sep 2009 14:54:44 +0100
Subject: Many small files, best practise.
In-Reply-To: <4AB134F6.2060900@redhat.com>
References: <2F3893D6F401F74695CE4AE1BA204E685C705E46C6@wfeex01pv.ad.fouredge.se>
	<4AA7AFA8.5040502@redhat.com>
	<19118.3970.628895.372996@tree.ty.sabi.co.uk>
	<4AAE2A4F.8010409@redhat.com>
	<19118.45290.900343.204958@tree.ty.sabi.co.uk>
	<4AB134F6.2060900@redhat.com>
Message-ID: <19127.34212.425251.424259@tree.ty.sabi.co.uk>

[ ... whether storing 1 bilion 7KB (average) records are best
stored in a database or 1 per file in a file system ... ]

>>> One thing that you can do when doing bulk loads of files
>>> (say, during a restore or migration), is to use a two phase
>>> write. First, write each of a batch of files (say 1000 files
>>> at a time), then go back and reopen/fsync/close them.

>> Why not just restore a database?

> If you started with a database, that would be reasonable. If
> you started with a file system, I guess I don't understand
> what you are suggesting.

Well, the topic of this discussion is whether one *should* start
with a database for the "lots of small records" case. 

It is not a new topic by any means -- there have been many
debates in the past as to how silly it is to have immense
file-per-message news/mail spool archives with lots of little
files. The outcome has always been to store them in databased of
one sort or another.

>>>>> One layout for directories that works well with this kind
>>>>> of thing is a time based one (say YEAR/MONTH/DAY/HOUR/MIN
>>>>> where MIN might be 0, 5, 10, ..., 55 for example).

>>> As to the problem above and ths kind of solution, I reckon
>>> that it is utterly absurd (and I could have used much
>>> stronger words).

>>> When you deal with systems that store millions of files,

>> Millions of files may work; but 1 billion is an utter
>> absurdity.  A filesystem that can store reasonably 1 billion
>> small files in 7TB is an unsolved research issue ... [
>> ... and fsck ... ]

> Strangely enough, I have been testing ext4 and stopped filling
> it at a bit over 1 billion 20KB files on Monday (with 60TB of
> storage).

Is that a *reasonable* use of a filesystem? Have you compared to
storing 1 billion 20KB records in a simple database?

As an aside, 20KB is no longer than much in the "small files"
range. For example, one stupid idea of storing records as "small
files" is the enormous internal fragmentation caused by 4KiB
allocation granularity, which swells space used too. Even for
the original problem, which was about:

  > ~1000.000.000 files (1-30k)
  > ~7TB in total

that is presumably lots of files under 4KiB if the average file
size is 7KB in a range between 1-30KB.

Also looking at my humble home system, at the root filesystem
and a media (RPMs, TARs, ZIPs, JPGs, ISOs, ...) archival
filesystem (both JFS):

  base# df / /fs/basho
  Filesystem           1M-blocks      Used Available Use% Mounted on
  /dev/sdb1                11902      9712      2191  82% /
  /dev/sda8               238426    228853      9573  96% /fs/basho
  base# df -i / /fs/basho
  Filesystem            Inodes   IUsed   IFree IUse% Mounted on
  /dev/sdb1            4873024  359964 4513060    8% /
  /dev/sda8            19738976  126493 19612483    1% /fs/basho

I see that files under 4K are the vast majority on one and a
large majority on the other:

  base# find / -xdev -type f -size -4000 | wc -l
  305064
  base# find /fs/basho -xdev -type f -size -4000 | wc -l
  107255

Anyhow, because while some people make (because they do "work")
fielsystems with millions and even billion inodes and/or 60TB
capacities (on 60+1 RAID5s sometimes), the question is whether
it makes sense or is an absurdity on its own merits and when
compared to a database.

That something stupid can be done is not an argument for doing it.

The arguments I referred to in my original comments show just
how expensive is to misuse a directory hierarchy in a filesystem
as if it were an index in a database, by comparing them:

 "I have a little script, the job of which is to create a lot of
  very small files (~1 million files, typically ~50-100 bytes each)."
 "It's a bit of a one-off (or twice, maybe) script, and
  currently due to finish in about 15 hours,"

 "creates a Berkeley DB database of K records of random length
  varying between I and J bytes,"
 "So, we got 130MiB of disc space used in a single file, >2500
  records sustained per second inserted over 6 minutes and a half,"

Perhaps 50-100 bytes is a bit extreme, but still compare "due to
finish in about 15 hours" with "6 minutes and a half".

Now, in that case a large part of the speedup is that the
records were small enough that 1m of them as a database would
fit into memory (that BTW was part of the point why using a
filesystem for that was utterly absurd).

I'd rather not do a test with 1G 6-7KB records on my (fairly
standard, small, 2GHz PCU, 2GiB RAM) home PC, but 1M 6-7KB
records is of course feasible, and on a single modern disk with
1 TB (and a slightly prettified updated script using BTREE) I
get (1M records with a 12 byte key, record length random between
2000 and 10000 bytes):

  base# rm manyt.db
  base# time perl manymake.pl manyt.db 1000000 2000 10000
    1 percent done, 990000 to go
    2 percent done, 980000 to go
    3 percent done, 970000 to go
  ....
   98 percent done, 20000 to go
   99 percent done, 10000 to go
  100 percent done, 0 to go

  real	81m6.812s
  user	0m29.957s
  sys	0m30.124s
  base# ls -ld manyt.db 
  -rw------- 1 root root 8108961792 Sep 19 20:36 manyt.db

The creation script flushes every 1% too, but from the pathetic
peak 3-4MB/s write rate it is pretty obvious that on my system
things don't get cached a lot (by design...).

As to reading, 10000 records at random among those 1M:

  base# time perl manyseek.pl manyt.db 1000000 10000
    1 percent done, 9900 to go
    2 percent done, 9800 to go
    3 percent done, 9700 to go
  ....
   98 percent done, 200 to go
   99 percent done, 100 to go
  100 percent done, 0 to go
  average length: 5984.4108

  real	7m22.016s
  user	0m0.210s
  sys	0m0.442s

That is on the slower half of a 1T drive in a half empty JFS
filesystem. That's 200/s 6KB average records inserted, and about
22/s looked up, which is about as good as the drive can do, all
in a single 8GB file. Sure, a lot slower than 50-100 bytes as it
can no longer much fit into memory, but still way off "due to
finish in about 15 hours". Sure the system I used for the new
test is a bit faster than the one used for the "in about 15
hours" test, but we are still talking one arm, which is largely
the bottleneck.

But wait -- I am JOKING. because it is ridiculous to load a 1M
record dataset into an indexed database one record at a time.

Sure it is *possible*, but any sensible database has a bulk
loader that builds the index after loading the data. So in any
reasonable scenario the difference when *restoring* a backedup
filesystem will be rather bigger than for the scenario above.
Sure, some file systems have 'dump' like tools that help, but
they don't recreate a nice index, they just restore it. Ah well.

Now let's see a much bigger scale test:

> [ ... ] testing ext4 and stopped filling it at a bit over 1
> billion 20KB files on Monday (with 60TB of storage). Running
> fsck on it took only 2.4 hours. [ ... ]

> [ ... ] 20KB files written to ext4 run at around 3,000
> files/sec. It took us about 4 days to fill it to 1 billion
> files [ ... ]

That sounds like you did use 'fsync' per file or something
similar, as you had written:

>>>> If you are writing to a local S-ATA disk, ext3/4 can write a
>>>> few thousand files/sec without doing any fsync() operations.
>>>> With fsync(), you will drop down quite a lot.

and here you report around 3000/s over a 60TB array.

Then 20KBx3000/s is 60MB/s -- rather unimpressive score for a
60TB filesystem (presumably spread over 60 drives or more), even
with 'fsync'. And the creation record rate itself looks like
about 50 records/s per drive. That is rather disappointing. Yes,
they are larger files, but that should not cause that much
slowdown.

Also, the storage layout is not declared (except that you are
storing 20TB of data in 60TB of drives, which is a bit of a cheat),
and it would be also quite interesting to see the output of that
'fsck' run:

> and 2.4 hours to fsck.

But that is an unreasonable test, even if it is the type of test
popular with some file system designers, precisely because...

Testing file system performance just after loading is a naive or
cheating exercise, especially with 'ext4' (and 'ext3'), as after
loading all those inodes and files are going to be nearly
optimally laid out (e.g. list of inode numbers in a directory
pretty much sequential), and with 'ext4' each file will consist
of a single extent (hopefully), so less metadata.

But a filesystem that simulates a simple small object database
will as a rule not be so lucky; it will grow and be modified.

Even worse, 'fsck' on a filesystem *without damage* is just an
exercise in enumerating inodes and other metadata. What is
interesting is that happens when there is damage and 'fsck' has
to start cross-correlating metadata.

So here are some more realistic 'fsck' estimates from other
filesystems and other times, who should be very familiar to
those considering utterly absurd designs:

  http://ukai.org/b/log/debian/snapshot

   "long fsck on disks for old snapshot.debian.net is completed
    today. It takes 75 days!"

   "It still fsck for a month....

    root      6235 36.1 59.7 1080080 307808 pts/2 D+  Jun21 15911:50 fsck.ext3 /dev/md5"

That was I think before some improvements to 'ext3' checking.

  http://groups.google.com/group/linux.debian.ports.x86-64/msg/fd2b4d46a4c294b5
  
   "Keep in mind if you go with XFS, you're going to need 10-15
    gig of memory or swap space to fsck 6tb.. it needs about 9
    gig to xfs_check, and 3 gig to xfs_repair a 4tb array on one
    of my systems.. oh, and a couple days to do either. :)"

   "> Generally, IMHO no. A fsck will cost a lot of time with
    > all filesystems.

    Some worse than others though.. looks like this 4tb is going
    to take 3 weeks.. it took about 3-4 hours on ext3.. If i had
    a couple gig of ram to put in the server that'd probably
    help though, as it's constantly swapping out a few meg a
    second."

  http://lists.us.dell.com/pipermail/linux-poweredge/2007-November/033821.html

   "> I'll definitely be considering that, as I already had to
    > wait hours for fsck to run on some 2 to 3TB ext3
    > filesystems after crashes. I know it can be disabled, but
    > I do feel better forcing a complete check after a system
    > crash, especially if the filesystem had been mounted for
    > very long, like a year or so, and heavily used.

    The decision process for using ext3 on large volumes is simple:

    Can you accept downtimes measured in hours (or days) due to
    fsck? No - don't use ext3."

  http://www.mysqlperformanceblog.com/2006/10/08/small-things-are-better/

   "Yesterday I had fun time repairing 1.5Tb ext3 partition,
    containing many millions of files. Of course it should have
    never happened - this was decent PowerEdge 2850 box with RAID
    volume, ECC memory and reliable CentOS 4.4 distribution but
    still it did. We had "journal failed" message in kernel log
    and filesystem needed to be checked and repaired even though
    it is journaling file system which should not need checks in
    normal use, even in case of power failures. Checking and
    repairing took many hours especially as automatic check on
    boot failed and had to be manually restarted."

Another factor is just how "complicated" the filesystem is, and
for example 'fsck' times with large numbers of hard links can be
very bad (and there are quite a few use cases like 'rdiff-backup').

Also, what about the few numbers you mention above? The 2.4
hours for 1 billion files mean 110K inodes examined per second.
Now 60TB probably means like 60 1TB drives to store 20TB of
data, a pretty large degree of parallelism. T'so reports:

  http://thunk.org/tytso/blog/2008/08/08/fast-ext4-fsck-times/

which shows that on a single (laptop) drive an 800K inode/90GB
'ext4' filesystem could be checked in 63s or around 12K inodes/s
per drive, not less than 2K.

There seems to be a scalability problem -- but of course: one of
the "unsolved research issue"s is that while read/write/etc. can
be parallelized (for large files) by using wide RAIDs, it is not
so easy to parallelize 'fsck' (except by using multiple mostly
independent filesystems).

[ ... ]

> The use case for big file systems with lots of small files (at
> least the one that I know of) is for object based file systems
> where files usually have odd, non-humanly generated file names
> (think guids with time stamps and digital signatures).

> These are pretty trivial to map into the time based directory
> scheme I mentioned before.

And it is utterly absurd to do so (see below).

> [ ... ] benchmarked both large DB instances and large file
> systems.  Good use cases exist for both, but the facts do not
> back up your DB is the only solution proposal :-)

Sure, large filesystems (to a point, which for me is the single
digit TB range) with large files have their place, even if
people seem to prefer metafilesystem like Lustre even for those,
for good reasons.

But the discussion is whether it makes sense, for a case like 1G
records averaging about 7KB, to use a filesystem with 200K
directories with each 5K files (or something similar) one file
per record, or a database with a nice overall index and a single
or a few files for all records.

Your facts above show that it is *possible* to create a similar
(1G x 20K records) filesystem, and that it seem to make a rather
poor use of a very large storage system.

The facts that I referred to in my original comment show that
there is a VERY LARGE performance difference between using a
filesystem as a (very) small-record database for just 1M
records, and a PRETTY LARGE difference even for 6KB records, and
that doing something stupid on the database side.

In the end the facts just confirm the overall discussion that
I referred to in my original comment:

  http://lists.gllug.org.uk/pipermail/gllug/2005-October/055445.html

   "* The size of the tree will be around 1M filesystem blocks on
      most filesystems, whose block size usually defaults to 4KiB,
      for a total of around 4GiB, or can be set as low as 512B, for
      a total of around 0.5GiB.

    * With 1,000,000 files and a fanout of 50, we need 20,000
      directories above them, 400 above those and 8 above those.
      So 3 directory opens/reads every time a file has to be
      accessed, in addition to opening and reading the file.

    * Each file access will involve therefore four inode accesses
      and four filesystem block accesses, probably rather widely
      scattered. Depending on the size of the filesystem block and
      whether the inode is contiguous to the body of the file this
      can involve anything between 32KiB and 2KiB of logical IO per
      file access.

    * It is likely that of the logical IOs those relating to the two
      top levels (those comprising 8 and 400 directories) of the
      subtree will be avoided by caching between 200KiB and 1.6MiB,
      but the other two levels, the 20,000 bottom directories and
      the 1,000,000 leaf files, won't likely be cached."

These are pretty elementary considerations, and boil down to the
issue of whether for a given dataset of "small" records the best
index structure is a tree of directories or a nicely balanced
index tree, and whether the "small" records should be at most
one per (4KiB usually) block or can share blocks, and there is
little doubt that tha latter wins pretty big.

Your proposed directory based index "YEAR/MONTH/DAY/HOUR/MIN"
seems to me particularly inane, as it has a *fixed fanout*, of
12 at the "MONTH" level, around 30 at the "DAY" level, 24 at the
hour level, and 60 at the "MIN" level with no balancing. Fine if
the record creation rate is constant.

Perhaps not -- it involves 500K "MIN" directories per year.
If we create 1G files per year we get around 2K files per "MIN"
directory, each of which is then likely to be a few 4KiB blocks
long. Fabulous :-).

Sure, it is a *doable* structure, but it is not *reasonable*,
especially if one knows the better alternative.

Overall the data and arguments above suggests that:

* Large filesystems (2 digits TB and more) usually should be
  avoided.

* Filesystems with large numbers (more than a few millions) of
  files, even large files, should be avoided.

* Large filesystems with a large number of small (around 4KiB)
  inodes (not just files) are utterly absurd, on their own
  merits, and even more so when compared with a database.

* Two big issues are that while parallel storage scales up data
  performance, it does not do that well with metadata, and in
  particular metadata crawls such as 'fsck' are hard to
  parallelize (they are hard even when they in effect resolve
  just in mostly-linear scans).

* If one *has* to have any of the above, separate filesystems,
  and/or filesystems based on a database-like design (e.g. based
  on indices throughout like HFS+ or Reiser3 or to some degree
  JFS and even XFS) may be the lesser evils, even if they have
  some limitations. But that is still fairly crazy. 'ar' files
  for one thing have been invented decades ago precisely because
  lots of small files and filesystems are a bad combination.

These are conclusions well supported by experiment, data and
simple reasoning as in the above. I should not have to explain
these pretty obvious points in detail -- that databases are much
better for large small record collections is not exactly a
recent discovery.

Sure, a lot of people "know better" and adopt what I call the
"syntactically valid" approach, where if a combination is
possible it is then fine. Good luck!



From pg_ext3 at ext3.for.sabi.co.UK  Mon Sep 21 15:37:25 2009
From: pg_ext3 at ext3.for.sabi.co.UK (Peter Grandi)
Date: Mon, 21 Sep 2009 16:37:25 +0100
Subject: Many small files, best practise.
In-Reply-To: <20090916222829.GQ2537@webber.adilger.int>
References: <2F3893D6F401F74695CE4AE1BA204E685C705E46C6@wfeex01pv.ad.fouredge.se>
	<4AA7AFA8.5040502@redhat.com>
	<19118.3970.628895.372996@tree.ty.sabi.co.uk>
	<4AAE2A4F.8010409@redhat.com>
	<19118.45290.900343.204958@tree.ty.sabi.co.uk>
	<20090916222829.GQ2537@webber.adilger.int>
Message-ID: <19127.40373.159150.385799@tree.ty.sabi.co.uk>

[ ... whether datasets like 1G records for a total of 7TB should be
stored as one-record-per-file in a filesystem or as a database ... ]

>>> When you deal with systems that store millions of files,

>> Millions of files may work; but 1 billion is an utter absurdity.
>> A filesystem that can store reasonably 1 billion small files in
>> 7TB is an unsolved research issue...

> I'd disagree.  We have Lustre filesystems with 500M files on
> the ext4(ish) metadata server, and these are only 4TB. Note
> there is NO DATA in the metadata files, so it isn't quite like
> a normal filesystem.

That is possible, but to me seems quite unreasonable. How long
does that take to RSYNC, for example? To just backup? What about
doing a 'find'? These are mad things.

This is the special case of an MDS as you mention, but it is still
fairly dangerous.

Just like many other similar choices (e.g. 19+1 RAID5 arrays), it
works (not so awesomely) as long as it works, and when it breaks it
is very bad.

I like the Lustre idea, and to me it is currently the best of a not
very enthusing lot, but the MDT is by far the weakest bit, and the
``lots of tiny files'' idea is one of the big deals.

In particular size of MDTs is a significant scalability issue
with Lustre, which was designed in older gentler times for
purposes to which metadata scalability might not have been so
essential. Like most good ideas it has been scaled up beyond
expectations (UNIX-style), and perhaps it is reaching the end
of its useful range.

Fortunately sensible Lustre people keep frequent and wholesame
MDS backups, and restoring a backup, and even a 500M 800B file
backup/restore is hopefully much faster than an 'fsck' if there
is damage.

> It also depends on what you mean by "small files". We've
> previously discussed storing small file data in an extended
> attribute, and if you are tuning for this and the file size is
> small enough (3kB or less) the file data could be stored
> inside the inode (i.e. zero seek data IO).

If I were to use a filesystem as a makeshift database I would
indeed use one of those filesystems that store small files or file
tails in the metadata, as I wrote:

  >> And for cases where a filesystem still makes sense I would
  >> rather use, instead of the inane manylevel directory
  >> structure above, a file system design with proper tree
  >> indexes and perhaps even one with the ability to store
  >> small files into inodes.

You might consider storing Lustre MDTs on Reiser3 instead of
'ldiskfs' :-).

But this is backwards; the database guys have spent the past
several decades working on the ``lots of small records reliably''
problem (and with "bushy" indices), and the main work by the file
system guys has been solving the ``massive massively parallel
files'' one. To the point that people like Reiser who did work
(with database like techniques) on the small files problems for
filesystems have been at best ignored.

[ ... ]

> I think you aren't backing your comments with any facts.

You may think that -- but that's only because you think wrong,
as you haven't read my comments or you want to misrepresent
them.

I made at the very start a clear example of a case with 1M small
files engendering a difference between more than 15 hours vs. 6
minutes for just creation.

For amusement I just rerun it in a nicer form on a somewhat faster
system:

  base$  rm /fs/jugen/tmp/manysmall.db 
  base$  time perl manymake.pl /fs/jugen/tmp/manysmall.db 1000000 50 100
    1 percent done, 990000 to go
    2 percent done, 980000 to go
    3 percent done, 970000 to go
  ....
   98 percent done, 20000 to go
   99 percent done, 10000 to go
  100 percent done, 0 to go

  real	0m48.209s
  user	0m6.240s
  sys	0m0.348s
  base$  ls -ld /fs/jugen/tmp/manysmall.db 
  -rw------- 1 pcg pcg 98197504 Sep 21 16:19 /fs/jugen/tmp/manysmall.db

That's 1M records in 10MB in less than a minute or 20K records/s,
for around 1.5MB/s, which is fairly typical for random access to a
fairly standard 1TB consumer drive in its latter half.

  base$  sudo sysctl vm.drop_caches=1
  vm.drop_caches = 1
  base$  time perl manyseek.pl /fs/jugen/tmp/manysmall.db 1000000 10000
    1 percent done, 9900 to go
    2 percent done, 9800 to go
    3 percent done, 9700 to go
  ....
   98 percent done, 200 to go
   99 percent done, 100 to go
  100 percent done, 0 to go
  average length: 69.3816

  real	2m4.265s
  user	0m0.150s
  sys	0m0.126s

Seeking of course is not awesome, and we get 10K records in 2m, or
around 80 records/s. Ah well. I need an SSD :-).

And as to the 'fsck', I confess that I had a list of cases in
mind but was waiting for the usual worn out dodgy technique of
quoting undamaged filesystem times:

> The e2fsck time on our MDS filesystems with 500M IN USE inodes
> is on the order of 4 hours (disk-based RAID-1+0 array). If
> this was on a RAID-1+0 SSD it could be noticably faster. Ric
> also commented previously about single-digit hours for e2fsck
> on a test 1B file ext4 filesystem.

That is a classic "benchmark" -- undamaged filesystem 'fsck'
tests, like the other favourite, freshly loaded filesystem
benchmarks, are just dodgy marketing tools.

And even so! 1 hour per TB, or 1h per 100M files. To me keeping
what may be production filesystem with 500M files unavailable
for 4 hours because one occasionally has to run 'fsck' (even if
in fact there is no damage) with an upside risk of weeks or
months sounds not such a good idea. But who knows.

There are been reports, which are sadly familiar to those who
work as sysadms, of single digit TB filesystems taking weeks to
months to repair, if damaged. The difference of course is
between scanning the metadata and crawling it.

Which is of course perfectly obvious, as RAIDs allow for
parallelizing of read/write but not easily for scanning and less so
for crawls. Scaling 'fsck' is not easy, is an unsolved research
problem, even if things like Lustre help somewhat (minus the MDTs
of course).

Now I feel a bit preachy, I'll mention some wider concepts (mostly
from the database guys) that should fit well in this discussion:

* A "database" is defined as something including a dataset whose
  working set does not fit in memory (it thrashes -- every access
  involves at least one IO). There are several types of databases,
  structured/unstructured, factual/textual/...; a filesystem is a
  kind of database, as that definition applies. But to me and
  several decades of practice and theory it is a database of
  record _containers_ (as suggested by the very word "file"), not
  of records. It is exceptionally hard to do a DBMS that handles
  equally well records and record containers.

* A "very large database" is a database that cannot be practically
  backed up (or checked) offline, as backup (or check) take too
  long wrt to requirements. Many filesystems are moving into the
  "very large database" category (can your customers accept that it
  might take 4 hours or 4 weeks to check, and 4 days to restore,
  their filesystem?). Storing small records (or small containers
  even) in a filesystem makes it much more likely that it becomes a
  "very large database", and while the technology for "very large
  databases" DBMSes is mature, that for "very large database" file
  system designs is not there or at least not as mature, even if
  the fun guys at Sun have been trying lately with ZFS.

* These are not novel or little know concepts and experiences. 'ar'
  files have been around for a long time, for some good reason.



From awk at google.com  Wed Sep 23 22:59:02 2009
From: awk at google.com (Abhijit Karmarkar)
Date: Wed, 23 Sep 2009 22:59:02 -0000
Subject: jbd/kjournald oops on 2.6.30.1
Message-ID: <88cc3e770909231558u5109aca1u1409ba6877a6c8f@mail.gmail.com>

Hi,

I am getting the following Oops on 2.6.30.1 kernel. The bad part is,
it happens rarely (twice in last 1.5 months) and the system is pretty
lightly loaded when this happens (no heavy file/disk io).

Any insights or patches that I can try? (i searched lkml and ext3
lists but could not find any similar oops/reports).


== Oops ===================
BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
IP: [<ffffffff80373520>] __journal_remove_journal_head+0x10/0x120
PGD 0
Oops: 0000 [#1] SMP
last sysfs file: /sys/class/scsi_host/host0/proc_name
CPU 0
Pid: 3834, comm: kjournald Not tainted 2.6.30.1_test #1
RIP: 0010:[<ffffffff80373520>]  [<ffffffff80373520>]
__journal_remove_journal_head+0x10/0x120
RSP: 0018:ffff880c7ee11d80  EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000034
RDX: 0000000000000002 RSI: ffff8804ee82aa20 RDI: ffff8804ee82aa20
RBP: ffff880c7ee11d90 R08: 0400000000000000 R09: 0000000000000000
R10: ffffffff803706af R11: 0000000000000000 R12: ffff8808659bc198
R13: 0000000000000001 R14: ffff880bd435a980 R15: ffff880c7959d000
FS:  0000000000000000(0000) GS:ffff88006d000000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000008 CR3: 0000000000201000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kjournald (pid: 3834, threadinfo ffff880c7ee10000, task
ffff880c794900c0)
Stack:
ffff8804ee82aa20 ffff8808659bc198 ffff880c7ee11db0 ffffffff80374fd4
ffff880c7ee11db0 ffff8804ee82aa20 ffff880c7ee11e90 ffffffff8037073d
ffff880c7959d3a8 ffff880c7ee11e48 ffff880c7959d028 ffff880c7959d338
Call Trace:
[<ffffffff80374fd4>] journal_remove_journal_head+0x24/0x50
[<ffffffff8037073d>] journal_commit_transaction+0x41d/0x1150
[<ffffffff8024f8cc>] ? try_to_del_timer_sync+0x5c/0x70
[<ffffffff8037498f>] kjournald+0xff/0x270
[<ffffffff8025c370>] ? autoremove_wake_function+0x0/0x40
[<ffffffff80374890>] ? kjournald+0x0/0x270
[<ffffffff8025bf63>] kthread+0x63/0x90
[<ffffffff8020cffa>] child_rip+0xa/0x20
[<ffffffff8025bf00>] ? kthread+0x0/0x90
[<ffffffff8020cff0>] ? child_rip+0x0/0x20
Code: 1f 44 00 00 48 89 f8 48 8b 3d 7d 0d ca 00 48 89 c6 e8 85 35 f5
ff c9 c3 0f 1f 00 55 48 89 e5 41 54 53 0f 1f 44 00 00 48 8b 5f 40 <8b>
4b 08 85 c9 0f 88 f2 00 00 00 f0 ff 47 60 8b 53 08 85 d2 75
RIP  [<ffffffff80373520>] __journal_remove_journal_head+0x10/0x120
RSP <ffff880c7ee11d80>
CR2: 0000000000000008
---[ end trace 2a47799c65258934 ]---



Looking at the disassembly of journal_remove_head():
==============
0xffffffff8037b760 <__journal_remove_journal_head+0>:   push   %rbp
0xffffffff8037b761 <__journal_remove_journal_head+1>:   mov    %rsp,%rbp
0xffffffff8037b764 <__journal_remove_journal_head+4>:   push   %r12
0xffffffff8037b766 <__journal_remove_journal_head+6>:   push   %rbx
0xffffffff8037b767 <__journal_remove_journal_head+7>:   callq
0xffffffff8020bcc0 <mcount>
0xffffffff8037b76c <__journal_remove_journal_head+12>:  mov    0x40(%rdi),%rbx
0xffffffff8037b770 <__journal_remove_journal_head+16>:  mov
0x8(%rbx),%r8d     <====== Oops
0xffffffff8037b774 <__journal_remove_journal_head+20>:  test   %r8d,%r8d
0xffffffff8037b777 <__journal_remove_journal_head+23>:  js
0xffffffff8037b86d <__journal_remove_journal_head+269>
0xffffffff8037b77d <__journal_remove_journal_head+29>:  lock incl 0x60(%rdi)
0xffffffff8037b781 <__journal_remove_journal_head+33>:  mov    0x8(%rbx),%esi
0xffffffff8037b784 <__journal_remove_journal_head+36>:  test   %esi,%esi
0xffffffff8037b786 <__journal_remove_journal_head+38>:  jne
0xffffffff8037b78f <__journal_remove_journal_head+47>
0xffffffff8037b788 <__journal_remove_journal_head+40>:  cmpq   $0x0,0x28(%rbx)
0xffffffff8037b78d <__journal_remove_journal_head+45>:  je
0xffffffff8037b794 <__journal_remove_journal_head+52>
.......
.......
==============


The oops seems be due to NULL journal head while evaluating J_ASSERT_JH() macro:
==============
static void __journal_remove_journal_head(struct buffer_head *bh)
{
       struct journal_head *jh = bh2jh(bh);
       J_ASSERT_JH(jh, jh->b_jcount >= 0);  <=== jh is NULL
       get_bh(bh);
       if (jh->b_jcount == 0) {
               if (jh->b_transaction == NULL &&
....
=============

Not sure why would that happen (corruption?).


Few system details:
================
- 64-bit, 2 quad-core (total 8 cores) Xeon, 48GB RAM
- Stock 2.6.30.1 kernel, *no* modules
- ext3 file-system (data=ordered mode) used over encrypted (dmcrypt) disks.
- underlying storage: h/w RAID.
- ext*/jbd config values:

CONFIG_EXT3_FS=y
CONFIG_EXT3_DEFAULTS_TO_ORDERED=y
CONFIG_EXT3_FS_XATTR=y
# CONFIG_EXT3_FS_POSIX_ACL is not set
# CONFIG_EXT3_FS_SECURITY is not set
CONFIG_EXT4_FS=y
# CONFIG_EXT4DEV_COMPAT is not set
CONFIG_EXT4_FS_XATTR=y
CONFIG_EXT4_FS_POSIX_ACL=y
CONFIG_EXT4_FS_SECURITY=y
CONFIG_JBD=y
# CONFIG_JBD_DEBUG is not set
CONFIG_JBD2=y
# CONFIG_JBD2_DEBUG is not set
CONFIG_FS_MBCACHE=y
# CONFIG_REISERFS_FS is not set
# CONFIG_JFS_FS is not set
CONFIG_FS_POSIX_ACL=y
===================

Let me know if you need any more details. Reproducing this (or finding
a good test to trigger this) is proving to be difficult :-( It sorta
oops once in a while ;-)


thanks
abhijit

ps: please Cc: me the replies. I am not subscribed to either of the
lists -- thanks!