From criley at erad.com Tue Sep 1 17:20:32 2009 From: criley at erad.com (Charles Riley) Date: Tue, 1 Sep 2009 13:20:32 -0400 (EDT) Subject: How many files can I have safely in a subdirectory? In-Reply-To: <10290235.17241251825434412.JavaMail.root@boardwalk2.erad.com> Message-ID: <31886612.17261251825632868.JavaMail.root@boardwalk2.erad.com> Greetings, I'm not sure if it's still the case, but there used to be a limit to how many subdirectories a directory can have. 32k, to be exact. We ended up creating our own (application level) directory hashing algorithm to work around it several years ago. This might only be a kernel 2.4 thing though. I'm unaware of any limit to number of files. However once the number of files in a directory gets above about 64k, filesystem performance will significantly decrease unless the filesystem has the dir_index option. dir_index can be specified at filesystem creation or added later using tune2fs (an fsck is required). Charles Charles Riley eRAD, Inc. ----- Original Message ----- From: "z0diac" To: ext3-users at redhat.com Sent: Sunday, August 30, 2009 12:00:12 PM GMT -05:00 US/Canada Eastern Subject: How many files can I have safely in a subdirectory? Ok, I'm running a vBulletin forum (3.8.4) and found that all user attachments go into 1 single directory for each user. For each attached file in the forum, there's 2 files on disk (*.attach and *.thumb), for pictures that is. One user already has over 100,000 attachements, thus, over 200,000 files in his attach directory. Someone recently told me to 'keep an eye on it' because certain setups can't hold more than X number of files in a single directory. Yet someone else said I could have over 1 trillion files in a single directory if the HDD was large enough... Here's my setup: linux version: 2.6.18-92.1.10.el5 php: 5.1.6 mySQL: 5.0.45 File system: ext3 vB support has told me any limitation there might be, will not be the result of vB, so now I'm looking at either Linux and the way it handles files, or the ext3 file system. Does anyone know if I can just keep going with putting files into one directory? (there will be over 1million probably by year's end. Hopefully not more than 5 million ever). And, will having so many files in a single directory cause any performance problems? (ie: slowdowns) My only option is to hire a coder to somehow have it split the 1M+ files into several subdirs, say 50,000 per subdir. But even though it's messy, if it really doesn't make a difference in the end whether they're in 50 subdirs, or just 1 dir, then I won't bother (and can sigh a breath of relief) Thanks in advance!! z0diac is offline Looking for Linux Hosting? Click Here. -- View this message in context: http://www.nabble.com/How-many-files-can-I-have-safely-in-a-subdirectory--tp25212801p25212801.html Sent from the Ext3 - User mailing list archive at Nabble.com. _______________________________________________ Ext3-users mailing list Ext3-users at redhat.com https://www.redhat.com/mailman/listinfo/ext3-users From darkonc at gmail.com Tue Sep 1 17:50:31 2009 From: darkonc at gmail.com (Stephen Samuel (gmail)) Date: Tue, 1 Sep 2009 10:50:31 -0700 Subject: How many files can I have safely in a subdirectory? In-Reply-To: <25212801.post@talk.nabble.com> References: <25212801.post@talk.nabble.com> Message-ID: <6cd50f9f0909011050h6e71b754g67d065b68b54c3df@mail.gmail.com> Well, if you presume the possibility of running into bugs when the directory gets over 2GB, and directory entries averaging under 20 bytes, then you might see a problem at around 100million entries. You can probably expect performance issues before that point. If you expect these directories to keep growing year after year, you might want to consider doing directory hashing... If nothing else, it could get ugly if someone decides to do an 'ls' on a directory with 10million entries. On Sun, Aug 30, 2009 at 9:00 AM, z0diac wrote: > > Ok, I'm running a vBulletin forum (3.8.4) and found that all user attachments > go into 1 single directory for each user. For each attached file in the > forum, there's 2 files on disk (*.attach and *.thumb), for pictures that is. > > One user already has over 100,000 attachements, thus, over 200,000 files in > his attach directory. > > Someone recently told me to 'keep an eye on it' because certain setups can't > hold more than X number of files in a single directory. Yet someone else > said I could have over 1 trillion files in a single directory if the HDD was > large enough... > > Here's my setup: > > linux version: 2.6.18-92.1.10.el5 > php: 5.1.6 > mySQL: 5.0.45 > File system: ext3 > > vB support has told me any limitation there might be, will not be the result > of vB, so now I'm looking at either Linux and the way it handles files, or > the ext3 file system. > > Does anyone know if I can just keep going with putting files into one > directory? (there will be over 1million probably by year's end. Hopefully > not more than 5 million ever). > > And, will having so many files in a single directory cause any performance > problems? (ie: slowdowns) > > My only option is to hire a coder to somehow have it split the 1M+ files > into several subdirs, say 50,000 per subdir. But even though it's messy, if > it really doesn't make a difference in the end whether they're in 50 > subdirs, or just 1 dir, then I won't bother (and can sigh a breath of > relief) > > > Thanks in advance!! > z0diac is offline > Looking for Linux Hosting? Click Here. > > -- > View this message in context: http://www.nabble.com/How-many-files-can-I-have-safely-in-a-subdirectory--tp25212801p25212801.html > Sent from the Ext3 - User mailing list archive at Nabble.com. > > _______________________________________________ > Ext3-users mailing list > Ext3-users at redhat.com > https://www.redhat.com/mailman/listinfo/ext3-users > -- Stephen Samuel http://www.bcgreen.com Software, like love, 778-861-7641 grows when you give it away From adilger at sun.com Tue Sep 1 17:54:50 2009 From: adilger at sun.com (Andreas Dilger) Date: Tue, 01 Sep 2009 11:54:50 -0600 Subject: How many files can I have safely in a subdirectory? In-Reply-To: <31886612.17261251825632868.JavaMail.root@boardwalk2.erad.com> References: <10290235.17241251825434412.JavaMail.root@boardwalk2.erad.com> <31886612.17261251825632868.JavaMail.root@boardwalk2.erad.com> Message-ID: <20090901175450.GR4197@webber.adilger.int> On Sep 01, 2009 13:20 -0400, Charles Riley wrote: > I'm not sure if it's still the case, but there used to be a limit > to how many subdirectories a directory can have. 32k, to be exact. > We ended up creating our own (application level) directory hashing > algorithm to work around it several years ago. This might only be a > kernel 2.4 thing though. This is true for ext3 (max 32000 subdirectories), but in ext4 there is no specific limit on the number of subdirectories. The subdirectory limit is the same as the number of entries in the directory. > I'm unaware of any limit to number of files. However once the number > of files in a directory gets above about 64k, filesystem performance will > significantly decrease unless the filesystem has the dir_index option. > dir_index can be specified at filesystem creation or added later using > tune2fs (an fsck is required). If formatted with dir_index (which is the default for newer mke2fs for some time now) we tested up to 10M files in a single directory on a regular basis. The maximum limit depends on the filename length, but is somewhere around 15-20M for "short" filenames (e.g. 32 characters or less). > ----- Original Message ----- > From: "z0diac" > To: ext3-users at redhat.com > Sent: Sunday, August 30, 2009 12:00:12 PM GMT -05:00 US/Canada Eastern > Subject: How many files can I have safely in a subdirectory? > > > Ok, I'm running a vBulletin forum (3.8.4) and found that all user attachments > go into 1 single directory for each user. For each attached file in the > forum, there's 2 files on disk (*.attach and *.thumb), for pictures that is. > > One user already has over 100,000 attachements, thus, over 200,000 files in > his attach directory. > > Someone recently told me to 'keep an eye on it' because certain setups can't > hold more than X number of files in a single directory. Yet someone else > said I could have over 1 trillion files in a single directory if the HDD was > large enough... > > Here's my setup: > > linux version: 2.6.18-92.1.10.el5 > php: 5.1.6 > mySQL: 5.0.45 > File system: ext3 > > vB support has told me any limitation there might be, will not be the result > of vB, so now I'm looking at either Linux and the way it handles files, or > the ext3 file system. > > Does anyone know if I can just keep going with putting files into one > directory? (there will be over 1million probably by year's end. Hopefully > not more than 5 million ever). > > And, will having so many files in a single directory cause any performance > problems? (ie: slowdowns) > > My only option is to hire a coder to somehow have it split the 1M+ files > into several subdirs, say 50,000 per subdir. But even though it's messy, if > it really doesn't make a difference in the end whether they're in 50 > subdirs, or just 1 dir, then I won't bother (and can sigh a breath of > relief) > > > Thanks in advance!! > z0diac is offline > Looking for Linux Hosting? Click Here. > > -- > View this message in context: http://www.nabble.com/How-many-files-can-I-have-safely-in-a-subdirectory--tp25212801p25212801.html > Sent from the Ext3 - User mailing list archive at Nabble.com. > > _______________________________________________ > Ext3-users mailing list > Ext3-users at redhat.com > https://www.redhat.com/mailman/listinfo/ext3-users > > _______________________________________________ > Ext3-users mailing list > Ext3-users at redhat.com > https://www.redhat.com/mailman/listinfo/ext3-users Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. From bruno at wolff.to Tue Sep 1 21:00:24 2009 From: bruno at wolff.to (Bruno Wolff III) Date: Tue, 1 Sep 2009 16:00:24 -0500 Subject: How many files can I have safely in a subdirectory? In-Reply-To: <25212801.post@talk.nabble.com> References: <25212801.post@talk.nabble.com> Message-ID: <20090901210024.GA26393@wolff.to> On Sun, Aug 30, 2009 at 09:00:12 -0700, z0diac wrote: > > Someone recently told me to 'keep an eye on it' because certain setups can't > hold more than X number of files in a single directory. Yet someone else > said I could have over 1 trillion files in a single directory if the HDD was > large enough... When I have directories in the few million range doing mass changes gets extremely slow. Besides some the other other things mentioned, you need to worry about the inode limit on the file system. The default now, is lower than it used to be. This bit me once when I was moving a directory with lots of files to another system with a similar size partition when i was expecting a similar inode limit. From web2009 at zeroreality.com Tue Sep 1 21:13:50 2009 From: web2009 at zeroreality.com (z0diac) Date: Tue, 1 Sep 2009 14:13:50 -0700 (PDT) Subject: How many files can I have safely in a subdirectory? In-Reply-To: <31886612.17261251825632868.JavaMail.root@boardwalk2.erad.com> References: <25212801.post@talk.nabble.com> <31886612.17261251825632868.JavaMail.root@boardwalk2.erad.com> Message-ID: <25244663.post@talk.nabble.com> Thanks! And thanks to all who have replied to this thread! I will see if I can get dir_index active. Bugzilla from criley at erad.com wrote: > > Greetings, > > I'm not sure if it's still the case, but there used to be a limit to how > many subdirectories a directory can have. 32k, to be exact. We ended up > creating our own (application level) directory hashing algorithm to work > around it several years ago. This might only be a kernel 2.4 thing > though. > > I'm unaware of any limit to number of files. However once the number of > files in a directory gets above about 64k, filesystem performance will > significantly decrease unless the filesystem has the dir_index option. > dir_index can be specified at filesystem creation or added later using > tune2fs (an fsck is required). > > Charles > > Charles Riley > eRAD, Inc. > > > > ----- Original Message ----- > From: "z0diac" > To: ext3-users at redhat.com > Sent: Sunday, August 30, 2009 12:00:12 PM GMT -05:00 US/Canada Eastern > Subject: How many files can I have safely in a subdirectory? > > > Ok, I'm running a vBulletin forum (3.8.4) and found that all user > attachments > go into 1 single directory for each user. For each attached file in the > forum, there's 2 files on disk (*.attach and *.thumb), for pictures that > is. > > One user already has over 100,000 attachements, thus, over 200,000 files > in > his attach directory. > > Someone recently told me to 'keep an eye on it' because certain setups > can't > hold more than X number of files in a single directory. Yet someone else > said I could have over 1 trillion files in a single directory if the HDD > was > large enough... > > Here's my setup: > > linux version: 2.6.18-92.1.10.el5 > php: 5.1.6 > mySQL: 5.0.45 > File system: ext3 > > vB support has told me any limitation there might be, will not be the > result > of vB, so now I'm looking at either Linux and the way it handles files, or > the ext3 file system. > > Does anyone know if I can just keep going with putting files into one > directory? (there will be over 1million probably by year's end. Hopefully > not more than 5 million ever). > > And, will having so many files in a single directory cause any performance > problems? (ie: slowdowns) > > My only option is to hire a coder to somehow have it split the 1M+ files > into several subdirs, say 50,000 per subdir. But even though it's messy, > if > it really doesn't make a difference in the end whether they're in 50 > subdirs, or just 1 dir, then I won't bother (and can sigh a breath of > relief) > > > Thanks in advance!! > z0diac is offline > Looking for Linux Hosting? Click Here. > > -- > View this message in context: > http://www.nabble.com/How-many-files-can-I-have-safely-in-a-subdirectory--tp25212801p25212801.html > Sent from the Ext3 - User mailing list archive at Nabble.com. > > _______________________________________________ > Ext3-users mailing list > Ext3-users at redhat.com > https://www.redhat.com/mailman/listinfo/ext3-users > > _______________________________________________ > Ext3-users mailing list > Ext3-users at redhat.com > https://www.redhat.com/mailman/listinfo/ext3-users > > -- View this message in context: http://www.nabble.com/How-many-files-can-I-have-safely-in-a-subdirectory--tp25212801p25244663.html Sent from the Ext3 - User mailing list archive at Nabble.com. From web2009 at zeroreality.com Tue Sep 1 21:23:06 2009 From: web2009 at zeroreality.com (z0diac) Date: Tue, 1 Sep 2009 14:23:06 -0700 (PDT) Subject: How many files can I have safely in a subdirectory? In-Reply-To: <25244663.post@talk.nabble.com> References: <25212801.post@talk.nabble.com> <31886612.17261251825632868.JavaMail.root@boardwalk2.erad.com> <25244663.post@talk.nabble.com> Message-ID: <25245073.post@talk.nabble.com> There was something mentioned in a search about dir_index and I checked and apparently it *is* running: # sudo tune2fs -l /dev/sdb1 | grep dir_index Filesystem features: has_journal resize_inode dir_index filetype needs_recovery sparse_super large_file .. so I'm not sure if I can keep dumping files into the same directory and not have to worry as much about performance ( ? ) It would be MUCh easier for me if I could, instead of having to login under multiple accounts. There shouldn't be much more than 1-2M files in the directory anyway. The partition is only a 250GB anyway, and each file is anywhere from 50-200kb on average, so there's just not enough space to hold that quantity of files anyway. ie: I'm sure I"ll run out of drive space before having too many files affects performance (hope hope) -- View this message in context: http://www.nabble.com/How-many-files-can-I-have-safely-in-a-subdirectory--tp25212801p25245073.html Sent from the Ext3 - User mailing list archive at Nabble.com. From darkonc at gmail.com Wed Sep 2 00:42:20 2009 From: darkonc at gmail.com (Stephen Samuel (gmail)) Date: Tue, 1 Sep 2009 17:42:20 -0700 Subject: How many files can I have safely in a subdirectory? In-Reply-To: <7.0.1.0.2.20090901145754.0269eca8@zeroreality.com> References: <25212801.post@talk.nabble.com> <6cd50f9f0909011050h6e71b754g67d065b68b54c3df@mail.gmail.com> <7.0.1.0.2.20090901145754.0269eca8@zeroreality.com> Message-ID: <6cd50f9f0909011742r587c4bc4tce4c51bc262dd07b@mail.gmail.com> The 2GB worry isn't about the contents of files in the directory, but rather what happens when the directory 'file' itself gets to be over 2GB in size. If something's likely to break, then it's either there or at 4GB (( i.e. if somebody made the mistake of using a 32bit pointer in the wrong place )). There are probably not a whole lot of examples of directories (as opposed to data files) getting larger than 2GB, so I'd consider it relatively uncharted territory. On Tue, Sep 1, 2009 at 12:00 PM, Marc wrote: > Thank you for the reply. ?(it seemed to come only via email as your reply > and one other to my post, aren't showing up in the thread). > > Yes there is definitely over 2gb of files in that directory. ?My' inodes' > are only at 3% of the 590M or whatever it was, that were created at the time > the disc structure was created. ?I just wasn't sure if there was a limit > with ext3 to the # of files that could reside in a single directory. ?I > guess I will have to start adding new files under a new user account (which > will put them in that user's attachment subdir). > > Thanks.! > > At 01:50 PM 9/1/2009, you wrote: >> >> Well, if you presume the possibility of running into bugs when the >> directory gets over 2GB, >> and directory entries averaging ?under 20 bytes, then you might see a >> problem at around >> 100million entries. >> You can probably expect performance issues before that point. >> If you expect these directories to keep growing year after year, >> you might want to consider doing directory hashing... >> If nothing else, it could get ugly if someone decides to do an 'ls' >> on a directory with 10million entries. >> >> >> On Sun, Aug 30, 2009 at 9:00 AM, z0diac wrote: >> > >> > Ok, I'm running a vBulletin forum (3.8.4) and found that all user >> > attachments >> > go into 1 single directory for each user. For each attached file in the >> > forum, there's 2 files on disk (*.attach and *.thumb), for pictures that >> > is. >> > >> > One user already has over 100,000 attachements, thus, over 200,000 files >> > in >> > his attach directory. >> > >> > Someone recently told me to 'keep an eye on it' because certain setups >> > can't >> > hold more than X number of files in a single directory. Yet someone else >> > said I could have over 1 trillion files in a single directory if the HDD >> > was >> > large enough... >> > >> > Here's my setup: >> > >> > linux version: 2.6.18-92.1.10.el5 >> > php: 5.1.6 >> > mySQL: 5.0.45 >> > File system: ext3 >> > >> > vB support has told me any limitation there might be, will not be the >> > result >> > of vB, so now I'm looking at either Linux and the way it handles files, >> > or >> > the ext3 file system. >> > >> > Does anyone know if I can just keep going with putting files into one >> > directory? (there will be over 1million probably by year's end. >> > Hopefully >> > not more than 5 million ever). >> > >> > And, will having so many files in a single directory cause any >> > performance >> > problems? (ie: slowdowns) >> > >> > My only option is to hire a coder to somehow have it split the 1M+ files >> > into several subdirs, say 50,000 per subdir. But even though it's messy, >> > if >> > it really doesn't make a difference in the end whether they're in 50 >> > subdirs, or just 1 dir, then I won't bother (and can sigh a breath of >> > relief) >> > >> > >> > Thanks in advance!! >> > z0diac is offline >> > Looking for Linux Hosting? Click Here. >> > >> > -- >> > View this message in context: >> > http://www.nabble.com/How-many-files-can-I-have-safely-in-a-subdirectory--tp25212801p25212801.html >> > Sent from the Ext3 - User mailing list archive at Nabble.com. >> > >> > _______________________________________________ >> > Ext3-users mailing list >> > Ext3-users at redhat.com >> > https://www.redhat.com/mailman/listinfo/ext3-users >> > >> >> >> >> -- >> Stephen Samuel http://www.bcgreen.com ?Software, like love, >> 778-861-7641 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?grows when you give it away > > -- Stephen Samuel http://www.bcgreen.com Software, like love, 778-861-7641 grows when you give it away From per.lanvin at fouredge.se Wed Sep 9 13:00:09 2009 From: per.lanvin at fouredge.se (=?iso-8859-1?Q?P=E4r_Lanvin?=) Date: Wed, 9 Sep 2009 15:00:09 +0200 Subject: Many small files, best practise. Message-ID: <2F3893D6F401F74695CE4AE1BA204E685C705E46C6@wfeex01pv.ad.fouredge.se> //Sys RHEL 5.3 ~1000.000.000 files (1-30k) ~7TB in total // Hi, I'm looking for a best practice when implementing this using EXT3 (or some other FS if it shouldn't do the job.). On average the reads dominate (99%), writes are only used for updating and isn't a part of the service provided. The data is divided into 200k directories with each some 5k files. This ratio (dir/files) can be altered to optimize FS performance. Any suggestions are greatly appreciated. Rgds /PL From rwheeler at redhat.com Wed Sep 9 13:37:44 2009 From: rwheeler at redhat.com (Ric Wheeler) Date: Wed, 09 Sep 2009 09:37:44 -0400 Subject: Many small files, best practise. In-Reply-To: <2F3893D6F401F74695CE4AE1BA204E685C705E46C6@wfeex01pv.ad.fouredge.se> References: <2F3893D6F401F74695CE4AE1BA204E685C705E46C6@wfeex01pv.ad.fouredge.se> Message-ID: <4AA7AFA8.5040502@redhat.com> On 09/09/2009 09:00 AM, P?r Lanvin wrote: > > //Sys > RHEL 5.3 > ~1000.000.000 files (1-30k) > ~7TB in total > // > > Hi, > > I'm looking for a best practice when implementing this using EXT3 (or some other FS if it shouldn't do the job.). > > On average the reads dominate (99%), writes are only used for updating and isn't a part of the service provided. > The data is divided into 200k directories with each some 5k files. This ratio (dir/files) can be altered to > optimize FS performance. > > Any suggestions are greatly appreciated. > > > Rgds > > /PL Hi Par, This sounds a lot like the challenges I had in my recent past working on a similar storage system. One key that you will find is to make sure that you minimize head movement while doing the writing. The best performance would be to have a few threads (say 4-8) write to the same subdirectory for a period of time of a few minutes (say 3-5) before moving on to a new directory. If you are writing to a local S-ATA disk, ext3/4 can write a few thousand files/sec without doing any fsync() operations. With fsync(), you will drop down quite a lot. One layout for directories that works well with this kind of thing is a time based one (say YEAR/MONTH/DAY/HOUR/MIN where MIN might be 0, 5, 10, ..., 55 for example). When reading files in ext3 (and ext4) or doing other bulk operations like a large deletion, it is important to sort the files by inode (do the readdir, get say all of the 5k files in your subdir and then sort by inode before doing your bulk operation). Good luck! Ric From pg_ext3 at ext3.for.sabi.co.UK Mon Sep 14 09:40:18 2009 From: pg_ext3 at ext3.for.sabi.co.UK (Peter Grandi) Date: Mon, 14 Sep 2009 10:40:18 +0100 Subject: Many small files, best practise. In-Reply-To: <4AA7AFA8.5040502@redhat.com> References: <2F3893D6F401F74695CE4AE1BA204E685C705E46C6@wfeex01pv.ad.fouredge.se> <4AA7AFA8.5040502@redhat.com> Message-ID: <19118.3970.628895.372996@tree.ty.sabi.co.uk> >> RHEL 5.3 >> ~1000.000.000 files (1-30k) >> ~7TB in total >> // >> I'm looking for a best practice when implementing this using >> EXT3 (or some other FS if it shouldn't do the job.). "best practice" would be a rather radical solution. >> On average the reads dominate (99%), writes are only used for >> updating and isn't a part of the service provided. The data >> is divided into 200k directories with each some 5k files. >> This ratio (dir/files) can be altered to optimize FS >> performance. > If you are writing to a local S-ATA disk, ext3/4 can write a > few thousand files/sec without doing any fsync() operations. > With fsync(), you will drop down quite a lot. Unfortunately using 'fsync' is a good idea for production systems. Also note that in order to write 10^9 files at 10^3/s rate takes 10^6 seconds; roughly 10 days to populate the filesystem (or at least that to restore it from backups). > One layout for directories that works well with this kind of > thing is a time based one (say YEAR/MONTH/DAY/HOUR/MIN where > MIN might be 0, 5, 10, ..., 55 for example). As to the problem above and ths kind of solution, I reckon that it is utterly absurd (and I could have used much stronger words). BTW, the sort of people who consider seriously such utter absurdities try to do a thorough job, and I don't want to know how the underlying storage system is structured :-). If anything, consider the obvious (obvious except to those who want to use a filesystem as a small record database), which is 'fsck' time, in particular given the structure of 'ext3' (or 'ext4') metadata. So: just don't use a filesystem as a database, spare us the horror; use a database, even a simple one, which is not utterly absurd. Compare these two: http://lists.gllug.org.uk/pipermail/gllug/2005-October/055445.html http://lists.gllug.org.uk/pipermail/gllug/2005-October/055488.html Anyhow I do see a lot of inane questions and "solutions" like the above in various lists (usually the XFS one, which attracts a lot of utter absurdities). > When reading files in ext3 (and ext4) or doing other bulk > operations like a large deletion, it is important to sort the > files by inode (do the readdir, get say all of the 5k files in > your subdir and then sort by inode before doing your bulk > operation). Good idea, but it is best to avoid the cases where this matters. From r.majumdar at globallogic.com Mon Sep 14 09:50:01 2009 From: r.majumdar at globallogic.com (Ritesh Majumdar) Date: Mon, 14 Sep 2009 15:20:01 +0530 Subject: Untar hangs on ext3 file system Message-ID: <1252921801.10534.9.camel@ripper.synapse.com> Hello List, I am trying to untar a 450 MB tar file on ext3 file system, but every time untar (using the command "tar zxvf ) hangs and I see no disk activity. While I use ReiserFS file system I can untar the same file successfully. I am not sure what is missing here. Please Help!!! Many Thanks, Ritesh. From rwheeler at redhat.com Mon Sep 14 11:34:39 2009 From: rwheeler at redhat.com (Ric Wheeler) Date: Mon, 14 Sep 2009 07:34:39 -0400 Subject: Many small files, best practise. In-Reply-To: <19118.3970.628895.372996@tree.ty.sabi.co.uk> References: <2F3893D6F401F74695CE4AE1BA204E685C705E46C6@wfeex01pv.ad.fouredge.se> <4AA7AFA8.5040502@redhat.com> <19118.3970.628895.372996@tree.ty.sabi.co.uk> Message-ID: <4AAE2A4F.8010409@redhat.com> On 09/14/2009 05:40 AM, Peter Grandi wrote: > >>> RHEL 5.3 >>> ~1000.000.000 files (1-30k) >>> ~7TB in total >>> // >>> > >>> I'm looking for a best practice when implementing this using >>> EXT3 (or some other FS if it shouldn't do the job.). >>> > "best practice" would be a rather radical solution. > > >>> On average the reads dominate (99%), writes are only used for >>> updating and isn't a part of the service provided. The data >>> is divided into 200k directories with each some 5k files. >>> This ratio (dir/files) can be altered to optimize FS >>> performance. >>> > >> If you are writing to a local S-ATA disk, ext3/4 can write a >> few thousand files/sec without doing any fsync() operations. >> With fsync(), you will drop down quite a lot. >> > Unfortunately using 'fsync' is a good idea for production > systems. > > Also note that in order to write 10^9 files at 10^3/s rate takes > 10^6 seconds; roughly 10 days to populate the filesystem (or at > least that to restore it from backups). > > One thing that you can do when doing bulk loads of files (say, during a restore or migration), is to use a two phase write. First, write each of a batch of files (say 1000 files at a time), then go back and reopen/fsync/close them. This will give you performance levels closer to not using fsync() and still give you good data integrity. Note that this usually is a good fit for this class of operations since you can always restart the bulk load if you have a crash/error/etc. To give this a try, you can use "fs_mark" to write say 100k files with the fsync one file at a time (-S 1, its default) or use one of the batch fsync modes (-S 3 for example). >> One layout for directories that works well with this kind of >> thing is a time based one (say YEAR/MONTH/DAY/HOUR/MIN where >> MIN might be 0, 5, 10, ..., 55 for example). >> > As to the problem above and ths kind of solution, I reckon that > it is utterly absurd (and I could have used much stronger words). > When you deal with systems that store millions of files, you pretty much always are going to use some kind of made up directory layout. The above scheme works pretty well in that it correlates well to normal usage patterns and queries (and tends to have those subdirectories laid out contiguously). You can always try to write 1 million files in a single subdirectory, but if you are writing your own application, using this kind of scheme is pretty trivial. > BTW, the sort of people who consider seriously such utter > absurdities try to do a thorough job, and I don't want to > know how the underlying storage system is structured :-). > > If anything, consider the obvious (obvious except to those who > want to use a filesystem as a small record database), which is > 'fsck' time, in particular given the structure of 'ext3' (or > 'ext4') metadata. > fsck time has improved quite a lot recently with ext4 (and with xfs). > So: just don't use a filesystem as a database, spare us the > horror; use a database, even a simple one, which is not utterly > absurd. > > Compare these two: > > http://lists.gllug.org.uk/pipermail/gllug/2005-October/055445.html > In this case, doing the bulk load I described above (reading in sorted order, writing out in the same), would significantly reduce the time of the restore. > http://lists.gllug.org.uk/pipermail/gllug/2005-October/055488.html > > Anyhow I do see a lot of inane questions and "solutions" like > the above in various lists (usually the XFS one, which attracts > a lot of utter absurdities). > > >> When reading files in ext3 (and ext4) or doing other bulk >> operations like a large deletion, it is important to sort the >> files by inode (do the readdir, get say all of the 5k files in >> your subdir and then sort by inode before doing your bulk >> operation). >> > Good idea, but it is best to avoid the cases where this matters. > > From sandeen at redhat.com Mon Sep 14 16:43:22 2009 From: sandeen at redhat.com (Eric Sandeen) Date: Mon, 14 Sep 2009 11:43:22 -0500 Subject: Untar hangs on ext3 file system In-Reply-To: <1252921801.10534.9.camel@ripper.synapse.com> References: <1252921801.10534.9.camel@ripper.synapse.com> Message-ID: <4AAE72AA.2020404@redhat.com> Ritesh Majumdar wrote: > Hello List, > > I am trying to untar a 450 MB tar file on ext3 file system, but every > time untar (using the command "tar zxvf ) hangs and I see no > disk activity. Stating which kernel you are using would be a help ... stracing the untar might tell you where it's at from the userspace perspective; echo t > /proc/sysrq-trigger would give you all of the kernel thread tracebacks, and you could find the tar process in there to see where it is stuck. -Eric > While I use ReiserFS file system I can untar the same file successfully. > > I am not sure what is missing here. > > Please Help!!! > > Many Thanks, > Ritesh. From pg_ext3 at ext3.for.sabi.co.UK Mon Sep 14 21:08:58 2009 From: pg_ext3 at ext3.for.sabi.co.UK (Peter Grandi) Date: Mon, 14 Sep 2009 22:08:58 +0100 Subject: Many small files, best practise. In-Reply-To: <4AAE2A4F.8010409@redhat.com> References: <2F3893D6F401F74695CE4AE1BA204E685C705E46C6@wfeex01pv.ad.fouredge.se> <4AA7AFA8.5040502@redhat.com> <19118.3970.628895.372996@tree.ty.sabi.co.uk> <4AAE2A4F.8010409@redhat.com> Message-ID: <19118.45290.900343.204958@tree.ty.sabi.co.uk> [ ... ] >> Also note that in order to write 10^9 files at 10^3/s rate >> takes 10^6 seconds; roughly 10 days to populate the >> filesystem (or at least that to restore it from backups). > One thing that you can do when doing bulk loads of files (say, > during a restore or migration), is to use a two phase > write. First, write each of a batch of files (say 1000 files > at a time), then go back and reopen/fsync/close them. Why not just restore a database? >>> One layout for directories that works well with this kind of >>> thing is a time based one (say YEAR/MONTH/DAY/HOUR/MIN where >>> MIN might be 0, 5, 10, ..., 55 for example). >> As to the problem above and ths kind of solution, I reckon that >> it is utterly absurd (and I could have used much stronger words). > When you deal with systems that store millions of files, Millions of files may work; but 1 billion is an utter absurdity. A filesystem that can store reasonably 1 billion small files in 7TB is an unsolved research issue... The obvious thing to do is to use a database, and there is no way around this point. If one genuinely needs to store a lot of files, why not split them into many independent filesystems? A single large one is only need to allow for hard linking or for having a single large space pool, and in applications where the directory structure above makes any kind of sense that neither is usually required. > you pretty much always are going to use some kind of made up > directory layout. File systems are usually used for storing somewhat unstructured information, not records that can be looked up with a simple "YEAR/MONTH/DAY/HOUR/MIN" key, which seems very suitable for something like a simpel DBMS. There is even a tendency to move filesystems into databases, as they scale a lot better. And for cases where a filesystem still makes sense I would rather use, instead of the inane manylevel directory structure above, a file system design with proper tree indexes and perhaps even one with the ability to store small files into inodes. [ ... ] > You can always try to write 1 million files in a single > subdirectory, Again, I'd rather avoid anything like that. > but if you are writing your own application, using this kind > of scheme is pretty trivial. And an utter absurdity, for 1 billion files in 200k directories. Both on its own merits and compared to the OBVIOUS alternative. >> If anything, consider the obvious (obvious except to those >> who want to use a filesystem as a small record database), >> which is 'fsck' time, in particular given the structure of >> 'ext3' (or 'ext4') metadata. > fsck time has improved quite a lot recently with ext4 (and > with xfs). How many months do you think a 7TB filesystem with 1 billion files would take to 'fsck' even with those improvements? Even with the nice improvements? [ ... ] From rwheeler at redhat.com Wed Sep 16 18:56:54 2009 From: rwheeler at redhat.com (Ric Wheeler) Date: Wed, 16 Sep 2009 14:56:54 -0400 Subject: Many small files, best practise. In-Reply-To: <19118.45290.900343.204958@tree.ty.sabi.co.uk> References: <2F3893D6F401F74695CE4AE1BA204E685C705E46C6@wfeex01pv.ad.fouredge.se> <4AA7AFA8.5040502@redhat.com> <19118.3970.628895.372996@tree.ty.sabi.co.uk> <4AAE2A4F.8010409@redhat.com> <19118.45290.900343.204958@tree.ty.sabi.co.uk> Message-ID: <4AB134F6.2060900@redhat.com> On 09/14/2009 05:08 PM, Peter Grandi wrote: > [ ... ] > >>> Also note that in order to write 10^9 files at 10^3/s rate >>> takes 10^6 seconds; roughly 10 days to populate the >>> filesystem (or at least that to restore it from backups). > >> One thing that you can do when doing bulk loads of files (say, >> during a restore or migration), is to use a two phase >> write. First, write each of a batch of files (say 1000 files >> at a time), then go back and reopen/fsync/close them. > > Why not just restore a database? If you started with a database, that would be reasonable. If you started with a file system, I guess I don't understand what you are suggesting. > >>>> One layout for directories that works well with this kind of >>>> thing is a time based one (say YEAR/MONTH/DAY/HOUR/MIN where >>>> MIN might be 0, 5, 10, ..., 55 for example). > >>> As to the problem above and ths kind of solution, I reckon that >>> it is utterly absurd (and I could have used much stronger words). > >> When you deal with systems that store millions of files, > > Millions of files may work; but 1 billion is an utter absurdity. > A filesystem that can store reasonably 1 billion small files in > 7TB is an unsolved research issue... Strangely enough, I have been testing ext4 and stopped filling it at a bit over 1 billion 20KB files on Monday (with 60TB of storage). Running fsck on it took only 2.4 hours. > > The obvious thing to do is to use a database, and there is no > way around this point. Everything has a use case. I am certainly not an anti-DB person, but your assertion alone is not convincing. > > If one genuinely needs to store a lot of files, why not split > them into many independent filesystems? A single large one is > only need to allow for hard linking or for having a single large > space pool, and in applications where the directory structure > above makes any kind of sense that neither is usually required. Splitting a big file system into small ones means that you (the application or sys admin) must load balance where to put new files instead of having the system do it for you. >> you pretty much always are going to use some kind of made up >> directory layout. The use case for big file systems with lots of small files (at least the one that I know of) is for object based file systems where files usually have odd, non-humanly generated file names (think guids with time stamps and digital signatures). These are pretty trivial to map into the time based directory scheme I mentioned before. > > File systems are usually used for storing somewhat unstructured > information, not records that can be looked up with a simple > "YEAR/MONTH/DAY/HOUR/MIN" key, which seems very suitable for > something like a simpel DBMS. > > There is even a tendency to move filesystems into databases, as > they scale a lot better. > > And for cases where a filesystem still makes sense I would > rather use, instead of the inane manylevel directory structure > above, a file system design with proper tree indexes and perhaps > even one with the ability to store small files into inodes. > > [ ... ] Have you tried to make a production DB with 1 billion records? Or done experiments with fs vs db schemes? > >> You can always try to write 1 million files in a single >> subdirectory, > > Again, I'd rather avoid anything like that. > >> but if you are writing your own application, using this kind >> of scheme is pretty trivial. > > And an utter absurdity, for 1 billion files in 200k directories. > Both on its own merits and compared to the OBVIOUS alternative. > >>> If anything, consider the obvious (obvious except to those >>> who want to use a filesystem as a small record database), >>> which is 'fsck' time, in particular given the structure of >>> 'ext3' (or 'ext4') metadata. > >> fsck time has improved quite a lot recently with ext4 (and >> with xfs). > > How many months do you think a 7TB filesystem with 1 billion > files would take to 'fsck' even with those improvements? Even > with the nice improvements? > 20KB files written to ext4 run at around 3,000 files/sec. It took us about 4 days to fill it to 1 billion files and 2.4 hours to fsck. Not to be mean, but I have worked in this exact area and have benchmarked both large DB instances and large file systems. Good use cases exist for both, but the facts do not back up your DB is the only solution proposal :-) ric From adilger at sun.com Wed Sep 16 22:28:29 2009 From: adilger at sun.com (Andreas Dilger) Date: Wed, 16 Sep 2009 16:28:29 -0600 Subject: Many small files, best practise. In-Reply-To: <19118.45290.900343.204958@tree.ty.sabi.co.uk> References: <2F3893D6F401F74695CE4AE1BA204E685C705E46C6@wfeex01pv.ad.fouredge.se> <4AA7AFA8.5040502@redhat.com> <19118.3970.628895.372996@tree.ty.sabi.co.uk> <4AAE2A4F.8010409@redhat.com> <19118.45290.900343.204958@tree.ty.sabi.co.uk> Message-ID: <20090916222829.GQ2537@webber.adilger.int> On Sep 14, 2009 22:08 +0100, Peter Grandi wrote: > > When you deal with systems that store millions of files, > > Millions of files may work; but 1 billion is an utter absurdity. > A filesystem that can store reasonably 1 billion small files in > 7TB is an unsolved research issue... I'd disagree. We have Lustre filesystems with 500M files on the ext4(ish) metadata server, and these are only 4TB. Note there is NO DATA in the metadata files, so it isn't quite like a normal filesystem. It also depends on what you mean by "small files". We've previously discussed storing small file data in an extended attribute, and if you are tuning for this and the file size is small enough (3kB or less) the file data could be stored inside the inode (i.e. zero seek data IO). > > fsck time has improved quite a lot recently with ext4 (and > > with xfs). > > How many months do you think a 7TB filesystem with 1 billion > files would take to 'fsck' even with those improvements? Even > with the nice improvements? I think you aren't backing your comments with any facts. The e2fsck time on our MDS filesystems with 500M IN USE inodes is on the order of 4 hours (disk-based RAID-1+0 array). If this was on a RAID-1+0 SSD it could be noticably faster. Ric also commented previously about single-digit hours for e2fsck on a test 1B file ext4 filesystem. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. From pg_ext3 at ext3.for.sabi.co.UK Mon Sep 21 13:54:44 2009 From: pg_ext3 at ext3.for.sabi.co.UK (Peter Grandi) Date: Mon, 21 Sep 2009 14:54:44 +0100 Subject: Many small files, best practise. In-Reply-To: <4AB134F6.2060900@redhat.com> References: <2F3893D6F401F74695CE4AE1BA204E685C705E46C6@wfeex01pv.ad.fouredge.se> <4AA7AFA8.5040502@redhat.com> <19118.3970.628895.372996@tree.ty.sabi.co.uk> <4AAE2A4F.8010409@redhat.com> <19118.45290.900343.204958@tree.ty.sabi.co.uk> <4AB134F6.2060900@redhat.com> Message-ID: <19127.34212.425251.424259@tree.ty.sabi.co.uk> [ ... whether storing 1 bilion 7KB (average) records are best stored in a database or 1 per file in a file system ... ] >>> One thing that you can do when doing bulk loads of files >>> (say, during a restore or migration), is to use a two phase >>> write. First, write each of a batch of files (say 1000 files >>> at a time), then go back and reopen/fsync/close them. >> Why not just restore a database? > If you started with a database, that would be reasonable. If > you started with a file system, I guess I don't understand > what you are suggesting. Well, the topic of this discussion is whether one *should* start with a database for the "lots of small records" case. It is not a new topic by any means -- there have been many debates in the past as to how silly it is to have immense file-per-message news/mail spool archives with lots of little files. The outcome has always been to store them in databased of one sort or another. >>>>> One layout for directories that works well with this kind >>>>> of thing is a time based one (say YEAR/MONTH/DAY/HOUR/MIN >>>>> where MIN might be 0, 5, 10, ..., 55 for example). >>> As to the problem above and ths kind of solution, I reckon >>> that it is utterly absurd (and I could have used much >>> stronger words). >>> When you deal with systems that store millions of files, >> Millions of files may work; but 1 billion is an utter >> absurdity. A filesystem that can store reasonably 1 billion >> small files in 7TB is an unsolved research issue ... [ >> ... and fsck ... ] > Strangely enough, I have been testing ext4 and stopped filling > it at a bit over 1 billion 20KB files on Monday (with 60TB of > storage). Is that a *reasonable* use of a filesystem? Have you compared to storing 1 billion 20KB records in a simple database? As an aside, 20KB is no longer than much in the "small files" range. For example, one stupid idea of storing records as "small files" is the enormous internal fragmentation caused by 4KiB allocation granularity, which swells space used too. Even for the original problem, which was about: > ~1000.000.000 files (1-30k) > ~7TB in total that is presumably lots of files under 4KiB if the average file size is 7KB in a range between 1-30KB. Also looking at my humble home system, at the root filesystem and a media (RPMs, TARs, ZIPs, JPGs, ISOs, ...) archival filesystem (both JFS): base# df / /fs/basho Filesystem 1M-blocks Used Available Use% Mounted on /dev/sdb1 11902 9712 2191 82% / /dev/sda8 238426 228853 9573 96% /fs/basho base# df -i / /fs/basho Filesystem Inodes IUsed IFree IUse% Mounted on /dev/sdb1 4873024 359964 4513060 8% / /dev/sda8 19738976 126493 19612483 1% /fs/basho I see that files under 4K are the vast majority on one and a large majority on the other: base# find / -xdev -type f -size -4000 | wc -l 305064 base# find /fs/basho -xdev -type f -size -4000 | wc -l 107255 Anyhow, because while some people make (because they do "work") fielsystems with millions and even billion inodes and/or 60TB capacities (on 60+1 RAID5s sometimes), the question is whether it makes sense or is an absurdity on its own merits and when compared to a database. That something stupid can be done is not an argument for doing it. The arguments I referred to in my original comments show just how expensive is to misuse a directory hierarchy in a filesystem as if it were an index in a database, by comparing them: "I have a little script, the job of which is to create a lot of very small files (~1 million files, typically ~50-100 bytes each)." "It's a bit of a one-off (or twice, maybe) script, and currently due to finish in about 15 hours," "creates a Berkeley DB database of K records of random length varying between I and J bytes," "So, we got 130MiB of disc space used in a single file, >2500 records sustained per second inserted over 6 minutes and a half," Perhaps 50-100 bytes is a bit extreme, but still compare "due to finish in about 15 hours" with "6 minutes and a half". Now, in that case a large part of the speedup is that the records were small enough that 1m of them as a database would fit into memory (that BTW was part of the point why using a filesystem for that was utterly absurd). I'd rather not do a test with 1G 6-7KB records on my (fairly standard, small, 2GHz PCU, 2GiB RAM) home PC, but 1M 6-7KB records is of course feasible, and on a single modern disk with 1 TB (and a slightly prettified updated script using BTREE) I get (1M records with a 12 byte key, record length random between 2000 and 10000 bytes): base# rm manyt.db base# time perl manymake.pl manyt.db 1000000 2000 10000 1 percent done, 990000 to go 2 percent done, 980000 to go 3 percent done, 970000 to go .... 98 percent done, 20000 to go 99 percent done, 10000 to go 100 percent done, 0 to go real 81m6.812s user 0m29.957s sys 0m30.124s base# ls -ld manyt.db -rw------- 1 root root 8108961792 Sep 19 20:36 manyt.db The creation script flushes every 1% too, but from the pathetic peak 3-4MB/s write rate it is pretty obvious that on my system things don't get cached a lot (by design...). As to reading, 10000 records at random among those 1M: base# time perl manyseek.pl manyt.db 1000000 10000 1 percent done, 9900 to go 2 percent done, 9800 to go 3 percent done, 9700 to go .... 98 percent done, 200 to go 99 percent done, 100 to go 100 percent done, 0 to go average length: 5984.4108 real 7m22.016s user 0m0.210s sys 0m0.442s That is on the slower half of a 1T drive in a half empty JFS filesystem. That's 200/s 6KB average records inserted, and about 22/s looked up, which is about as good as the drive can do, all in a single 8GB file. Sure, a lot slower than 50-100 bytes as it can no longer much fit into memory, but still way off "due to finish in about 15 hours". Sure the system I used for the new test is a bit faster than the one used for the "in about 15 hours" test, but we are still talking one arm, which is largely the bottleneck. But wait -- I am JOKING. because it is ridiculous to load a 1M record dataset into an indexed database one record at a time. Sure it is *possible*, but any sensible database has a bulk loader that builds the index after loading the data. So in any reasonable scenario the difference when *restoring* a backedup filesystem will be rather bigger than for the scenario above. Sure, some file systems have 'dump' like tools that help, but they don't recreate a nice index, they just restore it. Ah well. Now let's see a much bigger scale test: > [ ... ] testing ext4 and stopped filling it at a bit over 1 > billion 20KB files on Monday (with 60TB of storage). Running > fsck on it took only 2.4 hours. [ ... ] > [ ... ] 20KB files written to ext4 run at around 3,000 > files/sec. It took us about 4 days to fill it to 1 billion > files [ ... ] That sounds like you did use 'fsync' per file or something similar, as you had written: >>>> If you are writing to a local S-ATA disk, ext3/4 can write a >>>> few thousand files/sec without doing any fsync() operations. >>>> With fsync(), you will drop down quite a lot. and here you report around 3000/s over a 60TB array. Then 20KBx3000/s is 60MB/s -- rather unimpressive score for a 60TB filesystem (presumably spread over 60 drives or more), even with 'fsync'. And the creation record rate itself looks like about 50 records/s per drive. That is rather disappointing. Yes, they are larger files, but that should not cause that much slowdown. Also, the storage layout is not declared (except that you are storing 20TB of data in 60TB of drives, which is a bit of a cheat), and it would be also quite interesting to see the output of that 'fsck' run: > and 2.4 hours to fsck. But that is an unreasonable test, even if it is the type of test popular with some file system designers, precisely because... Testing file system performance just after loading is a naive or cheating exercise, especially with 'ext4' (and 'ext3'), as after loading all those inodes and files are going to be nearly optimally laid out (e.g. list of inode numbers in a directory pretty much sequential), and with 'ext4' each file will consist of a single extent (hopefully), so less metadata. But a filesystem that simulates a simple small object database will as a rule not be so lucky; it will grow and be modified. Even worse, 'fsck' on a filesystem *without damage* is just an exercise in enumerating inodes and other metadata. What is interesting is that happens when there is damage and 'fsck' has to start cross-correlating metadata. So here are some more realistic 'fsck' estimates from other filesystems and other times, who should be very familiar to those considering utterly absurd designs: http://ukai.org/b/log/debian/snapshot "long fsck on disks for old snapshot.debian.net is completed today. It takes 75 days!" "It still fsck for a month.... root 6235 36.1 59.7 1080080 307808 pts/2 D+ Jun21 15911:50 fsck.ext3 /dev/md5" That was I think before some improvements to 'ext3' checking. http://groups.google.com/group/linux.debian.ports.x86-64/msg/fd2b4d46a4c294b5 "Keep in mind if you go with XFS, you're going to need 10-15 gig of memory or swap space to fsck 6tb.. it needs about 9 gig to xfs_check, and 3 gig to xfs_repair a 4tb array on one of my systems.. oh, and a couple days to do either. :)" "> Generally, IMHO no. A fsck will cost a lot of time with > all filesystems. Some worse than others though.. looks like this 4tb is going to take 3 weeks.. it took about 3-4 hours on ext3.. If i had a couple gig of ram to put in the server that'd probably help though, as it's constantly swapping out a few meg a second." http://lists.us.dell.com/pipermail/linux-poweredge/2007-November/033821.html "> I'll definitely be considering that, as I already had to > wait hours for fsck to run on some 2 to 3TB ext3 > filesystems after crashes. I know it can be disabled, but > I do feel better forcing a complete check after a system > crash, especially if the filesystem had been mounted for > very long, like a year or so, and heavily used. The decision process for using ext3 on large volumes is simple: Can you accept downtimes measured in hours (or days) due to fsck? No - don't use ext3." http://www.mysqlperformanceblog.com/2006/10/08/small-things-are-better/ "Yesterday I had fun time repairing 1.5Tb ext3 partition, containing many millions of files. Of course it should have never happened - this was decent PowerEdge 2850 box with RAID volume, ECC memory and reliable CentOS 4.4 distribution but still it did. We had "journal failed" message in kernel log and filesystem needed to be checked and repaired even though it is journaling file system which should not need checks in normal use, even in case of power failures. Checking and repairing took many hours especially as automatic check on boot failed and had to be manually restarted." Another factor is just how "complicated" the filesystem is, and for example 'fsck' times with large numbers of hard links can be very bad (and there are quite a few use cases like 'rdiff-backup'). Also, what about the few numbers you mention above? The 2.4 hours for 1 billion files mean 110K inodes examined per second. Now 60TB probably means like 60 1TB drives to store 20TB of data, a pretty large degree of parallelism. T'so reports: http://thunk.org/tytso/blog/2008/08/08/fast-ext4-fsck-times/ which shows that on a single (laptop) drive an 800K inode/90GB 'ext4' filesystem could be checked in 63s or around 12K inodes/s per drive, not less than 2K. There seems to be a scalability problem -- but of course: one of the "unsolved research issue"s is that while read/write/etc. can be parallelized (for large files) by using wide RAIDs, it is not so easy to parallelize 'fsck' (except by using multiple mostly independent filesystems). [ ... ] > The use case for big file systems with lots of small files (at > least the one that I know of) is for object based file systems > where files usually have odd, non-humanly generated file names > (think guids with time stamps and digital signatures). > These are pretty trivial to map into the time based directory > scheme I mentioned before. And it is utterly absurd to do so (see below). > [ ... ] benchmarked both large DB instances and large file > systems. Good use cases exist for both, but the facts do not > back up your DB is the only solution proposal :-) Sure, large filesystems (to a point, which for me is the single digit TB range) with large files have their place, even if people seem to prefer metafilesystem like Lustre even for those, for good reasons. But the discussion is whether it makes sense, for a case like 1G records averaging about 7KB, to use a filesystem with 200K directories with each 5K files (or something similar) one file per record, or a database with a nice overall index and a single or a few files for all records. Your facts above show that it is *possible* to create a similar (1G x 20K records) filesystem, and that it seem to make a rather poor use of a very large storage system. The facts that I referred to in my original comment show that there is a VERY LARGE performance difference between using a filesystem as a (very) small-record database for just 1M records, and a PRETTY LARGE difference even for 6KB records, and that doing something stupid on the database side. In the end the facts just confirm the overall discussion that I referred to in my original comment: http://lists.gllug.org.uk/pipermail/gllug/2005-October/055445.html "* The size of the tree will be around 1M filesystem blocks on most filesystems, whose block size usually defaults to 4KiB, for a total of around 4GiB, or can be set as low as 512B, for a total of around 0.5GiB. * With 1,000,000 files and a fanout of 50, we need 20,000 directories above them, 400 above those and 8 above those. So 3 directory opens/reads every time a file has to be accessed, in addition to opening and reading the file. * Each file access will involve therefore four inode accesses and four filesystem block accesses, probably rather widely scattered. Depending on the size of the filesystem block and whether the inode is contiguous to the body of the file this can involve anything between 32KiB and 2KiB of logical IO per file access. * It is likely that of the logical IOs those relating to the two top levels (those comprising 8 and 400 directories) of the subtree will be avoided by caching between 200KiB and 1.6MiB, but the other two levels, the 20,000 bottom directories and the 1,000,000 leaf files, won't likely be cached." These are pretty elementary considerations, and boil down to the issue of whether for a given dataset of "small" records the best index structure is a tree of directories or a nicely balanced index tree, and whether the "small" records should be at most one per (4KiB usually) block or can share blocks, and there is little doubt that tha latter wins pretty big. Your proposed directory based index "YEAR/MONTH/DAY/HOUR/MIN" seems to me particularly inane, as it has a *fixed fanout*, of 12 at the "MONTH" level, around 30 at the "DAY" level, 24 at the hour level, and 60 at the "MIN" level with no balancing. Fine if the record creation rate is constant. Perhaps not -- it involves 500K "MIN" directories per year. If we create 1G files per year we get around 2K files per "MIN" directory, each of which is then likely to be a few 4KiB blocks long. Fabulous :-). Sure, it is a *doable* structure, but it is not *reasonable*, especially if one knows the better alternative. Overall the data and arguments above suggests that: * Large filesystems (2 digits TB and more) usually should be avoided. * Filesystems with large numbers (more than a few millions) of files, even large files, should be avoided. * Large filesystems with a large number of small (around 4KiB) inodes (not just files) are utterly absurd, on their own merits, and even more so when compared with a database. * Two big issues are that while parallel storage scales up data performance, it does not do that well with metadata, and in particular metadata crawls such as 'fsck' are hard to parallelize (they are hard even when they in effect resolve just in mostly-linear scans). * If one *has* to have any of the above, separate filesystems, and/or filesystems based on a database-like design (e.g. based on indices throughout like HFS+ or Reiser3 or to some degree JFS and even XFS) may be the lesser evils, even if they have some limitations. But that is still fairly crazy. 'ar' files for one thing have been invented decades ago precisely because lots of small files and filesystems are a bad combination. These are conclusions well supported by experiment, data and simple reasoning as in the above. I should not have to explain these pretty obvious points in detail -- that databases are much better for large small record collections is not exactly a recent discovery. Sure, a lot of people "know better" and adopt what I call the "syntactically valid" approach, where if a combination is possible it is then fine. Good luck! From pg_ext3 at ext3.for.sabi.co.UK Mon Sep 21 15:37:25 2009 From: pg_ext3 at ext3.for.sabi.co.UK (Peter Grandi) Date: Mon, 21 Sep 2009 16:37:25 +0100 Subject: Many small files, best practise. In-Reply-To: <20090916222829.GQ2537@webber.adilger.int> References: <2F3893D6F401F74695CE4AE1BA204E685C705E46C6@wfeex01pv.ad.fouredge.se> <4AA7AFA8.5040502@redhat.com> <19118.3970.628895.372996@tree.ty.sabi.co.uk> <4AAE2A4F.8010409@redhat.com> <19118.45290.900343.204958@tree.ty.sabi.co.uk> <20090916222829.GQ2537@webber.adilger.int> Message-ID: <19127.40373.159150.385799@tree.ty.sabi.co.uk> [ ... whether datasets like 1G records for a total of 7TB should be stored as one-record-per-file in a filesystem or as a database ... ] >>> When you deal with systems that store millions of files, >> Millions of files may work; but 1 billion is an utter absurdity. >> A filesystem that can store reasonably 1 billion small files in >> 7TB is an unsolved research issue... > I'd disagree. We have Lustre filesystems with 500M files on > the ext4(ish) metadata server, and these are only 4TB. Note > there is NO DATA in the metadata files, so it isn't quite like > a normal filesystem. That is possible, but to me seems quite unreasonable. How long does that take to RSYNC, for example? To just backup? What about doing a 'find'? These are mad things. This is the special case of an MDS as you mention, but it is still fairly dangerous. Just like many other similar choices (e.g. 19+1 RAID5 arrays), it works (not so awesomely) as long as it works, and when it breaks it is very bad. I like the Lustre idea, and to me it is currently the best of a not very enthusing lot, but the MDT is by far the weakest bit, and the ``lots of tiny files'' idea is one of the big deals. In particular size of MDTs is a significant scalability issue with Lustre, which was designed in older gentler times for purposes to which metadata scalability might not have been so essential. Like most good ideas it has been scaled up beyond expectations (UNIX-style), and perhaps it is reaching the end of its useful range. Fortunately sensible Lustre people keep frequent and wholesame MDS backups, and restoring a backup, and even a 500M 800B file backup/restore is hopefully much faster than an 'fsck' if there is damage. > It also depends on what you mean by "small files". We've > previously discussed storing small file data in an extended > attribute, and if you are tuning for this and the file size is > small enough (3kB or less) the file data could be stored > inside the inode (i.e. zero seek data IO). If I were to use a filesystem as a makeshift database I would indeed use one of those filesystems that store small files or file tails in the metadata, as I wrote: >> And for cases where a filesystem still makes sense I would >> rather use, instead of the inane manylevel directory >> structure above, a file system design with proper tree >> indexes and perhaps even one with the ability to store >> small files into inodes. You might consider storing Lustre MDTs on Reiser3 instead of 'ldiskfs' :-). But this is backwards; the database guys have spent the past several decades working on the ``lots of small records reliably'' problem (and with "bushy" indices), and the main work by the file system guys has been solving the ``massive massively parallel files'' one. To the point that people like Reiser who did work (with database like techniques) on the small files problems for filesystems have been at best ignored. [ ... ] > I think you aren't backing your comments with any facts. You may think that -- but that's only because you think wrong, as you haven't read my comments or you want to misrepresent them. I made at the very start a clear example of a case with 1M small files engendering a difference between more than 15 hours vs. 6 minutes for just creation. For amusement I just rerun it in a nicer form on a somewhat faster system: base$ rm /fs/jugen/tmp/manysmall.db base$ time perl manymake.pl /fs/jugen/tmp/manysmall.db 1000000 50 100 1 percent done, 990000 to go 2 percent done, 980000 to go 3 percent done, 970000 to go .... 98 percent done, 20000 to go 99 percent done, 10000 to go 100 percent done, 0 to go real 0m48.209s user 0m6.240s sys 0m0.348s base$ ls -ld /fs/jugen/tmp/manysmall.db -rw------- 1 pcg pcg 98197504 Sep 21 16:19 /fs/jugen/tmp/manysmall.db That's 1M records in 10MB in less than a minute or 20K records/s, for around 1.5MB/s, which is fairly typical for random access to a fairly standard 1TB consumer drive in its latter half. base$ sudo sysctl vm.drop_caches=1 vm.drop_caches = 1 base$ time perl manyseek.pl /fs/jugen/tmp/manysmall.db 1000000 10000 1 percent done, 9900 to go 2 percent done, 9800 to go 3 percent done, 9700 to go .... 98 percent done, 200 to go 99 percent done, 100 to go 100 percent done, 0 to go average length: 69.3816 real 2m4.265s user 0m0.150s sys 0m0.126s Seeking of course is not awesome, and we get 10K records in 2m, or around 80 records/s. Ah well. I need an SSD :-). And as to the 'fsck', I confess that I had a list of cases in mind but was waiting for the usual worn out dodgy technique of quoting undamaged filesystem times: > The e2fsck time on our MDS filesystems with 500M IN USE inodes > is on the order of 4 hours (disk-based RAID-1+0 array). If > this was on a RAID-1+0 SSD it could be noticably faster. Ric > also commented previously about single-digit hours for e2fsck > on a test 1B file ext4 filesystem. That is a classic "benchmark" -- undamaged filesystem 'fsck' tests, like the other favourite, freshly loaded filesystem benchmarks, are just dodgy marketing tools. And even so! 1 hour per TB, or 1h per 100M files. To me keeping what may be production filesystem with 500M files unavailable for 4 hours because one occasionally has to run 'fsck' (even if in fact there is no damage) with an upside risk of weeks or months sounds not such a good idea. But who knows. There are been reports, which are sadly familiar to those who work as sysadms, of single digit TB filesystems taking weeks to months to repair, if damaged. The difference of course is between scanning the metadata and crawling it. Which is of course perfectly obvious, as RAIDs allow for parallelizing of read/write but not easily for scanning and less so for crawls. Scaling 'fsck' is not easy, is an unsolved research problem, even if things like Lustre help somewhat (minus the MDTs of course). Now I feel a bit preachy, I'll mention some wider concepts (mostly from the database guys) that should fit well in this discussion: * A "database" is defined as something including a dataset whose working set does not fit in memory (it thrashes -- every access involves at least one IO). There are several types of databases, structured/unstructured, factual/textual/...; a filesystem is a kind of database, as that definition applies. But to me and several decades of practice and theory it is a database of record _containers_ (as suggested by the very word "file"), not of records. It is exceptionally hard to do a DBMS that handles equally well records and record containers. * A "very large database" is a database that cannot be practically backed up (or checked) offline, as backup (or check) take too long wrt to requirements. Many filesystems are moving into the "very large database" category (can your customers accept that it might take 4 hours or 4 weeks to check, and 4 days to restore, their filesystem?). Storing small records (or small containers even) in a filesystem makes it much more likely that it becomes a "very large database", and while the technology for "very large databases" DBMSes is mature, that for "very large database" file system designs is not there or at least not as mature, even if the fun guys at Sun have been trying lately with ZFS. * These are not novel or little know concepts and experiences. 'ar' files have been around for a long time, for some good reason. From awk at google.com Wed Sep 23 22:59:02 2009 From: awk at google.com (Abhijit Karmarkar) Date: Wed, 23 Sep 2009 22:59:02 -0000 Subject: jbd/kjournald oops on 2.6.30.1 Message-ID: <88cc3e770909231558u5109aca1u1409ba6877a6c8f@mail.gmail.com> Hi, I am getting the following Oops on 2.6.30.1 kernel. The bad part is, it happens rarely (twice in last 1.5 months) and the system is pretty lightly loaded when this happens (no heavy file/disk io). Any insights or patches that I can try? (i searched lkml and ext3 lists but could not find any similar oops/reports). == Oops =================== BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 IP: [] __journal_remove_journal_head+0x10/0x120 PGD 0 Oops: 0000 [#1] SMP last sysfs file: /sys/class/scsi_host/host0/proc_name CPU 0 Pid: 3834, comm: kjournald Not tainted 2.6.30.1_test #1 RIP: 0010:[] [] __journal_remove_journal_head+0x10/0x120 RSP: 0018:ffff880c7ee11d80 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000034 RDX: 0000000000000002 RSI: ffff8804ee82aa20 RDI: ffff8804ee82aa20 RBP: ffff880c7ee11d90 R08: 0400000000000000 R09: 0000000000000000 R10: ffffffff803706af R11: 0000000000000000 R12: ffff8808659bc198 R13: 0000000000000001 R14: ffff880bd435a980 R15: ffff880c7959d000 FS: 0000000000000000(0000) GS:ffff88006d000000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000000000008 CR3: 0000000000201000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process kjournald (pid: 3834, threadinfo ffff880c7ee10000, task ffff880c794900c0) Stack: ffff8804ee82aa20 ffff8808659bc198 ffff880c7ee11db0 ffffffff80374fd4 ffff880c7ee11db0 ffff8804ee82aa20 ffff880c7ee11e90 ffffffff8037073d ffff880c7959d3a8 ffff880c7ee11e48 ffff880c7959d028 ffff880c7959d338 Call Trace: [] journal_remove_journal_head+0x24/0x50 [] journal_commit_transaction+0x41d/0x1150 [] ? try_to_del_timer_sync+0x5c/0x70 [] kjournald+0xff/0x270 [] ? autoremove_wake_function+0x0/0x40 [] ? kjournald+0x0/0x270 [] kthread+0x63/0x90 [] child_rip+0xa/0x20 [] ? kthread+0x0/0x90 [] ? child_rip+0x0/0x20 Code: 1f 44 00 00 48 89 f8 48 8b 3d 7d 0d ca 00 48 89 c6 e8 85 35 f5 ff c9 c3 0f 1f 00 55 48 89 e5 41 54 53 0f 1f 44 00 00 48 8b 5f 40 <8b> 4b 08 85 c9 0f 88 f2 00 00 00 f0 ff 47 60 8b 53 08 85 d2 75 RIP [] __journal_remove_journal_head+0x10/0x120 RSP CR2: 0000000000000008 ---[ end trace 2a47799c65258934 ]--- Looking at the disassembly of journal_remove_head(): ============== 0xffffffff8037b760 <__journal_remove_journal_head+0>: push %rbp 0xffffffff8037b761 <__journal_remove_journal_head+1>: mov %rsp,%rbp 0xffffffff8037b764 <__journal_remove_journal_head+4>: push %r12 0xffffffff8037b766 <__journal_remove_journal_head+6>: push %rbx 0xffffffff8037b767 <__journal_remove_journal_head+7>: callq 0xffffffff8020bcc0 0xffffffff8037b76c <__journal_remove_journal_head+12>: mov 0x40(%rdi),%rbx 0xffffffff8037b770 <__journal_remove_journal_head+16>: mov 0x8(%rbx),%r8d <====== Oops 0xffffffff8037b774 <__journal_remove_journal_head+20>: test %r8d,%r8d 0xffffffff8037b777 <__journal_remove_journal_head+23>: js 0xffffffff8037b86d <__journal_remove_journal_head+269> 0xffffffff8037b77d <__journal_remove_journal_head+29>: lock incl 0x60(%rdi) 0xffffffff8037b781 <__journal_remove_journal_head+33>: mov 0x8(%rbx),%esi 0xffffffff8037b784 <__journal_remove_journal_head+36>: test %esi,%esi 0xffffffff8037b786 <__journal_remove_journal_head+38>: jne 0xffffffff8037b78f <__journal_remove_journal_head+47> 0xffffffff8037b788 <__journal_remove_journal_head+40>: cmpq $0x0,0x28(%rbx) 0xffffffff8037b78d <__journal_remove_journal_head+45>: je 0xffffffff8037b794 <__journal_remove_journal_head+52> ....... ....... ============== The oops seems be due to NULL journal head while evaluating J_ASSERT_JH() macro: ============== static void __journal_remove_journal_head(struct buffer_head *bh) { struct journal_head *jh = bh2jh(bh); J_ASSERT_JH(jh, jh->b_jcount >= 0); <=== jh is NULL get_bh(bh); if (jh->b_jcount == 0) { if (jh->b_transaction == NULL && .... ============= Not sure why would that happen (corruption?). Few system details: ================ - 64-bit, 2 quad-core (total 8 cores) Xeon, 48GB RAM - Stock 2.6.30.1 kernel, *no* modules - ext3 file-system (data=ordered mode) used over encrypted (dmcrypt) disks. - underlying storage: h/w RAID. - ext*/jbd config values: CONFIG_EXT3_FS=y CONFIG_EXT3_DEFAULTS_TO_ORDERED=y CONFIG_EXT3_FS_XATTR=y # CONFIG_EXT3_FS_POSIX_ACL is not set # CONFIG_EXT3_FS_SECURITY is not set CONFIG_EXT4_FS=y # CONFIG_EXT4DEV_COMPAT is not set CONFIG_EXT4_FS_XATTR=y CONFIG_EXT4_FS_POSIX_ACL=y CONFIG_EXT4_FS_SECURITY=y CONFIG_JBD=y # CONFIG_JBD_DEBUG is not set CONFIG_JBD2=y # CONFIG_JBD2_DEBUG is not set CONFIG_FS_MBCACHE=y # CONFIG_REISERFS_FS is not set # CONFIG_JFS_FS is not set CONFIG_FS_POSIX_ACL=y =================== Let me know if you need any more details. Reproducing this (or finding a good test to trigger this) is proving to be difficult :-( It sorta oops once in a while ;-) thanks abhijit ps: please Cc: me the replies. I am not subscribed to either of the lists -- thanks!