From smccauliff at mail.arc.nasa.gov Thu Aug 2 01:55:53 2007 From: smccauliff at mail.arc.nasa.gov (Sean McCauliff) Date: Wed, 01 Aug 2007 18:55:53 -0700 Subject: Poor Performance WhenNumber of Files > 1M Message-ID: <46B139A9.8070808@mail.arc.nasa.gov> Hi all, I plan on having about 100M files totaling about 8.5TiBytes. To see how ext3 would perform with large numbers of files I've written a test program which creates a configurable number of files into a configurable number of directories, reads from those files, lists them and then deletes them. Even up to 1M files ext3 seems to perform well and scale linearly; the time to execute the program on 1M files is about double the time it takes it to execute on .5M files. But past 1M files it seems to have n^2 scalability. Test details appear below. Looking at the various options for ext3 nothing jumps out as the obvious one to use to improve performance. Any recommendations? Thanks! Sean ------ Parameter one is number of files, parameter two is number of directories to write into. Dell MD3000 + LVM2. 2x7 10k rpm SAS disks 128k stripe RAID-0 for 3.8 TiBytes of total storage. Fedora Core 6 x86_64. 2xQuad Core Xeon. Default mount and ext3 options used. [root at galaxy filestore]# time /soc/abyss/test/fileSystemTest.pl 10000 1000 real 0m1.054s user 0m0.128s sys 0m0.382s [root at galaxy filestore]# time /soc/abyss/test/fileSystemTest.pl 1000000 1000 real 1m0.938s user 0m12.203s sys 0m40.358s [root at galaxy filestore]# time /soc/abyss/test/fileSystemTest.pl 10000000 1000 real 13m39.881s user 2m6.645s sys 7m26.665s [root at galaxy filestore]# time /soc/abyss/test/fileSystemTest.pl 20000000 1000 real 44m46.359s user 4m22.911s sys 17m2.792s From davids at webmaster.com Thu Aug 2 04:42:28 2007 From: davids at webmaster.com (David Schwartz) Date: Wed, 1 Aug 2007 21:42:28 -0700 Subject: Poor Performance WhenNumber of Files > 1M In-Reply-To: <46B139A9.8070808@mail.arc.nasa.gov> Message-ID: > Hi all, > > I plan on having about 100M files totaling about 8.5TiBytes. To see > how ext3 would perform with large numbers of files I've written a test > program which creates a configurable number of files into a configurable > number of directories, reads from those files, lists them and then > deletes them. Even up to 1M files ext3 seems to perform well and scale > linearly; the time to execute the program on 1M files is about double > the time it takes it to execute on .5M files. But past 1M files it > seems to have n^2 scalability. Test details appear below. > > Looking at the various options for ext3 nothing jumps out as the obvious > one to use to improve performance. > > Any recommendations? If you want performance that's not O(n^2), the number of directory levels must go up one each time the order of magnitude of the number of files goes up. That is, the number of files per directory must be constant. Suppose you have a directory of N files. To locate each file requires N location operations each requiring looking at an average of N/2 files. So it is O(N*(N2)), which is O(N^2). Add another level of directories each time you increase the number of files by a factor of 10. DS From darkonc at gmail.com Thu Aug 2 05:11:29 2007 From: darkonc at gmail.com (Stephen Samuel) Date: Wed, 1 Aug 2007 22:11:29 -0700 Subject: Poor Performance WhenNumber of Files > 1M In-Reply-To: <46B139A9.8070808@mail.arc.nasa.gov> References: <46B139A9.8070808@mail.arc.nasa.gov> Message-ID: <6cd50f9f0708012211j3e065301nf6db714125d89538@mail.gmail.com> Searching for directories (to ensure no duplicates, etc) is going to be order N^2. Size of the directory is likely to be a limiting factor. Try increasing to 10000 directories (in two layors of 100 each). I'll bet you that the result will be a pretty good increase in speed (getting back to the speeds that you had with 1M directories). On 8/1/07, Sean McCauliff wrote: > Hi all, > > I plan on having about 100M files totaling about 8.5TiBytes. To see > how ext3 would perform with large numbers of files I've written a test > program which creates a configurable number of files into a configurable > number of directories, reads from those files, lists them and then > deletes them. Even up to 1M files ext3 seems to perform well and scale > linearly; the time to execute the program on 1M files is about double > the time it takes it to execute on .5M files. But past 1M files it > seems to have n^2 scalability. Test details appear below. > > Looking at the various options for ext3 nothing jumps out as the obvious > one to use to improve performance. > > Any recommendations? > -- Stephen Samuel http://www.bcgreen.com 778-861-7641 From ext3-users at harkless.org Thu Aug 2 16:01:34 2007 From: ext3-users at harkless.org (Dan Harkless) Date: Thu, 02 Aug 2007 09:01:34 -0700 Subject: "htree_dirblock_to_tree: bad entry in directory" error Message-ID: <200708021601.l72G1YxC029439@www.harkless.org> Hi. I woke up this morning to find a ton of waiting emails complaining that some cron jobs on my system couldn't run because one of my filesystems (ext3 on software RAID 1) was suddenly mounted read-only. Always nice when you're away from the server due to travel. ;^> I investigated in the logs and found: 2007-08-02 04:02:25 kern.crit www kernel: EXT3-fs error (device md2): htree_dirblock_to_tree: bad entry in directory #3616894: rec_len is too small for name_len - offset=103576, inode=3619715, rec_len=12, name_len=132 2007-08-02 04:02:25 kern.err www kernel: Aborting journal on device md2. 2007-08-02 04:02:25 kern.crit www kernel: ext3_abort called. 2007-08-02 04:02:25 kern.crit www kernel: EXT3-fs error (device md2): ext3_journal_start_sb: Detected aborted journal 2007-08-02 04:02:25 kern.crit www kernel: Remounting filesystem read-only I unmounted the filesystem and ran fsck, but though it detected that the filesystem had errors, it didn't report any findings during the check: fsck 1.35 (28-Feb-2004) e2fsck 1.35 (28-Feb-2004) /dev/md2: recovering journal /dev/md2 contains a file system with errors, check forced. Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information /dev/md2: 200231/5248992 files (1.4% non-contiguous), 1563304/10492432 blocks I remounted the filesystem and all *seems* to be okay now. I was curious what "directory #3616894" (inode 3619715) was, so I did 'find / -inum 3619715 -exec ls -dioF {} \;', but the output showed that that was a non-directory file created and last modified in 2004. How could this be? And what would cause an error like the above? Am I out of the woods now, or is there more checking of some kind that I should do to make sure this isn't going to be happening again? Thank you for your time! -- Dan Harkless http://harkless.org/dan/ From ulrich.windl at rz.uni-regensburg.de Thu Aug 2 09:55:53 2007 From: ulrich.windl at rz.uni-regensburg.de (Ulrich Windl) Date: Thu, 02 Aug 2007 11:55:53 +0200 Subject: kernel: EXT3-fs: Unsupported filesystem blocksize 8192 on md0. Message-ID: <46B1C648.3202.1058EE1A@Ulrich.Windl.rkdvmks1.ngate.uni-regensburg.de> Hi, I made an ext3 filesystem with 8kB block size: # mkfs.ext3 -T largefile -v -b 8192 /dev/md0 Warning: blocksize 8192 not usable on most systems. mke2fs 1.38 (30-Jun-2005) Filesystem label= OS type: Linux Block size=8192 (log=3) Fragment size=8192 (log=3) 148480 inodes, 18940704 blocks 947035 blocks (5.00%) reserved for the super user First data block=0 290 block groups 65528 blocks per group, 65528 fragments per group 512 inodes per group Superblock backups stored on blocks: 65528, 196584, 327640, 458696, 589752, 1638200, 1769256, 3210872, 5307768, 8191000, 15923304 Writing inode tables: done Creating journal (32768 blocks): done Writing superblocks and filesystem accounting information: done This filesystem will be automatically checked every 22 mounts or 180 days, whichever comes first. Use tune2fs -c or -i to override. When mounting it as ext2 (by mistake), the kernel says: EXT2-fs warning (device md0): ext2_fill_super: mounting ext3 filesystem as ext2 When finally mounting the filesystem as ext3, the kernel says: kernel: EXT3-fs: Unsupported filesystem blocksize 8192 on md0. I don't quite understand: About any larger filesystem of today can support blocks > 4kB. What is the problem here? Kernel is 2.6.16.46-0.12-default (SUSE SLES10 SP1) on IA64 Regards, Ulrich From adilger at clusterfs.com Fri Aug 3 16:49:42 2007 From: adilger at clusterfs.com (Andreas Dilger) Date: Fri, 3 Aug 2007 10:49:42 -0600 Subject: "htree_dirblock_to_tree: bad entry in directory" error In-Reply-To: <200708021601.l72G1YxC029439@www.harkless.org> References: <200708021601.l72G1YxC029439@www.harkless.org> Message-ID: <20070803164942.GP6142@schatzie.adilger.int> On Aug 02, 2007 09:01 -0700, Dan Harkless wrote: > 2007-08-02 04:02:25 kern.crit www kernel: EXT3-fs error (device md2): htree_dirblock_to_tree: bad entry in directory #3616894: rec_len is too small for name_len - offset=103576, inode=3619715, rec_len=12, name_len=132 > > I unmounted the filesystem and ran fsck, but though it detected that the > filesystem had errors, it didn't report any findings during the check: > > fsck 1.35 (28-Feb-2004) > e2fsck 1.35 (28-Feb-2004) > /dev/md2: recovering journal > /dev/md2 contains a file system with errors, check forced. > Pass 1: Checking inodes, blocks, and sizes > Pass 2: Checking directory structure > Pass 3: Checking directory connectivity > Pass 4: Checking reference counts > Pass 5: Checking group summary information > /dev/md2: 200231/5248992 files (1.4% non-contiguous), 1563304/10492432 blocks > > I remounted the filesystem and all *seems* to be okay now. I was curious > what "directory #3616894" (inode 3619715) was, so I did 'find / -inum > 3619715 -exec ls -dioF {} \;', but the output showed that that was a > non-directory file created and last modified in 2004. How could this be? Note that the DIRECTORY is 3616894, and the entry within that directory that was corrupted is 3619715. > And what would cause an error like the above? Am I out of the woods now, or > is there more checking of some kind that I should do to make sure this isn't > going to be happening again? Given that there is no corruption on disk, I would put this toward some kind of memory corruption. It might be a single-bit error though, because 12 = 0xc and 132 = 0x84 so if you clear bit 0x80 from the name_len (leaving a name_len = 4) it would be correct for a rec_len of 12. Is the filename 4 characters long? Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From coyli at suse.de Fri Aug 3 17:54:48 2007 From: coyli at suse.de (Coly Li) Date: Sat, 04 Aug 2007 01:54:48 +0800 Subject: Poor Performance WhenNumber of Files > 1M In-Reply-To: <46B139A9.8070808@mail.arc.nasa.gov> References: <46B139A9.8070808@mail.arc.nasa.gov> Message-ID: How about the file size ? If the size is small, another performance kill should be on disk inode layout. Because the order of access dentries of dir is probably different from the order of allocating inodes in inode tables. This will make much time wast on hard disk seeking for the first time. Just FYI. Coly Sean McCauliff wrote: > Hi all, > > I plan on having about 100M files totaling about 8.5TiBytes. To see > how ext3 would perform with large numbers of files I've written a test > program which creates a configurable number of files into a configurable > number of directories, reads from those files, lists them and then > deletes them. Even up to 1M files ext3 seems to perform well and scale > linearly; the time to execute the program on 1M files is about double > the time it takes it to execute on .5M files. But past 1M files it > seems to have n^2 scalability. Test details appear below. > > Looking at the various options for ext3 nothing jumps out as the obvious > one to use to improve performance. > > Any recommendations? > > Thanks! > Sean > > ------ > Parameter one is number of files, parameter two is number of directories > to write into. > > Dell MD3000 + LVM2. 2x7 10k rpm SAS disks 128k stripe RAID-0 for 3.8 > TiBytes of total storage. Fedora Core 6 x86_64. 2xQuad Core Xeon. > Default mount and ext3 options used. > > [root at galaxy filestore]# time /soc/abyss/test/fileSystemTest.pl 10000 1000 > > real 0m1.054s > user 0m0.128s > sys 0m0.382s > > [root at galaxy filestore]# time /soc/abyss/test/fileSystemTest.pl 1000000 > 1000 > > real 1m0.938s > user 0m12.203s > sys 0m40.358s > [root at galaxy filestore]# time /soc/abyss/test/fileSystemTest.pl 10000000 > 1000 > > real 13m39.881s > user 2m6.645s > sys 7m26.665s > [root at galaxy filestore]# time /soc/abyss/test/fileSystemTest.pl 20000000 > 1000 > > real 44m46.359s > user 4m22.911s > sys 17m2.792s From smccauliff at mail.arc.nasa.gov Fri Aug 3 18:15:37 2007 From: smccauliff at mail.arc.nasa.gov (Sean McCauliff) Date: Fri, 03 Aug 2007 11:15:37 -0700 Subject: Poor Performance WhenNumber of Files > 1M In-Reply-To: <6cd50f9f0708012211j3e065301nf6db714125d89538@mail.gmail.com> References: <46B139A9.8070808@mail.arc.nasa.gov> <6cd50f9f0708012211j3e065301nf6db714125d89538@mail.gmail.com> Message-ID: <46B370C9.8080801@mail.arc.nasa.gov> It seems like my code has a bug where it would create ndirs = nfiles / nFilesPerDir. So it was making more directories. I thought that with dir_index option directory entry look up would be more like O(1) so that scale up should be completely linear. Multi level directory schemes seem to degrade performance more. Sean Stephen Samuel wrote: > Searching for directories (to ensure no duplicates, etc) is going to > be order N^2. > > Size of the directory is likely to be a limiting factor. > > Try increasing to 10000 directories (in two layors of 100 each). I'll > bet you that the result will be a pretty good increase in speed > (getting back to the speeds that you had with 1M directories). > > > On 8/1/07, Sean McCauliff wrote: >> Hi all, >> >> I plan on having about 100M files totaling about 8.5TiBytes. To see >> how ext3 would perform with large numbers of files I've written a test >> program which creates a configurable number of files into a configurable >> number of directories, reads from those files, lists them and then >> deletes them. Even up to 1M files ext3 seems to perform well and scale >> linearly; the time to execute the program on 1M files is about double >> the time it takes it to execute on .5M files. But past 1M files it >> seems to have n^2 scalability. Test details appear below. >> >> Looking at the various options for ext3 nothing jumps out as the obvious >> one to use to improve performance. >> >> Any recommendations? >> From coyli at suse.de Sat Aug 4 03:03:01 2007 From: coyli at suse.de (Coly Li) Date: Sat, 04 Aug 2007 11:03:01 +0800 Subject: kernel: EXT3-fs: Unsupported filesystem blocksize 8192 on md0. In-Reply-To: <46B1C648.3202.1058EE1A@Ulrich.Windl.rkdvmks1.ngate.uni-regensburg.de> References: <46B1C648.3202.1058EE1A@Ulrich.Windl.rkdvmks1.ngate.uni-regensburg.de> Message-ID: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 In ext3, it is better to make the filesystem block not large than page size. On x86 the typical page size is 4KB. Coly Ulrich Windl wrote: > Hi, > > I made an ext3 filesystem with 8kB block size: > # mkfs.ext3 -T largefile -v -b 8192 /dev/md0 > Warning: blocksize 8192 not usable on most systems. > mke2fs 1.38 (30-Jun-2005) > Filesystem label= > OS type: Linux > Block size=8192 (log=3) > Fragment size=8192 (log=3) > 148480 inodes, 18940704 blocks > 947035 blocks (5.00%) reserved for the super user > First data block=0 > 290 block groups > 65528 blocks per group, 65528 fragments per group > 512 inodes per group > Superblock backups stored on blocks: > 65528, 196584, 327640, 458696, 589752, 1638200, 1769256, 3210872, > 5307768, 8191000, 15923304 > > Writing inode tables: done > Creating journal (32768 blocks): done > Writing superblocks and filesystem accounting information: done > > This filesystem will be automatically checked every 22 mounts or > 180 days, whichever comes first. Use tune2fs -c or -i to override. > > When mounting it as ext2 (by mistake), the kernel says: > EXT2-fs warning (device md0): ext2_fill_super: mounting ext3 filesystem as ext2 > > When finally mounting the filesystem as ext3, the kernel says: > kernel: EXT3-fs: Unsupported filesystem blocksize 8192 on md0. > > I don't quite understand: About any larger filesystem of today can support blocks >> 4kB. What is the problem here? > > Kernel is 2.6.16.46-0.12-default (SUSE SLES10 SP1) on IA64 > > Regards, > Ulrich -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2 (GNU/Linux) Comment: Using GnuPG with SUSE - http://enigmail.mozdev.org iD8DBQFGs+xOuTp8cyZ5lTERAt73AKC9oxc+H1zEH3A/iaSNAst8qYA/KACglDB4 nNKY6K8+cDbknOwIBECNvFQ= =h4EV -----END PGP SIGNATURE----- From rsalmon74 at gmail.com Wed Aug 8 20:21:05 2007 From: rsalmon74 at gmail.com (Rene Salmon) Date: Wed, 8 Aug 2007 15:21:05 -0500 Subject: Poor ext3 performance on RAID array Message-ID: Hi list, I am having some strange performance issues with ext3 and I am hoping to get some advice/hints on how to make this better. First some background on the setup. We have a RAID 5 array 10+1+1 with one LUN. That is 10 SATA drives one parity drive and one dedicated spare. The LUN is about 6.5TB. Using a 2Gbit/sec fiber channel card I can do some dd writes to the raw device and get speeds close to 200Mbytes/sec which is more or less the max the card can do. Next I create an xfs file system on the LUN and do a dd to xfs and get speeds close to 150Mbytes/sec. I want to use ext3 not xfs so next I put ext3 on the lun. Now when I do the dd to the ext3 lun I get 25-50Mbytes/sec depending on whether I have the Raid controller cache turned on or off. Getting 50MBytes/sec with the raid controller cache turned off. I know that ext3 should perform better so I must be doing something wrong. Here is my mkfs.ext3 mkfs.ext3 -b4096 -Tlagefile4 /dev/dm-0 Thanks in advanced for any help on this. Rene -------------- next part -------------- An HTML attachment was scrubbed... URL: From Ionel.Gardais at tech-advantage.com Wed Aug 8 20:31:40 2007 From: Ionel.Gardais at tech-advantage.com (GARDAIS Ionel) Date: Wed, 8 Aug 2007 22:31:40 +0200 Subject: =?iso-8859-1?q?RE=A0=3A_Poor_ext3_performance_on_RAID_array?= References: Message-ID: Hi Rene, You should try to add the "-E stride=X" option to the mkfs command line. Where X is expalined in the man page. This will basically map ext3 "blocks" on the RAID stripe size. Ionel -------- Message d'origine-------- De: ext3-users-bounces at redhat.com de la part de Rene Salmon Date: mer. 08/08/2007 22:21 ?: ext3-users at redhat.com Objet : Poor ext3 performance on RAID array Hi list, I am having some strange performance issues with ext3 and I am hoping to get some advice/hints on how to make this better. First some background on the setup. We have a RAID 5 array 10+1+1 with one LUN. That is 10 SATA drives one parity drive and one dedicated spare. The LUN is about 6.5TB. Using a 2Gbit/sec fiber channel card I can do some dd writes to the raw device and get speeds close to 200Mbytes/sec which is more or less the max the card can do. Next I create an xfs file system on the LUN and do a dd to xfs and get speeds close to 150Mbytes/sec. I want to use ext3 not xfs so next I put ext3 on the lun. Now when I do the dd to the ext3 lun I get 25-50Mbytes/sec depending on whether I have the Raid controller cache turned on or off. Getting 50MBytes/sec with the raid controller cache turned off. I know that ext3 should perform better so I must be doing something wrong. Here is my mkfs.ext3 mkfs.ext3 -b4096 -Tlagefile4 /dev/dm-0 Thanks in advanced for any help on this. Rene -------------- next part -------------- An HTML attachment was scrubbed... URL: From rsalmon74 at gmail.com Wed Aug 8 20:53:43 2007 From: rsalmon74 at gmail.com (Rene Salmon) Date: Wed, 8 Aug 2007 15:53:43 -0500 Subject: Poor ext3 performance on RAID array Message-ID: Hi, Thanks for the reply. I tried using the -E stride=X option as follows: mkfs.ext3 -b4096 -Tlargefile4 -E stride=640 /dev/dm-1 and got the same results around 50 MBytes/sec. Maybe I have the wrong number for stride so here is the math for that: stride=stripe-size Configure the filesystem for a RAID array with stripe-size filesystem blocks per stripe. Here is some more detail on the RAID array. RAID level : 5 (10 drives + 1 parity) Chunk Size : 256 KB Stripe Size : 2560 KB (10 drives * 256KB) stride=640 * 4096(byte blocks) = 2560KB I will try other stride options but they don't seem to change much. Thanks Rene On 8/8/07, GARDAIS Ionel wrote: > > Hi Rene, > > You should try to add the "-E stride=X" option to the mkfs command line. > Where X is expalined in the man page. > > This will basically map ext3 "blocks" on the RAID stripe size. > > Ionel > > > -------- Message d'origine-------- > De: ext3-users-bounces at redhat.com de la part de Rene Salmon > Date: mer. 08/08/2007 22:21 > ?: ext3-users at redhat.com > Objet : Poor ext3 performance on RAID array > > Hi list, > > > I am having some strange performance issues with ext3 and I am hoping to > get some advice/hints on how to make this better. First some background > on the setup. > > We have a RAID 5 array 10+1+1 with one LUN. That is 10 SATA drives one > parity drive and one dedicated spare. The LUN is about 6.5TB. > > Using a 2Gbit/sec fiber channel card I can do some dd writes to the raw > device and get speeds close to 200Mbytes/sec which is more or less the > max the card can do. > > Next I create an xfs file system on the LUN and do a dd to xfs and get > speeds close to 150Mbytes/sec. > > I want to use ext3 not xfs so next I put ext3 on the lun. Now when I do > the dd to the ext3 lun I get 25-50Mbytes/sec depending on whether I have > the Raid controller cache turned on or off. Getting 50MBytes/sec with > the raid controller cache turned off. > > I know that ext3 should perform better so I must be doing something > wrong. Here is my mkfs.ext3 > > mkfs.ext3 -b4096 -Tlagefile4 /dev/dm-0 > > Thanks in advanced for any help on this. > > Rene > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Ionel.Gardais at tech-advantage.com Wed Aug 8 20:58:33 2007 From: Ionel.Gardais at tech-advantage.com (GARDAIS Ionel) Date: Wed, 8 Aug 2007 22:58:33 +0200 Subject: =?iso-8859-1?q?RE=A0=3A_RE_=3A_Poor_ext3_performance_on_RAID_arr?= =?iso-8859-1?q?ay?= References: Message-ID: stride is the chunk size of your raid whatever the number of physical disks composing the RAID array. So for 256k chunks with a block size of 4k, stride should be 256/4 = 64 instead of 640. Maybe it will help. Ionel -------- Message d'origine-------- De: Rene Salmon [mailto:rsalmon74 at gmail.com] Date: mer. 08/08/2007 22:53 ?: GARDAIS Ionel Cc: ext3-users at redhat.com Objet : Re: RE : Poor ext3 performance on RAID array Hi, Thanks for the reply. I tried using the -E stride=X option as follows: mkfs.ext3 -b4096 -Tlargefile4 -E stride=640 /dev/dm-1 and got the same results around 50 MBytes/sec. Maybe I have the wrong number for stride so here is the math for that: stride=stripe-size Configure the filesystem for a RAID array with stripe-size filesystem blocks per stripe. Here is some more detail on the RAID array. RAID level : 5 (10 drives + 1 parity) Chunk Size : 256 KB Stripe Size : 2560 KB (10 drives * 256KB) stride=640 * 4096(byte blocks) = 2560KB I will try other stride options but they don't seem to change much. Thanks Rene On 8/8/07, GARDAIS Ionel wrote: > > Hi Rene, > > You should try to add the "-E stride=X" option to the mkfs command line. > Where X is expalined in the man page. > > This will basically map ext3 "blocks" on the RAID stripe size. > > Ionel > > > -------- Message d'origine-------- > De: ext3-users-bounces at redhat.com de la part de Rene Salmon > Date: mer. 08/08/2007 22:21 > ?: ext3-users at redhat.com > Objet : Poor ext3 performance on RAID array > > Hi list, > > > I am having some strange performance issues with ext3 and I am hoping to > get some advice/hints on how to make this better. First some background > on the setup. > > We have a RAID 5 array 10+1+1 with one LUN. That is 10 SATA drives one > parity drive and one dedicated spare. The LUN is about 6.5TB. > > Using a 2Gbit/sec fiber channel card I can do some dd writes to the raw > device and get speeds close to 200Mbytes/sec which is more or less the > max the card can do. > > Next I create an xfs file system on the LUN and do a dd to xfs and get > speeds close to 150Mbytes/sec. > > I want to use ext3 not xfs so next I put ext3 on the lun. Now when I do > the dd to the ext3 lun I get 25-50Mbytes/sec depending on whether I have > the Raid controller cache turned on or off. Getting 50MBytes/sec with > the raid controller cache turned off. > > I know that ext3 should perform better so I must be doing something > wrong. Here is my mkfs.ext3 > > mkfs.ext3 -b4096 -Tlagefile4 /dev/dm-0 > > Thanks in advanced for any help on this. > > Rene > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From rsalmon74 at gmail.com Wed Aug 8 21:31:14 2007 From: rsalmon74 at gmail.com (Rene Salmon) Date: Wed, 8 Aug 2007 16:31:14 -0500 Subject: Poor ext3 performance on RAID array Message-ID: Hi, I think I found part of the problem. Our RAID array vendor explained to me that they have a problem of not getting enough data from ext3 to do full stripe writes when ext3 issues an cache flush command. Basically something to do with the journaling and cache flushing. To test this I replaced ext3 with ext2 and now I get the expected results: mkfs.ext2 -Tlargefile4 -b4096 /dev/dm-1 dd if=/dev/zero of=/mnt/test bs=1024K count=1024 conv=fsync 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 6.85545 seconds, 157 MB/s So now I think I will just try to just have ext3 journal to a different device. See if that helps any. Thanks Rene On 8/8/07, GARDAIS Ionel wrote: > > stride is the chunk size of your raid whatever the number of physical > disks composing the RAID array. > So for 256k chunks with a block size of 4k, stride should be 256/4 = 64 > instead of 640. > > Maybe it will help. > > Ionel > > > -------- Message d'origine-------- > De: Rene Salmon [mailto:rsalmon74 at gmail.com ] > Date: mer. 08/08/2007 22:53 > ?: GARDAIS Ionel > Cc: ext3-users at redhat.com > Objet : Re: RE : Poor ext3 performance on RAID array > > Hi, > > Thanks for the reply. I tried using the -E stride=X option as follows: > > mkfs.ext3 -b4096 -Tlargefile4 -E stride=640 /dev/dm-1 > > and got the same results around 50 MBytes/sec. Maybe I have the wrong > number for stride so here is the math for that: > > stride=stripe-size > Configure the filesystem for a RAID array > with > stripe-size filesystem blocks per stripe. > > > Here is some more detail on the RAID array. > > RAID level : 5 (10 drives + 1 parity) > Chunk Size : 256 KB > Stripe Size : 2560 KB (10 drives * 256KB) > > stride=640 * 4096(byte blocks) = 2560KB > > I will try other stride options but they don't seem to change much. > > Thanks > Rene > > > > > On 8/8/07, GARDAIS Ionel wrote: > > > > Hi Rene, > > > > You should try to add the "-E stride=X" option to the mkfs command line. > > Where X is expalined in the man page. > > > > This will basically map ext3 "blocks" on the RAID stripe size. > > > > Ionel > > > > > > -------- Message d'origine-------- > > De: ext3-users-bounces at redhat.com de la part de Rene Salmon > > Date: mer. 08/08/2007 22:21 > > ?: ext3-users at redhat.com > > Objet : Poor ext3 performance on RAID array > > > > Hi list, > > > > > > I am having some strange performance issues with ext3 and I am hoping to > > get some advice/hints on how to make this better. First some background > > on the setup. > > > > We have a RAID 5 array 10+1+1 with one LUN. That is 10 SATA drives one > > parity drive and one dedicated spare. The LUN is about 6.5TB. > > > > Using a 2Gbit/sec fiber channel card I can do some dd writes to the raw > > device and get speeds close to 200Mbytes/sec which is more or less the > > max the card can do. > > > > Next I create an xfs file system on the LUN and do a dd to xfs and get > > speeds close to 150Mbytes/sec. > > > > I want to use ext3 not xfs so next I put ext3 on the lun. Now when I do > > the dd to the ext3 lun I get 25-50Mbytes/sec depending on whether I have > > the Raid controller cache turned on or off. Getting 50MBytes/sec with > > the raid controller cache turned off. > > > > I know that ext3 should perform better so I must be doing something > > wrong. Here is my mkfs.ext3 > > > > mkfs.ext3 -b4096 -Tlagefile4 /dev/dm-0 > > > > Thanks in advanced for any help on this. > > > > Rene > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From adilger at clusterfs.com Wed Aug 8 23:21:21 2007 From: adilger at clusterfs.com (Andreas Dilger) Date: Wed, 8 Aug 2007 17:21:21 -0600 Subject: Poor ext3 performance on RAID array In-Reply-To: References: Message-ID: <20070808232121.GW6689@schatzie.adilger.int> On Aug 08, 2007 15:21 -0500, Rene Salmon wrote: > I am having some strange performance issues with ext3 and I am hoping to > get some advice/hints on how to make this better. First some background > on the setup. > > We have a RAID 5 array 10+1+1 with one LUN. That is 10 SATA drives one > parity drive and one dedicated spare. The LUN is about 6.5TB. > > Using a 2Gbit/sec fiber channel card I can do some dd writes to the raw > device and get speeds close to 200Mbytes/sec which is more or less the > max the card can do. > > Next I create an xfs file system on the LUN and do a dd to xfs and get > speeds close to 150Mbytes/sec. > > I want to use ext3 not xfs so next I put ext3 on the lun. Now when I do > the dd to the ext3 lun I get 25-50Mbytes/sec depending on whether I have > the Raid controller cache turned on or off. Getting 50MBytes/sec with > the raid controller cache turned off. > > I know that ext3 should perform better so I must be doing something > wrong. Here is my mkfs.ext3 > > mkfs.ext3 -b4096 -Tlagefile4 /dev/dm-0 You could also try out ext4, that's where the real performance improvements are... Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From bruno at wolff.to Thu Aug 9 03:52:41 2007 From: bruno at wolff.to (Bruno Wolff III) Date: Wed, 8 Aug 2007 22:52:41 -0500 Subject: Poor Performance WhenNumber of Files > 1M In-Reply-To: <46B139A9.8070808@mail.arc.nasa.gov> References: <46B139A9.8070808@mail.arc.nasa.gov> Message-ID: <20070809035241.GB26169@wolff.to> On Wed, Aug 01, 2007 at 18:55:53 -0700, Sean McCauliff wrote: > Hi all, > > I plan on having about 100M files totaling about 8.5TiBytes. To see > how ext3 would perform with large numbers of files I've written a test > program which creates a configurable number of files into a configurable > number of directories, reads from those files, lists them and then > deletes them. Even up to 1M files ext3 seems to perform well and scale > linearly; the time to execute the program on 1M files is about double > the time it takes it to execute on .5M files. But past 1M files it > seems to have n^2 scalability. Test details appear below. > > Looking at the various options for ext3 nothing jumps out as the obvious > one to use to improve performance. > > Any recommendations? Did you make sure directory indexing is available? I think that is the default now for ext3, but maybe it wasn't turned on for your test. From smccauliff at mail.arc.nasa.gov Thu Aug 9 19:02:31 2007 From: smccauliff at mail.arc.nasa.gov (Sean McCauliff) Date: Thu, 09 Aug 2007 12:02:31 -0700 Subject: Poor Performance WhenNumber of Files > 1M In-Reply-To: <20070809035241.GB26169@wolff.to> References: <46B139A9.8070808@mail.arc.nasa.gov> <20070809035241.GB26169@wolff.to> Message-ID: <46BB64C7.7090802@mail.arc.nasa.gov> dumpe2fs reports that dir_index option is enabled. But thank you for the suggestion. Sean Bruno Wolff III wrote: > On Wed, Aug 01, 2007 at 18:55:53 -0700, > Sean McCauliff wrote: >> Hi all, >> >> I plan on having about 100M files totaling about 8.5TiBytes. To see >> how ext3 would perform with large numbers of files I've written a test >> program which creates a configurable number of files into a configurable >> number of directories, reads from those files, lists them and then >> deletes them. Even up to 1M files ext3 seems to perform well and scale >> linearly; the time to execute the program on 1M files is about double >> the time it takes it to execute on .5M files. But past 1M files it >> seems to have n^2 scalability. Test details appear below. >> >> Looking at the various options for ext3 nothing jumps out as the obvious >> one to use to improve performance. >> >> Any recommendations? > > Did you make sure directory indexing is available? I think that is the > default now for ext3, but maybe it wasn't turned on for your test. > From adilger at clusterfs.com Thu Aug 9 19:51:55 2007 From: adilger at clusterfs.com (Andreas Dilger) Date: Thu, 9 Aug 2007 13:51:55 -0600 Subject: Poor Performance WhenNumber of Files > 1M In-Reply-To: <46BB64C7.7090802@mail.arc.nasa.gov> References: <46B139A9.8070808@mail.arc.nasa.gov> <20070809035241.GB26169@wolff.to> <46BB64C7.7090802@mail.arc.nasa.gov> Message-ID: <20070809195155.GZ6689@schatzie.adilger.int> Sean McCauliff wrote: >I plan on having about 100M files totaling about 8.5TiBytes. To see >how ext3 would perform with large numbers of files I've written a test >program which creates a configurable number of files into a configurable >number of directories, reads from those files, lists them and then >deletes them. Even up to 1M files ext3 seems to perform well and scale >linearly; the time to execute the program on 1M files is about double >the time it takes it to execute on .5M files. But past 1M files it >seems to have n^2 scalability. Test details appear below. > >Looking at the various options for ext3 nothing jumps out as the obvious >one to use to improve performance. Try increasing your journal size (mke2fs -J size=400), and having a lot of RAM. When you say "having about 100M files", does that mean "need to be constantly accessing 100M files" or just "need to store a total of 100M files in this filesystem"? The former means you need to keep the whole working set in RAM for maximum performance, about 100M * (128 + 32) = 19GB of RAM. The latter is no problem, we have ext3 filesystems with > 250M files in them. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From smccauliff at mail.arc.nasa.gov Thu Aug 9 20:04:03 2007 From: smccauliff at mail.arc.nasa.gov (Sean McCauliff) Date: Thu, 09 Aug 2007 13:04:03 -0700 Subject: Poor Performance WhenNumber of Files > 1M In-Reply-To: <20070809195155.GZ6689@schatzie.adilger.int> References: <46B139A9.8070808@mail.arc.nasa.gov> <20070809035241.GB26169@wolff.to> <46BB64C7.7090802@mail.arc.nasa.gov> <20070809195155.GZ6689@schatzie.adilger.int> Message-ID: <46BB7333.3010602@mail.arc.nasa.gov> > Try increasing your journal size (mke2fs -J size=400), and having a lot > of RAM. > > When you say "having about 100M files", does that mean "need to be > constantly accessing 100M files" or just "need to store a total of > 100M files in this filesystem"? Likely only 10M will be accessed at any time. > > The former means you need to keep the whole working set in RAM for > maximum performance, about 100M * (128 + 32) = 19GB of RAM. The > latter is no problem, we have ext3 filesystems with > 250M files > in them. The system has 16G of RAM; getting 32G in the future is a possibility. Where do you get 128 + 32 from? Is 128 the inode size? this is running a 64bit os. Does that change the memory requirements? Thanks, I will try your suggestion. Sean From adilger at clusterfs.com Thu Aug 9 22:12:49 2007 From: adilger at clusterfs.com (Andreas Dilger) Date: Thu, 9 Aug 2007 16:12:49 -0600 Subject: Poor Performance WhenNumber of Files > 1M In-Reply-To: <46BB7333.3010602@mail.arc.nasa.gov> References: <46B139A9.8070808@mail.arc.nasa.gov> <20070809035241.GB26169@wolff.to> <46BB64C7.7090802@mail.arc.nasa.gov> <20070809195155.GZ6689@schatzie.adilger.int> <46BB7333.3010602@mail.arc.nasa.gov> Message-ID: <20070809221249.GC6689@schatzie.adilger.int> On Aug 09, 2007 13:04 -0700, Sean McCauliff wrote: > >When you say "having about 100M files", does that mean "need to be > >constantly accessing 100M files" or just "need to store a total of > >100M files in this filesystem"? > Likely only 10M will be accessed at any time. If you can structure it so the 10M files that will be accessed together are stored to disk together, then your application will work better, no matter what the filesystem. > >The former means you need to keep the whole working set in RAM for > >maximum performance, about 100M * (128 + 32) = 19GB of RAM. The > >latter is no problem, we have ext3 filesystems with > 250M files > >in them. > The system has 16G of RAM; getting 32G in the future is a possibility. > Where do you get 128 + 32 from? Is 128 the inode size? this is > running a 64bit os. Does that change the memory requirements? 128 = inode size, 32 = directory entry size. there will be other overhead as well, but this will get you into the right ballpark. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From rd at powerset.com Tue Aug 14 18:08:28 2007 From: rd at powerset.com (Ryan Dooley) Date: Tue, 14 Aug 2007 11:08:28 -0700 Subject: unlink performance Message-ID: <84E2AE771361E9419DD0EFBD31F09C4D48356CF8B0@EXVMBX015-1.exch015.msoutlookonline.net> Greetings, I'm looking for clues about speeding up unlink performance. I have an ext3 file system (well, across several machines) that have a temp directory. The application needs to clean up the temp directory before the next run of the application begins and the engineers want to clean out this temp directory "quickly". I have no hard numbers yet on what they are seeing but the contents of the directory involves "many files" and "many directories" of various sizes. The file system in question is mounted with noatime,nodiratime with the following filesystem features: has_journal, resize_inode, dir_index, filetype, needs_recovery, sparse_super and large_file The operating system is Fedora Core 6 with fedora's 2.6.20-1.2933 kernel. The mounted file system is about 1.2TB in size and is a software raid-5 over four, 7200 rpm SATA disks. The disks were formatted with all the defaults. Pre-loading the file system cache (a la "find /path/to/temp -type f -print >/dev/null") followed by an "rm -rf /path/to/temp" seems to be pretty speedy to me. Any other suggestions of things I can experiment with to build performance numbers? Cheers, Ryan From mnalis-ml at voyager.hr Tue Aug 14 23:43:21 2007 From: mnalis-ml at voyager.hr (Matija Nalis) Date: Wed, 15 Aug 2007 01:43:21 +0200 Subject: unlink performance In-Reply-To: <84E2AE771361E9419DD0EFBD31F09C4D48356CF8B0@EXVMBX015-1.exch015.msoutlookonline.net> References: <84E2AE771361E9419DD0EFBD31F09C4D48356CF8B0@EXVMBX015-1.exch015.msoutlookonline.net> Message-ID: <20070814234321.GA4179@eagle102.home.lan> On Tue, Aug 14, 2007 at 11:08:28AM -0700, Ryan Dooley wrote: > I'm looking for clues about speeding up unlink performance. I have an > ext3 file system (well, across several machines) that have a temp > directory. The application needs to clean up the temp directory before > the next run of the application begins and the engineers want to clean out > this temp directory "quickly". Depending on your usage (and not dependant of specifics of filesystem), you might be able to get away with: mv /path/to/temp /path/to/temp.old mkdir /path/to/temp rm -rf /path/to/temp.old & as a workaround. in that way, the new run which fills the '/path/to/temp' can commence practically without any delay even while old data is still being removed, without any clash. > The file system in question is mounted with noatime,nodiratime with the > following filesystem features: > > has_journal, resize_inode, dir_index, filetype, needs_recovery, sparse_super and large_file What is the journal size ? (echo 'stat <8>' | debugfs /dev/md0) You can try increasing it. How much RAM is in the machine ? > The operating system is Fedora Core 6 with fedora's 2.6.20-1.2933 kernel. > The mounted file system is about 1.2TB in size and is a software raid-5 > over four, 7200 rpm SATA disks. The disks were formatted with all the > defaults. RAID5 would NOT be fastest choice for writes (including deletes), especially depending on the stripe size, and is dependant on "mke2fs -E stride" option. > Pre-loading the file system cache (a la "find /path/to/temp -type f -print > >/dev/null") followed by an "rm -rf /path/to/temp" seems to be pretty > speedy to me. Then there is no problem, yes ? :) -- Opinions above are GNU-copylefted. From nigel.metheringham at dev.intechnology.co.uk Wed Aug 15 08:13:27 2007 From: nigel.metheringham at dev.intechnology.co.uk (Nigel Metheringham) Date: Wed, 15 Aug 2007 09:13:27 +0100 Subject: unlink performance In-Reply-To: <84E2AE771361E9419DD0EFBD31F09C4D48356CF8B0@EXVMBX015-1.exch015.msoutlookonline.net> References: <84E2AE771361E9419DD0EFBD31F09C4D48356CF8B0@EXVMBX015-1.exch015.msoutlookonline.net> Message-ID: <2D70CC67-EE54-484A-96E0-51C6CF1B6590@dev.intechnology.co.uk> On 14 Aug 2007, at 19:08, Ryan Dooley wrote: > Pre-loading the file system cache (a la "find /path/to/temp -type f > -print >/dev/null") followed by an "rm -rf /path/to/temp" seems to > be pretty speedy to me. Do you mean that:- find /path/to/temp -type f -print >/dev/null rm -rf /path/to/temp is faster than just rm -rf /path/to/temp or do you mean that you have arranged to do the find before the point where you want to delete? If the former, that surprises me somewhat. Nigel. -- [ Nigel Metheringham Nigel.Metheringham at InTechnology.co.uk ] [ - Comments in this message are my own and not ITO opinion/policy - ] From rd at powerset.com Wed Aug 15 17:12:31 2007 From: rd at powerset.com (Ryan Dooley) Date: Wed, 15 Aug 2007 10:12:31 -0700 Subject: unlink performance In-Reply-To: <2D70CC67-EE54-484A-96E0-51C6CF1B6590@dev.intechnology.co.uk> References: <84E2AE771361E9419DD0EFBD31F09C4D48356CF8B0@EXVMBX015-1.exch015.msoutlookonline.net> <2D70CC67-EE54-484A-96E0-51C6CF1B6590@dev.intechnology.co.uk> Message-ID: <84E2AE771361E9419DD0EFBD31F09C4D48356CFA4D@EXVMBX015-1.exch015.msoutlookonline.net> Actually I was curious if it would make any difference at all. The only two "benchmarks" (really they are not) that I have are two different attempts: find /path/to/temp -type f -exec rm {} \; This in about 18.39 seconds. rm -rf /path/to/temp That finished in 16.43 seconds so you're assumption that the find didn't actually help anything is true. What I don't have handy is how big those temp directories were or how many files were included (and what size those files were). Now probably wondering why I can't wait 18-20 seconds but this was just a small test case. The data set will be much larger in normal cases. Cheers, Ryan -----Original Message----- From: Nigel Metheringham [mailto:nigel.metheringham at dev.intechnology.co.uk] Sent: Wednesday, August 15, 2007 1:13 AM To: Ryan Dooley Cc: ext3-users at redhat.com Subject: Re: unlink performance On 14 Aug 2007, at 19:08, Ryan Dooley wrote: > Pre-loading the file system cache (a la "find /path/to/temp -type f > -print >/dev/null") followed by an "rm -rf /path/to/temp" seems to > be pretty speedy to me. Do you mean that:- find /path/to/temp -type f -print >/dev/null rm -rf /path/to/temp is faster than just rm -rf /path/to/temp or do you mean that you have arranged to do the find before the point where you want to delete? If the former, that surprises me somewhat. Nigel. -- [ Nigel Metheringham Nigel.Metheringham at InTechnology.co.uk ] [ - Comments in this message are my own and not ITO opinion/policy - ] From darkonc at gmail.com Thu Aug 16 14:10:22 2007 From: darkonc at gmail.com (Stephen Samuel) Date: Thu, 16 Aug 2007 07:10:22 -0700 Subject: unlink performance In-Reply-To: <84E2AE771361E9419DD0EFBD31F09C4D48356CFA4D@EXVMBX015-1.exch015.msoutlookonline.net> References: <84E2AE771361E9419DD0EFBD31F09C4D48356CF8B0@EXVMBX015-1.exch015.msoutlookonline.net> <2D70CC67-EE54-484A-96E0-51C6CF1B6590@dev.intechnology.co.uk> <84E2AE771361E9419DD0EFBD31F09C4D48356CFA4D@EXVMBX015-1.exch015.msoutlookonline.net> Message-ID: <6cd50f9f0708160710o1bfbfee4ob97251e59f12d7b9@mail.gmail.com> I think that the mv / mkdir is going to be your best move... If you can't move the whole directory, try: mkdir /tmp2 mv /tmp/{*,.??*} /tmp2 (restart the application) rm -rf /tmp2/* /tmp2/.??* (it depends, of course, on the two directories being on the same partition). That approach especially works if the bulk of your files are in sub-directories You'll only be doing work on the inodes directly in the main directory, and then you can take your time deleting the subdirectories from /tmp2 while your app runs. I'm looking for clues about speeding up unlink performance. I have an ext3 > file system (well, across several machines) that have a temp directory. The > application needs to clean up the temp directory before the next run of the > application begins and the engineers want to clean out this temp directory > "quickly". > -- Stephen Samuel http://www.bcgreen.com 778-861-7641 -------------- next part -------------- An HTML attachment was scrubbed... URL: From rd at powerset.com Mon Aug 20 15:11:20 2007 From: rd at powerset.com (Ryan Dooley) Date: Mon, 20 Aug 2007 08:11:20 -0700 Subject: unlink performance In-Reply-To: <6cd50f9f0708160710o1bfbfee4ob97251e59f12d7b9@mail.gmail.com> Message-ID: On 8/16/07 7:10 AM, "Stephen Samuel" wrote: I think that the mv / mkdir is going to be your best move... If you can't move the whole directory, try: mkdir /tmp2 mv /tmp/{*,.??*} /tmp2 (restart the application) rm -rf /tmp2/* /tmp2/.??* (it depends, of course, on the two directories being on the same partition). That approach especially works if the bulk of your files are in sub-directories You'll only be doing work on the inodes directly in the main directory, and then you can take your time deleting the subdirectories from /tmp2 while your app runs. I'll give that a shot and see what happens. Thanks! Cheers, Ryan -------------- next part -------------- An HTML attachment was scrubbed... URL: From beevis at libero.it Tue Aug 28 13:29:46 2007 From: beevis at libero.it (beevis at libero.it) Date: Tue, 28 Aug 2007 15:29:46 +0200 Subject: Reserved space Message-ID: Hello list, I have a doubt. When creating an ext3 fs, 5% of its space is reserved to the superuser. I understand this should be important for /, maybe /var and /tmp. But is it compulsory for other fs, like, say, an external disk with data? Or it's just a heritage, no more needed? Could one safely reclaim this 5%? I understand that no other fs (jfs, xfs, reiser) reserves some space. Thanks for clarification From tytso at mit.edu Tue Aug 28 14:28:42 2007 From: tytso at mit.edu (Theodore Tso) Date: Tue, 28 Aug 2007 10:28:42 -0400 Subject: Reserved space In-Reply-To: References: Message-ID: <20070828142842.GC31120@thunk.org> On Tue, Aug 28, 2007 at 03:29:46PM +0200, beevis at libero.it wrote: > Hello list, > I have a doubt. When creating an ext3 fs, 5% of its space is reserved to the superuser. > I understand this should be important for /, maybe /var and /tmp. > But is it compulsory for other fs, like, say, an external disk with data? > Or it's just a heritage, no more needed? Could one safely reclaim this 5%? > I understand that no other fs (jfs, xfs, reiser) reserves some space. You can, but the performance of the filesystem will go down as you use the last 5%, especially if the filesystem is dynamic and constantly changing, since it will cause the files to become very badly fragmented. UFS historically used 10% for its anti-fragmentation reserve. With ext3 we decreased it to 5%. If the filesystem is going to be essentially static after you fill it up, sure you can reduce it down to 0%. But if the filesystem is going to be continuously active, you will get better performance by buying a bigger hard drive and using a filesystem with an average utilization of 50-80% than one which is hovering between 99-100%. Aside from spending 100-200 Euro's on extra memory, speading 100-200 Euro's on a newer, bigger hard drive can be one of the easist and cheapest way to improve the performance of your system. Regards, - Ted From beevis at libero.it Tue Aug 28 18:49:37 2007 From: beevis at libero.it (beevis at libero.it) Date: Tue, 28 Aug 2007 20:49:37 +0200 Subject: Reserved space Message-ID: > Hello list, > I have a doubt. When creating an ext3 fs, 5% of its space is reserved to the superuser. > I understand this should be important for /, maybe /var and /tmp. > But is it compulsory for other fs, like, say, an external disk with data? > Or it's just a heritage, no more needed? Could one safely reclaim this 5%? > I understand that no other fs (jfs, xfs, reiser) reserves some space. You can, but the performance of the filesystem will go down as you use the last 5%, especially if the filesystem is dynamic and constantly changing, since it will cause the files to become very badly fragmented. UFS historically used 10% for its anti-fragmentation reserve. With ext3 we decreased it to 5%. If the filesystem is going to be essentially static after you fill it up, sure you can reduce it down to 0%. But if the filesystem is going to be continuously active, you will get better performance by buying a bigger hard drive and using a filesystem with an average utilization of 50-80% than one which is hovering between 99-100%. Aside from spending 100-200 Euro's on extra memory, speading 100-200 Euro's on a newer, bigger hard drive can be one of the easist and cheapest way to improve the performance of your system. Regards, - Ted Thanks for clarification. So I understand this 5% is reserved in order to prevent fragmentation. From tytso at mit.edu Tue Aug 28 21:59:44 2007 From: tytso at mit.edu (Theodore Tso) Date: Tue, 28 Aug 2007 17:59:44 -0400 Subject: Reserved space In-Reply-To: References: Message-ID: <20070828215944.GD31120@thunk.org> On Tue, Aug 28, 2007 at 08:49:37PM +0200, beevis at libero.it wrote: > Thanks for clarification. So I understand this 5% is reserved in > order to prevent fragmentation. If you want to be pedantic, to allow ext3's anti-fragmentation algorithsm to work more efficiently. It will not (by a long shot!) completely remove fragmentation, but rather, fragmentation will increase as you use last 5-10% of the filesystem. We arbitrarily set 5% as the reserve; UFS (as used in Solaris, BSD, and many other historical Unix systems) set the reserve at 10%. - Ted From jpiszcz at lucidpixels.com Tue Aug 28 22:45:56 2007 From: jpiszcz at lucidpixels.com (Justin Piszcz) Date: Tue, 28 Aug 2007 18:45:56 -0400 (EDT) Subject: Reserved space In-Reply-To: <20070828215944.GD31120@thunk.org> References: <20070828215944.GD31120@thunk.org> Message-ID: On Tue, 28 Aug 2007, Theodore Tso wrote: > On Tue, Aug 28, 2007 at 08:49:37PM +0200, beevis at libero.it wrote: >> Thanks for clarification. So I understand this 5% is reserved in >> order to prevent fragmentation. > > If you want to be pedantic, to allow ext3's anti-fragmentation > algorithsm to work more efficiently. It will not (by a long shot!) > completely remove fragmentation, but rather, fragmentation will > increase as you use last 5-10% of the filesystem. We arbitrarily set > 5% as the reserve; UFS (as used in Solaris, BSD, and many other > historical Unix systems) set the reserve at 10%. > > - Ted > > > _______________________________________________ > Ext3-users mailing list > Ext3-users at redhat.com > https://www.redhat.com/mailman/listinfo/ext3-users > Does anyone have any metrics as to how bad it gets for 1% 2% 3% 4% and 5% reservations? The difference between 1% and 5% can be a lot with 1TB drives etc. Justin. From raghuprasath at yahoo.com Wed Aug 29 10:02:06 2007 From: raghuprasath at yahoo.com (Kannan Raghuprasath) Date: Wed, 29 Aug 2007 03:02:06 -0700 (PDT) Subject: ext3-fs error with RAID 5 Array. Message-ID: <564417.24036.qm@web34711.mail.mud.yahoo.com> Hi, I have a Fedora core 4 machine (kernel- 2.6.11-1.1369_FC4smp) conneted to external DAS using Ultra 320 SCSI controller card. The DAS is configured as RAID 5 of 3.5 TB. I partitioned this array into two partitions of size 1.9TB and 1.6 TB. These are assigned to 2 different LUNS so that these two appear as two partitions in the my Linux machine. I use this DAS for backup and after backing up for 30 hours i observe that partition gets remounted as read-only with following error message in dmesg, EXT3-fs error (device sdb1): ext3_add_entry: bad entry in directory #2: directory entry across blocks - offset=1080, inode=135216, rec_len=4132, name_len=25 Aborting journal on device sdb1. ext3_abort called. EXT3-fs error (device sdb1): ext3_journal_start_sb: Detected aborted journal Remounting filesystem read-only EXT3-fs error (device sdb1) in start_transaction: Journal has aborted EXT3-fs error (device sdb1) in ext3_create: IO failure EXT3-fs error (device sdb1): ext3_readdir: bad entry in directory #2: directory entry across blocks - offset=1080, inode=135216, rec_len=4132, name_len=25 EXT3-fs error (device sdb1): ext3_readdir: bad entry in directory #2: directory entry across blocks - offset=1080, inode=135216, rec_len=4132, name_len=25 EXT3-fs error (device sdb1): ext3_readdir: bad entry in directory #2: directory entry across blocks - offset=1080, inode=135216, rec_len=4132, name_len=25 EXT3-fs error (device sdb1): ext3_readdir: bad entry in directory #2: directory entry across blocks - offset=1080, inode=135216, rec_len=4132, name_len=25 EXT3-fs error (device sdb1): ext3_readdir: bad entry in directory #2: directory entry across blocks - offset=1080, inode=135216, rec_len=4132, name_len=25 Any suggestions on how to solve this problem? Best regards and thanks in advance for your help, Raghu ____________________________________________________________________________________ Building a website is a piece of cake. Yahoo! Small Business gives you all the tools to get online. http://smallbusiness.yahoo.com/webhosting