From vcaron at bearstech.com Tue Mar 12 23:56:15 2013 From: vcaron at bearstech.com (Vincent Caron) Date: Wed, 13 Mar 2013 00:56:15 +0100 Subject: ext4 and extremely slow filesystem traversal Message-ID: <513FC09F.4030708@bearstech.com> Hello list, I have troubles with the daily backup of a modest filesystem which tends to take more that 10 hours. I have ext4 all over the place on ~200 servers and never ran into such a problem. The filesystem capacity is 300 GB (19,6M inodes) with 196 GB (9,3M inodes) used. It's mounted 'defaults,noatime'. It sits on a hardware RAID array thru plain LVM slices. The RAID array is a RAID5 running on 5x SATA 500G disks, with a battery-backed (RAM) cache and write-back cache policy. To be precise, it's an Areca 1231. The hardware RAID array use 64kB stripes and I've configured the filesystem with 4kB blocks and stride=16. It also has 0 reserved blocks. In other works the fs was created with 'mkfs -t ext4 -E stride=16 -m 0 -L volname /dev/vgX/Y'. I'm attaching the mke2fs.conf for reference too. Everything is running with Debian Squeeze and its 2.6.32 kernel (amd64 flavour), on a 4 cores and 4 GB RAM server. I ran a tiobench tonight on an idle instance (I have two identicals systems - hw, sw, data - with exactly the same pb). I've attached results as plain text to protect them from line wrapping. They look fine to me. When I try to backup the problematic filesystem with tar, rsync or whatever tool traversing the whole filesystem, things are awful. I know that this filesystem has *lots* of directories, most with few or no files in them. Tonight I ran a simple 'find /path/to/vol -type d |pv -bl' (counts directories as they are found), I stopped it more than 2 hours later : it was not done, and already counted more than 2M directories. IO stats showed 1000 read calls/sec with avq=1 and avio=5 ms. CPU is 2% so it is totally I/O bound. This looks like the worst random read case to me. I even tried a hack which tries to sort directories while traversing the filesystem to no avail. Right now I don't even know how to analyze my filesystem further. Sorry for not being able to describe it more accurately. I'm in search for any advice or direction to improve this situation. While keeping using ext4 of course :). PS: I did ask to the developers to not abuse the filesystem that way, and that in 2013 it's okay to have 10k+ files per directory... No success, so I guess I'll have to work around it. -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: tiobench.txt URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: mke2fs.conf URL: From tytso at mit.edu Wed Mar 13 02:52:26 2013 From: tytso at mit.edu (Theodore Ts'o) Date: Tue, 12 Mar 2013 22:52:26 -0400 Subject: ext4 and extremely slow filesystem traversal In-Reply-To: <513FC09F.4030708@bearstech.com> References: <513FC09F.4030708@bearstech.com> Message-ID: <20130313025226.GE16919@thunk.org> On Wed, Mar 13, 2013 at 12:56:15AM +0100, Vincent Caron wrote: > > I even tried a hack which tries to sort directories while traversing > the filesystem to no avail. Did you sort results from readdir() by inode number? i.e., such as what the following LD_PRELOAD hack does? https://git.kernel.org/cgit/fs/ext2/e2fsprogs.git/tree/contrib/spd_readdir.c?h=maint > Right now I don't even know how to analyze my filesystem further. > Sorry for not being able to describe it more accurately. I'm in search > for any advice or direction to improve this situation. While keeping > using ext4 of course :). Try running "e2fsck -fv /dev/XXX" and send me the output. Also useful would be the output of "e2freefrag /dev/XXX" and "dumpe2fs -h" - Ted From vcaron at bearstech.com Wed Mar 13 09:19:52 2013 From: vcaron at bearstech.com (Vincent Caron) Date: Wed, 13 Mar 2013 10:19:52 +0100 Subject: ext4 and extremely slow filesystem traversal In-Reply-To: <20130313025226.GE16919@thunk.org> References: <513FC09F.4030708@bearstech.com> <20130313025226.GE16919@thunk.org> Message-ID: <514044B8.6000206@bearstech.com> On 13/03/2013 03:52, Theodore Ts'o wrote: > > Did you sort results from readdir() by inode number? i.e., such as > what the following LD_PRELOAD hack does? > > https://git.kernel.org/cgit/fs/ext2/e2fsprogs.git/tree/contrib/spd_readdir.c?h=maint I don't think I tried this specific hack, I'm having a go right now. Is is still useful if each directory only holds a few inodes ? >> Right now I don't even know how to analyze my filesystem further. >> Sorry for not being able to describe it more accurately. I'm in search >> for any advice or direction to improve this situation. While keeping >> using ext4 of course :). > > Try running "e2fsck -fv /dev/XXX" and send me the output. > > Also useful would be the output of "e2freefrag /dev/XXX" and "dumpe2fs -h" Information attached. Dumpfs said: dumpe2fs 1.42.5 (29-Jul-2012). Thanks for your help ! -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: e2freefrag.txt URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: e2fsck-fv.txt URL: From vcaron at bearstech.com Wed Mar 13 11:59:53 2013 From: vcaron at bearstech.com (Vincent Caron) Date: Wed, 13 Mar 2013 12:59:53 +0100 Subject: ext4 and extremely slow filesystem traversal In-Reply-To: <514044B8.6000206@bearstech.com> References: <513FC09F.4030708@bearstech.com> <20130313025226.GE16919@thunk.org> <514044B8.6000206@bearstech.com> Message-ID: <51406A39.8040508@bearstech.com> On 13/03/2013 10:19, Vincent Caron wrote: >> > Did you sort results from readdir() by inode number? i.e., such as >> > what the following LD_PRELOAD hack does? >> > >> > https://git.kernel.org/cgit/fs/ext2/e2fsprogs.git/tree/contrib/spd_readdir.c?h=maint > I don't think I tried this specific hack, I'm having a go right now. > Is is still useful if each directory only holds a few inodes ? Same slowness, I ran : filer:~# gcc -shared -fPIC -ldl -o spd_readdir.so spd_readdir.c filer:~# LD_PRELOAD=./spd_readdir.so find /srv/vol -type d |pv -bl I stopped the experiment at +54min with 845k directories found (which gives roughly the same rate of 1M directories / hour, and I know there are more that 2M of them). From vcaron at bearstech.com Wed Mar 13 20:49:20 2013 From: vcaron at bearstech.com (Vincent Caron) Date: Wed, 13 Mar 2013 21:49:20 +0100 Subject: ext4 and extremely slow filesystem traversal In-Reply-To: <20130313203346.GJ5604@thunk.org> References: <513FC09F.4030708@bearstech.com> <20130313025226.GE16919@thunk.org> <514044B8.6000206@bearstech.com> <20130313203346.GJ5604@thunk.org> Message-ID: <5140E650.5070900@bearstech.com> On 13/03/2013 21:33, Theodore Ts'o wrote: > Wow. You have more directories than regular files! Given that there > are no hard links, that implies that you have at least 2,079,271 > directories which are ***empty***. Awful, isn't it ? I knew directories were abused, but didn't know that 'e2fsck -v' would display the exact figures (since I never waited 5+ hours to scan the whole filesystem). Nice to know. > The inline data feature (which is still in testing and isn't something > I can recommend for production use yet) is probably the best hope for > you. But probably the best thing you can do is to harrague your > developers to ask what the heck they are doing.... Indeed, these filers are storing live and sensitive data and are conservatively running stable OS and well known kernels. Thanks for your advice, I'll actively work with the devs in order to refactor their filesystem layout. From tytso at mit.edu Wed Mar 13 20:52:10 2013 From: tytso at mit.edu (Theodore Ts'o) Date: Wed, 13 Mar 2013 16:52:10 -0400 Subject: ext4 and extremely slow filesystem traversal In-Reply-To: <5140E650.5070900@bearstech.com> References: <513FC09F.4030708@bearstech.com> <20130313025226.GE16919@thunk.org> <514044B8.6000206@bearstech.com> <20130313203346.GJ5604@thunk.org> <5140E650.5070900@bearstech.com> Message-ID: <20130313205210.GK5604@thunk.org> On Wed, Mar 13, 2013 at 09:49:20PM +0100, Vincent Caron wrote: > On 13/03/2013 21:33, Theodore Ts'o wrote: > > Wow. You have more directories than regular files! Given that there > > are no hard links, that implies that you have at least 2,079,271 > > directories which are ***empty***. > > Awful, isn't it ? I knew directories were abused, but didn't know that > 'e2fsck -v' would display the exact figures (since I never waited 5+ > hours to scan the whole filesystem). Nice to know. To be clear, that's at least two million directories assuming that all of the other directories have but a single file in them(!). In reality you probably have a lot more than 2 million empty directories.... - Ted From tytso at mit.edu Wed Mar 13 20:33:46 2013 From: tytso at mit.edu (Theodore Ts'o) Date: Wed, 13 Mar 2013 16:33:46 -0400 Subject: ext4 and extremely slow filesystem traversal In-Reply-To: <514044B8.6000206@bearstech.com> References: <513FC09F.4030708@bearstech.com> <20130313025226.GE16919@thunk.org> <514044B8.6000206@bearstech.com> Message-ID: <20130313203346.GJ5604@thunk.org> On Wed, Mar 13, 2013 at 10:19:52AM +0100, Vincent Caron wrote: > > 3633315 regular files > 5712586 directories > 0 character device files > 0 block device files > 0 fifos > 0 links > 0 symbolic links (0 fast symbolic links) > 0 sockets > -------- > 9345901 files (really in-use inodes) Wow. You have more directories than regular files! Given that there are no hard links, that implies that you have at least 2,079,271 directories which are ***empty***. The inline data feature (which is still in testing and isn't something I can recommend for production use yet) is probably the best hope for you. But probably the best thing you can do is to harrague your developers to ask what the heck they are doing.... - Ted From pg_ext3 at ext3.for.sabi.co.uk Wed Mar 13 21:29:32 2013 From: pg_ext3 at ext3.for.sabi.co.uk (Peter Grandi) Date: Wed, 13 Mar 2013 21:29:32 +0000 Subject: ext4 and extremely slow filesystem traversal In-Reply-To: <513FC09F.4030708@bearstech.com> References: <513FC09F.4030708@bearstech.com> Message-ID: <20800.61372.638398.8301@tree.ty.sabi.co.uk> > I have troubles with the daily backup of a modest filesystem > which tends to take more that 10 hours. [ ... ] with 196 GB > (9,3M inodes) used. That is roughly 1M inodes/hour and 20GB/hour, or nearly 300 inodes/s and nearly 6MB/s. These are very good numbers for high random IOPS loads, and as seen later, you have one. > It's mounted 'defaults,noatime'. That helps. > It sits on a hardware RAID array thru plain LVM slices. That's the pointless default... But does not particularly slow things down here. > The RAID array is a RAID5 running on 5x SATA 500G disks, with a > battery-backed (RAM) cache and write-back cache policy. To be > precise, it's an Areca 1231. The hardware RAID array use 64kB > stripes and I've configured the filesystem with 4kB blocks and > stride=16. The striping or alignment are not relevant on reads, but the stride matters a great deal as to metadata parallelism, and here it is set to 64KiB. But the array stride is 16KiB (a 4-wide stripe of 64KiB). But since it is an integral multiple it should be about as good. And since the backup performance is pretty good, that seems the case. > It also has 0 reserved blocks. That's usually a truly terrible setting (20% is a much better value), but your filesystem is not very full anyhow. > When I try to backup the problematic filesystem with tar, rsync > or whatever tool traversing the whole filesystem, things are > awful. Rather they are pretty good. Each 500GB SATA disk can usually do somewhat less than 100 random IOPS/second, there are 4 disks in each stripe when reading, and you are getting nearly 300 inodes/s and 5MB/s, quite close to the maximum. On random loads with smallish records typical rotating disks have transfer rates of 0.5MB to 1.5 MB/s, and you are getting rather more than that (mostly thanks to the 20KiB average inode size). You are getting pretty good delivery from 'ext4' and a very low random IOPS storage system on a highly randomized workload: > I know that this filesystem has *lots* of directories, most > with few or no files in them. That's a really bad idea. > Tonight I ran a simple 'find /path/to/vol -type d |pv -bl' > (counts directories as they are found), I stopped it more than > 2 hours later : it was not done, and already counted more than > 2M directories. That's the usual 1M inodes/s. > [ ... ] I'm in search for any advice or direction to improve > this situation. While keeping using ext4 of course :). Well, any system administrator would tell you the same: your backup workload and your storage system are mismatched, and the best solution is probably to use 146GB SAS 15K RPM disks for the same capacity (or more). Or perhaps recent enteprise level SSDs. The "small file" problem is ancient, and I call it the "mailstore" problems from its typical incarnation: http://www.sabi.co.uk/blog/12-thr.html#120429 > PS: I did ask to the developers to not abuse the filesystem > that way, The "I use the filesystem as a DBMS" attitude is really very common among developers. It is cost-free to them, and backup (and system) administrators bear the cost when the filesystem fills up. Because at the beginning everything looks fine. Designing stuff that seems cheap and fast at the beginning even if it becomes very bad after some time is a good way to look like a winner in most organizations. > and that in 2013 it's okay to have 10k+ files per directory... It's not, it is a very bad idea. In 2013, just like in 1973, or in 1993, it is a much better idea to use simple indexed files to keep a collection of smallish records. Directories are a classification system, not a database indexing system. Here is an amusing report of the difference between the two: http://www.sabi.co.uk/blog/anno05-4th.html#051016 > No success, so I guess I'll have to work around it. As a backup administrator you can't get much better from your situation. You are already getting nearly the best performance for whole tree scans of very random small records on a low random IOPS storage layer. From vcaron at bearstech.com Wed Mar 13 21:50:13 2013 From: vcaron at bearstech.com (Vincent Caron) Date: Wed, 13 Mar 2013 22:50:13 +0100 Subject: ext4 and extremely slow filesystem traversal In-Reply-To: <20800.61372.638398.8301@tree.ty.sabi.co.uk> References: <513FC09F.4030708@bearstech.com> <20800.61372.638398.8301@tree.ty.sabi.co.uk> Message-ID: <5140F495.2040407@bearstech.com> On 13/03/2013 22:29, Peter Grandi wrote: >> > It also has 0 reserved blocks. > That's usually a truly terrible setting (20% is a much better > value), but your filesystem is not very full anyhow. This filesystem has no file owned by root and won't have any. I thought in this case -m0 would be a good idea. Thanks a lot for your detailed insight on the various performance figures, I didn't do the proper math to realize that this inode reading rate was actually *good*. Fortunately the client is technically savvy, and pointing him at this mailing-list thread will help make him the right decision. From tytso at mit.edu Thu Mar 14 01:05:51 2013 From: tytso at mit.edu (Theodore Ts'o) Date: Wed, 13 Mar 2013 21:05:51 -0400 Subject: ext4 and extremely slow filesystem traversal In-Reply-To: <5140E650.5070900@bearstech.com> References: <513FC09F.4030708@bearstech.com> <20130313025226.GE16919@thunk.org> <514044B8.6000206@bearstech.com> <20130313203346.GJ5604@thunk.org> <5140E650.5070900@bearstech.com> Message-ID: <20130314010551.GA9962@thunk.org> On Wed, Mar 13, 2013 at 09:49:20PM +0100, Vincent Caron wrote: > On 13/03/2013 21:33, Theodore Ts'o wrote: > > Wow. You have more directories than regular files! Given that there > > are no hard links, that implies that you have at least 2,079,271 > > directories which are ***empty***. > > Awful, isn't it ? I knew directories were abused, but didn't know that > 'e2fsck -v' would display the exact figures (since I never waited 5+ > hours to scan the whole filesystem). Nice to know. Just as a note, e2fsck -v can sometimes get this information much more quickly than other alternatives, since it can scan the file system in inode order, instead of the essentially random order. Just as a side, if you just want to get a rough count of the number of directories, you can get that by grabbing the information out of dumpe2fs. Group 624: (Blocks 20447232-20479999) [ITABLE_ZEROED] Checksum 0xd3f5, unused inodes 4821 Block bitmap at 20447232 (+0), Inode bitmap at 20447248 (+16) Inode table at 20447264-20447775 (+32) 24103 free blocks, 4821 free inodes, 435 directories, 4821 unused inodes ^^^^^^^^^^^^^^^ Free blocks: 20455889, 20455898-20479999 Free inodes: 5115180-5120000 Dumpe2fs doesn't actually sum the number of directories, and you won't be able to differentiate the number of files that are regular files versus symlinks, device nodes, etc., but if you just want to get the number of directories, you can get this number by getting the information out of dumpe2fs without having to wait for e2fsck to complete. You can even do this with a mounted file system, but the number will of course not necessarily be completely accurate if you do that. (You can get the number of inodes in use by subtracting the number of free inodes from the number of inodes in the file system. If you then subtract the number of directories, then you can get the number of non-directory inodes versus directory inodes.) - Ted From pg_ext3 at ext3.for.sabi.co.UK Thu Mar 14 20:57:49 2013 From: pg_ext3 at ext3.for.sabi.co.UK (Peter Grandi) Date: Thu, 14 Mar 2013 20:57:49 +0000 Subject: ext4 and extremely slow filesystem traversal In-Reply-To: <5140F495.2040407@bearstech.com> References: <513FC09F.4030708@bearstech.com> <20800.61372.638398.8301@tree.ty.sabi.co.uk> <5140F495.2040407@bearstech.com> Message-ID: <20802.14797.748473.461517@tree.ty.sabi.co.uk> >>>> It also has 0 reserved blocks. >> That's usually a truly terrible setting (20% is a much better >> value), but your filesystem is not very full anyhow. > This filesystem has no file owned by root and won't have > any. I thought in this case -m0 would be a good idea. Why does it matter here that "no file owned by root"? What has that got to do with the much greater difficulty to find contiguous space the fuller the filetree is? From vcaron at bearstech.com Thu Mar 14 22:56:47 2013 From: vcaron at bearstech.com (Vincent Caron) Date: Thu, 14 Mar 2013 23:56:47 +0100 Subject: ext4 and extremely slow filesystem traversal In-Reply-To: <20802.14797.748473.461517@tree.ty.sabi.co.uk> References: <513FC09F.4030708@bearstech.com> <20800.61372.638398.8301@tree.ty.sabi.co.uk> <5140F495.2040407@bearstech.com> <20802.14797.748473.461517@tree.ty.sabi.co.uk> Message-ID: <514255AF.5040701@bearstech.com> On 14/03/2013 21:57, Peter Grandi wrote: >> This filesystem has no file owned by root and won't have >> > any. I thought in this case -m0 would be a good idea. > Why does it matter here that "no file owned by root"? > > What has that got to do with the much greater difficulty to find > contiguous space the fuller the filetree is? Because the man page says that reserved blocks are used to protect root-level daemons from misbehaving would unprivileged programs try to fill the disk. And uh, to avoid fragmentation, I missed that part. OTOH I monitor disk space and never let go past 95% block usage without specific action (freeing inodes or enlarging filesystem). Would I use -m5 and oversize my filesystems (because I sell the capacity, say I sell 100GB then I need a 105GB blockdev), I would still monitor the disk usage and take action before it's 100% filled up. But I'd end up reserving more blocks without more guarantees that the -m0 case. So technically it looks wrong, but politically I'm not sure it's stupid. Or is it ? From vcaron at bearstech.com Thu Mar 14 23:07:31 2013 From: vcaron at bearstech.com (Vincent Caron) Date: Fri, 15 Mar 2013 00:07:31 +0100 Subject: ext4 and extremely slow filesystem traversal In-Reply-To: <20130314010551.GA9962@thunk.org> References: <513FC09F.4030708@bearstech.com> <20130313025226.GE16919@thunk.org> <514044B8.6000206@bearstech.com> <20130313203346.GJ5604@thunk.org> <5140E650.5070900@bearstech.com> <20130314010551.GA9962@thunk.org> Message-ID: <51425833.9090708@bearstech.com> On 14/03/2013 02:05, Theodore Ts'o wrote: > Just as a note, e2fsck -v can sometimes get this information much more > quickly than other alternatives, since it can scan the file system in > inode order, instead of the essentially random order. > > Just as a side, if you just want to get a rough count of the number of > directories, you can get that by grabbing the information out of > dumpe2fs. Very useful. Global stats without having to scan the whole filesystems are very precious... I was wondering : couldn't we use dumpe2fs or something based on libext2fs to quickly extract a snapshot of all inodes from a given filesystem ? For incremental backups, simply checking the mtime on millions of inodes and discovering that only a handful of them were updated since the previous pass looks very inefficient with readdir()+lstat(). So mnay syscalls, so man spoonfed bits of information. When I had a peek, I tought I'd got a list of inodes but would not be able to link them back to their name(s) without inducing the same cost as a regular find-like filesystem traversal. Does it make sense ? AFAIK I would be better served with block-level snapshot solutions, but LVM snapshots are supposed to double your writes if I got it right, and I'm not sure there's something else in the Linux and free software world. Plus I'd love to not migrate away from my ext3/4's without a compelling reason. Btrfs is not (yet) and option and ZFS doesn't fit legally with Linux. From adilger at dilger.ca Fri Mar 15 07:14:17 2013 From: adilger at dilger.ca (Andreas Dilger) Date: Fri, 15 Mar 2013 00:14:17 -0700 Subject: ext4 and extremely slow filesystem traversal In-Reply-To: <51425833.9090708@bearstech.com> References: <513FC09F.4030708@bearstech.com> <20130313025226.GE16919@thunk.org> <514044B8.6000206@bearstech.com> <20130313203346.GJ5604@thunk.org> <5140E650.5070900@bearstech.com> <20130314010551.GA9962@thunk.org> <51425833.9090708@bearstech.com> Message-ID: <0BCD1099-4028-4328-9428-64112BE4ED56@dilger.ca> On 2013-03-14, at 16:07, Vincent Caron wrote: > On 14/03/2013 02:05, Theodore Ts'o wrote: >> Just as a note, e2fsck -v can sometimes get this information much more >> quickly than other alternatives, since it can scan the file system in >> inode order, instead of the essentially random order. >> >> Just as a side, if you just want to get a rough count of the number of >> directories, you can get that by grabbing the information out of >> dumpe2fs. > > Very useful. Global stats without having to scan the whole filesystems > are very precious... > > I was wondering : couldn't we use dumpe2fs or something based on > libext2fs to quickly extract a snapshot of all inodes from a given > filesystem ? For incremental backups, simply checking the mtime on > millions of inodes and discovering that only a handful of them were > updated since the previous pass looks very inefficient with > readdir()+lstat(). That's exactly what e2scan does. I'm pretty sure that is in upstream e2fsprogs now (not just our Lustre version), but I'm on a plane and cannot check. It will scan the inode table directly and can generate the pathnames of files efficiently. It can filter on timestamps. Cheers, Andreas > So mnay syscalls, so man spoonfed bits of > information. When I had a peek, I tought I'd got a list of inodes but > would not be able to link them back to their name(s) without inducing > the same cost as a regular find-like filesystem traversal. Does it make > sense ? > > AFAIK I would be better served with block-level snapshot solutions, > but LVM snapshots are supposed to double your writes if I got it right, > and I'm not sure there's something else in the Linux and free software > world. Plus I'd love to not migrate away from my ext3/4's without a > compelling reason. Btrfs is not (yet) and option and ZFS doesn't fit > legally with Linux. > > _______________________________________________ > Ext3-users mailing list > Ext3-users at redhat.com > https://www.redhat.com/mailman/listinfo/ext3-users From bothie at gmx.de Sat Mar 16 11:17:01 2013 From: bothie at gmx.de (Bodo Thiesen) Date: Sat, 16 Mar 2013 12:17:01 +0100 Subject: ext4 and extremely slow filesystem traversal In-Reply-To: <51425833.9090708@bearstech.com> References: <513FC09F.4030708@bearstech.com> <20130313025226.GE16919@thunk.org> <514044B8.6000206@bearstech.com> <20130313203346.GJ5604@thunk.org> <5140E650.5070900@bearstech.com> <20130314010551.GA9962@thunk.org> <51425833.9090708@bearstech.com> Message-ID: <20130316121701.5c8386db@phenom> * Vincent Caron hat geschrieben: > AFAIK I would be better served with block-level snapshot solutions, > but LVM snapshots are supposed to double your writes if I got it right, > and I'm not sure there's something else in the Linux and free software > world. There is a simple and reliable solution for block level backups: dd umount && dd if="our raid partition" of="some new big enough disc" && mount and then wait for the data to go at 100MB/s or so to the new disc. Using snapshots is not a reliable way to do backups, since you would still have to trust the LVM code to be totally error free and protect your data under any circumstances (including hardware failures in your raid array etc). For your actual problem: Ask your developers to use some mapping system. When they want to access a file "filename" then calculate md5sum of "filename", take the first 6 characters of the ascii representation (here it would be 435ed7) and create a file called "43/5e/d7-X". This way you would end up with at most 65792 directories. The X is needed to distinquish between files with same 6 first letters md5sum. So, first file gets name "43/5e/d7-1", second file gets name "43/5e/d7-2" and so on. Somewhere else, they would then store the mapping table, mapping file "filename" to "43/5e/d7-2". All accesses go through this mapping table. Regards, Bodo From bothie at gmx.de Sat Mar 16 11:29:07 2013 From: bothie at gmx.de (Bodo Thiesen) Date: Sat, 16 Mar 2013 12:29:07 +0100 Subject: ext4 and extremely slow filesystem traversal In-Reply-To: <514255AF.5040701@bearstech.com> References: <513FC09F.4030708@bearstech.com> <20800.61372.638398.8301@tree.ty.sabi.co.uk> <5140F495.2040407@bearstech.com> <20802.14797.748473.461517@tree.ty.sabi.co.uk> <514255AF.5040701@bearstech.com> Message-ID: <20130316122907.7343a0f0@phenom> * Vincent Caron hat geschrieben: > On 14/03/2013 21:57, Peter Grandi wrote: > >> This filesystem has no file owned by root and won't have > >> > any. I thought in this case -m0 would be a good idea. > > Why does it matter here that "no file owned by root"? > > > > What has that got to do with the much greater difficulty to find > > contiguous space the fuller the filetree is? > > Because the man page says that reserved blocks are used to protect > root-level daemons from misbehaving would unprivileged programs try to > fill the disk. And uh, to avoid fragmentation, I missed that part. > > OTOH I monitor disk space and never let go past 95% block usage > without specific action (freeing inodes or enlarging filesystem). Would > I use -m5 and oversize my filesystems (because I sell the capacity, say > I sell 100GB then I need a 105GB blockdev), I would still monitor the > disk usage and take action before it's 100% filled up. But I'd end up > reserving more blocks without more guarantees that the -m0 case. > > So technically it looks wrong, but politically I'm not sure it's > stupid. Or is it ? Using -m technically makes the file system driver report ENOSPC when there is in fact still free space available. So, as long as you make sure, that you have at least 5% free space at any given time, it doesn't matter whether you have -m0 or -m5. However, tools like df show the available capacity to user space, so 100GB with 95GB used will show 100% used with -m5 and 95% used with -m0. Effectively, that means, when you go with -m5 and make sure, df shows at least 5% free space, you end up with about 10% free all the time - this way reducing fragmentation - in theory. Since you're missusing your file system as data bank management system with many small files anyways, inter file fragmentation is the least of your problems. So, it's totally safe for you to stay with -m0. Regards, Bodo From vcaron at bearstech.com Sun Mar 24 22:40:56 2013 From: vcaron at bearstech.com (Vincent Caron) Date: Sun, 24 Mar 2013 23:40:56 +0100 Subject: ext4 and extremely slow filesystem traversal In-Reply-To: <0BCD1099-4028-4328-9428-64112BE4ED56@dilger.ca> References: <513FC09F.4030708@bearstech.com> <20130313025226.GE16919@thunk.org> <514044B8.6000206@bearstech.com> <20130313203346.GJ5604@thunk.org> <5140E650.5070900@bearstech.com> <20130314010551.GA9962@thunk.org> <51425833.9090708@bearstech.com> <0BCD1099-4028-4328-9428-64112BE4ED56@dilger.ca> Message-ID: <514F80F8.9050908@bearstech.com> On 15/03/2013 08:14, Andreas Dilger wrote: > That's exactly what e2scan does. I'm pretty sure that is in upstream e2fsprogs now (not just our Lustre version), but I'm on a plane and cannot check. > > It will scan the inode table directly and can generate the pathnames of files efficiently. It can filter on timestamps. I did not find it in git://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git There are some references from Lustre doc [1] and I could find some github repo with some code [2]. Would you know where the e2scan upstream lives ? [1] http://wiki.lustre.org/manual/LustreManual20_HTML/SystemConfigurationUtilities_HTML.html#50438219_55923 [2] https://github.com/morrone/e2fsprogs From aragonx at dcsnow.com Tue Mar 26 19:51:24 2013 From: aragonx at dcsnow.com (aragonx at dcsnow.com) Date: Tue, 26 Mar 2013 19:51:24 -0000 Subject: e2freefrag says filesystem too large Message-ID: Can someone tell me if this will be fixed? # e2freefrag /dev/sdl1 Device: /dev/sdl1 Blocksize: 4096 bytes /dev/sdl1: Filesystem too large to use legacy bitmaps while reading block bitmap # rpm -qa|grep e2fsprogs e2fsprogs-libs-1.42.5-1.fc18.x86_64 e2fsprogs-1.42.5-1.fc18.x86_64 # df|grep /dev/sdl1 /dev/sdl1?????????????????????????????????????????????????????????????? 35143869536 30265426892?? 4175317880?? 88% /mnt/backup Thanks! --- Will Y. -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. -------------- next part -------------- An HTML attachment was scrubbed... URL: From tytso at mit.edu Wed Mar 27 02:14:43 2013 From: tytso at mit.edu (Theodore Ts'o) Date: Tue, 26 Mar 2013 22:14:43 -0400 Subject: e2freefrag says filesystem too large In-Reply-To: References: Message-ID: <20130327021443.GA2697@thunk.org> On Tue, Mar 26, 2013 at 07:51:24PM -0000, aragonx at dcsnow.com wrote: > > Can someone tell me if this will be fixed? > > # e2freefrag > /dev/sdl1: Filesystem too large to use legacy bitmaps while reading > block bitmap > > # rpm -qa|grep e2fsprogs > e2fsprogs-1.42.5-1.fc18.x86_64 Yes, it's fixed in e2fsprogs 1.42.7. And if you're using 64-bit file systems, you really really want to upgrade to 1.42.7 --- especially if you are ever thinking of using resize2fs; we fixed a number of very serious bugs, some of which could destroy file systems if you try to do off-line resizes. See the release notes for more details: http://e2fsprogs.sourceforge.net/e2fsprogs-release.html#1.42.7 Regards, - Ted