From tt at it-austria.net Fri Aug 1 09:43:40 2008 From: tt at it-austria.net (Thomas Trauner) Date: Fri, 01 Aug 2008 11:43:40 +0200 Subject: duplicate entries on ext3 when using readdir/readdir64 Message-ID: <1217583820.12454.20.camel@kannnix.a2x.lan.at> Hello, I have a problem with directories that contain more than 10000 entries (Ubuntu 8.04.1) or with more than 70000 entries (RHEL 5.2). If you use readdir(3) or readdir64(3) you get one entry twice, with same name and inode. Some analyzing showed that disabling dir_index solves this problem, but I think that this is a bug in the ext3 code, as no other file-system shows this behavior. I've found the following regarding this bug, but nothing about whether if it is fixed nor if a back-port for older 2.6 kernels exists. and On linux-fsdevel I've found the following, but they delete directory entries in between multiple readdir calls. Does anyone know where I could find more information or report this bug? Thanks in advance! Regards. Tom Trauner From tytso at mit.edu Fri Aug 1 12:16:58 2008 From: tytso at mit.edu (Theodore Tso) Date: Fri, 1 Aug 2008 08:16:58 -0400 Subject: duplicate entries on ext3 when using readdir/readdir64 In-Reply-To: <1217583820.12454.20.camel@kannnix.a2x.lan.at> References: <1217583820.12454.20.camel@kannnix.a2x.lan.at> Message-ID: <20080801121658.GG8736@mit.edu> On Fri, Aug 01, 2008 at 11:43:40AM +0200, Thomas Trauner wrote: > > I have a problem with directories that contain more than 10000 entries > (Ubuntu 8.04.1) or with more than 70000 entries (RHEL 5.2). If you use > readdir(3) or readdir64(3) you get one entry twice, with same name and > inode. > How reproducible is this; can you reproduce it on this one filesystem? Can you reproduce it on multiple filesystems? What sort of file names are you using? Also, are you testing by using "ls", or do you have your own program getting the names of the files. If the latter, are you using telldir()/seekdir() in any way? - Ted From tt at it-austria.net Fri Aug 1 14:00:31 2008 From: tt at it-austria.net (Thomas Trauner) Date: Fri, 01 Aug 2008 16:00:31 +0200 Subject: duplicate entries on ext3 when using readdir/readdir64 In-Reply-To: <20080801121658.GG8736@mit.edu> References: <1217583820.12454.20.camel@kannnix.a2x.lan.at> <20080801121658.GG8736@mit.edu> Message-ID: <1217599231.14552.13.camel@kannnix.a2x.lan.at> On Fri, 2008-08-01 at 08:16 -0400, Theodore Tso wrote: > On Fri, Aug 01, 2008 at 11:43:40AM +0200, Thomas Trauner wrote: > > > > I have a problem with directories that contain more than 10000 entries > > (Ubuntu 8.04.1) or with more than 70000 entries (RHEL 5.2). If you use > > readdir(3) or readdir64(3) you get one entry twice, with same name and > > inode. > > > > How reproducible is this; can you reproduce it on this one filesystem? > Can you reproduce it on multiple filesystems? What sort of file names > are you using? Every time I tried. It is reproducible on the same filesystem, and also on other systems with different filesystem sizes and usage patterns. It showed up when on of our own script working through a Subversion directory failed. File names are numbers, starting with "0" counting up. > Also, are you testing by using "ls", or do you have your own program > getting the names of the files. If the latter, are you using > telldir()/seekdir() in any way? I'm testing with 'ls|sort -n|uniq -d' and also with a simple program that simply counts how often readdir can be called. > - Ted > Tom From sandeen at redhat.com Fri Aug 1 14:47:07 2008 From: sandeen at redhat.com (Eric Sandeen) Date: Fri, 01 Aug 2008 09:47:07 -0500 Subject: duplicate entries on ext3 when using readdir/readdir64 In-Reply-To: <1217599231.14552.13.camel@kannnix.a2x.lan.at> References: <1217583820.12454.20.camel@kannnix.a2x.lan.at> <20080801121658.GG8736@mit.edu> <1217599231.14552.13.camel@kannnix.a2x.lan.at> Message-ID: <489321EB.3070009@redhat.com> Thomas Trauner wrote: > On Fri, 2008-08-01 at 08:16 -0400, Theodore Tso wrote: >> On Fri, Aug 01, 2008 at 11:43:40AM +0200, Thomas Trauner wrote: >>> I have a problem with directories that contain more than 10000 entries >>> (Ubuntu 8.04.1) or with more than 70000 entries (RHEL 5.2). If you use >>> readdir(3) or readdir64(3) you get one entry twice, with same name and >>> inode. >>> >> How reproducible is this; can you reproduce it on this one filesystem? >> Can you reproduce it on multiple filesystems? What sort of file names >> are you using? > > Every time I tried. It is reproducible on the same filesystem, and also on other > systems with different filesystem sizes and usage patterns. > It showed up when on of our own script working through a Subversion directory failed. > > File names are numbers, starting with "0" counting up. > >> Also, are you testing by using "ls", or do you have your own program >> getting the names of the files. If the latter, are you using >> telldir()/seekdir() in any way? > > I'm testing with 'ls|sort -n|uniq -d' and also with a simple program > that simply counts how often readdir can be called. > Hm, a bog-simple test here doesn't show any trouble: [root at inode dirtest]# for I in `seq 0 10500`; do touch $I; done [root at inode dirtest]# ls | sort -n | uniq -d [root at inode dirtest]# ls | wc -l 10501 does that reflect what you're doing? Do you have a testcase you can share? -Eric From tt at it-austria.net Mon Aug 4 07:48:39 2008 From: tt at it-austria.net (Thomas Trauner) Date: Mon, 04 Aug 2008 09:48:39 +0200 Subject: duplicate entries on ext3 when using readdir/readdir64 In-Reply-To: <489321EB.3070009@redhat.com> References: <1217583820.12454.20.camel@kannnix.a2x.lan.at> <20080801121658.GG8736@mit.edu> <1217599231.14552.13.camel@kannnix.a2x.lan.at> <489321EB.3070009@redhat.com> Message-ID: <1217836119.14552.27.camel@kannnix.a2x.lan.at> On Fri, 2008-08-01 at 09:47 -0500, Eric Sandeen wrote: > Hm, a bog-simple test here doesn't show any trouble: > > [root at inode dirtest]# for I in `seq 0 10500`; do touch $I; done > [root at inode dirtest]# ls | sort -n | uniq -d > [root at inode dirtest]# ls | wc -l > 10501 > > does that reflect what you're doing? Do you have a testcase you can share? Yes, but I've written incorrect values, sorry. It's a little bit higher, a run of my program outputs this on 2.6.24-19-generic (ubuntu 8.04.1): expected 11778 files, but readdir reports 11779 expected 11862 files, but readdir64 reports 11863 And on 2.6.18-92.1.6.el5 (rhel 5.2): expected 72922 files, but readdir reports 72923 expected 73131 files, but readdir64 reports 73132 The testcase is here: > -Eric Tom From tt at it-austria.net Tue Aug 5 10:53:51 2008 From: tt at it-austria.net (Thomas Trauner) Date: Tue, 05 Aug 2008 12:53:51 +0200 Subject: duplicate entries on ext3 when using readdir/readdir64 In-Reply-To: <20080801121658.GG8736@mit.edu> References: <1217583820.12454.20.camel@kannnix.a2x.lan.at> <20080801121658.GG8736@mit.edu> Message-ID: <1217933631.14552.45.camel@kannnix.a2x.lan.at> On Fri, 2008-08-01 at 08:16 -0400, Theodore Tso wrote: > On Fri, Aug 01, 2008 at 11:43:40AM +0200, Thomas Trauner wrote: > > > > I have a problem with directories that contain more than 10000 entries > > (Ubuntu 8.04.1) or with more than 70000 entries (RHEL 5.2). If you use > > readdir(3) or readdir64(3) you get one entry twice, with same name and > > inode. > > > > How reproducible is this; can you reproduce it on this one filesystem? > Can you reproduce it on multiple filesystems? What sort of file names > are you using? I made new tests with the code under on a bunch of freshly generated and empty filesystems, every about 38GB large, of type fat (aborted after about 22000 entries because it took to long), ext2, xfs, jfs and again ext3. All tests made with 2.6.24-19-generic (ubuntu 8.04.1). I also tried minix fs, just for fun, but I could only create 126 files. Ext3 shows the same effect as before, but at 103033 entries (readdir) and 104136 entries (readdir64). 'ls|sort -n|uniq -d' output (ls uses getdents64, so I asume it uses readdir64, but I don't checked the ls source): root at darfnix:/readdir/ext3/testdir# ls|sort -n|uniq -d 102456 root at darfnix:/readdir/ext3/testdir# Can I do anything else? Regards Tom From tytso at mit.edu Wed Aug 6 04:46:09 2008 From: tytso at mit.edu (Theodore Tso) Date: Wed, 6 Aug 2008 00:46:09 -0400 Subject: duplicate entries on ext3 when using readdir/readdir64 In-Reply-To: <1217933631.14552.45.camel@kannnix.a2x.lan.at> References: <1217583820.12454.20.camel@kannnix.a2x.lan.at> <20080801121658.GG8736@mit.edu> <1217933631.14552.45.camel@kannnix.a2x.lan.at> Message-ID: <20080806044609.GA9277@mit.edu> On Tue, Aug 05, 2008 at 12:53:51PM +0200, Thomas Trauner wrote: > > How reproducible is this; can you reproduce it on this one filesystem? > > Can you reproduce it on multiple filesystems? What sort of file names > > are you using? > > I made new tests with the code under > on a bunch of freshly > generated and empty filesystems, every about 38GB large, of type fat > (aborted after about 22000 entries because it took to long), ext2, xfs, > jfs and again ext3. All tests made with 2.6.24-19-generic (ubuntu > 8.04.1). I was able to reproduce using ext3. It looks like it's caused by a hash collision; but ext3 has code that's supposed to avoid returning a directory entry doubled in this fashion. I'll have to look into it. - Ted From tt at it-austria.net Wed Aug 6 13:33:17 2008 From: tt at it-austria.net (Thomas Trauner) Date: Wed, 06 Aug 2008 15:33:17 +0200 Subject: duplicate entries on ext3 when using readdir/readdir64 In-Reply-To: <20080806044609.GA9277@mit.edu> References: <1217583820.12454.20.camel@kannnix.a2x.lan.at> <20080801121658.GG8736@mit.edu> <1217933631.14552.45.camel@kannnix.a2x.lan.at> <20080806044609.GA9277@mit.edu> Message-ID: <1218029597.14552.54.camel@kannnix.a2x.lan.at> On Wed, 2008-08-06 at 00:46 -0400, Theodore Tso wrote: > I was able to reproduce using ext3. It looks like it's caused by a > hash collision; but ext3 has code that's supposed to avoid returning a > directory entry doubled in this fashion. I'll have to look into it. > > - Ted Thank you. Tom From tytso at MIT.EDU Wed Aug 6 14:07:23 2008 From: tytso at MIT.EDU (Theodore Tso) Date: Wed, 6 Aug 2008 10:07:23 -0400 Subject: duplicate entries on ext3 when using readdir/readdir64 In-Reply-To: <1217933631.14552.45.camel@kannnix.a2x.lan.at> References: <1217583820.12454.20.camel@kannnix.a2x.lan.at> <20080801121658.GG8736@mit.edu> <1217933631.14552.45.camel@kannnix.a2x.lan.at> Message-ID: <20080806140722.GA14109@mit.edu> On Tue, Aug 05, 2008 at 12:53:51PM +0200, Thomas Trauner wrote: > On Fri, 2008-08-01 at 08:16 -0400, Theodore Tso wrote: > > On Fri, Aug 01, 2008 at 11:43:40AM +0200, Thomas Trauner wrote: > > > > > > I have a problem with directories that contain more than 10000 entries > > > (Ubuntu 8.04.1) or with more than 70000 entries (RHEL 5.2). If you use > > > readdir(3) or readdir64(3) you get one entry twice, with same name and > > > inode. > > > > I made new tests with the code under > on a bunch of freshly > generated and empty filesystems, every about 38GB large, of type fat > (aborted after about 22000 entries because it took to long), ext2, xfs, > jfs and again ext3.... OK, I have a workaroud for you. It appears there's a kernel bug hiding here, since there shouldn't be duplicates returned by readdir() even if we have hash collisions. It turns out though that the TEA hash we are currently using as the default is a really sucky hash. I can't remember who suggested it; I may go looking in the archives just out of curiosity. My fault, though, I should have tested it much more thoroughly, although it *looked* good, and it was take from the core of an encryption algorithm, so I thought it would be OK. The claim was that it was just as good for our purposes as the cut-down md4 hash we were using, but it was faster (so it would burn less cpu cycles). Unfortunately, (a) at least on modern hardware (I tested on an X61s laptop) the TEA hash is in fact a little slower, and (b) for small filenames with small hamming distances between them,, such as what you are using in your test, it's generating lots of collisions. Anyway, the workaround is as follows: debugfs -w /dev/sdXXX debugfs: set_super_value def_hash_version half_md4 debugfs: quit Then completely delete any directories where you were having problems, and recreate them. (You can do the "mkdir foo.new; mv foo/* foo.new; rmdir foo; mv foo.new foo" trick if you want to preserve the files in that directory.) In any case, here's the test case which shows the hash collision problem much more quickly. You can also use it for benchmarks, like so: time tst_hash -q -a tea -n 3000000 time tst_hash -q -a half_md4 -n 3000000 With the following options, we can also see with the right filename lengths, the tea algorithm doesn't create any hash collisions, so maybe whoever tested the algorithm before they suggested it just got unlucky with the set of filenames that he/she chose: tst_hash -p 0000 -a tea -n 3000000 In any case, unless someone comes up with a really good reason, I probably will change the default hash algorithm for mke2fs to half_md4, since it is both faster and a better hash function. This doesn't change the fact that the kernel should do the right thing with hash collisions, at least in the simple case without telldir/seekdir. When I merged the htree code I had tested it with the Douglas Adams hash (always returns a hash value of 0x00000042:0000000 no matter what its inputs), and it did the right thing, so we must have regressed somewhere along the line... - Ted /* * tst_htree.c * * Copyright (C) 2008 by Theodore Ts'o. * * This file may be redistributed under the terms of the GNU Public * License, Version 2 * * Compile command: * cc -g -O2 -o tst_hash tst_hash.c -lext2fs -lcom_err -luuid -le2p */ #include #include #include #include #include #include #include #include #include #include "ext2fs/ext2fs.h" #include "uuid/uuid.h" #include "et/com_err.h" #define SEED "87fd5d61-4612-4147-8bf5-a21948e7e909" struct hash { int num; ext2_dirhash_t hash, minor_hash; }; static EXT2_QSORT_TYPE hash_cmp(const void *a, const void *b) { const struct hash *db_a = (const struct hash *) a; const struct hash *db_b = (const struct hash *) b; if (db_a->hash != db_b->hash) return (int) (db_a->hash - db_b->hash); return (int) (db_a->minor_hash - db_b->minor_hash); } main(int argc, char **argv) { errcode_t errcode; ext2_dirhash_t hash, minor_hash; int hash_alg = EXT2_HASH_TEA; char name[200], *tmp, prefix[100]; unsigned char uuid[16]; int thislen, i, c, quiet = 0, num_hashes = 300000; struct hash *hash_array; uuid_parse(SEED, uuid); prefix[0] = 0; while ((c = getopt(argc, argv, "s:a:n:qp:")) != EOF) switch (c) { case 's': uuid_parse(optarg, uuid); break; case 'a': hash_alg = e2p_string2hash(optarg); if (hash_alg < 0) { fprintf(stderr, "Invalid hash algorithm: %s\n", optarg); exit(1); } break; case 'n': num_hashes = strtoul (optarg, &tmp, 0); if (*tmp) { com_err (argv[0], 0, "count - %s", optarg); exit(1); } break; case 'p': if (strlen(optarg)+1 > sizeof(prefix)) { fprintf(stderr, "%s: prefix too large!\n", argv[0]); exit(1); } strcpy(prefix, optarg); break; case 'q': quiet = 1; break; default: fprintf(stderr, "Usage: %s [-q] [-s hash_seed] " "[-a hash_alg] [-n num_hashes]\n", argv[0]); exit(1); } hash_array = malloc(num_hashes * sizeof(struct hash)); if (hash_array == NULL) { fprintf(stderr, "Couldn't allocate hash_array\n"); exit(1); } for (i=0; i < num_hashes; i++) { sprintf(name, "%s%d", prefix, i); errcode = ext2fs_dirhash(hash_alg, name, strlen(name), (__u32 *) uuid, &hash_array[i].hash, &hash_array[i].minor_hash); if (errcode) { com_err("ext2fs_dirhash", errcode, "while trying to hash '%s'", name); exit(1); } hash_array[i].num = i; } qsort(hash_array, (size_t) num_hashes, sizeof(struct hash), hash_cmp); for (c=0,i=0; i < num_hashes-1; i++) { if ((hash_array[i].hash == hash_array[i+1].hash) && (hash_array[i].minor_hash == hash_array[i+1].minor_hash)) { c++; if (quiet) continue; printf("hash collision: %d, %d: %08x:%08x\n", hash_array[i].num, hash_array[i+1].num, hash_array[i].hash, hash_array[i].minor_hash); } } printf("%d collisions\n", c); exit(0); } From tytso at MIT.EDU Wed Aug 6 14:45:47 2008 From: tytso at MIT.EDU (Theodore Tso) Date: Wed, 6 Aug 2008 10:45:47 -0400 Subject: duplicate entries on ext3 when using readdir/readdir64 In-Reply-To: <1217836119.14552.27.camel@kannnix.a2x.lan.at> References: <1217583820.12454.20.camel@kannnix.a2x.lan.at> <20080801121658.GG8736@mit.edu> <1217599231.14552.13.camel@kannnix.a2x.lan.at> <489321EB.3070009@redhat.com> <1217836119.14552.27.camel@kannnix.a2x.lan.at> Message-ID: <20080806144547.GB14109@mit.edu> On Mon, Aug 04, 2008 at 09:48:39AM +0200, Thomas Trauner wrote: > Yes, but I've written incorrect values, sorry. It's a little bit higher, > a run of my program outputs this on 2.6.24-19-generic (ubuntu 8.04.1): > > expected 11778 files, but readdir reports 11779 > expected 11862 files, but readdir64 reports 11863 > > And on 2.6.18-92.1.6.el5 (rhel 5.2): > expected 72922 files, but readdir reports 72923 > expected 73131 files, but readdir64 reports 73132 BTW, I doubt the difference in what you had on your Ubuntu and RHEL system has anything to do with the kernel version or the distribution, but just the luck of the draw. If you run "dumpe2fs -h /dev/sdXX | grep "Hash Seed" from both systms, and then take that uuid and feed it to the tst_hash program via the -s option, you'll probably see it was simply the different directory hash seed which is changing when the first collision happened: %./tst_hash -s 27e0ed94-069c-44c0-bea0-044b1a8d7bcc hash collision: 142886, 142987: 7104d654:131c0700 hash collision: 188030, 188131: aefe1dc2:f7517103 hash collision: 14020, 14031: fc717efa:87ce3eaa hash collision: 120336, 120732: 34c3f1b6:cee72d50 4 collisions vs. % ./tst_hash -s 7089e459-07c2-43cc-b25f-bafdcce9cd05 hash collision: 167469, 167568: 4de08834:3fa2a17a hash collision: 133356, 133752: ce1bfd8e:a1bce824 hash collision: 179218, 179319: ea71d5c8:43471df9 hash collision: 111503, 111701: fbfcea6c:760591e8 hash collision: 134034, 134135: 0ff24a86:f627f5a1 hash collision: 252452, 252553: 6631082a:43adb3f4 hash collision: 101107, 101305: a1a99e86:8d50e974 hash collision: 62302, 62313: 2689a56c:38ccd31d hash collision: 60242, 60253: d9e3f444:f252b5f5 9 collisions With the first hash seed, the first collision happened with the filenames 14020 and 14031. With the second hash seed, you don't get a collision until 60242 and 60253. Regards, - Ted From tt at it-austria.net Wed Aug 6 15:14:43 2008 From: tt at it-austria.net (Thomas Trauner) Date: Wed, 06 Aug 2008 17:14:43 +0200 Subject: duplicate entries on ext3 when using readdir/readdir64 In-Reply-To: <20080806140722.GA14109@mit.edu> References: <1217583820.12454.20.camel@kannnix.a2x.lan.at> <20080801121658.GG8736@mit.edu> <1217933631.14552.45.camel@kannnix.a2x.lan.at> <20080806140722.GA14109@mit.edu> Message-ID: <1218035683.14552.61.camel@kannnix.a2x.lan.at> On Wed, 2008-08-06 at 10:07 -0400, Theodore Tso wrote: > On Tue, Aug 05, 2008 at 12:53:51PM +0200, Thomas Trauner wrote: > > On Fri, 2008-08-01 at 08:16 -0400, Theodore Tso wrote: > > > On Fri, Aug 01, 2008 at 11:43:40AM +0200, Thomas Trauner wrote: > > > > > > > > I have a problem with directories that contain more than 10000 entries > > > > (Ubuntu 8.04.1) or with more than 70000 entries (RHEL 5.2). If you use > > > > readdir(3) or readdir64(3) you get one entry twice, with same name and > > > > inode. > > > > > > I made new tests with the code under > > on a bunch of freshly > > generated and empty filesystems, every about 38GB large, of type fat > > (aborted after about 22000 entries because it took to long), ext2, xfs, > > jfs and again ext3.... > > OK, I have a workaroud for you. It appears there's a kernel bug > hiding here, since there shouldn't be duplicates returned by readdir() > even if we have hash collisions. Thank you for your fast help and detailed explanation! Now I've something to read at home ;) Thanks! Tom From snitzer at gmail.com Wed Aug 13 21:21:20 2008 From: snitzer at gmail.com (Mike Snitzer) Date: Wed, 13 Aug 2008 17:21:20 -0400 Subject: duplicate entries on ext3 when using readdir/readdir64 In-Reply-To: <20080806140722.GA14109@mit.edu> References: <1217583820.12454.20.camel@kannnix.a2x.lan.at> <20080801121658.GG8736@mit.edu> <1217933631.14552.45.camel@kannnix.a2x.lan.at> <20080806140722.GA14109@mit.edu> Message-ID: <170fa0d20808131421j3e4955dcra611509f1a094547@mail.gmail.com> On Wed, Aug 6, 2008 at 10:07 AM, Theodore Tso wrote: > On Tue, Aug 05, 2008 at 12:53:51PM +0200, Thomas Trauner wrote: >> On Fri, 2008-08-01 at 08:16 -0400, Theodore Tso wrote: >> > On Fri, Aug 01, 2008 at 11:43:40AM +0200, Thomas Trauner wrote: >> > > >> > > I have a problem with directories that contain more than 10000 entries >> > > (Ubuntu 8.04.1) or with more than 70000 entries (RHEL 5.2). If you use >> > > readdir(3) or readdir64(3) you get one entry twice, with same name and >> > > inode. >> > > >> I made new tests with the code under >> on a bunch of freshly >> generated and empty filesystems, every about 38GB large, of type fat >> (aborted after about 22000 entries because it took to long), ext2, xfs, >> jfs and again ext3.... > > OK, I have a workaroud for you. It appears there's a kernel bug > hiding here, since there shouldn't be duplicates returned by readdir() > even if we have hash collisions. Ted, The attached patch has served my employer (IBRIX) well for 2.5 years. It was only recently, when I re-raised this issue internally based on this thread, that a co-worker recalled the fix. regards, Mike -------------- next part -------------- A non-text attachment was scrubbed... Name: ext3_dx_readdir_hash_collision_fix.patch Type: text/x-patch Size: 1400 bytes Desc: not available URL: From tytso at mit.edu Thu Aug 14 02:58:21 2008 From: tytso at mit.edu (Theodore Tso) Date: Wed, 13 Aug 2008 22:58:21 -0400 Subject: duplicate entries on ext3 when using readdir/readdir64 In-Reply-To: <170fa0d20808131421j3e4955dcra611509f1a094547@mail.gmail.com> References: <1217583820.12454.20.camel@kannnix.a2x.lan.at> <20080801121658.GG8736@mit.edu> <1217933631.14552.45.camel@kannnix.a2x.lan.at> <20080806140722.GA14109@mit.edu> <170fa0d20808131421j3e4955dcra611509f1a094547@mail.gmail.com> Message-ID: <20080814025821.GA6469@mit.edu> On Wed, Aug 13, 2008 at 05:21:20PM -0400, Mike Snitzer wrote: > > The attached patch has served my employer (IBRIX) well for 2.5 years. > It was only recently, when I re-raised this issue internally based on > this thread, that a co-worker recalled the fix. > The patch looks good. Did someone raise it 2.5 years ago, and we somehow dropped the ball, or did no one think to submit the patch upstream? Also, can I get a Signed-off-by: line for this patch? Thanks!! - Ted From magawake at gmail.com Thu Aug 14 12:33:29 2008 From: magawake at gmail.com (Mag Gam) Date: Thu, 14 Aug 2008 08:33:29 -0400 Subject: small blocks Message-ID: <1cbd6f830808140533j704bdc5bncbd6288dba3b5543@mail.gmail.com> I am trying to understand what the purpose of having small blocks per inode. I know you can cram more inodes per filesystem, but what is the downside? TIA From sandeen at redhat.com Thu Aug 14 13:49:48 2008 From: sandeen at redhat.com (Eric Sandeen) Date: Thu, 14 Aug 2008 08:49:48 -0500 Subject: small blocks In-Reply-To: <1cbd6f830808140533j704bdc5bncbd6288dba3b5543@mail.gmail.com> References: <1cbd6f830808140533j704bdc5bncbd6288dba3b5543@mail.gmail.com> Message-ID: <48A437FC.6010706@redhat.com> Mag Gam wrote: > I am trying to understand what the purpose of having small blocks per > inode. I know you can cram more inodes per filesystem, the main result is that you waste less space per file, since for randomly-sized files you waste half a block(size) per file. > but what is the > downside? More overhead for management, and more importantly, I still think there is a bug lurking somewhere with block size < page size (rpm tends to hit it for some people). -Eric From tytso at mit.edu Thu Aug 14 14:52:29 2008 From: tytso at mit.edu (Theodore Tso) Date: Thu, 14 Aug 2008 10:52:29 -0400 Subject: duplicate entries on ext3 when using readdir/readdir64 In-Reply-To: <170fa0d20808131421j3e4955dcra611509f1a094547@mail.gmail.com> References: <1217583820.12454.20.camel@kannnix.a2x.lan.at> <20080801121658.GG8736@mit.edu> <1217933631.14552.45.camel@kannnix.a2x.lan.at> <20080806140722.GA14109@mit.edu> <170fa0d20808131421j3e4955dcra611509f1a094547@mail.gmail.com> Message-ID: <20080814145229.GA8256@mit.edu> On Wed, Aug 13, 2008 at 05:21:20PM -0400, Mike Snitzer wrote: > > The attached patch has served my employer (IBRIX) well for 2.5 years. > It was only recently, when I re-raised this issue internally based on > this thread, that a co-worker recalled the fix. > I can confirm that this patch fixes things; I also have this patch ported to ext4. I need a Signed-off-by before I can push this to Linus, though. - Ted From snitzer at gmail.com Thu Aug 14 23:27:45 2008 From: snitzer at gmail.com (Mike Snitzer) Date: Thu, 14 Aug 2008 19:27:45 -0400 Subject: duplicate entries on ext3 when using readdir/readdir64 In-Reply-To: <20080814025821.GA6469@mit.edu> References: <1217583820.12454.20.camel@kannnix.a2x.lan.at> <20080801121658.GG8736@mit.edu> <1217933631.14552.45.camel@kannnix.a2x.lan.at> <20080806140722.GA14109@mit.edu> <170fa0d20808131421j3e4955dcra611509f1a094547@mail.gmail.com> <20080814025821.GA6469@mit.edu> Message-ID: <170fa0d20808141627i7f05fbcdnf043e090834940dd@mail.gmail.com> On Wed, Aug 13, 2008 at 10:58 PM, Theodore Tso wrote: > On Wed, Aug 13, 2008 at 05:21:20PM -0400, Mike Snitzer wrote: >> >> The attached patch has served my employer (IBRIX) well for 2.5 years. >> It was only recently, when I re-raised this issue internally based on >> this thread, that a co-worker recalled the fix. >> > > The patch looks good. Did someone raise it 2.5 years ago, and we > somehow dropped the ball, or did no one think to submit the patch > upstream? We intended to push this fix upstream but doing so got inadvertently overlooked as we put focus to new issues. I mentioned how long ago this patch was developed purely to help illustrate the stability of the fix. > Also, can I get a Signed-off-by: line for this patch? Eugene Dashevsky authored the patch; I refreshed it against 2.6.27-rc3: Signed-off-by: Eugene Dashevsky Signed-off-by: Mike Snitzer thanks, Mike From magawake at gmail.com Fri Aug 15 11:21:23 2008 From: magawake at gmail.com (Mag Gam) Date: Fri, 15 Aug 2008 07:21:23 -0400 Subject: small blocks In-Reply-To: <48A437FC.6010706@redhat.com> References: <1cbd6f830808140533j704bdc5bncbd6288dba3b5543@mail.gmail.com> <48A437FC.6010706@redhat.com> Message-ID: <1cbd6f830808150421l47afd774ofb104f37558c71a5@mail.gmail.com> Hmm, I am wasting less space when I lower the ratio from 4096 to 1024 (the minimum). Why don't more people do this, since its frugal. I guess your point of more overhead, but what causes more overhead? Also, do you have the buzilla number I can investigate for this? Sorry for such a newbie question. TIA On Thu, Aug 14, 2008 at 9:49 AM, Eric Sandeen wrote: > Mag Gam wrote: >> I am trying to understand what the purpose of having small blocks per >> inode. I know you can cram more inodes per filesystem, > > the main result is that you waste less space per file, since for > randomly-sized files you waste half a block(size) per file. > >> but what is the >> downside? > > More overhead for management, and more importantly, I still think there > is a bug lurking somewhere with block size < page size (rpm tends to hit > it for some people). > > -Eric > From sandeen at redhat.com Fri Aug 15 13:53:33 2008 From: sandeen at redhat.com (Eric Sandeen) Date: Fri, 15 Aug 2008 08:53:33 -0500 Subject: small blocks In-Reply-To: <1cbd6f830808150421l47afd774ofb104f37558c71a5@mail.gmail.com> References: <1cbd6f830808140533j704bdc5bncbd6288dba3b5543@mail.gmail.com> <48A437FC.6010706@redhat.com> <1cbd6f830808150421l47afd774ofb104f37558c71a5@mail.gmail.com> Message-ID: <48A58A5D.4030309@redhat.com> Mag Gam wrote: > Hmm, I am wasting less space when I lower the ratio from 4096 to 1024 > (the minimum). Why don't more people do this, since its frugal. > I guess your point of more overhead, but what causes more overhead? > Also, do you have the buzilla number I can investigate for this? > > Sorry for such a newbie question. Now that I reread, perhaps I gave you the wrong answer anyway. Are you talking about the -i or the -b option? -Eric From magawake at gmail.com Fri Aug 15 23:07:42 2008 From: magawake at gmail.com (Mag Gam) Date: Fri, 15 Aug 2008 19:07:42 -0400 Subject: small blocks In-Reply-To: <48A58A5D.4030309@redhat.com> References: <1cbd6f830808140533j704bdc5bncbd6288dba3b5543@mail.gmail.com> <48A437FC.6010706@redhat.com> <1cbd6f830808150421l47afd774ofb104f37558c71a5@mail.gmail.com> <48A58A5D.4030309@redhat.com> Message-ID: <1cbd6f830808151607u51bd366cg1148b5278a852b86@mail.gmail.com> Asking about the -i options. -i bytes-per-inode The man page states, "This value generally shouldn't be smaller than the blocksize of the filesystem, since then too many inodes will be made." So, whats the problem of having too many inodes On Fri, Aug 15, 2008 at 9:53 AM, Eric Sandeen wrote: > Mag Gam wrote: >> Hmm, I am wasting less space when I lower the ratio from 4096 to 1024 >> (the minimum). Why don't more people do this, since its frugal. >> I guess your point of more overhead, but what causes more overhead? >> Also, do you have the buzilla number I can investigate for this? >> >> Sorry for such a newbie question. > > Now that I reread, perhaps I gave you the wrong answer anyway. > > Are you talking about the -i or the -b option? > > -Eric > From sandeen at redhat.com Fri Aug 15 23:11:01 2008 From: sandeen at redhat.com (Eric Sandeen) Date: Fri, 15 Aug 2008 18:11:01 -0500 Subject: small blocks In-Reply-To: <1cbd6f830808151607u51bd366cg1148b5278a852b86@mail.gmail.com> References: <1cbd6f830808140533j704bdc5bncbd6288dba3b5543@mail.gmail.com> <48A437FC.6010706@redhat.com> <1cbd6f830808150421l47afd774ofb104f37558c71a5@mail.gmail.com> <48A58A5D.4030309@redhat.com> <1cbd6f830808151607u51bd366cg1148b5278a852b86@mail.gmail.com> Message-ID: <48A60D05.9030602@redhat.com> Mag Gam wrote: > Asking about the -i options. > > -i bytes-per-inode > > The man page states, "This value generally shouldn't be smaller than > the blocksize of the filesystem, since then too many inodes will be > made." > > So, whats the problem of having too many inodes You waste space on unused inodes. And the problem of not having _enough_ is, you can't make new files even when you have lots of blocks free, and you can't change that after the fact. It's one of the drawbacks of not dynamically allocating inodes. -Eric From pegasus at nerv.eu.org Thu Aug 21 11:07:52 2008 From: pegasus at nerv.eu.org (Jure =?UTF-8?B?UGXEjWFy?=) Date: Thu, 21 Aug 2008 13:07:52 +0200 Subject: ext2online with 1k blocks not working Message-ID: <20080821130752.98ce8b6b.pegasus@nerv.eu.org> Hello, As a Virtuozzo users we have majority of our diskspace formatted with -i 1024 -b 1024. Lately I discovered that on CentOS 4.6 ext2online barfs when I try to grow such filesystem. Running it with -v -d, it prints lots of lines like: ext2online v1.1.18 - 2001/03/18 for EXT2FS 0.5b ext2online: 873646830 is a bad size for an ext2 fs! rounding down to 873644033 ... group NNN inode table has offset 2, not 2475 ... checking for group block NNNN in Bond found 2218 not 2474 at 3513[168] ext2online: unable to resize /dev/cciss/c0d0p3 And exit error code is 3. I verified on a test system that ext2online works perfectly well in same situation with 4k blocks. Any ideas? -- Jure Pe?ar http://jure.pecar.org From tytso at mit.edu Thu Aug 21 13:47:33 2008 From: tytso at mit.edu (Theodore Tso) Date: Thu, 21 Aug 2008 09:47:33 -0400 Subject: ext2online with 1k blocks not working In-Reply-To: <20080821130752.98ce8b6b.pegasus@nerv.eu.org> References: <20080821130752.98ce8b6b.pegasus@nerv.eu.org> Message-ID: <20080821134733.GH16634@mit.edu> On Thu, Aug 21, 2008 at 01:07:52PM +0200, Jure Pe?ar wrote: > > As a Virtuozzo users we have majority of our diskspace formatted with -i 1024 -b 1024. > > Lately I discovered that on CentOS 4.6 ext2online barfs when I try to grow such filesystem. Running it with -v -d, it prints lots of lines like: > > ext2online v1.1.18 - 2001/03/18 for EXT2FS 0.5b > ext2online: 873646830 is a bad size for an ext2 fs! rounding down to 873644033 > ... > group NNN inode table has offset 2, not 2475 > ... > checking for group block NNNN in Bond > found 2218 not 2474 at 3513[168] > > ext2online: unable to resize /dev/cciss/c0d0p3 Can you replicate the problem using resize2fs from e2fsprogs version 1.41.0? Resize2fs has supported online resize for quite sometime, and I'm not sure the ext2online tool is being actively maintained at this point. Out of curiosity, why are you using a 1k blocksize? Does Virtuozzo require it? Especially for a filesystem as big what you are apparently using, there will be some significant performance downsides with using a 1k blocksize. And the -i 1024; are you storing huge numbers of small files? - Ted From pegasus at nerv.eu.org Thu Aug 21 18:32:17 2008 From: pegasus at nerv.eu.org (Jure =?UTF-8?B?UGXEjWFy?=) Date: Thu, 21 Aug 2008 20:32:17 +0200 Subject: ext2online with 1k blocks not working In-Reply-To: <20080821134733.GH16634@mit.edu> References: <20080821130752.98ce8b6b.pegasus@nerv.eu.org> <20080821134733.GH16634@mit.edu> Message-ID: <20080821203217.684c2424.pegasus@nerv.eu.org> On Thu, 21 Aug 2008 09:47:33 -0400 Theodore Tso wrote: > Can you replicate the problem using resize2fs from e2fsprogs version > 1.41.0? Resize2fs has supported online resize for quite sometime, and > I'm not sure the ext2online tool is being actively maintained at this > point. Ah yes, resize2fs ... I knew there's another tool for resizing, just forgot its name. [root at localhost resize]# ./resize2fs /dev/cciss/c0d0p3 Performing an on-line resize of /dev/cciss/c0d0p3 to 873646828 (1k) blocks. ./resize2fs: Inappropriate ioctl for device While trying to add group #78125 /var/log/messages show: localhost kernel: JBD: resize2fs wants too many credits (3498 > 2048) And filesystem grew from 485G to only 534G and not 800 and something G. > Out of curiosity, why are you using a 1k blocksize? Does Virtuozzo > require it? Especially for a filesystem as big what you are > apparently using, there will be some significant performance downsides > with using a 1k blocksize. And the -i 1024; are you storing huge > numbers of small files? Commercial version of Virtuozzo (unlike free OpenVZ) offers "vzfs" which adds some kind of CoW symlink on top of ext3. From the host point of view, every new virtual environment is just a bunch of symlinks pointing to an OS template. So yes, there are many files and many of them are just symlinks. We haven't met any performance issues (yet), only the upper file size limit (16GB). There's potential for unacceptably long fsck times and we're rethinking our setup to avoid that. -- Jure Pe?ar http://jure.pecar.org/ From tytso at mit.edu Thu Aug 21 19:56:53 2008 From: tytso at mit.edu (Theodore Tso) Date: Thu, 21 Aug 2008 15:56:53 -0400 Subject: ext2online with 1k blocks not working In-Reply-To: <20080821203217.684c2424.pegasus@nerv.eu.org> References: <20080821130752.98ce8b6b.pegasus@nerv.eu.org> <20080821134733.GH16634@mit.edu> <20080821203217.684c2424.pegasus@nerv.eu.org> Message-ID: <20080821195653.GB9791@mit.edu> On Thu, Aug 21, 2008 at 08:32:17PM +0200, Jure Pe?ar wrote: > > Can you replicate the problem using resize2fs from e2fsprogs version > > 1.41.0? Resize2fs has supported online resize for quite sometime, and > > I'm not sure the ext2online tool is being actively maintained at this > > point. > > Ah yes, resize2fs ... I knew there's another tool for resizing, just forgot > its name. > > [root at localhost resize]# ./resize2fs /dev/cciss/c0d0p3 > Performing an on-line resize of /dev/cciss/c0d0p3 to 873646828 (1k) blocks. > ./resize2fs: Inappropriate ioctl for device While trying to add group #78125 Hmm... can you send me the output of "dumpe2fs -h /dev/cciss/c0d0p3"? - Ted From adilger at sun.com Sat Aug 23 11:32:05 2008 From: adilger at sun.com (Andreas Dilger) Date: Sat, 23 Aug 2008 05:32:05 -0600 Subject: ext2online with 1k blocks not working In-Reply-To: <20080821203217.684c2424.pegasus@nerv.eu.org> References: <20080821130752.98ce8b6b.pegasus@nerv.eu.org> <20080821134733.GH16634@mit.edu> <20080821203217.684c2424.pegasus@nerv.eu.org> Message-ID: <20080823113205.GO3392@webber.adilger.int> On Aug 21, 2008 20:32 +0200, Jure Pe?ar wrote: > On Thu, 21 Aug 2008 09:47:33 -0400 > Theodore Tso wrote: > > Can you replicate the problem using resize2fs from e2fsprogs version > > 1.41.0? Resize2fs has supported online resize for quite sometime, and > > I'm not sure the ext2online tool is being actively maintained at this > > point. > > Ah yes, resize2fs ... I knew there's another tool for resizing, just forgot > its name. > > [root at localhost resize]# ./resize2fs /dev/cciss/c0d0p3 > Performing an on-line resize of /dev/cciss/c0d0p3 to 873646828 (1k) blocks. > ./resize2fs: Inappropriate ioctl for device While trying to add group #78125 > > /var/log/messages show: > localhost kernel: JBD: resize2fs wants too many credits (3498 > 2048) > > And filesystem grew from 485G to only 534G and not 800 and something G. How big is your journal? It seems it is only 8MB, which isn't large enough to a resize 870GB filesystem. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. From sandeen at redhat.com Sat Aug 23 15:36:07 2008 From: sandeen at redhat.com (Eric Sandeen) Date: Sat, 23 Aug 2008 10:36:07 -0500 Subject: ext2online with 1k blocks not working In-Reply-To: <20080821203217.684c2424.pegasus@nerv.eu.org> References: <20080821130752.98ce8b6b.pegasus@nerv.eu.org> <20080821134733.GH16634@mit.edu> <20080821203217.684c2424.pegasus@nerv.eu.org> Message-ID: <48B02E67.9010804@redhat.com> Jure Pe?ar wrote: > On Thu, 21 Aug 2008 09:47:33 -0400 > Theodore Tso wrote: > >> Can you replicate the problem using resize2fs from e2fsprogs version >> 1.41.0? Resize2fs has supported online resize for quite sometime, and >> I'm not sure the ext2online tool is being actively maintained at this >> point. > > Ah yes, resize2fs ... I knew there's another tool for resizing, just forgot > its name. > > [root at localhost resize]# ./resize2fs /dev/cciss/c0d0p3 > Performing an on-line resize of /dev/cciss/c0d0p3 to 873646828 (1k) blocks. > ./resize2fs: Inappropriate ioctl for device While trying to add group #78125 > > /var/log/messages show: > localhost kernel: JBD: resize2fs wants too many credits (3498 > 2048) > > And filesystem grew from 485G to only 534G and not 800 and something G. You didn't say exactly which kernel version this is, but this might be fixed in newer RHEL (er, CentOS) kernels: * Fri Mar 28 2008 Vivek Goyal [2.6.9-68.28] ... -ext3: lighten up resize transaction requirements (Eric Sandeen) [166038] Although usually I got -ENOSPC back to userspace .. -Eric From pegasus at nerv.eu.org Sun Aug 24 15:31:43 2008 From: pegasus at nerv.eu.org (Jure =?UTF-8?B?UGXEjWFy?=) Date: Sun, 24 Aug 2008 17:31:43 +0200 Subject: ext2online with 1k blocks not working In-Reply-To: <20080823113205.GO3392@webber.adilger.int> References: <20080821130752.98ce8b6b.pegasus@nerv.eu.org> <20080821134733.GH16634@mit.edu> <20080821203217.684c2424.pegasus@nerv.eu.org> <20080823113205.GO3392@webber.adilger.int> Message-ID: <20080824173143.d7f4c79d.pegasus@nerv.eu.org> On Sat, 23 Aug 2008 05:32:05 -0600 Andreas Dilger wrote: > How big is your journal? It seems it is only 8MB, which isn't large > enough to a resize 870GB filesystem. > > Cheers, Andreas Yes, that's the conclusion Ted came up with. Still, offline resizing works, so I'll just have to schedule more downtime for the resize to finish. -- Jure Pe?ar http://jure.pecar.org/ From pegasus at nerv.eu.org Sun Aug 24 15:34:01 2008 From: pegasus at nerv.eu.org (Jure =?UTF-8?B?UGXEjWFy?=) Date: Sun, 24 Aug 2008 17:34:01 +0200 Subject: ext2online with 1k blocks not working In-Reply-To: <48B02E67.9010804@redhat.com> References: <20080821130752.98ce8b6b.pegasus@nerv.eu.org> <20080821134733.GH16634@mit.edu> <20080821203217.684c2424.pegasus@nerv.eu.org> <48B02E67.9010804@redhat.com> Message-ID: <20080824173401.f2cc4909.pegasus@nerv.eu.org> On Sat, 23 Aug 2008 10:36:07 -0500 Eric Sandeen wrote: > You didn't say exactly which kernel version this is, but this might be > fixed in newer RHEL (er, CentOS) kernels: > > * Fri Mar 28 2008 Vivek Goyal [2.6.9-68.28] > ... > -ext3: lighten up resize transaction requirements (Eric Sandeen) [166038] > > Although usually I got -ENOSPC back to userspace .. 2.6.9-67.0.22.ELsmp ... almost there ;) Thanks for info, but it wouldn't make any difference for us, since we're limited with virtuozzo kernels (which are based on rhel kernels). -- Jure Pe?ar http://jure.pecar.org/ From adilger at sun.com Sun Aug 24 23:37:10 2008 From: adilger at sun.com (Andreas Dilger) Date: Sun, 24 Aug 2008 17:37:10 -0600 Subject: ext2online with 1k blocks not working In-Reply-To: <20080824173143.d7f4c79d.pegasus@nerv.eu.org> References: <20080821130752.98ce8b6b.pegasus@nerv.eu.org> <20080821134733.GH16634@mit.edu> <20080821203217.684c2424.pegasus@nerv.eu.org> <20080823113205.GO3392@webber.adilger.int> <20080824173143.d7f4c79d.pegasus@nerv.eu.org> Message-ID: <20080824233710.GO3392@webber.adilger.int> On Aug 24, 2008 17:31 +0200, Jure Pe?ar wrote: > On Sat, 23 Aug 2008 05:32:05 -0600 > Andreas Dilger wrote: > > How big is your journal? It seems it is only 8MB, which isn't large > > enough to a resize 870GB filesystem. > > Yes, that's the conclusion Ted came up with. > > Still, offline resizing works, so I'll just have to schedule more downtime > for the resize to finish. You may also consider resizing your journal while it is offline: tune2fs -O ^has_journal $dev {maybe e2fsck -f needed here} tune2fs -j $dev should create a journal with at least 32MB. You can check with: debugfs -c -R "stat <8>" $dev Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. From ross at biostat.ucsf.edu Mon Aug 25 18:40:06 2008 From: ross at biostat.ucsf.edu (Ross Boylan) Date: Mon, 25 Aug 2008 11:40:06 -0700 Subject: Problem in HTREE directory node Message-ID: <1219689606.12088.50.camel@corn.betterworld.us> Short version: fsck said "invalid HTREE directory inode 635113 (mail/r/user/ross/comp/admin-wheat) clear HTREE index?" To which I replied Yes. What exactly does this mean was corrupted? In particular, does it mean the list of files in the directory .../comp/admin-wheat was damaged? Or is the trouble in the comp directory? Is fsck likely to have fixed up things as good as new, or might something be lost or corrupted? I don't know what clearing the HTREE index does. How can I check if things are OK? I have backups. Longer version: After an ugly but should have been clean shutdown I got reports that most of my partitions were unclean. There were a lot of logs replayed. The partitions are almost all LVM volumes. My mail spool is ext3, and fsck showed "Problem in HTREE directory inode 635112 node (627) not referenced. Problem in HTREE directory inode 635112 node (628) has invalid depths Problem in HTREFsdck died with exit status 4." The message continued with info on a possible log (which wasn't there--maybe because I use an initrd?) and need for manual check. There were too many messages, about consecutively numbered nodes, to see them all (always in pairs as above). 628 was the last. Manual fsck tells me I can't use auto mode. Full manual gives "invalid HTREE directory inode 635113 (mail/r/user/ross/comp/admin-wheat) clear HTREE index?" I said Yes, and run completed. Reboot. Still FS not clean messages for most, and "cyrspool primary superblock features different from backup, check forced." Finally it starts. I'm running a Linux 2.6.25 kernel on a P4; that particular partition was on a SATA disk. I recently added another SATA disk and added it to the volume group that included my mail spool. I have some IDE disks too. When LVM starts up it gives the error Parse error at byte 3306 (line253): unexpected token 9 times. I think it's been doing this for a long time. It seems to discover and activate all the volume groups. So, my main questions are up above ("short version"). I also wonder why, even after my manual fsck, I got the error about the primary superblock features differeing. My more general, and probably harder, question, is how things could have gotten into this state. Thanks for any insight. Ross Boylan