From tt at it-austria.net  Fri Aug  1 09:43:40 2008
From: tt at it-austria.net (Thomas Trauner)
Date: Fri, 01 Aug 2008 11:43:40 +0200
Subject: duplicate entries on ext3 when using readdir/readdir64
Message-ID: <1217583820.12454.20.camel@kannnix.a2x.lan.at>

Hello,


I have a problem with directories that contain more than 10000 entries
(Ubuntu 8.04.1) or with more than 70000 entries (RHEL 5.2). If you use
readdir(3) or readdir64(3) you get one entry twice, with same name and
inode.

Some analyzing showed that disabling dir_index solves this problem, but
I think that this is a bug in the ext3 code, as no other file-system
shows this behavior.

I've found the following regarding this bug, but nothing about whether
if it is fixed nor if a back-port for older 2.6 kernels exists.

<https://www.redhat.com/archives/ext3-users/2007-December/msg00004.html>
and
<http://episteme.arstechnica.com/eve/forums/a/tpc/f/96509133/m/199007643931?r=494000843931#494000843931>

On linux-fsdevel I've found the following, but they delete directory
entries in between multiple readdir calls.

<http://kerneltrap.org/mailarchive/linux-fsdevel/2005/9/15/310258>

Does anyone know where I could find more information or report this bug?

Thanks in advance!
Regards.
Tom Trauner


From tytso at mit.edu  Fri Aug  1 12:16:58 2008
From: tytso at mit.edu (Theodore Tso)
Date: Fri, 1 Aug 2008 08:16:58 -0400
Subject: duplicate entries on ext3 when using readdir/readdir64
In-Reply-To: <1217583820.12454.20.camel@kannnix.a2x.lan.at>
References: <1217583820.12454.20.camel@kannnix.a2x.lan.at>
Message-ID: <20080801121658.GG8736@mit.edu>

On Fri, Aug 01, 2008 at 11:43:40AM +0200, Thomas Trauner wrote:
> 
> I have a problem with directories that contain more than 10000 entries
> (Ubuntu 8.04.1) or with more than 70000 entries (RHEL 5.2). If you use
> readdir(3) or readdir64(3) you get one entry twice, with same name and
> inode.
> 

How reproducible is this; can you reproduce it on this one filesystem?
Can you reproduce it on multiple filesystems?  What sort of file names
are you using?

Also, are you testing by using "ls", or do you have your own program
getting the names of the files.  If the latter, are you using
telldir()/seekdir() in any way?

						- Ted


From tt at it-austria.net  Fri Aug  1 14:00:31 2008
From: tt at it-austria.net (Thomas Trauner)
Date: Fri, 01 Aug 2008 16:00:31 +0200
Subject: duplicate entries on ext3 when using readdir/readdir64
In-Reply-To: <20080801121658.GG8736@mit.edu>
References: <1217583820.12454.20.camel@kannnix.a2x.lan.at>
	<20080801121658.GG8736@mit.edu>
Message-ID: <1217599231.14552.13.camel@kannnix.a2x.lan.at>

On Fri, 2008-08-01 at 08:16 -0400, Theodore Tso wrote:
> On Fri, Aug 01, 2008 at 11:43:40AM +0200, Thomas Trauner wrote:
> > 
> > I have a problem with directories that contain more than 10000 entries
> > (Ubuntu 8.04.1) or with more than 70000 entries (RHEL 5.2). If you use
> > readdir(3) or readdir64(3) you get one entry twice, with same name and
> > inode.
> > 
> 
> How reproducible is this; can you reproduce it on this one filesystem?
> Can you reproduce it on multiple filesystems?  What sort of file names
> are you using?

Every time I tried. It is reproducible on the same filesystem, and also on other
systems with different filesystem sizes and usage patterns.
It showed up when on of our own script working through a Subversion directory failed.

File names are numbers, starting with "0" counting up.

> Also, are you testing by using "ls", or do you have your own program
> getting the names of the files.  If the latter, are you using
> telldir()/seekdir() in any way?

I'm testing with 'ls|sort -n|uniq -d' and also with a simple program
that simply counts how often readdir can be called.

> 						- Ted
> 

Tom


From sandeen at redhat.com  Fri Aug  1 14:47:07 2008
From: sandeen at redhat.com (Eric Sandeen)
Date: Fri, 01 Aug 2008 09:47:07 -0500
Subject: duplicate entries on ext3 when using readdir/readdir64
In-Reply-To: <1217599231.14552.13.camel@kannnix.a2x.lan.at>
References: <1217583820.12454.20.camel@kannnix.a2x.lan.at>	<20080801121658.GG8736@mit.edu>
	<1217599231.14552.13.camel@kannnix.a2x.lan.at>
Message-ID: <489321EB.3070009@redhat.com>

Thomas Trauner wrote:
> On Fri, 2008-08-01 at 08:16 -0400, Theodore Tso wrote:
>> On Fri, Aug 01, 2008 at 11:43:40AM +0200, Thomas Trauner wrote:
>>> I have a problem with directories that contain more than 10000 entries
>>> (Ubuntu 8.04.1) or with more than 70000 entries (RHEL 5.2). If you use
>>> readdir(3) or readdir64(3) you get one entry twice, with same name and
>>> inode.
>>>
>> How reproducible is this; can you reproduce it on this one filesystem?
>> Can you reproduce it on multiple filesystems?  What sort of file names
>> are you using?
> 
> Every time I tried. It is reproducible on the same filesystem, and also on other
> systems with different filesystem sizes and usage patterns.
> It showed up when on of our own script working through a Subversion directory failed.
> 
> File names are numbers, starting with "0" counting up.
> 
>> Also, are you testing by using "ls", or do you have your own program
>> getting the names of the files.  If the latter, are you using
>> telldir()/seekdir() in any way?
> 
> I'm testing with 'ls|sort -n|uniq -d' and also with a simple program
> that simply counts how often readdir can be called.
> 

Hm, a bog-simple test here doesn't show any trouble:

[root at inode dirtest]# for I in `seq 0 10500`; do touch $I; done
[root at inode dirtest]# ls | sort -n | uniq -d
[root at inode dirtest]# ls | wc -l
10501

does that reflect what you're doing?  Do you have a testcase you can share?

-Eric


From tt at it-austria.net  Mon Aug  4 07:48:39 2008
From: tt at it-austria.net (Thomas Trauner)
Date: Mon, 04 Aug 2008 09:48:39 +0200
Subject: duplicate entries on ext3 when using readdir/readdir64
In-Reply-To: <489321EB.3070009@redhat.com>
References: <1217583820.12454.20.camel@kannnix.a2x.lan.at>
	<20080801121658.GG8736@mit.edu>
	<1217599231.14552.13.camel@kannnix.a2x.lan.at>
	<489321EB.3070009@redhat.com>
Message-ID: <1217836119.14552.27.camel@kannnix.a2x.lan.at>

On Fri, 2008-08-01 at 09:47 -0500, Eric Sandeen wrote:
> Hm, a bog-simple test here doesn't show any trouble:
> 
> [root at inode dirtest]# for I in `seq 0 10500`; do touch $I; done
> [root at inode dirtest]# ls | sort -n | uniq -d
> [root at inode dirtest]# ls | wc -l
> 10501
> 
> does that reflect what you're doing?  Do you have a testcase you can share?

Yes, but I've written incorrect values, sorry. It's a little bit higher,
a run of my program outputs this on 2.6.24-19-generic (ubuntu 8.04.1):

expected 11778 files, but readdir reports 11779
expected 11862 files, but readdir64 reports 11863

And on 2.6.18-92.1.6.el5 (rhel 5.2):
expected 72922 files, but readdir reports 72923
expected 73131 files, but readdir64 reports 73132

The testcase is here: <http://www.unet.univie.ac.at/~a9100884/readdir.c>
> -Eric

Tom


From tt at it-austria.net  Tue Aug  5 10:53:51 2008
From: tt at it-austria.net (Thomas Trauner)
Date: Tue, 05 Aug 2008 12:53:51 +0200
Subject: duplicate entries on ext3 when using readdir/readdir64
In-Reply-To: <20080801121658.GG8736@mit.edu>
References: <1217583820.12454.20.camel@kannnix.a2x.lan.at>
	<20080801121658.GG8736@mit.edu>
Message-ID: <1217933631.14552.45.camel@kannnix.a2x.lan.at>

On Fri, 2008-08-01 at 08:16 -0400, Theodore Tso wrote:
> On Fri, Aug 01, 2008 at 11:43:40AM +0200, Thomas Trauner wrote:
> > 
> > I have a problem with directories that contain more than 10000 entries
> > (Ubuntu 8.04.1) or with more than 70000 entries (RHEL 5.2). If you use
> > readdir(3) or readdir64(3) you get one entry twice, with same name and
> > inode.
> > 
> 
> How reproducible is this; can you reproduce it on this one filesystem?
> Can you reproduce it on multiple filesystems?  What sort of file names
> are you using?

I made new tests with the code under
<http://www.unet.univie.ac.at/~a9100884/readdir.c> on a bunch of freshly
generated and empty filesystems, every about 38GB large, of type fat
(aborted after about 22000 entries because it took to long), ext2, xfs,
jfs and again ext3. All tests made with 2.6.24-19-generic (ubuntu
8.04.1).

I also tried minix fs, just for fun, but I could only create 126 files.

Ext3 shows the same effect as before, but at 103033 entries (readdir)
and 104136 entries (readdir64). 'ls|sort -n|uniq -d' output (ls uses
getdents64, so I asume it uses readdir64, but I don't checked the ls
source):

        root at darfnix:/readdir/ext3/testdir# ls|sort -n|uniq -d
        102456
        root at darfnix:/readdir/ext3/testdir# 

Can I do anything else?

Regards
Tom


From tytso at mit.edu  Wed Aug  6 04:46:09 2008
From: tytso at mit.edu (Theodore Tso)
Date: Wed, 6 Aug 2008 00:46:09 -0400
Subject: duplicate entries on ext3 when using readdir/readdir64
In-Reply-To: <1217933631.14552.45.camel@kannnix.a2x.lan.at>
References: <1217583820.12454.20.camel@kannnix.a2x.lan.at>
	<20080801121658.GG8736@mit.edu>
	<1217933631.14552.45.camel@kannnix.a2x.lan.at>
Message-ID: <20080806044609.GA9277@mit.edu>

On Tue, Aug 05, 2008 at 12:53:51PM +0200, Thomas Trauner wrote:
> > How reproducible is this; can you reproduce it on this one filesystem?
> > Can you reproduce it on multiple filesystems?  What sort of file names
> > are you using?
> 
> I made new tests with the code under
> <http://www.unet.univie.ac.at/~a9100884/readdir.c> on a bunch of freshly
> generated and empty filesystems, every about 38GB large, of type fat
> (aborted after about 22000 entries because it took to long), ext2, xfs,
> jfs and again ext3. All tests made with 2.6.24-19-generic (ubuntu
> 8.04.1).

I was able to reproduce using ext3.  It looks like it's caused by a
hash collision; but ext3 has code that's supposed to avoid returning a
directory entry doubled in this fashion.  I'll have to look into it.

	  		   		       - Ted


From tt at it-austria.net  Wed Aug  6 13:33:17 2008
From: tt at it-austria.net (Thomas Trauner)
Date: Wed, 06 Aug 2008 15:33:17 +0200
Subject: duplicate entries on ext3 when using readdir/readdir64
In-Reply-To: <20080806044609.GA9277@mit.edu>
References: <1217583820.12454.20.camel@kannnix.a2x.lan.at>
	<20080801121658.GG8736@mit.edu>
	<1217933631.14552.45.camel@kannnix.a2x.lan.at>
	<20080806044609.GA9277@mit.edu>
Message-ID: <1218029597.14552.54.camel@kannnix.a2x.lan.at>

On Wed, 2008-08-06 at 00:46 -0400, Theodore Tso wrote:
> I was able to reproduce using ext3.  It looks like it's caused by a
> hash collision; but ext3 has code that's supposed to avoid returning a
> directory entry doubled in this fashion.  I'll have to look into it.
> 
> 	  		   		       - Ted

Thank you.

Tom


From tytso at MIT.EDU  Wed Aug  6 14:07:23 2008
From: tytso at MIT.EDU (Theodore Tso)
Date: Wed, 6 Aug 2008 10:07:23 -0400
Subject: duplicate entries on ext3 when using readdir/readdir64
In-Reply-To: <1217933631.14552.45.camel@kannnix.a2x.lan.at>
References: <1217583820.12454.20.camel@kannnix.a2x.lan.at>
	<20080801121658.GG8736@mit.edu>
	<1217933631.14552.45.camel@kannnix.a2x.lan.at>
Message-ID: <20080806140722.GA14109@mit.edu>

On Tue, Aug 05, 2008 at 12:53:51PM +0200, Thomas Trauner wrote:
> On Fri, 2008-08-01 at 08:16 -0400, Theodore Tso wrote:
> > On Fri, Aug 01, 2008 at 11:43:40AM +0200, Thomas Trauner wrote:
> > > 
> > > I have a problem with directories that contain more than 10000 entries
> > > (Ubuntu 8.04.1) or with more than 70000 entries (RHEL 5.2). If you use
> > > readdir(3) or readdir64(3) you get one entry twice, with same name and
> > > inode.
> > > 
> I made new tests with the code under
> <http://www.unet.univie.ac.at/~a9100884/readdir.c> on a bunch of freshly
> generated and empty filesystems, every about 38GB large, of type fat
> (aborted after about 22000 entries because it took to long), ext2, xfs,
> jfs and again ext3....

OK, I have a workaroud for you.  It appears there's a kernel bug
hiding here, since there shouldn't be duplicates returned by readdir()
even if we have hash collisions.  

It turns out though that the TEA hash we are currently using as the
default is a really sucky hash.  I can't remember who suggested it; I
may go looking in the archives just out of curiosity.  My fault,
though, I should have tested it much more thoroughly, although it
*looked* good, and it was take from the core of an encryption
algorithm, so I thought it would be OK.

The claim was that it was just as good for our purposes as the
cut-down md4 hash we were using, but it was faster (so it would burn
less cpu cycles).  Unfortunately, (a) at least on modern hardware (I
tested on an X61s laptop) the TEA hash is in fact a little slower, and
(b) for small filenames with small hamming distances between them,,
such as what you are using in your test, it's generating lots of
collisions.

Anyway, the workaround is as follows:

debugfs -w /dev/sdXXX
debugfs: set_super_value def_hash_version half_md4
debugfs: quit

Then completely delete any directories where you were having problems,
and recreate them.  (You can do the "mkdir foo.new; mv foo/* foo.new;
rmdir foo; mv foo.new foo" trick if you want to preserve the files in
that directory.)

In any case, here's the test case which shows the hash collision
problem much more quickly.  You can also use it for benchmarks, like
so:

time tst_hash -q -a tea -n 3000000
time tst_hash -q -a half_md4 -n 3000000

With the following options, we can also see with the right filename
lengths, the tea algorithm doesn't create any hash collisions, so
maybe whoever tested the algorithm before they suggested it just got
unlucky with the set of filenames that he/she chose:

   tst_hash -p 0000 -a tea -n 3000000

In any case, unless someone comes up with a really good reason, I
probably will change the default hash algorithm for mke2fs to
half_md4, since it is both faster and a better hash function.


This doesn't change the fact that the kernel should do the right thing
with hash collisions, at least in the simple case without
telldir/seekdir.  When I merged the htree code I had tested it with
the Douglas Adams hash (always returns a hash value of
0x00000042:0000000 no matter what its inputs), and it did the right
thing, so we must have regressed somewhere along the line...

    	  	       	    	 	   - Ted

/*
 * tst_htree.c
 *
 * Copyright (C) 2008 by Theodore Ts'o.
 *
 * This file may be redistributed under the terms of the GNU Public
 * License, Version 2
 * 
 * Compile command:
 *	cc -g -O2 -o tst_hash tst_hash.c -lext2fs -lcom_err -luuid -le2p
 */

#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <ctype.h>
#include <string.h>
#include <time.h>
#include <errno.h>
#include <sys/types.h>
#include <getopt.h>

#include "ext2fs/ext2fs.h"
#include "uuid/uuid.h"
#include "et/com_err.h"

#define SEED "87fd5d61-4612-4147-8bf5-a21948e7e909"

struct hash {
	int num;
	ext2_dirhash_t hash, minor_hash;
};

static EXT2_QSORT_TYPE hash_cmp(const void *a, const void *b)
{
	const struct hash *db_a =
		(const struct hash *) a;
	const struct hash *db_b =
		(const struct hash *) b;

	if (db_a->hash != db_b->hash)
		return (int) (db_a->hash - db_b->hash);
	
	return (int) (db_a->minor_hash - db_b->minor_hash);
}

main(int argc, char **argv)
{
	errcode_t errcode;
	ext2_dirhash_t	hash, minor_hash;
	int hash_alg = EXT2_HASH_TEA;
	char name[200], *tmp, prefix[100];
	unsigned char uuid[16];
	int thislen, i, c, quiet = 0, num_hashes = 300000;
	struct hash *hash_array;

	uuid_parse(SEED, uuid);
	prefix[0] = 0;

	while ((c = getopt(argc, argv, "s:a:n:qp:")) != EOF)
		switch (c) {
		case 's':
			uuid_parse(optarg, uuid);
			break;
		case 'a':
			hash_alg = e2p_string2hash(optarg);
			if (hash_alg < 0) {
				fprintf(stderr, "Invalid hash algorithm: %s\n",
					optarg);
				exit(1);
			}
			break;
		case 'n':
			num_hashes = strtoul (optarg, &tmp, 0);
			if (*tmp) {
				com_err (argv[0], 0, "count - %s", optarg);
				exit(1);
			}
			break;
		case 'p':
			if (strlen(optarg)+1 > sizeof(prefix)) {
				fprintf(stderr, "%s: prefix too large!\n",
					argv[0]);
				exit(1);
			}
			strcpy(prefix, optarg);
			break;
		case 'q':
			quiet = 1;
			break;
		default:
			fprintf(stderr, "Usage: %s [-q] [-s hash_seed] "
				"[-a hash_alg] [-n num_hashes]\n", argv[0]);
			exit(1);
		}

	hash_array = malloc(num_hashes * sizeof(struct hash));
	if (hash_array == NULL) {
		fprintf(stderr, "Couldn't allocate hash_array\n");
		exit(1);
	}
	
	for (i=0; i < num_hashes; i++) {
		sprintf(name, "%s%d", prefix, i);
	
		errcode = ext2fs_dirhash(hash_alg, name, strlen(name),
					 (__u32 *) uuid, 
					 &hash_array[i].hash,
					 &hash_array[i].minor_hash);
		if (errcode) {
			com_err("ext2fs_dirhash", errcode, 
				"while trying to hash '%s'", name);
			exit(1);
		}
		hash_array[i].num = i;
	}

	qsort(hash_array, (size_t) num_hashes, sizeof(struct hash), hash_cmp);

	for (c=0,i=0; i < num_hashes-1; i++) {
		if ((hash_array[i].hash == hash_array[i+1].hash) &&
		    (hash_array[i].minor_hash == hash_array[i+1].minor_hash)) {
			c++;
			if (quiet)
				continue;
			printf("hash collision: %d, %d: %08x:%08x\n", 
			       hash_array[i].num, hash_array[i+1].num,
			       hash_array[i].hash, hash_array[i].minor_hash);
		}
	}
	printf("%d collisions\n", c);

	exit(0);
}


From tytso at MIT.EDU  Wed Aug  6 14:45:47 2008
From: tytso at MIT.EDU (Theodore Tso)
Date: Wed, 6 Aug 2008 10:45:47 -0400
Subject: duplicate entries on ext3 when using readdir/readdir64
In-Reply-To: <1217836119.14552.27.camel@kannnix.a2x.lan.at>
References: <1217583820.12454.20.camel@kannnix.a2x.lan.at>
	<20080801121658.GG8736@mit.edu>
	<1217599231.14552.13.camel@kannnix.a2x.lan.at>
	<489321EB.3070009@redhat.com>
	<1217836119.14552.27.camel@kannnix.a2x.lan.at>
Message-ID: <20080806144547.GB14109@mit.edu>

On Mon, Aug 04, 2008 at 09:48:39AM +0200, Thomas Trauner wrote:
> Yes, but I've written incorrect values, sorry. It's a little bit higher,
> a run of my program outputs this on 2.6.24-19-generic (ubuntu 8.04.1):
> 
> expected 11778 files, but readdir reports 11779
> expected 11862 files, but readdir64 reports 11863
> 
> And on 2.6.18-92.1.6.el5 (rhel 5.2):
> expected 72922 files, but readdir reports 72923
> expected 73131 files, but readdir64 reports 73132

BTW, I doubt the difference in what you had on your Ubuntu and RHEL
system has anything to do with the kernel version or the distribution,
but just the luck of the draw.  If you run "dumpe2fs -h /dev/sdXX |
grep "Hash Seed" from both systms, and then take that uuid and feed it
to the tst_hash program via the -s option, you'll probably see it was
simply the different directory hash seed which is changing when the
first collision happened:

%./tst_hash -s 27e0ed94-069c-44c0-bea0-044b1a8d7bcc 
hash collision: 142886, 142987: 7104d654:131c0700
hash collision: 188030, 188131: aefe1dc2:f7517103
hash collision: 14020, 14031: fc717efa:87ce3eaa
hash collision: 120336, 120732: 34c3f1b6:cee72d50
4 collisions

vs.

% ./tst_hash -s 7089e459-07c2-43cc-b25f-bafdcce9cd05
hash collision: 167469, 167568: 4de08834:3fa2a17a
hash collision: 133356, 133752: ce1bfd8e:a1bce824
hash collision: 179218, 179319: ea71d5c8:43471df9
hash collision: 111503, 111701: fbfcea6c:760591e8
hash collision: 134034, 134135: 0ff24a86:f627f5a1
hash collision: 252452, 252553: 6631082a:43adb3f4
hash collision: 101107, 101305: a1a99e86:8d50e974
hash collision: 62302, 62313: 2689a56c:38ccd31d
hash collision: 60242, 60253: d9e3f444:f252b5f5
9 collisions

With the first hash seed, the first collision happened with the
filenames 14020 and 14031.  With the second hash seed, you don't get a
collision until 60242 and 60253.

Regards,

						- Ted


From tt at it-austria.net  Wed Aug  6 15:14:43 2008
From: tt at it-austria.net (Thomas Trauner)
Date: Wed, 06 Aug 2008 17:14:43 +0200
Subject: duplicate entries on ext3 when using readdir/readdir64
In-Reply-To: <20080806140722.GA14109@mit.edu>
References: <1217583820.12454.20.camel@kannnix.a2x.lan.at>
	<20080801121658.GG8736@mit.edu>
	<1217933631.14552.45.camel@kannnix.a2x.lan.at>
	<20080806140722.GA14109@mit.edu>
Message-ID: <1218035683.14552.61.camel@kannnix.a2x.lan.at>

On Wed, 2008-08-06 at 10:07 -0400, Theodore Tso wrote:
> On Tue, Aug 05, 2008 at 12:53:51PM +0200, Thomas Trauner wrote:
> > On Fri, 2008-08-01 at 08:16 -0400, Theodore Tso wrote:
> > > On Fri, Aug 01, 2008 at 11:43:40AM +0200, Thomas Trauner wrote:
> > > > 
> > > > I have a problem with directories that contain more than 10000 entries
> > > > (Ubuntu 8.04.1) or with more than 70000 entries (RHEL 5.2). If you use
> > > > readdir(3) or readdir64(3) you get one entry twice, with same name and
> > > > inode.
> > > > 
> > I made new tests with the code under
> > <http://www.unet.univie.ac.at/~a9100884/readdir.c> on a bunch of freshly
> > generated and empty filesystems, every about 38GB large, of type fat
> > (aborted after about 22000 entries because it took to long), ext2, xfs,
> > jfs and again ext3....
> 
> OK, I have a workaroud for you.  It appears there's a kernel bug
> hiding here, since there shouldn't be duplicates returned by readdir()
> even if we have hash collisions.  

Thank you for your fast help and detailed explanation! Now I've
something to read at home ;)

Thanks!
Tom


From snitzer at gmail.com  Wed Aug 13 21:21:20 2008
From: snitzer at gmail.com (Mike Snitzer)
Date: Wed, 13 Aug 2008 17:21:20 -0400
Subject: duplicate entries on ext3 when using readdir/readdir64
In-Reply-To: <20080806140722.GA14109@mit.edu>
References: <1217583820.12454.20.camel@kannnix.a2x.lan.at>
	<20080801121658.GG8736@mit.edu>
	<1217933631.14552.45.camel@kannnix.a2x.lan.at>
	<20080806140722.GA14109@mit.edu>
Message-ID: <170fa0d20808131421j3e4955dcra611509f1a094547@mail.gmail.com>

On Wed, Aug 6, 2008 at 10:07 AM, Theodore Tso <tytso at mit.edu> wrote:
> On Tue, Aug 05, 2008 at 12:53:51PM +0200, Thomas Trauner wrote:
>> On Fri, 2008-08-01 at 08:16 -0400, Theodore Tso wrote:
>> > On Fri, Aug 01, 2008 at 11:43:40AM +0200, Thomas Trauner wrote:
>> > >
>> > > I have a problem with directories that contain more than 10000 entries
>> > > (Ubuntu 8.04.1) or with more than 70000 entries (RHEL 5.2). If you use
>> > > readdir(3) or readdir64(3) you get one entry twice, with same name and
>> > > inode.
>> > >
>> I made new tests with the code under
>> <http://www.unet.univie.ac.at/~a9100884/readdir.c> on a bunch of freshly
>> generated and empty filesystems, every about 38GB large, of type fat
>> (aborted after about 22000 entries because it took to long), ext2, xfs,
>> jfs and again ext3....
>
> OK, I have a workaroud for you.  It appears there's a kernel bug
> hiding here, since there shouldn't be duplicates returned by readdir()
> even if we have hash collisions.

Ted,

The attached patch has served my employer (IBRIX) well for 2.5 years.
It was only recently, when I re-raised this issue internally based on
this thread, that a co-worker recalled the fix.

regards,
Mike
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ext3_dx_readdir_hash_collision_fix.patch
Type: text/x-patch
Size: 1400 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20080813/ac7c8d50/attachment.bin>

From tytso at mit.edu  Thu Aug 14 02:58:21 2008
From: tytso at mit.edu (Theodore Tso)
Date: Wed, 13 Aug 2008 22:58:21 -0400
Subject: duplicate entries on ext3 when using readdir/readdir64
In-Reply-To: <170fa0d20808131421j3e4955dcra611509f1a094547@mail.gmail.com>
References: <1217583820.12454.20.camel@kannnix.a2x.lan.at>
	<20080801121658.GG8736@mit.edu>
	<1217933631.14552.45.camel@kannnix.a2x.lan.at>
	<20080806140722.GA14109@mit.edu>
	<170fa0d20808131421j3e4955dcra611509f1a094547@mail.gmail.com>
Message-ID: <20080814025821.GA6469@mit.edu>

On Wed, Aug 13, 2008 at 05:21:20PM -0400, Mike Snitzer wrote:
> 
> The attached patch has served my employer (IBRIX) well for 2.5 years.
> It was only recently, when I re-raised this issue internally based on
> this thread, that a co-worker recalled the fix.
> 

The patch looks good.  Did someone raise it 2.5 years ago, and we
somehow dropped the ball, or did no one think to submit the patch
upstream?

Also, can I get a Signed-off-by: line for this patch?

Thanks!!

						- Ted


From magawake at gmail.com  Thu Aug 14 12:33:29 2008
From: magawake at gmail.com (Mag Gam)
Date: Thu, 14 Aug 2008 08:33:29 -0400
Subject: small blocks
Message-ID: <1cbd6f830808140533j704bdc5bncbd6288dba3b5543@mail.gmail.com>

I am trying to understand what the purpose of having small blocks per
inode. I know you can cram more inodes per filesystem, but what is the
downside?

TIA


From sandeen at redhat.com  Thu Aug 14 13:49:48 2008
From: sandeen at redhat.com (Eric Sandeen)
Date: Thu, 14 Aug 2008 08:49:48 -0500
Subject: small blocks
In-Reply-To: <1cbd6f830808140533j704bdc5bncbd6288dba3b5543@mail.gmail.com>
References: <1cbd6f830808140533j704bdc5bncbd6288dba3b5543@mail.gmail.com>
Message-ID: <48A437FC.6010706@redhat.com>

Mag Gam wrote:
> I am trying to understand what the purpose of having small blocks per
> inode. I know you can cram more inodes per filesystem, 

the main result is that you waste less space per file, since for
randomly-sized files you waste half a block(size) per file.

> but what is the
> downside?

More overhead for management, and more importantly, I still think there
is a bug lurking somewhere with block size < page size (rpm tends to hit
it for some people).

-Eric


From tytso at mit.edu  Thu Aug 14 14:52:29 2008
From: tytso at mit.edu (Theodore Tso)
Date: Thu, 14 Aug 2008 10:52:29 -0400
Subject: duplicate entries on ext3 when using readdir/readdir64
In-Reply-To: <170fa0d20808131421j3e4955dcra611509f1a094547@mail.gmail.com>
References: <1217583820.12454.20.camel@kannnix.a2x.lan.at>
	<20080801121658.GG8736@mit.edu>
	<1217933631.14552.45.camel@kannnix.a2x.lan.at>
	<20080806140722.GA14109@mit.edu>
	<170fa0d20808131421j3e4955dcra611509f1a094547@mail.gmail.com>
Message-ID: <20080814145229.GA8256@mit.edu>

On Wed, Aug 13, 2008 at 05:21:20PM -0400, Mike Snitzer wrote:
> 
> The attached patch has served my employer (IBRIX) well for 2.5 years.
> It was only recently, when I re-raised this issue internally based on
> this thread, that a co-worker recalled the fix.
> 

I can confirm that this patch fixes things; I also have this patch
ported to ext4.  I need a Signed-off-by before I can push this to
Linus, though.

							- Ted


From snitzer at gmail.com  Thu Aug 14 23:27:45 2008
From: snitzer at gmail.com (Mike Snitzer)
Date: Thu, 14 Aug 2008 19:27:45 -0400
Subject: duplicate entries on ext3 when using readdir/readdir64
In-Reply-To: <20080814025821.GA6469@mit.edu>
References: <1217583820.12454.20.camel@kannnix.a2x.lan.at>
	<20080801121658.GG8736@mit.edu>
	<1217933631.14552.45.camel@kannnix.a2x.lan.at>
	<20080806140722.GA14109@mit.edu>
	<170fa0d20808131421j3e4955dcra611509f1a094547@mail.gmail.com>
	<20080814025821.GA6469@mit.edu>
Message-ID: <170fa0d20808141627i7f05fbcdnf043e090834940dd@mail.gmail.com>

On Wed, Aug 13, 2008 at 10:58 PM, Theodore Tso <tytso at mit.edu> wrote:
> On Wed, Aug 13, 2008 at 05:21:20PM -0400, Mike Snitzer wrote:
>>
>> The attached patch has served my employer (IBRIX) well for 2.5 years.
>> It was only recently, when I re-raised this issue internally based on
>> this thread, that a co-worker recalled the fix.
>>
>
> The patch looks good.  Did someone raise it 2.5 years ago, and we
> somehow dropped the ball, or did no one think to submit the patch
> upstream?

We intended to push this fix upstream but doing so got inadvertently
overlooked as we put focus to new issues.  I mentioned how long ago
this patch was developed purely to help illustrate the stability of
the fix.

> Also, can I get a Signed-off-by: line for this patch?

Eugene Dashevsky authored the patch; I refreshed it against 2.6.27-rc3:

Signed-off-by: Eugene Dashevsky <eugene at ibrix.com>
Signed-off-by: Mike Snitzer <msnitzer at ibrix.com>

thanks,
Mike


From magawake at gmail.com  Fri Aug 15 11:21:23 2008
From: magawake at gmail.com (Mag Gam)
Date: Fri, 15 Aug 2008 07:21:23 -0400
Subject: small blocks
In-Reply-To: <48A437FC.6010706@redhat.com>
References: <1cbd6f830808140533j704bdc5bncbd6288dba3b5543@mail.gmail.com>
	<48A437FC.6010706@redhat.com>
Message-ID: <1cbd6f830808150421l47afd774ofb104f37558c71a5@mail.gmail.com>

Hmm, I am wasting less space when I lower the ratio from 4096 to 1024
(the minimum). Why don't more people do this, since its frugal.
I guess your point of more overhead, but what causes more overhead?
Also, do you have the buzilla number I can investigate for this?

Sorry for such a newbie question.
TIA

On Thu, Aug 14, 2008 at 9:49 AM, Eric Sandeen <sandeen at redhat.com> wrote:
> Mag Gam wrote:
>> I am trying to understand what the purpose of having small blocks per
>> inode. I know you can cram more inodes per filesystem,
>
> the main result is that you waste less space per file, since for
> randomly-sized files you waste half a block(size) per file.
>
>> but what is the
>> downside?
>
> More overhead for management, and more importantly, I still think there
> is a bug lurking somewhere with block size < page size (rpm tends to hit
> it for some people).
>
> -Eric
>


From sandeen at redhat.com  Fri Aug 15 13:53:33 2008
From: sandeen at redhat.com (Eric Sandeen)
Date: Fri, 15 Aug 2008 08:53:33 -0500
Subject: small blocks
In-Reply-To: <1cbd6f830808150421l47afd774ofb104f37558c71a5@mail.gmail.com>
References: <1cbd6f830808140533j704bdc5bncbd6288dba3b5543@mail.gmail.com>	
	<48A437FC.6010706@redhat.com>
	<1cbd6f830808150421l47afd774ofb104f37558c71a5@mail.gmail.com>
Message-ID: <48A58A5D.4030309@redhat.com>

Mag Gam wrote:
> Hmm, I am wasting less space when I lower the ratio from 4096 to 1024
> (the minimum). Why don't more people do this, since its frugal.
> I guess your point of more overhead, but what causes more overhead?
> Also, do you have the buzilla number I can investigate for this?
> 
> Sorry for such a newbie question.

Now that I reread, perhaps I gave you the wrong answer anyway.

Are you talking about the -i or the -b option?

-Eric


From magawake at gmail.com  Fri Aug 15 23:07:42 2008
From: magawake at gmail.com (Mag Gam)
Date: Fri, 15 Aug 2008 19:07:42 -0400
Subject: small blocks
In-Reply-To: <48A58A5D.4030309@redhat.com>
References: <1cbd6f830808140533j704bdc5bncbd6288dba3b5543@mail.gmail.com>
	<48A437FC.6010706@redhat.com>
	<1cbd6f830808150421l47afd774ofb104f37558c71a5@mail.gmail.com>
	<48A58A5D.4030309@redhat.com>
Message-ID: <1cbd6f830808151607u51bd366cg1148b5278a852b86@mail.gmail.com>

Asking about the -i options.

 -i bytes-per-inode

The man page states, "This  value  generally shouldn't be smaller than
the blocksize of the filesystem, since then too many inodes will be
made."

So, whats the problem of having too many inodes


On Fri, Aug 15, 2008 at 9:53 AM, Eric Sandeen <sandeen at redhat.com> wrote:
> Mag Gam wrote:
>> Hmm, I am wasting less space when I lower the ratio from 4096 to 1024
>> (the minimum). Why don't more people do this, since its frugal.
>> I guess your point of more overhead, but what causes more overhead?
>> Also, do you have the buzilla number I can investigate for this?
>>
>> Sorry for such a newbie question.
>
> Now that I reread, perhaps I gave you the wrong answer anyway.
>
> Are you talking about the -i or the -b option?
>
> -Eric
>


From sandeen at redhat.com  Fri Aug 15 23:11:01 2008
From: sandeen at redhat.com (Eric Sandeen)
Date: Fri, 15 Aug 2008 18:11:01 -0500
Subject: small blocks
In-Reply-To: <1cbd6f830808151607u51bd366cg1148b5278a852b86@mail.gmail.com>
References: <1cbd6f830808140533j704bdc5bncbd6288dba3b5543@mail.gmail.com>	
	<48A437FC.6010706@redhat.com>	
	<1cbd6f830808150421l47afd774ofb104f37558c71a5@mail.gmail.com>	
	<48A58A5D.4030309@redhat.com>
	<1cbd6f830808151607u51bd366cg1148b5278a852b86@mail.gmail.com>
Message-ID: <48A60D05.9030602@redhat.com>

Mag Gam wrote:
> Asking about the -i options.
> 
>  -i bytes-per-inode
> 
> The man page states, "This  value  generally shouldn't be smaller than
> the blocksize of the filesystem, since then too many inodes will be
> made."
> 
> So, whats the problem of having too many inodes

You waste space on unused inodes.

And the problem of not having _enough_ is, you can't make new files even
when you have lots of blocks free, and you can't change that after the
fact.  It's one of the drawbacks of not dynamically allocating inodes.

-Eric


From pegasus at nerv.eu.org  Thu Aug 21 11:07:52 2008
From: pegasus at nerv.eu.org (Jure =?UTF-8?B?UGXEjWFy?=)
Date: Thu, 21 Aug 2008 13:07:52 +0200
Subject: ext2online with 1k blocks not working
Message-ID: <20080821130752.98ce8b6b.pegasus@nerv.eu.org>


Hello,

As a Virtuozzo users we have majority of our diskspace formatted with -i 1024 -b 1024.

Lately I discovered that on CentOS 4.6 ext2online barfs when I try to grow such filesystem. Running it with -v -d, it prints lots of lines like:

ext2online v1.1.18 - 2001/03/18 for EXT2FS 0.5b
ext2online: 873646830 is a bad size for an ext2 fs! rounding down to 873644033
...
group NNN inode table has offset 2, not 2475
...
checking for group block NNNN in Bond
found 2218 not 2474 at 3513[168]

ext2online: unable to resize /dev/cciss/c0d0p3

And exit error code is 3.

I verified on a test system that ext2online works perfectly well in same situation with 4k blocks.

Any ideas?


-- 

Jure Pe?ar
http://jure.pecar.org


From tytso at mit.edu  Thu Aug 21 13:47:33 2008
From: tytso at mit.edu (Theodore Tso)
Date: Thu, 21 Aug 2008 09:47:33 -0400
Subject: ext2online with 1k blocks not working
In-Reply-To: <20080821130752.98ce8b6b.pegasus@nerv.eu.org>
References: <20080821130752.98ce8b6b.pegasus@nerv.eu.org>
Message-ID: <20080821134733.GH16634@mit.edu>

On Thu, Aug 21, 2008 at 01:07:52PM +0200, Jure Pe?ar wrote:
> 
> As a Virtuozzo users we have majority of our diskspace formatted with -i 1024 -b 1024.
> 
> Lately I discovered that on CentOS 4.6 ext2online barfs when I try to grow such filesystem. Running it with -v -d, it prints lots of lines like:
> 
> ext2online v1.1.18 - 2001/03/18 for EXT2FS 0.5b
> ext2online: 873646830 is a bad size for an ext2 fs! rounding down to 873644033
> ...
> group NNN inode table has offset 2, not 2475
> ...
> checking for group block NNNN in Bond
> found 2218 not 2474 at 3513[168]
> 
> ext2online: unable to resize /dev/cciss/c0d0p3

Can you replicate the problem using resize2fs from e2fsprogs version
1.41.0?  Resize2fs has supported online resize for quite sometime, and
I'm not sure the ext2online tool is being actively maintained at this
point.

Out of curiosity, why are you using a 1k blocksize?  Does Virtuozzo
require it?  Especially for a filesystem as big what you are
apparently using, there will be some significant performance downsides
with using a 1k blocksize.  And the -i 1024; are you storing huge
numbers of small files? 

						- Ted


From pegasus at nerv.eu.org  Thu Aug 21 18:32:17 2008
From: pegasus at nerv.eu.org (Jure =?UTF-8?B?UGXEjWFy?=)
Date: Thu, 21 Aug 2008 20:32:17 +0200
Subject: ext2online with 1k blocks not working
In-Reply-To: <20080821134733.GH16634@mit.edu>
References: <20080821130752.98ce8b6b.pegasus@nerv.eu.org>
	<20080821134733.GH16634@mit.edu>
Message-ID: <20080821203217.684c2424.pegasus@nerv.eu.org>

On Thu, 21 Aug 2008 09:47:33 -0400
Theodore Tso <tytso at mit.edu> wrote:

> Can you replicate the problem using resize2fs from e2fsprogs version
> 1.41.0?  Resize2fs has supported online resize for quite sometime, and
> I'm not sure the ext2online tool is being actively maintained at this
> point.

Ah yes, resize2fs ... I knew there's another tool for resizing, just forgot
its name.

 [root at localhost resize]# ./resize2fs /dev/cciss/c0d0p3
Performing an on-line resize of /dev/cciss/c0d0p3 to 873646828 (1k) blocks.
./resize2fs: Inappropriate ioctl for device While trying to add group #78125

/var/log/messages show:
localhost kernel: JBD: resize2fs wants too many credits (3498 > 2048)

And filesystem grew from 485G to only 534G and not 800 and something G.

> Out of curiosity, why are you using a 1k blocksize?  Does Virtuozzo
> require it?  Especially for a filesystem as big what you are
> apparently using, there will be some significant performance downsides
> with using a 1k blocksize.  And the -i 1024; are you storing huge
> numbers of small files? 

Commercial version of Virtuozzo (unlike free OpenVZ) offers "vzfs" which
adds some kind of CoW symlink on top of ext3. From the host point of
view, every new virtual environment is just a bunch of symlinks pointing to
an OS template. So yes, there are many files and many of them are just
symlinks. We haven't met any performance issues (yet), only the upper file
size limit (16GB). There's potential for unacceptably long fsck times and
we're rethinking our setup to avoid that.


-- 

Jure Pe?ar
http://jure.pecar.org/


From tytso at mit.edu  Thu Aug 21 19:56:53 2008
From: tytso at mit.edu (Theodore Tso)
Date: Thu, 21 Aug 2008 15:56:53 -0400
Subject: ext2online with 1k blocks not working
In-Reply-To: <20080821203217.684c2424.pegasus@nerv.eu.org>
References: <20080821130752.98ce8b6b.pegasus@nerv.eu.org>
	<20080821134733.GH16634@mit.edu>
	<20080821203217.684c2424.pegasus@nerv.eu.org>
Message-ID: <20080821195653.GB9791@mit.edu>

On Thu, Aug 21, 2008 at 08:32:17PM +0200, Jure Pe?ar wrote:
> > Can you replicate the problem using resize2fs from e2fsprogs version
> > 1.41.0?  Resize2fs has supported online resize for quite sometime, and
> > I'm not sure the ext2online tool is being actively maintained at this
> > point.
> 
> Ah yes, resize2fs ... I knew there's another tool for resizing, just forgot
> its name.
> 
>  [root at localhost resize]# ./resize2fs /dev/cciss/c0d0p3
> Performing an on-line resize of /dev/cciss/c0d0p3 to 873646828 (1k) blocks.
> ./resize2fs: Inappropriate ioctl for device While trying to add group #78125

Hmm...  can you send me the output of "dumpe2fs -h /dev/cciss/c0d0p3"?

	    	     	    	      - Ted


From adilger at sun.com  Sat Aug 23 11:32:05 2008
From: adilger at sun.com (Andreas Dilger)
Date: Sat, 23 Aug 2008 05:32:05 -0600
Subject: ext2online with 1k blocks not working
In-Reply-To: <20080821203217.684c2424.pegasus@nerv.eu.org>
References: <20080821130752.98ce8b6b.pegasus@nerv.eu.org>
	<20080821134733.GH16634@mit.edu>
	<20080821203217.684c2424.pegasus@nerv.eu.org>
Message-ID: <20080823113205.GO3392@webber.adilger.int>

On Aug 21, 2008  20:32 +0200, Jure Pe?ar wrote:
> On Thu, 21 Aug 2008 09:47:33 -0400
> Theodore Tso <tytso at mit.edu> wrote:
> > Can you replicate the problem using resize2fs from e2fsprogs version
> > 1.41.0?  Resize2fs has supported online resize for quite sometime, and
> > I'm not sure the ext2online tool is being actively maintained at this
> > point.
> 
> Ah yes, resize2fs ... I knew there's another tool for resizing, just forgot
> its name.
> 
>  [root at localhost resize]# ./resize2fs /dev/cciss/c0d0p3
> Performing an on-line resize of /dev/cciss/c0d0p3 to 873646828 (1k) blocks.
> ./resize2fs: Inappropriate ioctl for device While trying to add group #78125
> 
> /var/log/messages show:
> localhost kernel: JBD: resize2fs wants too many credits (3498 > 2048)
> 
> And filesystem grew from 485G to only 534G and not 800 and something G.

How big is your journal?  It seems it is only 8MB, which isn't large
enough to a resize 870GB filesystem.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


From sandeen at redhat.com  Sat Aug 23 15:36:07 2008
From: sandeen at redhat.com (Eric Sandeen)
Date: Sat, 23 Aug 2008 10:36:07 -0500
Subject: ext2online with 1k blocks not working
In-Reply-To: <20080821203217.684c2424.pegasus@nerv.eu.org>
References: <20080821130752.98ce8b6b.pegasus@nerv.eu.org>	<20080821134733.GH16634@mit.edu>
	<20080821203217.684c2424.pegasus@nerv.eu.org>
Message-ID: <48B02E67.9010804@redhat.com>

Jure Pe?ar wrote:
> On Thu, 21 Aug 2008 09:47:33 -0400
> Theodore Tso <tytso at mit.edu> wrote:
> 
>> Can you replicate the problem using resize2fs from e2fsprogs version
>> 1.41.0?  Resize2fs has supported online resize for quite sometime, and
>> I'm not sure the ext2online tool is being actively maintained at this
>> point.
> 
> Ah yes, resize2fs ... I knew there's another tool for resizing, just forgot
> its name.
> 
>  [root at localhost resize]# ./resize2fs /dev/cciss/c0d0p3
> Performing an on-line resize of /dev/cciss/c0d0p3 to 873646828 (1k) blocks.
> ./resize2fs: Inappropriate ioctl for device While trying to add group #78125
> 
> /var/log/messages show:
> localhost kernel: JBD: resize2fs wants too many credits (3498 > 2048)
> 
> And filesystem grew from 485G to only 534G and not 800 and something G.

You didn't say exactly which kernel version this is, but this might be
fixed in newer RHEL (er, CentOS) kernels:

* Fri Mar 28 2008 Vivek Goyal <vgoyal at redhat.com> [2.6.9-68.28]
...
-ext3: lighten up resize transaction requirements (Eric Sandeen) [166038]

Although usually I got -ENOSPC back to userspace ..

-Eric


From pegasus at nerv.eu.org  Sun Aug 24 15:31:43 2008
From: pegasus at nerv.eu.org (Jure =?UTF-8?B?UGXEjWFy?=)
Date: Sun, 24 Aug 2008 17:31:43 +0200
Subject: ext2online with 1k blocks not working
In-Reply-To: <20080823113205.GO3392@webber.adilger.int>
References: <20080821130752.98ce8b6b.pegasus@nerv.eu.org>
	<20080821134733.GH16634@mit.edu>
	<20080821203217.684c2424.pegasus@nerv.eu.org>
	<20080823113205.GO3392@webber.adilger.int>
Message-ID: <20080824173143.d7f4c79d.pegasus@nerv.eu.org>

On Sat, 23 Aug 2008 05:32:05 -0600
Andreas Dilger <adilger at sun.com> wrote:

> How big is your journal?  It seems it is only 8MB, which isn't large
> enough to a resize 870GB filesystem.
> 
> Cheers, Andreas

Yes, that's the conclusion Ted came up with. 

Still, offline resizing works, so I'll just have to schedule more downtime
for the resize to finish.

-- 

Jure Pe?ar
http://jure.pecar.org/


From pegasus at nerv.eu.org  Sun Aug 24 15:34:01 2008
From: pegasus at nerv.eu.org (Jure =?UTF-8?B?UGXEjWFy?=)
Date: Sun, 24 Aug 2008 17:34:01 +0200
Subject: ext2online with 1k blocks not working
In-Reply-To: <48B02E67.9010804@redhat.com>
References: <20080821130752.98ce8b6b.pegasus@nerv.eu.org>
	<20080821134733.GH16634@mit.edu>
	<20080821203217.684c2424.pegasus@nerv.eu.org>
	<48B02E67.9010804@redhat.com>
Message-ID: <20080824173401.f2cc4909.pegasus@nerv.eu.org>

On Sat, 23 Aug 2008 10:36:07 -0500
Eric Sandeen <sandeen at redhat.com> wrote:

> You didn't say exactly which kernel version this is, but this might be
> fixed in newer RHEL (er, CentOS) kernels:
> 
> * Fri Mar 28 2008 Vivek Goyal <vgoyal at redhat.com> [2.6.9-68.28]
> ...
> -ext3: lighten up resize transaction requirements (Eric Sandeen) [166038]
> 
> Although usually I got -ENOSPC back to userspace ..

2.6.9-67.0.22.ELsmp ... almost there ;)

Thanks for info, but it wouldn't make any difference for us, since we're
limited with virtuozzo kernels (which are based on rhel kernels).


-- 

Jure Pe?ar
http://jure.pecar.org/


From adilger at sun.com  Sun Aug 24 23:37:10 2008
From: adilger at sun.com (Andreas Dilger)
Date: Sun, 24 Aug 2008 17:37:10 -0600
Subject: ext2online with 1k blocks not working
In-Reply-To: <20080824173143.d7f4c79d.pegasus@nerv.eu.org>
References: <20080821130752.98ce8b6b.pegasus@nerv.eu.org>
	<20080821134733.GH16634@mit.edu>
	<20080821203217.684c2424.pegasus@nerv.eu.org>
	<20080823113205.GO3392@webber.adilger.int>
	<20080824173143.d7f4c79d.pegasus@nerv.eu.org>
Message-ID: <20080824233710.GO3392@webber.adilger.int>

On Aug 24, 2008  17:31 +0200, Jure Pe?ar wrote:
> On Sat, 23 Aug 2008 05:32:05 -0600
> Andreas Dilger <adilger at sun.com> wrote:
> > How big is your journal?  It seems it is only 8MB, which isn't large
> > enough to a resize 870GB filesystem.
> 
> Yes, that's the conclusion Ted came up with. 
> 
> Still, offline resizing works, so I'll just have to schedule more downtime
> for the resize to finish.

You may also consider resizing your journal while it is offline:

	tune2fs -O ^has_journal $dev
	{maybe e2fsck -f needed here}
	tune2fs -j $dev

should create a journal with at least 32MB.  You can check with:

	debugfs -c -R "stat <8>" $dev

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


From ross at biostat.ucsf.edu  Mon Aug 25 18:40:06 2008
From: ross at biostat.ucsf.edu (Ross Boylan)
Date: Mon, 25 Aug 2008 11:40:06 -0700
Subject: Problem in HTREE directory node
Message-ID: <1219689606.12088.50.camel@corn.betterworld.us>

Short version: 

fsck said
"invalid HTREE directory inode 635113
(mail/r/user/ross/comp/admin-wheat) clear HTREE index?" To which I
replied Yes.  

What exactly does this mean was corrupted?  In particular, does it mean
the list of files in the directory .../comp/admin-wheat was damaged?  Or
is the trouble in the comp directory?

Is fsck likely to have fixed up things as good as new, or might
something be lost or corrupted?  I don't know what clearing the HTREE
index does.

How can I check if things are OK?  I have backups.


Longer version:

After an ugly but should have been clean shutdown I got reports that
most of my partitions were unclean.  There were a lot of logs replayed.
The partitions are almost all LVM volumes.  My mail spool is ext3, and
fsck showed 
"Problem in HTREE directory inode 635112 node (627) not referenced.
Problem in HTREE directory inode 635112 node (628) has invalid depths
Problem in HTREFsdck died with exit status 4."
The message continued with info on a possible log (which wasn't
there--maybe because I use an initrd?) and need for
manual check.  There were too many messages, about consecutively
numbered nodes, to see them all (always in pairs as above).  628 was
the last.

Manual fsck tells me I can't use auto mode.  Full manual gives
"invalid HTREE directory inode 635113
(mail/r/user/ross/comp/admin-wheat) clear HTREE index?"
I said Yes, and run completed.  Reboot.

Still FS not clean messages for most, and 
"cyrspool primary superblock features different from backup, check
forced."
Finally it starts.

I'm running a Linux 2.6.25 kernel on a P4; that particular partition was
on a SATA disk.  I recently added another SATA disk and added it to the
volume group that included my mail spool.  I have some IDE disks too.

When LVM starts up it gives the error
Parse error at byte 3306 (line253): unexpected token
9 times.  I think it's been doing this for a long time.  It seems to
discover and activate all the volume groups.

So, my main questions are up above ("short version").  I also wonder
why, even after my manual fsck, I got the error about the primary
superblock features differeing.

My more general, and probably harder, question, is how things could have
gotten into this state.

Thanks for any insight.
Ross Boylan