From Vince.McIntyre at atnf.csiro.au  Tue Nov  1 01:48:10 2005
From: Vince.McIntyre at atnf.csiro.au (Vincent McIntyre)
Date: Tue, 1 Nov 2005 12:48:10 +1100 (EST)
Subject: ext3 + fs > 2Tbyte
In-Reply-To: <20051031220648.GC31368@schatzie.adilger.int>
References: <mailman.5281.1130746677.1909.ext3-users@redhat.com>
	<Pine.LNX.4.62.0510311918440.20154@bedlam.atnf.CSIRO.AU>
	<20051031220648.GC31368@schatzie.adilger.int>
Message-ID: <Pine.LNX.4.62.0511011247390.1768@bedlam.atnf.CSIRO.AU>

thanks for your response, Andreas.

> It sounds like you have overflowed the end of the 2TB device limit and
> clobbered the beginning of your filesystem.  This can happen if the
> SCSI driver, kernel, or even ext3 isn't handling offsets > 2^31 properly.
> I know RH has only recently started supporting ext3 filesystems > 2TB,
> and it isn't clear that all drivers handle this properly yet.

This box is using the fusion mpt drivers as in 2.6.7 - mptbase,mptscsih
etc. Do you recall any >2Tb issue being fixed in later kernels?

When the machine was last in a good state, the filesystem had 1.5Tbyte
used, ie as far as I can tell nothing would have written past 2Tb,
although I suppose there is no guarantee the space is used up in order
of increasing offset.

The filesystem was exported over NFS, and was being written to by
client machines. It is using NFSv3 (nfs-kernel-server 1.0-2woody3).
Worked great for several months.

> Please update your e2fsprogs to the latest.  You also need to use
> "e2fsck -b 32768" (or multiple thereof) for such large filesystems.
> I think newer e2fsprogs will print this message properly in that case.
>
I downloaded 1.38 from sourceforge and built it. No change in behaviour.
I tried e2fsck with block offsets from 1025 to 4194305 in steps of 1024.
I also tried dumpe2fs with the same range of offsets, also nothing.

I've attached an strace of dumpe2fs, perhaps it is helpful?


Another question. The e2fsck(8) manpage says the superblocks are at -
   Blocksize     -b
   1k          8193
   2k         16384
   4k         32768
Why is the superblock offset for 1k at 8193, not 8192?
Is that an error in the manpage?
Or should it be that the 2k, 4k block offsets should be odd,
ie 16385, 32769? This article suggests the latter -
    http://www2.linuxjournal.com/article/0193

-------------- next part --------------
A non-text attachment was scrubbed...
Name: log.dumpe2fs.gz
Type: application/octet-stream
Size: 1027 bytes
Desc: 
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20051101/2379eab4/attachment.obj>

From tytso at mit.edu  Tue Nov  1 04:46:58 2005
From: tytso at mit.edu (Theodore Ts'o)
Date: Mon, 31 Oct 2005 23:46:58 -0500
Subject: What is the history of CONFIG_EXT{2,3}_CHECK?
In-Reply-To: <20051031212503.GY31368@schatzie.adilger.int>
References: <20051031001334.GP4180@stusta.de>
	<20051031212503.GY31368@schatzie.adilger.int>
Message-ID: <20051101044658.GA7500@thunk.org>

On Mon, Oct 31, 2005 at 02:25:03PM -0700, Andreas Dilger wrote:
> On Oct 31, 2005  01:13 +0100, Adrian Bunk wrote:
> > Can anyone tell me the history of CONFIG_EXT{2,3}_CHECK?
> > 
> > There is code for a "check" option for mount if these options are 
> > enabled, but there's no way to enable them.
> 
> These are expensive debugging options, which walk the inode/block bitmaps
> for getting the group inode/block usage instead of using the group
> summary data.  Not used very often but I suspect occasionally useful for
> developers mucking with ext[23] internals.  Since it is developer-only
> code it needs to be enabled with #define CONFIG_EXT[23]_CHECK in a
> header or compile option.

It's basically a stripped down version of e2fsck pass #5, though.  Is
there any reason why this needs to be in the kernel?  If it would be
useful I could easily make a userspace implementation of these checks.

						- Ted


From adilger at clusterfs.com  Tue Nov  1 06:08:32 2005
From: adilger at clusterfs.com (Andreas Dilger)
Date: Mon, 31 Oct 2005 23:08:32 -0700
Subject: ext3 + fs > 2Tbyte
In-Reply-To: <Pine.LNX.4.62.0511011119570.1768@bedlam.atnf.CSIRO.AU>
References: <mailman.5281.1130746677.1909.ext3-users@redhat.com>
	<Pine.LNX.4.62.0510311918440.20154@bedlam.atnf.CSIRO.AU>
	<20051031220648.GC31368@schatzie.adilger.int>
	<Pine.LNX.4.62.0511011119570.1768@bedlam.atnf.CSIRO.AU>
Message-ID: <20051101060832.GK31368@schatzie.adilger.int>

On Nov 01, 2005  12:45 +1100, Vincent.McIntyre at csiro.au wrote:
> >It sounds like you have overflowed the end of the 2TB device limit and
> >clobbered the beginning of your filesystem.  This can happen if the
> >SCSI driver, kernel, or even ext3 isn't handling offsets > 2^31 properly.
> >I know RH has only recently started supporting ext3 filesystems > 2TB,
> >and it isn't clear that all drivers handle this properly yet.
> 
> This box is using the fusion mpt drivers as in 2.6.7 - mptbase,mptscsih
> etc. Do you recall any >2Tb issue being fixed in later kernels?

Sorry, I don't know, I've just heard of occasional problems in this area
and very few people reporting success.

> When the machine was last in a good state, the filesystem had 1.5Tbyte
> used, ie as far as I can tell nothing would have written past 2Tb,
> although I suppose there is no guarantee the space is used up in order
> of increasing offset.

No, it is "kind of" used in increasing offset, but not strictly so.

> >Please update your e2fsprogs to the latest.  You also need to use
> >"e2fsck -b 32768" (or multiple thereof) for such large filesystems.
> >I think newer e2fsprogs will print this message properly in that case.

You might also need to add "-B 4096".

> I downloaded 1.38 from sourceforge and built it. No change in behaviour.
> I tried e2fsck with block offsets from 1025 to 4194305 in steps of 1024.
> I also tried dumpe2fs with the same range of offsets, also nothing.
> 
> 
> Another question. The e2fsck(8) manpage says the superblocks are at -
>  Blocksize     -b
>  1k          8193
>  2k         16384
>  4k         32768
> Why is the superblock offset for 1k at 8193, not 8192?

Because the ext[23] superblock is at 1024 bytes offset from the
beginning of the device.  For 1kB blocksize this is a whole block
so the filesystem starts at block 1, while for larger blocksize
this is still in block 0.  Backup superblocks are at block offsets:

(blocksize * 8) * {3,5,7}^n, n={0,1,2,3...}

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


From Vince.McIntyre at atnf.csiro.au  Tue Nov  1 13:38:45 2005
From: Vince.McIntyre at atnf.csiro.au (Vincent McIntyre)
Date: Wed, 2 Nov 2005 00:38:45 +1100 (EST)
Subject: ext3 + fs > 2Tbyte
In-Reply-To: <Pine.LNX.4.62.0511012320040.9437@bedlam.atnf.CSIRO.AU>
References: <mailman.5281.1130746677.1909.ext3-users@redhat.com>
	<Pine.LNX.4.62.0510311918440.20154@bedlam.atnf.CSIRO.AU>
	<20051031220648.GC31368@schatzie.adilger.int>
	<Pine.LNX.4.62.0511011119570.1768@bedlam.atnf.CSIRO.AU>
	<20051101060832.GK31368@schatzie.adilger.int>
	<Pine.LNX.4.62.0511012320040.9437@bedlam.atnf.CSIRO.AU>
Message-ID: <Pine.LNX.4.62.0511020038210.9437@bedlam.atnf.CSIRO.AU>


>>> Please update your e2fsprogs to the latest.  You also need to use
>>> "e2fsck -b 32768" (or multiple thereof) for such large filesystems.
>>> I think newer e2fsprogs will print this message properly in that case.
>
> You might also need to add "-B 4096".

I gave that a try as well (and -B 8192), with the same results.

I tried to make a copy of the first part of the filesystem with dd;

    # dd if=/dev/sdb1 of=/tmp/sdb1.dd bs=1 count=16384 \
        conv=noerror,sync,notrunc

This returned a file supposedly 16384 bytes long , but it didn't make
much sense - looking at it with 'od' or 'hexdump' I get only 17 lines
of output, not the roughly 178 I get for the same exercise with a good
ext3 filesystem. (The /tmp filesystem has 128-byte inodes.)

The output appears to be just the EFI GPT partition label.

I'm starting to suspect something in the raid device is in a strange
state. Or that the whole filesystem has just totally disappeared. :(

A bit more digging in the logs found this, from the first boot when
power was reapplied
   sdb : very big device. try to use READ CAPACITY(16).
   kernel: SCSI device sdb: 4688461824 512-byte hdwr sectors (2400492 MB)
   kernel: SCSI device sdb: drive cache: write back
   kernel:  /dev/scsi/host2/bus0/target0/lun0: p1
   kernel: Attached scsi disk sdb at scsi2, channel 0, id 0, lun 0
so far so good - and then (eek)
   kernel: VFS: Can't find ext3 filesystem on dev sdb1.
when kjournald attempts to take a peek at the journal.


>> I downloaded 1.38 from sourceforge and built it. No change in behaviour.
>> I tried e2fsck with block offsets from 1025 to 4194305 in steps of 1024.
>> I also tried dumpe2fs with the same range of offsets, also nothing.
>>
>>
>> Another question. The e2fsck(8) manpage says the superblocks are at -
>>  Blocksize     -b
>>  1k          8193
>>  2k         16384
>>  4k         32768
>> Why is the superblock offset for 1k at 8193, not 8192?
>
> Because the ext[23] superblock is at 1024 bytes offset from the
> beginning of the device.  For 1kB blocksize this is a whole block
> so the filesystem starts at block 1, while for larger blocksize
> this is still in block 0.  Backup superblocks are at block offsets:
>
> (blocksize * 8) * {3,5,7}^n, n={0,1,2,3...}

I'm starting to get this, thanks for your patience.
I tried all the feasible values of -b less than 2147483647,
as I mention above. I did not try larger block sizes than 8192.

I since found these links which fill out the picture a bit more.
    http://web.mit.edu/tytso/www/linux/ext2intro.html
    http://homepage.smc.edu/morgan_david/cs40/analyze-ext2.htm
    http://uranus.it.swin.edu.au/~jn/explore2fs/es2fs.htm
    http://www.unixwiz.net/techtips/recovering-ext2.html
    http://nepto.atomicpile.sk/mix/articles/ext2-superblock/ext2-superblock-notes.txt


Any further thoughts appreciated.

Cheers
Vince


From bloch at verdurin.com  Tue Nov  1 15:59:31 2005
From: bloch at verdurin.com (bloch at verdurin.com)
Date: Tue, 1 Nov 2005 15:59:31 +0000
Subject: Recover original superblock on corrupted filesystem?
In-Reply-To: <20051025220521.GB17476@bloch.smith.man.ac.uk>
References: <20051021145114.GA432@bloch.smith.man.ac.uk>
	<1130265842.4965.21.camel@orbit.scot.redhat.com>
	<20051025220521.GB17476@bloch.smith.man.ac.uk>
Message-ID: <20051101155931.GA1256@bloch.smith.man.ac.uk>

On Tue, 25 Oct 2005, bloch at verdurin.com wrote:

> On Tue, 25 Oct 2005, Stephen C. Tweedie wrote:
> 
> > Hi,
> > 
> > On Fri, 2005-10-21 at 15:51 +0100, bloch at verdurin.com wrote:
> > 
> > > It appears the original superblock is corrupted too, as it has an inode
> > > count of 0.  When I start fsck with -b 32760, it uses the alternate
> > > superblock and proceeds.  However, it restarts from the beginning a
> > > couple of times and after the second restart it doesn't use the
> > > alternate superblock, stopping instead as it can't find the original
> > > one.
> > 
> > Do you have a log of the fsck output, and which e2fsprogs version is
> > this?  Sounds like it may be an e2fsck bug if we don't honour the backup
> > superblock flag on subsequent passes.
> > 
> 
> I do have a log, yes.  It's rather large...
> 
> It's version 1.38
> 
> > > Is there a way around this, such as using one of the alternate
> > > superblocks to replace the broken one
> > 
> > Yes, "dd" of the appropriate block should work... but do this with
> > extreme care, as getting it slightly wrong will cause major havoc.
> > 
> > "debugfs" may be a better bet.  
> > 
> > 	# debugfs -w -b$BLOCKSIZE -s$SUPERBLOCK /dev/$DEV
> > 
> > will tell debugfs to read the specified superblock.  If you dirty the
> > superblock (eg. with the "dirty" command) then quit, it will write back
> > the backup superblock to the home location too.
> > 
> 

As an update to this, the problem seems to have re-occurred.  Here are
the relevant error messages:

EXT3-fs error (device sdb1): ext3_new_block: Allocating block in system
zone - block = 41484288
Aborting journal on device sdb1.
EXT3-fs error (device sdb1) in ext3_new_block: Journal has aborted
ext3_abort called.
EXT3-fs error (device sdb1): ext3_journal_start_sb: Detected aborted
journal
Remounting filesystem read-only
EXT3-fs error (device sdb1) in ext3_reserve_inode_write: Journal has
aborted
__journal_remove_journal_head: freeing b_committed_data

Is there anything you can suggest to look at before I run fsck on this?

Thanks,
Adam


From adilger at clusterfs.com  Tue Nov  1 18:09:12 2005
From: adilger at clusterfs.com (Andreas Dilger)
Date: Tue, 1 Nov 2005 11:09:12 -0700
Subject: ext3 + fs > 2Tbyte
In-Reply-To: <Pine.LNX.4.62.0511012320040.9437@bedlam.atnf.CSIRO.AU>
References: <mailman.5281.1130746677.1909.ext3-users@redhat.com>
	<Pine.LNX.4.62.0510311918440.20154@bedlam.atnf.CSIRO.AU>
	<20051031220648.GC31368@schatzie.adilger.int>
	<Pine.LNX.4.62.0511011119570.1768@bedlam.atnf.CSIRO.AU>
	<20051101060832.GK31368@schatzie.adilger.int>
	<Pine.LNX.4.62.0511012320040.9437@bedlam.atnf.CSIRO.AU>
Message-ID: <20051101180912.GN31368@schatzie.adilger.int>

On Nov 02, 2005  00:37 +1100, Vincent.McIntyre at csiro.au wrote:
> I tried to make a copy of the first part of the filesystem with dd;
> 
>   # dd if=/dev/sdb1 of=/tmp/sdb1.dd bs=1 count=16384 \
>       conv=noerror,sync,notrunc
> 
> This returned a file supposedly 16384 bytes long , but it didn't make
> much sense - looking at it with 'od' or 'hexdump' I get only 17 lines
> of output, not the roughly 178 I get for the same exercise with a good
> ext3 filesystem. (The /tmp filesystem has 128-byte inodes.)

"od" will compress lines that are identical (usually all-zero) as "*".
If you want all the output, use -v.

> The output appears to be just the EFI GPT partition label.

The EFI GPT label can be restored from the backup (which is located
at the end of the device) so that might have happened.

> I'm starting to suspect something in the raid device is in a strange
> state. Or that the whole filesystem has just totally disappeared. :(

od -Ax -tx4 /dev/sdb1 | grep "^[0-9a-f]30 [0-9a-f]* [0-9a-f]* 000[1-3]ef53 "

should locate the ext2 superblock magic number(s) eventually.  There is
also a utility in e2fsprogs source (misc/findsuper) that is not installed
that you could build that does this more efficiently.

If those don't appear anywhere, then something dramatically bad has
happened to your filesystem.  Aliasing would only damage at most (if
you did "dd if=/dev/zero" into a file at the end of the filesystem)
the first 300GB of your device, and there _should_ be a backup super
somewhere beyond that (haven't done math to confirm).

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


From bloch at verdurin.com  Wed Nov  2 13:09:57 2005
From: bloch at verdurin.com (bloch at verdurin.com)
Date: Wed, 2 Nov 2005 13:09:57 +0000
Subject: Recover original superblock on corrupted filesystem?
In-Reply-To: <20051101155931.GA1256@bloch.smith.man.ac.uk>
References: <20051021145114.GA432@bloch.smith.man.ac.uk>
	<1130265842.4965.21.camel@orbit.scot.redhat.com>
	<20051025220521.GB17476@bloch.smith.man.ac.uk>
	<20051101155931.GA1256@bloch.smith.man.ac.uk>
Message-ID: <20051102130956.GA16564@bloch.smith.man.ac.uk>

On Tue, 01 Nov 2005, bloch at verdurin.com wrote:

> 
> As an update to this, the problem seems to have re-occurred.  Here are
> the relevant error messages:
> 
> EXT3-fs error (device sdb1): ext3_new_block: Allocating block in system
> zone - block = 41484288
> Aborting journal on device sdb1.
> EXT3-fs error (device sdb1) in ext3_new_block: Journal has aborted
> ext3_abort called.
> EXT3-fs error (device sdb1): ext3_journal_start_sb: Detected aborted
> journal
> Remounting filesystem read-only
> EXT3-fs error (device sdb1) in ext3_reserve_inode_write: Journal has
> aborted
> __journal_remove_journal_head: freeing b_committed_data
> 

Another update - exactly the same problem has occurred on an identical
machine.  The disks are on a Megaraid RAID1 array.

Two other machines which only differ from the problem ones in that they
have 4G RAM instead of 8G have not shown any such symptoms.


Adam


From kent at cpttm.org.mo  Thu Nov  3 01:35:04 2005
From: kent at cpttm.org.mo (Kent Tong)
Date: Thu, 3 Nov 2005 01:35:04 +0000 (UTC)
Subject: filesystem remounted as read only
Message-ID: <loom.20051103T023009-740@post.gmane.org>

Hi,

I'm running kernel 2.6.8-15, lvm2 v2.01.04-5 and acl v2.2.23-1 on a 
Sunblade 100 (sparc). In a few months we have experienced for several 
times that an ext3 filesystem is remounted as read-only (this is due 
to the option "errors=remount-ro" in /etc/fstab). Sometimes there is 
no error in log files but sometimes we see:

kernel: init_special_inode: bogus i_mode (3016)
kernel: init_special_inode: bogus i_mode (3125)
kernel: init_special_inode: bogus i_mode (3144)
kernel: init_special_inode: bogus i_mode (3231)
kernel: init_special_inode: bogus i_mode (3423)
kernel: init_special_inode: bogus i_mode (3452)

In the former case (no error in the logs), then running fsck will find 
no error. In the latter case, it may find some errors and fix them.

I've run smartmontools to check the disks but no errors are found.

I've run "fsck -c" to look up bad blocks but nothing is found.

What else can I do to troubleshoot the problem? In particular, the
most strange is if it is remounting as read-only, why there is no
error in the logs? Could remounting as read-only prevent it from
writing to the logs?

Thanks!


From Vincent.McIntyre at csiro.au  Tue Nov  1 01:45:27 2005
From: Vincent.McIntyre at csiro.au (Vincent.McIntyre at csiro.au)
Date: Tue, 1 Nov 2005 12:45:27 +1100 (EST)
Subject: ext3 + fs > 2Tbyte
In-Reply-To: <20051031220648.GC31368@schatzie.adilger.int>
References: <mailman.5281.1130746677.1909.ext3-users@redhat.com>
	<Pine.LNX.4.62.0510311918440.20154@bedlam.atnf.CSIRO.AU>
	<20051031220648.GC31368@schatzie.adilger.int>
Message-ID: <Pine.LNX.4.62.0511011119570.1768@bedlam.atnf.CSIRO.AU>

thanks for your response, Andreas.

> It sounds like you have overflowed the end of the 2TB device limit and
> clobbered the beginning of your filesystem.  This can happen if the
> SCSI driver, kernel, or even ext3 isn't handling offsets > 2^31 properly.
> I know RH has only recently started supporting ext3 filesystems > 2TB,
> and it isn't clear that all drivers handle this properly yet.

This box is using the fusion mpt drivers as in 2.6.7 - mptbase,mptscsih
etc. Do you recall any >2Tb issue being fixed in later kernels?

When the machine was last in a good state, the filesystem had 1.5Tbyte
used, ie as far as I can tell nothing would have written past 2Tb,
although I suppose there is no guarantee the space is used up in order
of increasing offset.

The filesystem was exported over NFS, and was being written to by
client machines. It is using NFSv3 (nfs-kernel-server 1.0-2woody3).
Worked great for several months.

> Please update your e2fsprogs to the latest.  You also need to use
> "e2fsck -b 32768" (or multiple thereof) for such large filesystems.
> I think newer e2fsprogs will print this message properly in that case.
>
I downloaded 1.38 from sourceforge and built it. No change in behaviour.
I tried e2fsck with block offsets from 1025 to 4194305 in steps of 1024.
I also tried dumpe2fs with the same range of offsets, also nothing.

I've attached an strace of dumpe2fs, perhaps it is helpful?


Another question. The e2fsck(8) manpage says the superblocks are at -
  Blocksize     -b
  1k          8193
  2k         16384
  4k         32768
Why is the superblock offset for 1k at 8193, not 8192?
Is that an error in the manpage?
Or should it be that the 2k, 4k block offsets should be odd,
ie 16385, 32769? This article suggests the latter -
   http://www2.linuxjournal.com/article/0193
-------------- next part --------------
A non-text attachment was scrubbed...
Name: log.dumpe2fs.gz
Type: application/octet-stream
Size: 1027 bytes
Desc: 
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20051101/6923e9bb/attachment.obj>

From Vincent.McIntyre at csiro.au  Tue Nov  1 13:37:06 2005
From: Vincent.McIntyre at csiro.au (Vincent.McIntyre at csiro.au)
Date: Wed, 2 Nov 2005 00:37:06 +1100 (EST)
Subject: ext3 + fs > 2Tbyte
In-Reply-To: <20051101060832.GK31368@schatzie.adilger.int>
References: <mailman.5281.1130746677.1909.ext3-users@redhat.com>
	<Pine.LNX.4.62.0510311918440.20154@bedlam.atnf.CSIRO.AU>
	<20051031220648.GC31368@schatzie.adilger.int>
	<Pine.LNX.4.62.0511011119570.1768@bedlam.atnf.CSIRO.AU>
	<20051101060832.GK31368@schatzie.adilger.int>
Message-ID: <Pine.LNX.4.62.0511012320040.9437@bedlam.atnf.CSIRO.AU>

>>> Please update your e2fsprogs to the latest.  You also need to use
>>> "e2fsck -b 32768" (or multiple thereof) for such large filesystems.
>>> I think newer e2fsprogs will print this message properly in that case.
>
> You might also need to add "-B 4096".

I gave that a try as well (and -B 8192), with the same results.

I tried to make a copy of the first part of the filesystem with dd;

   # dd if=/dev/sdb1 of=/tmp/sdb1.dd bs=1 count=16384 \
       conv=noerror,sync,notrunc

This returned a file supposedly 16384 bytes long , but it didn't make
much sense - looking at it with 'od' or 'hexdump' I get only 17 lines
of output, not the roughly 178 I get for the same exercise with a good
ext3 filesystem. (The /tmp filesystem has 128-byte inodes.)

The output appears to be just the EFI GPT partition label.

I'm starting to suspect something in the raid device is in a strange
state. Or that the whole filesystem has just totally disappeared. :(

A bit more digging in the logs found this, from the first boot when
power was reapplied
  sdb : very big device. try to use READ CAPACITY(16).
  kernel: SCSI device sdb: 4688461824 512-byte hdwr sectors (2400492 MB)
  kernel: SCSI device sdb: drive cache: write back
  kernel:  /dev/scsi/host2/bus0/target0/lun0: p1
  kernel: Attached scsi disk sdb at scsi2, channel 0, id 0, lun 0
so far so good - and then (eek)
  kernel: VFS: Can't find ext3 filesystem on dev sdb1.
when kjournald attempts to take a peek at the journal.


>> I downloaded 1.38 from sourceforge and built it. No change in behaviour.
>> I tried e2fsck with block offsets from 1025 to 4194305 in steps of 1024.
>> I also tried dumpe2fs with the same range of offsets, also nothing.
>>
>>
>> Another question. The e2fsck(8) manpage says the superblocks are at -
>>  Blocksize     -b
>>  1k          8193
>>  2k         16384
>>  4k         32768
>> Why is the superblock offset for 1k at 8193, not 8192?
>
> Because the ext[23] superblock is at 1024 bytes offset from the
> beginning of the device.  For 1kB blocksize this is a whole block
> so the filesystem starts at block 1, while for larger blocksize
> this is still in block 0.  Backup superblocks are at block offsets:
>
> (blocksize * 8) * {3,5,7}^n, n={0,1,2,3...}

I'm starting to get this, thanks for your patience.
I tried all the feasible values of -b less than 2147483647,
as I mention above. I did not try larger block sizes than 8192.

I since found these links which fill out the picture a bit more.
   http://web.mit.edu/tytso/www/linux/ext2intro.html
   http://homepage.smc.edu/morgan_david/cs40/analyze-ext2.htm
   http://uranus.it.swin.edu.au/~jn/explore2fs/es2fs.htm
   http://www.unixwiz.net/techtips/recovering-ext2.html
   http://nepto.atomicpile.sk/mix/articles/ext2-superblock/ext2-superblock-notes.txt


Any further thoughts appreciated.

Cheers
Vince


From adilger at clusterfs.com  Thu Nov  3 17:40:40 2005
From: adilger at clusterfs.com (Andreas Dilger)
Date: Thu, 3 Nov 2005 10:40:40 -0700
Subject: filesystem remounted as read only
In-Reply-To: <loom.20051103T023009-740@post.gmane.org>
References: <loom.20051103T023009-740@post.gmane.org>
Message-ID: <20051103174040.GM31368@schatzie.adilger.int>

On Nov 03, 2005  01:35 +0000, Kent Tong wrote:
> I'm running kernel 2.6.8-15, lvm2 v2.01.04-5 and acl v2.2.23-1 on a 
> Sunblade 100 (sparc). In a few months we have experienced for several 
> times that an ext3 filesystem is remounted as read-only (this is due 
> to the option "errors=remount-ro" in /etc/fstab). Sometimes there is 
> no error in log files but sometimes we see:
> 
> kernel: init_special_inode: bogus i_mode (3016)
> kernel: init_special_inode: bogus i_mode (3125)
> kernel: init_special_inode: bogus i_mode (3144)
> kernel: init_special_inode: bogus i_mode (3231)
> kernel: init_special_inode: bogus i_mode (3423)
> kernel: init_special_inode: bogus i_mode (3452)
> 
> In the former case (no error in the logs), then running fsck will find 
> no error. In the latter case, it may find some errors and fix them.
> 
> I've run smartmontools to check the disks but no errors are found.
> 
> I've run "fsck -c" to look up bad blocks but nothing is found.
> 
> What else can I do to troubleshoot the problem? In particular, the
> most strange is if it is remounting as read-only, why there is no
> error in the logs? Could remounting as read-only prevent it from
> writing to the logs?

Remounting read-only should only happen in the context of "ext3_error".
The init_special_inode() code does not return an error to the caller
so in some cases this error may go unnoticed.

In cases where there is a runtime error but no problem is found on
disk, it is usually a memory error.  It is also possible there is
a cable error or similar.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


From jeff at jettis.com  Thu Nov  3 21:18:40 2005
From: jeff at jettis.com (Jeff Dinisco)
Date: Thu, 3 Nov 2005 13:18:40 -0800
Subject: mount r/w and r/o
Message-ID: <B6A0A04D59978745A68272143BE55BD42CA58D@laxmsex01.corp.jettis.com>

I have an ext3 filesystem mounted r/w on 1 host and r/o on multiple
hosts.  Dangerous but cost effective.  I recently implemented some
protection through a fc switch that restricts some hosts to r/o access
to the data luns.  So if someone types mount -o rw or something, all is
not lost.

The issue occurs when it's mounted r/w on 1 host and another host
attempts to mount it r/o.  The mount command takes about a minute to
complete, it successfully mounts, and several error messages are
reported...

Nov  3 12:52:26 lax kernel: EXT3-fs: INFO: recovery required on readonly
filesystem.
Nov  3 12:52:26 lax kernel: EXT3-fs: write access will be enabled during
recovery.
Nov  3 12:52:27 lax kernel: cfq: depth 4 reached, tagging now on

...reports this for about 260 different sectors (makes sense, fc switch
is preventing write access)...

Nov  3 12:52:27 lax kernel: SCSI error : <494 0 0 1> return code =
0x8000002
Nov  3 12:52:27 lax kernel: sdl: Current: sense key: Data Protect
Nov  3 12:52:27 lax kernel:     Additional sense: Logical unit software
write protected
Nov  3 12:52:27 lax kernel: end_request: I/O error, dev sdl, sector 496
Nov  3 12:52:27 lax kernel: Buffer I/O error on device sdl, logical
block 62
Nov  3 12:52:27 lax kernel: lost page write due to I/O error on sdl

then completes...

Nov  3 12:52:44 laxl kernel: EXT3-fs: recovery complete. (how???)
Nov  3 12:52:44 laxl kernel: EXT3-fs: mounted filesystem with ordered
data mode.

This also happens on other filesystems and other devices under the same
circumstances.

When the filesystem is umounted from the r/w host, it mounts w/ out
error on r/o host.  It's interesting to note that after that's done, you
can remount the filesystem on the r/w host, and then mount it on the r/o
w/ just a few errors and w/ in seconds.

My questions are...
Should I be concerned by this?
Is there a way to automatically skip the recovery attempt, and if so,
should I use it?
Am I going about this all wrong, is there a better way to do this (other
than GFS)?

Thanks.

 - Jeff
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20051103/9d2c42ce/attachment.htm>

From menscher at uiuc.edu  Thu Nov  3 21:37:37 2005
From: menscher at uiuc.edu (Damian Menscher)
Date: Thu, 3 Nov 2005 15:37:37 -0600 (CST)
Subject: mount r/w and r/o
In-Reply-To: <B6A0A04D59978745A68272143BE55BD42CA58D@laxmsex01.corp.jettis.com>
References: <B6A0A04D59978745A68272143BE55BD42CA58D@laxmsex01.corp.jettis.com>
Message-ID: <Pine.LNX.4.63.0511031536550.3063@zeus.itg.uiuc.edu>

On Thu, 3 Nov 2005, Jeff Dinisco wrote:

> I have an ext3 filesystem mounted r/w on 1 host and r/o on multiple
> hosts.  Dangerous but cost effective.
>
> My questions are...
> Should I be concerned by this?
> Is there a way to automatically skip the recovery attempt, and if so,
> should I use it?
> Am I going about this all wrong, is there a better way to do this (other
> than GFS)?

Sorry to ask the obvious question, but why not just use NFS?

Damian Menscher
-- 
-=#| <menscher at uiuc.edu> www.uiuc.edu/~menscher/ Ofc:(650)273-2757 |#=-
-=#| The above opinions are not necessarily those of my employers. |#=-


From jeff at jettis.com  Thu Nov  3 21:58:24 2005
From: jeff at jettis.com (Jeff Dinisco)
Date: Thu, 3 Nov 2005 13:58:24 -0800
Subject: mount r/w and r/o
Message-ID: <B6A0A04D59978745A68272143BE55BD4A431E4@laxmsex01.corp.jettis.com>

Performance is the answer.  This is streaming media and the throughput
is very high.   

-----Original Message-----
From: Wolber, Richard C [mailto:richard.c.wolber at boeing.com] 
Sent: Thursday, November 03, 2005 5:01 PM
To: Damian Menscher; Jeff Dinisco
Cc: ext3-users at redhat.com
Subject: RE: mount r/w and r/o

> > My questions are...
> > Should I be concerned by this?
> > Is there a way to automatically skip the recovery attempt, and if
so, 
> > should I use it?
> > Am I going about this all wrong, is there a better way to do this 
> > (other than GFS)?
> 
> Sorry to ask the obvious question, but why not just use NFS?

Performance? NFS is a lot of overhead to consider using on something
like 
FC. Mounting r/o seems (and I await the experts opinion) at first glance

to be a very effictive way of doing this.

..Chuck..


From adilger at clusterfs.com  Thu Nov  3 22:07:35 2005
From: adilger at clusterfs.com (Andreas Dilger)
Date: Thu, 3 Nov 2005 15:07:35 -0700
Subject: mount r/w and r/o
In-Reply-To: <B6A0A04D59978745A68272143BE55BD42CA58D@laxmsex01.corp.jettis.com>
References: <B6A0A04D59978745A68272143BE55BD42CA58D@laxmsex01.corp.jettis.com>
Message-ID: <20051103220735.GW31368@schatzie.adilger.int>

On Nov 03, 2005  13:18 -0800, Jeff Dinisco wrote:
> I have an ext3 filesystem mounted r/w on 1 host and r/o on multiple
> hosts.  Dangerous but cost effective.  I recently implemented some
> protection through a fc switch that restricts some hosts to r/o access
> to the data luns.  So if someone types mount -o rw or something, all is
> not lost.

This is completely dangerous and should not be done.  The FC switch is
preventing potentially serious corruption to your filesystem, but is
not preventing the r/o clients from getting corrupt/stale data and
possibly crashing.  There is nothing on those clients to keep their
cache up-to-date with what is happening on the r/w server.

> Is there a way to automatically skip the recovery attempt, and if so,
> should I use it?

No.

> Am I going about this all wrong, is there a better way to do this (other
> than GFS)?

As another person suggested, NFS is fine for small-scale usage like
this.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


From Vince.McIntyre at atnf.csiro.au  Fri Nov  4 01:17:16 2005
From: Vince.McIntyre at atnf.csiro.au (Vincent McIntyre)
Date: Fri, 4 Nov 2005 12:17:16 +1100 (EST)
Subject: ext3 + fs > 2Tbyte
In-Reply-To: <20051101180912.GN31368@schatzie.adilger.int>
References: <mailman.5281.1130746677.1909.ext3-users@redhat.com>
	<Pine.LNX.4.62.0510311918440.20154@bedlam.atnf.CSIRO.AU>
	<20051031220648.GC31368@schatzie.adilger.int>
	<Pine.LNX.4.62.0511011119570.1768@bedlam.atnf.CSIRO.AU>
	<20051101060832.GK31368@schatzie.adilger.int>
	<Pine.LNX.4.62.0511012320040.9437@bedlam.atnf.CSIRO.AU>
	<20051101180912.GN31368@schatzie.adilger.int>
Message-ID: <Pine.LNX.4.62.0511041123180.17129@bedlam.atnf.CSIRO.AU>

Hi again

I unplugged the original xraid and did some tests on a non-production one,
building larger and larger filesystems, mounting, & dismounting.
I can reproduce the problem with this sequence:

* boot with xraid device plugged in, kernel 2.6.7-1-686-smp
     (packaged as 2.6.7-1.backports.org.1)
* install a gpt disklabel with parted (-1.6.24 rather than 1.6.19)
* make an ext2 filesystem as big as the disk with parted
* mount - it mounts ok
* umount
* tune2fs -j (-1.38)
* mount - it mounts ok (-2.12)
* umount (-2.12)
* reboot
* try to mount - it fails.
     (the filesystem is not mentioned in /etc/fstab, the system should
      not be attempting to mount it of fsck it at boot time)

No files were written to the filesystem during the test sequence.

I have not yet tried filesystems smaller than 2Tb across reboots.
I expect it will work, but I will try that shortly to check.


findsuper tells me there are superblocks, but fs_blk_sz changes (!?)
# /root/e2fsprogs-1.38/misc/findsuper /dev/sdb1
starting at 0, with 512 byte increments
        thisoff     block fs_blk_sz  blksz grp last_mount
          17920        17 586057719  4096    0 Thu Jan  1 10:00:00 1970
      134234624    131088 586057719  4096    1 Thu Jan  1 10:00:00 1970
      134235648    131089 586057719  4096    1 Thu Jan  1 10:00:00 1970
      209733120    204817   1023983  1024   25 Thu Jan  1 10:00:00 1970
      226510336    221201   1023983  1024   27 Thu Jan  1 10:00:00 1970
      402670080    393232 586057719  4096    3 Thu Jan  1 10:00:00 1970
      402671104    393233 586057719  4096    3 Thu Jan  1 10:00:00 1970
      411059712    401425   1023983  1024   49 Thu Jan  1 10:00:00 1970
      671105536    655376 586057719  4096    5 Thu Jan  1 10:00:00 1970
      671106560    655377 586057719  4096    5 Thu Jan  1 10:00:00 1970
      679495168    663569   1023983  1024   81 Thu Jan  1 10:00:00 1970
      939540992    917520 586057719  4096    7 Thu Jan  1 10:00:00 1970
      939542016    917521 586057719  4096    7 Thu Jan  1 10:00:00 1970
     1207976448   1179664 586057719  4096    9 Thu Jan  1 10:00:00 1970
     1207977472   1179665 586057719  4096    9 Thu Jan  1 10:00:00 1970
     3355460096   3276816 586057719  4096   25 Thu Jan  1 10:00:00 1970
     3355461120   3276817 586057719  4096   25 Thu Jan  1 10:00:00 1970
     3623895552   3538960 586057719  4096   27 Thu Jan  1 10:00:00 1970
     3623896576   3538961 586057719  4096   27 Thu Jan  1 10:00:00 1970
     6576685568   6422544 586057719  4096   49 Thu Jan  1 10:00:00 1970
     6576686592   6422545 586057719  4096   49 Thu Jan  1 10:00:00 1970
    10871652864  10616848 586057719  4096   81 Thu Jan  1 10:00:00 1970
    10871653888  10616849 586057719  4096   81 Thu Jan  1 10:00:00 1970
    16777232896  16384016 586057719  4096  125 Thu Jan  1 10:00:00 1970
    16777233920  16384017 586057719  4096  125 Thu Jan  1 10:00:00 1970
^C
This is not looking good...

Your nice od trick tells me slightly different locations for the
superblock signatures -
# od -Ax -tx4 /dev/sdb1 | \
   grep "^[0-9a-f]*30 [0-9a-f]* [0-9a-f]* 000[1-3]ef53 "
004630 436a93dd 001e0000 0001ef53 00000001
8004630 00000000 001e0000 0001ef53 00000001
c804630 00000000 001e0000 0001ef53 00000001
d804630 00000000 001e0000 0001ef53 00000001
18004630 00000000 001e0000 0001ef53 00000001
18804630 00000000 001e0000 0001ef53 00000001
28004630 00000000 001e0000 0001ef53 00000001
28804630 00000000 001e0000 0001ef53 00000001
38004630 00000000 001e0000 0001ef53 00000001
48004630 00000000 001e0000 0001ef53 00000001
c8004630 00000000 001e0000 0001ef53 00000001
d8004630 00000000 001e0000 0001ef53 00000001
88004630 00000000 001e0000 0001ef53 00000001
^C

0x004630 corresponds to byte offset 17968, 48 bytes away.
Is this explainable by the position of the superblock signature within
the disk block?
0x8004630 corresponds to 134220222, delta=14400. This is confusing me.

So I tried a few e2fsck runs. I know I'm probably being dense but none
of these worked:
e2fsck -n -b 16        -B 4096 /dev/sdb1
e2fsck -n -b 17        -B 4096 /dev/sdb1
e2fsck -n -b 18        -B 4096 /dev/sdb1
e2fsck -n -b 204816    -B 1024 /dev/sdb1
e2fsck -n -b 204817    -B 1024 /dev/sdb1
e2fsck -n -b 204818    -B 1024 /dev/sdb1
e2fsck -n -b 221200    -B 1024 /dev/sdb1
e2fsck -n -b 221201    -B 1024 /dev/sdb1
e2fsck -n -b 221202    -B 1024 /dev/sdb1
e2fsck -n -b 1179664   -B 4096 /dev/sdb1
e2fsck -n -b 1179665   -B 4096 /dev/sdb1
e2fsck -n -b 6422544   -B 4096 /dev/sdb1
e2fsck -n -b 6422545   -B 4096 /dev/sdb1
e2fsck -n -b 10616848  -B 4096 /dev/sdb1
e2fsck -n -b 10616849  -B 4096 /dev/sdb1

(The e2fsck manpage could be a tiny bit clearer in that - I think -
  it means you to use -b <blocknumber>, not -b <offset_to_superblock>)

oh, and just trying to mount does not work, as one might expect.
# mount -text2 /dev/sdb1 /tmp/a
mount: wrong fs type, bad option, bad superblock on /dev/sdb1,
        or too many mounted file systems
        (aren't you trying to mount an extended partition,
        instead of some logical partition inside?)

I did straces of the e2fsck before and after the reboot; would it help
to send those?

Thanks again
Vince


From adilger at clusterfs.com  Fri Nov  4 02:35:47 2005
From: adilger at clusterfs.com (Andreas Dilger)
Date: Thu, 3 Nov 2005 19:35:47 -0700
Subject: ext3 + fs > 2Tbyte
In-Reply-To: <Pine.LNX.4.62.0511041123180.17129@bedlam.atnf.CSIRO.AU>
References: <mailman.5281.1130746677.1909.ext3-users@redhat.com>
	<Pine.LNX.4.62.0510311918440.20154@bedlam.atnf.CSIRO.AU>
	<20051031220648.GC31368@schatzie.adilger.int>
	<Pine.LNX.4.62.0511011119570.1768@bedlam.atnf.CSIRO.AU>
	<20051101060832.GK31368@schatzie.adilger.int>
	<Pine.LNX.4.62.0511012320040.9437@bedlam.atnf.CSIRO.AU>
	<20051101180912.GN31368@schatzie.adilger.int>
	<Pine.LNX.4.62.0511041123180.17129@bedlam.atnf.CSIRO.AU>
Message-ID: <20051104023547.GY31368@schatzie.adilger.int>

On Nov 04, 2005  12:17 +1100, Vincent McIntyre wrote:
> * boot with xraid device plugged in, kernel 2.6.7-1-686-smp
>     (packaged as 2.6.7-1.backports.org.1)
> * install a gpt disklabel with parted (-1.6.24 rather than 1.6.19)
> * make an ext2 filesystem as big as the disk with parted
> * mount - it mounts ok
> * umount
> * tune2fs -j (-1.38)
> * mount - it mounts ok (-2.12)
> * umount (-2.12)
> * reboot
> * try to mount - it fails.
>     (the filesystem is not mentioned in /etc/fstab, the system should
>      not be attempting to mount it of fsck it at boot time)
> 
> No files were written to the filesystem during the test sequence.

Hmm, I would expect at least the need to write something to the filesystem,
unless you are unlucky enough that the last group(s) aliases exactly over
the first superblock on disk, but is kept in the cache enough to remount
it before you reboot.

If you just to the mke2fs + reboot + mount does that fail?  Same with
just the tune2fs -j + reboot + remount?  Do you only use the parted
"mkfs" or do you actually use the mke2fs from e2fsprogs?

> I have not yet tried filesystems smaller than 2Tb across reboots.
> I expect it will work, but I will try that shortly to check.
> 
> 
> findsuper tells me there are superblocks, but fs_blk_sz changes (!?)

These are remnants of previous filesystems on the device, each with
slightly different offsets (maybe with and without a partition table,
or with different partition types).  In one case there was a small
1kB block filesystem on the disk in the past.

> # /root/e2fsprogs-1.38/misc/findsuper /dev/sdb1
> starting at 0, with 512 byte increments
>        thisoff     block fs_blk_sz  blksz grp last_mount
>          17920        17 586057719  4096    0 Thu Jan  1 10:00:00 1970

What is missing is the superblock at offset "1024".  What this tool
_should_ also print out is part of the superblock UUID so it is possible
to say which superblocks belong to a single filesystem.

With an ext3 filesystem you will also find copies of the superblock in
the journal, they will all be marked "grp 0" and are not valid backups.

>      134234624    131088 586057719  4096    1 Thu Jan  1 10:00:00 1970
>      134235648    131089 586057719  4096    1 Thu Jan  1 10:00:00 1970
>      209733120    204817   1023983  1024   25 Thu Jan  1 10:00:00 1970
>      226510336    221201   1023983  1024   27 Thu Jan  1 10:00:00 1970
>      402670080    393232 586057719  4096    3 Thu Jan  1 10:00:00 1970
>      402671104    393233 586057719  4096    3 Thu Jan  1 10:00:00 1970
>      411059712    401425   1023983  1024   49 Thu Jan  1 10:00:00 1970
>      671105536    655376 586057719  4096    5 Thu Jan  1 10:00:00 1970
>      671106560    655377 586057719  4096    5 Thu Jan  1 10:00:00 1970
>      679495168    663569   1023983  1024   81 Thu Jan  1 10:00:00 1970
>      939540992    917520 586057719  4096    7 Thu Jan  1 10:00:00 1970
>      939542016    917521 586057719  4096    7 Thu Jan  1 10:00:00 1970
>     1207976448   1179664 586057719  4096    9 Thu Jan  1 10:00:00 1970
>     1207977472   1179665 586057719  4096    9 Thu Jan  1 10:00:00 1970
>     3355460096   3276816 586057719  4096   25 Thu Jan  1 10:00:00 1970
>     3355461120   3276817 586057719  4096   25 Thu Jan  1 10:00:00 1970
>     3623895552   3538960 586057719  4096   27 Thu Jan  1 10:00:00 1970
>     3623896576   3538961 586057719  4096   27 Thu Jan  1 10:00:00 1970
>     6576685568   6422544 586057719  4096   49 Thu Jan  1 10:00:00 1970
>     6576686592   6422545 586057719  4096   49 Thu Jan  1 10:00:00 1970
>    10871652864  10616848 586057719  4096   81 Thu Jan  1 10:00:00 1970
>    10871653888  10616849 586057719  4096   81 Thu Jan  1 10:00:00 1970
>    16777232896  16384016 586057719  4096  125 Thu Jan  1 10:00:00 1970
>    16777233920  16384017 586057719  4096  125 Thu Jan  1 10:00:00 1970
> ^C
> This is not looking good...

There appear to be 2 filesystems of interest.  One has offset 0x4200 = 16896,
but is missing the primary superblock.  The other has offset 0x4600 = 17920.
Neither of these would allow you to mount the filesystem as-is, because the
superblock is not aligned at 1024 bytes from the start of the device.

I would suspect something wacky with the partitioning and/or the way that
parted is making the filesystem.

> Your nice od trick tells me slightly different locations for the
> superblock signatures -
> # od -Ax -tx4 /dev/sdb1 | \
>   grep "^[0-9a-f]*30 [0-9a-f]* [0-9a-f]* 000[1-3]ef53 "
> 004630 436a93dd 001e0000 0001ef53 00000001
> 8004630 00000000 001e0000 0001ef53 00000001
> c804630 00000000 001e0000 0001ef53 00000001
> d804630 00000000 001e0000 0001ef53 00000001
> 18004630 00000000 001e0000 0001ef53 00000001
> ^C
> 
> 0x004630 corresponds to byte offset 17968, 48 bytes away.
> Is this explainable by the position of the superblock signature within
> the disk block?

Yes, this hack is only looking for the ext[23] magic number, which is not
at the start of the superblock (0x30 = 48 bytes offset).

> So I tried a few e2fsck runs. I know I'm probably being dense but none
> of these worked:
> e2fsck -n -b 16        -B 4096 /dev/sdb1
> e2fsck -n -b 17        -B 4096 /dev/sdb1
> e2fsck -n -b 18        -B 4096 /dev/sdb1
> e2fsck -n -b 204816    -B 1024 /dev/sdb1
> e2fsck -n -b 204817    -B 1024 /dev/sdb1
> e2fsck -n -b 204818    -B 1024 /dev/sdb1
> e2fsck -n -b 221200    -B 1024 /dev/sdb1
> e2fsck -n -b 221201    -B 1024 /dev/sdb1
> e2fsck -n -b 221202    -B 1024 /dev/sdb1
> e2fsck -n -b 1179664   -B 4096 /dev/sdb1
> e2fsck -n -b 1179665   -B 4096 /dev/sdb1
> e2fsck -n -b 6422544   -B 4096 /dev/sdb1
> e2fsck -n -b 6422545   -B 4096 /dev/sdb1
> e2fsck -n -b 10616848  -B 4096 /dev/sdb1
> e2fsck -n -b 10616849  -B 4096 /dev/sdb1

No, I'd expect you need to do something with the device partitioning
to get the filesystem aligned properly.  They aren't even aligned on
a block boundary, there is a 512-byte offset.

> (The e2fsck manpage could be a tiny bit clearer in that - I think -
>  it means you to use -b <blocknumber>, not -b <offset_to_superblock>)

Send a patch to Ted.

I would recommend to do the following:
- make a partition
- reboot the system
- use mke2fs -j to make the filesystem
- test mount, unmount, reboot at this point

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


From Vincent.McIntyre at csiro.au  Fri Nov  4 05:19:00 2005
From: Vincent.McIntyre at csiro.au (Vincent.McIntyre at csiro.au)
Date: Fri, 4 Nov 2005 16:19:00 +1100 (EST)
Subject: ext3 + fs > 2Tbyte
In-Reply-To: <20051104023547.GY31368@schatzie.adilger.int>
References: <mailman.5281.1130746677.1909.ext3-users@redhat.com>
	<Pine.LNX.4.62.0510311918440.20154@bedlam.atnf.CSIRO.AU>
	<20051031220648.GC31368@schatzie.adilger.int>
	<Pine.LNX.4.62.0511011119570.1768@bedlam.atnf.CSIRO.AU>
	<20051101060832.GK31368@schatzi
	<20051104023547.GY31368@schatzie.adilger.int>
Message-ID: <Pine.LNX.4.62.0511041418530.17129@bedlam.atnf.CSIRO.AU>

>> No files were written to the filesystem during the test sequence.
>
> Hmm, I would expect at least the need to write something to the filesystem,
> unless you are unlucky enough that the last group(s) aliases exactly over
> the first superblock on disk, but is kept in the cache enough to remount
> it before you reboot.

ok, I can add that to the scripts in my next round of tests.


> Do you only use the parted "mkfs" or do you actually use the mke2fs 
> from e2fsprogs? 
The script does this
   parted -s /dev/sdb1 print
   parted -s /dev/sdb1 mklabel gpt
   parted -s /dev/sdb1 print
   parted -s /dev/sdb1 mkpart primary 0 10
   parted -s /dev/sdb1 print
   parted -s /dev/sdb1 mke2fs 1 ext2
   parted -s /dev/sdb1 print

I did not try mke2fs before now because I don't think it worked when
I was trying to make FS larger than 2Tb. Can't recall now.


> If you just to the mke2fs + reboot + mount does that fail?

Yes. While you were typing,
  * I made a teeny 10 Mbyte filesystem (using parted, as above)
  * mounted
  * umounted
  * ran findsuper and od
  * reboot
  * ran parted /dev/sdb1 print
    (repeated, using strace)
  * ran an straced e2fsck /dev/sdb1
and got the same error.

I couldn't quite believe this so I tried it again. Same result.
Post reboot, I did things in slightly different order:

  * strace e2fsck -n /dev/sdb1
  e2fsck 1.38 (30-Jun-2005)
  Couldn't find ext2 superblock, trying backup blocks...
  /local/sbin/e2fsck: Bad magic number in super-block while trying to open 
/dev/sdb1

  The superblock could not be read or does not describe a correct ext2
  filesystem.  If the device is valid and it really contains an ext2
  filesystem (and not swap or ufs or something else), then the superblock
  is corrupt, and you might try running e2fsck with an alternate
  superblock:
     e2fsck -b 8193 <device>

  * /local/sbin/parted /dev/sdb print
  Disk geometry for /dev/sdb: 0.000-2289288.000 megabytes
  Disk label type: gpt
  Minor    Start       End     Filesystem  Name                  Flags
  1          0.017     10.000  ext2
  Information: Don't forget to update /etc/fstab, if necessary.


> Same with just the tune2fs -j + reboot + remount?

I switched to using mke2fs to create the filesystem, ie
  * I made a teeny 10 Mbyte partition (using parted)
  * mke2fs /dev/sdb1
  * mounted
  * umounted
  * ran findsuper and od
  * reboot
  * strace -o strace.e2fsck.postboot /local/sbin/e2fsck -n /dev/sdb1
  e2fsck 1.38 (30-Jun-2005)
  Couldn't find ext2 superblock, trying backup blocks...
  /local/sbin/e2fsck: Bad magic number in super-block while trying to open 
/dev/sdb1

  The superblock could not be read or does not describe a correct ext2
  filesystem.  If the device is valid and it really contains an ext2
  filesystem (and not swap or ufs or something else), then the superblock
  is corrupt, and you might try running e2fsck with an alternate
  superblock:
     e2fsck -b 8193 <device>

So it is starting to look like the GPT disklabel is causing a problem.

I switched to having parted make a msdos disklabel but kept everything
else the same - it worked fine.
  # strace -o strace.e2fsck.postboot /local/sbin/e2fsck -n /dev/sdb1
  e2fsck 1.38 (30-Jun-2005)
  /dev/sdb1: clean, 11/2000 files, 268/8000 blocks
  #


>> findsuper tells me there are superblocks, but fs_blk_sz changes (!?)
>
> These are remnants of previous filesystems on the device, each with
> slightly different offsets (maybe with and without a partition table,
> or with different partition types).  In one case there was a small
> 1kB block filesystem on the disk in the past.

ah, of course. I thought findsuper would respect the partition boundaries
and stop at the "end" of the filesystem. It did that pre-reboot, e.g. my
10Mbyte test above
   starting at 0, with 512 byte increments
        thisoff     block fs_blk_sz  blksz grp last_mount
           1024         1     10223  1024    0 Thu Jan  1 10:00:00 1970
        8389632      8193     10223  1024    1 Thu Jan  1 10:00:00 1970

       10468864: finished with errno 0

Post-reboot, I get this:
   starting at 0, with 512 byte increments
        thisoff     block fs_blk_sz  blksz grp last_mount
          17920        17     10223  1024    0 Thu Jan  1 10:00:00 1970
        8406528      8209     10223  1024    1 Thu Jan  1 10:00:00 1970
      134235648    131089 511999995  4096    1 Thu Jan  1 10:00:00 1970
      209733120    204817   1023983  1024   25 Thu Jan  1 10:00:00 1970
      226510336    221201   1023983  1024   27 Thu Jan  1 10:00:00 1970

To clean things up, I suppose I could dd /dev/zero into /dev/sdb?
It'll only take about 10 hours..


>> # /root/e2fsprogs-1.38/misc/findsuper /dev/sdb1
>> starting at 0, with 512 byte increments
>>        thisoff     block fs_blk_sz  blksz grp last_mount
>>          17920        17 586057719  4096    0 Thu Jan  1 10:00:00 1970
>
> What is missing is the superblock at offset "1024".  What this tool
> _should_ also print out is part of the superblock UUID so it is possible
> to say which superblocks belong to a single filesystem.
>
> With an ext3 filesystem you will also find copies of the superblock in
> the journal, they will all be marked "grp 0" and are not valid backups.

ok, thanks for explaining this.


> There appear to be 2 filesystems of interest.  One has offset 0x4200 = 16896,
> but is missing the primary superblock.  The other has offset 0x4600 = 17920.
> Neither of these would allow you to mount the filesystem as-is, because the
> superblock is not aligned at 1024 bytes from the start of the device.
>
> I would suspect something wacky with the partitioning and/or the way that
> parted is making the filesystem.

Most of this just the history of the fs creation tests I did I guess.
Remeber all these are just test filesystems on separate hardware.
I have not dared to run findsuper on the filesystem of interest yet,
I want to make sure I can actually recover a test FS first.


>> So I tried a few e2fsck runs. I know I'm probably being dense but none
>> of these worked:
>> e2fsck -n -b 16        -B 4096 /dev/sdb1
>> e2fsck -n -b 17        -B 4096 /dev/sdb1
....
>
> No, I'd expect you need to do something with the device partitioning
> to get the filesystem aligned properly.  They aren't even aligned on
> a block boundary, there is a 512-byte offset.

I noticed that when computing thisoff/blksz, but didn't make much of it.
Thanks for clearing that up.
I'll take a look at the manuals to see if I can force things to be
on a block boundary.

> I would recommend to do the following:
> - make a partition
> - reboot the system
> - use mke2fs -j to make the filesystem
> - test mount, unmount, reboot at this point

This reboot-after-partition thing is foreign to me (coming from solaris); 
it seems quite a poor design to need this. But let's run with it.

   parted -s /dev/sdb1 print
   parted -s /dev/sdb1 mklabel gpt
   parted -s /dev/sdb1 print
   parted -s /dev/sdb1 mkpart primary 0 10
   parted -s /dev/sdb1 print
   sleep 60
   reboot
   parted -s /dev/sdb1 print
   mke2fs -n -v /dev/sdb1
   mke2fs -q /dev/sdb1
     mke2fs gets stuck...
     I have to ^C it.

   # fdisk -l /dev/sdb
   You must set cylinders.
   You can do this from the extra functions menu.

   Disk /dev/sdb: 0 MB, 0 bytes
   255 heads, 63 sectors/track, 0 cylinders
   Units = cylinders of 16065 * 512 = 8225280 bytes

      Device Boot      Start         End      Blocks   Id  System
   /dev/sdb1               1      267350  2147483647+  ee  EFI GPT
   Partition 1 has different physical/logical beginnings (non-Linux?):
      phys=(0, 0, 1) logical=(0, 0, 2)
   Partition 1 has different physical/logical endings:
      phys=(1023, 254, 63) logical=(267349, 89, 4)

   # /local/sbin/parted /dev/sdb print
   Error: The primary GPT table is corrupt, but the backup appears ok, so
   that will be used.
   OK/Cancel? C
   Information: Don't forget to update /etc/fstab, if necessary.

   # /local/sbin/parted /dev/sdb print
   Error: The primary GPT table is corrupt, but the backup appears ok, so
   that will be used.
   OK/Cancel? OK
   Disk geometry for /dev/sdb: 0.000-2289288.000 megabytes
   Disk label type: gpt
   Minor    Start       End     Filesystem  Name                  Flags
   1          0.017     10.000  ext2
   Information: Don't forget to update /etc/fstab, if necessary.


   # strace -o strace.e2fsck.post-parted /local/sbin/e2fsck -n /dev/sdb1
   e2fsck 1.38 (30-Jun-2005)
   Couldn't find ext2 superblock, trying backup blocks...
   /local/sbin/e2fsck: Bad magic number in super-block while trying to open
   /dev/sdb1

   The superblock could not be read or does not describe a correct ext2
   filesystem.  If the device is valid and it really contains an ext2
   filesystem (and not swap or ufs or something else), then the superblock
   is corrupt, and you might try running e2fsck with an alternate
   superblock:
     e2fsck -b 8193 <device>

So it appears that support is lacking for GPT disklabels in e2fsprogs
and possibly the kernel as well.

I ran one more time,
   partition with parted, gpt label.
   reboot
   make 10Mbyte ext2 fs with parted
   mount, umount, findsuper, od - all this seems to work ok.
   reboot
   attempt to mount
    mount -text2 /dev/sdb1 /tmp/a
    mount: wrong fs type, bad option, bad superblock on /dev/sdb1,
        or too many mounted file systems
        (aren't you trying to mount an extended partition,
        instead of some logical partition inside?)

I think this says there is something funky with the GPT disklabelling.

Thanks for your help,
Vince


From adilger at clusterfs.com  Fri Nov  4 07:37:44 2005
From: adilger at clusterfs.com (Andreas Dilger)
Date: Fri, 4 Nov 2005 00:37:44 -0700
Subject: ext3 + fs > 2Tbyte
In-Reply-To: <Pine.LNX.4.62.0511041418530.17129@bedlam.atnf.CSIRO.AU>
References: <mailman.5281.1130746677.1909.ext3-users@redhat.com>
	<Pine.LNX.4.62.0510311918440.20154@bedlam.atnf.CSIRO.AU>
	<20051031220648.GC31368@schatzie.adilger.int>
	<Pine.LNX.4.62.0511011119570.1768@bedlam.atnf.CSIRO.AU>
	<20051104023547.GY31368@schatzie.adilger.int>
	<Pine.LNX.4.62.0511041418530.17129@bedlam.atnf.CSIRO.AU>
Message-ID: <20051104073744.GZ31368@schatzie.adilger.int>

On Nov 04, 2005  16:19 +1100, Vincent.McIntyre at csiro.au wrote:
> >Do you only use the parted "mkfs" or do you actually use the mke2fs 
> >from e2fsprogs? 
> The script does this
>   parted -s /dev/sdb1 print
>   parted -s /dev/sdb1 mklabel gpt
>   parted -s /dev/sdb1 print
>   parted -s /dev/sdb1 mkpart primary 0 10
>   parted -s /dev/sdb1 print
>   parted -s /dev/sdb1 mke2fs 1 ext2
>   parted -s /dev/sdb1 print

Hmm, I don't use parted often, but does it make sense to be making a GPT
disklabel on /dev/sdb1 instead of making it on /dev/sdb?

Note also that there is actually no need to make a partition at all if
you are just going to use the whole device for the filesystem.  This
is particularly interesting with some RAID hardware, since the partition
table adds a 512-byte offset to every single IO, and this can cause
some noticable performance problems.

Just do "mke2fs -j /dev/sdb" and be happy.

> Yes. While you were typing,
>  * I made a teeny 10 Mbyte filesystem (using parted, as above)
>  * mounted
>  * umounted
>  * ran findsuper and od
>  * reboot
>  * ran parted /dev/sdb1 print
>    (repeated, using strace)
>  * ran an straced e2fsck /dev/sdb1
> and got the same error.
> 
> I couldn't quite believe this so I tried it again. Same result.

This sounds like parted isn't doing what you want, and ext3 is not
the source of the problem at all.

> So it is starting to look like the GPT disklabel is causing a problem.

I agree.

> ah, of course. I thought findsuper would respect the partition boundaries
> and stop at the "end" of the filesystem. It did that pre-reboot, e.g. my
> 10Mbyte test above

It DOES respect the partition boundaries, actually.  In fact, if you
point it at a partition (instead of the parent device) it should not
be POSSIBLE for it to read beyond the end of the partition, and the
kernel should prevent it.

>   starting at 0, with 512 byte increments
>        thisoff     block fs_blk_sz  blksz grp last_mount
>           1024         1     10223  1024    0 Thu Jan  1 10:00:00 1970
>        8389632      8193     10223  1024    1 Thu Jan  1 10:00:00 1970
> 
>       10468864: finished with errno 0
> 
> Post-reboot, I get this:
>   starting at 0, with 512 byte increments
>        thisoff     block fs_blk_sz  blksz grp last_mount
>          17920        17     10223  1024    0 Thu Jan  1 10:00:00 1970
>        8406528      8209     10223  1024    1 Thu Jan  1 10:00:00 1970
>      134235648    131089 511999995  4096    1 Thu Jan  1 10:00:00 1970
>      209733120    204817   1023983  1024   25 Thu Jan  1 10:00:00 1970
>      226510336    221201   1023983  1024   27 Thu Jan  1 10:00:00 1970

This would seem to indicate your partition table is being corrupted.

>   # /local/sbin/parted /dev/sdb print
>   Error: The primary GPT table is corrupt, but the backup appears ok, so
>   that will be used.
>   OK/Cancel? OK
>   Disk geometry for /dev/sdb: 0.000-2289288.000 megabytes
>   Disk label type: gpt
>   Minor    Start       End     Filesystem  Name                  Flags
>   1          0.017     10.000  ext2
>   Information: Don't forget to update /etc/fstab, if necessary.

I suspect this is part of the problem.  The GPT disk label is being
written into /dev/sdb1 (which isn't really valid) and upon reboot the
"backup" is being found at the end of the device and doesn't match
the existing partition table on /dev/sdb.

>   # strace -o strace.e2fsck.post-parted /local/sbin/e2fsck -n /dev/sdb1
>   e2fsck 1.38 (30-Jun-2005)
>   Couldn't find ext2 superblock, trying backup blocks...
>   /local/sbin/e2fsck: Bad magic number in super-block while trying to open
>   /dev/sdb1

At this point, you are trying to access a filesystem with an offset from
the start of the partition.  If you want to recover from this (your real
filesystem), what you should probably do is locate the expected start of
the filesystem using findsuper and then copy it onto your backup device:

dd if=/dev/orig of=/dev/backup bs=offset skip=1

The backup superblocks should have a byte offset of {1,3,5,...} * 32768 * 4096
from the start of the device, so subtracting this from the actual offsets
found will tell you where the filesystem is supposed to start.  Checking the
first few (non group = 0) backup superblocks should make it pretty clear
where the filesystem is supposed to start.


Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


From Vincent.McIntyre at csiro.au  Fri Nov  4 11:30:26 2005
From: Vincent.McIntyre at csiro.au (Vincent.McIntyre at csiro.au)
Date: Fri, 4 Nov 2005 22:30:26 +1100 (EST)
Subject: ext3 + fs > 2Tbyte
In-Reply-To: <20051104073744.GZ31368@schatzie.adilger.int>
References: <mailman.5281.1130746677.1909.ext3-users@redhat.com>
	<Pine.LNX.4.62.0510311918440.20154@bedlam.atnf.CSIRO.AU>
	<20051031220648.GC31368@schatzie.adilger.int>
	<Pine.LNX.4.62.0511011119570.1768@bedlam.atnf.CSIRO.AU>
	<20051104023547.GY31368@schatzie.adilger.int>
	<Pine.LNX.4.62.0511041418530.17129@bedlam.atnf.CSIRO.AU>
	<20051104073744.GZ31368@schatzie.adilger.int>
Message-ID: <Pine.LNX.4.62.0511042213290.24536@bedlam.atnf.CSIRO.AU>

>>> Do you only use the parted "mkfs" or do you actually use the mke2fs
>>> from e2fsprogs?
>> The script does this
>>   parted -s /dev/sdb1 print
>>   parted -s /dev/sdb1 mklabel gpt
>>   parted -s /dev/sdb1 print
>>   parted -s /dev/sdb1 mkpart primary 0 10
>>   parted -s /dev/sdb1 print
>>   parted -s /dev/sdb1 mke2fs 1 ext2
>>   parted -s /dev/sdb1 print
>
> Hmm, I don't use parted often, but does it make sense to be making a GPT
> disklabel on /dev/sdb1 instead of making it on /dev/sdb?

ooops - misquote on my part. I was indeed using /dev/sdb for this.
I was translating from a shell script that uses a variable for the
disk device and the partition, and confused the two when translating.

> Note also that there is actually no need to make a partition at all if
> you are just going to use the whole device for the filesystem.  This
> is particularly interesting with some RAID hardware, since the partition
> table adds a 512-byte offset to every single IO, and this can cause
> some noticable performance problems.
>
> Just do "mke2fs -j /dev/sdb" and be happy.

ok, I'll give that a whirl.


>> ah, of course. I thought findsuper would respect the partition boundaries
>> and stop at the "end" of the filesystem. It did that pre-reboot, e.g. my
>> 10Mbyte test above
>
> It DOES respect the partition boundaries, actually.  In fact, if you
> point it at a partition (instead of the parent device) it should not
> be POSSIBLE for it to read beyond the end of the partition, and the
> kernel should prevent it.
>
>>   starting at 0, with 512 byte increments
>>        thisoff     block fs_blk_sz  blksz grp last_mount
>>           1024         1     10223  1024    0 Thu Jan  1 10:00:00 1970
>>        8389632      8193     10223  1024    1 Thu Jan  1 10:00:00 1970
>>
>>       10468864: finished with errno 0
>>
>> Post-reboot, I get this:
>>   starting at 0, with 512 byte increments
>>        thisoff     block fs_blk_sz  blksz grp last_mount
>>          17920        17     10223  1024    0 Thu Jan  1 10:00:00 1970
>>        8406528      8209     10223  1024    1 Thu Jan  1 10:00:00 1970
>>      134235648    131089 511999995  4096    1 Thu Jan  1 10:00:00 1970
>>      209733120    204817   1023983  1024   25 Thu Jan  1 10:00:00 1970
>>      226510336    221201   1023983  1024   27 Thu Jan  1 10:00:00 1970
>
> This would seem to indicate your partition table is being corrupted.

right.

>
>>   # /local/sbin/parted /dev/sdb print
>>   Error: The primary GPT table is corrupt, but the backup appears ok, so
>>   that will be used.
>>   OK/Cancel? OK
>>   Disk geometry for /dev/sdb: 0.000-2289288.000 megabytes
>>   Disk label type: gpt
>>   Minor    Start       End     Filesystem  Name                  Flags
>>   1          0.017     10.000  ext2
>>   Information: Don't forget to update /etc/fstab, if necessary.
>
> I suspect this is part of the problem.  The GPT disk label is being
> written into /dev/sdb1 (which isn't really valid) and upon reboot the
> "backup" is being found at the end of the device and doesn't match
> the existing partition table on /dev/sdb.

Does your reasoning change given my silly mistake above,
ie I was running parted on /dev/sdb not /dev/sdb1.


>>   # strace -o strace.e2fsck.post-parted /local/sbin/e2fsck -n /dev/sdb1
>>   e2fsck 1.38 (30-Jun-2005)
>>   Couldn't find ext2 superblock, trying backup blocks...
>>   /local/sbin/e2fsck: Bad magic number in super-block while trying to open
>>   /dev/sdb1
>
> At this point, you are trying to access a filesystem with an offset from
> the start of the partition.  If you want to recover from this (your real
> filesystem), what you should probably do is locate the expected start of
> the filesystem using findsuper and then copy it onto your backup device:
>
> dd if=/dev/orig of=/dev/backup bs=offset skip=1
>
> The backup superblocks should have a byte offset of {1,3,5,...} * 32768 * 4096
> from the start of the device, so subtracting this from the actual offsets
> found will tell you where the filesystem is supposed to start.  Checking the
> first few (non group = 0) backup superblocks should make it pretty clear
> where the filesystem is supposed to start.

I'll take a poke at this.

Assuming there is a problem with GPT labels, can you advise where to
report this? parted-bug, or bugzilla.kernel.org? Or both?

Cheers
Vince


From jp at enix.org  Fri Nov  4 17:11:04 2005
From: jp at enix.org (=?ISO-8859-1?Q?J=E9r=F4me_Petazzoni?=)
Date: Fri, 04 Nov 2005 18:11:04 +0100
Subject: mount r/w and r/o
In-Reply-To: <B6A0A04D59978745A68272143BE55BD4A431E4@laxmsex01.corp.jettis.com>
References: <B6A0A04D59978745A68272143BE55BD4A431E4@laxmsex01.corp.jettis.com>
Message-ID: <436B9628.1010102@enix.org>

[one r-w mount, multiple r-o mounts shared thru FC switch]

>>>should I use it?
>>>Am I going about this all wrong, is there a better way to do this 
>>>(other than GFS)?
>>>      
>>>
I once heard about someone doing something like that for a video farm, 
intermixing solaris and freebsd servers (so as far as he, and I, knew, 
there was no easy sharing solution). He did the following :
- create the filesystem on the solaris bow
- create many 1 GB files, with a specific byte pattern (512 bytes 
sectors iirc)
- the freebsd box would read the raw device, detect the byte patterns 
and build an internal lookup table, to know that file F, offset O was 
located on physical sector S
- the solaris box would then write data to the 1 GB files, and the 
freebsd box could then read back the data, thanks to the previously 
built lookup table (the 1 GB files would only be rewritten to, never 
truncated or rewritten, AFAIK)

IIRC, there was 2 solaris boxen using some HA solution, and many freebsd 
boxen accessing the data. This worked because the files were smaller 
than 1 GB (to be honnest, I don't know the exact size he used), and the 
very impressive performance of the solution balanced the hassle involved 
in setting up the whole thing.

Now, I would not ask "why not NFS?", but "why not GFS?" (and please 
apologize if it the answer is obvious...)


From jeff at jettis.com  Fri Nov  4 17:40:20 2005
From: jeff at jettis.com (Jeff Dinisco)
Date: Fri, 4 Nov 2005 09:40:20 -0800
Subject: mount r/w and r/o
Message-ID: <B6A0A04D59978745A68272143BE55BD4A431EA@laxmsex01.corp.jettis.com>

Thanks for the reply.  Very interesting.  Could you explain how the bsd box read the raw device and built the internal lookup table?

The main reason I wrote "not GFS" is because I'm aware of it and that it would take a bit of work to implement.  I'm currently looking for a quick fix to give me some time to implement a more robust solution.  Also, realizing I had some definite issues w/ my current config, I researched GFS a little while back.  It's my understanding that total storage in a GFS cluster cannot exceed 8TB and we have > 12TB.  I didn't investigate too much further for a work-around. 

Andreas suggested lustre which on the surface appears to be viable.  

-----Original Message-----
From: J?r?me Petazzoni [mailto:jp at enix.org] 
Sent: Friday, November 04, 2005 12:11 PM
To: Jeff Dinisco
Cc: Wolber, Richard C; Damian Menscher; ext3-users at redhat.com
Subject: Re: mount r/w and r/o

[one r-w mount, multiple r-o mounts shared thru FC switch]

>>>should I use it?
>>>Am I going about this all wrong, is there a better way to do this 
>>>(other than GFS)?
>>>      
>>>
I once heard about someone doing something like that for a video farm, 
intermixing solaris and freebsd servers (so as far as he, and I, knew, 
there was no easy sharing solution). He did the following :
- create the filesystem on the solaris bow
- create many 1 GB files, with a specific byte pattern (512 bytes 
sectors iirc)
- the freebsd box would read the raw device, detect the byte patterns 
and build an internal lookup table, to know that file F, offset O was 
located on physical sector S
- the solaris box would then write data to the 1 GB files, and the 
freebsd box could then read back the data, thanks to the previously 
built lookup table (the 1 GB files would only be rewritten to, never 
truncated or rewritten, AFAIK)

IIRC, there was 2 solaris boxen using some HA solution, and many freebsd 
boxen accessing the data. This worked because the files were smaller 
than 1 GB (to be honnest, I don't know the exact size he used), and the 
very impressive performance of the solution balanced the hassle involved 
in setting up the whole thing.

Now, I would not ask "why not NFS?", but "why not GFS?" (and please 
apologize if it the answer is obvious...)


From jp at enix.org  Fri Nov  4 18:03:57 2005
From: jp at enix.org (=?ISO-8859-1?Q?J=E9r=F4me_Petazzoni?=)
Date: Fri, 04 Nov 2005 19:03:57 +0100
Subject: mount r/w and r/o
In-Reply-To: <B6A0A04D59978745A68272143BE55BD4A431EA@laxmsex01.corp.jettis.com>
References: <B6A0A04D59978745A68272143BE55BD4A431EA@laxmsex01.corp.jettis.com>
Message-ID: <436BA28D.9070808@enix.org>


>Thanks for the reply.  Very interesting.  Could you explain how the bsd box read the raw device and built the internal lookup table?
>  
>
I suppose the BSD box was just accessing the device "raw" (like in 
"/dev/sdX" ; I don't know the exact syntax for BSD, tho), bypassing even 
the partition scheme.

I also guess that the big files created thru the Solaris box were a 
succession of 512-bytes records, each with 4 bytes for the file number, 
then 4 bytes of sector number, the rest being some magic padding. The 
BSD box just had to scan all the sectors and build a kind of hash map.

Sorry for the lack of details and accurracy, but this was more the kind 
of "around a beer" discussion rather than a formal report ... And this 
was, as I understood, a long-term solution, which required a bit of 
hacking before being ready to production (modifying the code of the 
streaming video server running on the BSD boxen, I assume).

>The main reason I wrote "not GFS" is because I'm aware of it and that it would take a bit of work to implement.  I'm currently looking for a quick fix to give me some time to implement a more robust solution.  Also, realizing I had some definite issues w/ my current config, I researched GFS a little while back.  It's my understanding that total storage in a GFS cluster cannot exceed 8TB and we have > 12TB.  I didn't investigate too much further for a work-around. 
>
>Andreas suggested lustre which on the surface appears to be viable.  
>  
>
Let us know your findings ;-)


From adilger at clusterfs.com  Fri Nov  4 21:27:51 2005
From: adilger at clusterfs.com (Andreas Dilger)
Date: Fri, 4 Nov 2005 14:27:51 -0700
Subject: mount r/w and r/o
In-Reply-To: <436B9628.1010102@enix.org>
References: <B6A0A04D59978745A68272143BE55BD4A431E4@laxmsex01.corp.jettis.com>
	<436B9628.1010102@enix.org>
Message-ID: <20051104212751.GD31368@schatzie.adilger.int>

On Nov 04, 2005  18:11 +0100, J?r?me Petazzoni wrote:
> [one r-w mount, multiple r-o mounts shared thru FC switch]
> 
> I once heard about someone doing something like that for a video farm, 
> intermixing solaris and freebsd servers (so as far as he, and I, knew, 
> there was no easy sharing solution). He did the following :
> - create the filesystem on the solaris bow
> - create many 1 GB files, with a specific byte pattern (512 bytes 
> sectors iirc)

Actually, if this was the case (files were never extended or truncated)
and the clients always used O_DIRECT to prevent caching of the data
then this would also work with an ext2 mount (or a modified ext3 that
had a mount option to disable journal recovery, only in conjunction
with read-only mounting).

It wouldn't really be a "normal" filesystem but for a specialized app
environment it might work.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


From richard.c.wolber at boeing.com  Thu Nov  3 22:01:05 2005
From: richard.c.wolber at boeing.com (Wolber, Richard C)
Date: Thu, 3 Nov 2005 14:01:05 -0800
Subject: mount r/w and r/o
Message-ID: <8C7C41A176AC0B468BEFB2EFD9BDAB992005B4@XCH-NW-5V2.nw.nos.boeing.com>

> > My questions are...
> > Should I be concerned by this?
> > Is there a way to automatically skip the recovery attempt, and if
so, 
> > should I use it?
> > Am I going about this all wrong, is there a better way to do this 
> > (other than GFS)?
> 
> Sorry to ask the obvious question, but why not just use NFS?

Performance? NFS is a lot of overhead to consider using on something
like 
FC. Mounting r/o seems (and I await the experts opinion) at first glance

to be a very effictive way of doing this.

..Chuck..


From arnd at arndb.de  Sat Nov  5 16:27:00 2005
From: arnd at arndb.de (Arnd Bergmann)
Date: Sat, 05 Nov 2005 17:27:00 +0100
Subject: [PATCH 10/25] fs: move ext2 ioctl32 handlers into file systems
References: <20051105162650.620266000@b551138y.boeblingen.de.ibm.com>
Message-ID: <20051105162714.555612000@b551138y.boeblingen.de.ibm.com>

An embedded and charset-unspecified text was scrubbed...
Name: ext2-ioctl.diff
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20051105/e949b2d4/attachment.ksh>

From arnd at arndb.de  Sat Nov  5 16:26:50 2005
From: arnd at arndb.de (Arnd Bergmann)
Date: Sat, 05 Nov 2005 17:26:50 +0100
Subject: [PATCH 00/25] reduce code in fs/compat_ioctl.c
Message-ID: <20051105162650.620266000@b551138y.boeblingen.de.ibm.com>

On S?nnavend 05 November 2005 00:51, Christoph Hellwig wrote:
> On Sat, Nov 05, 2005 at 12:10:46AM +0100, Arnd Bergmann wrote:
> >
> > BTW, I now have a set of 25 patches that moves all handlers from
> > fs/compat_ioctl.c over to the respective drivers and subsystems,
> > but I'm not sure how to best test that.
> > I intend to at least give it a test run on my Opteron for the whatever
> > ioctls I normally use, but the rest is just guesswork. Christoph,
> > can you review those patches?
> 
> I'm not sure moving everything from fs/compat_ioctl.c is a good idea.
> Everything that is just in a single driver or subsystem that has
> common ioctl code - sure.  else it doesn't make a lot of sense.

Ok, here is my full set of patches, let's see which ones are
sensible and which ones we are better off without.

Getting rid of fs/compat_ioctl.c completely could at least simplify
the compat_sys_ioctl() code a bit and would also make sure that
we only build the handlers into the kernel that can be used
potentially, which reduces the binary size.

The patch set is still largely untested, except for a single
compile test, but at least some of the patches are very simple,
so maybe I can get a quick ack or nack on them.

In general, I'm just moving over the handlers to the respective
subsystem without changing the logic, so the patch should not
have any effect on the ioctl operation itself, but it also
means that the handlers still use compat_alloc_user_space
or get_fs/set_fs when it's not really necessary.

	Arnd <><

 drivers/block/ioctl.c                       |  549 +++++
 drivers/block/loop.c                        |   76
 drivers/block/paride/pcd.c                  |    1
 drivers/block/paride/pd.c                   |    1
 drivers/block/paride/pt.c                   |    1
 drivers/block/pktcdvd.c                     |   20
 drivers/bluetooth/hci_ldisc.c               |   22
 drivers/cdrom/Makefile                      |    2
 drivers/cdrom/aztcd.c                       |    1
 drivers/cdrom/cdu31a.c                      |    1
 drivers/cdrom/cm206.c                       |    1
 drivers/cdrom/compat.c                      |  163 +
 drivers/cdrom/gscd.c                        |    1
 drivers/cdrom/mcdx.c                        |    1
 drivers/cdrom/optcd.c                       |    1
 drivers/cdrom/sbpcd.c                       |    1
 drivers/cdrom/sjcd.c                        |    1
 drivers/cdrom/sonycd535.c                   |    2
 drivers/char/Makefile                       |    1
 drivers/char/compat_mtio.c                  |   81
 drivers/char/ftape/zftape/zftape-init.c     |    1
 drivers/char/n_tty.c                        |    1
 drivers/char/raw.c                          |   91
 drivers/char/tty_io.c                       |  191 +
 drivers/char/viotape.c                      |    1
 drivers/char/vt.c                           |    3
 drivers/char/vt_ioctl.c                     |  195 +
 drivers/i2c/i2c-dev.c                       |  141 +
 drivers/ide/ide-cd.c                        |    1
 drivers/ide/ide-floppy.c                    |    1
 drivers/ide/ide-tape.c                      |    1
 drivers/media/radio/miropcm20-radio.c       |    1
 drivers/media/radio/radio-aimslab.c         |    1
 drivers/media/radio/radio-aztech.c          |    1
 drivers/media/radio/radio-cadet.c           |    1
 drivers/media/radio/radio-gemtek-pci.c      |    1
 drivers/media/radio/radio-gemtek.c          |    1
 drivers/media/radio/radio-maestro.c         |    1
 drivers/media/radio/radio-maxiradio.c       |    1
 drivers/media/radio/radio-rtrack2.c         |    1
 drivers/media/radio/radio-sf16fmi.c         |    1
 drivers/media/radio/radio-sf16fmr2.c        |    1
 drivers/media/radio/radio-terratec.c        |    1
 drivers/media/radio/radio-trust.c           |    1
 drivers/media/radio/radio-typhoon.c         |    1
 drivers/media/radio/radio-zoltrix.c         |    1
 drivers/media/video/Makefile                |    2
 drivers/media/video/arv.c                   |    1
 drivers/media/video/bttv-driver.c           |    1
 drivers/media/video/bw-qcam.c               |    1
 drivers/media/video/c-qcam.c                |    1
 drivers/media/video/compat_ioctl.c          |  318 +++
 drivers/media/video/cpia.c                  |    1
 drivers/media/video/cx88/cx88-video.c       |    2
 drivers/media/video/meye.c                  |    1
 drivers/media/video/pms.c                   |    1
 drivers/media/video/saa5249.c               |    1
 drivers/media/video/saa7134/saa7134-video.c |    2
 drivers/media/video/stradis.c               |    1
 drivers/media/video/w9966.c                 |    1
 drivers/media/video/zoran_driver.c          |    1
 drivers/media/video/zr36120.c               |    1
 drivers/mtd/mtdchar.c                       |   94
 drivers/net/ppp_generic.c                   |  179 +
 drivers/s390/char/tape_char.c               |    1
 drivers/scsi/osst.c                         |    2
 drivers/scsi/sg.c                           |  154 +
 drivers/scsi/sr.c                           |    1
 drivers/scsi/st.c                           |    2
 drivers/usb/core/devio.c                    |  139 +
 drivers/usb/media/dsbr100.c                 |    1
 drivers/usb/media/ov511.c                   |    1
 drivers/usb/media/pwc/pwc-if.c              |    1
 drivers/usb/media/se401.c                   |    1
 drivers/usb/media/stv680.c                  |    1
 drivers/usb/media/usbvideo.c                |    1
 drivers/usb/media/vicam.c                   |    1
 drivers/usb/media/w9968cf.c                 |    1
 drivers/video/fbmem.c                       |  147 +
 fs/autofs/root.c                            |   35
 fs/autofs4/root.c                           |   41
 fs/block_dev.c                              |   10
 fs/cifs/cifsfs.c                            |   10
 fs/cifs/cifsfs.h                            |    2
 fs/cifs/ioctl.c                             |   29
 fs/compat.c                                 |   27
 fs/compat_ioctl.c                           | 2918 ----------------------------
 fs/ext2/dir.c                               |    3
 fs/ext2/ext2.h                              |    1
 fs/ext2/file.c                              |    6
 fs/ext2/ioctl.c                             |   31
 fs/ext3/dir.c                               |    3
 fs/ext3/file.c                              |    3
 fs/ext3/ioctl.c                             |   66
 fs/fat/dir.c                                |   54
 fs/hfsplus/dir.c                            |    4
 fs/hfsplus/hfsplus_fs.h                     |    4
 fs/hfsplus/inode.c                          |    4
 fs/hfsplus/ioctl.c                          |   29
 fs/ncpfs/dir.c                              |    3
 fs/ncpfs/file.c                             |    4
 fs/ncpfs/ioctl.c                            |  241 ++
 fs/reiserfs/dir.c                           |    3
 fs/reiserfs/file.c                          |    4
 fs/reiserfs/ioctl.c                         |   36
 fs/smbfs/dir.c                              |    4
 fs/smbfs/file.c                             |    4
 fs/smbfs/ioctl.c                            |   16
 fs/smbfs/proto.h                            |    1
 fs/xfs/linux-2.6/xfs_ioctl32.c              |   15
 include/linux/cdrom.h                       |    2
 include/linux/compat_ioctl.h                |  387 ---
 include/linux/ext2_fs.h                     |    7
 include/linux/ext3_fs.h                     |    1
 include/linux/fs.h                          |    3
 include/linux/ioctl32.h                     |    2
 include/linux/mtio.h                        |   12
 include/linux/ncp_fs.h                      |    1
 include/linux/net.h                         |    2
 include/linux/reiserfs_fs.h                 |    9
 include/linux/socket.h                      |    4
 include/linux/tty.h                         |    2
 include/linux/tty_driver.h                  |    4
 include/linux/tty_ldisc.h                   |    2
 include/linux/videodev.h                    |    2
 include/net/sock.h                          |    9
 net/atm/common.h                            |    1
 net/atm/ioctl.c                             |  167 +
 net/atm/pvc.c                               |    3
 net/atm/svc.c                               |    3
 net/bluetooth/bnep/sock.c                   |    1
 net/bluetooth/cmtp/sock.c                   |    1
 net/bluetooth/hci_sock.c                    |    1
 net/bluetooth/hidp/sock.c                   |    1
 net/bluetooth/rfcomm/sock.c                 |    1
 net/compat.c                                | 1456 +++++++++----
 net/socket.c                                |    7
 137 files changed, 4527 insertions(+), 3807 deletions(-)


From hch at lst.de  Sun Nov  6 04:39:42 2005
From: hch at lst.de (Christoph Hellwig)
Date: Sun, 6 Nov 2005 05:39:42 +0100
Subject: [PATCH 10/25] fs: move ext2 ioctl32 handlers into file systems
In-Reply-To: <20051105162714.555612000@b551138y.boeblingen.de.ibm.com>
References: <20051105162650.620266000@b551138y.boeblingen.de.ibm.com>
	<20051105162714.555612000@b551138y.boeblingen.de.ibm.com>
Message-ID: <20051106043942.GA31343@lst.de>

On Sat, Nov 05, 2005 at 05:27:00PM +0100, Arnd Bergmann wrote:
> The same ioctls (originally from ext2) are used on ext2, ext3,
> hfsplus, cifs, reiserfs and xfs. Since they are really compatible
> between 32 and 64 bit except for the ioctl number, the conversion
> handler is trivial and I copy it to each of these file systems
> in order to eventually get rid of fs/compat_ioctl.c completely.

NACK, this is completely idiotic.  Duplicating handlers is the very
last thing we want.  I actually have patches to move handling some
of those ioctls into generic code, but that's a different story.


From arnd at arndb.de  Mon Nov  7 10:24:47 2005
From: arnd at arndb.de (Arnd Bergmann)
Date: Mon, 7 Nov 2005 11:24:47 +0100
Subject: [PATCH 10/25] fs: move ext2 ioctl32 handlers into file systems
In-Reply-To: <20051106043942.GA31343@lst.de>
References: <20051105162650.620266000@b551138y.boeblingen.de.ibm.com>
	<20051105162714.555612000@b551138y.boeblingen.de.ibm.com>
	<20051106043942.GA31343@lst.de>
Message-ID: <200511071124.49467.arnd@arndb.de>

On S?nndag 06 November 2005 05:39, Christoph Hellwig wrote:
> NACK, this is completely idiotic. ?Duplicating handlers is the very
> last thing we want. ?I actually have patches to move handling some
> of those ioctls into generic code, but that's a different story.

Ok, I'll drop this patch then, except for the ext3 parts that fix
an actual problem of missing conversion handlers.

What is your opinion on the xfs bit. The current code is somewhat
broken, since XFS_IOC_{GET,SET}{VERSION,XFLAGS} are not really
compatible. Should those three lines simply be removed?

	Arnd <><

--- linux-cg.orig/fs/xfs/linux-2.6/xfs_ioctl32.c        2005-11-05 02:44:55.000000000 +0100
+++ linux-cg/fs/xfs/linux-2.6/xfs_ioctl32.c     2005-11-05 02:45:35.000000000 +0100
@@ -34,6 +34,11 @@
 #define  _NATIVE_IOC(cmd, type) \
          _IOC(_IOC_DIR(cmd), _IOC_TYPE(cmd), _IOC_NR(cmd), sizeof(type))
 
+/* broken ext2 ioctl numbers */
+#define XFS_IOC_GETVERSION32 _IOR('v', 1, int)
+#define XFS_IOC_GETXFLAGS32 _IOR('f', 1, int)
+#define XFS_IOC_SETXFLAGS32 _IOW('f', 2, int)
+
 #if defined(CONFIG_IA64) || defined(CONFIG_X86_64)
 #define BROKEN_X86_ALIGNMENT
 /* on ia32 l_start is on a 32-bit boundary */
@@ -115,12 +120,16 @@
        vnode_t         *vp = LINVFS_GET_VP(inode);
 
        switch (cmd) {
+       /* these take an int as their argument, not a long */
+       case XFS_IOC_GETVERSION32:
+       case XFS_IOC_GETXFLAGS32:
+       case XFS_IOC_SETXFLAGS32:
+               cmd = _NATIVE_IOC(cmd, long);
+               break;
+
        case XFS_IOC_DIOINFO:
        case XFS_IOC_FSGEOMETRY_V1:
        case XFS_IOC_FSGEOMETRY:
-       case XFS_IOC_GETVERSION:
-       case XFS_IOC_GETXFLAGS:
-       case XFS_IOC_SETXFLAGS:
        case XFS_IOC_FSGETXATTR:
        case XFS_IOC_FSSETXATTR:
        case XFS_IOC_FSGETXATTRA:


From dev at sw.ru  Mon Nov  7 13:41:40 2005
From: dev at sw.ru (Kirill Korotaev)
Date: Mon, 07 Nov 2005 16:41:40 +0300
Subject: [PATCH] ext3: journal handling on error path in
	ext3_journalled_writepage()
Message-ID: <436F5994.2070703@sw.ru>

Forwarded original patch from Denis Lunev:

This patch fixes lost referrence on ext3 current handle in
ext3_journalled_writepage()

Signed-Off-By: Denis Lunev <den at sw.ru>

P.S. against 2.6.14
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: diff-ms-ext3handle-20051031
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20051107/b2243c4c/attachment.ksh>

From bunk at stusta.de  Mon Nov  7 21:18:50 2005
From: bunk at stusta.de (Adrian Bunk)
Date: Mon, 7 Nov 2005 22:18:50 +0100
Subject: [2.6 patch] remove CONFIG_EXT{2,3}_CHECK
In-Reply-To: <20051101044658.GA7500@thunk.org>
References: <20051031001334.GP4180@stusta.de>
	<20051031212503.GY31368@schatzie.adilger.int>
	<20051101044658.GA7500@thunk.org>
Message-ID: <20051107211850.GZ3847@stusta.de>

On Mon, Oct 31, 2005 at 11:46:58PM -0500, Theodore Ts'o wrote:
> On Mon, Oct 31, 2005 at 02:25:03PM -0700, Andreas Dilger wrote:
> > On Oct 31, 2005  01:13 +0100, Adrian Bunk wrote:
> > > Can anyone tell me the history of CONFIG_EXT{2,3}_CHECK?
> > > 
> > > There is code for a "check" option for mount if these options are 
> > > enabled, but there's no way to enable them.
> > 
> > These are expensive debugging options, which walk the inode/block bitmaps
> > for getting the group inode/block usage instead of using the group
> > summary data.  Not used very often but I suspect occasionally useful for
> > developers mucking with ext[23] internals.  Since it is developer-only
> > code it needs to be enabled with #define CONFIG_EXT[23]_CHECK in a
> > header or compile option.
> 
> It's basically a stripped down version of e2fsck pass #5, though.  Is
> there any reason why this needs to be in the kernel?  If it would be
> useful I could easily make a userspace implementation of these checks.

This code was introduced with kernel 2.4, but as far as I can see there 
was never an option for enabling it.

Unless someone can give a strong reason for keeping it, I'd suggest the 
patch below.

> 						- Ted

cu
Adrian


<--  snip  -->


The CONFIG_EXT{2,3}_CHECK options where were never available, and all 
they did was to implement a subset of e2fsck in the kernel.


Signed-off-by: Adrian Bunk <bunk at stusta.de>

---

 Documentation/filesystems/ext2.txt |    2 
 fs/ext2/balloc.c                   |   73 -----------------------------
 fs/ext2/ialloc.c                   |   40 ---------------
 fs/ext2/super.c                    |   16 ------
 fs/ext3/balloc.c                   |   73 -----------------------------
 fs/ext3/ialloc.c                   |   41 ----------------
 fs/ext3/super.c                    |   17 ------
 7 files changed, 2 insertions(+), 260 deletions(-)

--- linux-2.6.14-mm1-full/Documentation/filesystems/ext2.txt.old	2005-11-07 21:22:25.000000000 +0100
+++ linux-2.6.14-mm1-full/Documentation/filesystems/ext2.txt	2005-11-07 21:22:36.000000000 +0100
@@ -17,8 +17,6 @@
 bsddf			(*)	Makes `df' act like BSD.
 minixdf				Makes `df' act like Minix.
 
-check				Check block and inode bitmaps at mount time
-				(requires CONFIG_EXT2_CHECK).
 check=none, nocheck	(*)	Don't do extra checking of bitmaps on mount
 				(check=normal and check=strict options removed)
 
--- linux-2.6.14-mm1-full/fs/ext2/balloc.c.old	2005-11-07 21:22:43.000000000 +0100
+++ linux-2.6.14-mm1-full/fs/ext2/balloc.c	2005-11-07 21:22:56.000000000 +0100
@@ -624,76 +624,3 @@
 	return EXT2_SB(sb)->s_gdb_count;
 }
 
-#ifdef CONFIG_EXT2_CHECK
-/* Called at mount-time, super-block is locked */
-void ext2_check_blocks_bitmap (struct super_block * sb)
-{
-	struct buffer_head *bitmap_bh = NULL;
-	struct ext2_super_block * es;
-	unsigned long desc_count, bitmap_count, x, j;
-	unsigned long desc_blocks;
-	struct ext2_group_desc * desc;
-	int i;
-
-	es = EXT2_SB(sb)->s_es;
-	desc_count = 0;
-	bitmap_count = 0;
-	desc = NULL;
-	for (i = 0; i < EXT2_SB(sb)->s_groups_count; i++) {
-		desc = ext2_get_group_desc (sb, i, NULL);
-		if (!desc)
-			continue;
-		desc_count += le16_to_cpu(desc->bg_free_blocks_count);
-		brelse(bitmap_bh);
-		bitmap_bh = read_block_bitmap(sb, i);
-		if (!bitmap_bh)
-			continue;
-
-		if (ext2_bg_has_super(sb, i) &&
-				!ext2_test_bit(0, bitmap_bh->b_data))
-			ext2_error(sb, __FUNCTION__,
-				   "Superblock in group %d is marked free", i);
-
-		desc_blocks = ext2_bg_num_gdb(sb, i);
-		for (j = 0; j < desc_blocks; j++)
-			if (!ext2_test_bit(j + 1, bitmap_bh->b_data))
-				ext2_error(sb, __FUNCTION__,
-					   "Descriptor block #%ld in group "
-					   "%d is marked free", j, i);
-
-		if (!block_in_use(le32_to_cpu(desc->bg_block_bitmap),
-					sb, bitmap_bh->b_data))
-			ext2_error(sb, "ext2_check_blocks_bitmap",
-				    "Block bitmap for group %d is marked free",
-				    i);
-
-		if (!block_in_use(le32_to_cpu(desc->bg_inode_bitmap),
-					sb, bitmap_bh->b_data))
-			ext2_error(sb, "ext2_check_blocks_bitmap",
-				    "Inode bitmap for group %d is marked free",
-				    i);
-
-		for (j = 0; j < EXT2_SB(sb)->s_itb_per_group; j++)
-			if (!block_in_use(le32_to_cpu(desc->bg_inode_table) + j,
-						sb, bitmap_bh->b_data))
-				ext2_error (sb, "ext2_check_blocks_bitmap",
-					    "Block #%ld of the inode table in "
-					    "group %d is marked free", j, i);
-
-		x = ext2_count_free(bitmap_bh, sb->s_blocksize);
-		if (le16_to_cpu(desc->bg_free_blocks_count) != x)
-			ext2_error (sb, "ext2_check_blocks_bitmap",
-				    "Wrong free blocks count for group %d, "
-				    "stored = %d, counted = %lu", i,
-				    le16_to_cpu(desc->bg_free_blocks_count), x);
-		bitmap_count += x;
-	}
-	if (le32_to_cpu(es->s_free_blocks_count) != bitmap_count)
-		ext2_error (sb, "ext2_check_blocks_bitmap",
-			"Wrong free blocks count in super block, "
-			"stored = %lu, counted = %lu",
-			(unsigned long)le32_to_cpu(es->s_free_blocks_count),
-			bitmap_count);
-	brelse(bitmap_bh);
-}
-#endif
--- linux-2.6.14-mm1-full/fs/ext2/ialloc.c.old	2005-11-07 21:23:04.000000000 +0100
+++ linux-2.6.14-mm1-full/fs/ext2/ialloc.c	2005-11-07 21:23:13.000000000 +0100
@@ -700,43 +700,3 @@
 	return count;
 }
 
-#ifdef CONFIG_EXT2_CHECK
-/* Called at mount-time, super-block is locked */
-void ext2_check_inodes_bitmap (struct super_block * sb)
-{
-	struct ext2_super_block * es = EXT2_SB(sb)->s_es;
-	unsigned long desc_count = 0, bitmap_count = 0;
-	struct buffer_head *bitmap_bh = NULL;
-	int i;
-
-	for (i = 0; i < EXT2_SB(sb)->s_groups_count; i++) {
-		struct ext2_group_desc *desc;
-		unsigned x;
-
-		desc = ext2_get_group_desc(sb, i, NULL);
-		if (!desc)
-			continue;
-		desc_count += le16_to_cpu(desc->bg_free_inodes_count);
-		brelse(bitmap_bh);
-		bitmap_bh = read_inode_bitmap(sb, i);
-		if (!bitmap_bh)
-			continue;
-		
-		x = ext2_count_free(bitmap_bh, EXT2_INODES_PER_GROUP(sb) / 8);
-		if (le16_to_cpu(desc->bg_free_inodes_count) != x)
-			ext2_error (sb, "ext2_check_inodes_bitmap",
-				    "Wrong free inodes count in group %d, "
-				    "stored = %d, counted = %lu", i,
-				    le16_to_cpu(desc->bg_free_inodes_count), x);
-		bitmap_count += x;
-	}
-	brelse(bitmap_bh);
-	if (percpu_counter_read(&EXT2_SB(sb)->s_freeinodes_counter) !=
-				bitmap_count)
-		ext2_error(sb, "ext2_check_inodes_bitmap",
-			    "Wrong free inodes count in super block, "
-			    "stored = %lu, counted = %lu",
-			    (unsigned long)le32_to_cpu(es->s_free_inodes_count),
-			    bitmap_count);
-}
-#endif
--- linux-2.6.14-mm1-full/fs/ext2/super.c.old	2005-11-07 21:23:21.000000000 +0100
+++ linux-2.6.14-mm1-full/fs/ext2/super.c	2005-11-07 21:23:56.000000000 +0100
@@ -281,7 +281,7 @@
 enum {
 	Opt_bsd_df, Opt_minix_df, Opt_grpid, Opt_nogrpid,
 	Opt_resgid, Opt_resuid, Opt_sb, Opt_err_cont, Opt_err_panic,
-	Opt_err_ro, Opt_nouid32, Opt_check, Opt_nocheck, Opt_debug,
+	Opt_err_ro, Opt_nouid32, Opt_nocheck, Opt_debug,
 	Opt_oldalloc, Opt_orlov, Opt_nobh, Opt_user_xattr, Opt_nouser_xattr,
 	Opt_acl, Opt_noacl, Opt_xip, Opt_ignore, Opt_err, Opt_quota,
 	Opt_usrquota, Opt_grpquota
@@ -303,7 +303,6 @@
 	{Opt_nouid32, "nouid32"},
 	{Opt_nocheck, "check=none"},
 	{Opt_nocheck, "nocheck"},
-	{Opt_check, "check"},
 	{Opt_debug, "debug"},
 	{Opt_oldalloc, "oldalloc"},
 	{Opt_orlov, "orlov"},
@@ -376,13 +375,6 @@
 		case Opt_nouid32:
 			set_opt (sbi->s_mount_opt, NO_UID32);
 			break;
-		case Opt_check:
-#ifdef CONFIG_EXT2_CHECK
-			set_opt (sbi->s_mount_opt, CHECK);
-#else
-			printk("EXT2 Check option not supported\n");
-#endif
-			break;
 		case Opt_nocheck:
 			clear_opt (sbi->s_mount_opt, CHECK);
 			break;
@@ -503,12 +495,6 @@
 			EXT2_BLOCKS_PER_GROUP(sb),
 			EXT2_INODES_PER_GROUP(sb),
 			sbi->s_mount_opt);
-#ifdef CONFIG_EXT2_CHECK
-	if (test_opt (sb, CHECK)) {
-		ext2_check_blocks_bitmap (sb);
-		ext2_check_inodes_bitmap (sb);
-	}
-#endif
 	return res;
 }
 
--- linux-2.6.14-mm1-full/fs/ext3/balloc.c.old	2005-11-07 21:24:04.000000000 +0100
+++ linux-2.6.14-mm1-full/fs/ext3/balloc.c	2005-11-07 21:26:53.000000000 +0100
@@ -1517,76 +1517,3 @@
 	return EXT3_SB(sb)->s_gdb_count;
 }
 
-#ifdef CONFIG_EXT3_CHECK
-/* Called at mount-time, super-block is locked */
-void ext3_check_blocks_bitmap (struct super_block * sb)
-{
-	struct ext3_super_block *es;
-	unsigned long desc_count, bitmap_count, x, j;
-	unsigned long desc_blocks;
-	struct buffer_head *bitmap_bh = NULL;
-	struct ext3_group_desc *gdp;
-	int i;
-
-	es = EXT3_SB(sb)->s_es;
-	desc_count = 0;
-	bitmap_count = 0;
-	gdp = NULL;
-	for (i = 0; i < EXT3_SB(sb)->s_groups_count; i++) {
-		gdp = ext3_get_group_desc (sb, i, NULL);
-		if (!gdp)
-			continue;
-		desc_count += le16_to_cpu(gdp->bg_free_blocks_count);
-		brelse(bitmap_bh);
-		bitmap_bh = read_block_bitmap(sb, i);
-		if (bitmap_bh == NULL)
-			continue;
-
-		if (ext3_bg_has_super(sb, i) &&
-				!ext3_test_bit(0, bitmap_bh->b_data))
-			ext3_error(sb, __FUNCTION__,
-				   "Superblock in group %d is marked free", i);
-
-		desc_blocks = ext3_bg_num_gdb(sb, i);
-		for (j = 0; j < desc_blocks; j++)
-			if (!ext3_test_bit(j + 1, bitmap_bh->b_data))
-				ext3_error(sb, __FUNCTION__,
-					   "Descriptor block #%ld in group "
-					   "%d is marked free", j, i);
-
-		if (!block_in_use (le32_to_cpu(gdp->bg_block_bitmap),
-						sb, bitmap_bh->b_data))
-			ext3_error (sb, "ext3_check_blocks_bitmap",
-				    "Block bitmap for group %d is marked free",
-				    i);
-
-		if (!block_in_use (le32_to_cpu(gdp->bg_inode_bitmap),
-						sb, bitmap_bh->b_data))
-			ext3_error (sb, "ext3_check_blocks_bitmap",
-				    "Inode bitmap for group %d is marked free",
-				    i);
-
-		for (j = 0; j < EXT3_SB(sb)->s_itb_per_group; j++)
-			if (!block_in_use (le32_to_cpu(gdp->bg_inode_table) + j,
-							sb, bitmap_bh->b_data))
-				ext3_error (sb, "ext3_check_blocks_bitmap",
-					    "Block #%d of the inode table in "
-					    "group %d is marked free", j, i);
-
-		x = ext3_count_free(bitmap_bh, sb->s_blocksize);
-		if (le16_to_cpu(gdp->bg_free_blocks_count) != x)
-			ext3_error (sb, "ext3_check_blocks_bitmap",
-				    "Wrong free blocks count for group %d, "
-				    "stored = %d, counted = %lu", i,
-				    le16_to_cpu(gdp->bg_free_blocks_count), x);
-		bitmap_count += x;
-	}
-	brelse(bitmap_bh);
-	if (le32_to_cpu(es->s_free_blocks_count) != bitmap_count)
-		ext3_error (sb, "ext3_check_blocks_bitmap",
-			"Wrong free blocks count in super block, "
-			"stored = %lu, counted = %lu",
-			(unsigned long)le32_to_cpu(es->s_free_blocks_count),
-			bitmap_count);
-}
-#endif
--- linux-2.6.14-mm1-full/fs/ext3/ialloc.c.old	2005-11-07 21:27:02.000000000 +0100
+++ linux-2.6.14-mm1-full/fs/ext3/ialloc.c	2005-11-07 21:27:09.000000000 +0100
@@ -756,44 +756,3 @@
 	return count;
 }
 
-#ifdef CONFIG_EXT3_CHECK
-/* Called at mount-time, super-block is locked */
-void ext3_check_inodes_bitmap (struct super_block * sb)
-{
-	struct ext3_super_block * es;
-	unsigned long desc_count, bitmap_count, x;
-	struct buffer_head *bitmap_bh = NULL;
-	struct ext3_group_desc * gdp;
-	int i;
-
-	es = EXT3_SB(sb)->s_es;
-	desc_count = 0;
-	bitmap_count = 0;
-	gdp = NULL;
-	for (i = 0; i < EXT3_SB(sb)->s_groups_count; i++) {
-		gdp = ext3_get_group_desc (sb, i, NULL);
-		if (!gdp)
-			continue;
-		desc_count += le16_to_cpu(gdp->bg_free_inodes_count);
-		brelse(bitmap_bh);
-		bitmap_bh = read_inode_bitmap(sb, i);
-		if (!bitmap_bh)
-			continue;
-
-		x = ext3_count_free(bitmap_bh, EXT3_INODES_PER_GROUP(sb) / 8);
-		if (le16_to_cpu(gdp->bg_free_inodes_count) != x)
-			ext3_error (sb, "ext3_check_inodes_bitmap",
-				    "Wrong free inodes count in group %d, "
-				    "stored = %d, counted = %lu", i,
-				    le16_to_cpu(gdp->bg_free_inodes_count), x);
-		bitmap_count += x;
-	}
-	brelse(bitmap_bh);
-	if (le32_to_cpu(es->s_free_inodes_count) != bitmap_count)
-		ext3_error (sb, "ext3_check_inodes_bitmap",
-			    "Wrong free inodes count in super block, "
-			    "stored = %lu, counted = %lu",
-			    (unsigned long)le32_to_cpu(es->s_free_inodes_count),
-			    bitmap_count);
-}
-#endif
--- linux-2.6.14-mm1-full/fs/ext3/super.c.old	2005-11-07 21:27:17.000000000 +0100
+++ linux-2.6.14-mm1-full/fs/ext3/super.c	2005-11-07 21:27:48.000000000 +0100
@@ -625,7 +625,7 @@
 enum {
 	Opt_bsd_df, Opt_minix_df, Opt_grpid, Opt_nogrpid,
 	Opt_resgid, Opt_resuid, Opt_sb, Opt_err_cont, Opt_err_panic, Opt_err_ro,
-	Opt_nouid32, Opt_check, Opt_nocheck, Opt_debug, Opt_oldalloc, Opt_orlov,
+	Opt_nouid32, Opt_nocheck, Opt_debug, Opt_oldalloc, Opt_orlov,
 	Opt_user_xattr, Opt_nouser_xattr, Opt_acl, Opt_noacl,
 	Opt_reservation, Opt_noreservation, Opt_noload, Opt_nobh,
 	Opt_commit, Opt_journal_update, Opt_journal_inum,
@@ -652,7 +652,6 @@
 	{Opt_nouid32, "nouid32"},
 	{Opt_nocheck, "nocheck"},
 	{Opt_nocheck, "check=none"},
-	{Opt_check, "check"},
 	{Opt_debug, "debug"},
 	{Opt_oldalloc, "oldalloc"},
 	{Opt_orlov, "orlov"},
@@ -773,14 +772,6 @@
 		case Opt_nouid32:
 			set_opt (sbi->s_mount_opt, NO_UID32);
 			break;
-		case Opt_check:
-#ifdef CONFIG_EXT3_CHECK
-			set_opt (sbi->s_mount_opt, CHECK);
-#else
-			printk(KERN_ERR
-			       "EXT3 Check option not supported\n");
-#endif
-			break;
 		case Opt_nocheck:
 			clear_opt (sbi->s_mount_opt, CHECK);
 			break;
@@ -1115,12 +1106,6 @@
 	} else {
 		printk("internal journal\n");
 	}
-#ifdef CONFIG_EXT3_CHECK
-	if (test_opt (sb, CHECK)) {
-		ext3_check_blocks_bitmap (sb);
-		ext3_check_inodes_bitmap (sb);
-	}
-#endif
 	return res;
 }
 

From brice+ext3 at daysofwonder.com  Tue Nov  8 15:14:44 2005
From: brice+ext3 at daysofwonder.com (Brice Figureau)
Date: Tue, 08 Nov 2005 16:14:44 +0100
Subject: EXT3-fs error (device md2): ext3_journal_start_sb: Detected
	aborted journal...
Message-ID: <1131462884.7659.31.camel@localhost.localdomain>

Hi,

I'm running a production server (Debian Sarge install) whose root
filesystem (a software raid 1 array of 2 partitions of IDE drive)
exhibited the following problem:

Oct 28 06:00:06 server2 kernel: attempt to access beyond end of device
Oct 28 06:00:06 server2 kernel: md2: rw=1, want=3050401328, limit=16353920
[...] a few of the above line snipped, want is different each time

Oct 28 06:00:06 server2 kernel: md2: rw=1, want=2323778952, limit=16353920
Oct 28 06:00:06 server2 kernel: printk: 2 messages suppressed.
Oct 28 06:00:06 server2 kernel: Buffer I/O error on device md2, logical block 3511697840
Oct 28 06:00:06 server2 kernel: lost page write due to I/O error on md2
Oct 28 06:00:06 server2 kernel: Aborting journal on device md2.
Oct 28 06:05:01 server2 kernel: ext3_abort called.
Oct 28 06:05:01 server2 kernel: EXT3-fs error (device md2): ext3_journal_start_sb: Detected aborted journal
Oct 28 06:05:01 server2 kernel: Remounting filesystem read-only

Now the root filesystem is remounted read-only.
Running fsck on it produces the following:

server2:~# e2fsck /dev/md2
e2fsck 1.37 (21-Mar-2005)
Pass 1: Checking inodes, blocks, and sizes
Inode 8 has illegal block(s).  Clear<y>? yes

Illegal block #2371 (3939553560) in inode 8.  CLEARED.
Illegal block #2372 (2534662274) in inode 8.  CLEARED.
Illegal block #2373 (860109200) in inode 8.  CLEARED.
Illegal block #2374 (3289467369) in inode 8.  CLEARED.
Illegal block #2375 (3883044785) in inode 8.  CLEARED.
Illegal block #2376 (819724782) in inode 8.  CLEARED.
Illegal block #2377 (2957378758) in inode 8.  CLEARED.
Illegal block #2378 (1131441392) in inode 8.  CLEARED.
Illegal block #2379 (1473257247) in inode 8.  CLEARED.
Illegal block #2380 (2359314433) in inode 8.  CLEARED.
Illegal block #2381 (448867375) in inode 8.  CLEARED.
Too many illegal blocks in inode 8.
Clear inode<y>? yes

Restarting e2fsck from the beginning...
Pass 1: Checking inodes, blocks, and sizes
Inode 8 has illegal block(s).  Clear<y>? 

and loops forever.

I know inode 8 is the journal inode.

I didn't try to reboot the server as I fear the recovery process would
not work and would need a human presence to force the fsck (see question
#1)

What can I do to remotely repair this root filesystem (as the server is
in a datacenter from which I'm far at the moment), and remount it rw ?

Thank you,
-- 
Brice Figureau


From adilger at clusterfs.com  Tue Nov  8 17:31:50 2005
From: adilger at clusterfs.com (Andreas Dilger)
Date: Tue, 8 Nov 2005 10:31:50 -0700
Subject: EXT3-fs error (device md2): ext3_journal_start_sb: Detected
	aborted journal...
In-Reply-To: <1131462884.7659.31.camel@localhost.localdomain>
References: <1131462884.7659.31.camel@localhost.localdomain>
Message-ID: <20051108173150.GF12862@schatzie.adilger.int>

On Nov 08, 2005  16:14 +0100, Brice Figureau wrote:
> Restarting e2fsck from the beginning...
> Pass 1: Checking inodes, blocks, and sizes
> Inode 8 has illegal block(s).  Clear<y>? 
> 
> and loops forever.
> 
> I know inode 8 is the journal inode.
> 
> What can I do to remotely repair this root filesystem (as the server is
> in a datacenter from which I'm far at the moment), and remount it rw ?

Running 'debugfs -w -R "feature ^has_journal,^needs_recovery" /dev/md2'
should remove the journal from the filesystem, and then your e2fsck
may work.  Don't forget to add it back afterward "tune2fs -j /dev/md2".

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


From rmunk at quake.Stanford.EDU  Wed Nov  9 02:09:47 2005
From: rmunk at quake.Stanford.EDU (Rasmus Munk Larsen)
Date: Tue, 08 Nov 2005 18:09:47 -0800
Subject: smarter sparse files?
Message-ID: <1131502187.12313.70.camel@akhenaten.Stanford.EDU>

Question: Does ext2/3 (or any other filesystem you know of) support
a system call turning blocks within a file back into "sparse zeros", 
i.e. giving the blocks back to the filesystem? 


Background:

I am working on a slotted fileformat where internal fragmentation 
occurs. One such occurrence is growth of the data in a given slot, 
which currently requires me to handle the fragmentation explicitly.
For example:

...==|== slot 1 ==|=== slot 2 ===|==...

Now assume that contents of slot 1 is replaced with a larger chunk of 
data. I must either append additional data e.g. at the end of the file

...==|== slot 1a =|=== slot2 ===|==...==|=== slot1b ==|

(and add my own data structures and code infrastructure to read 
fragmented slots) or leave the old (defunct) slot 1 data in place 
and garbage collect it later:

...==|= deadbeef =|=== slot2 ===|==...==|====== slot1' =====|

It's my impression that mechanisms for handling similar types of 
fragmentation is already implemented quite well in most modern 
filesystems, and hence I was wondering:

Question: Does ext2/3 (or any other filesystem you know of) support
turning blocks within a file back into "sparse zeros", i.e. giving 
the blocks back to the filesystem? 

If that was the case I could simply free the disk blocks belonging 
entirely to slot1 (as in turning it into zeros in a sparse file) and
append the new data at the end:

...==|XXXXXXXXXXXX|=== slot2 ===|==...==|===== slot1' =====|

and in effect having the file system do my garbage collection for me.
I would (probably very naively) think that this should be possible and
cheap since it only involves by manipulating trees/freelists/whatever 
and perhaps "massaging" a small number of actual data blocks. 
(I should mention that my slots are typically much larger than a 
disk block 100kB-1MB, say) 

Simple example: On a file system supporting sparse files, the following

fd = open("sparse1",w");
write(fd, buf, 10);
lseek(fd,1000000,SEEK_CUR);
write(fd, buf, 10);

will create a file occupying a small number of 
disk blocks. Ideally I would like to be able to do 
the following

fd = open("sparse2",w");
write(fd, buf, 1000020);
lseek(fd,10,SEEK_SET);
giveback(fd, 1000000, SEEK_CUR);   
lseek(fd,1000010,SEEK_SET);
write(fd, buf, 10);

and end up with a sparse2 not much larger than 
sparse1. "giveback" is my imaginary system call
that tells the file system that n bytes starting 
at a given offset should no longer be considered part 
of the file and the associated blocks given back to the 
file system's freelist. I realize that this might not be
possible currently if the chunk you wish to free is 
not aligned with the start of a block etc. etc. 

A "cruder" interface, just giving whole disk blocks back 
would be acceptable, though.

Your comments would be much appreciated,

Rasmus Munk Larsen, Stanford University.


From adilger at clusterfs.com  Tue Nov 15 19:28:59 2005
From: adilger at clusterfs.com (Andreas Dilger)
Date: Tue, 15 Nov 2005 12:28:59 -0700
Subject: smarter sparse files?
In-Reply-To: <1131502187.12313.70.camel@akhenaten.Stanford.EDU>
References: <1131502187.12313.70.camel@akhenaten.Stanford.EDU>
Message-ID: <20051115192859.GB5831@schatzie.adilger.int>

On Nov 08, 2005  18:09 -0800, Rasmus Munk Larsen wrote:
> Question: Does ext2/3 (or any other filesystem you know of) support
> a system call turning blocks within a file back into "sparse zeros", 
> i.e. giving the blocks back to the filesystem? 

This is something that was implemented a long time ago, called "punch"
but never integrated into the core kernel.  It is essentially a form
of truncate that has an "end" parameter instead of removing all blocks
until EOF.

Implementing this is quite complex and I imagine it is much more complex
now than when we did it (maybe 1.2.x kernel days).  However, I believe
it is a useful interface and I think it would be used if it were available.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


From mb/ext3 at dcs.qmul.ac.uk  Wed Nov 16 08:52:44 2005
From: mb/ext3 at dcs.qmul.ac.uk (Matt Bernstein)
Date: Wed, 16 Nov 2005 08:52:44 +0000
Subject: (large, external) data journal BUG (Assertion failure in
 __journal_drop_transaction()
 at fs/jbd/checkpoint.c:626: "transaction->t_forget == NULL")
Message-ID: <437AF35C.3010106@dcs.qmul.ac.uk>

Hi,

A couple of our important servers, both running FC4 but one i386 and one 
x86_64, have been crashing recently. They both are running ext3 
data=journal with large external journals and high commit intervals. 
Both machines use the gdth driver for their hardware RAID sets, if 
that's of any use. I think the hardware is good in both cases.

I hope someone finds this data useful enough to be able to fix the bug.

IMAP server crash (once only, thus far):

Assertion failure in __journal_drop_transaction() at 
fs/jbd/checkpoint.c:626: "transaction->t_forget == NULL"
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at "fs/jbd/checkpoint.c":626
invalid operand: 0000 [1] SMP
CPU 0
Modules linked in: loop iptable_nat ip_conntrack_amanda ipt_ULOG 
ipt_REJECT ipt_state ip_conntrack iptable_filter ip_tables w83627hf 
eeprom lm85 i2c_sensor i2c_isa md5 ipv6 video button battery ac ohci_hcd 
i2c_amd8111 i2c_amd756 i2c_core shpchp e100 mii tg3 floppy sg 
dm_snapshot dm_zero dm_mirror ext3 jbd raid1 dm_mod gdth sata_sil libata 
sd_mod scsi_mod
Pid: 1485, comm: kjournald Not tainted 2.6.12-1.1398_FC4smp
RIP: 0010:[<ffffffff8807d56f>] 
<ffffffff8807d56f>{:jbd:__journal_drop_transaction+319}
RSP: 0018:ffff8100fade9de8  EFLAGS: 00010292
RAX: 0000000000000074 RBX: ffff8100c5f0ea80 RCX: ffffffff8042d908
RDX: ffffffff8042d908 RSI: 0000000000000296 RDI: ffffffff8042d900
RBP: ffff8100f8b55000 R08: ffff81008234c040 R09: 0000000000000030
R10: 0000000000000000 R11: ffffffff8011d680 R12: ffff81003b333080
R13: ffff8100c5f0ea80 R14: ffff8100f8b55000 R15: 0000000000000000
FS:  00002aaaaadfcf00(0000) GS:ffffffff8050d780(0000) knlGS:00000000f7ff16c0
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00002aaab51a0000 CR3: 00000000e2980000 CR4: 00000000000006e0
Process kjournald (pid: 1485, threadinfo ffff8100fade8000, task 
ffff8100fb9be880)
Stack: ffff8100020ba898 ffff81008caebce8 0000000000000000 ffffffff8807c9d2
        ffff8100f8b55024 0000000000000cf7 ffff8100f8b5515c 0000000000000000
        0000000000000000 0000000000000000
Call Trace:<ffffffff8807c9d2>{:jbd:journal_commit_transaction+4194}
        <ffffffff801439f1>{del_timer+113} 
<ffffffff8807f4d3>{:jbd:kjournald+275}
        <ffffffff8807eba0>{:jbd:commit_timeout+0} 
<ffffffff801506e0>{autoremove_wake_function+0}
        <ffffffff8010f76b>{child_rip+8} <ffffffff8807f3c0>{:jbd:kjournald+0}
        <ffffffff8010f763>{child_rip+0}

Code: 0f 0b fe 15 08 88 ff ff ff ff 72 02 48 83 7b 50 00 74 34 49
RIP <ffffffff8807d56f>{:jbd:__journal_drop_transaction+319} RSP 
<ffff8100fade9de8>
  <3>Debug: sleeping function called from invalid context at 
include/linux/rwsem.h:43
in_atomic():0, irqs_disabled():1

Call Trace:<ffffffff8013abd5>{profile_task_exit+21} 
<ffffffff8013bff2>{do_exit+34}
        <ffffffff8022178d>{vgacon_cursor+221} <ffffffff8011066d>{die+77}
        <ffffffff80111203>{do_invalid_op+163} 
<ffffffff8807d56f>{:jbd:__journal_drop_transaction+319}
        <ffffffff8010f5b5>{error_exit+0} 
<ffffffff8011d680>{flat_send_IPI_mask+0}
        <ffffffff8807d56f>{:jbd:__journal_drop_transaction+319}
        <ffffffff8807d56f>{:jbd:__journal_drop_transaction+319}
        <ffffffff8807c9d2>{:jbd:journal_commit_transaction+4194}
        <ffffffff801439f1>{del_timer+113} 
<ffffffff8807f4d3>{:jbd:kjournald+275}
        <ffffffff8807eba0>{:jbd:commit_timeout+0} 
<ffffffff801506e0>{autoremove_wake_function+0}
        <ffffffff8010f76b>{child_rip+8} <ffffffff8807f3c0>{:jbd:kjournald+0}
        <ffffffff8010f763>{child_rip+0}

File server crash (has happened a few times now):

Assertion failure in __journal_drop_transaction() at 
fs/jbd/checkpoint.c:626: "transaction->t_forget == NULL"
------------[ cut here ]------------
kernel BUG at fs/jbd/checkpoint.c:626!
invalid operand: 0000 [#1]
SMP
Modules linked in: loop nfsd exportfs lockd nfs_acl sunrpc autofs4 ipv6 
ip_conntrack_amanda ipt_REJECT ipt_state ip_conntrack iptable_filter 
ip_tables dm_mod video button battery ac ohci_hcd i2c_amd756 i2c_core 
3c59x mii ns83820 floppy sg ext3 jbd gdth sd_mod scsi_mod
CPU:    0
EIP:    0060:[<f88a997c>]    Not tainted VLI
EFLAGS: 00010296   (2.6.13-1.1526_FC4smp)
EIP is at __journal_drop_transaction+0x117/0x2fa [jbd]
eax: 00000074   ebx: f064d2e0   ecx: c036fbf4   edx: 00000286
esi: f699a200   edi: c2f50000   ebp: e775df84   esp: c2f50ec4
ds: 007b   es: 007b   ss: 0068
Process kjournald (pid: 1168, threadinfo=c2f50000 task=c2e64020)
Stack: f88acfa8 f88b2e92 f88ada14 00000272 f88ada7c f064d2e0 f699a200 
f88a9781
        c2f50000 d142414c e775df84 f88a8f61 e775df84 f88a9700 c2f50000 
ecb98e60
        f064d2e0 000000f5 e85cc160 defc4598 f699a200 00000000 defc4560 
f88a7846
Call Trace:
  [<f88a9781>] __journal_remove_checkpoint+0x56/0x75 [jbd]
  [<f88a8f61>] __try_to_free_cp_buf+0x31/0x68 [jbd]
  [<f88a9700>] __journal_clean_checkpoint_list+0x6f/0x9a [jbd]
  [<f88a7846>] journal_commit_transaction+0x147/0xff1 [jbd]
  [<c01295f7>] lock_timer_base+0x15/0x2f
  [<c0129803>] try_to_del_timer_sync+0x45/0x4d
  [<f88aa68b>] kjournald+0xc5/0x20d [jbd]
  [<f88aa5c0>] commit_timeout+0x0/0x5 [jbd]
  [<c01347c2>] autoremove_wake_function+0x0/0x37
  [<f88aa5c6>] kjournald+0x0/0x20d [jbd]
  [<c0101ca1>] kernel_thread_helper+0x5/0xb
Code: 44 24 10 7c da 8a f8 c7 44 24 0c 72 02 00 00 c7 44 24 08 14 da 8a 
f8 c7 44 24 04 92 2e 8b f8 c7 04 24 a8 cf 8a f8 e8 cb 7c 87 c7 <0f> 0b 
72 02 14 da 8a f8 8b 4b 2c 85 c9 74 34 c7 44 24 10 c4 d0


From tobias.orlamuende at googlemail.com  Thu Nov 17 15:27:53 2005
From: tobias.orlamuende at googlemail.com (=?ISO-8859-1?Q?Tobias_Orlam=FCnde?=)
Date: Thu, 17 Nov 2005 16:27:53 +0100
Subject: ext3-image doesn't mount anymore and reports errors
Message-ID: <b5b21f0511170727k2f7318a2x@mail.gmail.com>

Hi folks,

we made an image of a partition by using dd. Original filesystem is
ext3 (4k block-size).
My colleague was able to mount this image once (using mount with "-o loop").
Since then anytime we try to mount it, it ends in the following error-message:

ioctl: LOOP_CLR_FD: Device or resource busy
mount: you must specify the filesystem type

We also tried to mount it on another system than our backup-machine -
without success but with the same error.

Fsck.ext3 ends in lots of inode-errors and the following one:

Error while iterating over blocks in inode 131736: Illegal triply
indirect block found
e2fsck: aborted

Using an alternative superblock (32768) results in the same error.

Due to the fact that this is our only backup of this machine it is
really important for us at least to read this image and get some data
off it.
We are also willing to pay a fair amount of money for recovery.

Is somebody able to help out quickly in this situation?

Regards

Tobias

PS: Please don't blame me for this backup-strategy! :-)


From tobias.orlamuende at googlemail.com  Thu Nov 17 15:38:42 2005
From: tobias.orlamuende at googlemail.com (=?ISO-8859-1?Q?Tobias_Orlam=FCnde?=)
Date: Thu, 17 Nov 2005 16:38:42 +0100
Subject: ext3-image doesn't mount anymore and reports errors
Message-ID: <b5b21f0511170738x4557afb7y@mail.gmail.com>

Hi folks,

please excuse, if this message come through twice. Seems like I have
some troubles with this gmail-account.

We made an image of a partition by using dd. Original filesystem is
ext3 (4k block-size).
My colleague was able to mount this image once (using mount with "-o loop").
Since then anytime we try to mount it, it ends in the following error-message:

ioctl: LOOP_CLR_FD: Device or resource busy
mount: you must specify the filesystem type

We also tried to mount it on another system than our backup-machine -
without success but with the same error.

Fsck.ext3 ends in lots of inode-errors and the following one:

Error while iterating over blocks in inode 131736: Illegal triply
indirect block found
e2fsck: aborted

Using an alternative superblock (32768) results in the same error.

Due to the fact that this is our only backup of this machine it is
really important for us at least to read this image and get some data
off it.
We are also willing to pay a fair amount of money for recovery.

Is somebody able to help out quickly in this situation?

Regards

Tobias

PS: Please don't blame me for this backup-strategy! :-)


From evilninja at gmx.net  Thu Nov 17 16:26:13 2005
From: evilninja at gmx.net (evilninja at gmx.net)
Date: Thu, 17 Nov 2005 17:26:13 +0100
Subject: ext3-image doesn't mount anymore and reports errors
In-Reply-To: <b5b21f0511170738x4557afb7y@mail.gmail.com>
References: <b5b21f0511170738x4557afb7y@mail.gmail.com>
Message-ID: <437CAF25.8060606@gmx.net>

Tobias Orlam?nde schrieb:
> We made an image of a partition by using dd. Original filesystem is
> ext3 (4k block-size).

was it really a backup of a partition or did you backup a whole disk?
(dd if=/dev/hda1 vs. dd if=/dev/hda)

> Fsck.ext3 ends in lots of inode-errors and the following one:
> 
> Error while iterating over blocks in inode 131736: Illegal triply
> indirect block found
> e2fsck: aborted

please make sure to use a current version of e2fsprogs and a current kernel.

> We are also willing to pay a fair amount of money for recovery.

hm, a couple of weeks ago i too was in the need of ext3-recovery and 
some *really* famous recovery-specialist told me: "ext3? no, not 
possible at all." (except for grep'ing through the fs as "usual")

Christian.
-- 
BOFH excuse #416:

We're out of slots on the server


From dahernemtallah at hotmail.com  Thu Nov 17 21:43:53 2005
From: dahernemtallah at hotmail.com (Nemtallah Daher)
Date: Thu, 17 Nov 2005 16:43:53 -0500
Subject: Ext3 bad magic after upgrage FC1 to FC4
Message-ID: <BAY104-F3365A239E391AD4A7DC2F1CF5F0@phx.gbl>

Dear All,

I have server with a 160G disk with one partition /dev/hde1.  The PC had FC1 
and everything was working fine.  I decided to do an upgrade to FC4.  Now I 
can no longer mount that partition.  I don't think anything happened to the 
file system, but changes in kernel and modules due to the upgrade is now 
making it inaccessible.

I tried:

mk2fs -n /dev/hde1
  got a list of superblocks and tried all with no luck

e2fsck -b xxxxxx /dev/hde1
  bad magic or superblock

I am sick over this and would appreciate any advice and guidance.  Thank 
you.


From evilninja at gmx.net  Fri Nov 18 01:16:44 2005
From: evilninja at gmx.net (evilninja at gmx.net)
Date: Fri, 18 Nov 2005 02:16:44 +0100
Subject: Ext3 bad magic after upgrage FC1 to FC4
In-Reply-To: <BAY104-F3365A239E391AD4A7DC2F1CF5F0@phx.gbl>
References: <BAY104-F3365A239E391AD4A7DC2F1CF5F0@phx.gbl>
Message-ID: <437D2B7C.2070603@gmx.net>

Nemtallah Daher schrieb:
> I have server with a 160G disk with one partition /dev/hde1.  The PC had 
> FC1 and everything was working fine.  I decided to do an upgrade to 
> FC4.  Now I can no longer mount that partition.  I don't think anything 

if it really is a kernel problem: can you downgrade to a previous kernel 
then? eg. FC1's or FC3's kernel.

are there any errors in the syslog? with a new kernel, the disk-driver 
probably got upgraded too...

-- 
BOFH excuse #313:

your process is not ISO 9000 compliant


From bunk at stusta.de  Fri Nov 18 03:34:00 2005
From: bunk at stusta.de (Adrian Bunk)
Date: Fri, 18 Nov 2005 04:34:00 +0100
Subject: [2.6 patch] fs/ext3/: small cleanups
Message-ID: <20051118033359.GX11494@stusta.de>

This patch contains the following cleanups:
- there's no need for ext3_count_free() #ifndef EXT3FS_DEBUG
- having prototypes for ext3_count_free() in two different headers is
  nonsense


Signed-off-by: Adrian Bunk <bunk at stusta.de>

---

 fs/ext3/balloc.c |    2 --
 fs/ext3/bitmap.c |    8 +++++++-
 fs/ext3/bitmap.h |    8 --------
 fs/ext3/ialloc.c |    1 -
 4 files changed, 7 insertions(+), 12 deletions(-)

--- linux-2.6.15-rc1-mm1-full/fs/ext3/bitmap.c.old	2005-11-18 02:52:02.000000000 +0100
+++ linux-2.6.15-rc1-mm1-full/fs/ext3/bitmap.c	2005-11-18 02:54:14.000000000 +0100
@@ -7,8 +7,11 @@
  * Universite Pierre et Marie Curie (Paris VI)
  */
 
+#ifdef EXT3FS_DEBUG
+
 #include <linux/buffer_head.h>
-#include "bitmap.h"
+
+#include "ext3_fs.h"
 
 static int nibblemap[] = {4, 3, 3, 2, 3, 2, 2, 1, 3, 2, 2, 1, 2, 1, 1, 0};
 
@@ -24,3 +27,6 @@
 			nibblemap[(map->b_data[i] >> 4) & 0xf];
 	return (sum);
 }
+
+#endif  /*  EXT3FS_DEBUG  */
+
--- linux-2.6.15-rc1-mm1-full/fs/ext3/balloc.c.old	2005-11-18 02:52:55.000000000 +0100
+++ linux-2.6.15-rc1-mm1-full/fs/ext3/balloc.c	2005-11-18 02:53:02.000000000 +0100
@@ -20,8 +20,6 @@
 #include <linux/quotaops.h>
 #include <linux/buffer_head.h>
 
-#include "bitmap.h"
-
 /*
  * balloc.c contains the blocks allocation and deallocation routines
  */
--- linux-2.6.15-rc1-mm1-full/fs/ext3/ialloc.c.old	2005-11-18 02:53:26.000000000 +0100
+++ linux-2.6.15-rc1-mm1-full/fs/ext3/ialloc.c	2005-11-18 02:53:31.000000000 +0100
@@ -26,7 +26,6 @@
 
 #include <asm/byteorder.h>
 
-#include "bitmap.h"
 #include "xattr.h"
 #include "acl.h"
 
--- linux-2.6.15-rc1-mm1-full/fs/ext3/bitmap.h	2005-11-17 21:30:48.000000000 +0100
+++ /dev/null	2005-11-08 19:07:57.000000000 +0100
@@ -1,8 +0,0 @@
-/*  linux/fs/ext3/bitmap.c
- *
- * Copyright (C) 2005 Simtec Electronics
- *	Ben Dooks <ben at simtec.co.uk>
- *
-*/
-
-extern unsigned long ext3_count_free (struct buffer_head *, unsigned int );


From jt at domainfactory.de  Fri Nov 18 16:03:25 2005
From: jt at domainfactory.de (Jochen Tuchbreiter)
Date: Fri, 18 Nov 2005 17:03:25 +0100
Subject: e2fsck not detecting corrupt file?
Message-ID: <02a801c5ec59$9b4a27e0$6e0aa8c0@buero3>

Hello,

on my ext3 fs I have a file that I can not modify anymore:

$ who am i
root     pts/0        Nov 18 19:42 (192.168.10.110)
$ ls -al /mnt/path/usage_200306.html
-rw-r-xrw-  1 50946 nobody 99935 Jul  1  2003 /mnt/path/usage_200306.html
$ rm /mnt/path/usage_200306.html
rm: remove regular file `/mnt/path/usage_200306.html'? y
rm: cannot remove `/mnt/path/usage_200306.html': Operation not permitted
$ chmod a+x /mnt/path/usage_200306.html
chmod: changing permissions of `/mnt/path/usage_200306.html': Operation not
permitted

The fs is NOT mounted readonly, I can move around / change other files on
the partition.


However running a full e2fsck -f on the fs does not find any problem.

$ ./e2fsck/e2fsck -f /dev/sdb6
e2fsck 1.38 (30-Jun-2005)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
$ ./e2fsck/e2fsck -V
e2fsck 1.38 (30-Jun-2005)
        Using EXT2FS Library version 1.38, 30-Jun-2005
$


$ stat /mnt/path/usage_200306.html
  File: `/mnt/path/usage_200306.html'
  Size: 99935           Blocks: 160        IO Block: 4096   regular file
Device: 816h/2070d      Inode: 2149222     Links: 1
Access: (0656/-rw-r-xrw-)  Uid: (50946/ UNKNOWN)   Gid: (   99/  nobody)
Access: 2005-11-19 00:36:35.000000000 +0100
Modify: 2003-07-01 04:55:16.000000000 +0200
Change: 2004-02-14 02:35:36.000000000 +0100


$ uname -a
Linux machine 2.4.29-grsec #10 SMP Mon Jul 4 14:26:46 CEST 2005 i686
Intel(R) Pentium(R) 4 CPU 3.00GHz GenuineIntel GNU/Linux


This is not a hard disk problem: dd'ed the whole partition from one disk to
a brand new one and had the same problem on the new disk.


Do you guys have any suggestions on how to further diagnose or fix this? I
also tried it on a 2.6 kernel without grsec, same result.


best regards,
Jochen


From alex at alex.org.uk  Fri Nov 18 16:25:44 2005
From: alex at alex.org.uk (Alex Bligh)
Date: Fri, 18 Nov 2005 16:25:44 +0000
Subject: e2fsck not detecting corrupt file?
In-Reply-To: <02a801c5ec59$9b4a27e0$6e0aa8c0@buero3>
References: <02a801c5ec59$9b4a27e0$6e0aa8c0@buero3>
Message-ID: <EBD03AE5A5532FEBA87D92AE@[192.168.100.25]>


--On 18 November 2005 17:03 +0100 Jochen Tuchbreiter <jt at domainfactory.de> 
wrote:

> $ chmod a+x /mnt/path/usage_200306.html
> chmod: changing permissions of `/mnt/path/usage_200306.html': Operation
> not permitted
>
> The fs is NOT mounted readonly, I can move around / change other files on
> the partition.

Is it marked as immutable (somehow)? As root, try:
 chattr -i <file>

Alex


From jt at domainfactory.de  Fri Nov 18 16:33:00 2005
From: jt at domainfactory.de (Jochen Tuchbreiter)
Date: Fri, 18 Nov 2005 17:33:00 +0100
Subject: e2fsck not detecting corrupt file?
In-Reply-To: <EBD03AE5A5532FEBA87D92AE@[192.168.100.25]>
Message-ID: <02b801c5ec5d$bd4ed120$6e0aa8c0@buero3>

Hello,

> > The fs is NOT mounted readonly, I can move around / change 
> other files on
> > the partition.
> 
> Is it marked as immutable (somehow)? As root, try:
>  chattr -i <file>

That's it: It had really strange attr-settings (+a +c). After removing the
"append only" it now works. I wonder how this happened, maybe I had a
hardware problem on the disk some time ago.

Thank you very much Alex!

regards,
Jochen


From puhuri at iki.fi  Tue Nov 22 11:08:00 2005
From: puhuri at iki.fi (Markus Peuhkuri)
Date: Tue, 22 Nov 2005 13:08:00 +0200
Subject: Doing fsck on shutdown
Message-ID: <4382FC10.2080606@iki.fi>

I usually shutdown computer for night (could probaly use software
shutdown, but have not yet studied it).  In that case, the disk is
checked quite often with default settings, usually in those cases I am
in hurry and want computer to start up fast :-).

One alternative would be trying to run fsck at shutdown if fsck is due
in a few mounts.  One could abort that if one wants computer to shutdown
fast, but in normal case one could just allow it and then computer would
later shut ifself down.

Has anyone designed initscripts for that?

ps. another issue regarding to mount counts is automounting USB disks
with ext3 file system.  If one uses automounter, then one rapidly
accumulates mount count for fsck.  Of course, it is possible to set
counter to zero and make fsck only based on time.  Any opinions on that?


From tytso at mit.edu  Tue Nov 22 15:21:19 2005
From: tytso at mit.edu (Theodore Ts'o)
Date: Tue, 22 Nov 2005 10:21:19 -0500
Subject: Doing fsck on shutdown
In-Reply-To: <4382FC10.2080606@iki.fi>
References: <4382FC10.2080606@iki.fi>
Message-ID: <20051122152119.GD29179@thunk.org>

On Tue, Nov 22, 2005 at 01:08:00PM +0200, Markus Peuhkuri wrote:
> I usually shutdown computer for night (could probaly use software
> shutdown, but have not yet studied it).  In that case, the disk is
> checked quite often with default settings, usually in those cases I am
> in hurry and want computer to start up fast :-).
> 
> One alternative would be trying to run fsck at shutdown if fsck is due
> in a few mounts.  One could abort that if one wants computer to shutdown
> fast, but in normal case one could just allow it and then computer would
> later shut ifself down.
> 
> Has anyone designed initscripts for that?

That's a good/interesting idea.  One suggestion; if you do this, make
sure you check to see if you are running on batteries; if you are,
it's likely that you might be in a situation such as a laptop on an
airplane and the airline attendant has just told you to shut down all
electronics in preparation for landing --- or the laptop has just
reported that you only have 3% battery life left, and please shut down
now.  

Sometimes doing a 3-5 minute FSCK run at shutdown isn't always the
right thing....

						- Ted


From bryan at kadzban.is-a-geek.net  Thu Nov 17 17:53:22 2005
From: bryan at kadzban.is-a-geek.net (Bryan Kadzban)
Date: Thu, 17 Nov 2005 12:53:22 -0500
Subject: ext3-image doesn't mount anymore and reports errors
In-Reply-To: <b5b21f0511170727k2f7318a2x@mail.gmail.com>
References: <b5b21f0511170727k2f7318a2x@mail.gmail.com>
Message-ID: <20051117175322.GA5511@kadzban.is-a-geek.net>

On Thu, Nov 17, 2005 at 04:27:53PM +0100, Tobias Orlam?nde wrote:
> My colleague was able to mount this image once (using mount with "-o loop").
> Since then anytime we try to mount it, it ends in the following error-message:
> 
> ioctl: LOOP_CLR_FD: Device or resource busy
> mount: you must specify the filesystem type

Looking at the loop driver (drivers/block/loop.c), the handler for
LOOP_CLR_FD checks a ref-count on the loop device.  If the ref-count is
bigger than 1 (the ioctl call holds a reference), it returns -EBUSY,
which corresponds to the error you're getting (device or resource busy).

Does anything else on the system have a handle open to the loop device
file?  What about the image file?  Do you have any other loopback-mounts
running at the time?  Does it help to manually do the losetup operation
on a known-free loop device, then mount the loop device itself (without
-o loop)?  (You will have to "losetup -d" the device after you unmount
it, also -- normally umount handles that.)

What does "losetup -f" say?

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20051117/71df66c2/attachment.sig>

From evil at g-house.de  Thu Nov 24 03:00:52 2005
From: evil at g-house.de (Christian)
Date: Thu, 24 Nov 2005 04:00:52 +0100
Subject: Doing fsck on shutdown
In-Reply-To: <4382FC10.2080606@iki.fi>
References: <4382FC10.2080606@iki.fi>
Message-ID: <43852CE4.6040202@g-house.de>

Markus Peuhkuri schrieb:
> I usually shutdown computer for night (could probaly use software
> shutdown, but have not yet studied it).  In that case, the disk is
> checked quite often with default settings, usually in those cases I am
> in hurry and want computer to start up fast :-).

for often rebooted desktop systems i'd just tune2fs(8) the filesystem to 
fsck based on a given interval of time rather than on the count of the 
mounts:

% tune2fs -i 1m /dev/sda1
   ....will check sda1 every month.

Christian.
-- 
BOFH excuse #341:

HTTPD Error 666 : BOFH was here


From linux at horizon.com  Thu Nov 24 21:42:57 2005
From: linux at horizon.com (linux at horizon.com)
Date: 24 Nov 2005 16:42:57 -0500
Subject: Assertion failure in ext3_sync_file() at fs/ext3/fsync.c:50:
	"ext3_journal_current_handle() == 0"
Message-ID: <20051124214257.4673.qmail@science.horizon.com>

------------[ cut here ]------------
kernel BUG at fs/ext3/fsync.c:50!
invalid operand: 0000 [#1]
CPU:    0
EIP:    0060:[<b0187d38>]    Not tainted VLI
EFLAGS: 00010296   (2.6.13.1)
EIP is at ext3_sync_file+0x58/0xf0
eax: 00000068   ebx: bf4a479c   ecx: b03cffac   edx: b03cffac
esi: b0398cfc   edi: b2b8f1c8   ebp: c13bcf60   esp: c13bcf18
ds: 007b   es: 007b   ss: 0068
Process aptitude (pid: 26952, threadinfo=c13bc000 task=d99cca80)
Stack: b0398afc b0383f40 b0395746 00000032 b0398cfc 00000000 00000000 e84824c0
       ca281dc0 bf4a483c c13bcf60 b01317c2 bf4a483c 00000000 00000000 e84824c0
       ffffffe4 bf4a483c c13bcf80 b01458ce e84824c0 dce74d9c 00000001 a7004000
Call Trace:
 [<b01032cb>] show_stack+0xab/0xf0
 [<b0103494>] show_registers+0x164/0x200
 [<b01036a8>] die+0xc8/0x150
 [<b01037b9>] do_trap+0x89/0xd0
 [<b0103b1a>] do_invalid_op+0xaa/0xc0
 [<b0102eef>] error_code+0x4f/0x54
 [<b01458ce>] msync_interval+0x8e/0xd0
 [<b0145a6f>] sys_msync+0x15f/0x171
 [<b0102c69>] syscall_call+0x7/0xb
Code: ba 46 57 39 b0 be fc 8c 39 b0 b8 40 3f 38 b0 89 74 24 10 89 4c 24 0c 89 54 24 08 89 44 24 04 c7 04 24 fc 8a 39 b0 e8 08 10 f9 ff <0f> 0b 32 00 46 57 39 b0 0f b7 43 28 25 00 f0 00 00 3d 00 80 00


x86, uniprocessor, 2.6.13.1, ext3 file system, data=ordered, 6-way RAID-1.
Kernel is stock except for ppskit-lite patches.

This is the usually-not-mounted emergency rescue partition which
contains disaster recovery tools.  Thus, the somewhat paranoid
data integrity settings.

The FS just filled up as I was doing the every-few-months update.

Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/md1                432312    425732         0 100% /boot

I'm currently copying a raw device snapshot which I can make available to
anyone who promises not to go grepping for secrets on it.  I don't think
there are any, but hunting through the whole image and maybe zeroing a
few data blocks is a bit of a PITA.


Anyway, thanks for what has usually been a very reliable file system!
I hope there's enough info here to find the problem.


Here's the tune2fs -l output.  No idea why it says "clean"; it is still
mounted read/write.

tune2fs 1.38 (30-Jun-2005)
Filesystem volume name:   <none>
Last mounted on:          /boot
Filesystem UUID:          ad036960-f1df-4c5e-9240-4e917527f20c
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal filetype needs_recovery sparse_super
Default mount options:    (none)
Filesystem state:         clean
Errors behavior:          Remount read-only
Filesystem OS type:       Linux
Inode count:              55296
Block count:              439360
Reserved block count:     21968
Free blocks:              91538
Free inodes:              30975
First block:              1
Block size:               1024
Fragment size:            1024
Blocks per group:         8192
Fragments per group:      8192
Inodes per group:         1024
Inode blocks per group:   128
Last mount time:          Thu Nov 24 20:52:16 2005
Last write time:          Thu Nov 24 20:52:16 2005
Mount count:              6
Maximum mount count:      34
Last checked:             Sat Aug  6 04:39:08 2005
Check interval:           15552000 (6 months)
Next check after:         Thu Feb  2 04:39:08 2006
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               128
Journal inode:            8
Journal backup:           inode blocks


From evilninja at gmx.net  Sat Nov 26 22:44:42 2005
From: evilninja at gmx.net (Christian)
Date: Sat, 26 Nov 2005 23:44:42 +0100 (CET)
Subject: Assertion failure in ext3_sync_file() at fs/ext3/fsync.c:50: 
 "ext3_journal_current_handle() == 0"
In-Reply-To: <20051124214257.4673.qmail@science.horizon.com>
References: <20051124214257.4673.qmail@science.horizon.com>
Message-ID: <26063.195.126.66.126.1133045082.squirrel@housecafe.dyndns.org>

On Thu, November 24, 2005 22:42, linux at horizon.com wrote:
> ------------[ cut here ]------------
> kernel BUG at fs/ext3/fsync.c:50!
> invalid operand: 0000 [#1]
> CPU:    0

is this error reproducible?

> Here's the tune2fs -l output.  No idea why it says "clean"; it is still
> mounted read/write.

i don't know if this is "ok" (don't have the docs atm) but does e2fsck
report anything to worry about?

thanks,
Christian
-- 
make bzImage, not war


From linux at horizon.com  Sun Nov 27 01:26:51 2005
From: linux at horizon.com (linux at horizon.com)
Date: 26 Nov 2005 20:26:51 -0500
Subject: Assertion failure in ext3_sync_file() at fs/ext3/fsync.c:50:
	"ext3_journal_current_handle() == 0"
In-Reply-To: <26063.195.126.66.126.1133045082.squirrel@housecafe.dyndns.org>
Message-ID: <20051127012651.30628.qmail@science.horizon.com>

> is this error reproducible?

Sorry, I didn't try; it was a production server I already had one
short-notice reboot on, and I didn't feel like trying for two.
(Although you're right, I should have thought of leaving it like that
until a good maintenance window instead of immediately e2fscking and
cleaning up.)

>> Here's the tune2fs -l output.  No idea why it says "clean"; it is still
>> mounted read/write.

> i don't know if this is "ok" (don't have the docs atm) but does e2fsck
> report anything to worry about?

No, it didn't.  I moved half of the .deb files and completed the update
by halves, and all was well.


From adilger at clusterfs.com  Sun Nov 27 09:00:12 2005
From: adilger at clusterfs.com (Andreas Dilger)
Date: Sun, 27 Nov 2005 02:00:12 -0700
Subject: Assertion failure in ext3_sync_file() at fs/ext3/fsync.c:50:
	"ext3_journal_current_handle() == 0"
In-Reply-To: <20051124214257.4673.qmail@science.horizon.com>
References: <20051124214257.4673.qmail@science.horizon.com>
Message-ID: <20051127090012.GU14509@schatzie.adilger.int>

On Nov 24, 2005  16:42 -0500, linux at horizon.com wrote:
> ------------[ cut here ]------------
> kernel BUG at fs/ext3/fsync.c:50!
> Process aptitude (pid: 26952, threadinfo=c13bc000 task=d99cca80)
> Call Trace:
>  [<b01458ce>] msync_interval+0x8e/0xd0
>  [<b0145a6f>] sys_msync+0x15f/0x171
>  [<b0102c69>] syscall_call+0x7/0xb

This BUG is:
J_ASSERT(ext3_journal_current_handle() == 0);

which means that somehow the aptitude process struct had a journal handle
still active when it shouldn't have.  Are there any console messages or
before the BUG, or just ENOSPC from the program?  Either way, I'd suspect
a bug in the error handling code not doing a journal_stop() before exiting
a function somewhere...

> Here's the tune2fs -l output.  No idea why it says "clean"; it is still
> mounted read/write.
> 
> Filesystem features:      has_journal filetype needs_recovery sparse_super
> Filesystem state:         clean

FYI - all ext3 filesystems say "clean" all the time, because when the
journal replay is completed (note "needs_recovery" flag above) the
filesystem will in fact be clean (i.e. not needing an e2fsck).  If this
were "error" (after the kernel detected some on-disk error) then you'd
get a full e2fsck on boot regardless of ext3 recovery or not.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


From linux at horizon.com  Tue Nov 29 02:36:30 2005
From: linux at horizon.com (linux at horizon.com)
Date: 28 Nov 2005 21:36:30 -0500
Subject: Assertion failure in ext3_sync_file() at fs/ext3/fsync.c:50:
	"ext3_journal_current_handle() == 0"
In-Reply-To: <20051127090012.GU14509@schatzie.adilger.int>
Message-ID: <20051129023630.8145.qmail@science.horizon.com>

> which means that somehow the aptitude process struct had a journal handle
> still active when it shouldn't have.  Are there any console messages or
> before the BUG, or just ENOSPC from the program?  Either way, I'd suspect
> a bug in the error handling code not doing a journal_stop() before exiting
> a function somewhere...

Sorry, nothing for 5 minutes, and that's just a martian packet. :-(

>> Filesystem state:         clean

> FYI - all ext3 filesystems say "clean" all the time, because when the
> journal replay is completed (note "needs_recovery" flag above) the
> filesystem will in fact be clean (i.e. not needing an e2fsck).  If this
> were "error" (after the kernel detected some on-disk error) then you'd
> get a full e2fsck on boot regardless of ext3 recovery or not.

Neat, thanks!