From worleys at gmail.com  Wed Oct  1 18:18:21 2008
From: worleys at gmail.com (Chris Worley)
Date: Wed, 1 Oct 2008 12:18:21 -0600
Subject: When is a block free?
In-Reply-To: <20080929163917.GB10831@mit.edu>
References: <f3177b9e0809161310i45d24836tf41160a12683c032@mail.gmail.com>
	<48D01448.4050107@redhat.com>
	<f3177b9e0809161403xac72601i3c99dec0b6be9959@mail.gmail.com>
	<f3177b9e0809161432i62adce12y2019062be9d160d4@mail.gmail.com>
	<f3177b9e0809290824m1b3cb778u79388f885587cc7@mail.gmail.com>
	<20080929163917.GB10831@mit.edu>
Message-ID: <f3177b9e0810011118q42206e3fi75f947fbc7ee9cc3@mail.gmail.com>

On Mon, Sep 29, 2008 at 10:39 AM, Theodore Tso <tytso at mit.edu> wrote:
> On Mon, Sep 29, 2008 at 09:24:33AM -0600, Chris Worley wrote:
>> On Tue, Sep 16, 2008 at 3:32 PM, Chris Worley <worleys at gmail.com> wrote:
>> > For example, in balloc.c I'm seeing ext3_free_blocks_sb
>> > calls ext3_clear_bit_atomic at the bottom... is that when the block is
>> > freed?  Are all blocks freed here?
>>
>> David Woodhouse, in an article at http://lwn.net/Articles/293658/, is
>> implementing the T10/T13 committees "Trim" request in 2.6.28 kernels.
>>
>> Would it be appropriate to call "blkdev_issue_discard" at the bottom
>> of ext3_free_blocks_sb where ext3_clear_bit_atomic is being called?
>
> Unfortunately, it's not as simple as that.  The problem is that as
> soon as you call trim, the drive is allowed to discard the contents of
> that block so that future attempts to read from that block returns all
> zeros.  Therefore we can't call Trim until after the transaction has
> committed.  That means we have to keep a linked list of block extents
> that are to be trimmed attached to the commit object, and only send
> the trim requests once the commit block has been written to disk.
>
> It's on the ext4 developer's TODO list to add Trim support to ext3 and
> ext4.

I was perusing David Woodhouse's 2.6.27-rc2 kernel at
git://git.infradead.org/users/drzeus/discard-2.6.git, and noticed he
has the discard built-in to where I was talking about for ext2... so I
coded our driver to handle discards, and it works very nicely!!!

The journaling issue you raise is not a show-stopper on the block
device side: if the block device has to maintain a couple of blocks
that are not really in use, it's no big deal (eventually the blocks
will be re-written and the universe will be in order again)... for the
users, I can understand if the discard is preserved on the block
device, while the fs still thinks there's good data in there (we'll
give you back all zeros on read).

Chris



From tytso at mit.edu  Wed Oct  1 18:59:09 2008
From: tytso at mit.edu (Theodore Tso)
Date: Wed, 1 Oct 2008 14:59:09 -0400
Subject: When is a block free?
In-Reply-To: <f3177b9e0810011118q42206e3fi75f947fbc7ee9cc3@mail.gmail.com>
References: <f3177b9e0809161310i45d24836tf41160a12683c032@mail.gmail.com>
	<48D01448.4050107@redhat.com>
	<f3177b9e0809161403xac72601i3c99dec0b6be9959@mail.gmail.com>
	<f3177b9e0809161432i62adce12y2019062be9d160d4@mail.gmail.com>
	<f3177b9e0809290824m1b3cb778u79388f885587cc7@mail.gmail.com>
	<20080929163917.GB10831@mit.edu>
	<f3177b9e0810011118q42206e3fi75f947fbc7ee9cc3@mail.gmail.com>
Message-ID: <20081001185908.GC10080@mit.edu>

On Wed, Oct 01, 2008 at 12:18:21PM -0600, Chris Worley wrote:
> 
> I was perusing David Woodhouse's 2.6.27-rc2 kernel at
> git://git.infradead.org/users/drzeus/discard-2.6.git, and noticed he
> has the discard built-in to where I was talking about for ext2... so I
> coded our driver to handle discards, and it works very nicely!!!

I'm not sure what you mean by "our driver"?

> The journaling issue you raise is not a show-stopper on the block
> device side: if the block device has to maintain a couple of blocks
> that are not really in use, it's no big deal (eventually the blocks
> will be re-written and the universe will be in order again)... for the
> users, I can understand if the discard is preserved on the block
> device, while the fs still thinks there's good data in there (we'll
> give you back all zeros on read).

It's no issue on the block device side at all, but from the user's
point of view it can be quite disastrous.  Consider the following
shell script:

   	cp /etc/passwd /etc/passwd.vipw
	vi /etc/passwd.vipw
	<sanity check /etc/passwd.vipw for correctness>
	# atomically update /etc/passwd
	mv /etc/passwd.vipw /etc/passwd

Now assume that we crash right after the "mv" command, but before the
transaction has committed.  The net result will be that the contents
of the /etc/passwd file will be all zeros, which some might
consider....  unfortuate.

This is exactly the same issue for why we can't just zero data blocks
on the unlink command, but instead have to wait until the unlink
operation has actually been committed in the journal.

						- Ted



From worleys at gmail.com  Wed Oct  1 19:46:00 2008
From: worleys at gmail.com (Chris Worley)
Date: Wed, 1 Oct 2008 13:46:00 -0600
Subject: When is a block free?
In-Reply-To: <20081001185908.GC10080@mit.edu>
References: <f3177b9e0809161310i45d24836tf41160a12683c032@mail.gmail.com>
	<48D01448.4050107@redhat.com>
	<f3177b9e0809161403xac72601i3c99dec0b6be9959@mail.gmail.com>
	<f3177b9e0809161432i62adce12y2019062be9d160d4@mail.gmail.com>
	<f3177b9e0809290824m1b3cb778u79388f885587cc7@mail.gmail.com>
	<20080929163917.GB10831@mit.edu>
	<f3177b9e0810011118q42206e3fi75f947fbc7ee9cc3@mail.gmail.com>
	<20081001185908.GC10080@mit.edu>
Message-ID: <f3177b9e0810011246m2fb714b9p2af0db17594c4dba@mail.gmail.com>

On Wed, Oct 1, 2008 at 12:59 PM, Theodore Tso <tytso at mit.edu> wrote:
> On Wed, Oct 01, 2008 at 12:18:21PM -0600, Chris Worley wrote:
>>
>> I was perusing David Woodhouse's 2.6.27-rc2 kernel at
>> git://git.infradead.org/users/drzeus/discard-2.6.git, and noticed he
>> has the discard built-in to where I was talking about for ext2... so I
>> coded our driver to handle discards, and it works very nicely!!!
>
> I'm not sure what you mean by "our driver"?

Our driver for the ioDrive:

http://fusionio.com/Products.aspx

So far, all I've implemented is the "discard" in the read/write
callback; no barrier, no ioctl.

>
>> The journaling issue you raise is not a show-stopper on the block
>> device side: if the block device has to maintain a couple of blocks
>> that are not really in use, it's no big deal (eventually the blocks
>> will be re-written and the universe will be in order again)... for the
>> users, I can understand if the discard is preserved on the block
>> device, while the fs still thinks there's good data in there (we'll
>> give you back all zeros on read).
>
> It's no issue on the block device side at all, but from the user's
> point of view it can be quite disastrous.
<snip>

Maybe that should effect the priority of implementation for ext[34]?

Chris



From tytso at mit.edu  Wed Oct  1 21:29:40 2008
From: tytso at mit.edu (Theodore Tso)
Date: Wed, 1 Oct 2008 17:29:40 -0400
Subject: When is a block free?
In-Reply-To: <f3177b9e0810011246m2fb714b9p2af0db17594c4dba@mail.gmail.com>
References: <f3177b9e0809161310i45d24836tf41160a12683c032@mail.gmail.com>
	<48D01448.4050107@redhat.com>
	<f3177b9e0809161403xac72601i3c99dec0b6be9959@mail.gmail.com>
	<f3177b9e0809161432i62adce12y2019062be9d160d4@mail.gmail.com>
	<f3177b9e0809290824m1b3cb778u79388f885587cc7@mail.gmail.com>
	<20080929163917.GB10831@mit.edu>
	<f3177b9e0810011118q42206e3fi75f947fbc7ee9cc3@mail.gmail.com>
	<20081001185908.GC10080@mit.edu>
	<f3177b9e0810011246m2fb714b9p2af0db17594c4dba@mail.gmail.com>
Message-ID: <20081001212940.GI10080@mit.edu>

On Wed, Oct 01, 2008 at 01:46:00PM -0600, Chris Worley wrote:
> 
> Maybe that should effect the priority of implementation for ext[34]?
> 

It's on our todo list, but at the moment you can't even *get* SSD's
that have the trim command, apparently for love or money.  So that
affects the priority as well.  If someone wants to ship me an SSD that
has trim support, ideally in a 2.5" 9mm hard drive SATA form factor
with at least 128gigs, I promise you that would affect priority of
that feature, at least for me.  :-)

						- Ted



From balu.manyam at gmail.com  Thu Oct  2 05:36:38 2008
From: balu.manyam at gmail.com (Balu manyam)
Date: Thu, 2 Oct 2008 11:06:38 +0530
Subject: When is a block free?
In-Reply-To: <20081001212940.GI10080@mit.edu>
References: <f3177b9e0809161310i45d24836tf41160a12683c032@mail.gmail.com>
	<48D01448.4050107@redhat.com>
	<f3177b9e0809161403xac72601i3c99dec0b6be9959@mail.gmail.com>
	<f3177b9e0809161432i62adce12y2019062be9d160d4@mail.gmail.com>
	<f3177b9e0809290824m1b3cb778u79388f885587cc7@mail.gmail.com>
	<20080929163917.GB10831@mit.edu>
	<f3177b9e0810011118q42206e3fi75f947fbc7ee9cc3@mail.gmail.com>
	<20081001185908.GC10080@mit.edu>
	<f3177b9e0810011246m2fb714b9p2af0db17594c4dba@mail.gmail.com>
	<20081001212940.GI10080@mit.edu>
Message-ID: <995392220810012236i756ca53at112ead6b03d8f8c1@mail.gmail.com>

On Thu, Oct 2, 2008 at 2:59 AM, Theodore Tso <tytso at mit.edu> wrote:

> On Wed, Oct 01, 2008 at 01:46:00PM -0600, Chris Worley wrote:
> >
> > Maybe that should effect the priority of implementation for ext[34]?
> >
>
>
also i am inferring correctly  that the SAN array vendors who are now
implementing thin provisioning i.e. allocate space on writes can benefit
from this ? now that the array can know which blocks are free and update its
own list of free blocks?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20081002/969c5934/attachment.htm>

From worleys at gmail.com  Thu Oct  2 13:40:30 2008
From: worleys at gmail.com (Chris Worley)
Date: Thu, 2 Oct 2008 07:40:30 -0600
Subject: When is a block free?
In-Reply-To: <995392220810012236i756ca53at112ead6b03d8f8c1@mail.gmail.com>
References: <f3177b9e0809161310i45d24836tf41160a12683c032@mail.gmail.com>
	<f3177b9e0809161403xac72601i3c99dec0b6be9959@mail.gmail.com>
	<f3177b9e0809161432i62adce12y2019062be9d160d4@mail.gmail.com>
	<f3177b9e0809290824m1b3cb778u79388f885587cc7@mail.gmail.com>
	<20080929163917.GB10831@mit.edu>
	<f3177b9e0810011118q42206e3fi75f947fbc7ee9cc3@mail.gmail.com>
	<20081001185908.GC10080@mit.edu>
	<f3177b9e0810011246m2fb714b9p2af0db17594c4dba@mail.gmail.com>
	<20081001212940.GI10080@mit.edu>
	<995392220810012236i756ca53at112ead6b03d8f8c1@mail.gmail.com>
Message-ID: <f3177b9e0810020640n3a2032a8l1488d2980ed1d6c1@mail.gmail.com>

On Wed, Oct 1, 2008 at 11:36 PM, Balu manyam <balu.manyam at gmail.com> wrote:
>
>
> On Thu, Oct 2, 2008 at 2:59 AM, Theodore Tso <tytso at mit.edu> wrote:
>>
>> On Wed, Oct 01, 2008 at 01:46:00PM -0600, Chris Worley wrote:
>> >
>> > Maybe that should effect the priority of implementation for ext[34]?
>> >
>>
>
> also i am inferring correctly  that the SAN array vendors who are now
> implementing thin provisioning i.e. allocate space on writes can benefit
> from this ?

Absolutely.

Chris
> now that the array can know which blocks are free and update its
> own list of free blocks?
>



From articpenguin3800 at gmail.com  Sat Oct 11 21:01:16 2008
From: articpenguin3800 at gmail.com (John Nelson)
Date: Sat, 11 Oct 2008 17:01:16 -0400
Subject: Backup Superblocks
Message-ID: <48F1141C.4040604@gmail.com>

Where does ext3 store the backup superblock? Does it have one at the 
very beginning of the partition and one at the very end?



From samuel at bcgreen.com  Sat Oct 11 21:41:09 2008
From: samuel at bcgreen.com (Stephen Samuel)
Date: Sat, 11 Oct 2008 14:41:09 -0700
Subject: Backup Superblocks
In-Reply-To: <48F1141C.4040604@gmail.com>
References: <48F1141C.4040604@gmail.com>
Message-ID: <6cd50f9f0810111441h5b4d5235g2104ef5a337f8bcd@mail.gmail.com>

It stores them in various places, depending on the size of your filesystem.
If your filesystem is large enough (>~ 1/2 GB) you'll probably find it at
block #32768.
For smaller filesystems, it appears to put the first backup at block # 8193

You can get more details by using the -n option to mkfs. If you used
nonstandard options in your original mkfs, you might want to provide those
details here, as well.  (( -n has mkfs.ext[23] not actually write to the
partition but simply say what it *WOULD* do if it did. ))

mkfs -t ext2 -n /dev/mydeice


On Sat, Oct 11, 2008 at 2:01 PM, John Nelson <articpenguin3800 at gmail.com>wrote:

> Where does ext3 store the backup superblock? Does it have one at the very
> beginning of the partition and one at the very end?
>
> _______________________________________________
> Ext3-users mailing list
> Ext3-users at redhat.com
> https://www.redhat.com/mailman/listinfo/ext3-users
>



-- 
Stephen Samuel http://www.bcgreen.com
778-861-7641
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20081011/3ad33afe/attachment.htm>

From carlo at alinoe.com  Wed Oct 15 01:43:10 2008
From: carlo at alinoe.com (Carlo Wood)
Date: Wed, 15 Oct 2008 03:43:10 +0200
Subject: How are 'files with holes' stored?
Message-ID: <20081015014310.GA1649@alinoe.com>

Hi, I don't know how to call them, but it seems
that ext3 grep allows files to be stored that
have a very large size (when doing an 'ls -l')
but do not actually allocate all blocks.

I assume this is achieved by using 0 as blocknumber
for indirect blocks.

What are the exact requirements for such files?
Is it allowed to have a double indirect block
that exists entirely of zeroes? Is it possible
there is are 0 entries in the tripple indirect
block? Etc.

-- 
Carlo Wood <carlo at alinoe.com>



From lm at bitmover.com  Wed Oct 15 01:47:55 2008
From: lm at bitmover.com (Larry McVoy)
Date: Tue, 14 Oct 2008 18:47:55 -0700
Subject: How are 'files with holes' stored?
In-Reply-To: <20081015014310.GA1649@alinoe.com>
References: <20081015014310.GA1649@alinoe.com>
Message-ID: <20081015014755.GB32378@bitmover.com>

I don't remember how UFS did this but I could go figure it out in 10 or 20 
minutes if that helped.  ext* - no idea.  I'd expect that your "block number
is 0" is a darn good guess, that's what I would do.  That or -1.

On Wed, Oct 15, 2008 at 03:43:10AM +0200, Carlo Wood wrote:
> Hi, I don't know how to call them, but it seems
> that ext3 grep allows files to be stored that
> have a very large size (when doing an 'ls -l')
> but do not actually allocate all blocks.
> 
> I assume this is achieved by using 0 as blocknumber
> for indirect blocks.
> 
> What are the exact requirements for such files?
> Is it allowed to have a double indirect block
> that exists entirely of zeroes? Is it possible
> there is are 0 entries in the tripple indirect
> block? Etc.
> 
> -- 
> Carlo Wood <carlo at alinoe.com>
> 
> _______________________________________________
> Ext3-users mailing list
> Ext3-users at redhat.com
> https://www.redhat.com/mailman/listinfo/ext3-users

-- 
---
Larry McVoy                lm at bitmover.com           http://www.bitkeeper.com



From ling at fnal.gov  Wed Oct 15 16:56:06 2008
From: ling at fnal.gov (Ling C. Ho)
Date: Wed, 15 Oct 2008 11:56:06 -0500
Subject: Need help recovering files.
Message-ID: <48F620A6.6060508@fnal.gov>

Hello,

I am trying to recover a huge ext3 filesystem (5.5TB) and fsck has been 
running for almost a week. It's still at PASS 1D at this point, showing 
messages like
File ... (inode #138235018, mod time Tue Sep 23 03:04:23 2008)
   has 1016 multiply-claimed block(s), shared with 327 file(s):
         <filesystem metadata>
         ... (inode #375491526, mod time Wed Jun  4 17:05:37 2008)
	...


I am think if it is possible for me to use debugfs to dump the content 
of inodes if I write a script to go through all the inodes that is used.

But when I try using ncheck to find out the path name (the filename 
would be enough) I get these messages:

ncheck: EXT2 directory corrupted while calling ext2_dir_iterate

The root inode according to fsck, and debugfs is gone.

If I were to do an ls, it says "Ext2 inode is not a directory".

The tools I am using are from e2fsprogs 1.41.2 . The file systems were 
originally created and mounted on a Fermi Linux SLF4.5 system (similar 
to RHEL 4.5).

Is there anyway for me to dump individual file or search for a valid 
directory inodes and use rdump?

Thanks,
...
ling



From Curtis at GreenKey.net  Sat Oct 18 19:55:56 2008
From: Curtis at GreenKey.net (Curtis Doty)
Date: Sat, 18 Oct 2008 12:55:56 -0700 (PDT)
Subject: recovering failed resize2fs
Message-ID: <20081018195556.EB4016F064@alopias.GreenKey.net>

While attempting to expand a 1.64T ext4 volume to 2.18T the F9 kernel 
deadlocked. (I have photo of screen/oops if anybody's interested.)

Now after recovery, the filesystem won't mount

   EXT4-fs: ext4_check_descriptors: Block bitmap for group 13413 not in 
group (block 0)!<3>EXT4-fs: group descriptors corrupted!

and fsck won't run:

   fsck.ext4: Group descriptors look bad... trying backup blocks...
   inst: recovering journal
   fsck.ext4: unable to set superblock flags on inst

I peeked at all backup superblocks, but they all appear the same--the 
larger/newer 2.18T geometry. :-(

What is the best way to recover? I know exactly how the original 
filesystem was created. Is there a way to just replay the old superblocks 
and trick it into thinking it never resized?

../C

dumpe2fs 1.41.0 (10-Jul-2008)
Filesystem volume name:   inst
Last mounted on:          <not available>
Filesystem UUID:          ddabbf0c-bf3f-495a-9777-e832cc14e9df
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype extent sparse_super large_file
Filesystem flags:         signed_directory_hash test_filesystem
Default mount options:    journal_data_writeback
Filesystem state:         clean
Errors behavior:          Remount read-only
Filesystem OS type:       Linux
Inode count:              9080832
Block count:              581173248
Reserved block count:     5808792
Free blocks:              6282771
Free inodes:              4331111
First block:              0
Block size:               4096
Fragment size:            4096
Reserved GDT blocks:      885
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         512
Inode blocks per group:   16
RAID stride:              32
RAID stripe width:        64
Filesystem created:       Sun Jul 27 21:02:11 2008
Last mount time:          Wed Aug 13 18:28:35 2008
Last write time:          Sat Oct 18 12:39:13 2008
Mount count:              2
Maximum mount count:      -1
Last checked:             Sun Jul 27 21:02:11 2008
Check interval:           0 (<none>)
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               128
Journal inode:            8
Default directory hash:   tea
Directory Hash Seed:      c6e2cfa3-0545-46a4-8240-ccb987191b88
Journal backup:           inode blocks
Journal size:             256M



From tytso at mit.edu  Sat Oct 18 20:29:36 2008
From: tytso at mit.edu (Theodore Tso)
Date: Sat, 18 Oct 2008 16:29:36 -0400
Subject: recovering failed resize2fs
In-Reply-To: <20081018195556.EB4016F064@alopias.GreenKey.net>
References: <20081018195556.EB4016F064@alopias.GreenKey.net>
Message-ID: <20081018202936.GC8383@mit.edu>

On Sat, Oct 18, 2008 at 12:55:56PM -0700, Curtis Doty wrote:
> While attempting to expand a 1.64T ext4 volume to 2.18T the F9 kernel  
> deadlocked. (I have photo of screen/oops if anybody's interested.)

Yes, that would be useful, thanks.

> Now after recovery, the filesystem won't mount
>
>   EXT4-fs: ext4_check_descriptors: Block bitmap for group 13413 not in  
> group (block 0)!<3>EXT4-fs: group descriptors corrupted!
>
> and fsck won't run:
>
>   fsck.ext4: Group descriptors look bad... trying backup blocks...
>   inst: recovering journal
>   fsck.ext4: unable to set superblock flags on inst

Hmm... This sounds like the needs recovery flag was set on the backup
superblock, which should never happen.  Before we try something more
extreme, see if this helps you:

e2fsck -b 32768 -B 4096 /dev/where-inst-is-located

That forces the use of the backup superblock right away, and might
help you get past the initial error.

					- Ted



From Curtis at GreenKey.net  Sat Oct 18 23:20:13 2008
From: Curtis at GreenKey.net (Curtis Doty)
Date: Sat, 18 Oct 2008 16:20:13 -0700 (PDT)
Subject: recovering failed resize2fs
In-Reply-To: <20081018202936.GC8383@mit.edu>
References: <20081018195556.EB4016F064@alopias.GreenKey.net>
	<20081018202936.GC8383@mit.edu>
Message-ID: <20081018232013.591E26F064@alopias.GreenKey.net>

4:29pm Theodore Tso said:

> On Sat, Oct 18, 2008 at 12:55:56PM -0700, Curtis Doty wrote:
>> While attempting to expand a 1.64T ext4 volume to 2.18T the F9 kernel
>> deadlocked. (I have photo of screen/oops if anybody's interested.)
>
> Yes, that would be useful, thanks.

Three photos of same: http://www.greenkey.net/~curtis/linux/

The rest had scrolled off, so maybe that soft lockup was a secondary 
effect rather than true cause? It was re-appearing every minute.

>
>> Now after recovery, the filesystem won't mount
>>
>>   EXT4-fs: ext4_check_descriptors: Block bitmap for group 13413 not in
>> group (block 0)!<3>EXT4-fs: group descriptors corrupted!
>>
>> and fsck won't run:
>>
>>   fsck.ext4: Group descriptors look bad... trying backup blocks...
>>   inst: recovering journal
>>   fsck.ext4: unable to set superblock flags on inst
>
> Hmm... This sounds like the needs recovery flag was set on the backup
> superblock, which should never happen.  Before we try something more
> extreme, see if this helps you:
>
> e2fsck -b 32768 -B 4096 /dev/where-inst-is-located
>
> That forces the use of the backup superblock right away, and might
> help you get past the initial error.

Same as before. :-(

# e2fsck -b32768 -B4096 -C0 /dev/dat/inst
e2fsck 1.41.0 (10-Jul-2008)
inst: recovering journal
e2fsck: unable to set superblock flags on inst

It appears *all* superblocks are same as that first 32768 by iterating 
over all superblocks shown in mkfs -n output says so.

I'm inclined to just force reduce the underlying lvm. It was 100% full 
before I extended and tried to resize. And I know the only writes on the 
new lvm extent would have been from resize2fs. It that wise?

../C



From tytso at mit.edu  Mon Oct 20 01:53:09 2008
From: tytso at mit.edu (Theodore Tso)
Date: Sun, 19 Oct 2008 21:53:09 -0400
Subject: recovering failed resize2fs
In-Reply-To: <20081018232013.591E26F064@alopias.GreenKey.net>
References: <20081018195556.EB4016F064@alopias.GreenKey.net>
	<20081018202936.GC8383@mit.edu>
	<20081018232013.591E26F064@alopias.GreenKey.net>
Message-ID: <20081020015309.GB8162@mit.edu>

On Sat, Oct 18, 2008 at 04:20:13PM -0700, Curtis Doty wrote:
> 4:29pm Theodore Tso said:
>
>> On Sat, Oct 18, 2008 at 12:55:56PM -0700, Curtis Doty wrote:
>>> While attempting to expand a 1.64T ext4 volume to 2.18T the F9 kernel
>>> deadlocked. (I have photo of screen/oops if anybody's interested.)
>>
>> Yes, that would be useful, thanks.
>
> Three photos of same: http://www.greenkey.net/~curtis/linux/
>
> The rest had scrolled off, so maybe that soft lockup was a secondary  
> effect rather than true cause? It was re-appearing every minute.

Looks like the kernel wedged due to running out of memory.  The calls
to shrink_zone(), shrink_inactive_list(), try_to_release_page(),
etc. tends to indicate that the system was frantically trying to find
free physical memory at the time.  It may or may not have been caused
by the online resize; how much memory does your system have, and what
else was going on at the time?  It may have been that something *else*
had been leaking memory at the time, and this pushed it over the line.

It's also the case that the online resize is journaled, so it should
have been safe; but I'm guessing that the system was thrashing so
hard, and you didn't have barriers enabled, and this resulted in the
filesystem getting corrupted.

>> Hmm... This sounds like the needs recovery flag was set on the backup
>> superblock, which should never happen.  Before we try something more
>> extreme, see if this helps you:
>>
>> e2fsck -b 32768 -B 4096 /dev/where-inst-is-located
>>
>> That forces the use of the backup superblock right away, and might
>> help you get past the initial error.
>
> Same as before. :-(
>
> # e2fsck -b32768 -B4096 -C0 /dev/dat/inst
> e2fsck 1.41.0 (10-Jul-2008)
> inst: recovering journal
> e2fsck: unable to set superblock flags on inst
>
> It appears *all* superblocks are same as that first 32768 by iterating  
> over all superblocks shown in mkfs -n output says so.
>
> I'm inclined to just force reduce the underlying lvm. It was 100% full  
> before I extended and tried to resize. And I know the only writes on the  
> new lvm extent would have been from resize2fs. It that wise?

No, force reducing the underlying LVM is only going to make things
worse, since it doesn't fix the filesystem.

So this is what I would do.  Create a snapshot and try this on the
snapshot first:

% lvcreate -s -L 10G -n inst-snapshot /dev/dat/inst
% debugfs -w /dev/dat/inst-snapshot
debugfs: features ^needs_recovery
debugfs: quit
% e2fsck -C 0 /dev/dat/inst

This will skip running the journal, but there's no guarantee the
journal is valid anyway.

If this turns into a mess, you can throw away the snapshot and try
something else.  (The something else would require writing a C program
that removes the needs_recovery from all the backup superblock, but
keeping it set on the master superbock.  That's more work, so let's
try this way first.)

						- Ted



From rdavidson at obsidian.com.au  Mon Oct 20 02:35:54 2008
From: rdavidson at obsidian.com.au (Robert Davidson)
Date: Mon, 20 Oct 2008 13:35:54 +1100
Subject: ext3 file system I/O blocks until reboot
Message-ID: <48FBEE8A.80608@obsidian.com.au>


Hi all,

We have a server that has a 580GB ext3 file system on it.  Until
recently we ran around 15 virtual servers from this file system.  It was
fine for at least a few months, then the file system would periodically
become inaccessible, getting more frequent as time went on.  Eventually
we wouldn't even get through a 15-hour period without having to reboot
the server.

When the I/O got blocked, all processes accessing files on
/var/lib/vservers (its mount point) would get stuck waiting for I/O to
complete ("D" state) and I couldn't find any way to revive it apart from
rebooting the server.  I tried sending various signals (TERM and KILL)
to some kernel threads but that didn't help at all.

The "kjournald" process also got stuck in the "D" state.

The server is running kernel 2.6.22.19 with the Linux-Vserver patch
vs2.2.0.7, DRBD 8.2.6 and the Areca RAID driver updated to
1.20.0X.15-80603 which was the latest available from Areca at the time. 
The OS is Debian etch.

As part of troubleshooting the problem I'd taken DRBD out of the mix,
tried updating the RAID driver in the kernel, replaced the RAID card
with another one with slightly later firmware, and also replaced the
power supply with a known-good one at the same time and disabled the
swap space.  None of that helped.

What did help was copying the files from the existing file system to a
newly formatted ext3 file system.  The newly formatted file system is
only around 320GB, but is also set up the same as the existing one (both
are hardware RAID-6, running on the same host, same controller, same
physical disks, etc).

When the file system would become inaccessible, there were no notices
from the kernel about any issue at all.  We have a serial console on
this server and nothing was captured by the serial console when this
happened, nor is there anything in the system logs (which should have
been writable all this time as they are not on the broken file system).

I used 'dd' to check if I could read from the underlying device files
that the file system was on (/dev/sdc1 and /dev/drbd1), there was no
problem doing that.  I didn't test writes to these devices though since
I don't know of any safe way to do so, but using the SysRq feature, an
emergency sync would not complete, nor would an emergency umount, so I
assume writes were out of the question.  Doing an 'ls' on
/var/lib/vservers just left me with yet another process stuck in the "D"
state.

A forced fsck of the file system (using a fresh build of e2fsprogs
1.41.3 with the matching libraries) provides no hint of any problems.

The root file system is an ext3 file system as well, and there were no
problems reading/writing to that file system while the ext3 file system
on /var/lib/vservers was inaccessible.  The filesystem is also on the
same RAID card, physical disks, etc.

One reason I've not moved to a newer kernel yet is because there isn't a
stable linux-vserver patch for anything newer than 2.6.22.19, so I'm
kind of stuck with that kernel until there is.  I made a start on
backporting the ext3 code from 2.6.26.5 to 2.6.22.19 but its not
something I trust myself to get right, so I'd rather avoid that approach
unless there is another way of doing that.

So my questions are:

Are there any further diagnostics I can perform on the old file system
to try and track down the problem?  If so, what are they?

Is this a known bug/problem with ext3 or something related to it?

Is it likely that one of the 3 or so deadlocks that have been fixed in
kernels since 2.6.22.19 would have cured this problem, or would these
deadlocks have taken down the hole box and not just affected the one
file system?

Or even this bug: http://bugzilla.kernel.org/show_bug.cgi?id=10882 (the
softlockup part, I think not though because I was able to copy
everything off that file system and on to a new one without having any
lockups or any other complaints from the kernel).

Thanks.

-- 
Regards,
Robert Davidson.
Obsidian Consulting Group.
Ph. 03-9355-7844
E-Mail: support at obsidian.com.au




From bruno at wolff.to  Mon Oct 20 13:34:49 2008
From: bruno at wolff.to (Bruno Wolff III)
Date: Mon, 20 Oct 2008 08:34:49 -0500
Subject: ext3 file system I/O blocks until reboot
In-Reply-To: <48FBEE8A.80608@obsidian.com.au>
References: <48FBEE8A.80608@obsidian.com.au>
Message-ID: <20081020133449.GB26855@wolff.to>

On Mon, Oct 20, 2008 at 13:35:54 +1100,
  Robert Davidson <rdavidson at obsidian.com.au> wrote:
> 
> So my questions are:
> 
> Are there any further diagnostics I can perform on the old file system
> to try and track down the problem?  If so, what are they?
> 
> Is this a known bug/problem with ext3 or something related to it?

I saw stuff like this happening starting with later 2.6.20 kernels that
wasn't fixed until the 2.6.24 kernels. (See bug 235043.) I wasn't using
VM's, so it might not be the same as the bug you are seeing. I do remember
seeing some other similar problems people were having that didn't appear
to be the same bug as I had when I did bugzilla searches. So you might
want to do your own bugzilla search to see what you can find.

I have also been getting disk IO lockups in F10, but in a more limited set
of circumstances. (Memory pressure on an X86_64 system.)



From rdavidson at obsidian.com.au  Tue Oct 21 00:40:06 2008
From: rdavidson at obsidian.com.au (Robert Davidson)
Date: Tue, 21 Oct 2008 11:40:06 +1100
Subject: ext3 file system I/O blocks until reboot
In-Reply-To: <20081020133449.GB26855@wolff.to>
References: <48FBEE8A.80608@obsidian.com.au> <20081020133449.GB26855@wolff.to>
Message-ID: <48FD24E6.1060106@obsidian.com.au>

Bruno Wolff III wrote:
> I saw stuff like this happening starting with later 2.6.20 kernels that
> wasn't fixed until the 2.6.24 kernels. (See bug 235043.) I wasn't using
> VM's, so it might not be the same as the bug you are seeing. I do remember
> seeing some other similar problems people were having that didn't appear
> to be the same bug as I had when I did bugzilla searches. So you might
> want to do your own bugzilla search to see what you can find.
>
> I have also been getting disk IO lockups in F10, but in a more limited set
> of circumstances. (Memory pressure on an X86_64 system.)
>   

Hi Bruno,

I've had a look through bugzilla but couldn't find any similar bugs (the
closest I can find is 439548 but I doubt very much that thats it).  Your
bug 235043 does sound rather different since it sounds like new
processes would be able to access the file system without a problem,
where as on my system any new attempt to read (writing wasn't tested)
just resulted in one more process stuck in the "D" state.

I might try taking a byte-for-byte copy of the FS and see if I can find
a way to reliably re-produce the problem on a similar server.

-- 
Regards,
Robert Davidson.
Obsidian Consulting Group.
Ph. 03-9355-7844
E-Mail: support at obsidian.com.au




From bruno at wolff.to  Tue Oct 21 03:37:22 2008
From: bruno at wolff.to (Bruno Wolff III)
Date: Mon, 20 Oct 2008 22:37:22 -0500
Subject: ext3 file system I/O blocks until reboot
In-Reply-To: <48FD24E6.1060106@obsidian.com.au>
References: <48FBEE8A.80608@obsidian.com.au> <20081020133449.GB26855@wolff.to>
	<48FD24E6.1060106@obsidian.com.au>
Message-ID: <20081021033722.GA24998@wolff.to>

On Tue, Oct 21, 2008 at 11:40:06 +1100,
  Robert Davidson <rdavidson at obsidian.com.au> wrote:
> 
> I've had a look through bugzilla but couldn't find any similar bugs (the
> closest I can find is 439548 but I doubt very much that thats it).  Your
> bug 235043 does sound rather different since it sounds like new
> processes would be able to access the file system without a problem,
> where as on my system any new attempt to read (writing wasn't tested)
> just resulted in one more process stuck in the "D" state.

For a while. Eventually everything would lock up.



From Curtis at GreenKey.net  Tue Oct 21 21:44:33 2008
From: Curtis at GreenKey.net (Curtis Doty)
Date: Tue, 21 Oct 2008 14:44:33 -0700 (PDT)
Subject: recovering failed resize2fs
In-Reply-To: <20081020015309.GB8162@mit.edu>
References: <20081018195556.EB4016F064@alopias.GreenKey.net>
	<20081018202936.GC8383@mit.edu>
	<20081018232013.591E26F064@alopias.GreenKey.net>
	<20081020015309.GB8162@mit.edu>
Message-ID: <20081021214433.DA4416F064@alopias.GreenKey.net>

Sunday Theodore Tso said:

> On Sat, Oct 18, 2008 at 04:20:13PM -0700, Curtis Doty wrote:
>> 4:29pm Theodore Tso said:
>>
>>> On Sat, Oct 18, 2008 at 12:55:56PM -0700, Curtis Doty wrote:
>>>> While attempting to expand a 1.64T ext4 volume to 2.18T the F9 kernel
>>>> deadlocked. (I have photo of screen/oops if anybody's interested.)
>>>
>>> Yes, that would be useful, thanks.
>>
>> Three photos of same: http://www.greenkey.net/~curtis/linux/
>>
>> The rest had scrolled off, so maybe that soft lockup was a secondary
>> effect rather than true cause? It was re-appearing every minute.
>
> Looks like the kernel wedged due to running out of memory.  The calls
> to shrink_zone(), shrink_inactive_list(), try_to_release_page(),
> etc. tends to indicate that the system was frantically trying to find
> free physical memory at the time.  It may or may not have been caused
> by the online resize; how much memory does your system have, and what
> else was going on at the time?  It may have been that something *else*
> had been leaking memory at the time, and this pushed it over the line.
>

The system had been a couple months and doing significant i/o on the ext4 
volume. And indeed it had been having periodic memory/swap issues:

http://www.greenkey.net/~curtis/linux/cracker-kernel.2008-10-21

> It's also the case that the online resize is journaled, so it should
> have been safe; but I'm guessing that the system was thrashing so
> hard, and you didn't have barriers enabled, and this resulted in the
> filesystem getting corrupted.

Some other observations...

  - a snapshot in a different vg blew up a few days prior; it was deleted
  - ran vgs a few times in another vty during resize2fs *immediately* 
before crash

>
>>> Hmm... This sounds like the needs recovery flag was set on the backup
>>> superblock, which should never happen.  Before we try something more
>>> extreme, see if this helps you:
>>>
>>> e2fsck -b 32768 -B 4096 /dev/where-inst-is-located
>>>
>>> That forces the use of the backup superblock right away, and might
>>> help you get past the initial error.
>>
>> Same as before. :-(
>>
>> # e2fsck -b32768 -B4096 -C0 /dev/dat/inst
>> e2fsck 1.41.0 (10-Jul-2008)
>> inst: recovering journal
>> e2fsck: unable to set superblock flags on inst
>>
>> It appears *all* superblocks are same as that first 32768 by iterating
>> over all superblocks shown in mkfs -n output says so.
>>
>> I'm inclined to just force reduce the underlying lvm. It was 100% full
>> before I extended and tried to resize. And I know the only writes on the
>> new lvm extent would have been from resize2fs. It that wise?
>
> No, force reducing the underlying LVM is only going to make things
> worse, since it doesn't fix the filesystem.
>
> So this is what I would do.  Create a snapshot and try this on the
> snapshot first:
>
> % lvcreate -s -L 10G -n inst-snapshot /dev/dat/inst
> % debugfs -w /dev/dat/inst-snapshot
> debugfs: features ^needs_recovery
> debugfs: quit
> % e2fsck -C 0 /dev/dat/inst

Done, but no change. :-(

EXT4-fs: ext4_check_descriptors: Block bitmap for group 13413 not in group (block 0)!<3>EXT4-fs: group descriptors corrupted!

>
> This will skip running the journal, but there's no guarantee the
> journal is valid anyway.
>
> If this turns into a mess, you can throw away the snapshot and try
> something else.  (The something else would require writing a C program
> that removes the needs_recovery from all the backup superblock, but
> keeping it set on the master superbock.  That's more work, so let's
> try this way first.)

How does that something else work?

../C



From rbock at eudoxos.de  Thu Oct 23 08:00:44 2008
From: rbock at eudoxos.de (Roland Bock)
Date: Thu, 23 Oct 2008 10:00:44 +0200
Subject: Block bitmap differences
Message-ID: <49002F2C.9040209@eudoxos.de>

Hi,

a few weeks ago, an unhealthy combination of firmware in an Adaptec Raid 
controller and Seagate disks damaged my Raid6 filesystem. A bunch of 
files were damaged or lost at that time after the firmaware was updated 
and I had run e2fsck. Luckily, I was able to restore everything from a 
backup. A subsequent check with e2fsck reported no errors.

Yesterday, I ran e2fsck -n again, to see if the system is still OK. It 
isn't and I have no idea how to interpret the messages (see attachment).

What is the meaning and severity of

- Block bitmap differences?
- Free blocks count wrong for group?


Thanks and regards.

Roland


-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: e2fsck-n-2008-10-22
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20081023/d012f39c/attachment.ksh>

From rbock at eudoxos.de  Thu Oct 23 08:18:31 2008
From: rbock at eudoxos.de (Roland Bock)
Date: Thu, 23 Oct 2008 10:18:31 +0200
Subject: Undeletable files
Message-ID: <49003357.8040704@eudoxos.de>

Hi,

an e2fsck-run left a few files sprinkled over the file system which seem 
to be undeletable. Although the FS is mounted RW, even root does not 
seem to be able to delete them.

#: ls -l
total 4888
-rwxrwx--t 1   18416192   21168618   45056 2007-03-08 17:57 
00000000_0000_AAM_TUI_002.txt
-r-xr-x-wT 1 2617499625 1426397418   45056 1920-07-13 22:52 
00000000_0000_AAM_TUI_007.txt
-rw-rwSrw- 1   51446267  130941264   49152 2006-11-03 07:37 
00000000_0000_AAM_TUI_015.txt
-rwxrwxrwt 1   33686018   59768993   49152 1909-11-25 08:49 
00000000_0000_AAM_TUI_021.txt
--w----rwt 1   64588982 2634154654   49152 2007-08-14 19:50 
00000000_0000_AAM_TUI_034.txt
-r-----r-T 1   66500841 4060231152   49152 2007-09-19 01:13 
19991001_0000_AAM_TUI_000.txt
------xrw- 1 2885846505   33621835 4243456 2005-11-10 06:47 
20011112_0000_AAM_TUI_000.txt
-r-x--xrwt 1 2214740202   33685997   49152 2004-09-10 00:51 
20040116_0000_AAM_TUI_000.txt
---x-w---t 1   18200553   19138330   49152 2022-08-08 13:51 
20051109_0000_AAM_TUI_000.txt
--w-rw---x 1   93782446 2533491176   45056 2004-08-22 20:46 
20060609_0000_AAM_TUI_000.txt
--wxrw-r-- 1   38929139   26715113   49152 2007-09-29 07:38 
20061220_0000_AAM_TUI_000.txt
---xr-xrwx 1   33661673   26673902   49152 2007-03-10 04:59 
20061221_0000_AAM_TUI_001.txt
---xrw---t 1   30977769  989954793   49152 2004-09-06 05:47 
20070117_0000_AAM_TUI_000.txt
-rw-rw-rwt 1   80150873 3204594410   49152 2006-12-19 14:22 
20070308_0000_AAM_TUI_000.txt
--w-r-x--T 1   37617711   58786132   49152 2007-04-07 21:02 
20070308_0000_AAM_TUI_002.txt
-rwxr-xrwt 1 3137470985   16843449   49152 2012-03-25 11:06 
20070419_0000_AAM_TUI_000.txt
----r-Srwt 1  159806442  268563177   49152 2007-08-07 23:15 
20070607_0000_AAM_TUI_000.txt


None of these files can be deleted or modified. None of the user/group 
IDs is valid. Root cannot change any of the attributes.

For example:
#: rm 20011112_0000_AAM_TUI_000.txt
rm: cannot remove `20011112_0000_AAM_TUI_000.txt': Operation not permitted

#: chown root:root 20051109_0000_AAM_TUI_000.txt
chown: changing ownership of `20051109_0000_AAM_TUI_000.txt': Operation 
not permitted

#: chmod a+w 20061221_0000_AAM_TUI_001.txt
chmod: changing permissions of `20061221_0000_AAM_TUI_001.txt': 
Operation not permitted


Any idea of how to get rid of these files? I have about a 100 million 
files on that file system. "About" 30.000 are in such a state as 
described above. The rest behaves normally (can be modified, deleted, etc).


Thanks in advance,

Roland



From jpiszcz at lucidpixels.com  Thu Oct 23 11:30:39 2008
From: jpiszcz at lucidpixels.com (Justin Piszcz)
Date: Thu, 23 Oct 2008 07:30:39 -0400 (EDT)
Subject: Undeletable files
In-Reply-To: <49003357.8040704@eudoxos.de>
References: <49003357.8040704@eudoxos.de>
Message-ID: <alpine.DEB.1.10.0810230730010.14565@p34.internal.lan>



On Thu, 23 Oct 2008, Roland Bock wrote:

> Hi,
>
> an e2fsck-run left a few files sprinkled over the file system which seem to 
> be undeletable. Although the FS is mounted RW, even root does not seem to be 
> able to delete them.
>

[ .. ]

> For example:
> #: rm 20011112_0000_AAM_TUI_000.txt
> rm: cannot remove `20011112_0000_AAM_TUI_000.txt': Operation not permitted
>
> #: chown root:root 20051109_0000_AAM_TUI_000.txt
> chown: changing ownership of `20051109_0000_AAM_TUI_000.txt': Operation not 
> permitted
>
> #: chmod a+w 20061221_0000_AAM_TUI_001.txt
> chmod: changing permissions of `20061221_0000_AAM_TUI_001.txt': Operation not 
> permitted
>
>
> Any idea of how to get rid of these files? I have about a 100 million files 
> on that file system. "About" 30.000 are in such a state as described above. 
> The rest behaves normally (can be modified, deleted, etc).

Either the FS is damaged or the files are chattr'd +i, lsattr -l filename.

Are they immutable by chance?

Justin.



From rbock at eudoxos.de  Thu Oct 23 11:51:53 2008
From: rbock at eudoxos.de (Roland Bock)
Date: Thu, 23 Oct 2008 13:51:53 +0200
Subject: Undeletable files
In-Reply-To: <alpine.DEB.1.10.0810230730010.14565@p34.internal.lan>
References: <49003357.8040704@eudoxos.de>
	<alpine.DEB.1.10.0810230730010.14565@p34.internal.lan>
Message-ID: <49006559.9020803@eudoxos.de>

Justin,

thanks for the hint!

Yes, some of them are immutable, e.g.

00000000_0000_AAM_TUI_007.txt Synchronous_Directory_Updates, Immutable, 
No_Atime, Compression_Raw_Access

Others aren't, e.g.

00000000_0000_AAM_TUI_002.txt Secure_Deletion, Append_Only, No_Atime, 
Compression_Raw_Access, Top_of_Directory_Hierarchie

Got rid of them by:
chattr -i -a *; rm *


Thanks again,

Roland

Justin Piszcz wrote:
> 
> 
> On Thu, 23 Oct 2008, Roland Bock wrote:
> 
>> Hi,
>>
>> an e2fsck-run left a few files sprinkled over the file system which 
>> seem to be undeletable. Although the FS is mounted RW, even root does 
>> not seem to be able to delete them.
>>
> 
> [ .. ]
> 
>> For example:
>> #: rm 20011112_0000_AAM_TUI_000.txt
>> rm: cannot remove `20011112_0000_AAM_TUI_000.txt': Operation not 
>> permitted
>>
>> #: chown root:root 20051109_0000_AAM_TUI_000.txt
>> chown: changing ownership of `20051109_0000_AAM_TUI_000.txt': 
>> Operation not permitted
>>
>> #: chmod a+w 20061221_0000_AAM_TUI_001.txt
>> chmod: changing permissions of `20061221_0000_AAM_TUI_001.txt': 
>> Operation not permitted
>>
>>
>> Any idea of how to get rid of these files? I have about a 100 million 
>> files on that file system. "About" 30.000 are in such a state as 
>> described above. The rest behaves normally (can be modified, deleted, 
>> etc).
> 
> Either the FS is damaged or the files are chattr'd +i, lsattr -l filename.
> 
> Are they immutable by chance?
> 
> Justin.
> 



From tytso at mit.edu  Thu Oct 23 14:08:33 2008
From: tytso at mit.edu (Theodore Tso)
Date: Thu, 23 Oct 2008 10:08:33 -0400
Subject: Block bitmap differences
In-Reply-To: <49002F2C.9040209@eudoxos.de>
References: <49002F2C.9040209@eudoxos.de>
Message-ID: <20081023140833.GB5529@mit.edu>

On Thu, Oct 23, 2008 at 10:00:44AM +0200, Roland Bock wrote:
> Hi,
>
> a few weeks ago, an unhealthy combination of firmware in an Adaptec Raid  
> controller and Seagate disks damaged my Raid6 filesystem. A bunch of  
> files were damaged or lost at that time after the firmaware was updated  
> and I had run e2fsck. Luckily, I was able to restore everything from a  
> backup. A subsequent check with e2fsck reported no errors.
>
> Yesterday, I ran e2fsck -n again, to see if the system is still OK. It  
> isn't and I have no idea how to interpret the messages (see attachment).

You ran the e2fsck while the filesystem is mounted.  So the output
reported is not trustworthy, and block allocation bitmap differences
and free block/inode accounting information being wrong is normal when
running e2fsck -n on a mounted filesystem.  

This message, however, is cause for concern:

> /dev/sdb1 contains a file system with errors, check forced.

This means the filesystem noticed some discrepancy (for example, when
freeing a block, it noticed that the block bitmap already showed the
block as being not in use, which should never happen and indicates
filesystem corruption).

I would recommend that you schedule downtime so you can run e2fsck on
the filesystem while it is unmounted.  Given the errors that you saw
when running e2fsck while it was mounted, it's unlikely that you will
see anything serious, but it is still something that you should do.

Regards,

							- Ted



From rbock at eudoxos.de  Thu Oct 23 16:05:57 2008
From: rbock at eudoxos.de (Roland Bock)
Date: Thu, 23 Oct 2008 18:05:57 +0200
Subject: Block bitmap differences
In-Reply-To: <20081023140833.GB5529@mit.edu>
References: <49002F2C.9040209@eudoxos.de> <20081023140833.GB5529@mit.edu>
Message-ID: <4900A0E5.1000403@eudoxos.de>

Ted,

thank you for your answers.

Is it normal to encounter file systems with minor errors? We run 8 
systems with Ubuntu 8.04 64bit and e2fsck reports "<device> contains 
file system with errors" for at least one partition on every machine.

Since there are 4 different types of hardware configurations, I tend to 
say that hardware is rather not to be blamed...

If it is not normal, what could be the reasons?
Are there any options to turn on logging which could give more insight 
(what would be the performance impact)?


Thanks and regards,

Roland


Theodore Tso wrote:
> On Thu, Oct 23, 2008 at 10:00:44AM +0200, Roland Bock wrote:
>> Hi,
>>
>> a few weeks ago, an unhealthy combination of firmware in an Adaptec Raid  
>> controller and Seagate disks damaged my Raid6 filesystem. A bunch of  
>> files were damaged or lost at that time after the firmaware was updated  
>> and I had run e2fsck. Luckily, I was able to restore everything from a  
>> backup. A subsequent check with e2fsck reported no errors.
>>
>> Yesterday, I ran e2fsck -n again, to see if the system is still OK. It  
>> isn't and I have no idea how to interpret the messages (see attachment).
> 
> You ran the e2fsck while the filesystem is mounted.  So the output
> reported is not trustworthy, and block allocation bitmap differences
> and free block/inode accounting information being wrong is normal when
> running e2fsck -n on a mounted filesystem.  
> 
> This message, however, is cause for concern:
> 
>> /dev/sdb1 contains a file system with errors, check forced.
> 
> This means the filesystem noticed some discrepancy (for example, when
> freeing a block, it noticed that the block bitmap already showed the
> block as being not in use, which should never happen and indicates
> filesystem corruption).
> 
> I would recommend that you schedule downtime so you can run e2fsck on
> the filesystem while it is unmounted.  Given the errors that you saw
> when running e2fsck while it was mounted, it's unlikely that you will
> see anything serious, but it is still something that you should do.
> 
> Regards,
> 
> 							- Ted



From sandeen at redhat.com  Thu Oct 23 16:07:54 2008
From: sandeen at redhat.com (Eric Sandeen)
Date: Thu, 23 Oct 2008 11:07:54 -0500
Subject: Block bitmap differences
In-Reply-To: <4900A0E5.1000403@eudoxos.de>
References: <49002F2C.9040209@eudoxos.de> <20081023140833.GB5529@mit.edu>
	<4900A0E5.1000403@eudoxos.de>
Message-ID: <4900A15A.9080602@redhat.com>

Roland Bock wrote:
> Ted,
> 
> thank you for your answers.
> 
> Is it normal to encounter file systems with minor errors? We run 8 
> systems with Ubuntu 8.04 64bit and e2fsck reports "<device> contains 
> file system with errors" for at least one partition on every machine.
> 
> Since there are 4 different types of hardware configurations, I tend to 
> say that hardware is rather not to be blamed...
> 
> If it is not normal, what could be the reasons?

Look in your system logs; if the fs is flagged with errors, it should
have issued a message when the error occurred.

-Eric

> Are there any options to turn on logging which could give more insight 
> (what would be the performance impact)?
> 
> 
> Thanks and regards,
> 
> Roland




From rbock at eudoxos.de  Thu Oct 23 17:53:26 2008
From: rbock at eudoxos.de (Roland Bock)
Date: Thu, 23 Oct 2008 19:53:26 +0200
Subject: Block bitmap differences
In-Reply-To: <4900A15A.9080602@redhat.com>
References: <49002F2C.9040209@eudoxos.de> <20081023140833.GB5529@mit.edu>
	<4900A0E5.1000403@eudoxos.de> <4900A15A.9080602@redhat.com>
Message-ID: <4900BA16.3040905@eudoxos.de>

Eric,

what should I be looking for? In /var/log I grep'ed for ext and fs (case 
insensitively) in all syslog, messages and kern.log files. I found 
nothing which indicated an error to me. Just occasional mount/umount 
messages and the like.

Well, to be exact: I did find some error messages from the time when we 
had hardware issues on one machine. But nothing since these were 
resolved two weeks ago. e2fsck was happy then.


Thanks and regards,

Roland

Eric Sandeen wrote:
> Roland Bock wrote:
>> Ted,
>>
>> thank you for your answers.
>>
>> Is it normal to encounter file systems with minor errors? We run 8 
>> systems with Ubuntu 8.04 64bit and e2fsck reports "<device> contains 
>> file system with errors" for at least one partition on every machine.
>>
>> Since there are 4 different types of hardware configurations, I tend to 
>> say that hardware is rather not to be blamed...
>>
>> If it is not normal, what could be the reasons?
> 
> Look in your system logs; if the fs is flagged with errors, it should
> have issued a message when the error occurred.
> 
> -Eric
> 
>> Are there any options to turn on logging which could give more insight 
>> (what would be the performance impact)?
>>
>>
>> Thanks and regards,
>>
>> Roland
> 
> 



From carlo at alinoe.com  Fri Oct 24 01:16:30 2008
From: carlo at alinoe.com (Carlo Wood)
Date: Fri, 24 Oct 2008 03:16:30 +0200
Subject: System crash during mke2fs
Message-ID: <20081024011630.GA14432@alinoe.com>

Hiya, don't know where else to report this. Please
correct me if this isn't the right place.

I just ran into a serious bug :((

We were trying to create a virtual filesystem
in an image (file) of around 238 GB.

Let the files name be foo.img, then we did:

losetup /dev/loop0 foo.img

and then used fdisk /dev/loop0 to create this partition
table:

uxley:~>fdisk -lu /dev/loop0

Disk /dev/loop0: 238.3 GB, 238370684928 bytes
255 heads, 63 sectors/track, 28980 cylinders, total 465567744 sectors
Units = sectors of 1 * 512 = 512 bytes

      Device Boot      Start         End      Blocks   Id  System
/dev/loop0p1   *          63      401624      200781   83  Linux
/dev/loop0p2          401625    16048934     7823655   83  Linux
/dev/loop0p3        16048935    21928724     2939895   82  Linux swap / Solaris
/dev/loop0p4        21928725   465563699   221817487+   5  Extended
/dev/loop0p5        21928788    27808514     2939863+  83  Linux
/dev/loop0p6        27808578    47359619     9775521   83  Linux
/dev/loop0p7        47359683    57143204     4891761   83  Linux
/dev/loop0p8        57143268   465563699   204210216   83  Linux

Next we did:

losetup -o $((512 * 63)) /dev/loop1 /dev/loop0

which should make the first partition available under /dev/loop1
(this certainly works if that partition already contains a fs,
we then can mount it).

Finally, I wanted to create a filesystem and ran the following
command:

uxley:~>mke2fs -j -L "/boot" /dev/loop1
mke2fs 1.40-WIP (14-Nov-2006)
Filesystem label=/boot
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
29097984 inodes, 58195960 blocks
2909798 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=0
1776 block groups
32768 blocks per group, 32768 fragments per group
16384 inodes per group
Superblock backups stored on blocks: 
	32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 
	4096000, 7962624, 11239424, 20480000, 23887872

Writing inode tables:  306/1776


Here the machine completely halted/crashed. I don't know what
happened, because it's a remote machine.

The writing of the inode table started very fast, but it was
already slowing down the last few - and completely stopped
at 306, which was 12 minutes ago (my ssh connection to the
machine still didn't time out, weird enough). 

I can still ping the machine I see.

Note that mke2fs says: 29097984 inodes, 58195960 blocks
That is 58195960 * 4096 = 238370652160 the full size of
the image file?!?

This partition is only 200MB though!

Did I do something very stupid, or is this a bug in mke2fs ?

-- 
Carlo Wood <carlo at alinoe.com>



From jordi.prats at gmail.com  Fri Oct 24 06:47:31 2008
From: jordi.prats at gmail.com (Jordi Prats)
Date: Fri, 24 Oct 2008 08:47:31 +0200
Subject: System crash during mke2fs
In-Reply-To: <20081024011630.GA14432@alinoe.com>
References: <20081024011630.GA14432@alinoe.com>
Message-ID: <1908f30810232347x7378a6d0o2476e794153b7a68@mail.gmail.com>

I don't know how this can hang your system, but instead of doing this:

losetup -o $((512 * 63)) /dev/loop1 /dev/loop0

You could use kpartx:

kpartx -a /dev/loop0

You are going to find in /dev/mapper your loop0p1:

Here you can find an example:

[root at shuVak ~]# dd if=/dev/zero of=caca bs=1024k count=100
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 0.656977 seconds, 160 MB/s
[root at shuVak ~]# losetup /dev/loop0 caca
[root at shuVak ~]# fdisk /dev/loop0
Device contains neither a valid DOS partition table, nor Sun, SGI or
OSF disklabel
Building a new DOS disklabel. Changes will remain in memory only,
until you decide to write them. After that, of course, the previous
content won't be recoverable.

Warning: invalid flag 0x0000 of partition table 4 will be corrected by w(rite)

Command (m for help): p

Disk /dev/loop0: 104 MB, 104857600 bytes
255 heads, 63 sectors/track, 12 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

      Device Boot      Start         End      Blocks   Id  System

Command (m for help): n
Command action
   e   extended
   p   primary partition (1-4)
p
Partition number (1-4): 1
First cylinder (1-12, default 1):
Using default value 1
Last cylinder or +size or +sizeM or +sizeK (1-12, default 12):
Using default value 12

Command (m for help): p

Disk /dev/loop0: 104 MB, 104857600 bytes
255 heads, 63 sectors/track, 12 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

      Device Boot      Start         End      Blocks   Id  System
/dev/loop0p1               1          12       96358+  83  Linux

Command (m for help): t
Selected partition 1
Hex code (type L to list codes): 8e
Changed system type of partition 1 to 8e (Linux LVM)

Command (m for help): w
The partition table has been altered!

Calling ioctl() to re-read partition table.

WARNING: Re-reading the partition table failed with error 22: Invalid argument.
The kernel still uses the old table.
The new table will be used at the next reboot.
Syncing disks.
[root at shuVak ~]# ls /dev/loop*
loop0  loop1  loop2  loop3  loop4  loop5  loop6  loop7
[root at shuVak ~]# kpartx -a /dev/loop0
[root at shuVak ~]# ls /dev/mapper/loop0p1
/dev/mapper/loop0p1


regards,
Jordi



On Fri, Oct 24, 2008 at 3:16 AM, Carlo Wood <carlo at alinoe.com> wrote:
> Hiya, don't know where else to report this. Please
> correct me if this isn't the right place.
>
> I just ran into a serious bug :((
>
> We were trying to create a virtual filesystem
> in an image (file) of around 238 GB.
>
> Let the files name be foo.img, then we did:
>
> losetup /dev/loop0 foo.img
>
> and then used fdisk /dev/loop0 to create this partition
> table:
>
> uxley:~>fdisk -lu /dev/loop0
>
> Disk /dev/loop0: 238.3 GB, 238370684928 bytes
> 255 heads, 63 sectors/track, 28980 cylinders, total 465567744 sectors
> Units = sectors of 1 * 512 = 512 bytes
>
>      Device Boot      Start         End      Blocks   Id  System
> /dev/loop0p1   *          63      401624      200781   83  Linux
> /dev/loop0p2          401625    16048934     7823655   83  Linux
> /dev/loop0p3        16048935    21928724     2939895   82  Linux swap / Solaris
> /dev/loop0p4        21928725   465563699   221817487+   5  Extended
> /dev/loop0p5        21928788    27808514     2939863+  83  Linux
> /dev/loop0p6        27808578    47359619     9775521   83  Linux
> /dev/loop0p7        47359683    57143204     4891761   83  Linux
> /dev/loop0p8        57143268   465563699   204210216   83  Linux
>
> Next we did:
>
> losetup -o $((512 * 63)) /dev/loop1 /dev/loop0
>
> which should make the first partition available under /dev/loop1
> (this certainly works if that partition already contains a fs,
> we then can mount it).
>
> Finally, I wanted to create a filesystem and ran the following
> command:
>
> uxley:~>mke2fs -j -L "/boot" /dev/loop1
> mke2fs 1.40-WIP (14-Nov-2006)
> Filesystem label=/boot
> OS type: Linux
> Block size=4096 (log=2)
> Fragment size=4096 (log=2)
> 29097984 inodes, 58195960 blocks
> 2909798 blocks (5.00%) reserved for the super user
> First data block=0
> Maximum filesystem blocks=0
> 1776 block groups
> 32768 blocks per group, 32768 fragments per group
> 16384 inodes per group
> Superblock backups stored on blocks:
>        32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
>        4096000, 7962624, 11239424, 20480000, 23887872
>
> Writing inode tables:  306/1776
>
>
> Here the machine completely halted/crashed. I don't know what
> happened, because it's a remote machine.
>
> The writing of the inode table started very fast, but it was
> already slowing down the last few - and completely stopped
> at 306, which was 12 minutes ago (my ssh connection to the
> machine still didn't time out, weird enough).
>
> I can still ping the machine I see.
>
> Note that mke2fs says: 29097984 inodes, 58195960 blocks
> That is 58195960 * 4096 = 238370652160 the full size of
> the image file?!?
>
> This partition is only 200MB though!
>
> Did I do something very stupid, or is this a bug in mke2fs ?
>
> --
> Carlo Wood <carlo at alinoe.com>
>
> _______________________________________________
> Ext3-users mailing list
> Ext3-users at redhat.com
> https://www.redhat.com/mailman/listinfo/ext3-users
>



-- 
Jordi



From tytso at mit.edu  Fri Oct 24 10:54:40 2008
From: tytso at mit.edu (Theodore Tso)
Date: Fri, 24 Oct 2008 06:54:40 -0400
Subject: System crash during mke2fs
In-Reply-To: <20081024011630.GA14432@alinoe.com>
References: <20081024011630.GA14432@alinoe.com>
Message-ID: <20081024105440.GC8658@mit.edu>

On Fri, Oct 24, 2008 at 03:16:30AM +0200, Carlo Wood wrote:
> Hiya, don't know where else to report this. Please
> correct me if this isn't the right place.
> 
> I just ran into a serious bug :((
> 
> We were trying to create a virtual filesystem
> in an image (file) of around 238 GB.  [Using double losetup configuration]
>
> Here the machine completely halted/crashed. I don't know what
> happened, because it's a remote machine.
> 
> The writing of the inode table started very fast, but it was
> already slowing down the last few - and completely stopped
> at 306, which was 12 minutes ago (my ssh connection to the
> machine still didn't time out, weird enough). 

That's a classic case of mke2fs tickling a VM bug.  The VM should be
able to do proper write throttling, but mke2fs writes a blocks very
quickly, and so it's a great test of the kernel virtual memory
subsystem.  :-) So the fact that your system hung is a kernel bug,
probably caued by the double /dev/loop configuration.  What version of
the kernel are you using?

There is a workaround that might help: "export MKE2FS_SYNC=10".  This
will force an explicit sync system call every 10 blockgroups, which
tends to work around the kernel VM bug.  It's not the default mainly
because mke2fs is such a great kernel test tool, and the VM really
needs to be able to handle this case.

> Note that mke2fs says: 29097984 inodes, 58195960 blocks
> That is 58195960 * 4096 = 238370652160 the full size of
> the image file?!?
> 
> This partition is only 200MB though!

That's because you created /dev/loop1 as a loop device with an offset
of 512*63 bytes from the beginning of /dev/loop0.  There is no way to
set the maximum size of a loop device (it's not something which is
currently defined as part of the interface of the LOOP_SET_STATUS
ioctl.  If you want to do things manually like this, you'll need to
explicitly specify the size of the desired filesystem to mke2fs; it's
a shortcoming in the loop device.

The other way to do things would be to create an image file of the
desired partition length, and then assemble it by hand afterwards;
sorry, the loop device wasn't designed to be used to emulate a
partitioned disk.  It could be, but kernel patches would be required
to extend its functionality.

Regards,

							- Ted



From rbock at eudoxos.de  Fri Oct 24 15:19:25 2008
From: rbock at eudoxos.de (Roland Bock)
Date: Fri, 24 Oct 2008 17:19:25 +0200
Subject: e2fsck discrepancies
Message-ID: <4901E77D.6010602@eudoxos.de>

Hi,

yesterday I ran e2fsck -n on a mounted file system and got:

/dev/sdb1 contains a file system with errors, check forced.

According to Ted, the lines that followed were not to be trusted due to 
the fact that the file system was mounted. But this error statement 
suggests to run a check with the fs unmounted.

Today, we scheduled a downtime and ran the check. It came of completely 
clean:
~: e2fsck -fy /dev/sdb1

e2fsck 1.40.8 (13-Mar-2008)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/sdb1: 32028520/536870912 files (0.5% non-contiguous), 
802465197/2147460933 blocks


Does this mean that read-only checks are generally not trustworthy, even 
the statement that the filesystem has errors? Or something like

Read-only reports clean: fine
Read-only reports error: not necessarily really an error


Thanks and regards,

Roland



From sandeen at redhat.com  Fri Oct 24 15:20:33 2008
From: sandeen at redhat.com (Eric Sandeen)
Date: Fri, 24 Oct 2008 10:20:33 -0500
Subject: System crash during mke2fs
In-Reply-To: <20081024011630.GA14432@alinoe.com>
References: <20081024011630.GA14432@alinoe.com>
Message-ID: <4901E7C1.4050201@redhat.com>

Carlo Wood wrote:
> Hiya, don't know where else to report this. Please
> correct me if this isn't the right place.
> 
> I just ran into a serious bug :((
...

> Finally, I wanted to create a filesystem and ran the following
> command:
> 
> uxley:~>mke2fs -j -L "/boot" /dev/loop1

...

> Here the machine completely halted/crashed. I don't know what
> happened, because it's a remote machine.

It'd be very good to have a console so you can see what really truly
happened.  A remote machine w/o a console would scare me in any case.  :)

Is the image file sparse, or is it filled in with zeros?  Is it hosted
on ext3?

Especially if it's sparse, but in either case, I'd be curious to know if
it works out any better or worse with other filesystems hosting the
image file - trying ext4 and/or xfs just as an experiment might be
interesting...

-Eric



From sandeen at redhat.com  Fri Oct 24 15:30:40 2008
From: sandeen at redhat.com (Eric Sandeen)
Date: Fri, 24 Oct 2008 10:30:40 -0500
Subject: e2fsck discrepancies
In-Reply-To: <4901E77D.6010602@eudoxos.de>
References: <4901E77D.6010602@eudoxos.de>
Message-ID: <4901EA20.8070800@redhat.com>

Roland Bock wrote:
> Hi,
> 
> yesterday I ran e2fsck -n on a mounted file system and got:
> 
> /dev/sdb1 contains a file system with errors, check forced.
> 
> According to Ted, the lines that followed were not to be trusted due to 
> the fact that the file system was mounted. But this error statement 
> suggests to run a check with the fs unmounted.
> 
> Today, we scheduled a downtime and ran the check. It came of completely 
> clean:
> ~: e2fsck -fy /dev/sdb1
> 
> e2fsck 1.40.8 (13-Mar-2008)
> Pass 1: Checking inodes, blocks, and sizes
> Pass 2: Checking directory structure
> Pass 3: Checking directory connectivity
> Pass 4: Checking reference counts
> Pass 5: Checking group summary information
> /dev/sdb1: 32028520/536870912 files (0.5% non-contiguous), 
> 802465197/2147460933 blocks
> 
> 
> Does this mean that read-only checks are generally not trustworthy, even 
> the statement that the filesystem has errors? Or something like
> 
> Read-only reports clean: fine
> Read-only reports error: not necessarily really an error

I think that's possible.  When e2fsck starts off, main() does:

main()
	check_super_block()
		if some sanity tests fail
			ext2fs_unmark_valid()
	check_if_skip()
		if EXT2_ERROR_FS || !ext2fs_test_valid()
			" contains a file system with errors"


check_if_skip is what issues the "contains a file system with errors"
message, and it may do so if the filesystem is marked with errors, OR if
a call to ext2fs_test_valid() fails.

Prior to this, check_super_block() may call ext2fs_unmark_valid() for a
variety of reasons, some of which could, I think, be caused by the
filesystem being live and not necessarily consistent when viewed by e2fsck.

So I think that the message is a bit misleading; "filesystem with
errors" sounds to me like EXT2_ERROR_FS, which should always issue some
sort of message to the syslog when set - but, you may also get the
"filesystem with errors" message due to some inconsistencies that may be
wholly due to the filesystem being mounted and in flux as fsck tries to
read it.

-Eric



From lll+ext3 at m4x.org  Fri Oct 24 15:59:22 2008
From: lll+ext3 at m4x.org (Loic Le Loarer)
Date: Fri, 24 Oct 2008 17:59:22 +0200
Subject: see current superblock information
Message-ID: <20081024155922.GF24933@pavuc.le-loarer.org>

Hi all,

I would like to have the current number of used/free inode count on a
mounted ext3 fs. It is very useful to debug a situation where you cannot
create new files while the fs isn't full according to "df" (i.e. when
free inode count is zero).

My first idea was to use "tune2fs -l /dev/device", it gives all the
information I need, but it reflects only the on-disk superblock, which
seems to never be written while the fs is mounted.

So I'm look for a way to either force the flush of the superblock or to
just have the current used/free inode count.

I hope I have contacted the correct mailing list.

Thank you in advance for your answers.
Best regards.
-- 
Lo?c

"heaven is not a place, it's a feeling"



From carlo at alinoe.com  Fri Oct 24 17:42:27 2008
From: carlo at alinoe.com (Carlo Wood)
Date: Fri, 24 Oct 2008 19:42:27 +0200
Subject: System crash during mke2fs
In-Reply-To: <20081024105440.GC8658@mit.edu>
References: <20081024011630.GA14432@alinoe.com> <20081024105440.GC8658@mit.edu>
Message-ID: <20081024174227.GA24607@alinoe.com>

On Fri, Oct 24, 2008 at 06:54:40AM -0400, Theodore Tso wrote:
> probably caued by the double /dev/loop configuration.  What version of
> the kernel are you using?

It's running 2.6.18-6-686

We rebooted the machine and nothing seemed corrupted
or wrong, except the virtual machine file; had to
take another 7 hours to recreate that (it's a vmware thing).

-- 
Carlo Wood <carlo at alinoe.com>



From carlo at alinoe.com  Fri Oct 24 17:46:33 2008
From: carlo at alinoe.com (Carlo Wood)
Date: Fri, 24 Oct 2008 19:46:33 +0200
Subject: System crash during mke2fs
In-Reply-To: <4901E7C1.4050201@redhat.com>
References: <20081024011630.GA14432@alinoe.com> <4901E7C1.4050201@redhat.com>
Message-ID: <20081024174633.GC24607@alinoe.com>

On Fri, Oct 24, 2008 at 10:20:33AM -0500, Eric Sandeen wrote:
> Is the image file sparse, or is it filled in with zeros?  Is it hosted
> on ext3?

Not sparse, but probably filled with zeroes. Yes it is.

> Especially if it's sparse, but in either case, I'd be curious to know if
> it works out any better or worse with other filesystems hosting the
> image file - trying ext4 and/or xfs just as an experiment might be
> interesting...

We're just trying to save this company that is down for three
weeks now ;) No time for experiments :p

Anyway, thanks for your comments. In the meantime we're
back on track fortunately.

-- 
Carlo Wood <carlo at alinoe.com>



From Curtis at GreenKey.net  Fri Oct 24 17:47:55 2008
From: Curtis at GreenKey.net (Curtis Doty)
Date: Fri, 24 Oct 2008 10:47:55 -0700 (PDT)
Subject: see current superblock information
In-Reply-To: <20081024155922.GF24933@pavuc.le-loarer.org>
References: <20081024155922.GF24933@pavuc.le-loarer.org>
Message-ID: <20081024174756.49A5E6F064@alopias.GreenKey.net>

5:59pm Loic Le Loarer said:
>
> So I'm look for a way to either force the flush of the superblock or to
> just have the current used/free inode count.

df -i

Is that what you seek?

../C



From rbock at eudoxos.de  Fri Oct 24 17:49:57 2008
From: rbock at eudoxos.de (Roland Bock)
Date: Fri, 24 Oct 2008 19:49:57 +0200
Subject: e2fsck discrepancies
In-Reply-To: <4901EA20.8070800@redhat.com>
References: <4901E77D.6010602@eudoxos.de> <4901EA20.8070800@redhat.com>
Message-ID: <49020AC5.9000703@eudoxos.de>

Eric:

thanks for the confirmation. Now that I read again the man page, I 
wonder how I could miss that part:

"[...] How-
ever,  even  if  it  is  safe to do so, the results printed by e2fsck 
are not valid if the
filesystem is mounted."

Blessed is he who can read :-)


Best regards,

Roland

Eric Sandeen wrote:
> Roland Bock wrote:
>> Hi,
>>
>> yesterday I ran e2fsck -n on a mounted file system and got:
>>
>> /dev/sdb1 contains a file system with errors, check forced.
>>
>> According to Ted, the lines that followed were not to be trusted due to 
>> the fact that the file system was mounted. But this error statement 
>> suggests to run a check with the fs unmounted.
>>
>> Today, we scheduled a downtime and ran the check. It came of completely 
>> clean:
>> ~: e2fsck -fy /dev/sdb1
>>
>> e2fsck 1.40.8 (13-Mar-2008)
>> Pass 1: Checking inodes, blocks, and sizes
>> Pass 2: Checking directory structure
>> Pass 3: Checking directory connectivity
>> Pass 4: Checking reference counts
>> Pass 5: Checking group summary information
>> /dev/sdb1: 32028520/536870912 files (0.5% non-contiguous), 
>> 802465197/2147460933 blocks
>>
>>
>> Does this mean that read-only checks are generally not trustworthy, even 
>> the statement that the filesystem has errors? Or something like
>>
>> Read-only reports clean: fine
>> Read-only reports error: not necessarily really an error
> 
> I think that's possible.  When e2fsck starts off, main() does:
> 
> main()
> 	check_super_block()
> 		if some sanity tests fail
> 			ext2fs_unmark_valid()
> 	check_if_skip()
> 		if EXT2_ERROR_FS || !ext2fs_test_valid()
> 			" contains a file system with errors"
> 
> 
> check_if_skip is what issues the "contains a file system with errors"
> message, and it may do so if the filesystem is marked with errors, OR if
> a call to ext2fs_test_valid() fails.
> 
> Prior to this, check_super_block() may call ext2fs_unmark_valid() for a
> variety of reasons, some of which could, I think, be caused by the
> filesystem being live and not necessarily consistent when viewed by e2fsck.
> 
> So I think that the message is a bit misleading; "filesystem with
> errors" sounds to me like EXT2_ERROR_FS, which should always issue some
> sort of message to the syslog when set - but, you may also get the
> "filesystem with errors" message due to some inconsistencies that may be
> wholly due to the filesystem being mounted and in flux as fsck tries to
> read it.
> 
> -Eric



From lll+ext3 at m4x.org  Fri Oct 24 22:48:01 2008
From: lll+ext3 at m4x.org (Loic Le Loarer)
Date: Sat, 25 Oct 2008 00:48:01 +0200
Subject: see current superblock information
In-Reply-To: <20081024174756.49A5E6F064@alopias.GreenKey.net>
References: <20081024155922.GF24933@pavuc.le-loarer.org>
	<20081024174756.49A5E6F064@alopias.GreenKey.net>
Message-ID: <20081024224801.GH24933@pavuc.le-loarer.org>

Le vendredi 24 octobre 2008 ? 10:47:55 -0700, Curtis Doty a ?crit:
> 5:59pm Loic Le Loarer said:
> >
> >So I'm look for a way to either force the flush of the superblock or to
> >just have the current used/free inode count.
> 
> df -i
> 
> Is that what you seek?

Exactly, it's so obvious now that you say it.
Thank you for the help !
-- 
Lo?c



From lists at nerdbynature.de  Sat Oct 25 23:22:02 2008
From: lists at nerdbynature.de (Christian Kujau)
Date: Sat, 25 Oct 2008 16:22:02 -0700 (PDT)
Subject: ext3 file system I/O blocks until reboot
In-Reply-To: <48FBEE8A.80608@obsidian.com.au>
References: <48FBEE8A.80608@obsidian.com.au>
Message-ID: <alpine.DEB.2.00.0810251615310.7569@bogon.housecafe.de>

Probably too late anyway, but:

On Mon, 20 Oct 2008, Robert Davidson wrote:
> The "kjournald" process also got stuck in the "D" state.

Did you try a SysReq-w to show all blocked tasks? OR even -d, or -t. You 
mentioned /var/log was on a different filesystem, so this information 
might make it to the disks. If not, your serial console should catch 
it. Maybe then we'll find out *why* these process are in "D" state.

Christian.
-- 
BOFH excuse #25:

Decreasing electron flux



From rdavidson at obsidian.com.au  Mon Oct 27 01:10:12 2008
From: rdavidson at obsidian.com.au (Robert Davidson)
Date: Mon, 27 Oct 2008 12:10:12 +1100
Subject: ext3 file system I/O blocks until reboot
In-Reply-To: <alpine.DEB.2.00.0810251615310.7569@bogon.housecafe.de>
References: <48FBEE8A.80608@obsidian.com.au>
	<alpine.DEB.2.00.0810251615310.7569@bogon.housecafe.de>
Message-ID: <490514F4.4060801@obsidian.com.au>


Christian Kujau wrote:
> Probably too late anyway, but:
>
> On Mon, 20 Oct 2008, Robert Davidson wrote:
>> The "kjournald" process also got stuck in the "D" state.
>
> Did you try a SysReq-w to show all blocked tasks? OR even -d, or -t.
> You mentioned /var/log was on a different filesystem, so this
> information might make it to the disks. If not, your serial console
> should catch it. Maybe then we'll find out *why* these process are in
> "D" state.

Hi Christian,

Not too late - this is an ongoing problem still.  I'm currently trying
to see if I can get some newer vserver patches so I can build a newer
kernel and try that.  Currently I'm stuck with 2.6.22.19

I've tried doing various SysRq requests, none of them would give me
anything back on the serial console, but it seems that may have been my
own fault for having the console logging set too low.  I've fixed that
up now.

In any case, the responses you'd expect to see from the kernel for the
various SysRq commands never made it into the logs.

About a month ago when the server last had problems, I made a new ext3
filesystem and copied everything from the old filesystem to the new
one.  I thought that worked but then last night we lost the same
filesystem again and had to reboot.

After copying everything off the original filesystem (also ext3) I ran a
forced fsck.ext3 on it and it didn't find any problems.

-- 
Regards,
Robert Davidson.
Obsidian Consulting Group.
Ph. 03-9355-7844
E-Mail: support at obsidian.com.au




From puhuri at iki.fi  Mon Oct 27 09:40:21 2008
From: puhuri at iki.fi (Markus Peuhkuri)
Date: Mon, 27 Oct 2008 11:40:21 +0200
Subject: Unlink performance
Message-ID: <49058C85.8060901@iki.fi>

Hi, I get problems with ext3 delete blocking filesystem access or
slowing down write speeds.

My system is following:

    * a process is reading real-time data (with few seconds of
      buffering) and after processing writing with top speed of 2x10
      Mbyte/s (two streams to different disks).
    * Then there are two processes that read data from the same disks
      and process it further and copy it to yet another pair of disks. 
    * Yet another processes is then deleting older files to keep disk
      usage below 85%

The reason for this kind of processing is that the second step is too
slow to happen real time, the incoming data is bursty in nature and at
peek load the processors are not fast enough to process the data.  On
average (given 2x900 GB disk buffer) the system is, however fast enough
to post-process the data.

However, as my delete script malfunctioned, and at one point it had
2x100 GB files to delete; thus running 'rm file' one after one for those
400 files, about 500 MB each.  What then resulted was  that the
real-time data processing became too slow and and buffers overfload.

Of course, I could force delete script to sleep few seconds between file
deletes to allow write process to recover, but still this feels a bit of
unsure patch.

I looked on IO schedulers, but while I'm quite familar with networking
queues, IO scheduler is largely unknown for me.  I assume that you
cannot assing per-process priorities with IO schedulers?  As that would
be the case, I would max priority for the real-time process and put
delete function to lowest one.

Any ideas how I could make sure that the system would do its best to
provide good service for real-time processing?  The secondary processing
is niced, but if I recall right, the delete was running with nice 0.

I had few ideas to improve things, but not yet had time to implement:

    * I could use tee-like program for post-processing.  At first it
      tries to process data real-time (reading from raw stream after it
      has been written to disk, so data could be in buffer if caching is
      set ok), but it if could not keep with it, it would then just
      queue post-processing and continue later, when load allows.
    * Smaller files would of course make blocking time shorter.


If it matters, the systems use sata disks (both native and scsi-raid),
and have kernel 2.6.26 (Debian Lenny).

. Markus



From alex at alex.org.uk  Mon Oct 27 09:30:18 2008
From: alex at alex.org.uk (Alex Bligh)
Date: Mon, 27 Oct 2008 10:30:18 +0100
Subject: Unlink performance
In-Reply-To: <49058C85.8060901@iki.fi>
References: <49058C85.8060901@iki.fi>
Message-ID: <52F49968757FFFD36073E072@Ximines.local>



--On 27 October 2008 11:40:21 +0200 Markus Peuhkuri <puhuri at iki.fi> wrote:

> However, as my delete script malfunctioned, and at one point it had
> 2x100 GB files to delete; thus running 'rm file' one after one for those
> 400 files, about 500 MB each.  What then resulted was  that the
> real-time data processing became too slow and and buffers overfload.

Are all the files in the same directory? Even with HTREE there seem
to be cases where this is surprisingly slow. Look into using nested
directories (e.g. A/B/C/D/foo where A, B, C, D are truncated hashes
of the file name).

Or, if you don't mind losing data in a power off and the job suits,
unlink the file name immediately your processor has opened it. Then
it will be deleted on close.

Alex



From adilger at sun.com  Mon Oct 27 19:51:23 2008
From: adilger at sun.com (Andreas Dilger)
Date: Mon, 27 Oct 2008 13:51:23 -0600
Subject: Unlink performance
In-Reply-To: <52F49968757FFFD36073E072@Ximines.local>
References: <49058C85.8060901@iki.fi> <52F49968757FFFD36073E072@Ximines.local>
Message-ID: <20081027195123.GM3184@webber.adilger.int>

On Oct 27, 2008  10:30 +0100, Alex Bligh wrote:
> --On 27 October 2008 11:40:21 +0200 Markus Peuhkuri <puhuri at iki.fi> wrote:
>
>> However, as my delete script malfunctioned, and at one point it had
>> 2x100 GB files to delete; thus running 'rm file' one after one for those
>> 400 files, about 500 MB each.  What then resulted was  that the
>> real-time data processing became too slow and and buffers overfload.
>
> Are all the files in the same directory? Even with HTREE there seem
> to be cases where this is surprisingly slow. Look into using nested
> directories (e.g. A/B/C/D/foo where A, B, C, D are truncated hashes
> of the file name).
>
> Or, if you don't mind losing data in a power off and the job suits,
> unlink the file name immediately your processor has opened it. Then
> it will be deleted on close.

No, it is likely the problem is with the ext3 indirect block pointer
updates for large files.  This will also put a lot of blocks into the
journal and if the journal is full it can block all other operations.

If you run with ext4 extents the unlink time is much shorter, though
you should test ext4 yourself before putting it into production.

Doing the "unlink; sleep 1" will keep the traffic to the journal lower,
as would deleting fewer files more often to ensure you don't delete
200GB of data at one time if you have real-time requirements.  If you
are not creating files faster than 1/s unlinks should be able to keep up.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.