recovering failed resize2fs

Tue Oct 21 21:44:33 UTC 2008

Sunday Theodore Tso said:

> On Sat, Oct 18, 2008 at 04:20:13PM -0700, Curtis Doty wrote:
>> 4:29pm Theodore Tso said:
>>
>>> On Sat, Oct 18, 2008 at 12:55:56PM -0700, Curtis Doty wrote:
>>>> While attempting to expand a 1.64T ext4 volume to 2.18T the F9 kernel
>>>> deadlocked. (I have photo of screen/oops if anybody's interested.)
>>>
>>> Yes, that would be useful, thanks.
>>
>> Three photos of same: http://www.greenkey.net/~curtis/linux/
>>
>> The rest had scrolled off, so maybe that soft lockup was a secondary
>> effect rather than true cause? It was re-appearing every minute.
>
> Looks like the kernel wedged due to running out of memory.  The calls
> to shrink_zone(), shrink_inactive_list(), try_to_release_page(),
> etc. tends to indicate that the system was frantically trying to find
> free physical memory at the time.  It may or may not have been caused
> by the online resize; how much memory does your system have, and what
> else was going on at the time?  It may have been that something *else*
> had been leaking memory at the time, and this pushed it over the line.
>

The system had been a couple months and doing significant i/o on the ext4 
volume. And indeed it had been having periodic memory/swap issues:

http://www.greenkey.net/~curtis/linux/cracker-kernel.2008-10-21

> It's also the case that the online resize is journaled, so it should
> have been safe; but I'm guessing that the system was thrashing so
> hard, and you didn't have barriers enabled, and this resulted in the
> filesystem getting corrupted.

Some other observations...

  - a snapshot in a different vg blew up a few days prior; it was deleted
  - ran vgs a few times in another vty during resize2fs *immediately* 
before crash

>
>>> Hmm... This sounds like the needs recovery flag was set on the backup
>>> superblock, which should never happen.  Before we try something more
>>> extreme, see if this helps you:
>>>
>>> e2fsck -b 32768 -B 4096 /dev/where-inst-is-located
>>>
>>> That forces the use of the backup superblock right away, and might
>>> help you get past the initial error.
>>
>> Same as before. :-(
>>
>> # e2fsck -b32768 -B4096 -C0 /dev/dat/inst
>> e2fsck 1.41.0 (10-Jul-2008)
>> inst: recovering journal
>> e2fsck: unable to set superblock flags on inst
>>
>> It appears *all* superblocks are same as that first 32768 by iterating
>> over all superblocks shown in mkfs -n output says so.
>>
>> I'm inclined to just force reduce the underlying lvm. It was 100% full
>> before I extended and tried to resize. And I know the only writes on the
>> new lvm extent would have been from resize2fs. It that wise?
>
> No, force reducing the underlying LVM is only going to make things
> worse, since it doesn't fix the filesystem.
>
> So this is what I would do.  Create a snapshot and try this on the
> snapshot first:
>
> % lvcreate -s -L 10G -n inst-snapshot /dev/dat/inst
> % debugfs -w /dev/dat/inst-snapshot
> debugfs: features ^needs_recovery
> debugfs: quit
> % e2fsck -C 0 /dev/dat/inst

Done, but no change. :-(

EXT4-fs: ext4_check_descriptors: Block bitmap for group 13413 not in group (block 0)!<3>EXT4-fs: group descriptors corrupted!

>
> This will skip running the journal, but there's no guarantee the
> journal is valid anyway.
>
> If this turns into a mess, you can throw away the snapshot and try
> something else.  (The something else would require writing a C program
> that removes the needs_recovery from all the backup superblock, but
> keeping it set on the master superbock.  That's more work, so let's
> try this way first.)

How does that something else work?

../C