recovering failed resize2fs

Mon Oct 20 01:53:09 UTC 2008

On Sat, Oct 18, 2008 at 04:20:13PM -0700, Curtis Doty wrote:
> 4:29pm Theodore Tso said:
>
>> On Sat, Oct 18, 2008 at 12:55:56PM -0700, Curtis Doty wrote:
>>> While attempting to expand a 1.64T ext4 volume to 2.18T the F9 kernel
>>> deadlocked. (I have photo of screen/oops if anybody's interested.)
>>
>> Yes, that would be useful, thanks.
>
> Three photos of same: http://www.greenkey.net/~curtis/linux/
>
> The rest had scrolled off, so maybe that soft lockup was a secondary  
> effect rather than true cause? It was re-appearing every minute.

Looks like the kernel wedged due to running out of memory.  The calls
to shrink_zone(), shrink_inactive_list(), try_to_release_page(),
etc. tends to indicate that the system was frantically trying to find
free physical memory at the time.  It may or may not have been caused
by the online resize; how much memory does your system have, and what
else was going on at the time?  It may have been that something *else*
had been leaking memory at the time, and this pushed it over the line.

It's also the case that the online resize is journaled, so it should
have been safe; but I'm guessing that the system was thrashing so
hard, and you didn't have barriers enabled, and this resulted in the
filesystem getting corrupted.

>> Hmm... This sounds like the needs recovery flag was set on the backup
>> superblock, which should never happen.  Before we try something more
>> extreme, see if this helps you:
>>
>> e2fsck -b 32768 -B 4096 /dev/where-inst-is-located
>>
>> That forces the use of the backup superblock right away, and might
>> help you get past the initial error.
>
> Same as before. :-(
>
> # e2fsck -b32768 -B4096 -C0 /dev/dat/inst
> e2fsck 1.41.0 (10-Jul-2008)
> inst: recovering journal
> e2fsck: unable to set superblock flags on inst
>
> It appears *all* superblocks are same as that first 32768 by iterating  
> over all superblocks shown in mkfs -n output says so.
>
> I'm inclined to just force reduce the underlying lvm. It was 100% full  
> before I extended and tried to resize. And I know the only writes on the  
> new lvm extent would have been from resize2fs. It that wise?

No, force reducing the underlying LVM is only going to make things
worse, since it doesn't fix the filesystem.

So this is what I would do.  Create a snapshot and try this on the
snapshot first:

% lvcreate -s -L 10G -n inst-snapshot /dev/dat/inst
% debugfs -w /dev/dat/inst-snapshot
debugfs: features ^needs_recovery
debugfs: quit
% e2fsck -C 0 /dev/dat/inst

This will skip running the journal, but there's no guarantee the
journal is valid anyway.

If this turns into a mess, you can throw away the snapshot and try
something else.  (The something else would require writing a C program
that removes the needs_recovery from all the backup superblock, but
keeping it set on the master superbock.  That's more work, so let's
try this way first.)

						- Ted