[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: another seriously corrupt ext3 -- pesky journal

On Mon, Aug 18, 2003 at 12:39:46PM -0400, Erez Zadok wrote:
> The power failure on Thursday did something evil to my ext3 file system (box
> running RH9+patches, ext3, /dev/md0, raid5 driver, 400GB f/s using 3x200GB
> IDE drives and one hot-spare).  The f/s got corrupt badly and the symptoms
> are very similar to what Eddy described here:
> 	https://www.redhat.com/archives/ext3-users/2003-July/msg00015.html
> That is, nearly everything I try results in and error such as
> 	"Invalid argument while checking ext3 journal for /dev/md0"

What probably happened is that the power failed while you were writing
out the inode table, and the memory failed before the DMA engine and
hard drive did, since DRAM tends to be more sensitive to voltage drops
that other parts of the system.  As a result, random garbage got
scribbled all over the disk.  (Ted 's observation: PC Class hardware
is sh*t.)

Normally, this isn't a problem, since the ext3 journal contains full
backups of recently written data blocks.  (As opposed to filesystems
that use soft update or logically journaled filesystems, which are
even more fragile in the face of cheap hardware that scribble random
garbage on power failure.)  However, this is not true when the first
part of the inode table is scribbled upon, such that the journal inode
can not be found.  

Given that this sort of failure has been reported at least 2 or 3
times, now, it's clear we need to address this vulnerability, probably
by keeping a backup copy of the journal inode (or at least the journal
data blocks) in the superblock, so it can survive this particular
lossage mode.

> Ted answered here:
> 	https://www.redhat.com/archives/ext3-users/2003-July/msg00035.html
> and suggested the last ditch approach using mke2fs -S to reinitialize the
> superblock and group descriptors.  After trying all sort of "safe" methods
> to recover the files, I have tried the -S option as follows:
> # mke2fs -j -b 4096 -S /dev/md0
> Creating journal (8192 blocks): mke2fs: File exists 
>  while trying to create journal
> ----------------------------------------------------------------------------

Yeah, what happened here is that the -S option does not clear the
inode table.  So when it tried to create the journal inode, it found
that there was something there already (but probably garbaged) and
then bombed out.

> And once again got this error wrt the journal.  Note that before I even
> tried this -S procedure, I tried to simply turn off the has_journal bit
> using tune2fs: didn't help.  (I'm willing to lose the info in the journal,
> as long as I can get the rest of my large f/s.)  But tune2fs and friends
> gave me a chicken-and-egg error about the invalid arg wrt the journal, while
> trying to turn it off (duhh).

You could have turned it off using debugfs, but up until now it's not
something that I've encouraged because of concerns that there might be
real data loss if it was too easy for users to disable the journal.

> Now I was able to start "e2fsck -b 71663616 -B 4096 /dev/md0".  It's been
> running for a couple of hours already.  Of course, it's discovering all
> sorts of wonderful new events and spewing messages I've never even seen
> before. 1/2 a :-)

Yup.  Some of the damage was caused by not replaying the journal
before running e2fsck, and some was done probably by the power failure
causing garbage to be scribbled on the disk.

> Anyway, my hypothesis now is that the f/s in question may have just had a
> really really bad journal inode on it that was preventing anything else from
> happening, and that perhaps I shouldn't have tried "mke2fs -S" above had I
> been able to just nuke the pesky journal (it might have prevented further
> corruption that fsck is now "fixing").

Your hypothesis was right.  Whether you nuked the journal by using
debugfs or y using mke2fs -S probably wouldn't have made any
difference, however.

> The good news is that prior to experimentation, I have made a dd backup of
> /dev/md0 (400GB) onto a file on another file server (1.5T), so I can dd it
> back onto my real /dev/md0 if need be.  Alternatively, I can make a second
> copy of that backup file, use losetup on the second copy, and then
> experiment.
> Questions:
> 1. Is there any reason why I couldn't experiment with e2fsprogs binaries on
>    a f/s dd image mounted over /dev/loopN?  I.e., will it behave the same as
>    a disk device as far as e2fsprogs are concerned?

No reason.  The e2fsprogs binaries don't need to operate on a block
device.  You can just point it at an dd image directly.

> 2. If my assertion is correct that most of my f/s is intact but the journal
>    is FUBAR, I need to find a way to force fsck to ignore the journal no
>    matter what.  Is there such a tool or option to some tool?  Is there a
>    way I could simply scan the disk and truncate the journal file, or turn
>    off the has_journal bit w/o touching the rest of the f/s?

You can use debugfs's feature command to turn off the has_journal bit
as follows:

debugfs -w /dev/hdaXX
debugfs: features ^has_journal
debugfs: features ^needs_recovery 
debugfs: quit

Hmm.... this will work unless the group descriptors are so badly
damaged that debugfs refuses to touch the superblock.  You can open it
in catastrophic mode, but right now as a safety precaution, you' re
not allowed to open the filesystem in read/write mode when in
catastrophic mode.  I can remove this restriction if we add some more
safety checks that will prevent debugfs from doing more damage when
opened in read/write catastrophic mode, at the moment, debugfs has
been written with a "first do no harm" principle.

Ultimately, though, it's probably more important to add a backup copy
of the journal inode to avoid needing to play games like this in the
first place, and to allow e2fsck to recover from these situations

						- Ted

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]