[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: ext3 filesystem corruption - more info

Andreas Dilger wrote:
On Apr 13, 2006  10:40 -0400, Sev Binello wrote:
[ still HTML-only email, extracting text from HTML is getting dull ]
Since it seemed to mount okay only 3mins earlier,<br>
can we assume that it was initially uncorrupted ?<br>
Or, is that not valid assumption ?<br>

No, at mount time there is only very cursory checking done of the group
descriptors and superblock.  The corruption reported appears to be from
bad indirect blocks.

Is there anything that we can check, test etc...<br>
any advice, action at this point is better than waiting for the next
fileystem disaster to ocurr.<br>

Do you run with write cache enabled on your device?  That can potentially
cause filesystem corruption even in the face of ext3 journaling, because
the journal atomicity guarantees are lost when the device reports a write
is complete on disk when it really isn't.
The raid system does run with write back cache enabled.
I don't believe the actual drives have this enabled,  but I'd have to check.

But we didn't actually lose power on the raid or hosts
just the connecting switches, so we lost all communication.
Presumably, in this situation  the controller cache should have been emptied
Is my reasoning correct here ?

Either way, you are saying is best to avoid write cacheing in the future.

Also, in looking and comparing error msgs in the log files
I noticed that on the host where the corruption occurred,
the call to abort the journal didn't seem to actually happen for an hour
Does that have any significance ?
Mar 25 14:38:52 acnlin83 kernel: Error (-5) on journal on device 08:21
Mar 25 14:38:52 acnlin83 kernel: Aborting journal on device sd(8,33).
    1hr gap
       Mar 25 15:39:19 acnlin83 kernel: ext3_abort called.
      Mar 25 15:39:19 acnlin83 kernel: EXT3-fs abort (device sd(8,33)): ext3_journal_start: Detected aborted journal
       Mar 25 15:39:19 acnlin83 kernel: Remounting filesystem read-only
        Mar 25 15:39:19 acnlin83 kernel: EXT3-fs error (device sd(8,33)) in start_transaction: Journal has aborted
Thanks again
Cheers, Andreas
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.



Sev Binello
Brookhaven National Laboratory
Upton, New York
sev bnl gov

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]