<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type"> <title></title> </head> <body bgcolor="#ffffff" text="#000000"> Andreas Dilger wrote: <blockquote cite="mid20060413192909.GV17364@schatzie.adilger.int" type="cite"> <pre wrap="">On Apr 13, 2006 10:40 -0400, Sev Binello wrote: [ still HTML-only email, extracting text from HTML is getting dull ] </pre> <blockquote type="cite"> <pre wrap="">Since it seemed to mount okay only 3mins earlier, can we assume that it was initially uncorrupted ? Or, is that not valid assumption ? </pre> </blockquote> <pre wrap=""> No, at mount time there is only very cursory checking done of the group descriptors and superblock. The corruption reported appears to be from bad indirect blocks. </pre> <blockquote type="cite"> <pre wrap="">Is there anything that we can check, test etc... any advice, action at this point is better than waiting for the next fileystem disaster to ocurr. </pre> </blockquote> <pre wrap=""> Do you run with write cache enabled on your device? That can potentially cause filesystem corruption even in the face of ext3 journaling, because the journal atomicity guarantees are lost when the device reports a write is complete on disk when it really isn't. </pre> </blockquote> The raid system does run with write back cache enabled. I don't believe the actual drives have this enabled, but I'd have to check. But we didn't actually lose power on the raid or hosts just the connecting switches, so we lost all communication. Presumably, in this situation the controller cache should have been emptied Is my reasoning correct here ? Either way, you are saying is best to avoid write cacheing in the future. Also, in looking and comparing error msgs in the log files I noticed that on the host where the corruption occurred, the call to abort the journal didn't seem to actually happen for an hour Does that have any significance ? <blockquote>Mar 25 14:38:52 acnlin83 kernel: Error (-5) on journal on device 08:21 Mar 25 14:38:52 acnlin83 kernel: Aborting journal on device sd(8,33). </blockquote> 1hr gap Mar 25 15:39:19 acnlin83 kernel: ext3_abort called. Mar 25 15:39:19 acnlin83 kernel: EXT3-fs abort (device sd(8,33)): ext3_journal_start: Detected aborted journal Mar 25 15:39:19 acnlin83 kernel: Remounting filesystem read-only Mar 25 15:39:19 acnlin83 kernel: EXT3-fs error (device sd(8,33)) in start_transaction: Journal has aborted Thanks again -Sev <blockquote cite="mid20060413192909.GV17364@schatzie.adilger.int" type="cite"> <pre wrap=""> Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. </pre> </blockquote> <pre class="moz-signature" cols="100">-- Sev Binello Brookhaven National Laboratory Upton, New York 631-344-5647 <a class="moz-txt-link-abbreviated" href="mailto:sev@bnl.gov">sev@bnl.gov</a> </pre> </body> </html>