Intermittent ext3 corruption on external firewire Micronet 1.5Tb RAID on FC3

Sun May 15 13:56:53 UTC 2005

Hi

I have a Firewire connected Micronet 1.5TB RAID with a single
large ext3 filesystem on one partition on a dual Xeon system.

I am checking out from an extremely large cvs repository
(don't ask) to this drive over the course of many days, and
intermittently I get bad blocks and the filesystem goes
read-only. This is not related to any power failure or
anything similar. The RAID is currently about 40% full;
this started to happen around the 15% mark as I recall.

I checked the RAID firmware setup, found that caching was
set to write-back, and changed it to write-through to
see if that would help (since I gather the Linux kernel
presumes write-through, though why it should make a
difference in the absence of a reboot or power failure
I don't understand).

This reduced the frequency of the error from once a night
to once every couple of nights; interestingly mostly at
about 04:03 AM or so. Looking at cron.daily, only mrtg
and sa seem to be starting up at about that time.

I suspect the timing is related to a change in the pattern
of disk activity rather than anything else.

I have no reason to suspect that there is anything actually
wrong with the RAID itself, which just appears as a really
big firewire external disk. It is new however, so this
can't be ruled out.

My next step is to just turn off journaling and see if
doing this with just ext2 works OK. Journaling doesn't
seem to be doing much good as I am stuck regularly running
ordinary fsck's with all these errors anyway !

I just thought I would ask if anyone else has had a similar
experience, and whether such issues are known to be with ext3,
or the firewire interface, or both together.

PS. I did actually create the partition and did the mkfs on
an AMD64 FC3 system at a different site, though that is not the
system to which the RAID is currently connected. Just mention
that in case this makes a difference, but I presume an fsck
would have noticed and fixed anything fundamentally wrong in
this regard.

David

May 15 04:03:30 localhost kernel: Aborting journal on device sdd1.
May 15 04:03:30 localhost kernel: EXT3-fs error (device sdd1): ext3_journal_start_sb: Detected aborted journal
May 15 04:03:30 localhost kernel: EXT3-fs error (device sdd1): ext3_xattr_get: inode 63343526: bad block 165510584
May 15 04:03:30 localhost kernel: EXT3-fs error (device sdd1) in start_transaction: Journal has aborted
May 15 04:03:30 localhost kernel: EXT3-fs error (device sdd1) in start_transaction: Journal has aborted
May 15 04:03:30 localhost kernel: inode_doinit_with_dentry:  getxattr returned 5 for dev=sdd1 ino=63343526
May 15 04:03:34 localhost kernel: EXT3-fs error (device sdd1): ext3_xattr_get: inode 63343381: bad block 141623810
May 15 04:03:34 localhost kernel: EXT3-fs error (device sdd1): ext3_xattr_get: inode 63947123: bad block 203323361

Linux localhost.localdomain 2.6.9-1.667smp #1 SMP Tue Nov 2 14:59:52 EST 2004 i686 i686 i386 GNU/Linux