SCSI disk errors, but disk diagnostics say disk is OK

Tom Haws trh at timberline.ca
Thu Apr 14 17:10:15 UTC 2005


Hi all,

Sorry for posting here, but I have a RH9 machine running the 2.4.20-8 
kernel that's having problems, and the shrike-list is just about dead, 
so I thought I might have a better chance of getting responses here...

I have a brand new (3 months old) Seagate ST3146807LW 146GB U320 SCSI 
drive that is having write problems:

SCSI disk error : host 0 channel 0 id 12 lun 0 return code = 8000002
Info fld=0x7df6bbf, Deferred sd08:11: sense key Hardware Error
Additional sense indicates Internal target failure

Eventually, the OS has enough problems, and it does this:

journal_bmap_Rsmp_e68c71a3: journal block not found at offset 6156 on 
sd(8,17)
Aborting journal on device sd(8,17).
ext3_abort called.
EXT3-fs abort (device sd(8,17)): ext3_journal_start: Detected aborted 
journal
Remounting filesystem read-only

It sounds like a simple bad disk, especially with the references to 
hardware errors, but I have shut down and rebooted with the Seagate 
SeaTools diagnostics CD, and run both the quick diagnostics and the 
complete surface scan of the disk, and it comes back clean (4 hours!).  
The vendor won't take the disk back unless the SeaTools reports errors, 
and I tend to believe the tool, since I have used it before and it 
definitely catches things...

So my conclusion is that the problem is OS-related.  I did an "mke2fs -c 
-j /dev/sdb1" to rebuild the filesystem last night, and I still get the 
same problems when trying to restore from tape.  I am going to run 
"mke2fs -c -c -j /dev/sdb1" tonight on the disk, to get it to do a 
complete destructive read/write test as it rebuilds the filesystem.  I 
tried to do that last night, but gave up at 2AM as it started to do the 
second pass with 0x55555555, after 2 hours of writing 0xaaaaaaa and 
verifying that seemingly without problems (though I'm not sure if it 
reports problems as it encounters them or at the end; I had to ^C out).  
The disk I/O with the "-c -c" option slows the machine right down, and 
it's a main file server, so I can't do that during the day.

There are other disks on that SCSI chain, including another ST3146807LW 
in the same cabinet as the one that is having problems.  The only thing 
unique about this one is that it is a single 140GB ext3 partition, 
whereas the other ST3146807LW is partitioned into a ~90Gb and a ~50GB 
partition.

Would anyone have any idea where to start looking for this problem?  
This machine is badly in need of patching, but because it is RH9 and not 
RHEL or Fedora, I'm not sure how to do that.  Any thoughts would be 
appreciated.

-Tom

-- 
_______________________________________________________________________
Tom Haws               Manager, Systems Administration
trh at timberline.ca      Timberline Forest Inventory Consultants
Tel: (250) 562-2628    1579 9th Ave, Prince George, B.C. Canada V2L 3R8
Fax: (250) 562-6942    http://www.timberline.ca
_______________________________________________________________________




More information about the fedora-list mailing list