SCSI disk errors, but disk diagnostics say disk is OK

Thu Apr 14 18:04:37 UTC 2005

Tom Haws wrote:
> Hi all,
> 
> Sorry for posting here, but I have a RH9 machine running the 2.4.20-8 
> kernel that's having problems, and the shrike-list is just about dead, 
> so I thought I might have a better chance of getting responses here...
> 
> I have a brand new (3 months old) Seagate ST3146807LW 146GB U320 SCSI 
> drive that is having write problems:
> 
> SCSI disk error : host 0 channel 0 id 12 lun 0 return code = 8000002
> Info fld=0x7df6bbf, Deferred sd08:11: sense key Hardware Error
> Additional sense indicates Internal target failure
> 
> Eventually, the OS has enough problems, and it does this:
> 
> journal_bmap_Rsmp_e68c71a3: journal block not found at offset 6156 on 
> sd(8,17)
> Aborting journal on device sd(8,17).
> ext3_abort called.
> EXT3-fs abort (device sd(8,17)): ext3_journal_start: Detected aborted 
> journal
> Remounting filesystem read-only
> 
> It sounds like a simple bad disk, especially with the references to 
> hardware errors, but I have shut down and rebooted with the Seagate 
> SeaTools diagnostics CD, and run both the quick diagnostics and the 
> complete surface scan of the disk, and it comes back clean (4 hours!).  
> The vendor won't take the disk back unless the SeaTools reports errors, 
> and I tend to believe the tool, since I have used it before and it 
> definitely catches things...
> 
> So my conclusion is that the problem is OS-related.  I did an "mke2fs -c 
> -j /dev/sdb1" to rebuild the filesystem last night, and I still get the 
> same problems when trying to restore from tape.  I am going to run 
> "mke2fs -c -c -j /dev/sdb1" tonight on the disk, to get it to do a 
> complete destructive read/write test as it rebuilds the filesystem.  I 
> tried to do that last night, but gave up at 2AM as it started to do the 
> second pass with 0x55555555, after 2 hours of writing 0xaaaaaaa and 
> verifying that seemingly without problems (though I'm not sure if it 
> reports problems as it encounters them or at the end; I had to ^C out).  
> The disk I/O with the "-c -c" option slows the machine right down, and 
> it's a main file server, so I can't do that during the day.
> 
> There are other disks on that SCSI chain, including another ST3146807LW 
> in the same cabinet as the one that is having problems.  The only thing 
> unique about this one is that it is a single 140GB ext3 partition, 
> whereas the other ST3146807LW is partitioned into a ~90Gb and a ~50GB 
> partition.
> 
> Would anyone have any idea where to start looking for this problem?  
> This machine is badly in need of patching, but because it is RH9 and not 
> RHEL or Fedora, I'm not sure how to do that.  Any thoughts would be 
> appreciated.

Whenever you have SCSI problems, the very first thing to check is the
cabling and the terminator.  Remember that SCSI only guarantees 3M of
cable length (about 10 feet), so a drive at the end of a cable much
longer than that (and remember, that 10 feet includes the cable inside
the cabinet) is very likely to have issues.

As for the terminator, make sure that the terminators are enabled ONLY
on the units at the ENDS of the cable.

Controllers are typically at one end of a cable and should have the
terminator enabled.  The last drive on the cable should also have its
terminator enabled.  No other terminators should be enabled.  Multiple
terminators will cause lots of problems.

So, measure your SCSI cable.  Look at every drive on the cable and make
sure that ONLY the drive at the end has a terminator.
----------------------------------------------------------------------
- Rick Stevens, Senior Systems Engineer     rstevens at vitalstream.com -
- VitalStream, Inc.                       http://www.vitalstream.com -
-                                                                    -
-      To err is human, to forgive, beyond the scope of the OS       -
----------------------------------------------------------------------