[rhelv6-list] RAID/SCSI error combined with core dump of rrdtool

Thu Feb 5 14:49:42 UTC 2015

Greetings,

I have a server with a four disk RAID5 set.  A few days ago two disks went 
offline at the same time.  The ext4 filesystem went read-only and although 
you could still SSH in, the system was totally hosed.  It certainly wouldn't 
shutdown via the command line.

I powered the box off, and via the RAID BIOS brought both disks back online 
manually.  I then initiated a consistency check.  About 75% of the way 
through one of the two disks screamed (literally) and quit for good.  The 
other disk remained online and hasn't been a problem since (but I didn't 
trust it).  The filesystem was a mess come fsck time.  Many things ended up 
in /lost+found, during clean up clusters required cloning, inodes that were 
orphaned were deleted, etc.  It wasn't pretty.  I managed to put all but 
three of the files put in /lost+found back where they were supposed to be. 
As far as I can see, the three remaining files don't appear to be important 
to the operation of the server.  As and when I discover what they're for 
I'll put them back in their original locations but they seem to be to do 
with SSL/CA certificates.

The hosted MySQL DB was also totally hosed (InnoDB table corruption).  Not 
only that, but the MySQL software itself was too.  I reinstalled that and 
recreated the DB, and now the software which relies on it (Observium, 
http://www.observium.org/) is operating normally again.  I also got Nessus 
(http://www.tenable.com/) working again (with help from Tenable Support) 
after many of its files went away.

However, I now have the following issue.  There appears to be a SCSI fault
such that every time /usr/bin/rrdtool (RH-supplied) runs it core dumps:

Feb  5 14:27:28 mole2 kernel: sd 0:2:0:0: [sda]  Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
Feb  5 14:27:28 mole2 kernel: sd 0:2:0:0: [sda] CDB: Read(10): 28 00 0b 33 d7 e0 00 00 08 00
Feb  5 14:27:28 mole2 kernel: end_request: I/O error, dev sda, sector 187946976
Feb  5 14:27:28 mole2 abrt[5346]: Saved core dump of pid 4491 (/usr/bin/rrdtool) to /var/spool/abrt/ccpp-2015-02-05-14:27:28-4491 (761856 bytes)
Feb  5 14:27:28 mole2 abrtd: Directory 'ccpp-2015-02-05-14:27:28-4491' creation detected

It's always the same sector of the (virtual, presented by the RAID hardware 
to the OS) disk.  I've replaced both of the underlying broken/suspect 
physical disks, but this error refuses to go away.  I've also reinstalled 
the rrdtool software in the hope that this would place it on another part of 
the disk.  The only thing rrdtool runs on/over is the data collected by 
Observium.  I've been through all of the graphs it generates and deleted and 
recreated any RRDs that were producing errors rather than graphs (on the 
assumption that they were corrupt files), and I'm still getting the SCSI 
errors and core dumps.

What should I try next?  Eventually I imagine I will have to reinstall the 
OS, but I'd rather not just yet.  Does anyone have any suggestions?

Chassis: Dell PowerEdge 610
OS: RHEL6.6 fully patched, kernel: 2.6.32-504.8.1.el6.x86_64
RAID: PERC 6/i Integrated, F/W: 6.3.3.0002, Driver: 06.803.01.00-rh1
rrdtool: rrdtool-perl-1.3.8-7.el6.x86_64
 	 rrdtool-1.3.8-7.el6.x86_64
 	 rrdtool-php-1.3.8-7.el6.x86_64
 	 rrdtool-devel-1.3.8-7.el6.x86_64

With thanks,

Ben
-- 
Unix Support, UIS, University of Cambridge, England