[rhelv6-list] RAID/SCSI error combined with core dump of rrdtool
Ben
bda20 at cam.ac.uk
Thu Feb 5 14:49:42 UTC 2015
Greetings,
I have a server with a four disk RAID5 set. A few days ago two disks went
offline at the same time. The ext4 filesystem went read-only and although
you could still SSH in, the system was totally hosed. It certainly wouldn't
shutdown via the command line.
I powered the box off, and via the RAID BIOS brought both disks back online
manually. I then initiated a consistency check. About 75% of the way
through one of the two disks screamed (literally) and quit for good. The
other disk remained online and hasn't been a problem since (but I didn't
trust it). The filesystem was a mess come fsck time. Many things ended up
in /lost+found, during clean up clusters required cloning, inodes that were
orphaned were deleted, etc. It wasn't pretty. I managed to put all but
three of the files put in /lost+found back where they were supposed to be.
As far as I can see, the three remaining files don't appear to be important
to the operation of the server. As and when I discover what they're for
I'll put them back in their original locations but they seem to be to do
with SSL/CA certificates.
The hosted MySQL DB was also totally hosed (InnoDB table corruption). Not
only that, but the MySQL software itself was too. I reinstalled that and
recreated the DB, and now the software which relies on it (Observium,
http://www.observium.org/) is operating normally again. I also got Nessus
(http://www.tenable.com/) working again (with help from Tenable Support)
after many of its files went away.
However, I now have the following issue. There appears to be a SCSI fault
such that every time /usr/bin/rrdtool (RH-supplied) runs it core dumps:
Feb 5 14:27:28 mole2 kernel: sd 0:2:0:0: [sda] Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
Feb 5 14:27:28 mole2 kernel: sd 0:2:0:0: [sda] CDB: Read(10): 28 00 0b 33 d7 e0 00 00 08 00
Feb 5 14:27:28 mole2 kernel: end_request: I/O error, dev sda, sector 187946976
Feb 5 14:27:28 mole2 abrt[5346]: Saved core dump of pid 4491 (/usr/bin/rrdtool) to /var/spool/abrt/ccpp-2015-02-05-14:27:28-4491 (761856 bytes)
Feb 5 14:27:28 mole2 abrtd: Directory 'ccpp-2015-02-05-14:27:28-4491' creation detected
It's always the same sector of the (virtual, presented by the RAID hardware
to the OS) disk. I've replaced both of the underlying broken/suspect
physical disks, but this error refuses to go away. I've also reinstalled
the rrdtool software in the hope that this would place it on another part of
the disk. The only thing rrdtool runs on/over is the data collected by
Observium. I've been through all of the graphs it generates and deleted and
recreated any RRDs that were producing errors rather than graphs (on the
assumption that they were corrupt files), and I'm still getting the SCSI
errors and core dumps.
What should I try next? Eventually I imagine I will have to reinstall the
OS, but I'd rather not just yet. Does anyone have any suggestions?
Chassis: Dell PowerEdge 610
OS: RHEL6.6 fully patched, kernel: 2.6.32-504.8.1.el6.x86_64
RAID: PERC 6/i Integrated, F/W: 6.3.3.0002, Driver: 06.803.01.00-rh1
rrdtool: rrdtool-perl-1.3.8-7.el6.x86_64
rrdtool-1.3.8-7.el6.x86_64
rrdtool-php-1.3.8-7.el6.x86_64
rrdtool-devel-1.3.8-7.el6.x86_64
With thanks,
Ben
--
Unix Support, UIS, University of Cambridge, England
More information about the rhelv6-list
mailing list