RAID drive failed, but SMART shows no errors?
Sam Varshavchik
mrsam at courier-mta.com
Tue Mar 13 03:18:59 UTC 2007
One of my FC6 machines just claimed that one of two RAID-1 SCSI drives had
an error:
Mar 12 21:44:33 headache kernel: sd 0:0:0:0: SCSI error: return code = 0x08000002
Mar 12 21:44:33 headache kernel: sda: Current: sense key: Hardware Error
Mar 12 21:44:33 headache kernel: Additional sense: Defect list error
Mar 12 21:44:33 headache kernel: end_request: I/O error, dev sda, sector 143363856
Mar 12 21:44:33 headache kernel: md: super_written gets error=-5, uptodate=0
Mar 12 21:44:33 headache kernel: raid1: Disk failure on sda3, disabling
device.
Mar 12 21:44:33 headache kernel: Operation continuing on 1 devices
Mar 12 21:44:33 headache kernel: RAID1 conf printout:
Mar 12 21:44:33 headache kernel: --- wd:1 rd:2
Mar 12 21:44:33 headache kernel: disk 0, wo:1, o:0, dev:sda3
Mar 12 21:44:33 headache kernel: disk 1, wo:0, o:1, dev:sdb3
Mar 12 21:44:33 headache kernel: RAID1 conf printout:
Mar 12 21:44:33 headache kernel: --- wd:1 rd:2
Mar 12 21:44:33 headache kernel: disk 1, wo:0, o:1, dev:sdb3
I have two SCSI drives off an Adaptec AIC-7902B U320 (rev 10) controller.
But smartctl gives this drive a clean bill of health:
[root at headache ~]# smartctl -H /dev/sda
smartctl version 5.36 [i386-redhat-linux-gnu] Copyright © 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
SMART Health Status: OK
I have three RAID-1 partitions on these disks. The one that reported an
error was the largest one. I dropped the degraded partition, and hot-added
it back. Immediately, another error was logged to /var/log/messages, for
the same block, but despite the error, the kernel started resyncing the
array:
Mar 12 22:37:33 headache kernel: Buffer I/O error on device sda3, logical block 35262625
Mar 12 22:37:41 headache kernel: sd 0:0:0:0: SCSI error: return code = 0x08000002
Mar 12 22:37:41 headache kernel: sda: Current: sense key: Medium Error
Mar 12 22:37:41 headache kernel: Additional sense: Unrecovered read error
Mar 12 22:37:41 headache kernel: Info fld=0x88b8f16
Mar 12 22:37:41 headache kernel: end_request: I/O error, dev sda, sector 143363862
Mar 12 22:37:41 headache kernel: Buffer I/O error on device sda3, logical block 35262625
Mar 12 22:37:41 headache kernel: md: bind<sda3>
Mar 12 22:37:42 headache kernel: RAID1 conf printout:
Mar 12 22:37:42 headache kernel: --- wd:1 rd:2
Mar 12 22:37:42 headache kernel: disk 0, wo:1, o:1, dev:sda3
Mar 12 22:37:42 headache kernel: disk 1, wo:0, o:1, dev:sdb3
Despite the second error, the resync of the failed partition completed
succesfully.
smartctl -a shows 80000+ read errors corrected by ECC/fast, no rereads,
and 6 rewrites. My knowledge of SMART is limited. The other drive in this
array shows 50000+ read errors corrected by ECC/fast, no rereads and no
rewrites.
So, do the 6 rewrites on this drive is an indication of a looming failure?
My second question is that the two drives are in a hot-swappable bay, and
connected to the Adaptec AIC-7902B U320 controller. Hardware-wise, the
drives are hot-swappable, but what about software-wise? If I take this
drive entirely off RAID-1, cut the power to the hot-swap bay, pull the drive
out, replace it, plug in back in, and reenable power, will the FC6 kernel be
able to deal with this?
If I cannot do this, my third question is what do I need to do, grub-wise,
to be able to swap sdb with sda? sda is the one that's failing the RAID-1
array. If I can't hot-swap it, I'll need to replace it with the sdb drive,
but right now grub is installed only on sda, so how do I install a copy of
all the grub boot-related stuff on sdb?
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/fedora-list/attachments/20070312/5f78cc41/attachment-0001.sig>
More information about the fedora-list
mailing list