RAID drive failed, but SMART shows no errors?

Tue Mar 13 08:06:25 UTC 2007

Sam Varshavchik wrote:
...
> But smartctl gives this drive a clean bill of health:
> 
> [root at headache ~]# smartctl -H /dev/sda
> smartctl version 5.36 [i386-redhat-linux-gnu] Copyright © 2002-6 Bruce 
> Allen
> Home page is http://smartmontools.sourceforge.net/
> 
> SMART Health Status: OK

Try running a SMART test on the drive:

smartctl -t long /dev/sda

It will tell you how long time it takes to run the test,
you'll have to probe once in a while with

smartctl -a /dev/sda

to get the result of the test. It will be at the end:

SMART Self-test log
Num  Test              Status                 segment  LifeTime 
LBA_first_err [SK ASC ASQ]
      Description                              number   (hours)
# 1  Background long   Completed                   - 12641 
      - [-   -    -]

> 
> I have three RAID-1 partitions on these disks.  The one that reported an 
> error was the largest one.  I dropped the degraded partition, and 
> hot-added it back.  Immediately, another error was logged to 
> /var/log/messages, for the same block, but despite the error, the kernel 
> started resyncing the array:
...

If it were me, I would replace this disk. The next time you
run into this read error could be when sdb fails and you try
to resync a new sdb :-(

...
> My second question is that the two drives are in a 
> hot-swappable bay, and connected to the Adaptec AIC-7902B U320 
> controller.  Hardware-wise, the drives are hot-swappable, but what about 
> software-wise?  If I take this drive entirely off RAID-1, cut the power 
> to the hot-swap bay, pull the drive out, replace it, plug in back in, 
> and reenable power, will the FC6 kernel be able to deal with this?

On my system (with 8 146G hotswap SCSI drives on a dual
channel Adaptec AHA-3960D / AIC-7899A), I would:

0. Keep a window open with a "tail -10f /var/log/messages"
1. take all partitions from the failing drive out of the array with mdadm
2. Remove the drive from the kernel:

echo "scsi remove-single-device 0 0 0 0" >/proc/scsi/scsi

The four zeros are: controller#,channel,SCSI id,LUN,
try "echo /proc/scsi/scsi" to see these numbers, if
you make a mistake and remove the wrong drive, you'll
have a problem...

3. Physically remove the drive
4. Insert a new drive
5. Tell the kernel that a new drive exists:

echo "scsi add-single-device 0 0 0 0" >/proc/scsi/scsi

This can take awhile, the drive has to spin up.

6. Partition the drive
7. Add the partitions with mdadm, follow the sync in /proc/mdstat
8. After the sync, run grub to reinstall the boot loader

We have done this several times when drives have failed.

> 
> If I cannot do this, my third question is what do I need to do, 
> grub-wise, to be able to swap sdb with sda?  sda is the one that's 
> failing the RAID-1 array.  If I can't hot-swap it, I'll need to replace 
> it with the sdb drive, but right now grub is installed only on sda, so 
> how do I install a copy of all the grub boot-related stuff on sdb?

Hm? If you have used the GUI to create the RAID partitions during
installation GRUB should be on both drives.

Mogens

-- 
Mogens Kjaer, Carlsberg A/S, Computer Department
Gamle Carlsberg Vej 10, DK-2500 Valby, Denmark
Phone: +45 33 27 53 25, Fax: +45 33 27 47 08
Email: mk at crc.dk Homepage: http://www.crc.dk