Replacing failed raid (boot) disk

Wed Jan 18 23:54:12 UTC 2006

Hi everybody,

I just got this log output a few days ago:
Jan 11 15:34:24 webserv1 kernel: ata1: status=0x51 { DriveReady SeekComplete Error }
Jan 11 15:34:24 webserv1 kernel: ata1: error=0x10 { SectorIdNotFound }
Jan 11 15:34:29 webserv1 kernel: ata1: status=0x51 { DriveReady SeekComplete Error }
Jan 11 15:34:29 webserv1 kernel: ata1: error=0x10 { SectorIdNotFound }
Jan 11 15:34:59 webserv1 kernel: ata1: command 0xc8 timeout, stat 0x51 host_stat 0x61
Jan 11 15:34:59 webserv1 kernel: ata1: status=0x51 { DriveReady SeekComplete Error }
Jan 11 15:34:59 webserv1 kernel: ata1: error=0x10 { SectorIdNotFound }
Jan 11 15:34:59 webserv1 kernel: SCSI error : <0 0 0 0> return code = 0x8000002
Jan 11 15:34:59 webserv1 kernel: sda: Current: sense key: Aborted Command
Jan 11 15:34:59 webserv1 kernel:     Additional sense: Recorded entity not found
Jan 11 15:34:59 webserv1 kernel: end_request: I/O error, dev sda, sector 11217554
Jan 11 15:34:59 webserv1 kernel: raid1: Disk failure on sda3, disabling device.
Jan 11 15:34:59 webserv1 kernel:        Operation continuing on 1 devices
Jan 11 15:34:59 webserv1 kernel: raid1: sda3: rescheduling sector 6815744
Jan 11 15:34:59 webserv1 kernel: raid1: sdb2: redirecting sector 6815744 to another mirror
Jan 11 15:34:59 webserv1 kernel: RAID1 conf printout:
Jan 11 15:34:59 webserv1 kernel:  --- wd:1 rd:2
Jan 11 15:34:59 webserv1 kernel:  disk 0, wo:1, o:0, dev:sda3
Jan 11 15:34:59 webserv1 kernel:  disk 1, wo:0, o:1, dev:sdb2
Jan 11 15:34:59 webserv1 kernel: RAID1 conf printout:
Jan 11 15:34:59 webserv1 kernel:  --- wd:1 rd:2
Jan 11 15:34:59 webserv1 kernel:  disk 1, wo:0, o:1, dev:sdb2

This is on a server with an unraided /boot on sda1 and a software-raid1 raided / partition

Dell says the HD needs to be replaced, so now I got the replacement hard disk.
The problem is: the failed disk is the one I boot from and the boot partition is not mirrored.
So I can not copy the content of the boot partition, nor get the fdisk information to partition the new disk the same way as the old
one
What is the best and easiest way to get the new system up and running as painlessly as possible?

I have a second machine with an identical setup, so I guess I could get the info from that box.

I am thinking I need to:
1. Plug the new disk in and boot from the rescue CD
2. Look up the partition info on the mirror box and partition the new disk accordingly.
3. Copy the content of the boot partition over from the mirrored box
4. install grub on sda (how!?!?!?)
5. Hopefully boot the machine with the replaced HD and hope that mdadm will automatically start synching the raid from the good raid
disk (sdb)

The problem is mainly step 4: I am not sure what I had picked as boot loader location from the "Advanced Boot Loader Configuration"
screen ("MBR vs. first sector of boot partition).
So I need to figure out
 a) what the location was, and
 b) how to get the boot loader installed there manually (I've always just used the automated install for the boot loader).

Is my assumption about steps 1-5 correct?
Does anybody have any hints regarding how to do step 4?

And then for the future: how can I be better prepared for this next time? Is there a way to capture the partition and boot loader
information (at a point before the disk actually goes bad) and then restore it to an identical drive in a more automated fashion?

Thanks,

MARK