input/output errors

Wed Nov 24 20:12:58 UTC 2004

Thanks for your help so far! Unfortunately I'm still a little lost... see 
below.

At 11:36 AM 11/23/2004 -0800, Rick Stevens wrote:
>Benjamin Hornberger wrote:
>>Hi all,
>>we have a machine (RHEL AS 3) on which last night suddenly a lot of input /
>>output errors occured. Basically, two partitions are unusable.
>>Both of these partitions are RAID 1 devices which share the same two IDE
>>hard disks (/dev/hda and /dev/hdc are two 250 GB drives, and /dev/md0 and
>>/dev/md1 are two RAID 1 devices which take 70 and 180 GB from each drive,
>>respectively).
>>Any hints? I looked into fsck, but I am not sure what is the right thing to
>>do.
>
>Ben,
>
>fsck is a tool that will (hopefully) fix filesystem inconsistencies.
>You should boot up in single user mode and run fsck against the two
>filesystems that have issues.  Note that you may lose some data when
>you do that.  Data that can't be reattached to their files will end up
>in the "lost+found" directory of the filesystem being fsck'd and given
>filenames that refer to their inode number.  You may be able to rebuild
>the file by looking at those files, but it's a tedious, error-fraught
>process.

What can I do with these files? I can't cat or more or tail them. Some of them
seem to be directories (starting with a "d" on ls -l), but when I try to cd 
into
them, I end up at /root.

>Since you set up RAID 1, you should first split the RAIDs into two disks
>and see if either disk has clean versions of the data.  If so, you may
>be able to purge the bad drive and recreate the RAID.

In the meantime I had done an fsck -cy already on /dev/md0 and /dev/md1.

If I mount the partitions by themselves (/dev/hda1,2 and /dev/hdc1,2 rather 
than
/dev/md0,1), it looks like /dev/hda1,2 are missing data compared to 
/dev/hdc1,2.
But from what I list below, it seems clear that /dev/hdc has problems. Did 
fsck
remove (corrupted) data from /dev/hda1,2?

>The most important thing to figure out is why you started getting I/O
>errors in the first place.  Is one of the drives dying?  Did you have
>a power glitch?  Did a RAM stick start acting weird?  You must fix the
>underlying issue or you're just going to get a repeat of this event.

I am trying to figure that out. The machine is connected to a UPS, so no power
glitch. How can I check my RAM?

What is the best way to check the hard drives (besides fsck -c)? Following the
Software RAID How-to, I did the following:

# cat /var/log/messages | grep hda
[tons of blocks like:]
kernel: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
kernel: hda: dma_intr: error=0x40 { UncorrectableError }, ...
kernel: end_request: I/O error, dev 03:01 (hda), sector ...
kernel: raid1: hda1: rescheduling block ...
kernel: raid1: hda1: unrecoverable I/O read error for block ...

# cat /var/log/message | grep hdc
...
kernel: md: kicking non-fresh hdc2 from array!
...
kernel: md: kicking non-fresh hdc2 from array!
...
kernel: md: md1 already running, cannot run hdc2
...
kernel: md: md0 already running, cannot run hdc1

# more /proc/mdstat
Personalities : [raid1]
read_ahead 1024 sectors
Event: 2
md1: active raid1 hda2[0]
         173429632 blocks [2/1] [U_]

md0: active raid1 hda1[0]
          71581920 blocks [2/1] [U_]

unused devices: <none>

# lsraid -a /dev/md0
[dev   9,   0] /dev/md0 (...cryptic numbers...) online
[dev   3,   1] /dev/hda1 (... cryptic numbers...) good
[dev   ?,   ?] (unknown) (zeroes) missing

same for /dev/md1

# mdadm --detail /dev/md0
...
Raid Devices: 2
Total Devices: 1
...
State: dirty, no-errors
Active devices: 1
Working devices: 1
Failed devices: 0
Spare devices: 0

Number Major Minor RaidDevice State
    0           3        1        0       active sync /dev/hda1
    1           0        0        1       faulty removed
...

same for /dev/md1

I don't really understand what's going on. Part of it looks to me as if 
/dev/hda has
a problem, (the greater) part of it looks to me as if /dev/hdc has a problem.

So if I pop in a replacement drive for /dev/hdc and do raidhotadd (is that the
way to go?), you think the RAID device might be reconstructed completely?
But why did I get ioerrors in the first place then -- isn't RAID supposed 
to avoid
that? I mean, I thought even if one disk fails, the RAID array should still 
work
ok, and I just have to replace the broken drive??

Thanks for any input. Is there more online documentation for software RAID?
Can anybody recommend a system administration book which is good for the
part-time research group admin?

Thanks for your help,
Benjamin