input/output errors

Wed Nov 24 22:01:20 UTC 2004

At 01:41 PM 11/24/2004 -0800, you wrote:
>Benjamin Hornberger wrote:
>>Thanks for your help so far! Unfortunately I'm still a little lost... see 
>>below.
>>At 11:36 AM 11/23/2004 -0800, Rick Stevens wrote:
>>
>>>Benjamin Hornberger wrote:
>>>
>>>>Hi all,
>>>>we have a machine (RHEL AS 3) on which last night suddenly a lot of input /
>>>>output errors occured. Basically, two partitions are unusable.
>>>>Both of these partitions are RAID 1 devices which share the same two IDE
>>>>hard disks (/dev/hda and /dev/hdc are two 250 GB drives, and /dev/md0 and
>>>>/dev/md1 are two RAID 1 devices which take 70 and 180 GB from each drive,
>>>>respectively).
>>>>Any hints? I looked into fsck, but I am not sure what is the right thing to
>>>>do.
>>>
>>>
>>>Ben,
>>>
>>>fsck is a tool that will (hopefully) fix filesystem inconsistencies.
>>>You should boot up in single user mode and run fsck against the two
>>>filesystems that have issues.  Note that you may lose some data when
>>>you do that.  Data that can't be reattached to their files will end up
>>>in the "lost+found" directory of the filesystem being fsck'd and given
>>>filenames that refer to their inode number.  You may be able to rebuild
>>>the file by looking at those files, but it's a tedious, error-fraught
>>>process.
>>
>>What can I do with these files? I can't cat or more or tail them. Some of 
>>them
>>seem to be directories (starting with a "d" on ls -l), but when I try to 
>>cd into
>>them, I end up at /root.
>
>That's the danger of them.  If they're directories, they don't have any
>parents anymore and their "back link" will probably take you back to /
>or your home directory.  You'd need to "ls" them to see which files are
>contained in them--you may then sort out where they belong.
>
>As far as the regular files are concerned, you need to look at their 
>contents to see if maybe you can concatenate them together to
>reconstruct the original file.  As I said before, it's tedious and very
>error-prone.
>
>>>Since you set up RAID 1, you should first split the RAIDs into two disks
>>>and see if either disk has clean versions of the data.  If so, you may
>>>be able to purge the bad drive and recreate the RAID.
>>
>>In the meantime I had done an fsck -cy already on /dev/md0 and /dev/md1.
>
>Uh, ok.
>
>>If I mount the partitions by themselves (/dev/hda1,2 and /dev/hdc1,2 
>>rather than
>>/dev/md0,1), it looks like /dev/hda1,2 are missing data compared to 
>>/dev/hdc1,2.
>>But from what I list below, it seems clear that /dev/hdc has problems. 
>>Did fsck
>>remove (corrupted) data from /dev/hda1,2?
>
>If you did the fsck on md0 and md1 before splitting the RAID1, yes, it's
>very possible.
>
>>>The most important thing to figure out is why you started getting I/O
>>>errors in the first place.  Is one of the drives dying?  Did you have
>>>a power glitch?  Did a RAM stick start acting weird?  You must fix the
>>>underlying issue or you're just going to get a repeat of this event.
>>
>>I am trying to figure that out. The machine is connected to a UPS, so no 
>>power
>>glitch. How can I check my RAM?
>
>You can run memtest86 on it.  If you are running Fedora Core, boot the
>first CD and at the "boot:" prompt, enter "memtest86".  If not, you can
>download a floppy image of it from "http://www.memtest86.com", put it
>on a floppy and boot that.  You can also get a couple of CDs that I keep
>handy:
>
>The Ultimate Boot CD
>     http://www.ultimatebootcd.com
>
>RIP (Recovery Is Possible)
>     http://www.tux.org/pub/people/kent-robotti/looplinux/rip/
>
>They're both bootable and have lots of diagnostics and such on them.  I
>keep current copies in my laptop case at all times--just in case I have
>to bail out a buddy.
>
>>What is the best way to check the hard drives (besides fsck -c)? 
>>Following the
>>Software RAID How-to, I did the following:
>># cat /var/log/messages | grep hda
>>[tons of blocks like:]
>>kernel: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
>>kernel: hda: dma_intr: error=0x40 { UncorrectableError }, ...
>>kernel: end_request: I/O error, dev 03:01 (hda), sector ...
>>kernel: raid1: hda1: rescheduling block ...
>>kernel: raid1: hda1: unrecoverable I/O read error for block ...
>># cat /var/log/message | grep hdc
>>...
>>kernel: md: kicking non-fresh hdc2 from array!
>>...
>>kernel: md: kicking non-fresh hdc2 from array!
>>...
>>kernel: md: md1 already running, cannot run hdc2
>>...
>>kernel: md: md0 already running, cannot run hdc1
>># more /proc/mdstat
>>Personalities : [raid1]
>>read_ahead 1024 sectors
>>Event: 2
>>md1: active raid1 hda2[0]
>>         173429632 blocks [2/1] [U_]
>>md0: active raid1 hda1[0]
>>          71581920 blocks [2/1] [U_]
>>unused devices: <none>
>># lsraid -a /dev/md0
>>[dev   9,   0] /dev/md0 (...cryptic numbers...) online
>>[dev   3,   1] /dev/hda1 (... cryptic numbers...) good
>>[dev   ?,   ?] (unknown) (zeroes) missing
>>same for /dev/md1
>># mdadm --detail /dev/md0
>>...
>>Raid Devices: 2
>>Total Devices: 1
>>...
>>State: dirty, no-errors
>>Active devices: 1
>>Working devices: 1
>>Failed devices: 0
>>Spare devices: 0
>>Number Major Minor RaidDevice State
>>    0           3        1        0       active sync /dev/hda1
>>    1           0        0        1       faulty removed
>>...
>>
>>same for /dev/md1
>>
>>I don't really understand what's going on. Part of it looks to me as if 
>>/dev/hda has
>>a problem, (the greater) part of it looks to me as if /dev/hdc has a problem.
>>So if I pop in a replacement drive for /dev/hdc and do raidhotadd (is 
>>that the
>>way to go?), you think the RAID device might be reconstructed completely?
>>But why did I get ioerrors in the first place then -- isn't RAID supposed 
>>to avoid
>>that? I mean, I thought even if one disk fails, the RAID array should 
>>still work
>>ok, and I just have to replace the broken drive??
>
>At this point, you may very well be sunk.  Had you run the fsck on the
>drives as individuals, you may have had a chance.  Once you ran it on
>the RAID volumes, all bets are off.
>
>This is the inherent danger in using software RAID--you're depending on
>the computer to be healthy to keep the RAID going.  If the computer is
>healthy and one of the drives fails, the system will keep running.  If,
>however, the computer gets sick (and this seems to be what happened),
>the RAID is compromised.  Who knows what evil things it did?
>
>This is why I NEVER recommend software RAID.  If you must have
>redundancy or high-availability, spend the extra $200 or so and use
>hardware RAID.  It really is cheap insurance (as you have unfortunately
>found out).

I actually tried hardware RAID, but I couldn't get RHEL AS 3 to recognize 
the Promise FastTrak TX 2000 RAID controller. Now that one is collecting 
dust in a shelf.

So say I want to wipe out the complete RAID device and install it from 
scratch. It's only home and data partitions, and I have a backup (which 
hopefully didn't get corrupted), so shouldn't be too much work. How do I 
really make sure my hard drives are ok? I run fsck with a bad-block check, 
and then I can believe it will be ok?

Say I do the memcheck, what else is there to check the system? Should I run 
fsck on all partitions? Anyway, does fsck leave a report somewhere? In the 
man pages, I read that the exit code tell me something. How do I read the 
exit code?

Thanks,
Benjamin