cvs update

Matthew Galgoci mgalgoci at redhat.com
Mon Nov 20 19:41:08 UTC 2006


Some time on friday, cvs-int.fedora.phx.redhat.com sustained undetermined
storage problems and resulting filesystem corruption. As best I can figure,
we had a one drive in a raid6 array drop offline, and another disk in that
array emit scsi errors. Now, you're probably thinking, this is raid6, it
should have been able to sustain losing two disks and keep on going.

Well, you're right and you're wrong. If two disks had simply dropped out of
the array, we'd be fine. That wasn't the case however. Somewhere in the
equation is data corruption. raid is great up until your hardware corrupts
the data. To support this claim, all you need to do is realize that we
sustained numerous ext3 errors and had the journal abort, and the root fs
went read-only.

I did my level best to revive the system on friday and saturday. I was able
to get it pxe booted onto rescue media, which helped recovery immensely. I
took numerous screen shots to chronical what I went through as I attempted
to recover the raid6 arrays and the logical volumes.

http://people.redhat.com/~mgalgoci/cvs-int.jpg
http://people.redhat.com/~mgalgoci/cvs-int2.jpg
http://people.redhat.com/~mgalgoci/cvs-int3.jpg
http://people.redhat.com/~mgalgoci/cvs-int4.jpg
http://people.redhat.com/~mgalgoci/fedora-cvs5.jpg
http://people.redhat.com/~mgalgoci/fedora-cvs6.jpg
http://people.redhat.com/~mgalgoci/fedora-cvs8.jpg
http://people.redhat.com/~mgalgoci/fedora-cvs9.jpg
http://people.redhat.com/~mgalgoci/fedora-cvs10.jpg
http://people.redhat.com/~mgalgoci/fedora-cvs11.jpg
http://people.redhat.com/~mgalgoci/fedora-cvs12.jpg
http://people.redhat.com/~mgalgoci/fedora-cvs13.jpg
http://people.redhat.com/~mgalgoci/fedora-cvs14.jpg
http://people.redhat.com/~mgalgoci/fedora-cvs15.jpg
http://people.redhat.com/~mgalgoci/fedora-cvs18.jpg
http://people.redhat.com/~mgalgoci/fedora-cvs17.jpg
http://people.redhat.com/~mgalgoci/fedora-cvs16.jpg
http://people.redhat.com/~mgalgoci/fedora-cvs19.jpg
http://people.redhat.com/~mgalgoci/fedora-cvs20.jpg

After #20, I said the hell with it, time to move on.

We've installed one of the new Dell 2950 machines that Dell was kind enough
to donate to the Fedora Project. Mike McGrath is in the process of updatifying
and restorifying the data from backups.

I have a Dell tech coming on site again today to do some more work on the
old new cvs-int server. I think we know what the issues are on it and we'll
have it usable again in the next day or so.

In the mean time, I think we need to take a look at all the Dell fedora boxes
and check the scsi drives in them. There are known issues with certain drive
firmware that cause drives to go offline and report spurrious errors.

The relevant Dell update is here:

http://support.us.dell.com/support/downloads/download.aspx?c=us&cs=555&l=en&s=biz&releaseid=R123859&formatcnt=1&libid=0&fileid=164751

We'll need downtime and hands on site to do this update. I'm sure Stacy will
be able to assist.

-- 
Matthew Galgoci
GIS Production Operations
Red Hat, Inc
919.754.3700 x44155




More information about the Fedora-infrastructure-list mailing list