[Linux-cluster] Can't get past Segmentation Faults: Solved

Wed Jan 24 19:28:08 UTC 2007

Sorry, talking to myself in case it ever helps someone :). 

I guess I am surprised that some of the heavy tech's on here didn't ask me a 
few questions that would have quickly led to the answer "Oh ya, your 
partitions are trashed" :).

The reason for the storage errors in my earlier post today now makes sense... 
the storage is still connected but is formatting so it's not ready for use.

What started this?

A UPS decided to pop it's breakers, taking out one of the power supplies on a 
number of my storage devices. The storage devices all have dual power supplies 
but the brocade switches were accidentally connected to the same UPS also, so, 
both power supplies failed on those.

Somehow, all of the storage was trashed when this happened. Why??? All of the 
partitions on the FC network were corrupted and ruined. 
I'm very confused about it but certainly willing to answer any questions 
anyone will have so that my problems might turn into potential future 
solutions.

Once I figured that out, I had to get to work on it. The easiest fix in my 
mind, since I'd have to reformat everything anyhow was to start from scratch, 
clean it all up and start over.

I logged into one node. 
I tried to clean things up but got errors from some of the other nodes. 
I turned off clvmd on all of the other nodes then got back to it.
I logged into all nodes and did the same there, nuked everything in each 
/etc/lvm directory because there would end up being conflicts, and there were.
Once everything was clean, I nuked the messed up partitions.
I then pvcreated, vgcreated and lvcreated a new device. 
I am now making the gfs filesystem and have recovered two of them so far.
Sad thing is that it takes days to reformat so I will have learned a valuable 
lesson about storage, yet again. Multipath is also next on my list.

So.. why did this happen?

1: The partitions were trashed on EVERY device connected to the FC network. 
Was it GFS or was it the FC network that trashed the storage? I would go with 
the FC network unless someone has other ideas.

2: After that happened, the nodes/storage, everything got out of sync. 
The only fix has been to nuke ALL of the clvmd information so that I could 
rebuild. Before that, I was not able to get anything done because nodes would 
be out of synch. 

I wish there was some cluster wide command to clear information from certain 
services, especially node wide services.

Anyhow, hope this helps someone and puts my huge thread to rest.

Mike