[Linux-cluster] Instability troubles

Wed Jan 2 22:35:23 UTC 2008

Hi all,

I'm having some major stability problems with my three-node CS/GFS cluster. 
Every two or three days, one of the nodes fences another, and I have to 
hard-reboot the entire cluster to recover.  I have had this happen twice 
today.  I don't know what's triggering the fencing, since all the nodes 
appear to me to be up and running when it happens.  In fact, I was logged 
on to node3 just now, running 'top', when node2 fenced it.

When they come up, they don't automatically mount their GFS filesystems, 
even with "_netdev" specified as a mount option; however, the node which 
comes up first mounts them all as part of bringing all the services up.

I did notice a couple of disconcerting things earlier today.  First, I was 
running "watch clustat".  (I prefer to see the time updating, where I 
can't with "clustat -i")  At one point, "clustat" crashed as follows:

Jan  2 15:19:54 node2 kernel: clustat[17720]: segfault at 0000000000000024 
rip 0000003629e75bc0 rsp 00007fff18827178 error 4

Fairly shortly thereafter, clustat reported that node3 as "Online, 
Estranged, rgmanager".  Can anyone shed light on what that means? 
Google's not telling me much.

At the moment, all three nodes are running CentOS 5.1, with kernel 
2.6.18-53.1.4.el5.  Can anyone point me in the right direction to resolve 
these problems?  I wasn't having trouble like this when I was running a 
CentOS 4 CS/GFS cluster.  Is it possible to downgrade, likely via a full 
rebuild of all the nodes, from CentOS 5 CS/GFS to 4?  Should I instead 
consider setting up a single node to mount the GFS filesystems and serve 
them out, to get around these fencing issues?

Thanks,

James