[Linux-cluster] Instability troubles

Thu Jan 3 15:38:40 UTC 2008

On Wed, 2008-01-02 at 17:35 -0500, James Chamberlain wrote:
> Hi all,
> 
> I'm having some major stability problems with my three-node CS/GFS cluster. 
> Every two or three days, one of the nodes fences another, and I have to 
> hard-reboot the entire cluster to recover.  I have had this happen twice 
> today.  I don't know what's triggering the fencing, since all the nodes 
> appear to me to be up and running when it happens.  In fact, I was logged 
> on to node3 just now, running 'top', when node2 fenced it.
> 
> When they come up, they don't automatically mount their GFS filesystems, 
> even with "_netdev" specified as a mount option; however, the node which 
> comes up first mounts them all as part of bringing all the services up.
> 
> I did notice a couple of disconcerting things earlier today.  First, I was 
> running "watch clustat".  (I prefer to see the time updating, where I 
> can't with "clustat -i")

The time is displayed in RHEL5 CVS version, and will go out with 5.2.

>   At one point, "clustat" crashed as follows:
> 
> Jan  2 15:19:54 node2 kernel: clustat[17720]: segfault at 0000000000000024 
> rip 0000003629e75bc0 rsp 00007fff18827178 error 4

A clustat crash is not a cause for a fence operation.  That is, this
might be related, but is definitely not the cause of a node being
evicted.

> Fairly shortly thereafter, clustat reported that node3 as "Online, 
> Estranged, rgmanager".  Can anyone shed light on what that means? 
> Google's not telling me much.

Ordinarily, this happens when you have a node join the cluster manually
w/o giving it the configuration file.  CMAN would assign it a node ID -
but the node is not in the cluster configuration - so clustat would
display the node as 'Estranged'.

In your case, I'm not sure what the problem would be.

> At the moment, all three nodes are running CentOS 5.1, with kernel 
> 2.6.18-53.1.4.el5.  Can anyone point me in the right direction to resolve 
> these problems?  I wasn't having trouble like this when I was running a 
> CentOS 4 CS/GFS cluster.  Is it possible to downgrade, likely via a full 
> rebuild of all the nodes, from CentOS 5 CS/GFS to 4?  Should I instead 
> consider setting up a single node to mount the GFS filesystems and serve 
> them out, to get around these fencing issues?

I'd be interested a core file.  Try to reproduce your clustat crash with
'ulimit -c unlimited' set before running clustat.  I haven't seen
clustat crash in a very long time, so I'm interested in the cause.
(Also, after the crash, check to see if ccsd is running...)

Maybe it will uncover some other hints as to the cause of the behavior
you saw.

If ccsd indeed failed for some reason, it would cause fencing to fail as
well because the fence daemon would be unable to read fencing actions.

Even given all of this, this doesn't explain why the node needed to be
fenced in the first place.  Were there any log messages indicating why
the node needed to be fenced?

The RHEL5 / CentOS5 release of Cluster Suite has a fairly aggressive
node death timeout (5 seconds); maybe increasing it would help.

<cluster ...>
   <cman .../>
   <totem token="21000"/> <!-- add this -->
   ...
</cluster>

-- Lon