[Linux-cluster] Instability troubles

Wed Jan 9 22:16:11 UTC 2008

On Thu, 3 Jan 2008, Lon Hohberger wrote:

> On Wed, 2008-01-02 at 17:35 -0500, James Chamberlain wrote:
>> Hi all,
>>
>> I'm having some major stability problems with my three-node CS/GFS cluster.
>> Every two or three days, one of the nodes fences another, and I have to
>> hard-reboot the entire cluster to recover.  I have had this happen twice
>> today.  I don't know what's triggering the fencing, since all the nodes
>> appear to me to be up and running when it happens.  In fact, I was logged
>> on to node3 just now, running 'top', when node2 fenced it.
>>
>> When they come up, they don't automatically mount their GFS filesystems,
>> even with "_netdev" specified as a mount option; however, the node which
>> comes up first mounts them all as part of bringing all the services up.
>>
>> I did notice a couple of disconcerting things earlier today.  First, I was
>> running "watch clustat".  (I prefer to see the time updating, where I
>> can't with "clustat -i")
>
> The time is displayed in RHEL5 CVS version, and will go out with 5.2.
>
>
>>   At one point, "clustat" crashed as follows:
>>
>> Jan  2 15:19:54 node2 kernel: clustat[17720]: segfault at 0000000000000024
>> rip 0000003629e75bc0 rsp 00007fff18827178 error 4
>
> A clustat crash is not a cause for a fence operation.  That is, this
> might be related, but is definitely not the cause of a node being
> evicted.
>
>
>> Fairly shortly thereafter, clustat reported that node3 as "Online,
>> Estranged, rgmanager".  Can anyone shed light on what that means?
>> Google's not telling me much.
>
> Ordinarily, this happens when you have a node join the cluster manually
> w/o giving it the configuration file.  CMAN would assign it a node ID -
> but the node is not in the cluster configuration - so clustat would
> display the node as 'Estranged'.
>
> In your case, I'm not sure what the problem would be.

I have a theory (see below).  Does it give you any ideas what might have 
happened here?

>> At the moment, all three nodes are running CentOS 5.1, with kernel
>> 2.6.18-53.1.4.el5.  Can anyone point me in the right direction to resolve
>> these problems?  I wasn't having trouble like this when I was running a
>> CentOS 4 CS/GFS cluster.  Is it possible to downgrade, likely via a full
>> rebuild of all the nodes, from CentOS 5 CS/GFS to 4?  Should I instead
>> consider setting up a single node to mount the GFS filesystems and serve
>> them out, to get around these fencing issues?
>
> I'd be interested a core file.  Try to reproduce your clustat crash with
> 'ulimit -c unlimited' set before running clustat.  I haven't seen
> clustat crash in a very long time, so I'm interested in the cause.
> (Also, after the crash, check to see if ccsd is running...)

I'll see what I can do for you.

> Maybe it will uncover some other hints as to the cause of the behavior
> you saw.
>
> If ccsd indeed failed for some reason, it would cause fencing to fail as
> well because the fence daemon would be unable to read fencing actions.
>
> Even given all of this, this doesn't explain why the node needed to be
> fenced in the first place.  Were there any log messages indicating why
> the node needed to be fenced?
>
> The RHEL5 / CentOS5 release of Cluster Suite has a fairly aggressive
> node death timeout (5 seconds); maybe increasing it would help.
>
> <cluster ...>
>   <cman .../>
>   <totem token="21000"/> <!-- add this -->
>   ...
> </cluster>

I've come up with a theory on what's been going on, and so far, that theory 
appears to be panning out.  At the very least, I haven't had any further 
crashes (yet).  I'm hoping someone can validate it or tell me I need to 
keep looking.

On each of the three nodes in my cluster, eth0 is used for cluster services 
(NFS) and the cluster's multicast group, and eth1 is used for iSCSI.  I 
noticed that two of the three nodes were using DHCP on eth0, and that the 
problems always seemed to happen when the cluster was under a heavy load. 
My DHCP server was configured to give these nodes the same address every 
time, so they essentially had static addresses - they just used DHCP to get 
them.  I think I spotted that there was a DHCP renewal going on at or just 
before the fencing started each time.  My theory is that, under heavy load, 
this DHCP renewal process was somehow interfering with either the primary 
IP address for eth0 or with the cluster's multicast traffic and was causing 
the affected node(s) to get booted from the cluster.  I have since switched 
all the nodes to use truely static addressing, and have not had a problem 
in the intervening week.  I have not yet tried the "<totem token=21000/>" 
trick that Lon mentioned, but I'm keeping that handy should problems crop 
up again.

Thanks,

James