[Linux-cluster] Nodes leaving and re-joining intermittently

Digimer linux at alteeve.com
Sat Dec 10 20:55:38 UTC 2011


On 12/10/2011 03:32 PM, Matthew Painter wrote:
> Hi all,
> 
> We are trying to get to the bottom of some odd intermittent behavior on
> a cluster. We are intermittently seeing nodes leave and rejoin clusters,
> without being fenced. Further the gap between leaving on re-joining is 8
> minutes. We are monitoring the latency between boxes, and it is
> acceptable (<5ms).
> 
> How can nodes exhibit this behavior? There seem to be no impact on the
> services running on the box, just this leaving and re-joining. The SNMP
> messages are below.
> 
> All help decoding this gratefully received! :)
> 
> Thanks,
> 
> Matt
> 
> 
> Sat Dec 10 15:22:00 GMT 2011: cluster3.localdomain
> DISMAN-EVENT-MIB::sysUpTimeInstance = 3:2:52:23.35,
> SNMPv2-MIB::snmpTrapOID.0 = COROSYNC-MIB::corosyncNoticesNodeStatus,
> COROSYNC-MIB::corosyncObjectsNodeName.0 = "cluster1.localdomain",
> COROSYNC-MIB::corosyncObjectsNodeID.0 = 1,
> COROSYNC-MIB::corosyncObjectsNodeAddress.0 = "10.79.202.1",
> COROSYNC-MIB::corosyncObjectsNodeStatus.0 = "left"
> 
> Sat Dec 10 15:30:25 GMT 2011: cluster3.localdomain
> DISMAN-EVENT-MIB::sysUpTimeInstance = 3:3:00:48.75,
> SNMPv2-MIB::snmpTrapOID.0 = COROSYNC-MIB::corosyncNoticesNodeStatus,
> COROSYNC-MIB::corosyncObjectsNodeName.0 = "cluster1.localdomain",
> COROSYNC-MIB::corosyncObjectsNodeID.0 = 1,
> COROSYNC-MIB::corosyncObjectsNodeAddress.0 = "10.79.202.1",
> COROSYNC-MIB::corosyncObjectsNodeStatus.0 = "joined"

My first instinct is to point to multicast issues in your switch, but
then, I'd expect the node to get fenced. That said, any unexpected
disconnect should fire a fence, so it would seem like the node is
cleanly stopping/restarting corosync.

Can you share your configuration and, ideally, anything in syslog from
all involved nodes starting from just before the disconnect and
continuing through to after the node rejoins?

-- 
Digimer
E-Mail:              digimer at alteeve.com
Freenode handle:     digimer
Papers and Projects: http://alteeve.com
Node Assassin:       http://nodeassassin.org
"omg my singularity battery is dead again.
stupid hawking radiation." - epitron




More information about the Linux-cluster mailing list