[Linux-cluster] Fencing of node

Thu Oct 2 19:30:03 UTC 2014

After creating simple two node cluster, one node is being fenced continually. I'm running pacemaker (1.1.10-29) with two nodes and the following corosync.conf:

totem {
version: 2
secauth: off
cluster_name: rh7cluster
transport: udpu
}

nodelist {
  node {
        ring0_addr: rh7cn1.devlab.sinenomine.net
        nodeid: 1
       }
  node {
        ring0_addr: rh7cn2.devlab.sinenomine.net
        nodeid: 2
       }
}

quorum {
provider: corosync_votequorum
two_node: 1
}

logging {
to_syslog: yes
}

Starting the cluster shows:

Oct  2 15:17:47 rh7cn1 kernel: dlm: connect from non cluster node

In the logs of both nodes. Both nodes then try and bring up resources (dlm, clvmd, and a cluster fs). 

Just prior to a node being fence, both nodes show the following

# pcs resource show
 Clone Set: dlm-clone [dlm]
     Started: [ rh7cn1.devlab.sinenomine.net rh7cn2.devlab.sinenomine.net ]
 Clone Set: clvmd-clone [clvmd]
     clvmd	(ocf::heartbeat:clvm):	FAILED 
     Started: [ rh7cn2.devlab.sinenomine.net ]
 Clone Set: clusterfs-clone [clusterfs]
     Started: [ rh7cn2.devlab.sinenomine.net ]
     Stopped: [ rh7cn1.devlab.sinenomine.net ]

Shortly after there is a clvmd timeout message in one of the logs and then that node gets fenced. I had added the high-availability firewalld service to both nodes.

Running crm_simulate -SL -VV shows:

 warning: unpack_rsc_op: 	Processing failed op start for clvmd:1 on rh7cn1.devlab.sinenomine.net: unknown error (1)

Current cluster status:
Online: [ rh7cn1.devlab.sinenomine.net rh7cn2.devlab.sinenomine.net ]

 ZVMPOWER	(stonith:fence_zvm):	Started rh7cn2.devlab.sinenomine.net 
 Clone Set: dlm-clone [dlm]
     Started: [ rh7cn1.devlab.sinenomine.net rh7cn2.devlab.sinenomine.net ]
 Clone Set: clvmd-clone [clvmd]
     clvmd	(ocf::heartbeat:clvm):	FAILED rh7cn1.devlab.sinenomine.net 
     Started: [ rh7cn2.devlab.sinenomine.net ]
 Clone Set: clusterfs-clone [clusterfs]
     Started: [ rh7cn2.devlab.sinenomine.net ]
     Stopped: [ rh7cn1.devlab.sinenomine.net ]

 warning: common_apply_stickiness: 	Forcing clvmd-clone away from rh7cn1.devlab.sinenomine.net after 1000000 failures (max=1000000)
 warning: common_apply_stickiness: 	Forcing clvmd-clone away from rh7cn1.devlab.sinenomine.net after 1000000 failures (max=1000000)
Transition Summary:
 * Stop    clvmd:1	(rh7cn1.devlab.sinenomine.net)

Executing cluster transition:
 * Pseudo action:   clvmd-clone_stop_0
 * Resource action: clvmd           stop on rh7cn1.devlab.sinenomine.net
 * Pseudo action:   clvmd-clone_stopped_0
 * Pseudo action:   all_stopped

Revised cluster status:
 warning: unpack_rsc_op: 	Processing failed op start for clvmd:1 on rh7cn1.devlab.sinenomine.net: unknown error (1)
Online: [ rh7cn1.devlab.sinenomine.net rh7cn2.devlab.sinenomine.net ]

 ZVMPOWER	(stonith:fence_zvm):	Started rh7cn2.devlab.sinenomine.net 
 Clone Set: dlm-clone [dlm]
     Started: [ rh7cn1.devlab.sinenomine.net rh7cn2.devlab.sinenomine.net ]
 Clone Set: clvmd-clone [clvmd]
     Started: [ rh7cn2.devlab.sinenomine.net ]
     Stopped: [ rh7cn1.devlab.sinenomine.net ]
 Clone Set: clusterfs-clone [clusterfs]
     Started: [ rh7cn2.devlab.sinenomine.net ]
     Stopped: [ rh7cn1.devlab.sinenomine.net ]

With RHEL 6 I would use a qdisk but this has been replaced by corosync_votequorum.

This is my first RHEL 7 HA cluster so I'm at the beginning of my learning. Any pointers as to what I should look at or what I need to read?

Neale