[Linux-cluster] Fencing of node
Neale Ferguson
neale at sinenomine.net
Thu Oct 2 19:30:03 UTC 2014
After creating simple two node cluster, one node is being fenced continually. I'm running pacemaker (1.1.10-29) with two nodes and the following corosync.conf:
totem {
version: 2
secauth: off
cluster_name: rh7cluster
transport: udpu
}
nodelist {
node {
ring0_addr: rh7cn1.devlab.sinenomine.net
nodeid: 1
}
node {
ring0_addr: rh7cn2.devlab.sinenomine.net
nodeid: 2
}
}
quorum {
provider: corosync_votequorum
two_node: 1
}
logging {
to_syslog: yes
}
Starting the cluster shows:
Oct 2 15:17:47 rh7cn1 kernel: dlm: connect from non cluster node
In the logs of both nodes. Both nodes then try and bring up resources (dlm, clvmd, and a cluster fs).
Just prior to a node being fence, both nodes show the following
# pcs resource show
Clone Set: dlm-clone [dlm]
Started: [ rh7cn1.devlab.sinenomine.net rh7cn2.devlab.sinenomine.net ]
Clone Set: clvmd-clone [clvmd]
clvmd (ocf::heartbeat:clvm): FAILED
Started: [ rh7cn2.devlab.sinenomine.net ]
Clone Set: clusterfs-clone [clusterfs]
Started: [ rh7cn2.devlab.sinenomine.net ]
Stopped: [ rh7cn1.devlab.sinenomine.net ]
Shortly after there is a clvmd timeout message in one of the logs and then that node gets fenced. I had added the high-availability firewalld service to both nodes.
Running crm_simulate -SL -VV shows:
warning: unpack_rsc_op: Processing failed op start for clvmd:1 on rh7cn1.devlab.sinenomine.net: unknown error (1)
Current cluster status:
Online: [ rh7cn1.devlab.sinenomine.net rh7cn2.devlab.sinenomine.net ]
ZVMPOWER (stonith:fence_zvm): Started rh7cn2.devlab.sinenomine.net
Clone Set: dlm-clone [dlm]
Started: [ rh7cn1.devlab.sinenomine.net rh7cn2.devlab.sinenomine.net ]
Clone Set: clvmd-clone [clvmd]
clvmd (ocf::heartbeat:clvm): FAILED rh7cn1.devlab.sinenomine.net
Started: [ rh7cn2.devlab.sinenomine.net ]
Clone Set: clusterfs-clone [clusterfs]
Started: [ rh7cn2.devlab.sinenomine.net ]
Stopped: [ rh7cn1.devlab.sinenomine.net ]
warning: common_apply_stickiness: Forcing clvmd-clone away from rh7cn1.devlab.sinenomine.net after 1000000 failures (max=1000000)
warning: common_apply_stickiness: Forcing clvmd-clone away from rh7cn1.devlab.sinenomine.net after 1000000 failures (max=1000000)
Transition Summary:
* Stop clvmd:1 (rh7cn1.devlab.sinenomine.net)
Executing cluster transition:
* Pseudo action: clvmd-clone_stop_0
* Resource action: clvmd stop on rh7cn1.devlab.sinenomine.net
* Pseudo action: clvmd-clone_stopped_0
* Pseudo action: all_stopped
Revised cluster status:
warning: unpack_rsc_op: Processing failed op start for clvmd:1 on rh7cn1.devlab.sinenomine.net: unknown error (1)
Online: [ rh7cn1.devlab.sinenomine.net rh7cn2.devlab.sinenomine.net ]
ZVMPOWER (stonith:fence_zvm): Started rh7cn2.devlab.sinenomine.net
Clone Set: dlm-clone [dlm]
Started: [ rh7cn1.devlab.sinenomine.net rh7cn2.devlab.sinenomine.net ]
Clone Set: clvmd-clone [clvmd]
Started: [ rh7cn2.devlab.sinenomine.net ]
Stopped: [ rh7cn1.devlab.sinenomine.net ]
Clone Set: clusterfs-clone [clusterfs]
Started: [ rh7cn2.devlab.sinenomine.net ]
Stopped: [ rh7cn1.devlab.sinenomine.net ]
With RHEL 6 I would use a qdisk but this has been replaced by corosync_votequorum.
This is my first RHEL 7 HA cluster so I'm at the beginning of my learning. Any pointers as to what I should look at or what I need to read?
Neale
More information about the Linux-cluster
mailing list