Thank you for that -- excellent tip. Yesterday evening forced a re-install of all cluster associated RPMs just in case of maybe some sort of binary corruption... Still getting same result. This log is from yesterday after increasing the log level of rgmanager. This is the log from the node that did the fencing. The "spare" machine did not pick up the service until after the "failed" node was noticed by all other nodes with a " clurgmgrd[5234]: <info> State change: 192.168.1.101 UP" - which is, of course, after the node was fenced and had rebooted and rejoined the cluster.... Really weird issue. <div style="margin-left: 40px;">May 19 16:19:13 c1n2 root: MARK I fail c1n1 running core1 by ifconfigging its ethernet ports off May 19 16:19:35 c1n2 openais[4660]: [TOTEM] The token was lost in the OPERATIONAL state. May 19 16:19:35 c1n2 openais[4660]: [TOTEM] Receive multicast socket recv buffer size (288000 bytes). May 19 16:19:35 c1n2 openais[4660]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes). May 19 16:19:35 c1n2 openais[4660]: [TOTEM] entering GATHER state from 2. May 19 16:19:40 c1n2 openais[4660]: [TOTEM] entering GATHER state from 11. May 19 16:19:40 c1n2 openais[4660]: [TOTEM] Saving state aru 8b high seq received 8b May 19 16:19:40 c1n2 openais[4660]: [TOTEM] Storing new sequence id for ring d0c May 19 16:19:40 c1n2 openais[4660]: [TOTEM] entering COMMIT state. May 19 16:19:40 c1n2 openais[4660]: [TOTEM] entering RECOVERY state. May 19 16:19:40 c1n2 openais[4660]: [TOTEM] position [0] member <a href="http://192.168.1.103">192.168.1.103</a>: May 19 16:19:40 c1n2 openais[4660]: [TOTEM] previous ring seq 3336 rep 192.168.1.103 May 19 16:19:40 c1n2 openais[4660]: [TOTEM] aru 8b high delivered 8b received flag 1 May 19 16:19:40 c1n2 openais[4660]: [TOTEM] position [1] member <a href="http://192.168.1.104">192.168.1.104</a>: May 19 16:19:40 c1n2 openais[4660]: [TOTEM] previous ring seq 3336 rep 192.168.1.103 May 19 16:19:40 c1n2 openais[4660]: [TOTEM] aru 8b high delivered 8b received flag 1 May 19 16:19:40 c1n2 openais[4660]: [TOTEM] position [2] member <a href="http://192.168.1.105">192.168.1.105</a>: May 19 16:19:40 c1n2 openais[4660]: [TOTEM] previous ring seq 3336 rep 192.168.1.103 May 19 16:19:40 c1n2 openais[4660]: [TOTEM] aru 8b high delivered 8b received flag 1 May 19 16:19:40 c1n2 openais[4660]: [TOTEM] position [3] member <a href="http://192.168.1.102">192.168.1.102</a>: May 19 16:19:40 c1n2 openais[4660]: [TOTEM] previous ring seq 3336 rep 192.168.1.103 May 19 16:19:40 c1n2 openais[4660]: [TOTEM] aru 8b high delivered 8b received flag 1 May 19 16:19:40 c1n2 openais[4660]: [TOTEM] Did not need to originate any messages in recovery. May 19 16:19:40 c1n2 openais[4660]: [CLM ] CLM CONFIGURATION CHANGE May 19 16:19:40 c1n2 openais[4660]: [CLM ] New Configuration: May 19 16:19:40 c1n2 openais[4660]: [CLM ] r(0) ip(192.168.1.103) May 19 16:19:40 c1n2 openais[4660]: [CLM ] r(0) ip(192.168.1.104) May 19 16:19:40 c1n2 openais[4660]: [CLM ] r(0) ip(192.168.1.105) May 19 16:19:40 c1n2 openais[4660]: [CLM ] r(0) ip(192.168.1.102) May 19 16:19:40 c1n2 openais[4660]: [CLM ] Members Left: May 19 16:19:40 c1n2 openais[4660]: [CLM ] r(0) ip(192.168.1.101) May 19 16:19:40 c1n2 openais[4660]: [CLM ] Members Joined: May 19 16:19:40 c1n2 openais[4660]: [CLM ] CLM CONFIGURATION CHANGE May 19 16:19:40 c1n2 openais[4660]: [CLM ] New Configuration: May 19 16:19:40 c1n2 openais[4660]: [CLM ] r(0) ip(192.168.1.103) May 19 16:19:40 c1n2 openais[4660]: [CLM ] r(0) ip(192.168.1.104) May 19 16:19:40 c1n2 openais[4660]: [CLM ] r(0) ip(192.168.1.105) May 19 16:19:40 c1n2 openais[4660]: [CLM ] r(0) ip(192.168.1.102) May 19 16:19:40 c1n2 openais[4660]: [CLM ] Members Left: May 19 16:19:40 c1n2 openais[4660]: [CLM ] Members Joined: May 19 16:19:40 c1n2 openais[4660]: [SYNC ] This node is within the primary component and will provide service. May 19 16:19:40 c1n2 openais[4660]: [TOTEM] entering OPERATIONAL state. May 19 16:19:40 c1n2 kernel: dlm: closing connection to node 1 May 19 16:19:40 c1n2 clurgmgrd[5234]: <info> State change: 192.168.1.101 DOWN May 19 16:19:40 c1n2 openais[4660]: [CLM ] got nodejoin message 192.168.1.103 May 19 16:19:40 c1n2 openais[4660]: [CLM ] got nodejoin message 192.168.1.104 May 19 16:19:40 c1n2 openais[4660]: [CLM ] got nodejoin message 192.168.1.105 May 19 16:19:40 c1n2 openais[4660]: [CLM ] got nodejoin message 192.168.1.102 May 19 16:19:40 c1n2 openais[4660]: [CPG ] got joinlist message from node 5 May 19 16:19:40 c1n2 openais[4660]: [CPG ] got joinlist message from node 2 May 19 16:19:40 c1n2 openais[4660]: [CPG ] got joinlist message from node 3 May 19 16:19:40 c1n2 openais[4660]: [CPG ] got joinlist message from node 4 May 19 16:19:43 c1n2 fenced[4680]: 192.168.1.101 not a cluster member after 3 sec post_fail_delay May 19 16:19:43 c1n2 fenced[4680]: fencing node "192.168.1.101" May 19 16:19:45 c1n2 clurgmgrd[5234]: <info> Waiting for node #1 to be fenced May 19 16:19:47 c1n2 fenced[4680]: fence "192.168.1.101" success May 19 16:19:47 c1n2 clurgmgrd[5234]: <info> Node #1 fenced; continuing May 19 16:20:05 c1n2 clurgmgrd: [5234]: <info> Executing /ha/bin/ha-hpss-mover1 status May 19 16:22:37 c1n2 clurgmgrd: [5234]: <info> Executing /ha/bin/ha-hpss-mover1 status May 19 16:23:27 c1n2 last message repeated 3 times May 19 16:24:57 c1n2 last message repeated 3 times May 19 16:25:17 c1n2 openais[4660]: [TOTEM] entering GATHER state from 11. May 19 16:25:17 c1n2 openais[4660]: [TOTEM] Saving state aru 3e high seq received 3e May 19 16:25:17 c1n2 openais[4660]: [TOTEM] Storing new sequence id for ring d10 May 19 16:25:17 c1n2 openais[4660]: [TOTEM] entering COMMIT state. May 19 16:25:17 c1n2 openais[4660]: [TOTEM] entering RECOVERY state. May 19 16:25:17 c1n2 openais[4660]: [TOTEM] position [0] member <a href="http://192.168.1.103">192.168.1.103</a>: May 19 16:25:17 c1n2 openais[4660]: [TOTEM] previous ring seq 3340 rep 192.168.1.103 May 19 16:25:17 c1n2 openais[4660]: [TOTEM] aru 3e high delivered 3e received flag 1 May 19 16:25:17 c1n2 openais[4660]: [TOTEM] position [1] member <a href="http://192.168.1.104">192.168.1.104</a>: May 19 16:25:17 c1n2 openais[4660]: [TOTEM] previous ring seq 3340 rep 192.168.1.103 May 19 16:25:17 c1n2 openais[4660]: [TOTEM] aru 3e high delivered 3e received flag 1 May 19 16:25:17 c1n2 openais[4660]: [TOTEM] position [2] member <a href="http://192.168.1.105">192.168.1.105</a>: May 19 16:25:17 c1n2 openais[4660]: [TOTEM] previous ring seq 3340 rep 192.168.1.103 May 19 16:25:17 c1n2 openais[4660]: [TOTEM] aru 3e high delivered 3e received flag 1 May 19 16:25:17 c1n2 openais[4660]: [TOTEM] position [3] member <a href="http://192.168.1.101">192.168.1.101</a>: May 19 16:25:17 c1n2 openais[4660]: [TOTEM] previous ring seq 3340 rep 192.168.1.101 May 19 16:25:17 c1n2 openais[4660]: [TOTEM] aru a high delivered a received flag 1 May 19 16:25:17 c1n2 openais[4660]: [TOTEM] position [4] member <a href="http://192.168.1.102">192.168.1.102</a>: May 19 16:25:17 c1n2 openais[4660]: [TOTEM] previous ring seq 3340 rep 192.168.1.103 May 19 16:25:17 c1n2 openais[4660]: [TOTEM] aru 3e high delivered 3e received flag 1 May 19 16:25:17 c1n2 openais[4660]: [TOTEM] Did not need to originate any messages in recovery. May 19 16:25:17 c1n2 openais[4660]: [CLM ] CLM CONFIGURATION CHANGE May 19 16:25:17 c1n2 openais[4660]: [CLM ] New Configuration: May 19 16:25:17 c1n2 openais[4660]: [CLM ] r(0) ip(192.168.1.103) May 19 16:25:17 c1n2 openais[4660]: [CLM ] r(0) ip(192.168.1.104) May 19 16:25:17 c1n2 openais[4660]: [CLM ] r(0) ip(192.168.1.105) May 19 16:25:17 c1n2 openais[4660]: [CLM ] r(0) ip(192.168.1.102) May 19 16:25:17 c1n2 openais[4660]: [CLM ] Members Left: May 19 16:25:17 c1n2 openais[4660]: [CLM ] Members Joined: May 19 16:25:17 c1n2 openais[4660]: [CLM ] CLM CONFIGURATION CHANGE May 19 16:25:17 c1n2 openais[4660]: [CLM ] New Configuration: May 19 16:25:17 c1n2 openais[4660]: [CLM ] r(0) ip(192.168.1.103) May 19 16:25:17 c1n2 openais[4660]: [CLM ] r(0) ip(192.168.1.104) May 19 16:25:17 c1n2 openais[4660]: [CLM ] r(0) ip(192.168.1.105) May 19 16:25:17 c1n2 openais[4660]: [CLM ] r(0) ip(192.168.1.101) May 19 16:25:17 c1n2 openais[4660]: [CLM ] r(0) ip(192.168.1.102) May 19 16:25:17 c1n2 openais[4660]: [CLM ] Members Left: May 19 16:25:17 c1n2 openais[4660]: [CLM ] Members Joined: May 19 16:25:17 c1n2 openais[4660]: [CLM ] r(0) ip(192.168.1.101) May 19 16:25:17 c1n2 openais[4660]: [SYNC ] This node is within the primary component and will provide service. May 19 16:25:17 c1n2 openais[4660]: [TOTEM] entering OPERATIONAL state. May 19 16:25:17 c1n2 openais[4660]: [CLM ] got nodejoin message 192.168.1.103 May 19 16:25:17 c1n2 openais[4660]: [CLM ] got nodejoin message 192.168.1.104 May 19 16:25:17 c1n2 openais[4660]: [CLM ] got nodejoin message 192.168.1.105 May 19 16:25:17 c1n2 openais[4660]: [CLM ] got nodejoin message 192.168.1.101 May 19 16:25:17 c1n2 openais[4660]: [CLM ] got nodejoin message 192.168.1.102 May 19 16:25:17 c1n2 openais[4660]: [CPG ] got joinlist message from node 2 May 19 16:25:17 c1n2 openais[4660]: [CPG ] got joinlist message from node 3 May 19 16:25:17 c1n2 openais[4660]: [CPG ] got joinlist message from node 4 May 19 16:25:17 c1n2 openais[4660]: [CPG ] got joinlist message from node 5 May 19 16:25:24 c1n2 kernel: dlm: connecting to 1 May 19 16:25:27 c1n2 clurgmgrd: [5234]: <info> Executing /ha/bin/ha-hpss-mover1 status May 19 16:25:57 c1n2 clurgmgrd: [5234]: <info> Executing /ha/bin/ha-hpss-mover1 status May 19 16:26:00 c1n2 clurgmgrd[5234]: <info> State change: 192.168.1.101 UP May 19 16:26:27 c1n2 clurgmgrd: [5234]: <info> Executing /ha/bin/ha-hpss-mover1 status May 19 16:26:56 c1n2 xinetd[9002]: Exiting... May 19 16:26:56 c1n2 xinetd[2236]: xinetd Version 2.3.14 started with libwrap loadavg labeled-networking options compiled in. May 19 16:26:56 c1n2 xinetd[2236]: Started working: 1 available service May 19 16:26:57 c1n2 clurgmgrd: [5234]: <info> Executing /ha/bin/ha-hpss-mover1 status May 19 16:28:57 c1n2 last message repeated 2 times May 19 16:28:58 c1n2 root: MARK II - end of test </div> <div class="gmail_quote">On Wed, May 19, 2010 at 2:42 PM, Alfredo Moralejo <<a href="mailto:amoralej@redhat.com">amoralej@redhat.com</a>> wrote: <blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"> <div bgcolor="#ffffff" text="#000000"> What is the state of service that was running in the node after pulling the power cables? stopped, failed? Set rgmanager in verbose mode with <rm log_level="7" log_facility="local4"> Regards Alfredo<div><div></div><div class="h5"> On 05/19/2010 07:08 PM, Dusty wrote: </div></div><blockquote type="cite"><div><div></div><div class="h5">In the interest of trouble-shooting I've taken all the failover domains out of the configuration. This resulted in no change: Service on a failed node does not relocate until the failed node reboots. To reiterate: Similar cluster configuration on similar hardware worked perfectly on RHEL5U3. </div></div><pre><fieldset></fieldset> -- Linux-cluster mailing list <div class="im"><a href="mailto:Linux-cluster@redhat.com" target="_blank">Linux-cluster@redhat.com</a> <a href="https://www.redhat.com/mailman/listinfo/linux-cluster" target="_blank">https://www.redhat.com/mailman/listinfo/linux-cluster</a></div></pre> </blockquote> <div>-- Alfredo Moralejo Red Hat - Senior consultant Office: +34 914148838 Cell: +34 607909535 Email: <a href="mailto:alfredo.moralejo@redhat.com" target="_blank">alfredo.moralejo@redhat.com</a> Dirección Comercial: C/Jose Bardasano Baos, 9, Edif. Gorbea 3, planta 3ºD, 28016 Madrid, Spain Dirección Registrada: Red Hat S.L., C/ Velazquez 63, Madrid 28001, Spain Inscrita en el Reg. Mercantil de Madrid – C.I.F. B82657941 </div> </div> -- Linux-cluster mailing list <a href="mailto:Linux-cluster@redhat.com">Linux-cluster@redhat.com</a> <a href="https://www.redhat.com/mailman/listinfo/linux-cluster" target="_blank">https://www.redhat.com/mailman/listinfo/linux-cluster</a> </blockquote></div>