<div class="gmail_quote">On Mon, May 17, 2010 at 4:56 PM, Corey Kovacs <<a href="mailto:corey.kovacs@gmail.com">corey.kovacs@gmail.com</a>> wrote: <blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"> The service scripts you have in the config above look made up. Are those some scripts or wrote or are you actually using sys V inits? </blockquote><div> I wrote the resource scripts. They all respond to {start|status|stop} as necessary. </div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"> Also, can you include a complete log segment? it's quite hard to debug someone's problem with only partial information. </blockquote><div> Here's a segment with the APC PDU as the fencing device: <div style="margin-left: 40px;">May 12 10:50:00 c1n2 openais[26524]: [TOTEM] The token was lost in the OPERATIONAL state. May 12 10:50:00 c1n2 openais[26524]: [TOTEM] Receive multicast socket recv buffer size (288000 bytes). May 12 10:50:00 c1n2 openais[26524]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes). May 12 10:50:00 c1n2 openais[26524]: [TOTEM] entering GATHER state from 2. May 12 10:50:05 c1n2 openais[26524]: [TOTEM] entering GATHER state from 0. May 12 10:50:05 c1n2 openais[26524]: [TOTEM] Saving state aru 2c3 high seq received 2c3 May 12 10:50:05 c1n2 openais[26524]: [TOTEM] Storing new sequence id for ring ae4 May 12 10:50:05 c1n2 openais[26524]: [TOTEM] entering COMMIT state. May 12 10:50:05 c1n2 openais[26524]: [TOTEM] entering RECOVERY state. May 12 10:50:05 c1n2 openais[26524]: [TOTEM] position [0] member <a href="http://192.168.1.103">192.168.1.103</a>: May 12 10:50:05 c1n2 openais[26524]: [TOTEM] previous ring seq 2784 rep 192.168.1.103 May 12 10:50:05 c1n2 openais[26524]: [TOTEM] aru 2c3 high delivered 2c3 received flag 1 May 12 10:50:05 c1n2 openais[26524]: [TOTEM] position [1] member <a href="http://192.168.1.104">192.168.1.104</a>: May 12 10:50:05 c1n2 openais[26524]: [TOTEM] previous ring seq 2784 rep 192.168.1.103 May 12 10:50:05 c1n2 openais[26524]: [TOTEM] aru 2c3 high delivered 2c3 received flag 1 May 12 10:50:05 c1n2 openais[26524]: [TOTEM] position [2] member <a href="http://192.168.1.105">192.168.1.105</a>: May 12 10:50:05 c1n2 openais[26524]: [TOTEM] previous ring seq 2784 rep 192.168.1.103 May 12 10:50:05 c1n2 openais[26524]: [TOTEM] aru 2c3 high delivered 2c3 received flag 1 May 12 10:50:05 c1n2 openais[26524]: [TOTEM] position [3] member <a href="http://192.168.1.102">192.168.1.102</a>: May 12 10:50:05 c1n2 openais[26524]: [TOTEM] previous ring seq 2784 rep 192.168.1.103 May 12 10:50:05 c1n2 openais[26524]: [TOTEM] aru 2c3 high delivered 2c3 received flag 1 May 12 10:50:05 c1n2 openais[26524]: [TOTEM] Did not need to originate any messages in recovery. May 12 10:50:05 c1n2 openais[26524]: [CLM ] CLM CONFIGURATION CHANGE May 12 10:50:05 c1n2 openais[26524]: [CLM ] New Configuration: May 12 10:50:05 c1n2 openais[26524]: [CLM ] r(0) ip(192.168.1.103) May 12 10:50:05 c1n2 openais[26524]: [CLM ] r(0) ip(192.168.1.104) May 12 10:50:05 c1n2 openais[26524]: [CLM ] r(0) ip(192.168.1.105) May 12 10:50:05 c1n2 openais[26524]: [CLM ] r(0) ip(192.168.1.102) May 12 10:50:05 c1n2 openais[26524]: [CLM ] Members Left: May 12 10:50:05 c1n2 openais[26524]: [CLM ] r(0) ip(192.168.1.101) May 12 10:50:05 c1n2 openais[26524]: [CLM ] Members Joined: May 12 10:50:05 c1n2 openais[26524]: [CLM ] CLM CONFIGURATION CHANGE May 12 10:50:05 c1n2 openais[26524]: [CLM ] New Configuration: May 12 10:50:05 c1n2 openais[26524]: [CLM ] r(0) ip(192.168.1.103) May 12 10:50:05 c1n2 openais[26524]: [CLM ] r(0) ip(192.168.1.104) May 12 10:50:05 c1n2 openais[26524]: [CLM ] r(0) ip(192.168.1.105) May 12 10:50:05 c1n2 openais[26524]: [CLM ] r(0) ip(192.168.1.102) May 12 10:50:05 c1n2 openais[26524]: [CLM ] Members Left: May 12 10:50:05 c1n2 openais[26524]: [CLM ] Members Joined: May 12 10:50:05 c1n2 openais[26524]: [SYNC ] This node is within the primary component and will provide service. May 12 10:50:05 c1n2 openais[26524]: [TOTEM] entering OPERATIONAL state. May 12 10:50:05 c1n2 kernel: dlm: closing connection to node 1 May 12 10:50:05 c1n2 openais[26524]: [CLM ] got nodejoin message 192.168.1.103 May 12 10:50:05 c1n2 openais[26524]: [CLM ] got nodejoin message 192.168.1.104 May 12 10:50:05 c1n2 openais[26524]: [CLM ] got nodejoin message 192.168.1.105 May 12 10:50:05 c1n2 openais[26524]: [CLM ] got nodejoin message 192.168.1.102 May 12 10:50:05 c1n2 openais[26524]: [CPG ] got joinlist message from node 5 May 12 10:50:05 c1n2 openais[26524]: [CPG ] got joinlist message from node 2 May 12 10:50:05 c1n2 openais[26524]: [CPG ] got joinlist message from node 3 May 12 10:50:05 c1n2 openais[26524]: [CPG ] got joinlist message from node 4 May 12 10:50:08 c1n2 fenced[26544]: 192.168.1.101 not a cluster member after 3 sec post_fail_delay May 12 10:50:08 c1n2 fenced[26544]: fencing node "192.168.1.101" May 12 10:50:12 c1n2 fenced[26544]: fence "192.168.1.101" success </div><div style="margin-left: 40px;">May 12 10:54:04 c1n2 openais[26524]: [TOTEM] entering GATHER state from 11. May 12 10:54:04 c1n2 openais[26524]: [TOTEM] Saving state aru 3e high seq received 3e May 12 10:54:04 c1n2 openais[26524]: [TOTEM] Storing new sequence id for ring ae8 May 12 10:54:04 c1n2 openais[26524]: [TOTEM] entering COMMIT state. May 12 10:54:04 c1n2 openais[26524]: [TOTEM] entering RECOVERY state. May 12 10:54:04 c1n2 openais[26524]: [TOTEM] position [0] member <a href="http://192.168.1.103">192.168.1.103</a>: May 12 10:54:04 c1n2 openais[26524]: [TOTEM] previous ring seq 2788 rep 192.168.1.103 May 12 10:54:04 c1n2 openais[26524]: [TOTEM] aru 3e high delivered 3e received flag 1 May 12 10:54:04 c1n2 openais[26524]: [TOTEM] position [1] member <a href="http://192.168.1.104">192.168.1.104</a>: May 12 10:54:04 c1n2 openais[26524]: [TOTEM] previous ring seq 2788 rep 192.168.1.103 May 12 10:54:04 c1n2 openais[26524]: [TOTEM] aru 3e high delivered 3e received flag 1 May 12 10:54:04 c1n2 openais[26524]: [TOTEM] position [2] member <a href="http://192.168.1.105">192.168.1.105</a>: May 12 10:54:04 c1n2 openais[26524]: [TOTEM] previous ring seq 2788 rep 192.168.1.103 May 12 10:54:04 c1n2 openais[26524]: [TOTEM] aru 3e high delivered 3e received flag 1 May 12 10:54:04 c1n2 openais[26524]: [TOTEM] position [3] member <a href="http://192.168.1.101">192.168.1.101</a>: May 12 10:54:04 c1n2 openais[26524]: [TOTEM] previous ring seq 2788 rep 192.168.1.101 May 12 10:54:04 c1n2 openais[26524]: [TOTEM] aru a high delivered a received flag 1 May 12 10:54:04 c1n2 openais[26524]: [TOTEM] position [4] member <a href="http://192.168.1.102">192.168.1.102</a>: May 12 10:54:04 c1n2 openais[26524]: [TOTEM] previous ring seq 2788 rep 192.168.1.103 May 12 10:54:04 c1n2 openais[26524]: [TOTEM] aru 3e high delivered 3e received flag 1 May 12 10:54:04 c1n2 openais[26524]: [TOTEM] Did not need to originate any messages in recovery. May 12 10:54:04 c1n2 openais[26524]: [CLM ] CLM CONFIGURATION CHANGE May 12 10:54:04 c1n2 openais[26524]: [CLM ] New Configuration: May 12 10:54:04 c1n2 openais[26524]: [CLM ] r(0) ip(192.168.1.103) May 12 10:54:04 c1n2 openais[26524]: [CLM ] r(0) ip(192.168.1.104) May 12 10:54:04 c1n2 openais[26524]: [CLM ] r(0) ip(192.168.1.105) May 12 10:54:04 c1n2 openais[26524]: [CLM ] r(0) ip(192.168.1.102) May 12 10:54:04 c1n2 openais[26524]: [CLM ] Members Left: May 12 10:54:04 c1n2 openais[26524]: [CLM ] Members Joined: May 12 10:54:04 c1n2 openais[26524]: [CLM ] CLM CONFIGURATION CHANGE May 12 10:54:04 c1n2 openais[26524]: [CLM ] New Configuration: May 12 10:54:04 c1n2 openais[26524]: [CLM ] r(0) ip(192.168.1.103) May 12 10:54:04 c1n2 openais[26524]: [CLM ] r(0) ip(192.168.1.104) May 12 10:54:04 c1n2 openais[26524]: [CLM ] r(0) ip(192.168.1.105) May 12 10:54:04 c1n2 openais[26524]: [CLM ] r(0) ip(192.168.1.101) May 12 10:54:04 c1n2 openais[26524]: [CLM ] r(0) ip(192.168.1.102) May 12 10:54:04 c1n2 openais[26524]: [CLM ] Members Left: May 12 10:54:04 c1n2 openais[26524]: [CLM ] Members Joined: May 12 10:54:04 c1n2 openais[26524]: [CLM ] r(0) ip(192.168.1.101) May 12 10:54:04 c1n2 openais[26524]: [SYNC ] This node is within the primary component and will provide service. May 12 10:54:04 c1n2 openais[26524]: [TOTEM] entering OPERATIONAL state. May 12 10:54:04 c1n2 openais[26524]: [CLM ] got nodejoin message 192.168.1.103 May 12 10:54:04 c1n2 openais[26524]: [CLM ] got nodejoin message 192.168.1.104 May 12 10:54:04 c1n2 openais[26524]: [CLM ] got nodejoin message 192.168.1.105 May 12 10:54:04 c1n2 openais[26524]: [CLM ] got nodejoin message 192.168.1.101 May 12 10:54:04 c1n2 openais[26524]: [CLM ] got nodejoin message 192.168.1.102 May 12 10:54:04 c1n2 openais[26524]: [CPG ] got joinlist message from node 5 May 12 10:54:04 c1n2 openais[26524]: [CPG ] got joinlist message from node 2 May 12 10:54:04 c1n2 openais[26524]: [CPG ] got joinlist message from node 3 May 12 10:54:04 c1n2 openais[26524]: [CPG ] got joinlist message from node 4 </div><div style="margin-left: 40px;"> </div>Please notice in the above log that the APC PDU reported to node2 (192.168.1.102), and node2 reported in its log, that fencing was successful. Also please note that no service relocation occurred for the service node1 was running for the four minutes it took for node1 to come back online. Here's another log segment after taking out the APC PDU and inserting manual_fencing as the fencing device: <div style="margin-left: 40px;">May 18 11:34:12 c1n2 root: MARK I begin test. doing ifcfg eth0 down && ifcfg eth1 down on node c1n1 May 18 11:35:03 c1n2 openais[25546]: [TOTEM] The token was lost in the OPERATIONAL state. May 18 11:35:03 c1n2 openais[25546]: [TOTEM] Receive multicast socket recv buffer size (288000 bytes). May 18 11:35:03 c1n2 openais[25546]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes). May 18 11:35:03 c1n2 openais[25546]: [TOTEM] entering GATHER state from 2. May 18 11:35:08 c1n2 openais[25546]: [TOTEM] entering GATHER state from 0. May 18 11:35:08 c1n2 openais[25546]: [TOTEM] Saving state aru 1ec high seq received 1ec May 18 11:35:08 c1n2 openais[25546]: [TOTEM] Storing new sequence id for ring c5c May 18 11:35:08 c1n2 openais[25546]: [TOTEM] entering COMMIT state. May 18 11:35:08 c1n2 openais[25546]: [TOTEM] entering RECOVERY state. May 18 11:35:08 c1n2 openais[25546]: [TOTEM] position [0] member <a href="http://192.168.1.103">192.168.1.103</a>: May 18 11:35:08 c1n2 openais[25546]: [TOTEM] previous ring seq 3160 rep 192.168.1.103 May 18 11:35:08 c1n2 openais[25546]: [TOTEM] aru 1ec high delivered 1ec received flag 1 May 18 11:35:08 c1n2 openais[25546]: [TOTEM] position [1] member <a href="http://192.168.1.104">192.168.1.104</a>: May 18 11:35:08 c1n2 openais[25546]: [TOTEM] previous ring seq 3160 rep 192.168.1.103 May 18 11:35:08 c1n2 openais[25546]: [TOTEM] aru 1ec high delivered 1ec received flag 1 May 18 11:35:08 c1n2 openais[25546]: [TOTEM] position [2] member <a href="http://192.168.1.105">192.168.1.105</a>: May 18 11:35:08 c1n2 openais[25546]: [TOTEM] previous ring seq 3160 rep 192.168.1.103 May 18 11:35:08 c1n2 openais[25546]: [TOTEM] aru 1ec high delivered 1ec received flag 1 May 18 11:35:08 c1n2 openais[25546]: [TOTEM] position [3] member <a href="http://192.168.1.102">192.168.1.102</a>: May 18 11:35:08 c1n2 openais[25546]: [TOTEM] previous ring seq 3160 rep 192.168.1.103 May 18 11:35:08 c1n2 openais[25546]: [TOTEM] aru 1ec high delivered 1ec received flag 1 May 18 11:35:08 c1n2 openais[25546]: [TOTEM] Did not need to originate any messages in recovery. May 18 11:35:08 c1n2 openais[25546]: [CLM ] CLM CONFIGURATION CHANGE May 18 11:35:08 c1n2 openais[25546]: [CLM ] New Configuration: May 18 11:35:08 c1n2 openais[25546]: [CLM ] r(0) ip(192.168.1.103) May 18 11:35:08 c1n2 openais[25546]: [CLM ] r(0) ip(192.168.1.104) May 18 11:35:08 c1n2 openais[25546]: [CLM ] r(0) ip(192.168.1.105) May 18 11:35:08 c1n2 openais[25546]: [CLM ] r(0) ip(192.168.1.102) May 18 11:35:08 c1n2 openais[25546]: [CLM ] Members Left: May 18 11:35:08 c1n2 openais[25546]: [CLM ] r(0) ip(192.168.1.101) May 18 11:35:08 c1n2 openais[25546]: [CLM ] Members Joined: May 18 11:35:08 c1n2 openais[25546]: [CLM ] CLM CONFIGURATION CHANGE May 18 11:35:08 c1n2 openais[25546]: [CLM ] New Configuration: May 18 11:35:08 c1n2 openais[25546]: [CLM ] r(0) ip(192.168.1.103) May 18 11:35:08 c1n2 openais[25546]: [CLM ] r(0) ip(192.168.1.104) May 18 11:35:08 c1n2 openais[25546]: [CLM ] r(0) ip(192.168.1.105) May 18 11:35:08 c1n2 openais[25546]: [CLM ] r(0) ip(192.168.1.102) May 18 11:35:08 c1n2 openais[25546]: [CLM ] Members Left: May 18 11:35:08 c1n2 openais[25546]: [CLM ] Members Joined: May 18 11:35:08 c1n2 openais[25546]: [SYNC ] This node is within the primary component and will provide service. May 18 11:35:08 c1n2 openais[25546]: [TOTEM] entering OPERATIONAL state. May 18 11:35:08 c1n2 kernel: dlm: closing connection to node 1 May 18 11:35:08 c1n2 openais[25546]: [CLM ] got nodejoin message 192.168.1.103 May 18 11:35:08 c1n2 openais[25546]: [CLM ] got nodejoin message 192.168.1.104 May 18 11:35:08 c1n2 openais[25546]: [CLM ] got nodejoin message 192.168.1.105 May 18 11:35:08 c1n2 openais[25546]: [CLM ] got nodejoin message 192.168.1.102 May 18 11:35:08 c1n2 openais[25546]: [CPG ] got joinlist message from node 4 May 18 11:35:08 c1n2 openais[25546]: [CPG ] got joinlist message from node 5 May 18 11:35:08 c1n2 openais[25546]: [CPG ] got joinlist message from node 2 May 18 11:35:08 c1n2 openais[25546]: [CPG ] got joinlist message from node 3 May 18 11:35:11 c1n2 fenced[25566]: 192.168.1.101 not a cluster member after 3 sec post_fail_delay May 18 11:35:11 c1n2 fenced[25566]: fencing node "192.168.1.101" May 18 11:35:11 c1n2 fence_manual: Node 192.168.1.101 needs to be reset before recovery can procede. Waiting for 192.168.1.101 to rejoin the cluster or for manual acknowledgement that it has been reset (i.e. fence_ack_manual -n 192.168.1.101) May 18 11:37:30 c1n2 ccsd[25540]: Attempt to close an unopened CCS descriptor (5280). May 18 11:37:30 c1n2 ccsd[25540]: Error while processing disconnect: Invalid request descriptor May 18 11:37:30 c1n2 fenced[25566]: fence "192.168.1.101" success May 18 11:41:31 c1n2 root: MARK II node c1n1 up now, no service relocation of service core1 occurred May 18 11:41:41 c1n2 openais[25546]: [TOTEM] entering GATHER state from 11. May 18 11:41:41 c1n2 openais[25546]: [TOTEM] Saving state aru 3f high seq received 3f May 18 11:41:41 c1n2 openais[25546]: [TOTEM] Storing new sequence id for ring c60 May 18 11:41:41 c1n2 openais[25546]: [TOTEM] entering COMMIT state. May 18 11:41:41 c1n2 openais[25546]: [TOTEM] entering RECOVERY state. May 18 11:41:41 c1n2 openais[25546]: [TOTEM] position [0] member <a href="http://192.168.1.103">192.168.1.103</a>: May 18 11:41:41 c1n2 openais[25546]: [TOTEM] previous ring seq 3164 rep 192.168.1.103 May 18 11:41:41 c1n2 openais[25546]: [TOTEM] aru 3f high delivered 3f received flag 1 May 18 11:41:41 c1n2 openais[25546]: [TOTEM] position [1] member <a href="http://192.168.1.104">192.168.1.104</a>: May 18 11:41:41 c1n2 openais[25546]: [TOTEM] previous ring seq 3164 rep 192.168.1.103 May 18 11:41:41 c1n2 openais[25546]: [TOTEM] aru 3f high delivered 3f received flag 1 May 18 11:41:41 c1n2 openais[25546]: [TOTEM] position [2] member <a href="http://192.168.1.105">192.168.1.105</a>: May 18 11:41:41 c1n2 openais[25546]: [TOTEM] previous ring seq 3164 rep 192.168.1.103 May 18 11:41:41 c1n2 openais[25546]: [TOTEM] aru 3f high delivered 3f received flag 1 May 18 11:41:41 c1n2 openais[25546]: [TOTEM] position [3] member <a href="http://192.168.1.101">192.168.1.101</a>: May 18 11:41:41 c1n2 openais[25546]: [TOTEM] previous ring seq 3164 rep 192.168.1.101 May 18 11:41:41 c1n2 openais[25546]: [TOTEM] aru a high delivered a received flag 1 May 18 11:41:41 c1n2 openais[25546]: [TOTEM] position [4] member <a href="http://192.168.1.102">192.168.1.102</a>: May 18 11:41:41 c1n2 openais[25546]: [TOTEM] previous ring seq 3164 rep 192.168.1.103 May 18 11:41:41 c1n2 openais[25546]: [TOTEM] aru 3f high delivered 3f received flag 1 May 18 11:41:41 c1n2 openais[25546]: [TOTEM] Did not need to originate any messages in recovery. May 18 11:41:41 c1n2 openais[25546]: [CLM ] CLM CONFIGURATION CHANGE May 18 11:41:41 c1n2 openais[25546]: [CLM ] New Configuration: May 18 11:41:41 c1n2 openais[25546]: [CLM ] r(0) ip(192.168.1.103) May 18 11:41:41 c1n2 openais[25546]: [CLM ] r(0) ip(192.168.1.104) May 18 11:41:41 c1n2 openais[25546]: [CLM ] r(0) ip(192.168.1.105) May 18 11:41:41 c1n2 openais[25546]: [CLM ] r(0) ip(192.168.1.102) May 18 11:41:41 c1n2 openais[25546]: [CLM ] Members Left: May 18 11:41:41 c1n2 openais[25546]: [CLM ] Members Joined: May 18 11:41:41 c1n2 openais[25546]: [CLM ] CLM CONFIGURATION CHANGE May 18 11:41:41 c1n2 openais[25546]: [CLM ] New Configuration: May 18 11:41:41 c1n2 openais[25546]: [CLM ] r(0) ip(192.168.1.103) May 18 11:41:41 c1n2 openais[25546]: [CLM ] r(0) ip(192.168.1.104) May 18 11:41:41 c1n2 openais[25546]: [CLM ] r(0) ip(192.168.1.105) May 18 11:41:41 c1n2 openais[25546]: [CLM ] r(0) ip(192.168.1.101) May 18 11:41:41 c1n2 openais[25546]: [CLM ] r(0) ip(192.168.1.102) May 18 11:41:41 c1n2 openais[25546]: [CLM ] Members Left: May 18 11:41:41 c1n2 openais[25546]: [CLM ] Members Joined: May 18 11:41:41 c1n2 openais[25546]: [CLM ] r(0) ip(192.168.1.101) May 18 11:41:41 c1n2 openais[25546]: [SYNC ] This node is within the primary component and will provide service. May 18 11:41:41 c1n2 openais[25546]: [TOTEM] entering OPERATIONAL state. May 18 11:41:41 c1n2 openais[25546]: [CLM ] got nodejoin message 192.168.1.103 May 18 11:41:41 c1n2 openais[25546]: [CLM ] got nodejoin message 192.168.1.104 May 18 11:41:41 c1n2 openais[25546]: [CLM ] got nodejoin message 192.168.1.105 May 18 11:41:41 c1n2 openais[25546]: [CLM ] got nodejoin message 192.168.1.101 May 18 11:41:41 c1n2 openais[25546]: [CLM ] got nodejoin message 192.168.1.102 May 18 11:41:41 c1n2 openais[25546]: [CPG ] got joinlist message from node 5 May 18 11:41:41 c1n2 openais[25546]: [CPG ] got joinlist message from node 2 May 18 11:41:41 c1n2 openais[25546]: [CPG ] got joinlist message from node 3 May 18 11:41:41 c1n2 openais[25546]: [CPG ] got joinlist message from node 4 May 18 11:41:48 c1n2 kernel: dlm: connecting to 1 </div><div style="margin-left: 40px;"> </div>Wonder what these indicate from that segment: <div style="margin-left: 40px;">May 18 11:37:30 c1n2 ccsd[25540]: Attempt to close an unopened CCS descriptor (5280). May 18 11:37:30 c1n2 ccsd[25540]: Error while processing disconnect: Invalid request descriptor </div></div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"> Really, I don't want to hear "It should work, I've done everything right" since clearly something is wrong. I as have many people here built several if not dozens of these clusters and we are making suggestions where we have seen the most problems. </blockquote><div> Ok. This is my seventh cluster. All previously built clusters, on RHEL5U3 (this one is the first I've built on RHEL5U4) functioned perfectly. I appreciate you making suggestions - I'm just saying that I'd stated three times that the APC unit is fencing properly. Have tested with fence_tool across all nodes. </div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"> It's always funny how the software engineers are never to blame for there software not working until I prove to them that it's there fault. Not trying to be a jerk, OR point fingers but open your mind a bit. </blockquote><div> My mind is open. I keep saying the APC PDU is successfully fencing and keep getting asked if the APC PDU is successfully fencing and I keep reporting that the APC PDU is successfully fencing. </div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"> Sometimes fencing can be hindered by simply being logged into the device while the cluster is trying to talk to it. First Gen iLO's and some APC firmware have problems with this. </blockquote><div> Understood - I've been making sure to log out when testing. </div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"> Whats the output of ... cman_tool status </blockquote><div><div style="margin-left: 40px;"># cman_tool status Version: 6.2.0 Config Version: 102 Cluster Name: hpss1 Cluster Id: 3299 Cluster Member: Yes Cluster Generation: 3168 Membership state: Cluster-Member Nodes: 5 Expected votes: 5 Total votes: 5 Quorum: 3 Active subsystems: 9 Flags: Dirty Ports Bound: 0 11 177 Node name: 192.168.222.86 Node ID: 2 Multicast addresses: 239.192.12.239 Node addresses: 192.168.2.86 </div> </div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"> cman_tool nodes </blockquote><div style="margin-left: 40px;"> cman_tool nodes Node Sts Inc Joined Name 1 M 3168 2010-05-18 11:41:41 192.168.1.101 2 M 3140 2010-05-18 11:32:12 192.168.1.102 3 M 3156 2010-05-18 11:32:12 192.168.1.103 4 M 3156 2010-05-18 11:32:12 192.168.1.104 5 M 3160 2010-05-18 11:32:12 192.168.1.105 </div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"> cman_tool services </blockquote><div style="margin-left: 40px;"> cman_tool services type level name id state fence 0 default 00010004 none [1 2 3 4 5] dlm 1 clvmd 00010005 none [1 2 3 4 5] dlm 1 rgmanager 00020004 none [1 2 3 4 5] dlm 1 m1_hpssSource 00020002 none [2] dlm 1 m1_varHpss 00040002 none [2] dlm 1 m1_varHpssAdmCor 00060002 none [2] gfs 2 m1_hpssSource 00010002 none [2] gfs 2 m1_varHpss 00030002 none [2] gfs 2 m1_varHpssAdmCor 00050002 none [2] </div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"> Are you sure that all the nodes have the right cluster config? ccs_tool update /etc/cluster/cluster.conf ? </blockquote><div> </div><div>Yes, absolutely positive. </div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"> What are using you using to manage the config? ricci/luci, system-config-cluster, vi? </blockquote><div> Sometimes luci, sometimes vi. </div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"> Can we see a "real" cluster.conf? </blockquote><div> The IPs have been changed through-out this email (as thoroughly and made-to-be consistent as possible) as per security directives. <?xml version="1.0"?> <cluster config_version="102" name="hpss7"> <fence_daemon clean_start="0" post_fail_delay="3" post_join_delay="60"/> <clusternodes> <clusternode name="192.168.1.101" nodeid="1" votes="1"> <fence> <method name="1"> <device name="manual_fence_c1n1" nodename="192.168.1.101"/> </method> </fence> </clusternode> <clusternode name="192.168.1.102" nodeid="2" votes="1"> <fence> <method name="1"> <device name="manual_fence_c1n2" nodename="192.168.1.102"/> </method> </fence> </clusternode> <clusternode name="192.168.1.103" nodeid="3" votes="1"> <fence> <method name="1"> <device name="manual_fence_c1n3" nodename="192.168.1.103"/> </method> </fence> </clusternode> <clusternode name="192.168.1.104" nodeid="4" votes="1"> <fence> <method name="1"> <device name="manual_fence_c1n4" nodename="192.168.1.104"/> </method> </fence> </clusternode> <clusternode name="192.168.1.105" nodeid="5" votes="1"> <fence> <method name="1"> <device name="manual_fence_c1n5" nodename="192.168.1.105"/> </method> </fence> </clusternode> </clusternodes> <cman/> <fencedevices> <fencedevice agent="fence_manual" name="manual_fence_c1n3"/> <fencedevice agent="fence_manual" name="manual_fence_c1n1"/> <fencedevice agent="fence_manual" name="manual_fence_c1n2"/> <fencedevice agent="fence_manual" name="manual_fence_c1n4"/> <fencedevice agent="fence_manual" name="manual_fence_c1n5"/> </fencedevices> <rm> <failoverdomains> <failoverdomain name="fd_core1" nofailback="1" ordered="1" restricted="1"> <failoverdomainnode name="192.168.1.101" priority="1"/> <failoverdomainnode name="192.168.1.102" priority="2"/> <failoverdomainnode name="192.168.1.103" priority="3"/> <failoverdomainnode name="192.168.1.104" priority="4"/> <failoverdomainnode name="192.168.1.105" priority="5"/> </failoverdomain> <failoverdomain name="fd_mover1" nofailback="1" ordered="1" restricted="1"> <failoverdomainnode name="192.168.1.101" priority="5"/> <failoverdomainnode name="192.168.1.102" priority="1"/> <failoverdomainnode name="192.168.1.103" priority="2"/> <failoverdomainnode name="192.168.1.104" priority="3"/> <failoverdomainnode name="192.168.1.105" priority="4"/> </failoverdomain> <failoverdomain name="fd_mover2" nofailback="1" ordered="1" restricted="1"> <failoverdomainnode name="192.168.1.101" priority="4"/> <failoverdomainnode name="192.168.1.102" priority="5"/> <failoverdomainnode name="192.168.1.103" priority="1"/> <failoverdomainnode name="192.168.1.104" priority="2"/> <failoverdomainnode name="192.168.1.105" priority="3"/> </failoverdomain> <failoverdomain name="fd_vfs1" nofailback="1" ordered="1" restricted="1"> <failoverdomainnode name="192.168.1.101" priority="3"/> <failoverdomainnode name="192.168.1.102" priority="4"/> <failoverdomainnode name="192.168.1.103" priority="5"/> <failoverdomainnode name="192.168.1.104" priority="1"/> <failoverdomainnode name="192.168.1.105" priority="2"/> </failoverdomain> </failoverdomains> <resources> <ip address="192.168.2.40" monitor_link="1"/> <ip address="10.10.1.74" monitor_link="1"/> <ip address="192.168.2.41" monitor_link="1"/> <ip address="10.10.1.75" monitor_link="1"/> <ip address="192.168.2.42" monitor_link="1"/> <ip address="10.10.1.76" monitor_link="1"/> <ip address="192.168.2.43" monitor_link="1"/> <ip address="10.10.1.77" monitor_link="1"/> <script file="/ha/bin/ha-hpss-core1" name="ha-hpss-core1"/> <script file="/ha/bin/ha-hpss-mover1" name="ha-hpss-mover1"/> <script file="/ha/bin/ha-hpss-mover2" name="ha-hpss-mover2"/> <script file="/ha/bin/ha-hpss-vfs1" name="ha-hpss-vfs1"/> <clusterfs device="/dev/mapper/c1_hpss_vg-c1_db2Backup" force_unmount="1" fsid="34722" fstype="gfs2" mountpoint="/ha/c1_db2Backup" name="c1_db2Backup" self_fence="1"/> <clusterfs device="/dev/mapper/c1_hpss_vg-c1_hpssSource" force_unmount="1" fsid="41961" fstype="gfs2" mountpoint="/ha/c1_hpssSource" name="c1_hpssSource" self_fence="1"/> <clusterfs device="/dev/mapper/c1_hpss_vg-c1_varHpss" force_unmount="1" fsid="31374" fstype="gfs2" mountpoint="/ha/c1_varHpss" name="c1_varHpss" self_fence="1"/> <clusterfs device="/dev/mapper/c1_hpss_vg-c1_varHpssAdmCor" force_unmount="1" fsid="46145" fstype="gfs2" mountpoint="/ha/c1_varHpssAdmCor" name="c1_varHpssAdmCor" self_fence="1"/> <clusterfs device="/dev/mapper/c1_hpssdb2v95_vg-c1_optIbmDb2V95" force_unmount="1" fsid="38858" fstype="gfs2" mountpoint="/ha/c1_optIbmDb2V95" name="c1_optIbmDb2V95" self_fence="1"/> <clusterfs device="/dev/mapper/c1_hpssdb_vg-c1_hpssdbUserSp1" force_unmount="1" fsid="17090" fstype="gfs2" mountpoint="/ha/c1_hpssdbUserSp1" name="c1_hpssdbUserSp1" self_fence="1"/> <clusterfs device="/dev/mapper/c1_hpssdb_vg-c1_varHpssHpssdb" force_unmount="1" fsid="55384" fstype="gfs2" mountpoint="/ha/c1_varHpssHpssdb" name="c1_varHpssHpssdb" self_fence="1"/> <clusterfs device="/dev/mapper/c1_hpssdblog_vg-c1_db2LogCfg" force_unmount="1" fsid="7401" fstype="gfs2" mountpoint="/ha/c1_db2LogCfg" name="c1_db2LogCfg" self_fence="1"/> <clusterfs device="/dev/mapper/c1_hpssdblog_vg-c1_db2LogSubs1" force_unmount="1" fsid="65529" fstype="gfs2" mountpoint="/ha/c1_db2LogSubs1" name="c1_db2LogSubs1" self_fence="1"/> <clusterfs device="/dev/mapper/c1_hpssdblogm_vg-c1_db2LogMCfg" force_unmount="1" fsid="30927" fstype="gfs2" mountpoint="/ha/c1_db2LogMCfg" name="c1_db2LogMCfg" self_fence="1"/> <clusterfs device="/dev/mapper/c1_hpssdblogm_vg-c1_db2LogMSubs1" force_unmount="1" fsid="47193" fstype="gfs2" mountpoint="/ha/c1_db2LogMSubs1" name="c1_db2LogMSubs1" self_fence="1"/> <clusterfs device="/dev/mapper/m1_hpss_vg-m1_varHpss" force_unmount="1" fsid="19005" fstype="gfs2" mountpoint="/ha/m1_varHpss" name="m1_varHpss" self_fence="1"/> <clusterfs device="/dev/mapper/m1_hpss_vg-m1_varHpssAdmCor" force_unmount="1" fsid="17130" fstype="gfs2" mountpoint="/ha/m1_varHpssAdmCor" name="m1_varHpssAdmCor" self_fence="1"/> <clusterfs device="/dev/mapper/m2_hpss_vg-m2_hpssSource" force_unmount="1" fsid="32169" fstype="gfs2" mountpoint="/ha/m2_hpssSource" name="m2_hpssSource" self_fence="1"/> <clusterfs device="/dev/mapper/m2_hpss_vg-m2_varHpss" force_unmount="1" fsid="30456" fstype="gfs2" mountpoint="/ha/m2_varHpss" name="m2_varHpss" self_fence="1"/> <clusterfs device="/dev/mapper/m2_hpss_vg-m2_varHpssAdmCor" force_unmount="1" fsid="10387" fstype="gfs2" mountpoint="/ha/m2_varHpssAdmCor" name="m2_varHpssAdmCor" self_fence="1"/> <clusterfs device="/dev/mapper/v1_hpss_vg-v1_hpssSource" force_unmount="1" fsid="46624" fstype="gfs2" mountpoint="/ha/v1_hpssSource" name="v1_hpssSource" self_fence="1"/> <clusterfs device="/dev/mapper/v1_hpss_vg-v1_varHpss" force_unmount="1" fsid="46980" fstype="gfs2" mountpoint="/ha/v1_varHpss" name="v1_varHpss" self_fence="1"/> <clusterfs device="/dev/mapper/v1_hpss_vg-v1_varHpssAdmCor" force_unmount="1" fsid="22473" fstype="gfs2" mountpoint="/ha/v1_varHpssAdmCor" name="v1_varHpssAdmCor" self_fence="1"/> <clusterfs device="/dev/mapper/m1_hpss_vg-m1_hpssSource" force_unmount="1" fsid="44889" fstype="gfs2" mountpoint="/ha/m1_hpssSource" name="m1_hpssSource" self_fence="1"/> </resources> <service autostart="1" domain="fd_core1" exclusive="1" name="core1" recovery="relocate"> <ip ref="192.168.2.40"/> <ip ref="10.10.1.74"/> <script ref="ha-hpss-core1"/> <clusterfs fstype="gfs" ref="c1_db2Backup"/> <clusterfs fstype="gfs" ref="c1_hpssSource"/> <clusterfs fstype="gfs" ref="c1_varHpss"/> <clusterfs fstype="gfs" ref="c1_varHpssAdmCor"/> <clusterfs fstype="gfs" ref="c1_optIbmDb2V95"/> <clusterfs fstype="gfs" ref="c1_hpssdbUserSp1"/> <clusterfs fstype="gfs" ref="c1_varHpssHpssdb"/> <clusterfs fstype="gfs" ref="c1_db2LogCfg"/> <clusterfs fstype="gfs" ref="c1_db2LogSubs1"/> <clusterfs fstype="gfs" ref="c1_db2LogMCfg"/> <clusterfs fstype="gfs" ref="c1_db2LogMSubs1"/> </service> <service autostart="1" domain="fd_mover1" exclusive="1" name="mover1" recovery="relocate"> <ip ref="192.168.2.41"/> <ip ref="10.10.1.75"/> <script ref="ha-hpss-mover1"/> <clusterfs fstype="gfs" ref="m1_hpssSource"/> <clusterfs fstype="gfs" ref="m1_varHpss"/> <clusterfs fstype="gfs" ref="m1_varHpssAdmCor"/> </service> <service autostart="1" domain="fd_mover2" exclusive="1" name="mover2" recovery="relocate"> <ip ref="192.168.2.42"/> <ip ref="10.10.1.76"/> <script ref="ha-hpss-mover2"/> <clusterfs fstype="gfs" ref="m2_hpssSource"/> <clusterfs fstype="gfs" ref="m2_varHpss"/> <clusterfs fstype="gfs" ref="m2_varHpssAdmCor"/> </service> <service autostart="1" domain="fd_vfs1" exclusive="1" name="vfs1" recovery="relocate"> <ip ref="192.168.2.43"/> <ip ref="10.10.1.77"/> <script ref="ha-hpss-vfs1"/> <clusterfs fstype="gfs" ref="v1_hpssSource"/> <clusterfs fstype="gfs" ref="v1_varHpss"/> <clusterfs fstype="gfs" ref="v1_varHpssAdmCor"/> </service> </rm> </cluster> </div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"> Finally, are you my chance using the HA-LVM stuff to manage disks across nodes or are you using GFS? </blockquote><div> The filesystems are all GFS2. </div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"> As I said above, and you clearly agree, something is not right and the more information you can share, the better. Okham was right for the most part. </blockquote><div> For the most part, yes. Thanks man. </div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"> Corey <div><div></div><div class="h5"> On Mon, May 17, 2010 at 10:00 PM, Dusty <<a href="mailto:dhoffutt@gmail.com">dhoffutt@gmail.com</a>> wrote: > Addendum, and a symptom of this issue is that because node1 does not reboot > and rejoin the cluster, the service it was running never relocates. > > I left it in this state over the weekend. Came back Monday morning, the > service had still not relocated. > > It is not fencing. It is not fencing. Fencing works. Fencing works. > > On Mon, May 17, 2010 at 3:58 PM, Dusty <<a href="mailto:dhoffutt@gmail.com">dhoffutt@gmail.com</a>> wrote: >> >> I appreciate the help - but I'm saying the same thing for like the fourth >> or fifth time now. >> >> Fencing is working well. All cluster nodes are able to communicate to the >> fence device (APC PDU). >> >> Another example. The cluster is quorate with five nodes and four running >> services. Am pulling the plug on node1. I need service1, that happens to be >> running on node1 right now, to relocate ASAP - not after node1 has rebooted. >> Node3 is a member of the cluster and is available to accept a service >> relocation. >> >> I have cluster ssh logged into all nodes and am tailing their >> /var/log/messages file. >> >> 14:57 "Pulling the plug" on node1 now (really just turning off the >> electrical port on the APC). >> About five seconds later.... >> 14:58:00 node2 fenced[7584] fencing node "192.168.1.1" >> 14:58:05 node2 fenced[7584] fence "192.168.1.1" success >> --- Right now the service SHOULD be being relocated - but it doesn't! --- >> -- a few minutes later, node1 has rebooted after being successfully fenced >> via node2 operating the APC PDU. >> 15:03:43 node3 clurgmgrd[4813]: <notice> Recovering failed service >> service:service1 >> >> Second test now - doing the exact same thing, but this time really pulling >> the plug on node1. >> >> Everything happens the same except node2 fencing node1 has no effect >> because I've simulated a complete node failure on node1. It is not going to >> boot. >> >> >> On Sat, May 15, 2010 at 1:25 PM, Corey Kovacs <<a href="mailto:corey.kovacs@gmail.com">corey.kovacs@gmail.com</a>> >> wrote: >>> >>> The reason I was pointing at the fencing config is that the service >>> will only re-locate when fenced is able to confirm that the offending >>> node has been fenced. If this can't happen, then services will not >>> relocate since the cluster doesn't know the state of all the nodes. If >>> a node get's an anvil dropped on it, then it should stop responding >>> and the cluster should then try to invoke the fence on that node to >>> make sure that it is indeed dead, even if it only cycles the power >>> port for n already dead node. >>> >>> Given you description you should experience the same "problem" if you >>> simply turn the node off. Nomally, when you turn the power off (not >>> pull the plug) then boot the node, the cluster either should have >>> aleady fenced the node, or it will fence as it's booting. Looks odd >>> but it's correct since the cluster has to get things to a known state. >>> >>> After the fence and before the node boots, services should start >>> migrating. All of this you probably know but it's worth saying anwyay. >>> >>> Basically, if your services only migrate after the node boots up, then >>> I believe fencing is not working properly. The services should migrate >>> while the node is booting or even before. >>> >>> So it appears to me that when you power the apc yourself, or pull the >>> plug on the node, you have the same condition. >>> >>> The way to really testing fencing, is to watch the logs on a node and >>> issue >>> >>> cman_tool kill <cluster memner> and tell cman to fence the node. >>> >>> One thought, can all your cluster nodes talk the APC at all times? >>> >>> >>> -Corey >>> >>> >>> >>> >>> On Sat, May 15, 2010 at 5:50 PM, Dusty <<a href="mailto:dhoffutt@gmail.com">dhoffutt@gmail.com</a>> wrote: >>> > Fencing works great - no problems there. The APC PDU responds >>> > beautifully to >>> > any node's attempt to fence. >>> > >>> > The issue is this: >>> > >>> > The service only relocates after the fenced node reboots and rejoins >>> > the >>> > cluster. Then the service relocates to another node. This happens well >>> > and >>> > without fail. >>> > >>> > But what if the node that was fenced refuses to boot back up because, >>> > say an >>> > anvil fell out of the sky and smashed it, or its motherboard fried? >>> > >>> > This is what I am simulating by pulling the plug on a node that happens >>> > to >>> > be running a service. The service will not relocate until the failed >>> > node >>> > has rebooted. >>> > >>> > I don't want that. I want the service to relocate ASAP regardless of if >>> > the >>> > failed node reboots or not. >>> > >>> > Thank you so much for your consideration. >>> > >>> > -- >>> > Linux-cluster mailing list >>> > <a href="mailto:Linux-cluster@redhat.com">Linux-cluster@redhat.com</a> >>> > <a href="https://www.redhat.com/mailman/listinfo/linux-cluster" target="_blank">https://www.redhat.com/mailman/listinfo/linux-cluster</a> >>> > >>> >>> -- >>> Linux-cluster mailing list >>> <a href="mailto:Linux-cluster@redhat.com">Linux-cluster@redhat.com</a> >>> <a href="https://www.redhat.com/mailman/listinfo/linux-cluster" target="_blank">https://www.redhat.com/mailman/listinfo/linux-cluster</a> >> > > > -- > Linux-cluster mailing list > <a href="mailto:Linux-cluster@redhat.com">Linux-cluster@redhat.com</a> > <a href="https://www.redhat.com/mailman/listinfo/linux-cluster" target="_blank">https://www.redhat.com/mailman/listinfo/linux-cluster</a> > -- Linux-cluster mailing list <a href="mailto:Linux-cluster@redhat.com">Linux-cluster@redhat.com</a> <a href="https://www.redhat.com/mailman/listinfo/linux-cluster" target="_blank">https://www.redhat.com/mailman/listinfo/linux-cluster</a> </div></div></blockquote></div>