[Linux-cluster] Re: Fencing test

Mon Jan 5 18:11:24 UTC 2009

hi,

On Mon, Jan 5, 2009 at 8:23 AM, Rajagopal Swaminathan
<raju.rajsand at gmail.com> wrote:
> Greetings,
>
> On Sat, Jan 3, 2009 at 4:18 AM, Paras pradhan <pradhanparas at gmail.com> wrote:
>>
>> Here I am using 4 nodes.
>>
>> Node 1) That runs luci
>> Node 2) This is my iscsi shared storage where my virutal machine(s) resides
>> Node 3) First node in my two node cluster
>> Node 4) Second node in my two node cluster
>>
>> All of them are connected simply to an unmanaged 16 port switch.
>
> Luci need not require a separate node to run. it can run on one of the
> member nodes (node 3 | 4).

OK.

>
> what does clustat say?

Here is my clustat o/p:

-----------

[root at ha1lx ~]# clustat
Cluster Status for ipmicluster @ Mon Jan  5 12:00:10 2009
Member Status: Quorate

 Member Name                                                     ID   Status
 ------ ----                                                     ---- ------
 10.42.21.29                                                         1
Online, rgmanager
 10.42.21.27                                                         2
Online, Local, rgmanager

 Service Name
Owner (Last)                                                     State
 ------- ----
----- ------                                                     -----
 vm:linux64
10.42.21.27
started
[root at ha1lx ~]#
------------------------

10.42.21.27 is node3 and 10.42.21.29 is node4

>
> Can you post your cluster.conf here?

Here is my cluster.conf

--
[root at ha1lx cluster]# more cluster.conf
<?xml version="1.0"?>
<cluster alias="ipmicluster" config_version="8" name="ipmicluster">
	<fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
	<clusternodes>
		<clusternode name="10.42.21.29" nodeid="1" votes="1">
			<fence>
				<method name="1">
					<device name="fence2"/>
				</method>
			</fence>
		</clusternode>
		<clusternode name="10.42.21.27" nodeid="2" votes="1">
			<fence>
				<method name="1">
					<device name="fence1"/>
				</method>
			</fence>
		</clusternode>
	</clusternodes>
	<cman expected_votes="1" two_node="1"/>
	<fencedevices>
		<fencedevice agent="fence_ipmilan" ipaddr="10.42.21.28"
login="admin" name="fence1" passwd="admin"/>
		<fencedevice agent="fence_ipmilan" ipaddr="10.42.21.30"
login="admin" name="fence2" passwd="admin"/>
	</fencedevices>
	<rm>
		<failoverdomains>
			<failoverdomain name="myfd" nofailback="0" ordered="1" restricted="0">
				<failoverdomainnode name="10.42.21.29" priority="2"/>
				<failoverdomainnode name="10.42.21.27" priority="1"/>
			</failoverdomain>
		</failoverdomains>
		<resources/>
		<vm autostart="1" domain="myfd" exclusive="0" migrate="live"
name="linux64" path="/guest_roots" recovery="restart"/>
	</rm>
</cluster>
------

Here:

10.42.21.28 is IPMI interface in node3
10.42.21.30 is IPMI interface in node4

>
> When you pull out the network cable *and* plug it back  in say node 3,
> , what messages appear in the /var/log/messages if Node 4 (if any)?
> (sorry for the repitition, but messages are necessary here to make any
> sense of the situation)
>

Ok here is the log in node 4 after i disconnect the network cable in node3.

-----------

Jan  5 12:05:24 ha2lx openais[4988]: [TOTEM] The token was lost in the
OPERATIONAL state.
Jan  5 12:05:24 ha2lx openais[4988]: [TOTEM] Receive multicast socket
recv buffer size (288000 bytes).
Jan  5 12:05:24 ha2lx openais[4988]: [TOTEM] Transmit multicast socket
send buffer size (262142 bytes).
Jan  5 12:05:24 ha2lx openais[4988]: [TOTEM] entering GATHER state from 2.
Jan  5 12:05:28 ha2lx openais[4988]: [TOTEM] entering GATHER state from 0.
Jan  5 12:05:28 ha2lx openais[4988]: [TOTEM] Creating commit token
because I am the rep.
Jan  5 12:05:28 ha2lx openais[4988]: [TOTEM] Saving state aru 76 high
seq received 76
Jan  5 12:05:28 ha2lx openais[4988]: [TOTEM] Storing new sequence id
for ring ac
Jan  5 12:05:28 ha2lx openais[4988]: [TOTEM] entering COMMIT state.
Jan  5 12:05:28 ha2lx openais[4988]: [TOTEM] entering RECOVERY state.
Jan  5 12:05:28 ha2lx openais[4988]: [TOTEM] position [0] member 10.42.21.29:
Jan  5 12:05:28 ha2lx openais[4988]: [TOTEM] previous ring seq 168 rep
10.42.21.27
Jan  5 12:05:28 ha2lx openais[4988]: [TOTEM] aru 76 high delivered 76
received flag 1
Jan  5 12:05:28 ha2lx openais[4988]: [TOTEM] Did not need to originate
any messages in recovery.
Jan  5 12:05:28 ha2lx openais[4988]: [TOTEM] Sending initial ORF token
Jan  5 12:05:28 ha2lx openais[4988]: [CLM  ] CLM CONFIGURATION CHANGE
Jan  5 12:05:28 ha2lx openais[4988]: [CLM  ] New Configuration:
Jan  5 12:05:28 ha2lx openais[4988]: [CLM  ] 	r(0) ip(10.42.21.29)
Jan  5 12:05:28 ha2lx openais[4988]: [CLM  ] Members Left:
Jan  5 12:05:28 ha2lx openais[4988]: [CLM  ] 	r(0) ip(10.42.21.27)
Jan  5 12:05:28 ha2lx openais[4988]: [CLM  ] Members Joined:
Jan  5 12:05:28 ha2lx openais[4988]: [CLM  ] CLM CONFIGURATION CHANGE
Jan  5 12:05:28 ha2lx kernel: dlm: closing connection to node 2
Jan  5 12:05:28 ha2lx openais[4988]: [CLM  ] New Configuration:
Jan  5 12:05:28 ha2lx fenced[5004]: 10.42.21.27 not a cluster member
after 0 sec post_fail_delay
Jan  5 12:05:28 ha2lx openais[4988]: [CLM  ] 	r(0) ip(10.42.21.29)
Jan  5 12:05:28 ha2lx kernel: GFS2: fsid=ipmicluster:guest_roots.0:
jid=1: Trying to acquire journal lock...
Jan  5 12:05:28 ha2lx openais[4988]: [CLM  ] Members Left:
Jan  5 12:05:28 ha2lx openais[4988]: [CLM  ] Members Joined:
Jan  5 12:05:28 ha2lx openais[4988]: [SYNC ] This node is within the
primary component and will provide service.
Jan  5 12:05:28 ha2lx openais[4988]: [TOTEM] entering OPERATIONAL state.
Jan  5 12:05:28 ha2lx openais[4988]: [CLM  ] got nodejoin message 10.42.21.29
Jan  5 12:05:28 ha2lx openais[4988]: [CPG  ] got joinlist message from node 1
Jan  5 12:05:28 ha2lx kernel: GFS2: fsid=ipmicluster:guest_roots.0:
jid=1: Looking at journal...
Jan  5 12:05:29 ha2lx kernel: GFS2: fsid=ipmicluster:guest_roots.0:
jid=1: Acquiring the transaction lock...
Jan  5 12:05:29 ha2lx kernel: GFS2: fsid=ipmicluster:guest_roots.0:
jid=1: Replaying journal...
Jan  5 12:05:29 ha2lx kernel: GFS2: fsid=ipmicluster:guest_roots.0:
jid=1: Replayed 0 of 0 blocks
Jan  5 12:05:29 ha2lx kernel: GFS2: fsid=ipmicluster:guest_roots.0:
jid=1: Found 0 revoke tags
Jan  5 12:05:29 ha2lx kernel: GFS2: fsid=ipmicluster:guest_roots.0:
jid=1: Journal replayed in 1s
Jan  5 12:05:29 ha2lx kernel: GFS2: fsid=ipmicluster:guest_roots.0: jid=1: Done
------------------

Now when I plug back my cable to node3, node 4 reboots and here is the
quickly grabbed log in node4

--
Jan  5 12:07:12 ha2lx openais[4988]: [TOTEM] entering GATHER state from 11.
Jan  5 12:07:12 ha2lx openais[4988]: [TOTEM] Saving state aru 1d high
seq received 1d
Jan  5 12:07:12 ha2lx openais[4988]: [TOTEM] Storing new sequence id
for ring b0
Jan  5 12:07:12 ha2lx openais[4988]: [TOTEM] entering COMMIT state.
Jan  5 12:07:12 ha2lx openais[4988]: [TOTEM] entering RECOVERY state.
Jan  5 12:07:12 ha2lx openais[4988]: [TOTEM] position [0] member 10.42.21.27:
Jan  5 12:07:12 ha2lx openais[4988]: [TOTEM] previous ring seq 172 rep
10.42.21.27
Jan  5 12:07:12 ha2lx openais[4988]: [TOTEM] aru 16 high delivered 16
received flag 1
Jan  5 12:07:12 ha2lx openais[4988]: [TOTEM] position [1] member 10.42.21.29:
Jan  5 12:07:12 ha2lx openais[4988]: [TOTEM] previous ring seq 172 rep
10.42.21.29
Jan  5 12:07:12 ha2lx openais[4988]: [TOTEM] aru 1d high delivered 1d
received flag 1
Jan  5 12:07:12 ha2lx openais[4988]: [TOTEM] Did not need to originate
any messages in recovery.
Jan  5 12:07:12 ha2lx openais[4988]: [CLM  ] CLM CONFIGURATION CHANGE
Jan  5 12:07:12 ha2lx openais[4988]: [CLM  ] New Configuration:
Jan  5 12:07:12 ha2lx openais[4988]: [CLM  ] 	r(0) ip(10.42.21.29)
Jan  5 12:07:12 ha2lx openais[4988]: [CLM  ] Members Left:
Jan  5 12:07:12 ha2lx openais[4988]: [CLM  ] Members Joined:
Jan  5 12:07:12 ha2lx openais[4988]: [CLM  ] CLM CONFIGURATION CHANGE
Jan  5 12:07:12 ha2lx openais[4988]: [CLM  ] New Configuration:
Jan  5 12:07:12 ha2lx openais[4988]: [CLM  ] 	r(0) ip(10.42.21.27)
Jan  5 12:07:12 ha2lx openais[4988]: [CLM  ] 	r(0) ip(10.42.21.29)
Jan  5 12:07:12 ha2lx openais[4988]: [CLM  ] Members Left:
Jan  5 12:07:12 ha2lx openais[4988]: [CLM  ] Members Joined:
Jan  5 12:07:12 ha2lx openais[4988]: [CLM  ] 	r(0) ip(10.42.21.27)
Jan  5 12:07:12 ha2lx openais[4988]: [SYNC ] This node is within the
primary component and will provide service.
Jan  5 12:07:12 ha2lx openais[4988]: [TOTEM] entering OPERATIONAL state.
Jan  5 12:07:12 ha2lx openais[4988]: [MAIN ] Killing node 10.42.21.27
because it has rejoined the cluster with existing state
Jan  5 12:07:12 ha2lx openais[4988]: [CMAN ] cman killed by node 2
because we rejoined the cluster without a full restart
Jan  5 12:07:12 ha2lx gfs_controld[5016]: groupd_dispatch error -1 errno 11
Jan  5 12:07:12 ha2lx gfs_controld[5016]: groupd connection died
Jan  5 12:07:12 ha2lx gfs_controld[5016]: cluster is down, exiting
Jan  5 12:07:12 ha2lx dlm_controld[5010]: cluster is down, exiting
Jan  5 12:07:12 ha2lx kernel: dlm: closing connection to node 1
Jan  5 12:07:12 ha2lx fenced[5004]: cluster is down, exiting
-------

Also here is the log of node3:

--
[root at ha1lx ~]# tail -f /var/log/messages
Jan  5 12:07:24 ha1lx openais[26029]: [TOTEM] entering OPERATIONAL state.
Jan  5 12:07:24 ha1lx openais[26029]: [CLM  ] got nodejoin message 10.42.21.27
Jan  5 12:07:24 ha1lx openais[26029]: [CLM  ] got nodejoin message 10.42.21.27
Jan  5 12:07:24 ha1lx openais[26029]: [CPG  ] got joinlist message from node 2
Jan  5 12:07:27 ha1lx ccsd[26019]: Attempt to close an unopened CCS
descriptor (4520670).
Jan  5 12:07:27 ha1lx ccsd[26019]: Error while processing disconnect:
Invalid request descriptor
Jan  5 12:07:27 ha1lx fenced[26045]: fence "10.42.21.29" success
Jan  5 12:07:27 ha1lx kernel: GFS2: fsid=ipmicluster:guest_roots.1:
jid=0: Trying to acquire journal lock...
Jan  5 12:07:27 ha1lx kernel: GFS2: fsid=ipmicluster:guest_roots.1:
jid=0: Looking at journal...
Jan  5 12:07:28 ha1lx kernel: GFS2: fsid=ipmicluster:guest_roots.1: jid=0: Done
----------------

> HTH
>
> With warm regards
>
> Rajagopal
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>

Thanks a lot

Paras.