[Linux-cluster] two node cluster with IP tiebreaker failed.

Wed Feb 25 09:45:59 UTC 2009

ext Kein He wrote:
> I think there is a problem, from "cman_tool status" shows:
>
> Nodes: 2
> Expected votes: 3
> Total votes: 2
>
>
> according to your cluster.conf , if all nodes and qdisk are online,
> the "Total votes" must be "3".  Probably "qdiskd" is not running, you
> can use " cman_tool nodes" to check if qdisk is working.
>
Yes, here is "cman_tool nodes" output:
Node  Sts   Inc   Joined               Name
   1   M    112   2009-02-25 03:05:19  as-1.localdomain
   2   M    104   2009-02-25 03:05:19  as-2.localdomain

A question is how to check whether qdisk is running ? and how to run it ?

Thanks.
>
>
>
> Mockey Chen wrote:
>> ext Mockey Chen wrote:
>>  
>>> ext Kein He wrote:
>>>      
>>>> Hi Mockey,
>>>>
>>>> Could you please attach the output from " cman_tool status " and "
>>>> cman_tool nodes -f" ?
>>>>
>>>>           
>>> Thanks your response.
>>>
>>> I try to run cman_tool status on as-2, but it hang, without output, and
>>> even Ctrl+C also no effect.
>>>       
>> I manually reboot as-1, and the problem solved.
>>
>> There is the output of cman_tool
>>
>> [root at as-1 ~]# cman_tool status
>> Version: 6.1.0
>> Config Version: 19
>> Cluster Name: azerothcluster
>> Cluster Id: 20148
>> Cluster Member: Yes
>> Cluster Generation: 76
>> Membership state: Cluster-Member
>> Nodes: 2
>> Expected votes: 3
>> Total votes: 2
>> Quorum: 2 Active subsystems: 8
>> Flags: Dirty
>> Ports Bound: 0 177 Node name: as-1.localdomain
>> Node ID: 1
>> Multicast addresses: 239.192.78.3
>> Node addresses: 10.56.150.3
>> [root at as-1 ~]# cman_tool status -f
>> Version: 6.1.0
>> Config Version: 19
>> Cluster Name: azerothcluster
>> Cluster Id: 20148
>> Cluster Member: Yes
>> Cluster Generation: 76
>> Membership state: Cluster-Member
>> Nodes: 2
>> Expected votes: 3
>> Total votes: 2
>> Quorum: 2 Active subsystems: 8
>> Flags: Dirty
>> Ports Bound: 0 177 Node name: as-1.localdomain
>> Node ID: 1
>> Multicast addresses: 239.192.78.3
>> Node addresses: 10.56.150.3
>>
>>
>> It seems cluster can not fence one of the node. How to solve it ?
>>
>>  
>>> I open a new window and can using ssh to as-2, but  after login,  I can
>>> not do anything, even a
>>> simple 'ls' command is hung.
>>>
>>> It seem the system keep alive but do not provide any service. Really
>>> bad.
>>>
>>> Any way to debug this issue ?
>>>      
>>>> Mockey Chen wrote:
>>>>          
>>>>> Hi,
>>>>>
>>>>> I have a two-nodes cluster, to avoid split-brain. I use ilo as fence
>>>>> device, IP tiebreaker. here is my /etc/cluster/cluster.conf
>>>>> <?xml version="1.0"?>
>>>>> <cluster alias="azerothcluster" config_version="19"
>>>>> name="azerothcluster">
>>>>>     <cman expected_votes="3" two_node="0"/>
>>>>>     <clusternodes>
>>>>>         <clusternode name="as-1.localdomain" nodeid="1" votes="1">
>>>>>             <fence>
>>>>>                 <method name="1">
>>>>>                     <device name="ilo1"/>
>>>>>                 </method>
>>>>>             </fence>
>>>>>         </clusternode>
>>>>>         <clusternode name="as-2.localdomain" nodeid="2" votes="1">
>>>>>             <fence>
>>>>>                 <method name="1">
>>>>>                     <device name="ilo2"/>
>>>>>                 </method>
>>>>>             </fence>
>>>>>         </clusternode>
>>>>>     </clusternodes>
>>>>>         <quorumd interval="1" tko="10" votes="1" label="pingtest">
>>>>>                 <heuristic program="ping 10.56.150.1 -c1 -t1"
>>>>> score="1"
>>>>> interval="2" tko="3"/>
>>>>>         </quorumd>
>>>>>     <fence_daemon post_fail_delay="0" post_join_delay="3"/>
>>>>>     <fencedevices>
>>>>>         <fencedevice agent="fence_ilo" hostname="10.56.154.18"
>>>>> login="power" name="ilo1" passwd="pass"/>
>>>>>         <fencedevice agent="fence_ilo" hostname="10.56.154.19"
>>>>> login="power" name="ilo2" passwd="pass"/>
>>>>>     </fencedevices>
>>>>> ...
>>>>> ...
>>>>>
>>>>> To test one node lost heartbeat case, I disable ethereal card
>>>>> (eth0) on
>>>>> as-1, I expect as-2 takeover services on as-1 and as-1 node reboot.
>>>>> The actual is as-1 lost connection to as-2.  as-2 detected it and
>>>>> try to
>>>>> re-construct cluster, but failed, here is the syslog form as-2
>>>>>
>>>>> Feb 24 21:25:35 as-2 openais[4139]: [TOTEM] The token was lost in the
>>>>> OPERATIONAL state.
>>>>> Feb 24 21:25:35 as-2 openais[4139]: [TOTEM] Receive multicast socket
>>>>> recv buffer size (288000 bytes).
>>>>> Feb 24 21:25:35 as-2 openais[4139]: [TOTEM] Transmit multicast socket
>>>>> send buffer size (262142 bytes).
>>>>> Feb 24 21:25:35 as-2 openais[4139]: [TOTEM] entering GATHER state
>>>>> from 2.
>>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] entering GATHER state
>>>>> from 0.
>>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] Creating commit token
>>>>> because I am the rep.
>>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] Saving state aru 1f4 high
>>>>> seq received 1f4
>>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] Storing new sequence
>>>>> id for
>>>>> ring 2c
>>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] entering COMMIT state.
>>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] entering RECOVERY state.
>>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] position [0] member
>>>>> 10.56.150.4:
>>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] previous ring seq 40 rep
>>>>> 10.56.150.3
>>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] aru 1f4 high delivered
>>>>> 1f4
>>>>> received flag 1
>>>>>
>>>>> Message from syslogd@ at Tue Feb 24 21:25:40 2009 ...
>>>>> as-2 clurgmgrd[4194]: <emerg> #1: Quorum Dissolved Feb 24 21:25:40
>>>>> as-2
>>>>> openais[4139]: [TOTEM] Did not need to originate any messages in
>>>>> recovery.
>>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] Sending initial ORF token
>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ] CLM CONFIGURATION CHANGE
>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ] New Configuration:
>>>>> Feb 24 21:25:40 as-2 clurgmgrd[4194]: <emerg> #1: Quorum Dissolved
>>>>> Feb 24 21:25:40 as-2 kernel: dlm: closing connection to node 1
>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ]     r(0) ip(10.56.150.4)
>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ] Members Left:
>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ]     r(0) ip(10.56.150.3)
>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ] Members Joined:
>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CMAN ] quorum lost, blocking
>>>>> activity
>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ] CLM CONFIGURATION CHANGE
>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ] New Configuration:
>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ]     r(0) ip(10.56.150.4)
>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ] Members Left:
>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ] Members Joined:
>>>>> Feb 24 21:25:40 as-2 openais[4139]: [SYNC ] This node is within the
>>>>> primary component and will provide service.
>>>>> Feb 24 21:25:40 as-2 ccsd[4130]: Cluster is not quorate.  Refusing
>>>>> connection.
>>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] entering OPERATIONAL
>>>>> state.
>>>>> Feb 24 21:25:40 as-2 ccsd[4130]: Error while processing connect:
>>>>> Connection refused
>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ] got nodejoin message
>>>>> 10.56.150.4
>>>>> Feb 24 21:25:40 as-2 ccsd[4130]: Invalid descriptor specified (-111).
>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CPG  ] got joinlist message from
>>>>> node 2
>>>>> Feb 24 21:25:40 as-2 ccsd[4130]: Someone may be attempting something
>>>>> evil.
>>>>> Feb 24 21:25:40 as-2 ccsd[4130]: Error while processing get: Invalid
>>>>> request descriptor
>>>>> Feb 24 21:25:40 as-2 ccsd[4130]: Invalid descriptor specified (-111).
>>>>> Feb 24 21:25:41 as-2 ccsd[4130]: Someone may be attempting something
>>>>> evil.
>>>>> Feb 24 21:25:41 as-2 ccsd[4130]: Error while processing get: Invalid
>>>>> request descriptor
>>>>> Feb 24 21:25:41 as-2 ccsd[4130]: Invalid descriptor specified (-21).
>>>>> Feb 24 21:25:41 as-2 ccsd[4130]: Someone may be attempting something
>>>>> evil.
>>>>> Feb 24 21:25:41 as-2 ccsd[4130]: Error while processing disconnect:
>>>>> Invalid request descriptor
>>>>> Feb 24 21:25:41 as-2 avahi-daemon[3756]: Withdrawing address
>>>>> record for
>>>>> 10.56.150.144 on eth0.
>>>>> Feb 24 21:25:41 as-2 in.rdiscd[8641]: setsockopt (IP_ADD_MEMBERSHIP):
>>>>> Address already in use
>>>>> Feb 24 21:25:41 as-2 in.rdiscd[8641]: Failed joining addresse
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> I also found there are some errors in as-1's syslog
>>>>> Feb 25 11:27:09 as-1 clurgmgrd[4332]: <err> #52: Failed changing RG
>>>>> status
>>>>> Feb 25 11:27:09 as-1 clurgmgrd: [4332]: <warning> Link for eth0: Not
>>>>> detected
>>>>> Feb 25 11:27:09 as-1 clurgmgrd: [4332]: <warning> No link on eth0...
>>>>> ...
>>>>> Feb 25 11:27:36 as-1 ccsd[4268]: Unable to connect to cluster
>>>>> infrastructure after 30 seconds.
>>>>> ...
>>>>> Feb 25 11:28:06 as-1 ccsd[4268]: Unable to connect to cluster
>>>>> infrastructure after 60 seconds.
>>>>> ...
>>>>> Feb 25 11:28:06 as-1 ccsd[4268]: Unable to connect to cluster
>>>>> infrastructure after 90 seconds.
>>>>>
>>>>>
>>>>> any comment is appreciated!
>>>>>
>>>>> -- 
>>>>> Linux-cluster mailing list
>>>>> Linux-cluster at redhat.com
>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>>                 
>>>> -- 
>>>> Linux-cluster mailing list
>>>> Linux-cluster at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>
>>>>           
>>> -- 
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>>       
>>
>> -- 
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>   
>
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>