[Linux-cluster] two node cluster with IP tiebreaker failed.

Thu Feb 26 02:17:44 UTC 2009

 Unfortunately , you need a shared disk to run qdisk, it can not work in 
"diskless" mode right now.

> ext Brett Cave wrote:
>   
>> On Wed, Feb 25, 2009 at 11:45 AM, Mockey Chen <mockey.chen at nsn.com> wrote:
>>   
>>     
>>> ext Kein He wrote:
>>>     
>>>       
>>>> I think there is a problem, from "cman_tool status" shows:
>>>>
>>>> Nodes: 2
>>>> Expected votes: 3
>>>> Total votes: 2
>>>>
>>>>
>>>> according to your cluster.conf , if all nodes and qdisk are online,
>>>> the "Total votes" must be "3".  Probably "qdiskd" is not running, you
>>>> can use " cman_tool nodes" to check if qdisk is working.
>>>>
>>>>       
>>>>         
>>> Yes, here is "cman_tool nodes" output:
>>> Node  Sts   Inc   Joined               Name
>>>   1   M    112   2009-02-25 03:05:19  as-1.localdomain
>>>   2   M    104   2009-02-25 03:05:19  as-2.localdomain
>>>
>>> A question is how to check whether qdisk is running ? and how to run it ?
>>>     
>>>       
>> [root at blade3 ~]# service qdiskd status
>> qdiskd (pid 2832) is running...
>> [root at blade3 ~]# pgrep qdisk -l
>> 2832 qdiskd
>> [root at blade3 ~]# cman_tool nodes
>> Node  Sts   Inc   Joined               Name
>>    0   M      0   2009-02-19 16:11:55  /dev/sda5     ## This is qdisk.
>>    1   M   1524   2009-02-20 22:27:32  blade1
>>    2   M   1552   2009-02-24 04:39:24  blade2
>>    3   M   1500   2009-02-19 16:11:03  blade3
>>    4   M   1516   2009-02-19 16:11:22  blade4
>>
>> You can use "service qdisk start" to start it, or run it with
>> /usr/sbin/qdisk -Q if you dont have the init script. If you installed
>> from rpm on a rh type distro, then the script should be there.
>>
>> REgards,
>> brett
>>   
>>     
> I try to use "service qdiskd start", but it failed:
> [root at as-2 ~]# service qdiskd start
> Starting the Quorum Disk Daemon:                           [FAILED]
> [root at as-2 ~]# tail /var/log/messages
> ...
> Feb 26 09:19:40 as-2 qdiskd[14707]: <crit> Unable to match label
> 'testing' to any device
> Feb 26 09:19:46 as-2 clurgmgrd[4032]: <notice> Reconfiguring
>
> Here is my qdisk configuration, I copy it from "man qdisk":
>         <quorumd interval="1" tko="10" votes="1" label="testing">
>                 <heuristic program="ping 10.56.150.1 -c1 -t1" score="1"
> interval="2" tko="3"/>
>         </quorumd>
>
> How to map label to device. Note: I did not have any shared storage.
>
>   
>>> Thanks.
>>>     
>>>       
>>>> Mockey Chen wrote:
>>>>       
>>>>         
>>>>> ext Mockey Chen wrote:
>>>>>
>>>>>         
>>>>>           
>>>>>> ext Kein He wrote:
>>>>>>
>>>>>>           
>>>>>>             
>>>>>>> Hi Mockey,
>>>>>>>
>>>>>>> Could you please attach the output from " cman_tool status " and "
>>>>>>> cman_tool nodes -f" ?
>>>>>>>
>>>>>>>
>>>>>>>             
>>>>>>>               
>>>>>> Thanks your response.
>>>>>>
>>>>>> I try to run cman_tool status on as-2, but it hang, without output, and
>>>>>> even Ctrl+C also no effect.
>>>>>>
>>>>>>           
>>>>>>             
>>>>> I manually reboot as-1, and the problem solved.
>>>>>
>>>>> There is the output of cman_tool
>>>>>
>>>>> [root at as-1 ~]# cman_tool status
>>>>> Version: 6.1.0
>>>>> Config Version: 19
>>>>> Cluster Name: azerothcluster
>>>>> Cluster Id: 20148
>>>>> Cluster Member: Yes
>>>>> Cluster Generation: 76
>>>>> Membership state: Cluster-Member
>>>>> Nodes: 2
>>>>> Expected votes: 3
>>>>> Total votes: 2
>>>>> Quorum: 2 Active subsystems: 8
>>>>> Flags: Dirty
>>>>> Ports Bound: 0 177 Node name: as-1.localdomain
>>>>> Node ID: 1
>>>>> Multicast addresses: 239.192.78.3
>>>>> Node addresses: 10.56.150.3
>>>>> [root at as-1 ~]# cman_tool status -f
>>>>> Version: 6.1.0
>>>>> Config Version: 19
>>>>> Cluster Name: azerothcluster
>>>>> Cluster Id: 20148
>>>>> Cluster Member: Yes
>>>>> Cluster Generation: 76
>>>>> Membership state: Cluster-Member
>>>>> Nodes: 2
>>>>> Expected votes: 3
>>>>> Total votes: 2
>>>>> Quorum: 2 Active subsystems: 8
>>>>> Flags: Dirty
>>>>> Ports Bound: 0 177 Node name: as-1.localdomain
>>>>> Node ID: 1
>>>>> Multicast addresses: 239.192.78.3
>>>>> Node addresses: 10.56.150.3
>>>>>
>>>>>
>>>>> It seems cluster can not fence one of the node. How to solve it ?
>>>>>
>>>>>
>>>>>         
>>>>>           
>>>>>> I open a new window and can using ssh to as-2, but  after login,  I can
>>>>>> not do anything, even a
>>>>>> simple 'ls' command is hung.
>>>>>>
>>>>>> It seem the system keep alive but do not provide any service. Really
>>>>>> bad.
>>>>>>
>>>>>> Any way to debug this issue ?
>>>>>>
>>>>>>           
>>>>>>             
>>>>>>> Mockey Chen wrote:
>>>>>>>
>>>>>>>             
>>>>>>>               
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I have a two-nodes cluster, to avoid split-brain. I use ilo as fence
>>>>>>>> device, IP tiebreaker. here is my /etc/cluster/cluster.conf
>>>>>>>> <?xml version="1.0"?>
>>>>>>>> <cluster alias="azerothcluster" config_version="19"
>>>>>>>> name="azerothcluster">
>>>>>>>>     <cman expected_votes="3" two_node="0"/>
>>>>>>>>     <clusternodes>
>>>>>>>>         <clusternode name="as-1.localdomain" nodeid="1" votes="1">
>>>>>>>>             <fence>
>>>>>>>>                 <method name="1">
>>>>>>>>                     <device name="ilo1"/>
>>>>>>>>                 </method>
>>>>>>>>             </fence>
>>>>>>>>         </clusternode>
>>>>>>>>         <clusternode name="as-2.localdomain" nodeid="2" votes="1">
>>>>>>>>             <fence>
>>>>>>>>                 <method name="1">
>>>>>>>>                     <device name="ilo2"/>
>>>>>>>>                 </method>
>>>>>>>>             </fence>
>>>>>>>>         </clusternode>
>>>>>>>>     </clusternodes>
>>>>>>>>         <quorumd interval="1" tko="10" votes="1" label="pingtest">
>>>>>>>>                 <heuristic program="ping 10.56.150.1 -c1 -t1"
>>>>>>>> score="1"
>>>>>>>> interval="2" tko="3"/>
>>>>>>>>         </quorumd>
>>>>>>>>     <fence_daemon post_fail_delay="0" post_join_delay="3"/>
>>>>>>>>     <fencedevices>
>>>>>>>>         <fencedevice agent="fence_ilo" hostname="10.56.154.18"
>>>>>>>> login="power" name="ilo1" passwd="pass"/>
>>>>>>>>         <fencedevice agent="fence_ilo" hostname="10.56.154.19"
>>>>>>>> login="power" name="ilo2" passwd="pass"/>
>>>>>>>>     </fencedevices>
>>>>>>>> ...
>>>>>>>> ...
>>>>>>>>
>>>>>>>> To test one node lost heartbeat case, I disable ethereal card
>>>>>>>> (eth0) on
>>>>>>>> as-1, I expect as-2 takeover services on as-1 and as-1 node reboot.
>>>>>>>> The actual is as-1 lost connection to as-2.  as-2 detected it and
>>>>>>>> try to
>>>>>>>> re-construct cluster, but failed, here is the syslog form as-2
>>>>>>>>
>>>>>>>> Feb 24 21:25:35 as-2 openais[4139]: [TOTEM] The token was lost in the
>>>>>>>> OPERATIONAL state.
>>>>>>>> Feb 24 21:25:35 as-2 openais[4139]: [TOTEM] Receive multicast socket
>>>>>>>> recv buffer size (288000 bytes).
>>>>>>>> Feb 24 21:25:35 as-2 openais[4139]: [TOTEM] Transmit multicast socket
>>>>>>>> send buffer size (262142 bytes).
>>>>>>>> Feb 24 21:25:35 as-2 openais[4139]: [TOTEM] entering GATHER state
>>>>>>>> from 2.
>>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] entering GATHER state
>>>>>>>> from 0.
>>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] Creating commit token
>>>>>>>> because I am the rep.
>>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] Saving state aru 1f4 high
>>>>>>>> seq received 1f4
>>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] Storing new sequence
>>>>>>>> id for
>>>>>>>> ring 2c
>>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] entering COMMIT state.
>>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] entering RECOVERY state.
>>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] position [0] member
>>>>>>>> 10.56.150.4:
>>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] previous ring seq 40 rep
>>>>>>>> 10.56.150.3
>>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] aru 1f4 high delivered
>>>>>>>> 1f4
>>>>>>>> received flag 1
>>>>>>>>
>>>>>>>> Message from syslogd@ at Tue Feb 24 21:25:40 2009 ...
>>>>>>>> as-2 clurgmgrd[4194]: <emerg> #1: Quorum Dissolved Feb 24 21:25:40
>>>>>>>> as-2
>>>>>>>> openais[4139]: [TOTEM] Did not need to originate any messages in
>>>>>>>> recovery.
>>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] Sending initial ORF token
>>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ] CLM CONFIGURATION CHANGE
>>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ] New Configuration:
>>>>>>>> Feb 24 21:25:40 as-2 clurgmgrd[4194]: <emerg> #1: Quorum Dissolved
>>>>>>>> Feb 24 21:25:40 as-2 kernel: dlm: closing connection to node 1
>>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ]     r(0) ip(10.56.150.4)
>>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ] Members Left:
>>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ]     r(0) ip(10.56.150.3)
>>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ] Members Joined:
>>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CMAN ] quorum lost, blocking
>>>>>>>> activity
>>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ] CLM CONFIGURATION CHANGE
>>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ] New Configuration:
>>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ]     r(0) ip(10.56.150.4)
>>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ] Members Left:
>>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ] Members Joined:
>>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [SYNC ] This node is within the
>>>>>>>> primary component and will provide service.
>>>>>>>> Feb 24 21:25:40 as-2 ccsd[4130]: Cluster is not quorate.  Refusing
>>>>>>>> connection.
>>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] entering OPERATIONAL
>>>>>>>> state.
>>>>>>>> Feb 24 21:25:40 as-2 ccsd[4130]: Error while processing connect:
>>>>>>>> Connection refused
>>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ] got nodejoin message
>>>>>>>> 10.56.150.4
>>>>>>>> Feb 24 21:25:40 as-2 ccsd[4130]: Invalid descriptor specified (-111).
>>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CPG  ] got joinlist message from
>>>>>>>> node 2
>>>>>>>> Feb 24 21:25:40 as-2 ccsd[4130]: Someone may be attempting something
>>>>>>>> evil.
>>>>>>>> Feb 24 21:25:40 as-2 ccsd[4130]: Error while processing get: Invalid
>>>>>>>> request descriptor
>>>>>>>> Feb 24 21:25:40 as-2 ccsd[4130]: Invalid descriptor specified (-111).
>>>>>>>> Feb 24 21:25:41 as-2 ccsd[4130]: Someone may be attempting something
>>>>>>>> evil.
>>>>>>>> Feb 24 21:25:41 as-2 ccsd[4130]: Error while processing get: Invalid
>>>>>>>> request descriptor
>>>>>>>> Feb 24 21:25:41 as-2 ccsd[4130]: Invalid descriptor specified (-21).
>>>>>>>> Feb 24 21:25:41 as-2 ccsd[4130]: Someone may be attempting something
>>>>>>>> evil.
>>>>>>>> Feb 24 21:25:41 as-2 ccsd[4130]: Error while processing disconnect:
>>>>>>>> Invalid request descriptor
>>>>>>>> Feb 24 21:25:41 as-2 avahi-daemon[3756]: Withdrawing address
>>>>>>>> record for
>>>>>>>> 10.56.150.144 on eth0.
>>>>>>>> Feb 24 21:25:41 as-2 in.rdiscd[8641]: setsockopt (IP_ADD_MEMBERSHIP):
>>>>>>>> Address already in use
>>>>>>>> Feb 24 21:25:41 as-2 in.rdiscd[8641]: Failed joining addresse
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I also found there are some errors in as-1's syslog
>>>>>>>> Feb 25 11:27:09 as-1 clurgmgrd[4332]: <err> #52: Failed changing RG
>>>>>>>> status
>>>>>>>> Feb 25 11:27:09 as-1 clurgmgrd: [4332]: <warning> Link for eth0: Not
>>>>>>>> detected
>>>>>>>> Feb 25 11:27:09 as-1 clurgmgrd: [4332]: <warning> No link on eth0...
>>>>>>>> ...
>>>>>>>> Feb 25 11:27:36 as-1 ccsd[4268]: Unable to connect to cluster
>>>>>>>> infrastructure after 30 seconds.
>>>>>>>> ...
>>>>>>>> Feb 25 11:28:06 as-1 ccsd[4268]: Unable to connect to cluster
>>>>>>>> infrastructure after 60 seconds.
>>>>>>>> ...
>>>>>>>> Feb 25 11:28:06 as-1 ccsd[4268]: Unable to connect to cluster
>>>>>>>> infrastructure after 90 seconds.
>>>>>>>>
>>>>>>>>
>>>>>>>> any comment is appreciated!
>>>>>>>>
>>>>>>>> --
>>>>>>>> Linux-cluster mailing list
>>>>>>>> Linux-cluster at redhat.com
>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>>>>>
>>>>>>>>               
>>>>>>>>                 
>>>>>>> --
>>>>>>> Linux-cluster mailing list
>>>>>>> Linux-cluster at redhat.com
>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>>>>
>>>>>>>
>>>>>>>             
>>>>>>>               
>>>>>> --
>>>>>> Linux-cluster mailing list
>>>>>> Linux-cluster at redhat.com
>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>>>
>>>>>>
>>>>>>           
>>>>>>             
>>>>> --
>>>>> Linux-cluster mailing list
>>>>> Linux-cluster at redhat.com
>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>>
>>>>>         
>>>>>           
>>>> --
>>>> Linux-cluster mailing list
>>>> Linux-cluster at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>
>>>>       
>>>>         
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>>     
>>>       
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>>   
>>     
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>