[Linux-cluster] two node cluster with IP tiebreaker failed.

Wed Feb 25 11:19:48 UTC 2009

On Wed, Feb 25, 2009 at 11:45 AM, Mockey Chen <mockey.chen at nsn.com> wrote:
> ext Kein He wrote:
>> I think there is a problem, from "cman_tool status" shows:
>>
>> Nodes: 2
>> Expected votes: 3
>> Total votes: 2
>>
>>
>> according to your cluster.conf , if all nodes and qdisk are online,
>> the "Total votes" must be "3".  Probably "qdiskd" is not running, you
>> can use " cman_tool nodes" to check if qdisk is working.
>>
> Yes, here is "cman_tool nodes" output:
> Node  Sts   Inc   Joined               Name
>   1   M    112   2009-02-25 03:05:19  as-1.localdomain
>   2   M    104   2009-02-25 03:05:19  as-2.localdomain
>
> A question is how to check whether qdisk is running ? and how to run it ?

[root at blade3 ~]# service qdiskd status
qdiskd (pid 2832) is running...
[root at blade3 ~]# pgrep qdisk -l
2832 qdiskd
[root at blade3 ~]# cman_tool nodes
Node  Sts   Inc   Joined               Name
   0   M      0   2009-02-19 16:11:55  /dev/sda5     ## This is qdisk.
   1   M   1524   2009-02-20 22:27:32  blade1
   2   M   1552   2009-02-24 04:39:24  blade2
   3   M   1500   2009-02-19 16:11:03  blade3
   4   M   1516   2009-02-19 16:11:22  blade4

You can use "service qdisk start" to start it, or run it with
/usr/sbin/qdisk -Q if you dont have the init script. If you installed
from rpm on a rh type distro, then the script should be there.

REgards,
brett
>
> Thanks.
>>
>>
>>
>> Mockey Chen wrote:
>>> ext Mockey Chen wrote:
>>>
>>>> ext Kein He wrote:
>>>>
>>>>> Hi Mockey,
>>>>>
>>>>> Could you please attach the output from " cman_tool status " and "
>>>>> cman_tool nodes -f" ?
>>>>>
>>>>>
>>>> Thanks your response.
>>>>
>>>> I try to run cman_tool status on as-2, but it hang, without output, and
>>>> even Ctrl+C also no effect.
>>>>
>>> I manually reboot as-1, and the problem solved.
>>>
>>> There is the output of cman_tool
>>>
>>> [root at as-1 ~]# cman_tool status
>>> Version: 6.1.0
>>> Config Version: 19
>>> Cluster Name: azerothcluster
>>> Cluster Id: 20148
>>> Cluster Member: Yes
>>> Cluster Generation: 76
>>> Membership state: Cluster-Member
>>> Nodes: 2
>>> Expected votes: 3
>>> Total votes: 2
>>> Quorum: 2 Active subsystems: 8
>>> Flags: Dirty
>>> Ports Bound: 0 177 Node name: as-1.localdomain
>>> Node ID: 1
>>> Multicast addresses: 239.192.78.3
>>> Node addresses: 10.56.150.3
>>> [root at as-1 ~]# cman_tool status -f
>>> Version: 6.1.0
>>> Config Version: 19
>>> Cluster Name: azerothcluster
>>> Cluster Id: 20148
>>> Cluster Member: Yes
>>> Cluster Generation: 76
>>> Membership state: Cluster-Member
>>> Nodes: 2
>>> Expected votes: 3
>>> Total votes: 2
>>> Quorum: 2 Active subsystems: 8
>>> Flags: Dirty
>>> Ports Bound: 0 177 Node name: as-1.localdomain
>>> Node ID: 1
>>> Multicast addresses: 239.192.78.3
>>> Node addresses: 10.56.150.3
>>>
>>>
>>> It seems cluster can not fence one of the node. How to solve it ?
>>>
>>>
>>>> I open a new window and can using ssh to as-2, but  after login,  I can
>>>> not do anything, even a
>>>> simple 'ls' command is hung.
>>>>
>>>> It seem the system keep alive but do not provide any service. Really
>>>> bad.
>>>>
>>>> Any way to debug this issue ?
>>>>
>>>>> Mockey Chen wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I have a two-nodes cluster, to avoid split-brain. I use ilo as fence
>>>>>> device, IP tiebreaker. here is my /etc/cluster/cluster.conf
>>>>>> <?xml version="1.0"?>
>>>>>> <cluster alias="azerothcluster" config_version="19"
>>>>>> name="azerothcluster">
>>>>>>     <cman expected_votes="3" two_node="0"/>
>>>>>>     <clusternodes>
>>>>>>         <clusternode name="as-1.localdomain" nodeid="1" votes="1">
>>>>>>             <fence>
>>>>>>                 <method name="1">
>>>>>>                     <device name="ilo1"/>
>>>>>>                 </method>
>>>>>>             </fence>
>>>>>>         </clusternode>
>>>>>>         <clusternode name="as-2.localdomain" nodeid="2" votes="1">
>>>>>>             <fence>
>>>>>>                 <method name="1">
>>>>>>                     <device name="ilo2"/>
>>>>>>                 </method>
>>>>>>             </fence>
>>>>>>         </clusternode>
>>>>>>     </clusternodes>
>>>>>>         <quorumd interval="1" tko="10" votes="1" label="pingtest">
>>>>>>                 <heuristic program="ping 10.56.150.1 -c1 -t1"
>>>>>> score="1"
>>>>>> interval="2" tko="3"/>
>>>>>>         </quorumd>
>>>>>>     <fence_daemon post_fail_delay="0" post_join_delay="3"/>
>>>>>>     <fencedevices>
>>>>>>         <fencedevice agent="fence_ilo" hostname="10.56.154.18"
>>>>>> login="power" name="ilo1" passwd="pass"/>
>>>>>>         <fencedevice agent="fence_ilo" hostname="10.56.154.19"
>>>>>> login="power" name="ilo2" passwd="pass"/>
>>>>>>     </fencedevices>
>>>>>> ...
>>>>>> ...
>>>>>>
>>>>>> To test one node lost heartbeat case, I disable ethereal card
>>>>>> (eth0) on
>>>>>> as-1, I expect as-2 takeover services on as-1 and as-1 node reboot.
>>>>>> The actual is as-1 lost connection to as-2.  as-2 detected it and
>>>>>> try to
>>>>>> re-construct cluster, but failed, here is the syslog form as-2
>>>>>>
>>>>>> Feb 24 21:25:35 as-2 openais[4139]: [TOTEM] The token was lost in the
>>>>>> OPERATIONAL state.
>>>>>> Feb 24 21:25:35 as-2 openais[4139]: [TOTEM] Receive multicast socket
>>>>>> recv buffer size (288000 bytes).
>>>>>> Feb 24 21:25:35 as-2 openais[4139]: [TOTEM] Transmit multicast socket
>>>>>> send buffer size (262142 bytes).
>>>>>> Feb 24 21:25:35 as-2 openais[4139]: [TOTEM] entering GATHER state
>>>>>> from 2.
>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] entering GATHER state
>>>>>> from 0.
>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] Creating commit token
>>>>>> because I am the rep.
>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] Saving state aru 1f4 high
>>>>>> seq received 1f4
>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] Storing new sequence
>>>>>> id for
>>>>>> ring 2c
>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] entering COMMIT state.
>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] entering RECOVERY state.
>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] position [0] member
>>>>>> 10.56.150.4:
>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] previous ring seq 40 rep
>>>>>> 10.56.150.3
>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] aru 1f4 high delivered
>>>>>> 1f4
>>>>>> received flag 1
>>>>>>
>>>>>> Message from syslogd@ at Tue Feb 24 21:25:40 2009 ...
>>>>>> as-2 clurgmgrd[4194]: <emerg> #1: Quorum Dissolved Feb 24 21:25:40
>>>>>> as-2
>>>>>> openais[4139]: [TOTEM] Did not need to originate any messages in
>>>>>> recovery.
>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] Sending initial ORF token
>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ] CLM CONFIGURATION CHANGE
>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ] New Configuration:
>>>>>> Feb 24 21:25:40 as-2 clurgmgrd[4194]: <emerg> #1: Quorum Dissolved
>>>>>> Feb 24 21:25:40 as-2 kernel: dlm: closing connection to node 1
>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ]     r(0) ip(10.56.150.4)
>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ] Members Left:
>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ]     r(0) ip(10.56.150.3)
>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ] Members Joined:
>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CMAN ] quorum lost, blocking
>>>>>> activity
>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ] CLM CONFIGURATION CHANGE
>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ] New Configuration:
>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ]     r(0) ip(10.56.150.4)
>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ] Members Left:
>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ] Members Joined:
>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [SYNC ] This node is within the
>>>>>> primary component and will provide service.
>>>>>> Feb 24 21:25:40 as-2 ccsd[4130]: Cluster is not quorate.  Refusing
>>>>>> connection.
>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] entering OPERATIONAL
>>>>>> state.
>>>>>> Feb 24 21:25:40 as-2 ccsd[4130]: Error while processing connect:
>>>>>> Connection refused
>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ] got nodejoin message
>>>>>> 10.56.150.4
>>>>>> Feb 24 21:25:40 as-2 ccsd[4130]: Invalid descriptor specified (-111).
>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CPG  ] got joinlist message from
>>>>>> node 2
>>>>>> Feb 24 21:25:40 as-2 ccsd[4130]: Someone may be attempting something
>>>>>> evil.
>>>>>> Feb 24 21:25:40 as-2 ccsd[4130]: Error while processing get: Invalid
>>>>>> request descriptor
>>>>>> Feb 24 21:25:40 as-2 ccsd[4130]: Invalid descriptor specified (-111).
>>>>>> Feb 24 21:25:41 as-2 ccsd[4130]: Someone may be attempting something
>>>>>> evil.
>>>>>> Feb 24 21:25:41 as-2 ccsd[4130]: Error while processing get: Invalid
>>>>>> request descriptor
>>>>>> Feb 24 21:25:41 as-2 ccsd[4130]: Invalid descriptor specified (-21).
>>>>>> Feb 24 21:25:41 as-2 ccsd[4130]: Someone may be attempting something
>>>>>> evil.
>>>>>> Feb 24 21:25:41 as-2 ccsd[4130]: Error while processing disconnect:
>>>>>> Invalid request descriptor
>>>>>> Feb 24 21:25:41 as-2 avahi-daemon[3756]: Withdrawing address
>>>>>> record for
>>>>>> 10.56.150.144 on eth0.
>>>>>> Feb 24 21:25:41 as-2 in.rdiscd[8641]: setsockopt (IP_ADD_MEMBERSHIP):
>>>>>> Address already in use
>>>>>> Feb 24 21:25:41 as-2 in.rdiscd[8641]: Failed joining addresse
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> I also found there are some errors in as-1's syslog
>>>>>> Feb 25 11:27:09 as-1 clurgmgrd[4332]: <err> #52: Failed changing RG
>>>>>> status
>>>>>> Feb 25 11:27:09 as-1 clurgmgrd: [4332]: <warning> Link for eth0: Not
>>>>>> detected
>>>>>> Feb 25 11:27:09 as-1 clurgmgrd: [4332]: <warning> No link on eth0...
>>>>>> ...
>>>>>> Feb 25 11:27:36 as-1 ccsd[4268]: Unable to connect to cluster
>>>>>> infrastructure after 30 seconds.
>>>>>> ...
>>>>>> Feb 25 11:28:06 as-1 ccsd[4268]: Unable to connect to cluster
>>>>>> infrastructure after 60 seconds.
>>>>>> ...
>>>>>> Feb 25 11:28:06 as-1 ccsd[4268]: Unable to connect to cluster
>>>>>> infrastructure after 90 seconds.
>>>>>>
>>>>>>
>>>>>> any comment is appreciated!
>>>>>>
>>>>>> --
>>>>>> Linux-cluster mailing list
>>>>>> Linux-cluster at redhat.com
>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>>>
>>>>> --
>>>>> Linux-cluster mailing list
>>>>> Linux-cluster at redhat.com
>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>>
>>>>>
>>>> --
>>>> Linux-cluster mailing list
>>>> Linux-cluster at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>
>>>>
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>