[Linux-cluster] two node cluster with IP tiebreaker failed.

Wed Feb 25 07:12:43 UTC 2009

ext Kein He wrote:
> Hi Mockey,
>
> Could you please attach the output from " cman_tool status " and "
> cman_tool nodes -f" ?
>
Thanks your response.

I try to run cman_tool status on as-2, but it hang, without output, and
even Ctrl+C also no effect.
I open a new window and can using ssh to as-2, but  after login,  I can
not do anything, even a
simple 'ls' command is hung.

It seem the system keep alive but do not provide any service. Really bad.

Any way to debug this issue ?
>
>
> Mockey Chen wrote:
>> Hi,
>>
>> I have a two-nodes cluster, to avoid split-brain. I use ilo as fence
>> device, IP tiebreaker. here is my /etc/cluster/cluster.conf
>> <?xml version="1.0"?>
>> <cluster alias="azerothcluster" config_version="19"
>> name="azerothcluster">
>>     <cman expected_votes="3" two_node="0"/>
>>     <clusternodes>
>>         <clusternode name="as-1.localdomain" nodeid="1" votes="1">
>>             <fence>
>>                 <method name="1">
>>                     <device name="ilo1"/>
>>                 </method>
>>             </fence>
>>         </clusternode>
>>         <clusternode name="as-2.localdomain" nodeid="2" votes="1">
>>             <fence>
>>                 <method name="1">
>>                     <device name="ilo2"/>
>>                 </method>
>>             </fence>
>>         </clusternode>
>>     </clusternodes>
>>         <quorumd interval="1" tko="10" votes="1" label="pingtest">
>>                 <heuristic program="ping 10.56.150.1 -c1 -t1" score="1"
>> interval="2" tko="3"/>
>>         </quorumd>
>>     <fence_daemon post_fail_delay="0" post_join_delay="3"/>
>>     <fencedevices>
>>         <fencedevice agent="fence_ilo" hostname="10.56.154.18"
>> login="power" name="ilo1" passwd="pass"/>
>>         <fencedevice agent="fence_ilo" hostname="10.56.154.19"
>> login="power" name="ilo2" passwd="pass"/>
>>     </fencedevices>
>> ...
>> ...
>>
>> To test one node lost heartbeat case, I disable ethereal card (eth0) on
>> as-1, I expect as-2 takeover services on as-1 and as-1 node reboot.
>> The actual is as-1 lost connection to as-2.  as-2 detected it and try to
>> re-construct cluster, but failed, here is the syslog form as-2
>>
>> Feb 24 21:25:35 as-2 openais[4139]: [TOTEM] The token was lost in the
>> OPERATIONAL state.
>> Feb 24 21:25:35 as-2 openais[4139]: [TOTEM] Receive multicast socket
>> recv buffer size (288000 bytes).
>> Feb 24 21:25:35 as-2 openais[4139]: [TOTEM] Transmit multicast socket
>> send buffer size (262142 bytes).
>> Feb 24 21:25:35 as-2 openais[4139]: [TOTEM] entering GATHER state
>> from 2.
>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] entering GATHER state
>> from 0.
>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] Creating commit token
>> because I am the rep.
>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] Saving state aru 1f4 high
>> seq received 1f4
>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] Storing new sequence id for
>> ring 2c
>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] entering COMMIT state.
>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] entering RECOVERY state.
>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] position [0] member
>> 10.56.150.4:
>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] previous ring seq 40 rep
>> 10.56.150.3
>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] aru 1f4 high delivered 1f4
>> received flag 1
>>
>> Message from syslogd@ at Tue Feb 24 21:25:40 2009 ...
>> as-2 clurgmgrd[4194]: <emerg> #1: Quorum Dissolved Feb 24 21:25:40 as-2
>> openais[4139]: [TOTEM] Did not need to originate any messages in
>> recovery.
>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] Sending initial ORF token
>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ] CLM CONFIGURATION CHANGE
>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ] New Configuration:
>> Feb 24 21:25:40 as-2 clurgmgrd[4194]: <emerg> #1: Quorum Dissolved
>> Feb 24 21:25:40 as-2 kernel: dlm: closing connection to node 1
>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ]     r(0) ip(10.56.150.4)
>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ] Members Left:
>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ]     r(0) ip(10.56.150.3)
>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ] Members Joined:
>> Feb 24 21:25:40 as-2 openais[4139]: [CMAN ] quorum lost, blocking
>> activity
>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ] CLM CONFIGURATION CHANGE
>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ] New Configuration:
>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ]     r(0) ip(10.56.150.4)
>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ] Members Left:
>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ] Members Joined:
>> Feb 24 21:25:40 as-2 openais[4139]: [SYNC ] This node is within the
>> primary component and will provide service.
>> Feb 24 21:25:40 as-2 ccsd[4130]: Cluster is not quorate.  Refusing
>> connection.
>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] entering OPERATIONAL state.
>> Feb 24 21:25:40 as-2 ccsd[4130]: Error while processing connect:
>> Connection refused
>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ] got nodejoin message
>> 10.56.150.4
>> Feb 24 21:25:40 as-2 ccsd[4130]: Invalid descriptor specified (-111).
>> Feb 24 21:25:40 as-2 openais[4139]: [CPG  ] got joinlist message from
>> node 2
>> Feb 24 21:25:40 as-2 ccsd[4130]: Someone may be attempting something
>> evil.
>> Feb 24 21:25:40 as-2 ccsd[4130]: Error while processing get: Invalid
>> request descriptor
>> Feb 24 21:25:40 as-2 ccsd[4130]: Invalid descriptor specified (-111).
>> Feb 24 21:25:41 as-2 ccsd[4130]: Someone may be attempting something
>> evil.
>> Feb 24 21:25:41 as-2 ccsd[4130]: Error while processing get: Invalid
>> request descriptor
>> Feb 24 21:25:41 as-2 ccsd[4130]: Invalid descriptor specified (-21).
>> Feb 24 21:25:41 as-2 ccsd[4130]: Someone may be attempting something
>> evil.
>> Feb 24 21:25:41 as-2 ccsd[4130]: Error while processing disconnect:
>> Invalid request descriptor
>> Feb 24 21:25:41 as-2 avahi-daemon[3756]: Withdrawing address record for
>> 10.56.150.144 on eth0.
>> Feb 24 21:25:41 as-2 in.rdiscd[8641]: setsockopt (IP_ADD_MEMBERSHIP):
>> Address already in use
>> Feb 24 21:25:41 as-2 in.rdiscd[8641]: Failed joining addresse
>>
>>
>>
>>
>> I also found there are some errors in as-1's syslog
>> Feb 25 11:27:09 as-1 clurgmgrd[4332]: <err> #52: Failed changing RG
>> status
>> Feb 25 11:27:09 as-1 clurgmgrd: [4332]: <warning> Link for eth0: Not
>> detected
>> Feb 25 11:27:09 as-1 clurgmgrd: [4332]: <warning> No link on eth0...
>> ...
>> Feb 25 11:27:36 as-1 ccsd[4268]: Unable to connect to cluster
>> infrastructure after 30 seconds.
>> ...
>> Feb 25 11:28:06 as-1 ccsd[4268]: Unable to connect to cluster
>> infrastructure after 60 seconds.
>> ...
>> Feb 25 11:28:06 as-1 ccsd[4268]: Unable to connect to cluster
>> infrastructure after 90 seconds.
>>
>>
>> any comment is appreciated!
>>
>> -- 
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>   
>
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>