[Linux-cluster] two node cluster with IP tiebreaker failed.

Mockey Chen mockey.chen at nsn.com
Wed Feb 25 04:39:53 UTC 2009


Hi,

I have a two-nodes cluster, to avoid split-brain. I use ilo as fence
device, IP tiebreaker. here is my /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster alias="azerothcluster" config_version="19" name="azerothcluster">
    <cman expected_votes="3" two_node="0"/>
    <clusternodes>
        <clusternode name="as-1.localdomain" nodeid="1" votes="1">
            <fence>
                <method name="1">
                    <device name="ilo1"/>
                </method>
            </fence>
        </clusternode>
        <clusternode name="as-2.localdomain" nodeid="2" votes="1">
            <fence>
                <method name="1">
                    <device name="ilo2"/>
                </method>
            </fence>
        </clusternode>
    </clusternodes>
        <quorumd interval="1" tko="10" votes="1" label="pingtest">
                <heuristic program="ping 10.56.150.1 -c1 -t1" score="1"
interval="2" tko="3"/>
        </quorumd>
    <fence_daemon post_fail_delay="0" post_join_delay="3"/>
    <fencedevices>
        <fencedevice agent="fence_ilo" hostname="10.56.154.18"
login="power" name="ilo1" passwd="pass"/>
        <fencedevice agent="fence_ilo" hostname="10.56.154.19"
login="power" name="ilo2" passwd="pass"/>
    </fencedevices>
...
...

To test one node lost heartbeat case, I disable ethereal card (eth0) on
as-1, I expect as-2 takeover services on as-1 and as-1 node reboot.
The actual is as-1 lost connection to as-2.  as-2 detected it and try to
re-construct cluster, but failed, here is the syslog form as-2

Feb 24 21:25:35 as-2 openais[4139]: [TOTEM] The token was lost in the
OPERATIONAL state.
Feb 24 21:25:35 as-2 openais[4139]: [TOTEM] Receive multicast socket
recv buffer size (288000 bytes).
Feb 24 21:25:35 as-2 openais[4139]: [TOTEM] Transmit multicast socket
send buffer size (262142 bytes).
Feb 24 21:25:35 as-2 openais[4139]: [TOTEM] entering GATHER state from 2.
Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] entering GATHER state from 0.
Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] Creating commit token
because I am the rep.
Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] Saving state aru 1f4 high
seq received 1f4
Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] Storing new sequence id for
ring 2c
Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] entering COMMIT state.
Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] entering RECOVERY state.
Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] position [0] member
10.56.150.4:
Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] previous ring seq 40 rep
10.56.150.3
Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] aru 1f4 high delivered 1f4
received flag 1

Message from syslogd@ at Tue Feb 24 21:25:40 2009 ...
as-2 clurgmgrd[4194]: <emerg> #1: Quorum Dissolved Feb 24 21:25:40 as-2
openais[4139]: [TOTEM] Did not need to originate any messages in recovery.
Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] Sending initial ORF token
Feb 24 21:25:40 as-2 openais[4139]: [CLM  ] CLM CONFIGURATION CHANGE
Feb 24 21:25:40 as-2 openais[4139]: [CLM  ] New Configuration:
Feb 24 21:25:40 as-2 clurgmgrd[4194]: <emerg> #1: Quorum Dissolved
Feb 24 21:25:40 as-2 kernel: dlm: closing connection to node 1
Feb 24 21:25:40 as-2 openais[4139]: [CLM  ]     r(0) ip(10.56.150.4) 
Feb 24 21:25:40 as-2 openais[4139]: [CLM  ] Members Left:
Feb 24 21:25:40 as-2 openais[4139]: [CLM  ]     r(0) ip(10.56.150.3) 
Feb 24 21:25:40 as-2 openais[4139]: [CLM  ] Members Joined:
Feb 24 21:25:40 as-2 openais[4139]: [CMAN ] quorum lost, blocking activity
Feb 24 21:25:40 as-2 openais[4139]: [CLM  ] CLM CONFIGURATION CHANGE
Feb 24 21:25:40 as-2 openais[4139]: [CLM  ] New Configuration:
Feb 24 21:25:40 as-2 openais[4139]: [CLM  ]     r(0) ip(10.56.150.4) 
Feb 24 21:25:40 as-2 openais[4139]: [CLM  ] Members Left:
Feb 24 21:25:40 as-2 openais[4139]: [CLM  ] Members Joined:
Feb 24 21:25:40 as-2 openais[4139]: [SYNC ] This node is within the
primary component and will provide service.
Feb 24 21:25:40 as-2 ccsd[4130]: Cluster is not quorate.  Refusing
connection.
Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] entering OPERATIONAL state.
Feb 24 21:25:40 as-2 ccsd[4130]: Error while processing connect:
Connection refused
Feb 24 21:25:40 as-2 openais[4139]: [CLM  ] got nodejoin message
10.56.150.4
Feb 24 21:25:40 as-2 ccsd[4130]: Invalid descriptor specified (-111).
Feb 24 21:25:40 as-2 openais[4139]: [CPG  ] got joinlist message from
node 2
Feb 24 21:25:40 as-2 ccsd[4130]: Someone may be attempting something evil.
Feb 24 21:25:40 as-2 ccsd[4130]: Error while processing get: Invalid
request descriptor
Feb 24 21:25:40 as-2 ccsd[4130]: Invalid descriptor specified (-111).
Feb 24 21:25:41 as-2 ccsd[4130]: Someone may be attempting something evil.
Feb 24 21:25:41 as-2 ccsd[4130]: Error while processing get: Invalid
request descriptor
Feb 24 21:25:41 as-2 ccsd[4130]: Invalid descriptor specified (-21).
Feb 24 21:25:41 as-2 ccsd[4130]: Someone may be attempting something evil.
Feb 24 21:25:41 as-2 ccsd[4130]: Error while processing disconnect:
Invalid request descriptor
Feb 24 21:25:41 as-2 avahi-daemon[3756]: Withdrawing address record for
10.56.150.144 on eth0.
Feb 24 21:25:41 as-2 in.rdiscd[8641]: setsockopt (IP_ADD_MEMBERSHIP):
Address already in use
Feb 24 21:25:41 as-2 in.rdiscd[8641]: Failed joining addresse




I also found there are some errors in as-1's syslog
Feb 25 11:27:09 as-1 clurgmgrd[4332]: <err> #52: Failed changing RG status
Feb 25 11:27:09 as-1 clurgmgrd: [4332]: <warning> Link for eth0: Not
detected
Feb 25 11:27:09 as-1 clurgmgrd: [4332]: <warning> No link on eth0...
...
Feb 25 11:27:36 as-1 ccsd[4268]: Unable to connect to cluster
infrastructure after 30 seconds.
...
Feb 25 11:28:06 as-1 ccsd[4268]: Unable to connect to cluster
infrastructure after 60 seconds.
...
Feb 25 11:28:06 as-1 ccsd[4268]: Unable to connect to cluster
infrastructure after 90 seconds.


any comment is appreciated!




More information about the Linux-cluster mailing list