[Linux-cluster] Two-Node Cluster Problem

Thu May 28 08:06:03 UTC 2009

Just gave the whole configuration a new try and setuped the whole
cluster once again.

This is the resulting cluster.conf with a very basic configuration.

[root at ipsdb01 ~]# cat /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster alias="ips_database" config_version="7" name="ips_database">
        <fence_daemon clean_start="1" post_fail_delay="10"
post_join_delay="30"/>
        <clusternodes>
                <clusternode name="10.102.10.51" nodeid="1" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="ipsdb01.drac"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="10.102.10.28" nodeid="2" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="ips08.drac"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman expected_votes="1" two_node="1"/>
        <fencedevices>
                <fencedevice agent="fence_drac" ipaddr="10.102.10.128"
login="root" name="ips08.drac" passwd="xxx"/>
                <fencedevice agent="fence_drac" ipaddr="10.102.10.151"
login="root" name="ipsdb01.drac" passwd="xxx"/>
        </fencedevices>
        <rm>
                <failoverdomains/>
                <resources>
                        <ip address="10.209.170.55" monitor_link="1"/>
                </resources>
                <service autostart="1" exclusive="0" name="ips_database"
recovery="relocate">
                        <ip ref="10.209.170.55"/>
                </service>
        </rm>
</cluster>

Services running on 10.102.10.28. I've did a 'powerdown' with the
drac-interface but the service is not taken over by the second node.

clustat on the remaining node gave an interessting output

[root at ipsdb01 ~]# clustat
Cluster Status for ips_database @ Thu May 28 09:31:30 2009
Member Status: Quorate

 Member Name                                                     ID   Status
 ------ ----                                                     ---- ------
 10.102.10.51                                                        1
Online, Local, rgmanager
 10.102.10.28                                                        2
Offline

 Service Name                                                     Owner
(Last)                                                     State
 ------- ----                                                     -----
------                                                     -----
 service:ips_database
10.102.10.28                                                     started

The service is 'started' but the Owner (10.102.10.28) is offline.

These are the last lines from /var/log/messages

May 28 09:27:03 ipsdb01 kernel: dlm: closing connection to node 2
May 28 09:27:03 ipsdb01 openais[5295]: [CLM  ] Members Joined:
May 28 09:27:03 ipsdb01 fenced[5315]: 10.102.10.28 not a cluster member
after 0 sec post_fail_delay
May 28 09:27:03 ipsdb01 openais[5295]: [SYNC ] This node is within the
primary component and will provide service.
May 28 09:27:03 ipsdb01 openais[5295]: [TOTEM] entering OPERATIONAL state.
May 28 09:27:03 ipsdb01 openais[5295]: [CLM  ] got nodejoin message
10.102.10.51
May 28 09:27:03 ipsdb01 openais[5295]: [CPG  ] got joinlist message from
node 1

The remaining system recognizes the failure, but don't start any
takeover-action.

Anyone an Idea what can cause such a Problem ?

Marco Nietz schrieb:
> Tiago Cruz schrieb:
>> Did you have:
>>
>> <cman two_node="1" expected_votes="1"/>
>>
>> ?
>>
> 
> yes, have this in my config.
> 
> 
>