[Linux-cluster] Cluster Suite 4 failover problem

Thu Oct 19 15:45:44 UTC 2006

Hi,

What is output to the "/var/log/messages" files of each node? That 
should provide a clue as to what the problem is.  Also, did you install 
the 'fence' RPM and any Clustered LVM / GFS RPMs?

You also might consider rebooting the "downed" node - this function is 
generally taken care of by fencing devices automatically and, as I 
understand it, "manual fencing" means you gotta reboot :), the 
assumption being that a failed node won't be allowed back in the cluster 
until it's restarted.

Thanks,
Jon

Dicky wrote:

> Hi All,
>
> I have two machines (named node1 -->192.168.0.27 and node2 
> -->192.168.0.28) installed Red Hat Cluster Suite 4 with DLM with 1 NIC 
> for each machine. I have created a manual fence, a failover domain, 
> two services (1st service is "www - listening address is 
> 192.168.0.111" , 2nd service is "ftp - listening address is 
> 192.168.0.112).
>
> After having the initital setup, everything seems working fine, i can 
> relocate the service from node1 to node 2 or vice versa manually, stop 
> and start the services.
>
> But when i tried to test the failover capibility, i.e. shutdown the 
> network service in one node e.g. shutdown the  eth0 of node1, the 
> failed service won't work in most time, following was the scenarios i 
> tested:
>
> Scenario: Running services running in node1, then i shutdown the eth0 
> of node1
>
> Result: Services not failover to node2, and the clustat in node1 shows 
> that:
>
> Member Status: Quorate
>
>  Member Name                      Status
>  ------ ----                              ------
>  node1                                    Offline
>  node2                                    Online, Local, rgmanager
>
>  Service Name     Owner (Last)                   State
>  ------- ----         ----- ------                       -----
>  ftp                       unkonwn                          started
>  www                   unkonwn                          started
>
> Both services were no longer working. when i restarted the eth0 in 
> node1, restarted the cman service in node1, it still didn't work. 
> Also, when i tried to restart the rgmanager in node1, it only showed 
> that "Waiting for services to stop: " and wating forever. Even i tried 
> to kill the process of the rgmanager, it didn't work. Finally, i  have 
> to reset both machines to get the cluster service back to normal.
>
> I would appreciate if anyone could help or anyone can share if they 
> also got such experience before.
> I also attached the cluster.conf below for any reference.
>
> ======cluster.conf=========
> <?xml version="1.0"?>
> <cluster config_version="34" name="alpha_cluster">
>        <fence_daemon post_fail_delay="0" post_join_delay="3"/>
>        <clusternodes>
>                <clusternode name="node1" votes="1">
>                        <fence>
>                                <method name="1">
>                                        <device name="Fence" 
> nodename="node1"/>
>                                </method>
>                        </fence>
>                </clusternode>
>                <clusternode name="node2" votes="1">
>                        <fence>
>                                <method name="1">
>                                        <device name="Fence" 
> nodename="node2"/>
>                                </method>
>                        </fence>
>                </clusternode>
>        </clusternodes>
>        <cman expected_votes="1" two_node="1"/>
>        <fencedevices>
>                <fencedevice agent="fence_manual" name="Fence"/>
>        </fencedevices>
>        <rm>
>                <failoverdomains>
>                        <failoverdomain name="aaa" ordered="0" 
> restricted="0">
>                                <failoverdomainnode name="node1" 
> priority="1"/>
>                                <failoverdomainnode name="node2" 
> priority="1"/>
>                        </failoverdomain>
>                </failoverdomains>
>                <resources>
>                        <ip address="192.168.0.111" monitor_link="0"/>
>                        <script file="/etc/rc.d/init.d/httpd" name="www"/>
>                        <script file="/etc/rc.d/init.d/vsftpd" 
> name="ftp"/>
>                        <ip address="192.168.0.112" monitor_link="0"/>
>                </resources>
>                <service autostart="1" domain="aaa" name="ftp" 
> recovery="relocate">
>                        <ip ref="192.168.0.112"/>
>                        <script ref="ftp"/>
>                </service>
>                <service autostart="1" domain="aaa" name="www" 
> recovery="relocate">
>                        <ip ref="192.168.0.111"/>
>                        <script ref="www"/>
>                </service>
>        </rm>
> </cluster>
> ==========END==========
>
> Many Thanks,
> Dicky
>
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster