[Linux-cluster] Cluster is failed.

Mon Mar 8 11:37:22 UTC 2010

Hi.

you got some errors on your cluster.conf file.

1. you must check that fence_rsa works berfore starting the cluster.
2. if you are using  quorumd, change cman to : <cman expected_votes="3"
two_node="0"/>
3. put quorumd  votes=1 , min_score=1
4. change your heuristic program to somthing like ping to your router.( its
better to add more heuristics )
5. install most updated rpms of cman openais & rgmanager.
6. clustat should show qdisk is online. cman should start  qdiskd .

for more information you can read :

http://sources.redhat.com/cluster/wiki/FAQ/CMAN  ( it helps me . )

Regards

Shalom.

On Mon, Mar 8, 2010 at 12:44 PM, <mogruith at free.fr> wrote:

>
> Hi all
>
> Here is my cluster.conf:
>
>
> <?xml version="1.0"?>
> <cluster config_version="6" name="TEST">
>        <quorumd device="/dev/vg1quorom/lv1quorom" interval="1"
> label="quorum"
> min_score="3" tko="10" votes="3">
>                <heuristic interval="2" program="/usr/sbin/qdiskd"
> score="1"/>
>        </quorumd>
>        <fence_daemon post_fail_delay="0" post_join_delay="3"/>
>        <cman expected_votes="1" two_node="1"/>
>        <clusternodes>
>                <clusternode name="node1" nodeid="1" votes="1">
>                        <fence>
>                                <method name="1">
>                                        <device name="RSA_node1"/>
>                                </method>
>                        </fence>
>                </clusternode>
>                <clusternode name="node2" nodeid="2" votes="1">
>                        <fence>
>                                <method name="1">
>                                        <device name="RSA_node2"/>
>                                </method>
>                        </fence>
>                </clusternode>
>        </clusternodes>
>        <cman/>
>        <fencedevices>
>                <fencedevice agent="fence_rsa" ipaddr="RSA_node1"
> login="USER"
> name="RSA_node1" passwd="PASSWORD"/>
>                <fencedevice agent="fence_rsa" ipaddr="RSA_node2"
> login="USER"
> name="RSA_node2" passwd="PASSWORD"/>
>        </fencedevices>
>        <rm>
>                <failoverdomains>
>                        <failoverdomain name="TEST" ordered="1"
> restricted="1">
>                                <failoverdomainnode name="node1"
> priority="1"/>
>                                <failoverdomainnode name="node2"
> priority="2"/>
>                        </failoverdomain>
>                </failoverdomains>
>                <resources>
>                        <ip address="172.28.104.80" monitor_link="1"/>
>                        <clusterfs device="/dev/vg1data/lv1data"
> force_unmount="0" fsid="30516" fstype="gfs2" mountpoint="/data" name="DATA"
> options=""/>
>                </resources>
>                <service autostart="1" domain="TEST" exclusive="1"
> name="TEST">
>                        <ip ref="172.28.104.80">
>                                <clusterfs ref="DATA"/>
>                        </ip>
>                </service>
>        </rm>
> </cluster>
>
>
> N.B
> node1, node2 , RSA_node1 and RSA_node2 are set in /etc/hosts
>
> When I move service from node1 to node2 (by a force reboot on node1), it
> fails
> (because of probably a network problem) but is there a timeout ? If node2
> can't
> connect to rsa node1, why it doesnt consider that node1is "dead" and why
> service
> doesn't go on node2 ?
>
> Here is the clustat
>
> [root at node2 ~]# clustat
> Cluster Status for TEST @ Mon Mar  8 11:33:32 2010
> Member Status: Quorate
>
>  Member Name                                                     ID
> Status
>  ------ ----                                                     ----
> ------
>  node1                                                            1 Offline
>  node2                                                            2 Online,
> Local, rgmanager
>
>  Service Name                                                     Owner
> (Last)
>                                                  State
>  ------- ----                                                     -----
> ------
>                                                  -----
>  service:TEST                                                     node1
>                                                 stopping
>
> It's stopping like that since 30min !
>
> Here is the log:
>
> Mar  8 11:35:45 node2 fenced[7038]: agent "fence_rsa" reports: Unable to
> connect/login to fencing device
> Mar  8 11:35:45 node2 fenced[7038]: fence "node1" failed
> Mar  8 11:35:50 node2 fenced[7038]: fencing node "node1"
> Mar  8 11:35:56 node2 fenced[7038]: agent "fence_rsa" reports: Unable to
> connect/login to fencing device
> Mar  8 11:35:56 node2 fenced[7038]: fence "node1" failed
>
> Why node2 is still trying to fence node1 ?
>
> Here is something else :
>
> [root at node2 ~]# cman_tool services
> type             level name       id       state
> fence            0     default    00010001 FAIL_START_WAIT
> [2]
> dlm              1     rgmanager  00020001 FAIL_ALL_STOPPED
> [1 2]
>
> How to verify quorum is used ?
>
> Last question : I have 3 networks (6 nic, 3 bonding), one is dedicated for
> heartbeat. where I have to set it in cluster.conf ? I would like node1 and
> node2
> communicate by their own bond3 .
>
> Thanks for your help.
>
> mog
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100308/8b124cc2/attachment.htm>