[Linux-cluster] Cluster is failed.

Tue Mar 9 20:43:11 UTC 2010

Hi all,

I corrected my cluster (done a new one with luci)

Here is my new cluster.conf (and my questions after ... :))

<?xml version="1.0"?>
<cluster alias="TEST" config_version="85" name="TEST">
        <fence_daemon clean_start="0" post_fail_delay="0"
post_join_delay="3"/>
        <clusternodes>
                <clusternode name="node1" nodeid="1" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="rsa_node1"/>
                               </method>
                        </fence>
                </clusternode>
                <clusternode name="node2" nodeid="2" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="rsa_node2"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman expected_votes="5"/>
        <fencedevices>
                <fencedevice agent="fence_rsa" ipaddr="rsa_node1"
login="ADMIN" name="rsa_node1" passwd="password"/>
                <fencedevice agent="fence_rsa" ipaddr="rsa_node2"
login="ADMIN" name="rsa_node2" passwd="password"/>
        </fencedevices>
        <rm>
                <failoverdomains>
                        <failoverdomain name="test" nofailback="0"
ordered="1" restricted="1">
                                <failoverdomainnode name="node1"
priority="1"/>
                                <failoverdomainnode name="node2"
priority="1"/>
                        </failoverdomain>
                </failoverdomains>
                <resources>
                        <ip address="192.168.10.20" monitor_link="1"/>
                        <clusterfs device="/dev/vg1data/lv1data"
force_unmount="1" fsid="47478" fstype="gfs2" mountpoint="/data"
name="lv1data" self_fence="0"/>
                        <clusterfs device="/dev/vg1app/lv1app"
force_unmount="1" fsid="11699" fstype="gfs2" mountpoint="/app"
name="lv1app" self_fence="0"/>
                </resources>
                <service autostart="1" domain="TEST" exclusive="1"
name="TEST" recovery="disable">
                        <ip ref="172.28.104.80">
                                <clusterfs fstype="gfs" ref="lv1data"/>
                                <clusterfs fstype="gfs" ref="lv1app"/>
                        </ip>
                </service>
        </rm>
        <totem consensus="4800" join="60" token="10000"
token_retransmits_before_loss_const="20"/>
        <quorumd device="/dev/vg1quorum/lv1quorum" interval="1"
min_score="1" tko="10" votes="3">
                <heuristic interval="10" program="/usr/sbin/qdiskd"
score="1"/>
        </quorumd>
</cluster>

Clustat gives :

Cluster Status for TEST@ Tue Mar 9 18:12:25 2010
Member Status: Quorate

Member Name ID Status
------ ---- ---- ------
node1 1 Online, Local, rgmanager
node2 2 Online, rgmanager
/dev/vg1quorum/lv1quorum 0 Online, Quorum Disk

Service Name Owner (Last) State
------- ---- ----- ------ -----
service:TEST node1 started

Now questions (sorry I read several website in english, but mine is a
bit poor ...):

- <cman expected_votes="5"/>   => what is this number 5 ? where does it
comme from ?
- I don't understand why now quorum is visible. I did the same thing
that before (mkqdisk etc etc ..)
- <quorumd device="/dev/vg1quorum/lv1quorum" interval="1" min_score="1"
tko="10" votes="3">  => Why 3 ? It is for node1, node2 and quorum ? 
- Luci purpose to automatically start cman and rgmanager, is it a good
idea ? Qdiskd is started in runlevel 2345 by system, same question, is
it a good thing ?
- I still have a network with IP's node1 in 10.0.0.10, and IP's node2 in
10.0.0.20, could I insert in cluster.conf as a network heartbeat ?
- Last one for fencing (:) ). Node1 will use rsa_node2 to kill node2,
and node2 will use rsa_node1 to kill node1 ?
 => 
	<clusternodes>
                <clusternode name="node1" nodeid="1" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="rsa_node1"/>
                               </method>
                        </fence>
         </clusternode>
Is it right ?

With this configuration, I shutdown ( with vilence) my first node, and
my service came on the second one, so it seems working fine but I don't
understand really why ... 

Thanks for your help

mog

Le lundi 08 mars 2010 à 13:37 +0200, שלום קלמר a écrit :
> Hi.
> 
> you got some errors on your cluster.conf file.
> 
> 1. you must check that fence_rsa works berfore starting the cluster.
> 2. if you are using  quorumd, change cman to : <cman
> expected_votes="3" two_node="0"/>
> 3. put quorumd  votes=1 , min_score=1
> 4. change your heuristic program to somthing like ping to your
> router.( its better to add more heuristics )
> 5. install most updated rpms of cman openais & rgmanager.
> 6. clustat should show qdisk is online. cman should start  qdiskd .
> 
> for more information you can read :
> 
> http://sources.redhat.com/cluster/wiki/FAQ/CMAN  ( it helps me . )
> 
> 
> Regards
> 
> Shalom.
> 
> On Mon, Mar 8, 2010 at 12:44 PM, <mogruith at free.fr> wrote:
>         
>         Hi all
>         
>         Here is my cluster.conf:
>         
>         
>         <?xml version="1.0"?>
>         <cluster config_version="6" name="TEST">
>                <quorumd device="/dev/vg1quorom/lv1quorom" interval="1"
>         label="quorum"
>         min_score="3" tko="10" votes="3">
>                        <heuristic interval="2"
>         program="/usr/sbin/qdiskd" score="1"/>
>                </quorumd>
>                <fence_daemon post_fail_delay="0" post_join_delay="3"/>
>                <cman expected_votes="1" two_node="1"/>
>                <clusternodes>
>                        <clusternode name="node1" nodeid="1" votes="1">
>                                <fence>
>                                        <method name="1">
>                                                <device
>         name="RSA_node1"/>
>                                        </method>
>                                </fence>
>                        </clusternode>
>                        <clusternode name="node2" nodeid="2" votes="1">
>                                <fence>
>                                        <method name="1">
>                                                <device
>         name="RSA_node2"/>
>                                        </method>
>                                </fence>
>                        </clusternode>
>                </clusternodes>
>                <cman/>
>                <fencedevices>
>                        <fencedevice agent="fence_rsa"
>         ipaddr="RSA_node1" login="USER"
>         name="RSA_node1" passwd="PASSWORD"/>
>                        <fencedevice agent="fence_rsa"
>         ipaddr="RSA_node2" login="USER"
>         name="RSA_node2" passwd="PASSWORD"/>
>                </fencedevices>
>                <rm>
>                        <failoverdomains>
>                                <failoverdomain name="TEST" ordered="1"
>         restricted="1">
>                                        <failoverdomainnode
>         name="node1" priority="1"/>
>                                        <failoverdomainnode
>         name="node2" priority="2"/>
>                                </failoverdomain>
>                        </failoverdomains>
>                        <resources>
>                                <ip address="172.28.104.80"
>         monitor_link="1"/>
>                                <clusterfs
>         device="/dev/vg1data/lv1data"
>         force_unmount="0" fsid="30516" fstype="gfs2"
>         mountpoint="/data" name="DATA"
>         options=""/>
>                        </resources>
>                        <service autostart="1" domain="TEST"
>         exclusive="1" name="TEST">
>                                <ip ref="172.28.104.80">
>                                        <clusterfs ref="DATA"/>
>                                </ip>
>                        </service>
>                </rm>
>         </cluster>
>         
>         
>         N.B
>         node1, node2 , RSA_node1 and RSA_node2 are set in /etc/hosts
>         
>         When I move service from node1 to node2 (by a force reboot on
>         node1), it fails
>         (because of probably a network problem) but is there a
>         timeout ? If node2 can't
>         connect to rsa node1, why it doesnt consider that node1is
>         "dead" and why service
>         doesn't go on node2 ?
>         
>         Here is the clustat
>         
>         [root at node2 ~]# clustat
>         Cluster Status for TEST @ Mon Mar  8 11:33:32 2010
>         Member Status: Quorate
>         
>          Member Name
>         ID   Status
>          ------ ----
>         ---- ------
>          node1
>              1 Offline
>          node2
>              2 Online, Local, rgmanager
>         
>          Service Name
>         Owner (Last)
>                                                          State
>          ------- ----
>         ----- ------
>                                                          -----
>          service:TEST
>         node1
>                                                         stopping
>         
>         It's stopping like that since 30min !
>         
>         Here is the log:
>         
>         Mar  8 11:35:45 node2 fenced[7038]: agent "fence_rsa" reports:
>         Unable to
>         connect/login to fencing device
>         Mar  8 11:35:45 node2 fenced[7038]: fence "node1" failed
>         Mar  8 11:35:50 node2 fenced[7038]: fencing node "node1"
>         Mar  8 11:35:56 node2 fenced[7038]: agent "fence_rsa" reports:
>         Unable to
>         connect/login to fencing device
>         Mar  8 11:35:56 node2 fenced[7038]: fence "node1" failed
>         
>         Why node2 is still trying to fence node1 ?
>         
>         Here is something else :
>         
>         [root at node2 ~]# cman_tool services
>         type             level name       id       state
>         fence            0     default    00010001 FAIL_START_WAIT
>         [2]
>         dlm              1     rgmanager  00020001 FAIL_ALL_STOPPED
>         [1 2]
>         
>         How to verify quorum is used ?
>         
>         Last question : I have 3 networks (6 nic, 3 bonding), one is
>         dedicated for
>         heartbeat. where I have to set it in cluster.conf ? I would
>         like node1 and node2
>         communicate by their own bond3 .
>         
>         Thanks for your help.
>         
>         mog
>         
>         --
>         Linux-cluster mailing list
>         Linux-cluster at redhat.com
>         https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster