[Linux-cluster] Node can't join already quorated cluster‏

Wed Jun 20 15:31:17 UTC 2012

Ok. I'll try the fence_manual and change the clean_start to one. I will
report you the results ASAP.

Thank you for the feedback.

2012/6/20 emmanuel segura <emi2fast at gmail.com>

> Ok Javier
>
> So now i know you don't wanna the fencing and the reason :-)
>
> <fence_daemon clean_start="1" post_fail_delay="0" post_join_delay="-1"/>
>
> and use the fence_manual
>
>
>
> 2012/6/20 Javier Vela <jvdiago at gmail.com>
>
>> I don't use fencing because with ha-lvm I thought that I dind't need it.
>> But also because both nodes are VMs in VMWare. I know that there is a
>> module to do fencing with vmware but I prefer to avoid it. I'm not in
>> control of the VMWare infraestructure and probably VMWare admins won't give
>> me the tools to use this module.
>>
>> Regards, Javi
>>
>>
>>> Fencing is critical, and running a cluster without fencing, even with
>>>
>>>
>>> qdisk, is not supported. Manual fencing is also not supported. The
>>> *only* way to have a reliable cluster, testing or production, is to use
>>> fencing.
>>>
>>> Why do you not wish to use it?
>>>
>>> On 06/20/2012 09:43 AM, Javier Vela wrote:
>>>
>>>
>>> > As I readed, if you use HA-LVM you don't need fencing because of vg
>>> > tagging. Is It absolutely mandatory to use fencing with qdisk?
>>> >
>>> > If it is, i supose i can use manual_fence, but in production I also
>>>
>>>
>>> > won't use fencing.
>>> >
>>> > Regards, Javi.
>>> >
>>> > Date: Wed, 20 Jun 2012 14:45:28 +0200
>>> > From: emi2fast at gmail.com <mailto:emi2fast at gmail.com>
>>>
>>>
>>> > To: linux-cluster at redhat.com <mailto:linux-cluster at redhat.com>
>>> > Subject: Re: [Linux-cluster] Node can't join already quorated cluster
>>>
>>>
>>> >
>>> > If you don't wanna use a real fence divice, because you only do some
>>> > test, you have to use fence_manual agent
>>> >
>>> > 2012/6/20 Javier Vela <jvdiago at gmail.com <mailto:jvdiago at gmail.com>>
>>>
>>>
>>> >
>>> >     Hi, I have a very strange problem, and after searching through lot
>>> >     of forums, I haven't found the solution. This is the scenario:
>>> >
>>> >     Two node cluster with Red Hat 5.7, HA-LVM, no fencing and quorum
>>>
>>>
>>> >     disk. I start qdiskd, cman and rgmanager on one node. After 5
>>> >     minutes, finally the fencing finishes and cluster get quorate with 2
>>> >     votes:
>>> >
>>> >     [root at node2 ~]# clustat
>>> >     Cluster Status for test_cluster @ Wed Jun 20 05:56:39 2012
>>>
>>>
>>> >     Member Status: Quorate
>>> >
>>> >       Member Name                             ID   Status
>>> >       ------ ----                             ---- ------
>>> >       node1-hb                                  1 Offline
>>>
>>>
>>> >       node2-hb                               2 Online, Local, rgmanager
>>> >       /dev/mapper/vg_qdisk-lv_qdisk               0 Online, Quorum Disk
>>> >
>>> >       Service Name                   Owner (Last)                   State
>>>
>>>
>>> >       ------- ----                   ----- ------                   -----
>>> >       service:postgres                   node2                  started
>>> >
>>> >     Now, I start the second node. When cman reaches fencing, it hangs
>>>
>>>
>>> >     for 5 minutes aprox, and finally fails. clustat says:
>>> >
>>> >     root at node1 ~]# clustat
>>> >     Cluster Status for test_cluster @ Wed Jun 20 06:01:12 2012
>>> >     Member Status: Inquorate
>>> >
>>>
>>>
>>> >       Member Name                             ID   Status
>>> >       ------ ----                             ---- ------
>>> >     node1-hb                                  1 Online, Local
>>> >     node2-hb                               2 Offline
>>>
>>>
>>> >       /dev/mapper/vg_qdisk-lv_qdisk               0 Offline
>>> >
>>> >     And in /var/log/messages I can see this errors:
>>> >
>>> >     Jun 20 06:02:12 node1 openais[6098]: [TOTEM] entering OPERATIONAL state.
>>>
>>>
>>> >     Jun 20 06:02:12 node1 openais[6098]: [CLM  ] got nodejoin message
>>> >     15.15.2.10
>>> >     Jun 20 06:02:13 node1 dlm_controld[5386]: connect to ccs error -111,
>>> >     check ccsd or cluster status
>>>
>>>
>>> >     Jun 20 06:02:13 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>>> >     connection.
>>> >     Jun 20 06:02:13 node1 ccsd[6090]: Error while processing connect:
>>> >     Connection refused
>>> >     Jun 20 06:02:13 node1 ccsd[6090]: Initial status:: Inquorate
>>>
>>>
>>> >     Jun 20 06:02:13 node1 gfs_controld[5392]: connect to ccs error -111,
>>> >     check ccsd or cluster status
>>> >     Jun 20 06:02:13 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>>> >     connection.
>>>
>>>
>>> >     Jun 20 06:02:13 node1 ccsd[6090]: Error while processing connect:
>>> >     Connection refused
>>> >     Jun 20 06:02:14 node1 openais[6098]: [TOTEM] entering GATHER state
>>> >     from 9.
>>> >     Jun 20 06:02:14 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>>>
>>>
>>> >     connection.
>>> >     Jun 20 06:02:14 node1 ccsd[6090]: Error while processing connect:
>>> >     Connection refused
>>> >     Jun 20 06:02:14 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>>> >     connection.
>>>
>>>
>>> >     Jun 20 06:02:14 node1 ccsd[6090]: Error while processing connect:
>>> >     Connection refused
>>> >     Jun 20 06:02:15 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>>> >     connection.
>>> >     Jun 20 06:02:15 node1 ccsd[6090]: Error while processing connect:
>>>
>>>
>>> >     Connection refused
>>> >     Jun 20 06:02:15 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>>> >     connection.
>>> >     Jun 20 06:02:15 node1 ccsd[6090]: Error while processing connect:
>>> >     Connection refused
>>>
>>>
>>> >     Jun 20 06:02:15 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>>> >     connection.
>>> >     Jun 20 06:02:15 node1 ccsd[6090]: Error while processing connect:
>>> >     Connection refused
>>> >     Jun 20 06:02:16 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>>>
>>>
>>> >     connection.
>>> >     Jun 20 06:02:16 node1 ccsd[6090]: Error while processing connect:
>>> >     Connection refused
>>> >     Jun 20 06:02:16 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>>> >     connection.
>>>
>>>
>>> >     Jun 20 06:02:16 node1 ccsd[6090]: Error while processing connect:
>>> >     Connection refused
>>> >     Jun 20 06:02:17 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>>> >     connection.
>>> >     Jun 20 06:02:17 node1 ccsd[6090]: Error while processing connect:
>>>
>>>
>>> >     Connection refused
>>> >     Jun 20 06:02:17 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>>> >     connection.
>>> >     Jun 20 06:02:17 node1 ccsd[6090]: Error while processing connect:
>>> >     Connection refused
>>>
>>>
>>> >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering GATHER state
>>> >     from 0.
>>> >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Creating commit token
>>> >     because I am the rep.
>>> >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Storing new sequence id
>>>
>>>
>>> >     for ring 15c
>>> >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering COMMIT state.
>>> >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering RECOVERY state.
>>> >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] position [0] member
>>>
>>>
>>> >     15.15.2.10 <http://15.15.2.10>:
>>> >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] previous ring seq 344
>>> >     rep 15.15.2.10
>>> >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] aru e high delivered e
>>>
>>>
>>> >     received flag 1
>>> >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Did not need to
>>> >     originate any messages in recovery.
>>> >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Sending initial ORF token
>>>
>>>
>>> >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering OPERATIONAL state.
>>> >     Jun 20 06:02:18 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>>> >     connection.
>>> >     Jun 20 06:02:18 node1 ccsd[6090]: Error while processing connect:
>>>
>>>
>>> >     Connection refused
>>> >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering GATHER state
>>> >     from 9.
>>> >     Jun 20 06:02:18 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>>> >     connection.
>>>
>>>
>>> >
>>> >     And the quorum disk:
>>> >
>>> >     [root at node2 ~]# mkqdisk -L -d
>>> >     kqdisk v0.6.0
>>> >     /dev/mapper/vg_qdisk-lv_qdisk:
>>> >     /dev/vg_qdisk/lv_qdisk:
>>> >              Magic:                eb7a62c2
>>>
>>>
>>> >              Label:                cluster_qdisk
>>> >              Created:              Thu Jun  7 09:23:34 2012
>>> >              Host:                 node1
>>> >              Kernel Sector Size:   512
>>>
>>>
>>> >              Recorded Sector Size: 512
>>> >
>>> >     Status block for node 1
>>> >              Last updated by node 2
>>> >              Last updated on Wed Jun 20 06:17:23 2012
>>> >              State: Evicted
>>>
>>>
>>> >              Flags: 0000
>>> >              Score: 0/0
>>> >              Average Cycle speed: 0.000500 seconds
>>> >              Last Cycle speed: 0.000000 seconds
>>> >              Incarnation: 4fe1a06c4fe1a06c
>>>
>>>
>>> >     Status block for node 2
>>> >              Last updated by node 2
>>> >              Last updated on Wed Jun 20 07:09:38 2012
>>> >              State: Master
>>> >              Flags: 0000
>>> >              Score: 0/0
>>>
>>>
>>> >              Average Cycle speed: 0.001000 seconds
>>> >              Last Cycle speed: 0.000000 seconds
>>> >              Incarnation: 4fe1a06c4fe1a06c
>>> >
>>> >
>>> >     In the other node I don't see any errors in /var/log/messages. One
>>>
>>>
>>> >     strange thing is that if I start cman on both nodes at the same
>>> >     time, everything works fine and both nodes quorate (until I reboot
>>> >     one node and the problem appears). I've checked that multicast is
>>>
>>>
>>> >     working properly. With iperf I can send a receive multicast paquets.
>>> >     Moreover I've seen with tcpdump the paquets that openais send when
>>> >     cman is trying to start. I've readed about a bug in RH 5.3 with the
>>>
>>>
>>> >     same behaviour, but it is solved in RH 5.4.
>>> >
>>> >     I don't have Selinux enabled, and Iptables are also disabled. Here
>>> >     is the cluster.conf simplified (with less services and resources). I
>>>
>>>
>>> >     want to point out one thing. I have allow_kill="0" in order to avoid
>>> >     fencing errors when quorum tries to fence a failed node. As <fence/>
>>> >     is empty, before this stanza I got a lot of messages in
>>>
>>>
>>> >     /var/log/messages with failed fencing.
>>> >
>>> >     <?xml version="1.0"?>
>>> >     <cluster alias="test_cluster" config_version="15" name="test_cluster">
>>>
>>>
>>> >              <fence_daemon clean_start="0" post_fail_delay="0"
>>> >     post_join_delay="-1"/>
>>> >              <clusternodes>
>>> >                      <clusternode name="node1-hb" nodeid="1" votes="1">
>>>
>>>
>>> >                              <fence/>
>>> >                      </clusternode>
>>> >                      <clusternode name="node2-hb" nodeid="2" votes="1">
>>> >                              <fence/>
>>>
>>>
>>> >                      </clusternode>
>>> >              </clusternodes>
>>> >              <cman two_node="0" expected_votes="3"/>
>>> >              <fencedevices/>
>>>
>>>
>>> >
>>> >              <rm log_facility="local4" log_level="7">
>>> >                      <failoverdomains>
>>> >                              <failoverdomain name="etest_cluster_fo"
>>>
>>>
>>> >     nofailback="1" ordered="1" restricted="1">
>>> >                                      <failoverdomainnode name="node1-hb"
>>> >     priority="1"/>
>>>
>>>
>>> >                                      <failoverdomainnode name="node2-hb"
>>> >     priority="2"/>
>>> >                              </failoverdomain>
>>> >                      </failoverdomains>
>>>
>>>
>>> >              <resources/>
>>> >              <service autostart="1" domain="test_cluster_fo"
>>> >     exclusive="0" name="postgres" recovery="relocate">
>>>
>>>
>>> >                      <ip address="172.24.119.44" monitor_link="1"/>
>>> >                      <lvm name="vg_postgres" vg_name="vg_postgres"
>>> >     lv_name="postgres"/>
>>>
>>>
>>> >
>>> >                      <fs device="/dev/vg_postgres/postgres"
>>> >     force_fsck="1" force_unmount="1" fstype="ext3"
>>> >     mountpoint="/var/lib/pgsql" name="postgres" self_fence="0"/>
>>>
>>>
>>> >
>>> >                      <script file="/etc/init.d/postgresql" name="postgres">
>>> >                      </script>
>>> >              </service>
>>> >              </rm>
>>>
>>>
>>> >              <totem consensus="4000" join="60" token="20000"
>>> >     token_retransmits_before_loss_const="20"/>
>>> >          <quorumd allow_kill="0" interval="1" label="cluster_qdisk"
>>>
>>>
>>> >     tko="10" votes="1">
>>> >                      <heuristic
>>> >     program="/usr/share/cluster/check_eth_link.sh eth0" score="1"
>>> >     interval="2" tko="3"/>
>>>
>>>
>>> >              </quorumd>
>>> >       </cluster>
>>> >
>>> >
>>> >     The /etc/hosts:
>>> >     172.24.119.10 node1
>>> >     172.24.119.34 node2
>>> >     15.15.2.10 node1-hb node1-hb.localdomain
>>>
>>>
>>> >     15.15.2.11 node2-hb node2-hb.localdomain
>>> >
>>> >     And the versions:
>>> >     Red Hat Enterprise Linux Server release 5.7 (Tikanga)
>>> >     cman-2.0.115-85.el5
>>> >     rgmanager-2.0.52-21.el5
>>>
>>>
>>> >     openais-0.80.6-30.el5
>>> >
>>> >     I don't know what else I should try, so if you can give me some
>>> >     ideas, I will be very pleased.
>>> >
>>> >     Regards, Javi.
>>> >
>>> >     --
>>>
>>>
>>> >     Linux-cluster mailing list
>>> >     Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>>>
>>> >     https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > esta es mi vida e me la vivo hasta que dios quiera
>>> >
>>> > -- Linux-cluster mailing list Linux-cluster at redhat.com
>>>
>>> > <mailto:Linux-cluster at redhat.com>
>>>
>>> > https://www.redhat.com/mailman/listinfo/linux-cluster
>>> >
>>> >
>>> > --
>>> > Linux-cluster mailing list
>>> > Linux-cluster at redhat.com
>>>
>>>
>>> > https://www.redhat.com/mailman/listinfo/linux-cluster
>>> >
>>>
>>>
>>> --
>>> Digimer
>>>
>>> Papers and Projects: https://alteeve.com
>>>
>>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
>
> --
> esta es mi vida e me la vivo hasta que dios quiera
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120620/fa7a172a/attachment.htm>