[Linux-cluster] Node can't join already quorated cluster‏

Wed Jun 20 15:44:00 UTC 2012

It's worth re-stating;

You are running an unsupported configuration. Please try to have the 
VMWare admins enable fence calls against your nodes and setup fencing. 
Until and unless you do, you will almost certainly run into problems, up 
to and including corrupting your data.

Please take a minute to read this:

https://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial#Concept.3B_Fencing

Digimer

On 06/20/2012 11:22 AM, emmanuel segura wrote:
> Ok Javier
>
> So now i know you don't wanna the fencing and the reason :-)
>
> <fence_daemon clean_start="1" post_fail_delay="0" post_join_delay="-1"/>
>
> and use the fence_manual
>
>
>
> 2012/6/20 Javier Vela <jvdiago at gmail.com <mailto:jvdiago at gmail.com>>
>
>     I don't use fencing because with ha-lvm I thought that I dind't need
>     it. But also because both nodes are VMs in VMWare. I know that there
>     is a module to do fencing with vmware but I prefer to avoid it. I'm
>     not in control of the VMWare infraestructure and probably VMWare
>     admins won't give me the tools to use this module.
>
>     Regards, Javi
>
>         Fencing is critical, and running a cluster without fencing, even with
>
>
>         qdisk, is not supported. Manual fencing is also not supported. The
>         *only* way to have a reliable cluster, testing or production, is to use
>         fencing.
>
>         Why do you not wish to use it?
>
>         On 06/20/2012 09:43 AM, Javier Vela wrote:
>
>
>         > As I readed, if you use HA-LVM you don't need fencing because of vg
>         > tagging. Is It absolutely mandatory to use fencing with qdisk?
>         >
>         > If it is, i supose i can use manual_fence, but in production I also
>
>
>         > won't use fencing.
>         >
>         > Regards, Javi.
>         >
>         > Date: Wed, 20 Jun 2012 14:45:28 +0200
>         > From:emi2fast at gmail.com  <mailto:emi2fast at gmail.com>  <mailto:emi2fast at gmail.com  <mailto:emi2fast at gmail.com>>
>
>
>         > To:linux-cluster at redhat.com  <mailto:linux-cluster at redhat.com>  <mailto:linux-cluster at redhat.com  <mailto:linux-cluster at redhat.com>>
>         > Subject: Re: [Linux-cluster] Node can't join already quorated cluster
>
>
>         >
>         > If you don't wanna use a real fence divice, because you only do some
>         > test, you have to use fence_manual agent
>         >
>         > 2012/6/20 Javier Vela <jvdiago at gmail.com  <mailto:jvdiago at gmail.com>  <mailto:jvdiago at gmail.com  <mailto:jvdiago at gmail.com>>>
>
>
>         >
>         >     Hi, I have a very strange problem, and after searching through lot
>         >     of forums, I haven't found the solution. This is the scenario:
>         >
>         >     Two node cluster with Red Hat 5.7, HA-LVM, no fencing and quorum
>
>
>         >     disk. I start qdiskd, cman and rgmanager on one node. After 5
>         >     minutes, finally the fencing finishes and cluster get quorate with 2
>         >     votes:
>         >
>         >     [root at node2 ~]# clustat
>         >     Cluster Status for test_cluster @ Wed Jun 20 05:56:39 2012
>
>
>         >     Member Status: Quorate
>         >
>         >       Member Name                             ID   Status
>         >       ------ ----                             ---- ------
>         >       node1-hb                                  1 Offline
>
>
>         >       node2-hb                               2 Online, Local, rgmanager
>         >       /dev/mapper/vg_qdisk-lv_qdisk               0 Online, Quorum Disk
>         >
>         >       Service Name                   Owner (Last)                   State
>
>
>         >       ------- ----                   ----- ------                   -----
>         >       service:postgres                   node2                  started
>         >
>         >     Now, I start the second node. When cman reaches fencing, it hangs
>
>
>         >     for 5 minutes aprox, and finally fails. clustat says:
>         >
>         >     root at node1 ~]# clustat
>         >     Cluster Status for test_cluster @ Wed Jun 20 06:01:12 2012
>         >     Member Status: Inquorate
>         >
>
>
>         >       Member Name                             ID   Status
>         >       ------ ----                             ---- ------
>         >     node1-hb                                  1 Online, Local
>         >     node2-hb                               2 Offline
>
>
>         >       /dev/mapper/vg_qdisk-lv_qdisk               0 Offline
>         >
>         >     And in /var/log/messages I can see this errors:
>         >
>         >     Jun 20 06:02:12 node1 openais[6098]: [TOTEM] entering OPERATIONAL state.
>
>
>         >     Jun 20 06:02:12 node1 openais[6098]: [CLM  ] got nodejoin message
>         >     15.15.2.10
>         >     Jun 20 06:02:13 node1 dlm_controld[5386]: connect to ccs error -111,
>         >     check ccsd or cluster status
>
>
>         >     Jun 20 06:02:13 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>         >     connection.
>         >     Jun 20 06:02:13 node1 ccsd[6090]: Error while processing connect:
>         >     Connection refused
>         >     Jun 20 06:02:13 node1 ccsd[6090]: Initial status:: Inquorate
>
>
>         >     Jun 20 06:02:13 node1 gfs_controld[5392]: connect to ccs error -111,
>         >     check ccsd or cluster status
>         >     Jun 20 06:02:13 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>         >     connection.
>
>
>         >     Jun 20 06:02:13 node1 ccsd[6090]: Error while processing connect:
>         >     Connection refused
>         >     Jun 20 06:02:14 node1 openais[6098]: [TOTEM] entering GATHER state
>         >     from 9.
>         >     Jun 20 06:02:14 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>
>
>         >     connection.
>         >     Jun 20 06:02:14 node1 ccsd[6090]: Error while processing connect:
>         >     Connection refused
>         >     Jun 20 06:02:14 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>         >     connection.
>
>
>         >     Jun 20 06:02:14 node1 ccsd[6090]: Error while processing connect:
>         >     Connection refused
>         >     Jun 20 06:02:15 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>         >     connection.
>         >     Jun 20 06:02:15 node1 ccsd[6090]: Error while processing connect:
>
>
>         >     Connection refused
>         >     Jun 20 06:02:15 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>         >     connection.
>         >     Jun 20 06:02:15 node1 ccsd[6090]: Error while processing connect:
>         >     Connection refused
>
>
>         >     Jun 20 06:02:15 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>         >     connection.
>         >     Jun 20 06:02:15 node1 ccsd[6090]: Error while processing connect:
>         >     Connection refused
>         >     Jun 20 06:02:16 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>
>
>         >     connection.
>         >     Jun 20 06:02:16 node1 ccsd[6090]: Error while processing connect:
>         >     Connection refused
>         >     Jun 20 06:02:16 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>         >     connection.
>
>
>         >     Jun 20 06:02:16 node1 ccsd[6090]: Error while processing connect:
>         >     Connection refused
>         >     Jun 20 06:02:17 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>         >     connection.
>         >     Jun 20 06:02:17 node1 ccsd[6090]: Error while processing connect:
>
>
>         >     Connection refused
>         >     Jun 20 06:02:17 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>         >     connection.
>         >     Jun 20 06:02:17 node1 ccsd[6090]: Error while processing connect:
>         >     Connection refused
>
>
>         >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering GATHER state
>         >     from 0.
>         >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Creating commit token
>         >     because I am the rep.
>         >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Storing new sequence id
>
>
>         >     for ring 15c
>         >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering COMMIT state.
>         >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering RECOVERY state.
>         >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] position [0] member
>
>
>         >     15.15.2.10 <http://15.15.2.10>:
>         >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] previous ring seq 344
>         >     rep 15.15.2.10
>         >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] aru e high delivered e
>
>
>         >     received flag 1
>         >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Did not need to
>         >     originate any messages in recovery.
>         >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Sending initial ORF token
>
>
>         >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering OPERATIONAL state.
>         >     Jun 20 06:02:18 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>         >     connection.
>         >     Jun 20 06:02:18 node1 ccsd[6090]: Error while processing connect:
>
>
>         >     Connection refused
>         >     Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering GATHER state
>         >     from 9.
>         >     Jun 20 06:02:18 node1 ccsd[6090]: Cluster is not quorate.  Refusing
>         >     connection.
>
>
>         >
>         >     And the quorum disk:
>         >
>         >     [root at node2 ~]# mkqdisk -L -d
>         >     kqdisk v0.6.0
>         >     /dev/mapper/vg_qdisk-lv_qdisk:
>         >     /dev/vg_qdisk/lv_qdisk:
>         >              Magic:                eb7a62c2
>
>
>         >              Label:                cluster_qdisk
>         >              Created:              Thu Jun  7 09:23:34 2012
>         >              Host:                 node1
>         >              Kernel Sector Size:   512
>
>
>         >              Recorded Sector Size: 512
>         >
>         >     Status block for node 1
>         >              Last updated by node 2
>         >              Last updated on Wed Jun 20 06:17:23 2012
>         >              State: Evicted
>
>
>         >              Flags: 0000
>         >              Score: 0/0
>         >              Average Cycle speed: 0.000500 seconds
>         >              Last Cycle speed: 0.000000 seconds
>         >              Incarnation: 4fe1a06c4fe1a06c
>
>
>         >     Status block for node 2
>         >              Last updated by node 2
>         >              Last updated on Wed Jun 20 07:09:38 2012
>         >              State: Master
>         >              Flags: 0000
>         >              Score: 0/0
>
>
>         >              Average Cycle speed: 0.001000 seconds
>         >              Last Cycle speed: 0.000000 seconds
>         >              Incarnation: 4fe1a06c4fe1a06c
>         >
>         >
>         >     In the other node I don't see any errors in /var/log/messages. One
>
>
>         >     strange thing is that if I start cman on both nodes at the same
>         >     time, everything works fine and both nodes quorate (until I reboot
>         >     one node and the problem appears). I've checked that multicast is
>
>
>         >     working properly. With iperf I can send a receive multicast paquets.
>         >     Moreover I've seen with tcpdump the paquets that openais send when
>         >     cman is trying to start. I've readed about a bug in RH 5.3 with the
>
>
>         >     same behaviour, but it is solved in RH 5.4.
>         >
>         >     I don't have Selinux enabled, and Iptables are also disabled. Here
>         >     is the cluster.conf simplified (with less services and resources). I
>
>
>         >     want to point out one thing. I have allow_kill="0" in order to avoid
>         >     fencing errors when quorum tries to fence a failed node. As <fence/>
>         >     is empty, before this stanza I got a lot of messages in
>
>
>         >     /var/log/messages with failed fencing.
>         >
>         >     <?xml version="1.0"?>
>         >     <cluster alias="test_cluster" config_version="15" name="test_cluster">
>
>
>         >              <fence_daemon clean_start="0" post_fail_delay="0"
>         >     post_join_delay="-1"/>
>         >              <clusternodes>
>         >                      <clusternode name="node1-hb" nodeid="1" votes="1">
>
>
>         >                              <fence/>
>         >                      </clusternode>
>         >                      <clusternode name="node2-hb" nodeid="2" votes="1">
>         >                              <fence/>
>
>
>         >                      </clusternode>
>         >              </clusternodes>
>         >              <cman two_node="0" expected_votes="3"/>
>         >              <fencedevices/>
>
>
>         >
>         >              <rm log_facility="local4" log_level="7">
>         >                      <failoverdomains>
>         >                              <failoverdomain name="etest_cluster_fo"
>
>
>         >     nofailback="1" ordered="1" restricted="1">
>         >                                      <failoverdomainnode name="node1-hb"
>         >     priority="1"/>
>
>
>         >                                      <failoverdomainnode name="node2-hb"
>         >     priority="2"/>
>         >                              </failoverdomain>
>         >                      </failoverdomains>
>
>
>         >              <resources/>
>         >              <service autostart="1" domain="test_cluster_fo"
>         >     exclusive="0" name="postgres" recovery="relocate">
>
>
>         >                      <ip address="172.24.119.44" monitor_link="1"/>
>         >                      <lvm name="vg_postgres" vg_name="vg_postgres"
>         >     lv_name="postgres"/>
>
>
>         >
>         >                      <fs device="/dev/vg_postgres/postgres"
>         >     force_fsck="1" force_unmount="1" fstype="ext3"
>         >     mountpoint="/var/lib/pgsql" name="postgres" self_fence="0"/>
>
>
>         >
>         >                      <script file="/etc/init.d/postgresql" name="postgres">
>         >                      </script>
>         >              </service>
>         >              </rm>
>
>
>         >              <totem consensus="4000" join="60" token="20000"
>         >     token_retransmits_before_loss_const="20"/>
>         >          <quorumd allow_kill="0" interval="1" label="cluster_qdisk"
>
>
>         >     tko="10" votes="1">
>         >                      <heuristic
>         >     program="/usr/share/cluster/check_eth_link.sh eth0" score="1"
>         >     interval="2" tko="3"/>
>
>
>         >              </quorumd>
>         >       </cluster>
>         >
>         >
>         >     The /etc/hosts:
>         >     172.24.119.10 node1
>         >     172.24.119.34 node2
>         >     15.15.2.10 node1-hb node1-hb.localdomain
>
>
>         >     15.15.2.11 node2-hb node2-hb.localdomain
>         >
>         >     And the versions:
>         >     Red Hat Enterprise Linux Server release 5.7 (Tikanga)
>         >     cman-2.0.115-85.el5
>         >     rgmanager-2.0.52-21.el5
>
>
>         >     openais-0.80.6-30.el5
>         >
>         >     I don't know what else I should try, so if you can give me some
>         >     ideas, I will be very pleased.
>         >
>         >     Regards, Javi.
>         >
>         >     --
>
>
>         >     Linux-cluster mailing list
>         >Linux-cluster at redhat.com  <mailto:Linux-cluster at redhat.com>  <mailto:Linux-cluster at redhat.com  <mailto:Linux-cluster at redhat.com>>
>
>         >https://www.redhat.com/mailman/listinfo/linux-cluster
>
>         >
>         >
>         >
>         >
>         > --
>         > esta es mi vida e me la vivo hasta que dios quiera
>         >
>         > -- Linux-cluster mailing listLinux-cluster at redhat.com  <mailto:Linux-cluster at redhat.com>
>
>         > <mailto:Linux-cluster at redhat.com  <mailto:Linux-cluster at redhat.com>>
>
>         >https://www.redhat.com/mailman/listinfo/linux-cluster
>         >
>         >
>         > --
>         > Linux-cluster mailing list
>         >Linux-cluster at redhat.com  <mailto:Linux-cluster at redhat.com>
>
>
>         >https://www.redhat.com/mailman/listinfo/linux-cluster
>         >
>
>
>         --
>         Digimer
>
>         Papers and Projects:https://alteeve.com
>
>
>     --
>     Linux-cluster mailing list
>     Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>     https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
>
>
> --
> esta es mi vida e me la vivo hasta que dios quiera
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>

-- 
Digimer
Papers and Projects: https://alteeve.com