[Linux-cluster] Node can't join already quorated cluster
emmanuel segura
emi2fast at gmail.com
Wed Jun 20 15:22:21 UTC 2012
Ok Javier
So now i know you don't wanna the fencing and the reason :-)
<fence_daemon clean_start="1" post_fail_delay="0" post_join_delay="-1"/>
and use the fence_manual
2012/6/20 Javier Vela <jvdiago at gmail.com>
> I don't use fencing because with ha-lvm I thought that I dind't need it.
> But also because both nodes are VMs in VMWare. I know that there is a
> module to do fencing with vmware but I prefer to avoid it. I'm not in
> control of the VMWare infraestructure and probably VMWare admins won't give
> me the tools to use this module.
>
> Regards, Javi
>
>
>> Fencing is critical, and running a cluster without fencing, even with
>>
>> qdisk, is not supported. Manual fencing is also not supported. The
>> *only* way to have a reliable cluster, testing or production, is to use
>> fencing.
>>
>> Why do you not wish to use it?
>>
>> On 06/20/2012 09:43 AM, Javier Vela wrote:
>>
>> > As I readed, if you use HA-LVM you don't need fencing because of vg
>> > tagging. Is It absolutely mandatory to use fencing with qdisk?
>> >
>> > If it is, i supose i can use manual_fence, but in production I also
>>
>> > won't use fencing.
>> >
>> > Regards, Javi.
>> >
>> > Date: Wed, 20 Jun 2012 14:45:28 +0200
>> > From: emi2fast at gmail.com <mailto:emi2fast at gmail.com>
>>
>> > To: linux-cluster at redhat.com <mailto:linux-cluster at redhat.com>
>> > Subject: Re: [Linux-cluster] Node can't join already quorated cluster
>>
>> >
>> > If you don't wanna use a real fence divice, because you only do some
>> > test, you have to use fence_manual agent
>> >
>> > 2012/6/20 Javier Vela <jvdiago at gmail.com <mailto:jvdiago at gmail.com>>
>>
>> >
>> > Hi, I have a very strange problem, and after searching through lot
>> > of forums, I haven't found the solution. This is the scenario:
>> >
>> > Two node cluster with Red Hat 5.7, HA-LVM, no fencing and quorum
>>
>> > disk. I start qdiskd, cman and rgmanager on one node. After 5
>> > minutes, finally the fencing finishes and cluster get quorate with 2
>> > votes:
>> >
>> > [root at node2 ~]# clustat
>> > Cluster Status for test_cluster @ Wed Jun 20 05:56:39 2012
>>
>> > Member Status: Quorate
>> >
>> > Member Name ID Status
>> > ------ ---- ---- ------
>> > node1-hb 1 Offline
>>
>> > node2-hb 2 Online, Local, rgmanager
>> > /dev/mapper/vg_qdisk-lv_qdisk 0 Online, Quorum Disk
>> >
>> > Service Name Owner (Last) State
>>
>> > ------- ---- ----- ------ -----
>> > service:postgres node2 started
>> >
>> > Now, I start the second node. When cman reaches fencing, it hangs
>>
>> > for 5 minutes aprox, and finally fails. clustat says:
>> >
>> > root at node1 ~]# clustat
>> > Cluster Status for test_cluster @ Wed Jun 20 06:01:12 2012
>> > Member Status: Inquorate
>> >
>>
>> > Member Name ID Status
>> > ------ ---- ---- ------
>> > node1-hb 1 Online, Local
>> > node2-hb 2 Offline
>>
>> > /dev/mapper/vg_qdisk-lv_qdisk 0 Offline
>> >
>> > And in /var/log/messages I can see this errors:
>> >
>> > Jun 20 06:02:12 node1 openais[6098]: [TOTEM] entering OPERATIONAL state.
>>
>> > Jun 20 06:02:12 node1 openais[6098]: [CLM ] got nodejoin message
>> > 15.15.2.10
>> > Jun 20 06:02:13 node1 dlm_controld[5386]: connect to ccs error -111,
>> > check ccsd or cluster status
>>
>> > Jun 20 06:02:13 node1 ccsd[6090]: Cluster is not quorate. Refusing
>> > connection.
>> > Jun 20 06:02:13 node1 ccsd[6090]: Error while processing connect:
>> > Connection refused
>> > Jun 20 06:02:13 node1 ccsd[6090]: Initial status:: Inquorate
>>
>> > Jun 20 06:02:13 node1 gfs_controld[5392]: connect to ccs error -111,
>> > check ccsd or cluster status
>> > Jun 20 06:02:13 node1 ccsd[6090]: Cluster is not quorate. Refusing
>> > connection.
>>
>> > Jun 20 06:02:13 node1 ccsd[6090]: Error while processing connect:
>> > Connection refused
>> > Jun 20 06:02:14 node1 openais[6098]: [TOTEM] entering GATHER state
>> > from 9.
>> > Jun 20 06:02:14 node1 ccsd[6090]: Cluster is not quorate. Refusing
>>
>> > connection.
>> > Jun 20 06:02:14 node1 ccsd[6090]: Error while processing connect:
>> > Connection refused
>> > Jun 20 06:02:14 node1 ccsd[6090]: Cluster is not quorate. Refusing
>> > connection.
>>
>> > Jun 20 06:02:14 node1 ccsd[6090]: Error while processing connect:
>> > Connection refused
>> > Jun 20 06:02:15 node1 ccsd[6090]: Cluster is not quorate. Refusing
>> > connection.
>> > Jun 20 06:02:15 node1 ccsd[6090]: Error while processing connect:
>>
>> > Connection refused
>> > Jun 20 06:02:15 node1 ccsd[6090]: Cluster is not quorate. Refusing
>> > connection.
>> > Jun 20 06:02:15 node1 ccsd[6090]: Error while processing connect:
>> > Connection refused
>>
>> > Jun 20 06:02:15 node1 ccsd[6090]: Cluster is not quorate. Refusing
>> > connection.
>> > Jun 20 06:02:15 node1 ccsd[6090]: Error while processing connect:
>> > Connection refused
>> > Jun 20 06:02:16 node1 ccsd[6090]: Cluster is not quorate. Refusing
>>
>> > connection.
>> > Jun 20 06:02:16 node1 ccsd[6090]: Error while processing connect:
>> > Connection refused
>> > Jun 20 06:02:16 node1 ccsd[6090]: Cluster is not quorate. Refusing
>> > connection.
>>
>> > Jun 20 06:02:16 node1 ccsd[6090]: Error while processing connect:
>> > Connection refused
>> > Jun 20 06:02:17 node1 ccsd[6090]: Cluster is not quorate. Refusing
>> > connection.
>> > Jun 20 06:02:17 node1 ccsd[6090]: Error while processing connect:
>>
>> > Connection refused
>> > Jun 20 06:02:17 node1 ccsd[6090]: Cluster is not quorate. Refusing
>> > connection.
>> > Jun 20 06:02:17 node1 ccsd[6090]: Error while processing connect:
>> > Connection refused
>>
>> > Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering GATHER state
>> > from 0.
>> > Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Creating commit token
>> > because I am the rep.
>> > Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Storing new sequence id
>>
>> > for ring 15c
>> > Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering COMMIT state.
>> > Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering RECOVERY state.
>> > Jun 20 06:02:18 node1 openais[6098]: [TOTEM] position [0] member
>>
>> > 15.15.2.10 <http://15.15.2.10>:
>> > Jun 20 06:02:18 node1 openais[6098]: [TOTEM] previous ring seq 344
>> > rep 15.15.2.10
>> > Jun 20 06:02:18 node1 openais[6098]: [TOTEM] aru e high delivered e
>>
>> > received flag 1
>> > Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Did not need to
>> > originate any messages in recovery.
>> > Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Sending initial ORF token
>>
>> > Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering OPERATIONAL state.
>> > Jun 20 06:02:18 node1 ccsd[6090]: Cluster is not quorate. Refusing
>> > connection.
>> > Jun 20 06:02:18 node1 ccsd[6090]: Error while processing connect:
>>
>> > Connection refused
>> > Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering GATHER state
>> > from 9.
>> > Jun 20 06:02:18 node1 ccsd[6090]: Cluster is not quorate. Refusing
>> > connection.
>>
>> >
>> > And the quorum disk:
>> >
>> > [root at node2 ~]# mkqdisk -L -d
>> > kqdisk v0.6.0
>> > /dev/mapper/vg_qdisk-lv_qdisk:
>> > /dev/vg_qdisk/lv_qdisk:
>> > Magic: eb7a62c2
>>
>> > Label: cluster_qdisk
>> > Created: Thu Jun 7 09:23:34 2012
>> > Host: node1
>> > Kernel Sector Size: 512
>>
>> > Recorded Sector Size: 512
>> >
>> > Status block for node 1
>> > Last updated by node 2
>> > Last updated on Wed Jun 20 06:17:23 2012
>> > State: Evicted
>>
>> > Flags: 0000
>> > Score: 0/0
>> > Average Cycle speed: 0.000500 seconds
>> > Last Cycle speed: 0.000000 seconds
>> > Incarnation: 4fe1a06c4fe1a06c
>>
>> > Status block for node 2
>> > Last updated by node 2
>> > Last updated on Wed Jun 20 07:09:38 2012
>> > State: Master
>> > Flags: 0000
>> > Score: 0/0
>>
>> > Average Cycle speed: 0.001000 seconds
>> > Last Cycle speed: 0.000000 seconds
>> > Incarnation: 4fe1a06c4fe1a06c
>> >
>> >
>> > In the other node I don't see any errors in /var/log/messages. One
>>
>> > strange thing is that if I start cman on both nodes at the same
>> > time, everything works fine and both nodes quorate (until I reboot
>> > one node and the problem appears). I've checked that multicast is
>>
>> > working properly. With iperf I can send a receive multicast paquets.
>> > Moreover I've seen with tcpdump the paquets that openais send when
>> > cman is trying to start. I've readed about a bug in RH 5.3 with the
>>
>> > same behaviour, but it is solved in RH 5.4.
>> >
>> > I don't have Selinux enabled, and Iptables are also disabled. Here
>> > is the cluster.conf simplified (with less services and resources). I
>>
>> > want to point out one thing. I have allow_kill="0" in order to avoid
>> > fencing errors when quorum tries to fence a failed node. As <fence/>
>> > is empty, before this stanza I got a lot of messages in
>>
>> > /var/log/messages with failed fencing.
>> >
>> > <?xml version="1.0"?>
>> > <cluster alias="test_cluster" config_version="15" name="test_cluster">
>>
>> > <fence_daemon clean_start="0" post_fail_delay="0"
>> > post_join_delay="-1"/>
>> > <clusternodes>
>> > <clusternode name="node1-hb" nodeid="1" votes="1">
>>
>> > <fence/>
>> > </clusternode>
>> > <clusternode name="node2-hb" nodeid="2" votes="1">
>> > <fence/>
>>
>> > </clusternode>
>> > </clusternodes>
>> > <cman two_node="0" expected_votes="3"/>
>> > <fencedevices/>
>>
>> >
>> > <rm log_facility="local4" log_level="7">
>> > <failoverdomains>
>> > <failoverdomain name="etest_cluster_fo"
>>
>> > nofailback="1" ordered="1" restricted="1">
>> > <failoverdomainnode name="node1-hb"
>> > priority="1"/>
>>
>> > <failoverdomainnode name="node2-hb"
>> > priority="2"/>
>> > </failoverdomain>
>> > </failoverdomains>
>>
>> > <resources/>
>> > <service autostart="1" domain="test_cluster_fo"
>> > exclusive="0" name="postgres" recovery="relocate">
>>
>> > <ip address="172.24.119.44" monitor_link="1"/>
>> > <lvm name="vg_postgres" vg_name="vg_postgres"
>> > lv_name="postgres"/>
>>
>> >
>> > <fs device="/dev/vg_postgres/postgres"
>> > force_fsck="1" force_unmount="1" fstype="ext3"
>> > mountpoint="/var/lib/pgsql" name="postgres" self_fence="0"/>
>>
>> >
>> > <script file="/etc/init.d/postgresql" name="postgres">
>> > </script>
>> > </service>
>> > </rm>
>>
>> > <totem consensus="4000" join="60" token="20000"
>> > token_retransmits_before_loss_const="20"/>
>> > <quorumd allow_kill="0" interval="1" label="cluster_qdisk"
>>
>> > tko="10" votes="1">
>> > <heuristic
>> > program="/usr/share/cluster/check_eth_link.sh eth0" score="1"
>> > interval="2" tko="3"/>
>>
>> > </quorumd>
>> > </cluster>
>> >
>> >
>> > The /etc/hosts:
>> > 172.24.119.10 node1
>> > 172.24.119.34 node2
>> > 15.15.2.10 node1-hb node1-hb.localdomain
>>
>> > 15.15.2.11 node2-hb node2-hb.localdomain
>> >
>> > And the versions:
>> > Red Hat Enterprise Linux Server release 5.7 (Tikanga)
>> > cman-2.0.115-85.el5
>> > rgmanager-2.0.52-21.el5
>>
>> > openais-0.80.6-30.el5
>> >
>> > I don't know what else I should try, so if you can give me some
>> > ideas, I will be very pleased.
>> >
>> > Regards, Javi.
>> >
>> > --
>>
>> > Linux-cluster mailing list
>> > Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>> > https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>> >
>> >
>> >
>> >
>> > --
>> > esta es mi vida e me la vivo hasta que dios quiera
>> >
>> > -- Linux-cluster mailing list Linux-cluster at redhat.com
>> > <mailto:Linux-cluster at redhat.com>
>>
>> > https://www.redhat.com/mailman/listinfo/linux-cluster
>> >
>> >
>> > --
>> > Linux-cluster mailing list
>> > Linux-cluster at redhat.com
>>
>> > https://www.redhat.com/mailman/listinfo/linux-cluster
>> >
>>
>>
>> --
>> Digimer
>> Papers and Projects: https://alteeve.com
>>
>>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
--
esta es mi vida e me la vivo hasta que dios quiera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120620/9b4df71d/attachment.htm>
More information about the Linux-cluster
mailing list