[Linux-cluster] Node can't join already quorated cluster
emmanuel segura
emi2fast at gmail.com
Wed Jun 20 13:59:18 UTC 2012
Fencing it's critical component of a cluster and i think it requires
A cluster without fencing it's not a good idea, but as you know that's your
choice
2012/6/20 Javier Vela <jvdiago at gmail.com>
> As I readed, if you use HA-LVM you don't need fencing because of vg
> tagging. Is It absolutely mandatory to use fencing with qdisk?
>
> If it is, i supose i can use manual_fence, but in production I also won't
> use fencing.
>
> Regards, Javi.
>
> Date: Wed, 20 Jun 2012 14:45:28 +0200
> From: emi2fast at gmail.com
> To: linux-cluster at redhat.com
> Subject: Re: [Linux-cluster] Node can't join already quorated cluster
>
>
> If you don't wanna use a real fence divice, because you only do some test,
> you have to use fence_manual agent
>
> 2012/6/20 Javier Vela <jvdiago at gmail.com>
>
> Hi, I have a very strange problem, and after searching through lot of
> forums, I haven't found the solution. This is the scenario:
>
> Two node cluster with Red Hat 5.7, HA-LVM, no fencing and quorum disk. I
> start qdiskd, cman and rgmanager on one node. After 5 minutes, finally the
> fencing finishes and cluster get quorate with 2 votes:
>
> [root at node2 ~]# clustat
> Cluster Status for test_cluster @ Wed Jun 20 05:56:39 2012
> Member Status: Quorate
>
> Member Name ID Status
> ------ ---- ---- ------
> node1-hb 1 Offline
> node2-hb 2 Online, Local, rgmanager
> /dev/mapper/vg_qdisk-lv_qdisk 0 Online, Quorum Disk
>
> Service Name Owner (Last) State
> ------- ---- ----- ------ -----
> service:postgres node2 started
>
> Now, I start the second node. When cman reaches fencing, it hangs for 5
> minutes aprox, and finally fails. clustat says:
>
> root at node1 ~]# clustat
> Cluster Status for test_cluster @ Wed Jun 20 06:01:12 2012
> Member Status: Inquorate
>
> Member Name ID Status
> ------ ---- ---- ------
> node1-hb 1 Online, Local
> node2-hb 2 Offline
> /dev/mapper/vg_qdisk-lv_qdisk 0 Offline
>
> And in /var/log/messages I can see this errors:
>
> Jun 20 06:02:12 node1 openais[6098]: [TOTEM] entering OPERATIONAL state.
> Jun 20 06:02:12 node1 openais[6098]: [CLM ] got nodejoin message
> 15.15.2.10
> Jun 20 06:02:13 node1 dlm_controld[5386]: connect to ccs error -111, check
> ccsd or cluster status
> Jun 20 06:02:13 node1 ccsd[6090]: Cluster is not quorate. Refusing
> connection.
> Jun 20 06:02:13 node1 ccsd[6090]: Error while processing connect:
> Connection refused
> Jun 20 06:02:13 node1 ccsd[6090]: Initial status:: Inquorate
> Jun 20 06:02:13 node1 gfs_controld[5392]: connect to ccs error -111, check
> ccsd or cluster status
> Jun 20 06:02:13 node1 ccsd[6090]: Cluster is not quorate. Refusing
> connection.
> Jun 20 06:02:13 node1 ccsd[6090]: Error while processing connect:
> Connection refused
> Jun 20 06:02:14 node1 openais[6098]: [TOTEM] entering GATHER state from 9.
> Jun 20 06:02:14 node1 ccsd[6090]: Cluster is not quorate. Refusing
> connection.
> Jun 20 06:02:14 node1 ccsd[6090]: Error while processing connect:
> Connection refused
> Jun 20 06:02:14 node1 ccsd[6090]: Cluster is not quorate. Refusing
> connection.
> Jun 20 06:02:14 node1 ccsd[6090]: Error while processing connect:
> Connection refused
> Jun 20 06:02:15 node1 ccsd[6090]: Cluster is not quorate. Refusing
> connection.
> Jun 20 06:02:15 node1 ccsd[6090]: Error while processing connect:
> Connection refused
> Jun 20 06:02:15 node1 ccsd[6090]: Cluster is not quorate. Refusing
> connection.
> Jun 20 06:02:15 node1 ccsd[6090]: Error while processing connect:
> Connection refused
> Jun 20 06:02:15 node1 ccsd[6090]: Cluster is not quorate. Refusing
> connection.
> Jun 20 06:02:15 node1 ccsd[6090]: Error while processing connect:
> Connection refused
> Jun 20 06:02:16 node1 ccsd[6090]: Cluster is not quorate. Refusing
> connection.
> Jun 20 06:02:16 node1 ccsd[6090]: Error while processing connect:
> Connection refused
> Jun 20 06:02:16 node1 ccsd[6090]: Cluster is not quorate. Refusing
> connection.
> Jun 20 06:02:16 node1 ccsd[6090]: Error while processing connect:
> Connection refused
> Jun 20 06:02:17 node1 ccsd[6090]: Cluster is not quorate. Refusing
> connection.
> Jun 20 06:02:17 node1 ccsd[6090]: Error while processing connect:
> Connection refused
> Jun 20 06:02:17 node1 ccsd[6090]: Cluster is not quorate. Refusing
> connection.
> Jun 20 06:02:17 node1 ccsd[6090]: Error while processing connect:
> Connection refused
> Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering GATHER state from 0.
> Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Creating commit token because
> I am the rep.
> Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Storing new sequence id for
> ring 15c
> Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering COMMIT state.
> Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering RECOVERY state.
> Jun 20 06:02:18 node1 openais[6098]: [TOTEM] position [0] member
> 15.15.2.10:
> Jun 20 06:02:18 node1 openais[6098]: [TOTEM] previous ring seq 344 rep
> 15.15.2.10
> Jun 20 06:02:18 node1 openais[6098]: [TOTEM] aru e high delivered e
> received flag 1
> Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Did not need to originate any
> messages in recovery.
> Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Sending initial ORF token
> Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering OPERATIONAL state.
> Jun 20 06:02:18 node1 ccsd[6090]: Cluster is not quorate. Refusing
> connection.
> Jun 20 06:02:18 node1 ccsd[6090]: Error while processing connect:
> Connection refused
> Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering GATHER state from 9.
> Jun 20 06:02:18 node1 ccsd[6090]: Cluster is not quorate. Refusing
> connection.
>
> And the quorum disk:
>
> [root at node2 ~]# mkqdisk -L -d
> kqdisk v0.6.0
> /dev/mapper/vg_qdisk-lv_qdisk:
> /dev/vg_qdisk/lv_qdisk:
> Magic: eb7a62c2
> Label: cluster_qdisk
> Created: Thu Jun 7 09:23:34 2012
> Host: node1
> Kernel Sector Size: 512
> Recorded Sector Size: 512
>
> Status block for node 1
> Last updated by node 2
> Last updated on Wed Jun 20 06:17:23 2012
> State: Evicted
> Flags: 0000
> Score: 0/0
> Average Cycle speed: 0.000500 seconds
> Last Cycle speed: 0.000000 seconds
> Incarnation: 4fe1a06c4fe1a06c
> Status block for node 2
> Last updated by node 2
> Last updated on Wed Jun 20 07:09:38 2012
> State: Master
> Flags: 0000
> Score: 0/0
> Average Cycle speed: 0.001000 seconds
> Last Cycle speed: 0.000000 seconds
> Incarnation: 4fe1a06c4fe1a06c
>
>
> In the other node I don't see any errors in /var/log/messages. One strange
> thing is that if I start cman on both nodes at the same time, everything
> works fine and both nodes quorate (until I reboot one node and the problem
> appears). I've checked that multicast is working properly. With iperf I can
> send a receive multicast paquets. Moreover I've seen with tcpdump the
> paquets that openais send when cman is trying to start. I've readed about a
> bug in RH 5.3 with the same behaviour, but it is solved in RH 5.4.
>
> I don't have Selinux enabled, and Iptables are also disabled. Here is the
> cluster.conf simplified (with less services and resources). I want to point
> out one thing. I have allow_kill="0" in order to avoid fencing errors when
> quorum tries to fence a failed node. As <fence/> is empty, before this
> stanza I got a lot of messages in /var/log/messages with failed fencing.
>
> <?xml version="1.0"?>
> <cluster alias="test_cluster" config_version="15" name="test_cluster">
> <fence_daemon clean_start="0" post_fail_delay="0"
> post_join_delay="-1"/>
> <clusternodes>
> <clusternode name="node1-hb" nodeid="1" votes="1">
> <fence/>
> </clusternode>
> <clusternode name="node2-hb" nodeid="2" votes="1">
> <fence/>
> </clusternode>
> </clusternodes>
> <cman two_node="0" expected_votes="3"/>
> <fencedevices/>
>
> <rm log_facility="local4" log_level="7">
> <failoverdomains>
> <failoverdomain name="etest_cluster_fo"
> nofailback="1" ordered="1" restricted="1">
> <failoverdomainnode name="node1-hb"
> priority="1"/>
> <failoverdomainnode name="node2-hb"
> priority="2"/>
> </failoverdomain>
> </failoverdomains>
> <resources/>
> <service autostart="1" domain="test_cluster_fo" exclusive="0"
> name="postgres" recovery="relocate">
> <ip address="172.24.119.44" monitor_link="1"/>
> <lvm name="vg_postgres" vg_name="vg_postgres"
> lv_name="postgres"/>
>
> <fs device="/dev/vg_postgres/postgres" force_fsck="1"
> force_unmount="1" fstype="ext3" mountpoint="/var/lib/pgsql" name="postgres"
> self_fence="0"/>
>
> <script file="/etc/init.d/postgresql" name="postgres">
> </script>
> </service>
> </rm>
> <totem consensus="4000" join="60" token="20000"
> token_retransmits_before_loss_const="20"/>
> <quorumd allow_kill="0" interval="1" label="cluster_qdisk" tko="10"
> votes="1">
> <heuristic program="/usr/share/cluster/check_eth_link.sh
> eth0" score="1" interval="2" tko="3"/>
> </quorumd>
> </cluster>
>
>
> The /etc/hosts:
> 172.24.119.10 node1
> 172.24.119.34 node2
> 15.15.2.10 node1-hb node1-hb.localdomain
> 15.15.2.11 node2-hb node2-hb.localdomain
>
> And the versions:
> Red Hat Enterprise Linux Server release 5.7 (Tikanga)
> cman-2.0.115-85.el5
> rgmanager-2.0.52-21.el5
> openais-0.80.6-30.el5
>
> I don't know what else I should try, so if you can give me some ideas, I
> will be very pleased.
>
> Regards, Javi.
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
>
>
> --
> esta es mi vida e me la vivo hasta que dios quiera
>
> -- Linux-cluster mailing list Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
--
esta es mi vida e me la vivo hasta que dios quiera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120620/67f9817a/attachment.htm>
More information about the Linux-cluster
mailing list