As I readed, if you use HA-LVM you don't need fencing because of vg tagging. Is It absolutely mandatory to use fencing with qdisk? If it is, i supose i can use manual_fence, but in production I also won't use fencing. Regards, Javi. Date: Wed, 20 Jun 2012 14:45:28 +0200 From: <a href="mailto:emi2fast@gmail.com">emi2fast@gmail.com</a> To: <a href="mailto:linux-cluster@redhat.com">linux-cluster@redhat.com</a> Subject: Re: [Linux-cluster] Node can't join already quorated cluster If you don't wanna use a real fence divice, because you only do some test, you have to use fence_manual agent <div class="ecxgmail_quote">2012/6/20 Javier Vela <<a href="mailto:jvdiago@gmail.com">jvdiago@gmail.com</a>> <blockquote class="ecxgmail_quote" style="border-left:1px #ccc solid;padding-left:1ex">Hi, I have a very strange problem, and after searching through lot of forums, I haven't found the solution. This is the scenario: Two node cluster with Red Hat 5.7, HA-LVM, no fencing and quorum disk. I start qdiskd, cman and rgmanager on one node. After 5 minutes, finally the fencing finishes and cluster get quorate with 2 votes: [root@node2 ~]# clustat Cluster Status for test_cluster @ Wed Jun 20 05:56:39 2012 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ node1-hb 1 Offline node2-hb 2 Online, Local, rgmanager /dev/mapper/vg_qdisk-lv_qdisk 0 Online, Quorum Disk Service Name Owner (Last) State ------- ---- ----- ------ ----- service:postgres node2 started Now, I start the second node. When cman reaches fencing, it hangs for 5 minutes aprox, and finally fails. clustat says: root@node1 ~]# clustat Cluster Status for test_cluster @ Wed Jun 20 06:01:12 2012 Member Status: Inquorate Member Name ID Status ------ ---- ---- ------ node1-hb 1 Online, Local node2-hb 2 Offline /dev/mapper/vg_qdisk-lv_qdisk 0 Offline And in /var/log/messages I can see this errors: Jun 20 06:02:12 node1 openais[6098]: [TOTEM] entering OPERATIONAL state. Jun 20 06:02:12 node1 openais[6098]: [CLM ] got nodejoin message 15.15.2.10 Jun 20 06:02:13 node1 dlm_controld[5386]: connect to ccs error -111, check ccsd or cluster status Jun 20 06:02:13 node1 ccsd[6090]: Cluster is not quorate. Refusing connection. Jun 20 06:02:13 node1 ccsd[6090]: Error while processing connect: Connection refused Jun 20 06:02:13 node1 ccsd[6090]: Initial status:: Inquorate Jun 20 06:02:13 node1 gfs_controld[5392]: connect to ccs error -111, check ccsd or cluster status Jun 20 06:02:13 node1 ccsd[6090]: Cluster is not quorate. Refusing connection. Jun 20 06:02:13 node1 ccsd[6090]: Error while processing connect: Connection refused Jun 20 06:02:14 node1 openais[6098]: [TOTEM] entering GATHER state from 9. Jun 20 06:02:14 node1 ccsd[6090]: Cluster is not quorate. Refusing connection. Jun 20 06:02:14 node1 ccsd[6090]: Error while processing connect: Connection refused Jun 20 06:02:14 node1 ccsd[6090]: Cluster is not quorate. Refusing connection. Jun 20 06:02:14 node1 ccsd[6090]: Error while processing connect: Connection refused Jun 20 06:02:15 node1 ccsd[6090]: Cluster is not quorate. Refusing connection. Jun 20 06:02:15 node1 ccsd[6090]: Error while processing connect: Connection refused Jun 20 06:02:15 node1 ccsd[6090]: Cluster is not quorate. Refusing connection. Jun 20 06:02:15 node1 ccsd[6090]: Error while processing connect: Connection refused Jun 20 06:02:15 node1 ccsd[6090]: Cluster is not quorate. Refusing connection. Jun 20 06:02:15 node1 ccsd[6090]: Error while processing connect: Connection refused Jun 20 06:02:16 node1 ccsd[6090]: Cluster is not quorate. Refusing connection. Jun 20 06:02:16 node1 ccsd[6090]: Error while processing connect: Connection refused Jun 20 06:02:16 node1 ccsd[6090]: Cluster is not quorate. Refusing connection. Jun 20 06:02:16 node1 ccsd[6090]: Error while processing connect: Connection refused Jun 20 06:02:17 node1 ccsd[6090]: Cluster is not quorate. Refusing connection. Jun 20 06:02:17 node1 ccsd[6090]: Error while processing connect: Connection refused Jun 20 06:02:17 node1 ccsd[6090]: Cluster is not quorate. Refusing connection. Jun 20 06:02:17 node1 ccsd[6090]: Error while processing connect: Connection refused Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering GATHER state from 0. Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Creating commit token because I am the rep. Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Storing new sequence id for ring 15c Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering COMMIT state. Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering RECOVERY state. Jun 20 06:02:18 node1 openais[6098]: [TOTEM] position [0] member <a href="http://15.15.2.10" target="_blank">15.15.2.10</a>: Jun 20 06:02:18 node1 openais[6098]: [TOTEM] previous ring seq 344 rep 15.15.2.10 Jun 20 06:02:18 node1 openais[6098]: [TOTEM] aru e high delivered e received flag 1 Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Did not need to originate any messages in recovery. Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Sending initial ORF token Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering OPERATIONAL state. Jun 20 06:02:18 node1 ccsd[6090]: Cluster is not quorate. Refusing connection. Jun 20 06:02:18 node1 ccsd[6090]: Error while processing connect: Connection refused Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering GATHER state from 9. Jun 20 06:02:18 node1 ccsd[6090]: Cluster is not quorate. Refusing connection. And the quorum disk: [root@node2 ~]# mkqdisk -L -d kqdisk v0.6.0 /dev/mapper/vg_qdisk-lv_qdisk: /dev/vg_qdisk/lv_qdisk: Magic: eb7a62c2 Label: cluster_qdisk Created: Thu Jun 7 09:23:34 2012 Host: node1 Kernel Sector Size: 512 Recorded Sector Size: 512 Status block for node 1 Last updated by node 2 Last updated on Wed Jun 20 06:17:23 2012 State: Evicted Flags: 0000 Score: 0/0 Average Cycle speed: 0.000500 seconds Last Cycle speed: 0.000000 seconds Incarnation: 4fe1a06c4fe1a06c Status block for node 2 Last updated by node 2 Last updated on Wed Jun 20 07:09:38 2012 State: Master Flags: 0000 Score: 0/0 Average Cycle speed: 0.001000 seconds Last Cycle speed: 0.000000 seconds Incarnation: 4fe1a06c4fe1a06c In the other node I don't see any errors in /var/log/messages. One strange thing is that if I start cman on both nodes at the same time, everything works fine and both nodes quorate (until I reboot one node and the problem appears). I've checked that multicast is working properly. With iperf I can send a receive multicast paquets. Moreover I've seen with tcpdump the paquets that openais send when cman is trying to start. I've readed about a bug in RH 5.3 with the same behaviour, but it is solved in RH 5.4. I don't have Selinux enabled, and Iptables are also disabled. Here is the cluster.conf simplified (with less services and resources). I want to point out one thing. I have allow_kill="0" in order to avoid fencing errors when quorum tries to fence a failed node. As <fence/> is empty, before this stanza I got a lot of messages in /var/log/messages with failed fencing. <?xml version="1.0"?> <cluster alias="test_cluster" config_version="15" name="test_cluster"> <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="-1"/> <clusternodes> <clusternode name="node1-hb" nodeid="1" votes="1"> <fence/> </clusternode> <clusternode name="node2-hb" nodeid="2" votes="1"> <fence/> </clusternode> </clusternodes> <cman two_node="0" expected_votes="3"/> <fencedevices/> <rm log_facility="local4" log_level="7"> <failoverdomains> <failoverdomain name="etest_cluster_fo" nofailback="1" ordered="1" restricted="1"> <failoverdomainnode name="node1-hb" priority="1"/> <failoverdomainnode name="node2-hb" priority="2"/> </failoverdomain> </failoverdomains> <resources/> <service autostart="1" domain="test_cluster_fo" exclusive="0" name="postgres" recovery="relocate"> <ip address="172.24.119.44" monitor_link="1"/> <lvm name="vg_postgres" vg_name="vg_postgres" lv_name="postgres"/> <fs device="/dev/vg_postgres/postgres" force_fsck="1" force_unmount="1" fstype="ext3" mountpoint="/var/lib/pgsql" name="postgres" self_fence="0"/> <script file="/etc/init.d/postgresql" name="postgres"> </script> </service> </rm> <totem consensus="4000" join="60" token="20000" token_retransmits_before_loss_const="20"/> <quorumd allow_kill="0" interval="1" label="cluster_qdisk" tko="10" votes="1"> <heuristic program="/usr/share/cluster/check_eth_link.sh eth0" score="1" interval="2" tko="3"/> </quorumd> </cluster> The /etc/hosts: 172.24.119.10 node1 172.24.119.34 node2 15.15.2.10 node1-hb node1-hb.localdomain 15.15.2.11 node2-hb node2-hb.localdomain And the versions: Red Hat Enterprise Linux Server release 5.7 (Tikanga) cman-2.0.115-85.el5 rgmanager-2.0.52-21.el5 openais-0.80.6-30.el5 I don't know what else I should try, so if you can give me some ideas, I will be very pleased. Regards, Javi. -- Linux-cluster mailing list <a href="mailto:Linux-cluster@redhat.com">Linux-cluster@redhat.com</a> <a href="https://www.redhat.com/mailman/listinfo/linux-cluster" target="_blank">https://www.redhat.com/mailman/listinfo/linux-cluster</a> </blockquote></div> -- esta es mi vida e me la vivo hasta que dios quiera -- Linux-cluster mailing list <a href="mailto:Linux-cluster@redhat.com">Linux-cluster@redhat.com</a> <a href="https://www.redhat.com/mailman/listinfo/linux-cluster">https://www.redhat.com/mailman/listinfo/linux-cluster</a>