[Linux-cluster] RHCS 5.1 latest packages, 2-node cluster, doesn't come up with only 1 node

Fri Feb 8 13:25:09 UTC 2008

Hello,

I forgot to add some versioning information from the Cluster packages, here
they are:

* Main cluster packages:
cman-2.0.73-1.el5_1.1.x86_64.rpm
openais-0.80.3-7.el5.x86_64.rpm
perl-Net-Telnet-3.03-5.noarch.rpm 

* Admin tools packages:
Cluster_Administration-en-US-5.1.0-7.noarch.rpm 
cluster-cim-0.10.0-5.el5_1.1.x86_64.rpm 
cluster-snmp-0.10.0-5.el5_1.1.x86_64.rpm 
luci-0.10.0-6.el5.x86_64.rpm 
modcluster-0.10.0-5.el5_1.1.x86_64.rpm 
rgmanager-2.0.31-1.el5.x86_64.rpm 
ricci-0.10.0-6.el5.x86_64.rpm 
system-config-cluster-1.0.50-1.3.noarch.rpm 
tog-pegasus-2.6.1-2.el5_1.1.*.rpm
oddjob-*.rpm

Thank you,

Celso.

On Fri, 8 Feb 2008 11:18:20 -0200, Celso K. Webber wrote
> Hello all,
> 
> I'm having a situation here that might be a bug, or maybe it's some mistake
> from my part.
> 
> * Scenario: 2-node cluster on Dell PE-2950 servers, Dell MD-3000 
> storage (SAS direct-attach), using IPMI-Lan as fencing devices, 2 
> NICs on each server
> (public and heartbeat networks), using Qdisk in the shared storage
> 
> * Problem: if I shutdown one node and keep it shut down, and then 
> reboot the other node, although CMAN comes up after 5 minutes or so, 
> rgmanager does not start.
> 
> I remember having this same problem with RHCS 4.4, but it was solved 
> by upgrading to 4.5. But with RHCS 4.4 CMAN didn't come up, with my 
> setup in RHCS
> 5.1 CMAN comes up after giving up waiting for the other node, but rgmanager
> doesn't, so services get not started. This is bad in an unattended situation.
> 
> Here are some steps and details I've collected from the machine 
> (sorry for a so long message):
> 
> * Shutdown node1
> 
> * Reboot node2
>   - after boot, took around 5 minutes in the "start fencing" message
>   - reported a startup FAIL for the "cman" service after this period 
> of time
> 
> * Boot completed
> 
> * Logged in:
>   - clustat reported inquorate and quorum disk as "offline":
> [root at mrp02 ~]# clustat
> msg_open: No such file or directory
> Member Status: Inquorate
> 
>   Member Name                        ID   Status
>   ------ ----                        ---- ------
>   node1                                 1 Offline
>   node2                                 2 Online, Local
>   /dev/sdc1                             0 Offline
> 
> * After a few seconds, clustat reported quorate and quorum disk as "online":
> [root at mrp02 ~]# clustat
> msg_open: No such file or directory
> Member Status: Quorate
> 
>   Member Name                        ID   Status
>   ------ ----                        ---- ------
>   node1                                 1 Offline
>   node2                                 2 Online, Local
>   /dev/sdc1                             0 Online, Quorum Disk
> 
> * Logs in /var/log/messages showed that after qdiskd assumed "master 
> role", cman reported regaining quorum:
> Feb  7 20:06:59 mrp02 qdiskd[5854]: <info> Assuming master role
> Feb  7 20:07:00 mrp02 ccsd[5694]: Cluster is not quorate.  Refusing connection.
> 
> Feb  7 20:07:00 mrp02 ccsd[5694]: Error while processing connect: Connection
> refused
> 
> Feb  7 20:07:00 mrp02 ccsd[5694]: Cluster is not quorate.  Refusing connection.
> 
> Feb  7 20:07:00 mrp02 ccsd[5694]: Error while processing connect: Connection
> refused
> 
> Feb  7 20:07:00 mrp02 openais[5714]: [CMAN ] quorum regained,
>  resuming activity
> Feb  7 20:07:01 mrp02 clurgmgrd[7523]: <notice> Quorum formed, starting
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>   -> Note that rgmanager started after quorum was regained, but 
> seemed not to work anymore later on (please see below).
> Feb  7 20:07:01 mrp02 kernel: dlm: no local IP address has been set
> Feb  7 20:07:01 mrp02 kernel: dlm: cannot start dlm lowcomms -107
> 
> * Noticed that in "clustat" there as an error message:
>   -> msg_open: No such file or directory
> 
> * Checked rgmanager to see if it was related:
> [root at mrp02 ~]# chkconfig --list rgmanager
> rgmanager       0:off   1:off   2:on    3:on    4:on    5:on    6:off
> [root at mrp02 ~]# service rgmanager status
> clurgmgrd dead but pid file exists
> 
> * Since rgmanager did not come back by itself, restarted it manually:
> [root at mrp02 init.d]# service rgmanager restart
> Starting Cluster Service Manager: dlm: Using TCP for communications
>                                                            [  OK  ]
> 
> * This time clustat did not show the "msg_open" error anymore:
> [root at mrp02 init.d]# clustat
> Member Status: Quorate
> 
>   Member Name                        ID   Status
>   ------ ----                        ---- ------
>   node1                                 1 Offline
>   node2                                 2 Online, Local
>   /dev/sdc1                             0 Online, Quorum Disk
> 
> * It seems to me that in case of cman regaining quorum after a lost 
> quorum, at least in a initial "no quorum" state, rgmanager is not 
> "woke up"
> 
> * This setup had no services configured, so I repeated the test 
> configuring a simple start/stop/status service using the "crond" 
> service as an example, same results
> 
> * Copy /etc/cluster/cluster.conf:
>   -> Notice: I'm using Qdiskd with "always ok" heuristics, since the 
> customer does not have a "always-on" IP tiebraker device to use with 
> a "ping" command as heuristics. <?xml version="1.0"?> <cluster 
> config_version="4" name="clu_mrp"> 	<quorumd interval="1" 
> label="clu_mrp" min_score="1" tko="30" votes="1"> 		<heuristic 
> interval="2" program="/bin/true" score="1"/> 	</quorumd> 
> 	<fence_daemon post_fail_delay="40" post_join_delay="3"/> 	<clusternodes>
> 		<clusternode name="node1" nodeid="1" votes="1">
> 			<fence>
> 				<method name="1">
> 					<device lanplus="1" name="node1-ipmi"/>
> 				</method>
> 			</fence>
> 		</clusternode>
> 		<clusternode name="node2" nodeid="2" votes="1">
> 			<fence>
> 				<method name="1">
> 					<device lanplus="1" name="node2-ipmi"/>
> 				</method>
> 			</fence>
> 		</clusternode>
> 	</clusternodes>
> 	<cman deadnode_timer="38"/>
> 	<fencedevices>
> 		<fencedevice agent="fence_ipmilan" auth="none" ipaddr="node1-ipmi"
> login="root" name="node1-ipmi" passwd="xxx"/>
> 		<fencedevice agent="fence_ipmilan" auth="none" ipaddr="node2-ipmi"
> login="root" name="node2-ipmi" passwd="xxx"/>
> 	</fencedevices>
> 	<rm>
> 		<failoverdomains/>
> 		<resources/>
> 	</rm>
> </cluster>
> 
> Could someone tell me with this is a expected behaviour? Shouldn't rgmanager
> start up automatically in this case?
> 
> Thank you all,
> 
> Celso.
> 
> -- 
> *Celso Kopp Webber*
> 
> celso at webbertek.com.br <mailto:celso at webbertek.com.br>
> 
> *Webbertek - Opensource Knowledge*
> (41) 8813-1919 - celular
> (41) 4063-8448, ramal 102 - fixo
> 
> -- 
> Esta mensagem foi verificada pelo sistema de antivírus e
>  acredita-se estar livre de perigo.
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

-- 
*Celso Kopp Webber*

celso at webbertek.com.br <mailto:celso at webbertek.com.br>

*Webbertek - Opensource Knowledge*
(41) 8813-1919 - celular
(41) 4063-8448, ramal 102 - fixo

-- 
Esta mensagem foi verificada pelo sistema de antivírus e
 acredita-se estar livre de perigo.