[Linux-cluster] RHCS 5.1 latest packages, 2-node cluster, doesn't come up with only 1 node
Celso K. Webber
celso at webbertek.com.br
Fri Feb 8 13:25:09 UTC 2008
Hello,
I forgot to add some versioning information from the Cluster packages, here
they are:
* Main cluster packages:
cman-2.0.73-1.el5_1.1.x86_64.rpm
openais-0.80.3-7.el5.x86_64.rpm
perl-Net-Telnet-3.03-5.noarch.rpm
* Admin tools packages:
Cluster_Administration-en-US-5.1.0-7.noarch.rpm
cluster-cim-0.10.0-5.el5_1.1.x86_64.rpm
cluster-snmp-0.10.0-5.el5_1.1.x86_64.rpm
luci-0.10.0-6.el5.x86_64.rpm
modcluster-0.10.0-5.el5_1.1.x86_64.rpm
rgmanager-2.0.31-1.el5.x86_64.rpm
ricci-0.10.0-6.el5.x86_64.rpm
system-config-cluster-1.0.50-1.3.noarch.rpm
tog-pegasus-2.6.1-2.el5_1.1.*.rpm
oddjob-*.rpm
Thank you,
Celso.
On Fri, 8 Feb 2008 11:18:20 -0200, Celso K. Webber wrote
> Hello all,
>
> I'm having a situation here that might be a bug, or maybe it's some mistake
> from my part.
>
> * Scenario: 2-node cluster on Dell PE-2950 servers, Dell MD-3000
> storage (SAS direct-attach), using IPMI-Lan as fencing devices, 2
> NICs on each server
> (public and heartbeat networks), using Qdisk in the shared storage
>
> * Problem: if I shutdown one node and keep it shut down, and then
> reboot the other node, although CMAN comes up after 5 minutes or so,
> rgmanager does not start.
>
> I remember having this same problem with RHCS 4.4, but it was solved
> by upgrading to 4.5. But with RHCS 4.4 CMAN didn't come up, with my
> setup in RHCS
> 5.1 CMAN comes up after giving up waiting for the other node, but rgmanager
> doesn't, so services get not started. This is bad in an unattended situation.
>
> Here are some steps and details I've collected from the machine
> (sorry for a so long message):
>
> * Shutdown node1
>
> * Reboot node2
> - after boot, took around 5 minutes in the "start fencing" message
> - reported a startup FAIL for the "cman" service after this period
> of time
>
> * Boot completed
>
> * Logged in:
> - clustat reported inquorate and quorum disk as "offline":
> [root at mrp02 ~]# clustat
> msg_open: No such file or directory
> Member Status: Inquorate
>
> Member Name ID Status
> ------ ---- ---- ------
> node1 1 Offline
> node2 2 Online, Local
> /dev/sdc1 0 Offline
>
> * After a few seconds, clustat reported quorate and quorum disk as "online":
> [root at mrp02 ~]# clustat
> msg_open: No such file or directory
> Member Status: Quorate
>
> Member Name ID Status
> ------ ---- ---- ------
> node1 1 Offline
> node2 2 Online, Local
> /dev/sdc1 0 Online, Quorum Disk
>
> * Logs in /var/log/messages showed that after qdiskd assumed "master
> role", cman reported regaining quorum:
> Feb 7 20:06:59 mrp02 qdiskd[5854]: <info> Assuming master role
> Feb 7 20:07:00 mrp02 ccsd[5694]: Cluster is not quorate. Refusing connection.
>
> Feb 7 20:07:00 mrp02 ccsd[5694]: Error while processing connect: Connection
> refused
>
> Feb 7 20:07:00 mrp02 ccsd[5694]: Cluster is not quorate. Refusing connection.
>
> Feb 7 20:07:00 mrp02 ccsd[5694]: Error while processing connect: Connection
> refused
>
> Feb 7 20:07:00 mrp02 openais[5714]: [CMAN ] quorum regained,
> resuming activity
> Feb 7 20:07:01 mrp02 clurgmgrd[7523]: <notice> Quorum formed, starting
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> -> Note that rgmanager started after quorum was regained, but
> seemed not to work anymore later on (please see below).
> Feb 7 20:07:01 mrp02 kernel: dlm: no local IP address has been set
> Feb 7 20:07:01 mrp02 kernel: dlm: cannot start dlm lowcomms -107
>
> * Noticed that in "clustat" there as an error message:
> -> msg_open: No such file or directory
>
> * Checked rgmanager to see if it was related:
> [root at mrp02 ~]# chkconfig --list rgmanager
> rgmanager 0:off 1:off 2:on 3:on 4:on 5:on 6:off
> [root at mrp02 ~]# service rgmanager status
> clurgmgrd dead but pid file exists
>
> * Since rgmanager did not come back by itself, restarted it manually:
> [root at mrp02 init.d]# service rgmanager restart
> Starting Cluster Service Manager: dlm: Using TCP for communications
> [ OK ]
>
> * This time clustat did not show the "msg_open" error anymore:
> [root at mrp02 init.d]# clustat
> Member Status: Quorate
>
> Member Name ID Status
> ------ ---- ---- ------
> node1 1 Offline
> node2 2 Online, Local
> /dev/sdc1 0 Online, Quorum Disk
>
> * It seems to me that in case of cman regaining quorum after a lost
> quorum, at least in a initial "no quorum" state, rgmanager is not
> "woke up"
>
> * This setup had no services configured, so I repeated the test
> configuring a simple start/stop/status service using the "crond"
> service as an example, same results
>
> * Copy /etc/cluster/cluster.conf:
> -> Notice: I'm using Qdiskd with "always ok" heuristics, since the
> customer does not have a "always-on" IP tiebraker device to use with
> a "ping" command as heuristics. <?xml version="1.0"?> <cluster
> config_version="4" name="clu_mrp"> <quorumd interval="1"
> label="clu_mrp" min_score="1" tko="30" votes="1"> <heuristic
> interval="2" program="/bin/true" score="1"/> </quorumd>
> <fence_daemon post_fail_delay="40" post_join_delay="3"/> <clusternodes>
> <clusternode name="node1" nodeid="1" votes="1">
> <fence>
> <method name="1">
> <device lanplus="1" name="node1-ipmi"/>
> </method>
> </fence>
> </clusternode>
> <clusternode name="node2" nodeid="2" votes="1">
> <fence>
> <method name="1">
> <device lanplus="1" name="node2-ipmi"/>
> </method>
> </fence>
> </clusternode>
> </clusternodes>
> <cman deadnode_timer="38"/>
> <fencedevices>
> <fencedevice agent="fence_ipmilan" auth="none" ipaddr="node1-ipmi"
> login="root" name="node1-ipmi" passwd="xxx"/>
> <fencedevice agent="fence_ipmilan" auth="none" ipaddr="node2-ipmi"
> login="root" name="node2-ipmi" passwd="xxx"/>
> </fencedevices>
> <rm>
> <failoverdomains/>
> <resources/>
> </rm>
> </cluster>
>
> Could someone tell me with this is a expected behaviour? Shouldn't rgmanager
> start up automatically in this case?
>
> Thank you all,
>
> Celso.
>
> --
> *Celso Kopp Webber*
>
> celso at webbertek.com.br <mailto:celso at webbertek.com.br>
>
> *Webbertek - Opensource Knowledge*
> (41) 8813-1919 - celular
> (41) 4063-8448, ramal 102 - fixo
>
> --
> Esta mensagem foi verificada pelo sistema de antivírus e
> acredita-se estar livre de perigo.
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
--
*Celso Kopp Webber*
celso at webbertek.com.br <mailto:celso at webbertek.com.br>
*Webbertek - Opensource Knowledge*
(41) 8813-1919 - celular
(41) 4063-8448, ramal 102 - fixo
--
Esta mensagem foi verificada pelo sistema de antivírus e
acredita-se estar livre de perigo.
More information about the Linux-cluster
mailing list