[Linux-cluster] RHCS 5.1 latest packages, 2-node cluster, doesn't come up with only 1 node

Fri Feb 8 13:18:20 UTC 2008

Hello all,

I'm having a situation here that might be a bug, or maybe it's some mistake
from my part.

* Scenario: 2-node cluster on Dell PE-2950 servers, Dell MD-3000 storage (SAS
direct-attach), using IPMI-Lan as fencing devices, 2 NICs on each server
(public and heartbeat networks), using Qdisk in the shared storage

* Problem: if I shutdown one node and keep it shut down, and then reboot the
other node, although CMAN comes up after 5 minutes or so, rgmanager does not
start.

I remember having this same problem with RHCS 4.4, but it was solved by
upgrading to 4.5. But with RHCS 4.4 CMAN didn't come up, with my setup in RHCS
5.1 CMAN comes up after giving up waiting for the other node, but rgmanager
doesn't, so services get not started. This is bad in an unattended situation.

Here are some steps and details I've collected from the machine (sorry for a
so long message):

* Shutdown node1

* Reboot node2
  - after boot, took around 5 minutes in the "start fencing" message
  - reported a startup FAIL for the "cman" service after this period of time

* Boot completed

* Logged in:
  - clustat reported inquorate and quorum disk as "offline":
[root at mrp02 ~]# clustat
msg_open: No such file or directory
Member Status: Inquorate

  Member Name                        ID   Status
  ------ ----                        ---- ------
  node1                                 1 Offline
  node2                                 2 Online, Local
  /dev/sdc1                             0 Offline

* After a few seconds, clustat reported quorate and quorum disk as "online":
[root at mrp02 ~]# clustat
msg_open: No such file or directory
Member Status: Quorate

  Member Name                        ID   Status
  ------ ----                        ---- ------
  node1                                 1 Offline
  node2                                 2 Online, Local
  /dev/sdc1                             0 Online, Quorum Disk

* Logs in /var/log/messages showed that after qdiskd assumed "master role",
cman reported regaining quorum:
Feb  7 20:06:59 mrp02 qdiskd[5854]: <info> Assuming master role
Feb  7 20:07:00 mrp02 ccsd[5694]: Cluster is not quorate.  Refusing connection.
Feb  7 20:07:00 mrp02 ccsd[5694]: Error while processing connect: Connection
refused
Feb  7 20:07:00 mrp02 ccsd[5694]: Cluster is not quorate.  Refusing connection.
Feb  7 20:07:00 mrp02 ccsd[5694]: Error while processing connect: Connection
refused
Feb  7 20:07:00 mrp02 openais[5714]: [CMAN ] quorum regained, resuming activity
Feb  7 20:07:01 mrp02 clurgmgrd[7523]: <notice> Quorum formed, starting
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  -> Note that rgmanager started after quorum was regained, but seemed not to
work anymore later on (please see below).
Feb  7 20:07:01 mrp02 kernel: dlm: no local IP address has been set
Feb  7 20:07:01 mrp02 kernel: dlm: cannot start dlm lowcomms -107

* Noticed that in "clustat" there as an error message:
  -> msg_open: No such file or directory

* Checked rgmanager to see if it was related:
[root at mrp02 ~]# chkconfig --list rgmanager
rgmanager       0:off   1:off   2:on    3:on    4:on    5:on    6:off
[root at mrp02 ~]# service rgmanager status
clurgmgrd dead but pid file exists

* Since rgmanager did not come back by itself, restarted it manually:
[root at mrp02 init.d]# service rgmanager restart
Starting Cluster Service Manager: dlm: Using TCP for communications
                                                           [  OK  ]

* This time clustat did not show the "msg_open" error anymore:
[root at mrp02 init.d]# clustat
Member Status: Quorate

  Member Name                        ID   Status
  ------ ----                        ---- ------
  node1                                 1 Offline
  node2                                 2 Online, Local
  /dev/sdc1                             0 Online, Quorum Disk

* It seems to me that in case of cman regaining quorum after a lost quorum, at
least in a initial "no quorum" state, rgmanager is not "woke up"

* This setup had no services configured, so I repeated the test configuring a
simple start/stop/status service using the "crond" service as an example, same
results

* Copy /etc/cluster/cluster.conf:
  -> Notice: I'm using Qdiskd with "always ok" heuristics, since the customer
does not have a "always-on" IP tiebraker device to use with a "ping" command
as heuristics.
<?xml version="1.0"?>
<cluster config_version="4" name="clu_mrp">
	<quorumd interval="1" label="clu_mrp" min_score="1" tko="30" votes="1">
		<heuristic interval="2" program="/bin/true" score="1"/>
	</quorumd>
	<fence_daemon post_fail_delay="40" post_join_delay="3"/>
	<clusternodes>
		<clusternode name="node1" nodeid="1" votes="1">
			<fence>
				<method name="1">
					<device lanplus="1" name="node1-ipmi"/>
				</method>
			</fence>
		</clusternode>
		<clusternode name="node2" nodeid="2" votes="1">
			<fence>
				<method name="1">
					<device lanplus="1" name="node2-ipmi"/>
				</method>
			</fence>
		</clusternode>
	</clusternodes>
	<cman deadnode_timer="38"/>
	<fencedevices>
		<fencedevice agent="fence_ipmilan" auth="none" ipaddr="node1-ipmi"
login="root" name="node1-ipmi" passwd="xxx"/>
		<fencedevice agent="fence_ipmilan" auth="none" ipaddr="node2-ipmi"
login="root" name="node2-ipmi" passwd="xxx"/>
	</fencedevices>
	<rm>
		<failoverdomains/>
		<resources/>
	</rm>
</cluster>

Could someone tell me with this is a expected behaviour? Shouldn't rgmanager
start up automatically in this case?

Thank you all,

Celso.

-- 
*Celso Kopp Webber*

celso at webbertek.com.br <mailto:celso at webbertek.com.br>

*Webbertek - Opensource Knowledge*
(41) 8813-1919 - celular
(41) 4063-8448, ramal 102 - fixo

-- 
Esta mensagem foi verificada pelo sistema de antivírus e
 acredita-se estar livre de perigo.