[Linux-cluster] Rebooting the Master Node in an RHCS Cluster

Tue Oct 26 12:52:41 UTC 2010

Hello,

Was wondering if anyone else has ever run into this.  We have a three-node RHCS cluster:

Three Proliant DL380-G6s, 48G memory
Dual network, power, QLogic HBAs for redundancy
EMC SAN
RHEL 5.5  kernel 2.6.18-194.el5

All three in an RHCS cluster, 12 Oracle database services.  The cluster itself runs fine under normal conditions, and all failovers function as expected.  There is only one failover domain configured, and all three nodes are members of that domain.  Four of the Oracle database services contain GFS2 file systems; the rest are ext3.

The problem is when we attempt a controlled shutdown of the current master node.  We have tested in the following situations:

1.  Node 1 is the current master and not running any services.  Node 2 is also not running any services.  Node 3 is running all 12 services.  We hard-fail node 1 (by logging into the ILO and clicking on "Reset" in power management) and node 2 immediately takes over the master role and the services stay where they are and continue to function.  I believe this is the expected behavior.

2.  Node 1 is the current master and not running any services.  Three services are on node 2, and node 3 is running the rest.  Again, we hard-fail node 1 as described above and node 2 assumes the master role and the services stay where they are and continue to function.

3.  Repeating the same steps as above; node 1 is the master and not running any services, three services on node 2 and the rest on node three.  This time we perform a controlled shutdown of node 1 to "properly" remove it from the cluster (let's say we're doing a rolling patch of the OS on the nodes) with the following steps on the master node:
 - Unmount any GFS file systems.
 - service rgmanager stop; service gfs2 stop; service gfs stop  (clustat shows node1 Online but no rgmanager, as expected)
 - fence_tool leave    (this removes node 1 from the fence group in the hopes that the other nodes don't try to fence it as it is rebooting)
 - service clvmd stop
 - cman_tool leave remove
 - service qdiskd stop
 - shutdown
Everything appears normal until we execute the 'cman_tool leave remove'.  At that point the cluster log on node 2 and node 3 shows "Lost contact with quorum device" (we expect that) but also shows "Emergency stop of services" for all 12 services.  While access to the quorum device is restored almost immediately (node 2 takes over the master role), rgmanager is temporarily unavailable on nodes 2 and 3 while the cluster basically reconfigures itself, restarting all 12 services.  Eventually all 12 services properly restart (not necessarily on the original node they were on) and when node 1 finishes rebooting, it properly rejoins itself to the cluster.  Node 2 retains itself as Master.

If I do the same tests as above and reboot a node that is NOT the master, the services remain where they are and the cluster does not reconfigure itself or restart any services.

My questions are, Why does the cluster reconfigure itself and restart ALL services regardless of what node they are on when I do a controlled shutdown of the current Master node?  Do I have to hard-reset the Master node in an RHCS cluster so the remaining services don't get restarted?  Why does the cluster completely reconfigure itself when the Master node is 'properly' removed.

Thanks for your help, and any suggestions would be appreciated.

Greg Charles
Mid Range Systems

gcharles at ups.com<mailto:gcharles at ups.com>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20101026/883732d0/attachment.htm>