[Linux-cluster] Rebooting the Master Node in an RHCS Cluster

Juan Ramon Martin Blanco robejrm at gmail.com
Tue Oct 26 13:27:15 UTC 2010

On Tue, Oct 26, 2010 at 2:52 PM,  <gcharles at ups.com> wrote:
> Hello,
> Was wondering if anyone else has ever run into this.  We have a three-node
> RHCS cluster:
> Three Proliant DL380-G6s, 48G memory
> Dual network, power, QLogic HBAs for redundancy
> RHEL 5.5  kernel 2.6.18-194.el5
> All three in an RHCS cluster, 12 Oracle database services.  The cluster
> itself runs fine under normal conditions, and all failovers function as
> expected.  There is only one failover domain configured, and all three nodes
> are members of that domain.  Four of the Oracle database services contain
> GFS2 file systems; the rest are ext3.
> The problem is when we attempt a controlled shutdown of the current master
> node.  We have tested in the following situations:
> 1.  Node 1 is the current master and not running any services.  Node 2 is
> also not running any services.  Node 3 is running all 12 services.  We
> hard-fail node 1 (by logging into the ILO and clicking on "Reset" in power
> management) and node 2 immediately takes over the master role and the
> services stay where they are and continue to function.  I believe this is
> the expected behavior.
> 2.  Node 1 is the current master and not running any services.  Three
> services are on node 2, and node 3 is running the rest.  Again, we hard-fail
> node 1 as described above and node 2 assumes the master role and the
> services stay where they are and continue to function.
> 3.  Repeating the same steps as above; node 1 is the master and not running
> any services, three services on node 2 and the rest on node three.  This
> time we perform a controlled shutdown of node 1 to "properly" remove it from
> the cluster (let's say we're doing a rolling patch of the OS on the nodes)
> with the following steps on the master node:
>  - Unmount any GFS file systems.
>  - service rgmanager stop; service gfs2 stop; service gfs stop  (clustat
> shows node1 Online but no rgmanager, as expected)
>  - fence_tool leave    (this removes node 1 from the fence group in the
> hopes that the other nodes don't try to fence it as it is rebooting)
>  - service clvmd stop
>  - cman_tool leave remove
>  - service qdiskd stop
>  - shutdown
> Everything appears normal until we execute the 'cman_tool leave remove'.  At
> that point the cluster log on node 2 and node 3 shows "Lost contact with
> quorum device" (we expect that) but also shows "Emergency stop of services"
> for all 12 services.  While access to the quorum device is restored almost
> immediately (node 2 takes over the master role), rgmanager is temporarily
> unavailable on nodes 2 and 3 while the cluster basically reconfigures
> itself, restarting all 12 services.  Eventually all 12 services properly
> restart (not necessarily on the original node they were on) and when node 1
> finishes rebooting, it properly rejoins itself to the cluster.  Node 2
> retains itself as Master.
> If I do the same tests as above and reboot a node that is NOT the master,
> the services remain where they are and the cluster does not reconfigure
> itself or restart any services.
> My questions are, Why does the cluster reconfigure itself and restart ALL
> services regardless of what node they are on when I do a controlled shutdown
> of the current Master node?  Do I have to hard-reset the Master node in an
> RHCS cluster so the remaining services don't get restarted?  Why does the
> cluster completely reconfigure itself when the Master node is 'properly'
> removed.
Hi, could you please show us your cluster.conf?

> Thanks for your help, and any suggestions would be appreciated.
> Greg Charles
> Mid Range Systems
> gcharles at ups.com
