[Linux-cluster] How to reboot the whole cluster

Mon Nov 17 11:40:53 UTC 2008

Hi folks,

RHEL 5.2
cman-2.0.84-2.el5
gfs-utils-0.1.17-1.el5
rgmanager-2.0.38-2.el5
openais-0.80.3-15.el5
kmod-gfs-PAE-0.1.23-5.el5
kmod-gfs2-PAE-1.92-1.1.el5
gfs2-utils-0.1.44-1.el5_2.1

I came into work this morning and our 4 node cluster was down because access to the GFS filesystem had been lost by all nodes due to an iSCSI error.
Even though the iSCSI error corrected itself in the middle of the night, the cluster did not regain quorum.

It took me 2 hours to fix the problem. Rebooting any node would would fail to start fencing during boot.

I eventually got it working by powering off all nodes, rebooting one at a time, but fencing did not start working until the fourth node was booted but
even then the GFS filesystem was not mounted.

Here's what I did.
Power off node 4.
Power off node 3.
Power off node 3.
Reboot node 1.

Node 1 can join the fence domain.
Power on node 2. Node 2 can't join the fence domain.
Power on node 3. Node 3 can't join the fence domain.
Power on node 4. Node 4 joins the fence domain.

I then had to 'service gfs start' on nodes 1 2 & 3 and the cluster was back up and running.

What is the correct way to get GFS filesystems running again after access to the GFS device has been temporarily lost and the cluster is blocking all
activity ?

Thanks,
Nick .