[Linux-cluster] Two-node clusters using GFS and shared storage

Fri Jul 13 16:32:22 UTC 2007

José,

Fencing is not optional but mandatory for GFS. Once the failing node is 
detected the cluster nodes will *wait* until the failing node is 
successfully fenced. Once fenced (i.e. power-cycled or disconnected from 
the SAN) one of the cluster nodes will replay the journal of the failing 
node and GFS operation continues. Without fencing the cluster will hang 
on any lock that is obtained by the failing node (like your hanging 
systems).

Install a proper fencing agent for operational use. For testing purposes 
you could use manual fencing (i.e. run fence_manual).

PS: Plugging the cable back in without power-cycling is a NO-GO. The 
failing node is no longer in-sync with the rest of the cluster (they 
assume the machine has been power-cycled after a manual fence) - you 
will risk GFS filesystem corruption by attaching it back to the storage 
without proper fencing procedures!

Jeroen

José Miguel Parrella Romero wrote:
> Greetings,
>
> I've been trying to setup a two-node cluster using a shared SAN (via
> Fibre Channel) and GFS. I've previously tried OCFS2, and I don't want to
> use NFS yet. The cluster must be an active-active one, and it runs on
> Itanium2 machines with Debian 4.0. I'm using cman 1.03.00
>
> I've setup a cluster using Red Hat tools, and my
> /etc/cluster/cluster.conf looks like:
>
> -- my cluster.conf --
>
> <?xml version="1.0"?>
> <cluster name="correo" config_version="1">
>
> <cman two_node="1" expected_votes="1">
> </cman>
>
> <clusternodes>
>
> <clusternode name="node1" votes="1">
> </clusternode>
>
> <clusternode name="node2" votes="1">
> </clusternode>
>
> </clusternodes>
>
> </cluster>
>
> -- end my cluster.conf --
>
> Note that I've removed entries related to fencing, but I previously had
> a 'manual' fencing method. So I've an LVM volume which contains a GFS
> filesystem, and I'm able to start ccsd, cman, fenced, clvmd and all the
> other related applications.
>
> Syslog reports that the cluster is quorate, and I'm able to mount the
> filesystem in both of my nodes. They need to write to the shared storage
> in an active-active fashion.
>
> I expect that removing the network cable in node1 would do the following:
>
> a) node1 would be disabled (right, it doesn't have a network cable)
> b) node2 would notice node1 is not there and will keep writing to the
> shared storage
> c) Eventually node1 will come back, and node2 will notice it, so it will
> hopefully start writing again
>
> And this it what happens when I unplug the network cable:
>
> a) node1 is disabled (no connectivity)
> b) node2 is also disabled! (trying to write to /home and /var/mail
> stalls the machine, and then logins and other processes are stalled)
> c) Plugging the cable back does nothing (both machines are hanged now,
> so I need to reboot them)
>
> I'm probably missing something, since this solution using OCFS2 also has
> the same problem! Our last-resort solution is active-active NFS using
> Heartbeat, but then we wouldn't be writing to the SAN through FC (2Gbps)
> but through Ethernet (1Gbps) since we don't have any other media around ATM.
>
> Is this a configuration related problem? Or is this a design feature in
>  both GFS/OCFS2? Or maybe I'm just missing the whole picture...
>
> Thank you very much for any advice,
> Jose
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>