[Linux-cluster] How to "reactivate" a fenced node?

Wed Jun 8 21:47:06 UTC 2005

Hi all,

i have got a 3 node gfs lab setup on debian sarge plus vanilla 2.6.11
kernel up and running with the FC4 CVS branch code from
    http://people.redhat.com/teigland/cluster-2.6.11.tar.bz2
Two of my nodes are connected via FC (sarge-fc1, sarge-fc2) and the
other one via iscsi (iscsi).  

If i try to simulate a node failure on one of the FC-nodes by unplugging
its network connection, the node gets fenced (fence_sanbox2) and the
other two nodes keep on going.  On the now fenced node i see a lot of
I/O errors (quite evident, the node is fenced), shortly after that
the node becomes inquorate. 

Now i would like to reactivate the fenced node by 
- Stopping the processes with access on the shared gfs volume
- Umount the shared gfs volume
- Stopping cluster daemons
- Re-enable the FC ports
- Starting cluster daemons (joining the cluster)
- Mount the shared gfs volume again
- Starting what needs to be started

However all processes on the fenced node with access on the gfs volume
are blocked in a way i can't stop them (even with a SIGKILL), so i can't
umount the still "busy" gfs volume, and so i can't stop the cluster
daemons. All i am left with to regain access to the gfs volume is to
reboot the fenced node.

The last message that gets written to syslog on the fenced node is

Jun  8 21:29:05 sarge-fc2 kernel: GFS: fsid=cluster:gfs1.1: telling LM
to withdraw

but that doesn't seem to have any effect. I also tried a 'gfs_tool
withdraw' to no avail.

Is this behaviour by design (i.e. unkillable processes)? Is it possible
to avoid rebooting the node in order to regain gfs access?

Regards,

Sebastian