[Linux-cluster] How to "reactivate" a fenced node?

Sebastian Kayser mls at skayser.de
Wed Jun 8 21:47:06 UTC 2005


Hi all,

i have got a 3 node gfs lab setup on debian sarge plus vanilla 2.6.11
kernel up and running with the FC4 CVS branch code from
    http://people.redhat.com/teigland/cluster-2.6.11.tar.bz2
Two of my nodes are connected via FC (sarge-fc1, sarge-fc2) and the
other one via iscsi (iscsi).  

If i try to simulate a node failure on one of the FC-nodes by unplugging
its network connection, the node gets fenced (fence_sanbox2) and the
other two nodes keep on going.  On the now fenced node i see a lot of
I/O errors (quite evident, the node is fenced), shortly after that
the node becomes inquorate. 

Now i would like to reactivate the fenced node by 
- Stopping the processes with access on the shared gfs volume
- Umount the shared gfs volume
- Stopping cluster daemons
- Re-enable the FC ports
- Starting cluster daemons (joining the cluster)
- Mount the shared gfs volume again
- Starting what needs to be started

However all processes on the fenced node with access on the gfs volume
are blocked in a way i can't stop them (even with a SIGKILL), so i can't
umount the still "busy" gfs volume, and so i can't stop the cluster
daemons. All i am left with to regain access to the gfs volume is to
reboot the fenced node.

The last message that gets written to syslog on the fenced node is

Jun  8 21:29:05 sarge-fc2 kernel: GFS: fsid=cluster:gfs1.1: telling LM
to withdraw

but that doesn't seem to have any effect. I also tried a 'gfs_tool
withdraw' to no avail.

Is this behaviour by design (i.e. unkillable processes)? Is it possible
to avoid rebooting the node in order to regain gfs access?

Regards,

Sebastian




More information about the Linux-cluster mailing list