[Linux-cluster] failed node causes all GFS systems to hang

Dan B. Phung phung at cs.columbia.edu
Wed Jun 8 21:46:26 UTC 2005


I think I'm doing something terribly wrong here, because if one of my
nodes goes down, the rest of the nodes connected to GFS are hung in some
wait state.  Specifically, only those nodes running fenced are hosed.
These machines are not only blocked on the GFS's file system, but the
local file system stuff is hung as well, which requires me to reboot
everybody connected to GFS.  I have one node not running fenced to reset
the quorum status, so that doesn't seem to be the problem.  

I updated from the cvs sources -rRHEL4 last friday, so I have up to date
stuff.  i'm running kernel 2.6.9 and fence_manual.  I remember a couple of
weeks back that when a node went down, I simply had to fence_ack_manual
the node, but that message never comes up anymore...

help!

-dan

-- 




More information about the Linux-cluster mailing list