[Linux-cluster] hung cluster question
Daniel McNeil
daniel at osdl.org
Sat Dec 4 00:58:12 UTC 2004
Another thing I am a bit confused by. After hitting the
rm hang describe before, I expected that reset one of the
nodes of the cluster would clear up the problem since
recovery should clean up the DLM lock state.
So I reset cl031. cl030 still had the gfs file system mounted
and cl032 was a member of the cluster, but did not a gfs
file system mounted.
When I reset cl031, both other nodes printed
CMAN: no HELLO from cl031a, removing from the cluster
Since I had configured manual fencing, I expected that I
would see a message on one of the nodes saying I needed
to ack the fencing, but I never saw any message.
After that running, cat /proc/cluster/services hung.
I reset cl031 and cl032 got:
CMAN: no HELLO from cl030a, removing from the cluster
CMAN: quorum lost, blocking activity
SM: 00000001 process_recovery_barrier status=-104
Does the SM: message mean anything.
After rebooting the other 2 nodes, they rejoined the cluster
ok, but there were message vi /var/log/messages :
Dec 3 16:53:44 cl032 fenced[17168]: fencing node "cl030a"
Dec 3 16:53:44 cl032 fenced[17168]: fence "cl030a" failed
So I'm not sure my manual fencing is working correctly.
Any suggestions?
Thanks,
Daniel
More information about the Linux-cluster
mailing list