[Linux-cluster] GFS problems!!!

Tue Oct 9 23:52:01 UTC 2007

Ok. I'm trying to implement GFS on two different clusters: 9 nodes, 17
nodes.

I'm having nothing but troubles. The gfs volumes are freezing and
throwing the cluster into a bad state. Currently, this is the state of
my cluster:

[root at plxp01md-new log]# cman_tool services
type             level name     id       state
fence            0     default  00010004 none
[1 2 3 4 5 6 7 8 9]
dlm              1     clvmd    00010003 none
[1 2 3 4 5 6 7 8 9]
dlm              1     mdi_log  00020001 FAIL_START_WAIT
[1 2 3 4 6 7 8 9]
dlm              1     deploy   00040001 FAIL_START_WAIT
[1 4 6 7 8 9]
gfs              2     mdi_log  00010001 FAIL_START_WAIT
[1 2 3 4 6 7 8 9]
gfs              2     deploy   00030001 FAIL_START_WAIT

I have no idea what happened. I've got users who are writing to a gfs
volume and just came and reported to me that the volumes not responding.
/var/log/messages has been outputting the following message, about 50
times a second,  since Friday:

Oct  9 13:54:35 plxp01deploy kernel: dlm: recover_master_copy -53 401ce

Can someone tell me what FAIL_START_WAIT means and how to recover from
it? Also, does anyone know what the log message above means?

All my servers in the cluster are showing the same service states.

I'm running RHEL5-64 bit. 

please help. I'm almost ready to give up on GFS. It seems way too
unstable.

James Fillman