[Linux-cluster] GFS problems!!!

Wed Oct 10 13:53:03 UTC 2007

On Tue, Oct 09, 2007 at 04:52:01PM -0700, James Fillman wrote:
> Ok. I'm trying to implement GFS on two different clusters: 9 nodes, 17
> nodes.
> 
> I'm having nothing but troubles. The gfs volumes are freezing and
> throwing the cluster into a bad state. Currently, this is the state of
> my cluster:
> 
> [root at plxp01md-new log]# cman_tool services
> type             level name     id       state
> fence            0     default  00010004 none
> [1 2 3 4 5 6 7 8 9]
> dlm              1     clvmd    00010003 none
> [1 2 3 4 5 6 7 8 9]
> dlm              1     mdi_log  00020001 FAIL_START_WAIT
> [1 2 3 4 6 7 8 9]
> dlm              1     deploy   00040001 FAIL_START_WAIT
> [1 4 6 7 8 9]
> gfs              2     mdi_log  00010001 FAIL_START_WAIT
> [1 2 3 4 6 7 8 9]
> gfs              2     deploy   00030001 FAIL_START_WAIT

You probably have nodes going down; if you can keep nodes from failing
things will run much better.  'cman_tool nodes' and /var/log/messages may
give us some idea about the source of node failures, whether they are
spurious, if your largish number of nodes are contributing to the
problems.

> I have no idea what happened. I've got users who are writing to a gfs
> volume and just came and reported to me that the volumes not responding.
> /var/log/messages has been outputting the following message, about 50
> times a second,  since Friday:
> 
> Oct  9 13:54:35 plxp01deploy kernel: dlm: recover_master_copy -53 401ce

Ignore those, they are debug messages that were mistakenly directed to
/var/log/messages.

Dave