[Linux-cluster] Re: How to configure a cluster to remain up in the event of node failure

Wed Aug 13 09:32:34 UTC 2008

I think I found a problem with the way it starts up...  See just below
the startup output for more info...

On Tue, Aug 12, 2008 at 4:59 PM, Brett Cave <brettcave at gmail.com> wrote:
> With a 3node gfs1 cluster, and if i hard reset 1 node, it hangs on
> startup, although the cluster seems to return to normal.
> Nodes: node2, node3, node4
> each node has 1 vote, and a qdisk has 2 votes.
>
> If I reset node3, gfs on node2 and node4 is blocked while node3
> restarts. First question: is there a config that will allow the
> cluster to continue operating while 1 node is down? My quorum is 3 and
> total votes is 4 while node3 is restarting, but my gfs mountpoints are
> inaccessible until my cman services start up on node3.
>
> Secondly, when node3 restarts, it hangs when trying to remount gfs file systems.
> Starting cman
> Mounting configfs...done
> Starting ccsd...done
> Starting cman...done
> Starting daemons...done
> Starting fencing...done
>                   OK
> qdiskd        OK
>
> "Mounting other file systems..." OK
>
> Mounting GFS filesystems: GFS 0.1.1-7.el5 installed
> Trying to join cluster "lock_dlm","jemdevcluster:cache1"
> dlm: Using TCP for communications
> dlm: connecting to 2
> dlm: got connection to 2
> dlm: connecting to 2
> dlm: got connection from 4

Could this be the problem?

When GFS is set to auto start via chkconfig, it first tries to
connnect to 2, gets connection and then tries to connect to 2 again.
It gets a connection from 4, and hangs.

However, if I chkconfig --levels 3 gfs off and then run service gfs
start once system has booted, i get:
dlm: connecting to 2
dlm: got connection from 2
dlm: connecting to 4
dlm: got connection from 4
mounting gfs mountpoints.

This works exactly as expected - gfs mounts, and cluster is back to
normal. This means that for some reason, when gfs is starting as an
automatic boot service, it doesnt connect to nodes properly - trying
to connect to node2 twice, rather than node2 and then node4 as it
should.

Why would it be doing this? where would i start for troubleshooting
something like.

>
> After that, system just hangs.
>
> From nodes 2 & 4, i can run cman_tool, and everything shows that the
> cluster is up, except for some services:
> [root at node2 cache1]# cman_tool services
> type             level name     id       state
> fence            0     default  00010004 none
> [2 3 4]
> dlm              1     cache1   00010003 none
> [2 3 4]
> dlm              1     storage  00030003 none
> [2 4]
> gfs              2     cache1   00000000 none
> [2 3 4]
> gfs              2     storage  00020003 none
> [2 4]
>
> [root at node2 cache1]# cman_tool nodes
> Node  Sts   Inc   Joined               Name
>   0   M      0   2008-08-12 16:11:46  /dev/sda5
>   2   M    336   2008-08-12 16:11:12  node2
>   3   M    352   2008-08-12 16:44:31  node3
>   4   M    344   2008-08-12 16:11:12  node4
>
> I have 2 gfs partitions
> [root at node4 CentOS]# grep gfs /etc/fstab
> /dev/sda1       /gfs/cache1                     gfs     defaults
>                         0 0
> /dev/sda2       /gfs/storage                    gfs     defaults
>                         0 0
>
>
> At this point, I am unable to unmount /gfs/cache1 from any of my nodes
> (node2 or node4) - it just hangs. I can unmount storage with no
> problem.
>
> Is there something I am overlooking? Any and all advice welcome :)
>
> Regards,
> Brett
>