[Linux-cluster] Re: How to configure a cluster to remain up in the event of node failure
Brett Cave
brettcave at gmail.com
Wed Aug 13 09:32:34 UTC 2008
I think I found a problem with the way it starts up... See just below
the startup output for more info...
On Tue, Aug 12, 2008 at 4:59 PM, Brett Cave <brettcave at gmail.com> wrote:
> With a 3node gfs1 cluster, and if i hard reset 1 node, it hangs on
> startup, although the cluster seems to return to normal.
> Nodes: node2, node3, node4
> each node has 1 vote, and a qdisk has 2 votes.
>
> If I reset node3, gfs on node2 and node4 is blocked while node3
> restarts. First question: is there a config that will allow the
> cluster to continue operating while 1 node is down? My quorum is 3 and
> total votes is 4 while node3 is restarting, but my gfs mountpoints are
> inaccessible until my cman services start up on node3.
>
> Secondly, when node3 restarts, it hangs when trying to remount gfs file systems.
> Starting cman
> Mounting configfs...done
> Starting ccsd...done
> Starting cman...done
> Starting daemons...done
> Starting fencing...done
> OK
> qdiskd OK
>
> "Mounting other file systems..." OK
>
> Mounting GFS filesystems: GFS 0.1.1-7.el5 installed
> Trying to join cluster "lock_dlm","jemdevcluster:cache1"
> dlm: Using TCP for communications
> dlm: connecting to 2
> dlm: got connection to 2
> dlm: connecting to 2
> dlm: got connection from 4
Could this be the problem?
When GFS is set to auto start via chkconfig, it first tries to
connnect to 2, gets connection and then tries to connect to 2 again.
It gets a connection from 4, and hangs.
However, if I chkconfig --levels 3 gfs off and then run service gfs
start once system has booted, i get:
dlm: connecting to 2
dlm: got connection from 2
dlm: connecting to 4
dlm: got connection from 4
mounting gfs mountpoints.
This works exactly as expected - gfs mounts, and cluster is back to
normal. This means that for some reason, when gfs is starting as an
automatic boot service, it doesnt connect to nodes properly - trying
to connect to node2 twice, rather than node2 and then node4 as it
should.
Why would it be doing this? where would i start for troubleshooting
something like.
>
> After that, system just hangs.
>
> From nodes 2 & 4, i can run cman_tool, and everything shows that the
> cluster is up, except for some services:
> [root at node2 cache1]# cman_tool services
> type level name id state
> fence 0 default 00010004 none
> [2 3 4]
> dlm 1 cache1 00010003 none
> [2 3 4]
> dlm 1 storage 00030003 none
> [2 4]
> gfs 2 cache1 00000000 none
> [2 3 4]
> gfs 2 storage 00020003 none
> [2 4]
>
> [root at node2 cache1]# cman_tool nodes
> Node Sts Inc Joined Name
> 0 M 0 2008-08-12 16:11:46 /dev/sda5
> 2 M 336 2008-08-12 16:11:12 node2
> 3 M 352 2008-08-12 16:44:31 node3
> 4 M 344 2008-08-12 16:11:12 node4
>
> I have 2 gfs partitions
> [root at node4 CentOS]# grep gfs /etc/fstab
> /dev/sda1 /gfs/cache1 gfs defaults
> 0 0
> /dev/sda2 /gfs/storage gfs defaults
> 0 0
>
>
> At this point, I am unable to unmount /gfs/cache1 from any of my nodes
> (node2 or node4) - it just hangs. I can unmount storage with no
> problem.
>
> Is there something I am overlooking? Any and all advice welcome :)
>
> Regards,
> Brett
>
More information about the Linux-cluster
mailing list