[Linux-cluster] 3rd node won't rejoin cluster

Arwin L Tugade arwin.tugade at csun.edu
Thu Sep 10 22:04:04 UTC 2009


I've just run into a odd problem on my production cluster.  One of the nodes got fenced (still digging through logs to find out why) and on it's way back up, it appears to join the cluster find but the node that fenced it starts spewing out tons of these in /var/log/messages:

Sep 10 14:25:34 redwing gfs_controld[6119]: cpg_mcast_joined retry 176200 unknown
Sep 10 14:25:35 redwing gfs_controld[6119]: cpg_mcast_joined retry 176300 unknown
Sep 10 14:25:35 redwing gfs_controld[6119]: cpg_mcast_joined retry 176400 unknown
Sep 10 14:25:35 redwing gfs_controld[6119]: cpg_mcast_joined retry 176500 unknown
Sep 10 14:25:35 redwing gfs_controld[6119]: cpg_mcast_joined retry 176600 unknown
Sep 10 14:25:35 redwing gfs_controld[6119]: cpg_mcast_joined retry 176700 unknown
....
...

The node that got fenced just hangs at the "Starting Fencing..." part of cman, while redwing (the node that fenced it) starts to climb in load slowly but surely.  I ended up bringing down the fenced node and I'm running fine off the 2 remaining nodes.  Has anyone ran into this problem.

I'm running RHEL5.3 with these packages:

[a_arwin at redwing ~]$ rpm -qa | egrep 'cman|rgman|gfs|lvm'
lvm2-2.02.40-6.el5
kmod-gfs-0.1.31-3.el5
cman-2.0.98-1.el5_3.1
gfs-utils-0.1.18-1.el5
rgmanager-2.0.46-1.el5_3.3
gfs2-utils-0.1.53-1.el5_3.2
lvm2-cluster-2.02.40-7.el5
[a_arwin at redwing ~]$ uname -a
Linux redwing.csun.edu 2.6.18-128.1.6.el5 #1 SMP Tue Mar 24 12:05:57 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux

Thanks ahead of time,
Arwin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090910/af29fb03/attachment.htm>


More information about the Linux-cluster mailing list