[Linux-cluster] Re: Cluster Suite 4 failover problem

Fri Oct 20 09:36:25 UTC 2006

I had a very similar set of problems just recently and found that 
uninstalling the "fence" RPM solved about 90% of them, incuding a 
hanging RGManager which required me to "switch-off-and-switch-on" the 
servers many times. I suspect that the problem was more to do with my 
unfamiliarity with fencing, but I wonder if there are some issues when 
running fencing and having no fence devices in use, and how the fenced 
daemon then interacts with RGManager. I do know that there are (fixed) 
similar lockup issues with RG Manager rgmanager-1.9.46-0, CMAN 
cman-kernel-2.6.9-43.8 and kernel 2.6.9-34 which disappear with an 
upgrade to kernel 2.6.9-34.0.1 and CMan cman-kernel-smp-2.6.9-43.8.3, 
but a new set of problems were introduced for me when I did that so I 
rolled back and uninstalled fenced, et viola!

I still find that on occasion I have to kill -9 the rgmanger process 
(sometimes I have to do it more than once) and I realise that an 
unfenced cluster is unsupported, but it solved the problems for me.

Hope this helps,
Jon

Dicky wrote:

> Hi,
>
> Thx for the reply. :)
>
> Yes, i have installed the 'fence' rpm, and others according to the 
> Redhat Cluster Suite documenation's "RPM Selection Criteria: Red Hat 
> Cluster Suite with DLM"
> , following are the rpms i have installed:
>
> =====RPM Installed=====
>
> ccs, fence, gulm, iddev, magma, magma-plugins, perl-Net-Telnet,    
> system-config-cluster, ipvsadm,
> piranha, ccs-devel, gulm-devel, iddev-devel, magma-devel,
>
> ====END=======
>
> I didn't install GFS.
>
> Here is the /var/log/messages output when i try to restart the 
> rgmanager service from the failed node after i re-enable eth0:
>
> ===/var/log/messages ==
>
> rgmanager: [1074]: <notice> Shutting down Cluster Service Manager...
> clurgmgrd[31777]: <err> #50: Unable to obtain cluster lock: Connection 
> timed out
> clurgmgrd[31777]: <err> #50: Unable to obtain cluster lock: Connection 
> timed out
> clurgmgrd[31777]: <warning> #67: Shutting down uncleanly
> clurgmgrd: [31777]: <info> Executing /etc/rc.d/init.d/vsftpd stop
> clurgmgrd: [31777]: <info> Executing /etc/rc.d/init.d/httpd stop
> vsftpd: vsftpd shutdown succeeded
> clurgmgrd: [31777]: <info> Removing IPv4 address 192.168.0.112 from eth0
> httpd: httpd shutdown succeeded
> clurgmgrd: [31777]: <info> Removing IPv4 address 192.168.0.111 from eth0
>
> =======END============
>
> Then it hanged forver until i manually reset the machine.
>
> I would like to know if the waiting is caused by this line :"
> clurgmgrd[31777]: <err> #50: Unable to obtain cluster lock: Connection 
> timed out
> " ?? If so, why and how to solve it??
>
> Also, i would like to know even i type " reboot" , it also hanged in 
> this line: "Shutting down Cluster Service Manager...
> Waiting for services to stop: " which caused me have press the reset 
> button, which may caused the file system corrupted, so manually press 
> the reset button is dangerous.
> Is there anyway for me to shutdown the rgmanager properly?
>
>
> Second question is, why the cluster didn't failover but the status 
> showed that the services were "started" ??? Is there anything i missed 
> in the configuration process??
>
> Many thanks,
> Dicky
>
>
>
>> Hi,
>>
>> What is output to the "/var/log/messages" files of
>> each node? That should provide a clue as to what the problem is. 
>> Also, did you install the 'fence' RPM and any Clustered LVM / GFS RPMs?
>>
>> You also might consider rebooting the "downed" node
>> - this function is generally taken care of by fencing devices
>> automatically and, as I understand it, "manual fencing" means you gotta
>> reboot :), the assumption being that a failed node won't be allowed
>> back in the cluster until it's restarted.
>>
>> Thanks,
>> Jon
>
>
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster