[Linux-cluster] Severe problems with 64-bit RHCS on RHEL5.1

Gordan Bobic gordan at bobich.net
Thu Apr 17 08:17:30 UTC 2008


Harri.Paivaniemi at tietoenator.com wrote:

> So, this is my sad history with ver 5. Do you use 64-bit ver 5 and what's your feeling?

I only started using it with v5, and I have to say that I haven't had 
any real problems. Some of my clusters have been 64-bit, some 32-bit, 
and I haven't seen any differences yet.

> My problems this time are:
> 
> 1. 2-node cluster. Can't start only one node to get cluster services up - it hangs in fencing and waits until I start te second node and immediately after that, when both nodes are starting cman, the cluster comes up. So if I have lost one node, I can't get the cluster up, if I have to restart for seome reason the working node. It should work like before (both nodes are down, I start one, it fences another and comes up). Now it just waits... log says:
> 
> ccsd[25272]: Error while processing connect: Connection refused
> 
> This is so common error message, that it just tell's nothing to me....

I have seen similar error messages before, and it has usually been 
caused by either the node names/interfaces/IPs not being listed 
correctly in /etc/hosts file, or iptables firewalling rules blocking 
communication between the nodes.

> 2. qdisk doesn't work. 2- node cluster. Start it (both nodes at the same time) to get it up. Works ok, qdisk works, heuristic works. Everything works. If I stop cluster daemons on one node, that node can't join to cluster anymore without a complete reboot. It joins, another node says ok, the node itself says ok, quorum is registred and heuristic is up, but the node's quorum-disk stays offline and another node says this node is offline. If I reboot this machine, it joins to cluster ok.

I believe it's supposed to work that way. When a node fails it needs to 
be fully restarted before it is allowed back into the cluster. I'm sure 
this has been mentioned on the list recently.

> 3. Funny thing: heuristic ping didn't work at all in the beginning and support gave me a "ping-script" which make it to work... so this describes quite well how experimental this cluster is nowadays...
> 
> I have to tell you it is a FACT that basics are ok: fencing works ok in a normal situation, I don't have typos, configs are in sync,  everything is ok, but these problems still exists.

I've been in similar situations before, but in the end it always turned 
out to be me doing something silly (see above re: host files and 
iptables as examples).

> I have 2 times sent sosreports etc. so RH support. They hava spent 3 weeks and still can't say whats wrong...

Sadly, that seems to be the quality of commercial support from any 
vendor. Support nowdays seems to have only one purpose - managerial 
back-covering exercise so they can pass the buck. I have always found 
that community support is several orders of magnitude better than 
commercial support in terms of both response speed and quality.

Gordan




More information about the Linux-cluster mailing list