[Linux-cluster] Testing Failover - Failing in few cases

Tue Jun 12 13:46:18 UTC 2007

On Mon, May 28, 2007 at 06:42:08PM +0530, Satya Daragani wrote:
> Hi Linux-Cluster Team,
> 
> Please help me in testing the failover with the RHEL Cluster Suite 4 with
> update 4. I am appending the details related to cluster nodes and
> configuration here. Kindly suggest me how to proceed further.
> 
>   6. No fence devices configured (chkconfig --del fenced)

This breaks 2-node mode.

> 2nd case
> 
> Currently node1 is running the httpd service, if I down the network
> interface (ifconfig eth0 down), the httpd service is failing over to the
> node2.
> 
> Then if I up the interface (ifconfig eth0 up) on node1, the service is not
> failovering to the node1 and in the /var/log/messages it is saying "unable
> to contact the cluster infrastructure". *Need your help here*

CMAN doesn't gracefully die in this case; fencing needs to happen to fix
it.

> 3rd case
> 
> Currently node1 is running the HTTPd service, if I remove the powercord (I
> mean the improper shutdown), the service is going to the recovery mode and
> not getting started on the node2. *Need your help here.*

Are there any log messages here?  I'm not sure if lack of fencing might
cause this; I suspect not.  It shouldn't stick in recovery mode.

> Currently node1 is running the httpd service, if I stop or killall the httpd
> service (service httpd stop) failover is not happening. *Need your help
> here.*

That's a bug in the script resource agent.  Edit
/usr/share/cluster/script.sh and change the 'interval' values for the
monitor + status actions to 10 seconds instead of 3600 seconds / 1 hour.

e.g.
      interval="10"

That should fix it.

Can I convince you to upgrade at least:

   rgmanager
   ccs
   magma
   magma-plugins

...to the update 5 releases?

-- Lon

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.