[Linux-cluster] Monitoring services/customize failure criteria

Fri Sep 12 16:24:59 UTC 2008

> 
> > -----Original Message-----
> > Just for clarification, if I use a script resource will the same 
> > script be used with a status parameter to check the status of the 
> > resource?
> 
> Yes.
> 
> > Where is the frequency specified for health checking of the 
> resource 
> > whether it be a custom script of apache? If we want to check health 
> > every second where can I set this frequency? I have used luci web 
> > interface up until now to do my configs with and have not seen 
> > anywhere yet where I can set frequency of health checks.
> 
> http://sources.redhat.com/cluster/faq.html#rgm_interval
> 
Thanks! I have made changes to the script.sh and ip.sh scripts. Setting
values to as low as 2seconds. We have a very simple apache setup and we
want quick failover. The checks easily complete within a seconds so I
don't believe I am creating endless loops of continuos checking. The log
file however still shows that checking is done every 10 seconds. My
changes has increased the frequency but there seems to be an lower limit
of 10sec and I can't find a place to override the value.

Sep 12 15:24:11 LONGAPA02ALT clurgmgrd: [14666]: <info> Executing
/etc/rc.d/init.d/httpd status
Sep 12 15:24:21 LONGAPA02ALT clurgmgrd: [14666]: <info> Executing
/etc/rc.d/init.d/httpd status
Sep 12 15:24:31 LONGAPA02ALT clurgmgrd: [14666]: <info> Executing
/etc/rc.d/init.d/httpd status

Currently if I stop the httpd server manually on a node the failover
takes about 15sec and if I time it roughly to coincide with the next
check the best I could get failover down to is 11sec. It seems that I am
still bound by this 10sec setting somewhere. It definitely does not take
10 sec to shutdown httpd, tear down the ip and then do the reverse on
the other node.

Failover simply does not happen if I physically disconnect a node. I
believe it is a config error somewhere on my part but not sure exactly
what. I have a two node cluster so there would be a quorom for the other
node to continue. I am looking into this currently but some
hints/suggestions would be appreciated. If I bring the box back online,
ie start the network services again failover occurs to the other node
which I find bizarre.

Regards

> 
> Correct. The script resource is generic - used for when a 
> more specific resource type is not available. For the 
> developers to build a framework to do health checking is 
> adding (potentially) a lot of complexity where it may not be 
> warranted. Simply throw it back on the implementer to deal 
> with - the KISS principle.  :-)
> 
> 
> --Jeff
> Performance Engineer
> 
> OpSource, Inc.
> http://www.opsource.net
> "Your Success is Our Success"
>   
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 

______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
______________________________________________________________________