[Linux-cluster] IP monitor failing periodically

Fri Jul 6 16:34:49 UTC 2007

On Sat, Jun 30, 2007 at 01:41:03PM -0500, Chris Harms wrote:
> I am experiencing periodic failovers due to a floating IP address not 
> passing the status check:
> 
> clurgmgrd: [9975]: <warning> Failed to ping 192.168.13.204
> Jun 30 11:41:47 nodeA clurgmgrd[9975]: <notice> status on ip 
> "192.168.13.204" returned 1 (generic error)
> 
> Both nodes have bonded NICs with gigabit connections to redundant 
> switches, so it is unlikely they are going down, nothing in the logs 
> about linux losing the links.  I parked all the cluster services - 2 
> Postgres services and 1 Apache - on one node and allowed it to run 
> overnight.  There would be no client activity during this time. One 
> Postgres service failed two times in this manner and the other failed 
> once in this manner.  The Apache service did not fail.
> 
> What can I do to resolve this or get more information out of the system 
> to resolve this?

Hmm, with bonded NICs, ip.sh monitors the links of the physical devices.
It's supposed to check and not complain if either link is up.

The ping bit is a bit weird; you could just disable it in
/usr/share/cluster/ip.sh.

I.e. change the 'ping' line to '/bin/true'

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.