[Linux-cluster] [UPDATE] IP monitor failing periodically

Sat Jul 21 20:41:04 UTC 2007

We reinstalled our machines with RHEL 5 x86_64 (we were running i386) a 
few weeks ago and the mysterious IP monitoring failures have disappeared. 
I believe it was postulated that a compiler bug regarding -fpie might be 
causing segfaults in i386 binaries, so this would support that theory to 
some degree, although I did not really attempt to confirm it further.  I 
thought the architecture change fixing the random failovers was noteworthy.

### previous thread below

Hi Chris,

I am experiencing the same problem on RHEL 5 and I have a support 
request in with RedHat.

I was asked to increase the debug level by changing the <rm> line in the 
cluster configuration to:

<rm log_facility="local4" log_level="7">

I then needed to add "local4.* /var/log/cluster" to /etc/syslog.conf and 
run "service syslog restart".

To update the cluster configuration I needed to propagate the cluster 
configuration to both nodes:

# ccs_tool update /etc/cluster/cluster.conf

After a week I have not had the problem with the increased logging 
despite the problem occurring regularly prior to that - 2 to 3 times a 
day. One day last week out of curiosity I reverted to the default 
settings and within a few hours I had the failure to ping error on one 
of the clustered IP addresses and the service was restarted.

I now have the logging back at 7 and the support request is pending.

Regards
-- 
David Schroeder
Server Support
Information Services Division
Flinders University
Adelaide, Australia
Ph: +61 8 8201 2689

Chris Harms wrote:
> I am experiencing periodic failovers due to a floating IP address not 
> passing the status check:
> 
> clurgmgrd: [9975]: <warning> Failed to ping 192.168.13.204
> Jun 30 11:41:47 nodeA clurgmgrd[9975]: <notice> status on ip 
> "192.168.13.204" returned 1 (generic error)
> 
> Both nodes have bonded NICs with gigabit connections to redundant 
> switches, so it is unlikely they are going down, nothing in the logs 
> about linux losing the links.  I parked all the cluster services - 2 
> Postgres services and 1 Apache - on one node and allowed it to run 
> overnight.  There would be no client activity during this time. One 
> Postgres service failed two times in this manner and the other failed 
> once in this manner.  The Apache service did not fail.
> 
> What can I do to resolve this or get more information out of the system 
> to resolve this?
> 
> Thanks in advance,
> Chris
> 
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster