[Linux-cluster] Cluster services stopping

Thu May 31 12:52:28 UTC 2007

I'm trying to figure out why my cluster services keep stopping for what
seems to be no obvious reason.  The obvious commonality between the
services being stopped are the following resources:  1 GFS file system,
1 IP address, and 1 or 2 init scripts.  The init scripts vary between
apache, tomcat, mysql, and squid.

Normally, if a process dies and a status check on the init script
returns a non-zero that event gets logged but that isn't happening when
these services are stopped.  An example of the first logged event
related to a failed service is shown below and then the service is
stopped and recovered.

"May 28 19:11:33 tf36 clurgmgrd[4418]: <notice> Stopping service twapp"

These nodes remain quite idle all of the time and have alot of
horsepower.  Some helpful information:

[smccl at tf36 log]$rpm -q rgmanager cman
rgmanager-1.9.46-0
cman-1.0.4-0

[smccl at tf36 log]$uname -osrvmpi
Linux 2.6.9-34.ELhugemem #1 SMP Wed Mar 8 00:47:12 CST 2006 i686 i686
i386 GNU/Linux

[smccl at tf36 log]$cat /etc/redhat-release 
CentOS release 4.3 (Final)

Any help is appreciated.  I can provide more information if you think it
is helpful.  Also, is there some sort of debugging within rgmanager I
can enable to see what is truly failing or timing out and requiring a
restart of these services?