[Linux-cluster] Cluster services stopping

Thu May 31 20:30:18 UTC 2007

Hi

We  have a similar problem, my server is runnig three services but only 
one of them restart sometimes without reason.
We have not a problem of high load in the server.
Could  happend months or weeks without any service restart.

The only difference with the others services is a ext3 file system on 
one shared external storage.

We have other installation with similar configuration and this problem 
is not happened.
I'm checking the fs.sh script to add more debug info. I think this 
script may report same error and this may trigger the restart of service.

We have RHE4 U2 and only update rgmanager to "rgmanager-1.9.53-0"

At this moment I'm installing RHE4 U5 for testing and we try to update 
the production host later.
But my problem is that I'm not sure if this update will fix this issue.
Make an update in production is "complicated" and I will have serius 
troubles if this update not fix this issue.

Bye-bye.
Note: sorry for my bad english.
Scott McClanahan escribió:
> I'm trying to figure out why my cluster services keep stopping for what
> seems to be no obvious reason.  The obvious commonality between the
> services being stopped are the following resources:  1 GFS file system,
> 1 IP address, and 1 or 2 init scripts.  The init scripts vary between
> apache, tomcat, mysql, and squid.
>
> Normally, if a process dies and a status check on the init script
> returns a non-zero that event gets logged but that isn't happening when
> these services are stopped.  An example of the first logged event
> related to a failed service is shown below and then the service is
> stopped and recovered.
>
> "May 28 19:11:33 tf36 clurgmgrd[4418]: <notice> Stopping service twapp"
>
> These nodes remain quite idle all of the time and have alot of
> horsepower.  Some helpful information:
>
> [smccl at tf36 log]$rpm -q rgmanager cman
> rgmanager-1.9.46-0
> cman-1.0.4-0
>
> [smccl at tf36 log]$uname -osrvmpi
> Linux 2.6.9-34.ELhugemem #1 SMP Wed Mar 8 00:47:12 CST 2006 i686 i686
> i386 GNU/Linux
>
> [smccl at tf36 log]$cat /etc/redhat-release 
> CentOS release 4.3 (Final)
>
> Any help is appreciated.  I can provide more information if you think it
> is helpful.  Also, is there some sort of debugging within rgmanager I
> can enable to see what is truly failing or timing out and requiring a
> restart of these services?
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>