[Linux-cluster] service restart problem

Tue Feb 14 00:40:03 UTC 2006

On Mon, Feb 13, 2006 at 11:56:20AM -0800, Marc Lewis wrote:
> On Sat, Feb 11, 2006 at 11:57:32PM +0200, Omer Faruk Sen wrote:
> > 
> > Hi,
> > 
> > I have a problem with redhat cluster suite. I have a two node test
> > cluster. cluster master is cluster2 which is running vsftpd, mysql as
> > service. I have manually edited vsftpd.conf so it can't start on cluster2.
> > Then I killed vsftpd process. But after that cluster didn't failover to
> > cluster1 or after a few seconds I have corrected vsftpd.conf but cluster2
> > didn't start this service. What I want to ask that is redhat cluster
> > doesn't support service status check so it can restart failed service or
> > does it support fail over if one resource doesn't work?
> > 
> > Best regards,
> 
> I'm seeing similar issues here.  The script entry doesn't seem to do
> anything when checking status.
> 
> For example, we have a MySQL service defined with an IP address, a shared
> SAN partition, and the /etc/init.d/mysqld script. 
> 
> The service starts up and shuts down fine when done manually via clusvcadm,
> but if I kill the mysql daemon with the script or manually, the clurgmgrd
> doesn't seem to care.  It just runs its status check, which does report it
> as "stopped" without ever restarting the service.

Just wanted to followup and say that I've solved the status check problem,
sort of.  I decided to play with the exit value of various init scripts to
see what, if any, effect they would have on clurgmgrd, and managed to get
something cobbled together that works.  Its not the best best solution, but
it should do.

To get these scripts to work, its important that the "status" and "stop"
return values that clurgmgrd can deal with.

If status returns a non-zero value (i.e. an error) then clurgmgrd will
think the service has failed and attempt to restart it.  It does this by
issuing a "stop" command, taking down the other resources associated with
it, and then bringing them all back up.

Its the stop command that can cause some problems.  For examample, in my
service defined above, I have the service "MySQL", which has the IP
address, shared storage and the /etc/init.d/mysqld script.  I start it up
using "clusvcadm -e MySQL" and all is well, it brings everything up in the
correct order and MySQL is running fine.  Every 30 seconds, I see it
running a "status" check in the syslog.  So far so good.

Now, I have modified the mysqld script to return the value of "status" from
/etc/init.d/functions as its exit code.  So, when everything is running
fine, it returns "0" and clurgmgrd is happy.  If I do a "killall -9
mysqld_safe mysqld" status will now return a value of 2, which is an error.
clurgmgrd will attempt to restart it by issuing the "stop" command to the
script.  This is where we run into problems.

Since the service is already dead, the startup scripts return an error when
trying to stop the service.  clurgmgrd fails the service and the service is
now down.

The only way I've found around this is to force the "stop" to return 0 no
matter what.  This way clurgmgrd will believe it has succeeded in shutting
down the service and will restart it.

My reasoning is that it is better to have it fail starting it up than to
have it fail stopping a service that is already dead.  I'm sure there are
other problems with this method, but I haven't identified them yet.

> Also, I've seen clurgmgrd die without logging anything anywhere.  I'll just
> check the cluster and it won't be running.  All of the services stay
> running, but the manager is dead.  Restarting it is problematic since it
> will restart each of the services causing a brief interruption.
> 
> Anyone have any ideas on how to solve either of these two problems?  I've
> been waiting to deploy the cluster we've put together until I could resolve
> these two issues, but have run out of things to try.

I'm still seeing clurgmgrd die periodically for no reason, though.  I may
have to write another script to monitor it as well and run that out of cron
every so often.  That doesn't seem like a very good solution, though since
it does restart all of the services that are running on that node.

 - Marc

-- 
Marc Lewis
Blarg! Online Services, Inc.