[Linux-cluster] service restart problem

Tue Feb 14 07:53:40 UTC 2006

Isn't it peculiar that it hasn't realised before. This problem seems to
exist since the release of rhel 4.2 (as far as I have realised) or maybe
earlier..

I hope this issue gets solved as soon as possible. By the way thanks for
the workaround Marc..

> On Mon, Feb 13, 2006 at 11:56:20AM -0800, Marc Lewis wrote:
>> On Sat, Feb 11, 2006 at 11:57:32PM +0200, Omer Faruk Sen wrote:
>> >
>> > Hi,
>> >
>> > I have a problem with redhat cluster suite. I have a two node test
>> > cluster. cluster master is cluster2 which is running vsftpd, mysql as
>> > service. I have manually edited vsftpd.conf so it can't start on
>> cluster2.
>> > Then I killed vsftpd process. But after that cluster didn't failover
>> to
>> > cluster1 or after a few seconds I have corrected vsftpd.conf but
>> cluster2
>> > didn't start this service. What I want to ask that is redhat cluster
>> > doesn't support service status check so it can restart failed service
>> or
>> > does it support fail over if one resource doesn't work?
>> >
>> > Best regards,
>>
>> I'm seeing similar issues here.  The script entry doesn't seem to do
>> anything when checking status.
>>
>> For example, we have a MySQL service defined with an IP address, a
>> shared
>> SAN partition, and the /etc/init.d/mysqld script.
>>
>> The service starts up and shuts down fine when done manually via
>> clusvcadm,
>> but if I kill the mysql daemon with the script or manually, the
>> clurgmgrd
>> doesn't seem to care.  It just runs its status check, which does report
>> it
>> as "stopped" without ever restarting the service.
>
> Just wanted to followup and say that I've solved the status check problem,
> sort of.  I decided to play with the exit value of various init scripts to
> see what, if any, effect they would have on clurgmgrd, and managed to get
> something cobbled together that works.  Its not the best best solution,
> but
> it should do.
>
> To get these scripts to work, its important that the "status" and "stop"
> return values that clurgmgrd can deal with.
>
> If status returns a non-zero value (i.e. an error) then clurgmgrd will
> think the service has failed and attempt to restart it.  It does this by
> issuing a "stop" command, taking down the other resources associated with
> it, and then bringing them all back up.
>
> Its the stop command that can cause some problems.  For examample, in my
> service defined above, I have the service "MySQL", which has the IP
> address, shared storage and the /etc/init.d/mysqld script.  I start it up
> using "clusvcadm -e MySQL" and all is well, it brings everything up in the
> correct order and MySQL is running fine.  Every 30 seconds, I see it
> running a "status" check in the syslog.  So far so good.
>
> Now, I have modified the mysqld script to return the value of "status"
> from
> /etc/init.d/functions as its exit code.  So, when everything is running
> fine, it returns "0" and clurgmgrd is happy.  If I do a "killall -9
> mysqld_safe mysqld" status will now return a value of 2, which is an
> error.
> clurgmgrd will attempt to restart it by issuing the "stop" command to the
> script.  This is where we run into problems.
>
> Since the service is already dead, the startup scripts return an error
> when
> trying to stop the service.  clurgmgrd fails the service and the service
> is
> now down.
>
> The only way I've found around this is to force the "stop" to return 0 no
> matter what.  This way clurgmgrd will believe it has succeeded in shutting
> down the service and will restart it.
>
> My reasoning is that it is better to have it fail starting it up than to
> have it fail stopping a service that is already dead.  I'm sure there are
> other problems with this method, but I haven't identified them yet.
>
>> Also, I've seen clurgmgrd die without logging anything anywhere.  I'll
>> just
>> check the cluster and it won't be running.  All of the services stay
>> running, but the manager is dead.  Restarting it is problematic since it
>> will restart each of the services causing a brief interruption.
>>
>> Anyone have any ideas on how to solve either of these two problems?
>> I've
>> been waiting to deploy the cluster we've put together until I could
>> resolve
>> these two issues, but have run out of things to try.
>
> I'm still seeing clurgmgrd die periodically for no reason, though.  I may
> have to write another script to monitor it as well and run that out of
> cron
> every so often.  That doesn't seem like a very good solution, though since
> it does restart all of the services that are running on that node.
>
>  - Marc
>
> --
> Marc Lewis
> Blarg! Online Services, Inc.
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>

-- 
Omer Faruk Sen
http://www.faruk.net