[Linux-cluster] Init scripts and cluster suite

Tue Aug 29 15:18:34 UTC 2006

On Mon, 2006-08-28 at 10:27 +0200, Jos Vos wrote:
> Hi,
> 
> Init scripts usually return a non-zero return code when they try to 
> stop a service that isn't running anymore.

According to the LSB, init scripts are supposed to return 0 in
stop-after-stop situations.

> When a cluster service has failed for some reason, the cluster suite
> requires you to first disable a service, before enabling it again.
> Disabling a service will try to stop the service, which will fail,
> and thus the service can't be disabled (and also not enabled again).

Disabling (e.g. failed->disabled) should always work, even if a portion
of the 'stop' phase returns nonzero.  It's really the only way to get a
service out of the failed state - so the assumption is that you have
cleaned up (or will clean up) the service before you try to enable it
again.

If this is not working, please file a bugzilla -- failed->disable should
work (maybe it should throw better warnings).

> The workaround is to either manually start the service and then
> disabling it (bad idea for a cluster service) or to write all
> cluster service scripts yourself, even if you just need to
> control a standard service like httpd.

Well, for httpd, Marek Grac just wrote an agent which plugs in to
rgmanager. ^^  On a more serious note, here's the bugzilla which talks
about the problem you're seeing:

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=151104

> Is the latter the recommended solution for this problem?

:( Yes.  For now.

The patch included in the above bugzilla should fix the problem for most
Red Hat (CentOS, etc.) installations, but will not be shipped in any
updates of RHEL4 because of the fact that users / administrators might
be erroneously relying on the "stop after stop returning failure"
"feature" (even though it is not LSB compliant).

I'm fairly certain that RHEL5 and later releases will have the problem
corrected (I'm pretty sure FC5 already has it fixed).

-- Lon