[Linux-cluster] Failover trouble.

Mon Oct 25 17:29:21 UTC 2004

On Sun, 2004-10-24 at 01:53 +0200, Eli Elizur wrote:
> I'm having problems failing over to my backup system.

> Yet, when the user script sends an "exit 1" the cluster is stopping
> the service but do no start it on the backup machine.

In RHEL 2.1, the exit status of the 'stop' phase is very important.  A
nonzero exit means that the user script could not, for some reason,
clean up the user service in its entirety.  Because of this, the status
of the user service is unknown.  We do not know that the service was
fully cleaned up, so it is *not* safe to relocate the service to another
node (ever).  In this case, there are generally two things you can do:

(1) Nothing.  Let other services in the cluster and on the node continue
to run normally; the service will remain broken until fixed by an
administrator.

(2) Reboot the node to ensure that any allocated resources are cleaned
up.

In RHEL 2.1, we opted for (1).

In that case, your user script should only ever return nonzero in the
'stop' path if it encounters something that it can fix automatically.

This indicates to the cluster software that the service has failed and
is in a state which can *not* be recovered automatically.  If the
service can be recovered, your user script must recover it and return
'0' from the stop path.

(This means that if the service was NOT running at the time the 'stop'
phase was called, you must still return '0' from the stop path.)

If you wish for behavior (2), simply change your script to run "/sbin/
reboot -fn" instead of returning a non-zero from the stop path.

-- Lon