[Linux-cluster] Monitoring services/customize failure criteria

Mon Sep 15 20:37:24 UTC 2008

> -----Original Message-----
> Is it really only 0 versus non-zero for status? How does the system
> distinguish between running, failed, and cleanly stopped then? I ask
> because both LSB & OCF specify a slightly more differentiated status
> exit code.

Yes, 0 is success and anything else is "not success." If the resource is
not 100% operational (as detemermined by the script) then it should not
be returning a 0 exit status. If it is, it's broken.

The LSB is fairly clear on this matter. It gives a table of exit status
codes to use with the 'status' action, of which 0 is the only success
code. All the other codes are failure codes. For all other non-status
actions, it states "the init script shall return an exit status of zero
if the action was successful. Otherwise, the exit status shall be
non-zero." This goes all the way back to LSB 1.0 (published June 29,
2001.)

http://refspecs.linux-foundation.org/LSB_3.2.0/LSB-Core-generic/LSB-Core
-generic/iniscrptact.html

It is more important to realize that Cluster Services is taking
advantage of the defined LSB actions of init scripts. Another way to
look at it is:

- the LSB *only* applies to init scripts, not a script resource for
Cluster Services.
- you can use *any* script you like for the script resource. It does not
have to be an init script (it simply makes sense to do so.)

Here's an example of an script that can be used as a script resource:

===========================
#!/bin/sh

case $1
in
   start)
      exit 0
   ;;
   stop)
      exit 0
   ;;
   status)
      vmstat 1 1 >> /var/log/stats
   ;;
esac
===========================

It implements the basics required by Cluster Services but it does not
conform to the LSB specification.

--Jeff
Performance Engineer

OpSource, Inc.
http://www.opsource.net
"Your Success is Our Success"