[Linux-cluster] Monitoring services/customize failure criteria

Tue Sep 16 21:55:25 UTC 2008

> -----Original Message-----
> > LSB is strictly for init scripts.
> > OCF is strictly for a cluster-managed resource.
> 
> That's not quite true; OCF is a backwards-compatible extension to LSB.
> ie, a script can be a fully compliant OCF _and_ LSB script at the same
> time, and behave differently depending on how it is called, and share
> code.

Yes, a script can be compliant with both standards but that can't be
assumed.

I'm using section 4 of OCF RA API (at
http://www.opencf.org/cgi-bin/viewcvs.cgi/specs/ra/resource-agent-api.tx
t?rev=HEAD - if this is not the correct document, then please point me
to the correct one) as the basis of my comparison of LSB and OCF:

"The API tries to make it possible to have RA function both as a normal
LSB
init script and a cluster-aware RA, but this is not required
functionality."

I would interpret that to mean OCF RA scripts are not required to be
backwards-compatible with LSB. The statement "OCF is a
backwards-compatible extension to LSB" could be misleading. I think a
better statement would be "OCF allows a script to be
backwards-compatible with LSB." An OCF script would have to be
individually identified as being backwards-compatible with LSB, not OCF
RA scripts as a whole. I'm taking the safe stance and saying they're
different because of the wording in OCF RA API. If a OCF script is
compliant with LSB, then it can be used as an init script and would be
better, all around, in my opinion.

The problem, which prompted this whole discussion in the first place,
are the exit codes for 'status' and 'monitor'. As you said, LSB uses 2
different tables of codes while OCF has only one. Your original question
from Monday:

> Is it really only 0 versus non-zero for status? How does the system
> distinguish between running, failed, and cleanly stopped then? I ask
> because both LSB & OCF specify a slightly more differentiated status
> exit code.

My answer is: the wrapper script only distinguishes between zero and
non-zero exit codes from scripts. In essence, zero is success and
non-zero is failure. Since the LSB exit codes for 'status' do not map to
the exit codes for 'monitor', the /usr/share/cluster/script.sh "mashes"
it into one. The code (from Red Hat 5 Update 1) would seem to confirm
this:

=====================================
# Don't need to catch return codes; this one will work.
ocf_log info "Executing ${OCF_RESKEY_file} $1"
${OCF_RESKEY_file} $1

declare -i rv=$?
if [ $rv -ne 0 ]; then
        ocf_log err "script:$OCF_RESKEY_name: $1 of $OCF_RESKEY_file
failed (returned $rv)"
        return $OCF_ERR_GENERIC
fi
=====================================

All non-zero exit codes are mapped to the OCF generic error code of 1.

> An LSB-compliant script doesn't need a wrapper to be made 
> OCF-compliant.
> 
> Or, well, yes, of course the calling conventions are 
> different,

That, alone, means you do need a wrapper *if* the calling program
(rgmanager, pacemaker, whatever) does not implement LSB conventions. You
can't call an LSB script with "monitor" and expect a result. If the
resource manager knows to call a script with "status" instead of
"monitor" and interpret the exit code properly, then you have a more
versatile product and don't need a wrapper script. But then the onus is
on you to track the LSB spec along with the OCF spec and handle
differences between them in your code. Rgmanager doesn't have to do that
because it expects only OCF behavior and uses a wrapper shell script to
handle LSB scripts.

> I think the disconnect here is mostly a misunderstanding of the exit
> codes for LSB "status"; it is not merely a 0 or !0 
> distinction. It also
> has "3", which means "not running (unused)". A rather important
> distinction from a failure.

Agreed. Interpreting the exit code needs to be done with prior knowledge
about the state of the service. An exit code of 3 can be either "good"
or "bad" - it depends on what state the service *should* be in. If a
service should be running, a 3 is bad. If it should not be running, a 3
is good. Unfortunately, the wrapper script for rgmanager doesn't make
such distinctions.

--Jeff
Performance Engineer

OpSource, Inc.
http://www.opsource.net
"Your Success is Our Success"