[Linux-cluster] Monitoring services/customize failure criteria

Tue Sep 16 13:12:26 UTC 2008

On 2008-09-16T00:29:49, Jeff Stoner <jstoner at opsource.net> wrote:

> LSB is strictly for init scripts.
> OCF is strictly for a cluster-managed resource.

That's not quite true; OCF is a backwards-compatible extension to LSB.
ie, a script can be a fully compliant OCF _and_ LSB script at the same
time, and behave differently depending on how it is called, and share
code.

(An LSB script basically is _identical_ to an OCF one, except that it
defaults the instance being managed.)

> They are similar but have significant differences. For example, LSB
> scripts are required to implement a 'status' action while OCF scripts
> are required to implement a 'monitor' action. This difference alone
> means, technically, you can't interchange LSB and OCF scripts unless
> they implement both (in some fashion.)

Exactly; the point is that they _can_ implement both. "monitor" really
is simply a cleaner version of "status" (because status is the only op
in the LSB list which has deviating exit codes).

(Detecting whether they are being called as LSB or as OCF is easily
possible by evaluating the environment.)

If I was to go back, I probably would not redefine "monitor". It seemed
like a good idea at the time - and it perhaps still emphasizes that we
care about correctness a bit more here.

(A _truly_ correct LSB script would be sufficient, but more often than
not, people get the LSB exit codes wrong because normal system start-up
doesn't care. That doesn't mean their scripts aren't buggy or
non-compliant, though - Debian got this wrong very often in the past.)

> I think this is the missing link in our conversation: the script
> resource type in Cluster Services is an attempt to make a LSB-compliant
> script into a OCF-compliant script.

An LSB-compliant script doesn't need a wrapper to be made OCF-compliant.

Or, well, yes, of course the calling conventions are different, but LSB
would already provide everything the cluster needs, and merely doing a
wrapper around it doesn't provide the LSB script with more features.

> So, the /usr/share/cluster/script.sh expects the script you specify to
> behave like an LSB script, not an OCF script. As such, the script
> resource type falls back to LSB conventions and uses a binary approach
> to a resource's start/stop/status actions: zero for success and
> non-zero for any failure.

I think the disconnect here is mostly a misunderstanding of the exit
codes for LSB "status"; it is not merely a 0 or !0 distinction. It also
has "3", which means "not running (unused)". A rather important
distinction from a failure.

Regards,
    Lars

-- 
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde