[Linux-cluster] Monitoring services/customize failure criteria

Tue Sep 16 10:11:09 UTC 2008

On Tue, Sep 16, 2008 at 01:29, Jeff Stoner <jstoner at opsource.net> wrote:
>> -----Original Message-----
>> It is also the detail of status/monitor which implementers get most
>> frequently wrong. "But it's either running or not!" ... Which
>> is clearly
>> not true, or at least such a case couldn't protect against certain
>> failure modes. (Such as multiple-active on several nodes, which is
>> likely to be _also_ failed.)
>
> Ok. I think I understand where the confusion lies.
>
> LSB is strictly for init scripts.
> OCF is strictly for a cluster-managed resource.

This is an unnecessary distinction.

A true LSB script, by which I mean one that follows _all_ the LSB
guidelines for status^, is perfectly adequate for clustering.
And an OCF script that implements sane parameter defaults can, for the
most part, be easily used as an init script for stopping and starting
services.

^ http://refspecs.linux-foundation.org/LSB_3.2.0/LSB-Core-generic/LSB-Core-generic/iniscrptact.html

[quote]
If the status action is requested, the init script will return the
following exit status codes.

0	program is running or service is OK
1	program is dead and /var/run pid file exists
2	program is dead and /var/lock lock file exists
3	program is not running
4	program or service status is unknown
5-99	reserved for future LSB use
100-149	reserved for distribution use
150-199	reserved for application use
200-254	reserved
[end quote]

>
> They are similar but have significant differences. For example, LSB
> scripts are required to implement a 'status' action while OCF scripts
> are required to implement a 'monitor' action.

Btw. Lars was one of the primary authors of the OCF spec... he's
pretty familiar with it and how it differs from LSB ;-)

> This difference alone
> means, technically, you can't interchange LSB and OCF scripts unless
> they implement both (in some fashion.)
>
> I think this is the missing link in our conversation: the script
> resource type in Cluster Services is an attempt to make a LSB-compliant
> script into a OCF-compliant script. So, the /usr/share/cluster/script.sh
> expects the script you specify to behave like an LSB script, not an OCF
> script. As such, the script resource type falls back to LSB conventions
> and uses a binary approach to a resource's start/stop/status actions:
> zero for success and non-zero for any failure. Other resource types
> (file system, nfs, ip, mysql, samba, etc.) may implement full OCF RA API
> exit codes.
>
> Does this help?

I'm guessing this was what Lars was asking about.

In case of interest, our equivalent is the LRMd which understands both
standards (as well as the old Heartbeat style ones) and hides any
differences from OCF.  So any cluster manager using it can treat
everything as an OCF resource.

Part of our resource definition is a "class" field which the admin
uses to tell the LRMd what standard to use for a given resource and it
will automagically map everything from and to OCF as required (eg. by
calling status instead of monitor for LSB and remapping the return
codes).