[Linux-cluster] Monitoring services/customize failure criteria

Tue Sep 16 03:47:07 UTC 2008

Jeff Stoner wrote:
>> -----Original Message-----
>> It is also the detail of status/monitor which implementers get most
>> frequently wrong. "But it's either running or not!" ... Which 
>> is clearly
>> not true, or at least such a case couldn't protect against certain
>> failure modes. (Such as multiple-active on several nodes, which is
>> likely to be _also_ failed.)
> 
> Ok. I think I understand where the confusion lies.
> 
> LSB is strictly for init scripts.
> OCF is strictly for a cluster-managed resource.
> 
> They are similar but have significant differences. For example, LSB
> scripts are required to implement a 'status' action while OCF scripts
> are required to implement a 'monitor' action. This difference alone
> means, technically, you can't interchange LSB and OCF scripts unless
> they implement both (in some fashion.)
> 
> I think this is the missing link in our conversation: the script
> resource type in Cluster Services is an attempt to make a LSB-compliant
> script into a OCF-compliant script. So, the /usr/share/cluster/script.sh
> expects the script you specify to behave like an LSB script, not an OCF
> script. As such, the script resource type falls back to LSB conventions
> and uses a binary approach to a resource's start/stop/status actions:
> zero for success and non-zero for any failure. Other resource types
> (file system, nfs, ip, mysql, samba, etc.) may implement full OCF RA API
> exit codes.
> 
> Does this help?
>

Also internally, rgmanager can recognize other non-zero OCF return codes:

http://git.fedorahosted.org/git/cluster.git?p=cluster.git&a=search&h=HEAD&st=grep&s=OCF_RA

-subhendu