[Linux-cluster] Monitoring services/customize failure criteria

Lars Marowsky-Bree lmb at suse.de
Mon Sep 15 20:58:28 UTC 2008


On 2008-09-15T21:37:24, Jeff Stoner <jstoner at opsource.net> wrote:

> Yes, 0 is success and anything else is "not success." If the resource is
> not 100% operational (as detemermined by the script) then it should not
> be returning a 0 exit status. If it is, it's broken.

But "status" needs more than just a binary distinction.

> The LSB is fairly clear on this matter. It gives a table of exit status
> codes to use with the 'status' action, of which 0 is the only success
> code. All the other codes are failure codes. For all other non-status
> actions, it states "the init script shall return an exit status of zero
> if the action was successful. Otherwise, the exit status shall be
> non-zero." This goes all the way back to LSB 1.0 (published June 29,
> 2001.)

Right, but it also is more elaborate on "status". ;-)

I was trying to understand how this is handled by rgmanager.

Pacemaker needs to distinguish between "running fine", "active but
somehow failed", and "cleanly stopped". (The numeric values are not
relevant for the discussion, but map to 0, anything else, 7.)

It is also the detail of status/monitor which implementers get most
frequently wrong. "But it's either running or not!" ... Which is clearly
not true, or at least such a case couldn't protect against certain
failure modes. (Such as multiple-active on several nodes, which is
likely to be _also_ failed.)

Hence my interest in your statement that just two states suffice; I am
just trying to understand.

(BTW, this goes back further than LSB 1.0, as that only documented
behaviour which was practice long before ;-)


Regards,
    Lars

-- 
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde




More information about the Linux-cluster mailing list