[Linux-cluster] HA agents (cluster scripts)

Wed Sep 15 16:37:41 UTC 2010

Cleber Rosa <crosa at redhat.com> wrote:
>  If they're bash scripts, you might want to try:
> 
> #bash -x <script> <arguments>.

This won't work.  The interface between the resource manager and the
resource agent script is more than just "run the script".  It includes:

 - What environment does the script get when run by the RM?
   - Which includes a lot of parameters passed from the RM to the script

 - What metadata must the agent script provide to the RM?
   - ... and what the RM will do differently based on this metadata

 - When does the RM call the agent, with which paremeters?
   - What actions will it take in response to exit codes
   - What happens to stdout and stderr (answer below)

As a very basic hack, you can sort of simulate running a resource
agent by doing something like this in bash:

$ sudo OCF_RESKEY_name=... OCF_RESKEY_otherparam=... /usr/share/cluster/myagent status

rg_test is better, but even it won't help troubleshoot all of this.

P.S. In RHCS at least, a resource agent's stdout and stderr are always
sent to the bitbucket, *except* that when calling meta-data, rgmanager
will read all of the agent's stdout as the metadata.  One probelm here
is that if you put ocf_log statements in your script, they *will* write
to stdout in addition to syslog; if any of your ocf_log's are at the
top level of the script you have to test that $1 isn't "meta-data", so
you don't write extra debugging or status output along with the XML.

And here's another tricky and undocumented portion of the interface:
If your XML metadata doesn't validate (for example, if you write some
extra stuff to stdout when called with meta-data, such as ocf_log),
rgmanager will ignore your resource agent as invalid, and will ignore
its resources in your cluster.conf - which means that any service you
define that includes your custom resource will "successfully" start
without your custom resource, and rgmanager will treat that as okay!!

I think that behavior is stunningly awful and broken; a resource
group that includes a resource that failed to validate, should fail.

At least in RHEL/CentOS 5.3, it doesn't log anything to indicate this
condition.  I've heard that in 5.5, you do get a log message when the
metadata doesn't validate.
  -- Cos