[Linux-cluster] rgmanager pretends all is well when it has trouble w/resource script

Wed Aug 25 16:27:37 UTC 2010

Summary: I introduced a problem into my custom resource script, and
  had a hard time debugging it because rgmanager didn't indicate any
  problem and claimed to start the service when it actually didn't.

[ RHCS using rgmanager-2.0.52-1.el5.centos on CentOS 5.3 ]

As part of my attempts to log, described in my earlier message to this
list, I added a line near the top of my custom resource agent script:

  ocf_log info "Called $0 $*"

I didn't realize at the time that ocf_log would also output to stdout,
meaning that when my script was called for meta-data, this line would
add output to the XML and make it invalid.

Here's the service definition from my cluster.conf
(with some details removed from the <dr> node)

<rm log_level="6">
  <service autostart="1" name="dr" recovery="relocate">
    <ip address="10.6.9.130" monitor_link="1">
      <dr name="dr_a" ... other arguments ...>
      <dr name="dr_b" ... other arguments ...>
    </ip>
  </service>
</rm>

Now, here's what I saw in /var/log/messages when I started rgmanager:

Aug 23 16:35:02 drnode1 kernel: dlm: Using TCP for communications
Aug 23 16:35:03 drnode1 clurgmgrd[13754]: <notice> Resource Group Manager Starting 
Aug 23 16:35:03 drnode1 clurgmgrd[13754]: <info> Loading Service Data 
Aug 23 16:35:03 drnode1 clurgmgrd: [13754]: <info> Called /usr/share/cluster/dr.sh meta-data 
Aug 23 16:35:04 drnode1 clurgmgrd[13754]: <info> Initializing Services 
Aug 23 16:35:04 drnode1 clurgmgrd[13754]: <info> Services Initialized 
Aug 23 16:35:04 drnode1 clurgmgrd[13754]: <info> State change: Local UP 
Aug 23 16:35:09 drnode1 clurgmgrd[13754]: <notice> Starting stopped service service:dr 
Aug 23 16:35:09 drnode1 clurgmgrd: [13754]: <info> Adding IPv4 address 10.6.9.130/23 to eth0 
Aug 23 16:35:10 drnode1 clurgmgrd[13754]: <notice> Service service:dr started 

But in fact, it never called the resouce script with "start" and never
attempted to start dr_a or dr_b.  This is because the metadata did not
validate.  When I removed that ocf_log statement from the top level of
the script (and also later, when I wrapped it in an if to only call it
if $1 wasn't "meta-data"), the service began working again.

It was tricky to diagnose:

1. No indication that, when it called my script for meta-data, it got
   bad XML and decided the script was bad or the "dr" resource was not
   there, or something like that.

2. No indication that it could not find a valid resource agent script
   for a resource that my cluster.conf says is a part of the service.

3. It claimed the service started, even though it didn't even attempt
   to start the resource.  clustat claimed the service started, too.

4. No health checks ran, obviously, so it never noticed anything wrong.

What's going on here?  rgmanager bugs?
Is there anywhere I could look to see indications of these problems?
Shouldn't it be logging something about these problems?

And above all,
  why did it claim the service started successfully?

  -- Cos