[Linux-cluster] service stuck in "recovering", no attempt to restart

Wed Oct 5 14:39:05 UTC 2011

On 10/04/2011 10:23 PM, Ofer Inbar wrote:
> On a 3 node cluster running:
>    cman-2.0.115-34.el5_5.3
>    rgmanager-2.0.52-6.el5.centos.8
>    openais-0.80.6-16.el5_5.9
>
> We have a custom resource, "dn", for which I wrote the resource agent.
> Service has three resources: a virtual IP (using ip.sh), and two dn children.

You should be able to disable then re-enable - that is, you shouldn't 
need to restart rgmanager to break the recovering state.

There's this related bug, but it should have been fixed in 2.0.52-6:

   https://bugzilla.redhat.com/show_bug.cgi?id=530409

> Normally, when one of the dn instances fails its status check,
> rgmanager stops the service (stops dn_a and dn_b, then stops the IP),
> then relocates to another node and starts the service there.

That's what I'd expect to happen.

> Several hours ago, one of the dn instances failed its status check,
> rgmanager stopped it, marked the service "recovering", but then did
> not seem to try to start it on any node.  It just stayed down for
> hours until logged in to look at it.
>
> Until 17:22 today, service was running on node1.  Here's what it logged:
>
> Oct  4 17:22:12 clustnode1 clurgmgrd: [517]:<err>  Monitoring Service dn:dn_b>  Service Is Not Running
> Oct  4 17:22:12 clustnode1 clurgmgrd[517]:<notice>  status on dn "dn_b" returned 1 (generic error)
> Oct  4 17:22:12 clustnode1 clurgmgrd[517]:<notice>  Stopping service service:dn
> Oct  4 17:22:12 clustnode1 clurgmgrd: [517]:<info>  Stopping Service dn:dn_b
> Oct  4 17:22:12 clustnode1 clurgmgrd: [517]:<notice>  Checking if stopped: check_pid_file /dn/dn_b/dn_b.pid
> Oct  4 17:22:14 clustnode1 clurgmgrd: [517]:<info>  Stopping Service dn:dn_b>  Succeed
> Oct  4 17:22:14 clustnode1 clurgmgrd: [517]:<info>  Stopping Service dn:dn_a
> Oct  4 17:22:15 clustnode1 clurgmgrd: [517]:<notice>  Checking if stopped: check_pid_file /dn/dn_a/dn_a.pid
> Oct  4 17:22:17 clustnode1 clurgmgrd: [517]:<info>  Stopping Service dn:dn_a>  Succeed
> Oct  4 17:22:17 clustnode1 clurgmgrd: [517]:<info>  Removing IPv4 address 10.6.9.136/23 from eth0
> Oct  4 17:22:27 clustnode1 clurgmgrd[517]:<notice>  Service service:dn is recovering
>
> At around that time, node2 also logged this:
>
> Oct  4 17:21:19 clustnode2 ccsd[5584]: Unable to read complete comm_header_t.
> Oct  4 17:21:29 clustnode2 ccsd[5584]: Unable to read complete comm_header_t.

It may be related; I doubt it.

> Again, this looks the same on all three nodes.
>
> Here's the resource section of cluster.conf (with the values of some
> of the arguments to my custom resource modified so as not to expose
> actual username, path, or port number):
>
> <rm log_level="6">
>    <service autostart="1" name="dn" recovery="relocate">
>      <ip address="10.6.9.136" monitor_link="1">
>        <dn user="username" dninstall="/dn/path" name="dn_a" monitoringport="portnum"/>
>        <dn user="username" dninstall="/dn/path" name="dn_b" monitoringport="portnum"/>
>      </ip>
>    </service>
> </rm>
>
> Any ideas why it might be in this state, where everything is
> apparently fine except that the service is "recovering" and rgmanager
> isn't trying to do anything about it and isn't logging any complaints?

The only cause for this is if we send a message but it either doesn't 
make it or we get a weird return code -- I think rgmanager logs it, 
though, so this could be a new issue.

> Attached: strace -fp output of clurgmrgd processes on node1 and node2

The strace data is not likely to be useful, but a dump from rgmanager 
would.  If you get in to this state again, do this:

    kill -USR1 `pidof -s clurgmgrd`

Then look at /tmp/rgmanager-dump* (2.0.x) or 
/var/lib/cluster/rgmanager-dump (3.x.y)

-- Lon