[Linux-cluster] service stuck in "recovering", no attempt to restart
Lon Hohberger
lhh at redhat.com
Wed Oct 5 14:39:05 UTC 2011
On 10/04/2011 10:23 PM, Ofer Inbar wrote:
> On a 3 node cluster running:
> cman-2.0.115-34.el5_5.3
> rgmanager-2.0.52-6.el5.centos.8
> openais-0.80.6-16.el5_5.9
>
> We have a custom resource, "dn", for which I wrote the resource agent.
> Service has three resources: a virtual IP (using ip.sh), and two dn children.
You should be able to disable then re-enable - that is, you shouldn't
need to restart rgmanager to break the recovering state.
There's this related bug, but it should have been fixed in 2.0.52-6:
https://bugzilla.redhat.com/show_bug.cgi?id=530409
> Normally, when one of the dn instances fails its status check,
> rgmanager stops the service (stops dn_a and dn_b, then stops the IP),
> then relocates to another node and starts the service there.
That's what I'd expect to happen.
> Several hours ago, one of the dn instances failed its status check,
> rgmanager stopped it, marked the service "recovering", but then did
> not seem to try to start it on any node. It just stayed down for
> hours until logged in to look at it.
>
> Until 17:22 today, service was running on node1. Here's what it logged:
>
> Oct 4 17:22:12 clustnode1 clurgmgrd: [517]:<err> Monitoring Service dn:dn_b> Service Is Not Running
> Oct 4 17:22:12 clustnode1 clurgmgrd[517]:<notice> status on dn "dn_b" returned 1 (generic error)
> Oct 4 17:22:12 clustnode1 clurgmgrd[517]:<notice> Stopping service service:dn
> Oct 4 17:22:12 clustnode1 clurgmgrd: [517]:<info> Stopping Service dn:dn_b
> Oct 4 17:22:12 clustnode1 clurgmgrd: [517]:<notice> Checking if stopped: check_pid_file /dn/dn_b/dn_b.pid
> Oct 4 17:22:14 clustnode1 clurgmgrd: [517]:<info> Stopping Service dn:dn_b> Succeed
> Oct 4 17:22:14 clustnode1 clurgmgrd: [517]:<info> Stopping Service dn:dn_a
> Oct 4 17:22:15 clustnode1 clurgmgrd: [517]:<notice> Checking if stopped: check_pid_file /dn/dn_a/dn_a.pid
> Oct 4 17:22:17 clustnode1 clurgmgrd: [517]:<info> Stopping Service dn:dn_a> Succeed
> Oct 4 17:22:17 clustnode1 clurgmgrd: [517]:<info> Removing IPv4 address 10.6.9.136/23 from eth0
> Oct 4 17:22:27 clustnode1 clurgmgrd[517]:<notice> Service service:dn is recovering
>
> At around that time, node2 also logged this:
>
> Oct 4 17:21:19 clustnode2 ccsd[5584]: Unable to read complete comm_header_t.
> Oct 4 17:21:29 clustnode2 ccsd[5584]: Unable to read complete comm_header_t.
It may be related; I doubt it.
> Again, this looks the same on all three nodes.
>
> Here's the resource section of cluster.conf (with the values of some
> of the arguments to my custom resource modified so as not to expose
> actual username, path, or port number):
>
> <rm log_level="6">
> <service autostart="1" name="dn" recovery="relocate">
> <ip address="10.6.9.136" monitor_link="1">
> <dn user="username" dninstall="/dn/path" name="dn_a" monitoringport="portnum"/>
> <dn user="username" dninstall="/dn/path" name="dn_b" monitoringport="portnum"/>
> </ip>
> </service>
> </rm>
>
> Any ideas why it might be in this state, where everything is
> apparently fine except that the service is "recovering" and rgmanager
> isn't trying to do anything about it and isn't logging any complaints?
The only cause for this is if we send a message but it either doesn't
make it or we get a weird return code -- I think rgmanager logs it,
though, so this could be a new issue.
> Attached: strace -fp output of clurgmrgd processes on node1 and node2
The strace data is not likely to be useful, but a dump from rgmanager
would. If you get in to this state again, do this:
kill -USR1 `pidof -s clurgmgrd`
Then look at /tmp/rgmanager-dump* (2.0.x) or
/var/lib/cluster/rgmanager-dump (3.x.y)
-- Lon
More information about the Linux-cluster
mailing list