[Linux-cluster] service stuck in "recovering", no attempt to restart

Wed Oct 5 15:01:46 UTC 2011

On Wed, Oct 5, 2011 at 4:39 PM, Lon Hohberger <lhh at redhat.com> wrote:
> On 10/04/2011 10:23 PM, Ofer Inbar wrote:
>>
>> On a 3 node cluster running:
>>   cman-2.0.115-34.el5_5.3
>>   rgmanager-2.0.52-6.el5.centos.8
>>   openais-0.80.6-16.el5_5.9
>>
>> We have a custom resource, "dn", for which I wrote the resource agent.
>> Service has three resources: a virtual IP (using ip.sh), and two dn
>> children.
>
> You should be able to disable then re-enable - that is, you shouldn't need
> to restart rgmanager to break the recovering state.
>
> There's this related bug, but it should have been fixed in 2.0.52-6:
>
>  https://bugzilla.redhat.com/show_bug.cgi?id=530409
>
I have the same problem with version 2.0.52-6 on rhel5, I'll try to
get a dump when it happens again (didn't know the USR1 signal thing)
# rpm -aq | grep -e rgmanager -e openais -e cman
cman-2.0.115-34.el5_5.4
rgmanager-2.0.52-6.el5_5.8
openais-0.80.6-16.el5_5.9

Thanks,
Juanra
>> Normally, when one of the dn instances fails its status check,
>> rgmanager stops the service (stops dn_a and dn_b, then stops the IP),
>> then relocates to another node and starts the service there.
>
> That's what I'd expect to happen.
>
>> Several hours ago, one of the dn instances failed its status check,
>> rgmanager stopped it, marked the service "recovering", but then did
>> not seem to try to start it on any node.  It just stayed down for
>> hours until logged in to look at it.
>>
>> Until 17:22 today, service was running on node1.  Here's what it logged:
>>
>> Oct  4 17:22:12 clustnode1 clurgmgrd: [517]:<err>  Monitoring Service
>> dn:dn_b>  Service Is Not Running
>> Oct  4 17:22:12 clustnode1 clurgmgrd[517]:<notice>  status on dn "dn_b"
>> returned 1 (generic error)
>> Oct  4 17:22:12 clustnode1 clurgmgrd[517]:<notice>  Stopping service
>> service:dn
>> Oct  4 17:22:12 clustnode1 clurgmgrd: [517]:<info>  Stopping Service
>> dn:dn_b
>> Oct  4 17:22:12 clustnode1 clurgmgrd: [517]:<notice>  Checking if stopped:
>> check_pid_file /dn/dn_b/dn_b.pid
>> Oct  4 17:22:14 clustnode1 clurgmgrd: [517]:<info>  Stopping Service
>> dn:dn_b>  Succeed
>> Oct  4 17:22:14 clustnode1 clurgmgrd: [517]:<info>  Stopping Service
>> dn:dn_a
>> Oct  4 17:22:15 clustnode1 clurgmgrd: [517]:<notice>  Checking if stopped:
>> check_pid_file /dn/dn_a/dn_a.pid
>> Oct  4 17:22:17 clustnode1 clurgmgrd: [517]:<info>  Stopping Service
>> dn:dn_a>  Succeed
>> Oct  4 17:22:17 clustnode1 clurgmgrd: [517]:<info>  Removing IPv4 address
>> 10.6.9.136/23 from eth0
>> Oct  4 17:22:27 clustnode1 clurgmgrd[517]:<notice>  Service service:dn is
>> recovering
>>
>> At around that time, node2 also logged this:
>>
>> Oct  4 17:21:19 clustnode2 ccsd[5584]: Unable to read complete
>> comm_header_t.
>> Oct  4 17:21:29 clustnode2 ccsd[5584]: Unable to read complete
>> comm_header_t.
>
> It may be related; I doubt it.
>
>
>> Again, this looks the same on all three nodes.
>>
>> Here's the resource section of cluster.conf (with the values of some
>> of the arguments to my custom resource modified so as not to expose
>> actual username, path, or port number):
>>
>> <rm log_level="6">
>>   <service autostart="1" name="dn" recovery="relocate">
>>     <ip address="10.6.9.136" monitor_link="1">
>>       <dn user="username" dninstall="/dn/path" name="dn_a"
>> monitoringport="portnum"/>
>>       <dn user="username" dninstall="/dn/path" name="dn_b"
>> monitoringport="portnum"/>
>>     </ip>
>>   </service>
>> </rm>
>>
>> Any ideas why it might be in this state, where everything is
>> apparently fine except that the service is "recovering" and rgmanager
>> isn't trying to do anything about it and isn't logging any complaints?
>
> The only cause for this is if we send a message but it either doesn't make
> it or we get a weird return code -- I think rgmanager logs it, though, so
> this could be a new issue.
>
>> Attached: strace -fp output of clurgmrgd processes on node1 and node2
>
> The strace data is not likely to be useful, but a dump from rgmanager would.
>  If you get in to this state again, do this:
>
>   kill -USR1 `pidof -s clurgmgrd`
>
> Then look at /tmp/rgmanager-dump* (2.0.x) or /var/lib/cluster/rgmanager-dump
> (3.x.y)
>
> -- Lon
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>