[Linux-cluster] Services getting stuck on node

Sat Sep 1 12:56:47 UTC 2012

Thanks for getting back.

I'll try the debug shutdown with that command.

Though I think the "failed to stop cleanly" is far from clear what that means. The node it was running on has gone (was fenced) so there was nothing to stop before starting on this node.

Thanks

Colin
________________________________
From: linux-cluster-bounces at redhat.com [linux-cluster-bounces at redhat.com] on behalf of emmanuel segura [emi2fast at gmail.com]
Sent: 01 September 2012 11:04
To: linux clustering
Subject: Re: [Linux-cluster] Services getting stuck on node

Hello Colin

maybe your service doesn't switch because this happen
======================================================
Aug 31 17:19:49 rgmanager #13: Service service:nfsdprj failed to stop cleanly
Aug 31 17:19:49 rgmanager #13: Service service:httpd failed to stop cleanly
======================================================

for debug your service stop, you can use rg_test test /etc/cluster/cluster.conf stop service <NAME_OF_SERVICE>

for help you think is more easy if you show your cluster.conf

Thanks :-)

2012/9/1 Colin Simpson <Colin.Simpson at iongeo.com<mailto:Colin.Simpson at iongeo.com>>
Hi

I had a strange issue this afternoon. One of my cluster nodes died (possible hw fault or driver issue). But the other node failed to take a number of it's services (2 node cluster), when it was successfully fenced.

The clustat indicated that the services were on still on the original node (started) but the top lines correctly stated that the node was "offline".  The rgmanager log says for this event:

Aug 31 17:19:30 rgmanager [ip] Link detected on bond0
Aug 31 17:19:30 rgmanager [ip] Local ping to 10.10.1.45 succeeded
Aug 31 17:19:37 rgmanager State change: bld1uxn1i DOWN
Aug 31 17:19:49 rgmanager [ip] Checking 10.10.1.46, Level 10
Aug 31 17:19:49 rgmanager [ip] Checking 10.10.1.45, Level 0
Aug 31 17:19:49 rgmanager [ip] Checking 10.10.1.33, Level 0
Aug 31 17:19:49 rgmanager [ip] 10.10.1.46 present on bond0
Aug 31 17:19:49 rgmanager [ip] Checking 10.10.1.43, Level 0
Aug 31 17:19:49 rgmanager [ip] 10.10.1.45 present on bond0
Aug 31 17:19:49 rgmanager [ip] 10.10.1.33 present on bond0
Aug 31 17:19:49 rgmanager [ip] Link for bond0: Detected
Aug 31 17:19:49 rgmanager [ip] 10.10.1.43 present on bond0
Aug 31 17:19:49 rgmanager Taking over service service:nfsdprj from down member bld1uxn1i
Aug 31 17:19:49 rgmanager [ip] Link for bond0: Detected
Aug 31 17:19:49 rgmanager [ip] Link for bond0: Detected
Aug 31 17:19:49 rgmanager #47: Failed changing service status
Aug 31 17:19:49 rgmanager Taking over service service:httpd from down member bld1uxn1i
Aug 31 17:19:49 rgmanager [ip] Link detected on bond0
Aug 31 17:19:49 rgmanager [ip] Link for bond0: Detected
Aug 31 17:19:49 rgmanager [ip] Link detected on bond0
Aug 31 17:19:49 rgmanager [ip] Link detected on bond0
Aug 31 17:19:49 rgmanager #47: Failed changing service status
Aug 31 17:19:49 rgmanager [ip] Local ping to 10.10.1.46 succeeded
Aug 31 17:19:49 rgmanager [ip] Link detected on bond0
Aug 31 17:19:49 rgmanager #13: Service service:nfsdprj failed to stop cleanly
Aug 31 17:19:49 rgmanager #13: Service service:httpd failed to stop cleanly
A couple of other services did successfully switch after this.

I have seem this a few times (randomly) on various clusters since around the time of upgrading to 6.3 from 6.2 (services refusing to cleanly stop on a node). It's hard to reproduce and when down we usually just want a restart as fast as possible (thereby limiting time for debugging).

How can I see what is causing the "#47: Failed changing service status" or any more debugging we can turn on in rgmanager to help with this?

Or better still has anyone else seen anything like this?

Thanks

Colin

________________________________

This email and any files transmitted with it are confidential and are intended solely for the use of the individual or entity to whom they are addressed. If you are not the original recipient or the person responsible for delivering the email to the intended recipient, be advised that you have received this email in error, and that any use, dissemination, forwarding, printing, or copying of this email is strictly prohibited. If you received this email in error, please immediately notify the sender and delete the original.

--
Linux-cluster mailing list
Linux-cluster at redhat.com<mailto:Linux-cluster at redhat.com>
https://www.redhat.com/mailman/listinfo/linux-cluster

--
esta es mi vida e me la vivo hasta que dios quiera

________________________________

This email and any files transmitted with it are confidential and are intended solely for the use of the individual or entity to whom they are addressed. If you are not the original recipient or the person responsible for delivering the email to the intended recipient, be advised that you have received this email in error, and that any use, dissemination, forwarding, printing, or copying of this email is strictly prohibited. If you received this email in error, please immediately notify the sender and delete the original.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120901/c4b8daf2/attachment.htm>