[Linux-cluster] Services getting stuck on node

emmanuel segura emi2fast at gmail.com
Sat Sep 1 10:04:39 UTC 2012


Hello Colin

maybe your service doesn't switch because this happen
======================================================
Aug 31 17:19:49 rgmanager #13: Service service:nfsdprj failed to stop
cleanly
Aug 31 17:19:49 rgmanager #13: Service service:httpd failed to stop cleanly
======================================================

for debug your service stop, you can use rg_test test
/etc/cluster/cluster.conf stop service <NAME_OF_SERVICE>

for help you think is more easy if you show your cluster.conf

Thanks :-)

2012/9/1 Colin Simpson <Colin.Simpson at iongeo.com>

> Hi
>
> I had a strange issue this afternoon. One of my cluster nodes died
> (possible hw fault or driver issue). But the other node failed to take a
> number of it's services (2 node cluster), when it was successfully fenced.
>
> The clustat indicated that the services were on still on the original node
> (started) but the top lines correctly stated that the node was "offline".
>  The rgmanager log says for this event:
>
> Aug 31 17:19:30 rgmanager [ip] Link detected on bond0
> Aug 31 17:19:30 rgmanager [ip] Local ping to 10.10.1.45 succeeded
> Aug 31 17:19:37 rgmanager State change: bld1uxn1i DOWN
> Aug 31 17:19:49 rgmanager [ip] Checking 10.10.1.46, Level 10
> Aug 31 17:19:49 rgmanager [ip] Checking 10.10.1.45, Level 0
> Aug 31 17:19:49 rgmanager [ip] Checking 10.10.1.33, Level 0
> Aug 31 17:19:49 rgmanager [ip] 10.10.1.46 present on bond0
> Aug 31 17:19:49 rgmanager [ip] Checking 10.10.1.43, Level 0
> Aug 31 17:19:49 rgmanager [ip] 10.10.1.45 present on bond0
> Aug 31 17:19:49 rgmanager [ip] 10.10.1.33 present on bond0
> Aug 31 17:19:49 rgmanager [ip] Link for bond0: Detected
> Aug 31 17:19:49 rgmanager [ip] 10.10.1.43 present on bond0
> Aug 31 17:19:49 rgmanager Taking over service service:nfsdprj from down
> member bld1uxn1i
> Aug 31 17:19:49 rgmanager [ip] Link for bond0: Detected
> Aug 31 17:19:49 rgmanager [ip] Link for bond0: Detected
> Aug 31 17:19:49 rgmanager #47: Failed changing service status
> Aug 31 17:19:49 rgmanager Taking over service service:httpd from down
> member bld1uxn1i
> Aug 31 17:19:49 rgmanager [ip] Link detected on bond0
> Aug 31 17:19:49 rgmanager [ip] Link for bond0: Detected
> Aug 31 17:19:49 rgmanager [ip] Link detected on bond0
> Aug 31 17:19:49 rgmanager [ip] Link detected on bond0
> Aug 31 17:19:49 rgmanager #47: Failed changing service status
> Aug 31 17:19:49 rgmanager [ip] Local ping to 10.10.1.46 succeeded
> Aug 31 17:19:49 rgmanager [ip] Link detected on bond0
> Aug 31 17:19:49 rgmanager #13: Service service:nfsdprj failed to stop
> cleanly
> Aug 31 17:19:49 rgmanager #13: Service service:httpd failed to stop cleanly
> A couple of other services did successfully switch after this.
>
> I have seem this a few times (randomly) on various clusters since around
> the time of upgrading to 6.3 from 6.2 (services refusing to cleanly stop on
> a node). It's hard to reproduce and when down we usually just want a
> restart as fast as possible (thereby limiting time for debugging).
>
> How can I see what is causing the "#47: Failed changing service status" or
> any more debugging we can turn on in rgmanager to help with this?
>
> Or better still has anyone else seen anything like this?
>
> Thanks
>
> Colin
>
> ________________________________
>
>
> This email and any files transmitted with it are confidential and are
> intended solely for the use of the individual or entity to whom they are
> addressed. If you are not the original recipient or the person responsible
> for delivering the email to the intended recipient, be advised that you
> have received this email in error, and that any use, dissemination,
> forwarding, printing, or copying of this email is strictly prohibited. If
> you received this email in error, please immediately notify the sender and
> delete the original.
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
esta es mi vida e me la vivo hasta que dios quiera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120901/2901759a/attachment.htm>


More information about the Linux-cluster mailing list