[Linux-cluster] Services getting stuck on node

Sat Sep 1 04:23:25 UTC 2012

Hi

 I just started using Redhat Cluster two weeks ago so i don't claim myself
an expert.

 Looking at this error, i can recommend you to look at
/var/log/cluster/fenced.log and also try commands like  "fence_tool ls ,
fence_tool dump" and look at the output if it returns any error.
Alternately, if you have time to investigate, do "service stop rgmanager"
and make sure it does not run, and try starting in the foreground as
"rgmanager -f" and see what it reports when you can simulate the same
scenario.

 Other than that, your /var/log/messages and /var/log/cluster/*.log files
must tell you something going on.

Param

On Sat, Sep 1, 2012 at 4:03 AM, Colin Simpson <Colin.Simpson at iongeo.com>wrote:

> Hi
>
> I had a strange issue this afternoon. One of my cluster nodes died
> (possible hw fault or driver issue). But the other node failed to take a
> number of it's services (2 node cluster), when it was successfully fenced.
>
> The clustat indicated that the services were on still on the original node
> (started) but the top lines correctly stated that the node was "offline".
>  The rgmanager log says for this event:
>
> Aug 31 17:19:30 rgmanager [ip] Link detected on bond0
> Aug 31 17:19:30 rgmanager [ip] Local ping to 10.10.1.45 succeeded
> Aug 31 17:19:37 rgmanager State change: bld1uxn1i DOWN
> Aug 31 17:19:49 rgmanager [ip] Checking 10.10.1.46, Level 10
> Aug 31 17:19:49 rgmanager [ip] Checking 10.10.1.45, Level 0
> Aug 31 17:19:49 rgmanager [ip] Checking 10.10.1.33, Level 0
> Aug 31 17:19:49 rgmanager [ip] 10.10.1.46 present on bond0
> Aug 31 17:19:49 rgmanager [ip] Checking 10.10.1.43, Level 0
> Aug 31 17:19:49 rgmanager [ip] 10.10.1.45 present on bond0
> Aug 31 17:19:49 rgmanager [ip] 10.10.1.33 present on bond0
> Aug 31 17:19:49 rgmanager [ip] Link for bond0: Detected
> Aug 31 17:19:49 rgmanager [ip] 10.10.1.43 present on bond0
> Aug 31 17:19:49 rgmanager Taking over service service:nfsdprj from down
> member bld1uxn1i
> Aug 31 17:19:49 rgmanager [ip] Link for bond0: Detected
> Aug 31 17:19:49 rgmanager [ip] Link for bond0: Detected
> Aug 31 17:19:49 rgmanager #47: Failed changing service status
> Aug 31 17:19:49 rgmanager Taking over service service:httpd from down
> member bld1uxn1i
> Aug 31 17:19:49 rgmanager [ip] Link detected on bond0
> Aug 31 17:19:49 rgmanager [ip] Link for bond0: Detected
> Aug 31 17:19:49 rgmanager [ip] Link detected on bond0
> Aug 31 17:19:49 rgmanager [ip] Link detected on bond0
> Aug 31 17:19:49 rgmanager #47: Failed changing service status
> Aug 31 17:19:49 rgmanager [ip] Local ping to 10.10.1.46 succeeded
> Aug 31 17:19:49 rgmanager [ip] Link detected on bond0
> Aug 31 17:19:49 rgmanager #13: Service service:nfsdprj failed to stop
> cleanly
> Aug 31 17:19:49 rgmanager #13: Service service:httpd failed to stop cleanly
> A couple of other services did successfully switch after this.
>
> I have seem this a few times (randomly) on various clusters since around
> the time of upgrading to 6.3 from 6.2 (services refusing to cleanly stop on
> a node). It's hard to reproduce and when down we usually just want a
> restart as fast as possible (thereby limiting time for debugging).
>
> How can I see what is causing the "#47: Failed changing service status" or
> any more debugging we can turn on in rgmanager to help with this?
>
> Or better still has anyone else seen anything like this?
>
> Thanks
>
> Colin
>
> ________________________________
>
>
> This email and any files transmitted with it are confidential and are
> intended solely for the use of the individual or entity to whom they are
> addressed. If you are not the original recipient or the person responsible
> for delivering the email to the intended recipient, be advised that you
> have received this email in error, and that any use, dissemination,
> forwarding, printing, or copying of this email is strictly prohibited. If
> you received this email in error, please immediately notify the sender and
> delete the original.
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120901/404aac80/attachment.htm>