[Linux-cluster] Services getting stuck on node

Fri Aug 31 22:33:49 UTC 2012

Hi

I had a strange issue this afternoon. One of my cluster nodes died (possible hw fault or driver issue). But the other node failed to take a number of it's services (2 node cluster), when it was successfully fenced.

The clustat indicated that the services were on still on the original node (started) but the top lines correctly stated that the node was "offline".  The rgmanager log says for this event:

Aug 31 17:19:30 rgmanager [ip] Link detected on bond0
Aug 31 17:19:30 rgmanager [ip] Local ping to 10.10.1.45 succeeded
Aug 31 17:19:37 rgmanager State change: bld1uxn1i DOWN
Aug 31 17:19:49 rgmanager [ip] Checking 10.10.1.46, Level 10
Aug 31 17:19:49 rgmanager [ip] Checking 10.10.1.45, Level 0
Aug 31 17:19:49 rgmanager [ip] Checking 10.10.1.33, Level 0
Aug 31 17:19:49 rgmanager [ip] 10.10.1.46 present on bond0
Aug 31 17:19:49 rgmanager [ip] Checking 10.10.1.43, Level 0
Aug 31 17:19:49 rgmanager [ip] 10.10.1.45 present on bond0
Aug 31 17:19:49 rgmanager [ip] 10.10.1.33 present on bond0
Aug 31 17:19:49 rgmanager [ip] Link for bond0: Detected
Aug 31 17:19:49 rgmanager [ip] 10.10.1.43 present on bond0
Aug 31 17:19:49 rgmanager Taking over service service:nfsdprj from down member bld1uxn1i
Aug 31 17:19:49 rgmanager [ip] Link for bond0: Detected
Aug 31 17:19:49 rgmanager [ip] Link for bond0: Detected
Aug 31 17:19:49 rgmanager #47: Failed changing service status
Aug 31 17:19:49 rgmanager Taking over service service:httpd from down member bld1uxn1i
Aug 31 17:19:49 rgmanager [ip] Link detected on bond0
Aug 31 17:19:49 rgmanager [ip] Link for bond0: Detected
Aug 31 17:19:49 rgmanager [ip] Link detected on bond0
Aug 31 17:19:49 rgmanager [ip] Link detected on bond0
Aug 31 17:19:49 rgmanager #47: Failed changing service status
Aug 31 17:19:49 rgmanager [ip] Local ping to 10.10.1.46 succeeded
Aug 31 17:19:49 rgmanager [ip] Link detected on bond0
Aug 31 17:19:49 rgmanager #13: Service service:nfsdprj failed to stop cleanly
Aug 31 17:19:49 rgmanager #13: Service service:httpd failed to stop cleanly
A couple of other services did successfully switch after this.

I have seem this a few times (randomly) on various clusters since around the time of upgrading to 6.3 from 6.2 (services refusing to cleanly stop on a node). It's hard to reproduce and when down we usually just want a restart as fast as possible (thereby limiting time for debugging).

How can I see what is causing the "#47: Failed changing service status" or any more debugging we can turn on in rgmanager to help with this?

Or better still has anyone else seen anything like this?

Thanks

Colin

________________________________

This email and any files transmitted with it are confidential and are intended solely for the use of the individual or entity to whom they are addressed. If you are not the original recipient or the person responsible for delivering the email to the intended recipient, be advised that you have received this email in error, and that any use, dissemination, forwarding, printing, or copying of this email is strictly prohibited. If you received this email in error, please immediately notify the sender and delete the original.