[Linux-cluster] clurgmgrd - <err> #48: Unable to obtain clusterlock: Connectiontimed out

Tue Jun 19 15:25:57 UTC 2007

Hi all

we just hit this Problem again:

Jun 18 08:03:08 lilr623a clurgmgrd[22152]:  #48: Unable to obtain
cluster lock: Connection timed out  	
Jun 18 08:03:35 lilr623f clurgmgrd: [21651]:  Executing
/usr/local/swadmin/caa/SAP/P06WD002 status  	
Jun 18 08:05:29 lilr623f clurgmgrd[21651]:  #49: Failed getting status
for RG P06WD002	

is there any open Bugzilla about this Problem?

what we also see that the Crash maybe is realated to the cron.daily
entries. Maybe some crontab entry trigger this dlmbug?

Here you can see the crontab, the cron.daily start at 08:02 the Cluster
stuck ag 08:03 ! Also the last time it was also the same time.

root at lilr623a:/tmp# cat /etc/crontab
SHELL=/bin/bash
PATH=/sbin:/bin:/usr/sbin:/usr/bin
MAILTO=root
HOME=/

# run-parts
01 * * * * root run-parts /etc/cron.hourly
02 8 * * * root run-parts /etc/cron.daily
22 4 * * 0 root run-parts /etc/cron.weekly
42 4 1 * * root run-parts /etc/cron.monthly

root at lilr623a:/tmp# ls -l  /etc/cron.daily
total 28
lrwxrwxrwx  1 root root   28 Oct  5  2006 00-logwatch ->
../log.d/scripts/logwatch.pl
-rwxr-xr-x  1 root root  418 Apr 14  2006 00-makewhatis.cron
-rwxr-xr-x  1 root root  276 Sep 28  2004 0anacron
-rwxr-xr-x  1 root root  180 Jul 13  2005 logrotate
-rwxr-xr-x  1 root root   48 Apr  9  2006 mcelog.cron
-rwxr-xr-x  1 root root 2133 Dec  1  2004 prelink
-rwxr-xr-x  1 root root  121 Aug  8  2005 slocate.cron 

Thanks for your help

Mike

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Lon Hohberger
Sent: Freitag, 11. Mai 2007 22:19
To: linux clustering
Subject: Re: [Linux-cluster] clurgmgrd - <err> #48: Unable to obtain
clusterlock: Connectiontimed out

On Mon, May 07, 2007 at 01:54:56PM -0400, rhurst at bidmc.harvard.edu
wrote:
> What could cause clurgmgrd fail like this?  If clurgmgrd has a hiccup 
> like this, is it supposed to shutdown its services?  Is there 
> something in our implementation that could have prevented this from
shutting down?
> 
> For unexplained reasons, we just had our CS service (WATSON) go down 
> on its own, and the syslog entry details the event as:
> 
> May  7 13:18:39 db1 clurgmgrd[17888]: <err> #48: Unable to obtain 
> cluster lock: Connection timed out May  7 13:18:41 db1 kernel: dlm: 
> Magma: reply from 2 no lock May  7 13:18:41 db1 kernel: dlm: reply May

> 7 13:18:41 db1 kernel: rh_cmd 5 May  7 13:18:41 db1 kernel: rh_lkid 
> 200242 May  7 13:18:41 db1 kernel: lockstate 2 May  7 13:18:41 db1 
> kernel: nodeid 0 May  7 13:18:41 db1 kernel: status 0 May  7 13:18:41 
> db1 kernel: lkid ee0388 May  7 13:18:41 db1 clurgmgrd[17888]: <notice>

> Stopping service WATSON

This usually is a dlm bug.  Once the DLM gets in to this state,
rgmanager blows up.  What rgmanager are you using?

(There's only one lock per service; the complexity of the service
doesn't matter...)

--
Lon Hohberger - Software Engineer - Red Hat, Inc.

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster