[Linux-cluster] Unkillable clurgmgrd

Mon Nov 12 22:47:18 UTC 2007

On 11/12/07, Lon Hohberger <lhh at redhat.com> wrote:
>
> On Sun, 2007-11-11 at 23:57 +0100, Jos Vos wrote:
> > Hi,
> >
> > I have a node that has an unkillable (kill -9 doesn't work) clurgmgrd
> > running.  I have fenced it now for the third time, with the same
> > result after startup...
> >
> > Stracing clutstat gives:
> >
> > [...]
> > socket(PF_FILE, SOCK_STREAM, 0)         = 5
> > connect(5, {sa_family=AF_FILE, path="/var/run/cluster/rgmanager.sk"},
> 110) = -1 ENOENT (No such file or directory)
> > close(5)                                = 0
> > dup(2)                                  = 5
> > fcntl(5, F_GETFL)                       = 0x8002 (flags
> O_RDWR|O_LARGEFILE)
> > fstat(5, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0
> > mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)
> = 0x2aaaaaaac000
> > lseek(5, 0, SEEK_CUR)                   = -1 ESPIPE (Illegal seek)
> > write(5, "msg_open: No such file or direct"..., 36msg_open: No such file
> or directory
> > ) = 36
> > close(5)                                = 0
> > munmap(0x2aaaaaaac000, 4096)            = 0
> > [...]
> >
> > How to get this node back up again???
> >
> > This is on a RHEL 5.0 clone.
>
> If it's unkillable, it's stuck waiting on the kernel for something.
>
> echo 1 > /proc/sys/kernel/sysrq
> echo t > /proc/sysrq-trigger
>
> dmesg > foo.out
>
> reply + attach foo.out ;)

I observed a similar problem on the test cluster. It appears the clurgmgrd
deadlocks in some cases in groups.c:count_resource_groups(). It does not
happen every time but it is reproducible. Surviving node calls
rg_lock(service:mysql) @ groups.c:101 and gets stuck. The other node
resource manager waits indefinitely for the lock:

{3965} rg_lock(service:mysql) @ groups.c:101
[3460] debug: Sending service states to CTX0xa2a7fd0
no key for rg="service:mysql"
no key for rg="service:test"
[3460] debug: Sending node states to CTX0xa2a7fd0
[3460] debug: Sending service states to CTX0xa2a7fd0
no key for rg="service:mysql"
no key for rg="service:test"

To the original poster: the surviving node clurgmgrd is "unkillable" as
well.
You can try to reboot the surviving node - it will release the lock and
resource manager on the fenced node will be unblocked and start just fine.
Unfortunately, once you reboot the node the situation may reverse (resource
manager will hang on the rebooted node).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071112/a2ee6e2f/attachment.htm>