[Linux-cluster] rgmanager is jamed

Sat May 26 07:05:37 UTC 2012

On 05/25/2012 06:20 PM, Nicolas Ross wrote:
> I am in the process of upgrading one of our cluster from RHEL 6.1 to
> 6.2. It's an 8-node cluster.
> 
> I started with one node. Stop all cluster resources, cman, rgmanager et
> al. yum update, reboot, move to next. The first one did ok.
> 
> On the second one, rgmanager started, but doesn't seem to connect to
> other nodes. I found this in dmesg :
> 
> INFO: task rgmanager:2901 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> rgmanager     D 0000000000000000     0  2901   2900 0x00000080
>  ffff880667299d48 0000000000000082 0000000000000000 ffff8806656aa318
>  ffff88066729c378 0000000000000001 ffff880665bb31b0 00007fffc6c6fa20
>  ffff88066635a678 ffff880667299fd8 000000000000f4e8 ffff88066635a678
> Call Trace:
>  [<ffffffff814ee6fe>] __mutex_lock_slowpath+0x13e/0x180
>  [<ffffffff814ee59b>] mutex_lock+0x2b/0x50
>  [<ffffffffa02c192c>] dlm_new_lockspace+0x3c/0xa30 [dlm]
>  [<ffffffff8115f74c>] ? __kmalloc+0x20c/0x220
>  [<ffffffffa02ca94d>] device_write+0x30d/0x7d0 [dlm]
>  [<ffffffff8105ea30>] ? default_wake_function+0x0/0x20
>  [<ffffffff8120c646>] ? security_file_permission+0x16/0x20
>  [<ffffffff81176918>] vfs_write+0xb8/0x1a0
>  [<ffffffff810d4932>] ? audit_syscall_entry+0x272/0x2a0
>  [<ffffffff81177321>] sys_write+0x51/0x90
>  [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
> 
> Tried rebooting, but the shutdown staled on stoping rgmanager. Fenced
> the node, same outcome.
> 
> Any hints ?

This looks like a kernel dlm problem. I can see you found a workaround,
but that should not be necessary since upgrades between releases should
work.

can you please file a ticket with GSS and escalate it? Might be a good
idea to grab sosreports before those logs are flushed away in rotate.

Thanks
Fabio