[Linux-cluster] rgmanager running, but cluster acts as if it's not

Thu Aug 18 15:38:14 UTC 2011

> 3-node cluster.  rgmanager is running on all three nodes, but service
> won't relocate over to node 3.  clustat doesn't see rgmanager on it.
> Run from nodes 1 and 2, clustat shows all three nodes Online but only
> nodes 1 and 2 have rgmanager.  Run from node 3, clustat shows all
> three Online and no rgmanager.  This is what I'd see if rgamanger were
> not running on node3 at all.  And yet:
[...]

After I sent that email - and about an hour after the problme first
began - node2 spontaneously switched to showing rgmanager="0" in its
clustat -x output, even though node2 was where the service was running.

After rebooting node3 another time, its clurgmgrd was no longer in the
SIGCHLD loop I showed before.  Instead, it was blocked on write(7, ...
According to lsof, filehandle 7 was /dev/misc/dlm-control

On #linux-cluster IRC, lon asked what group_tool ls showed...

node1 $sudo group_tool ls
type             level name       id       state
fence            0     default    00010001 JOIN_STOP_WAIT
[1 2 3 3]
dlm              1     rgmanager  00030001 JOIN_ALL_STOPPED
[1 2 3]

node2 $sudo group_tool ls
[sudo] password for oinbar:
type             level name       id       state
fence            0     default    00010001 JOIN_STOP_WAIT
[1 2 3 3]
dlm              1     rgmanager  00030001 JOIN_ALL_STOPPED
[1 2 3]

node3 $sudo group_tool ls
[sudo] password for oinbar:
type             level name       id       state
fence            0     default    00000000 JOIN_STOP_WAIT
[1 2 3]
dlm              1     rgmanager  00000000 JOIN_STOP_WAIT
[1 2 3]

He also asked me to send SIGUSR1 to clurgmgrd and get the contents of
/tmp/rgmanager-dump*, but clurgmgrd did not respond to SIGUSR1 and I
got no dump files.

Also, I updated the cluster.conf to change <rm log_level="6"> to 7.

I started seeing this in /var/log/messages on node3:

Aug 18 10:00:07 node3 rgmanager: [8121]: <notice> Shutting down Cluster Service Manager... 
Aug 18 10:13:31 node3 kernel: dlm: Using TCP for communications
Aug 18 10:13:31 node3 dlm_controld[1857]: process_uevent online@ error -17 errno 2
Aug 18 10:14:05 node3 kernel: dlm: rgmanager: group join failed -512 0
Aug 18 10:14:05 node3 kernel: dlm: Using TCP for communications
Aug 18 10:14:05 node3 dlm_controld[1857]: process_uevent online@ error -17 errno 2
Aug 18 10:14:33 node3 kernel: dlm: rgmanager: group join failed -512 0
Aug 18 10:14:36 node3 dlm_controld[1857]: process_uevent online@ error -17 errno 2
Aug 18 10:14:36 node3 kernel: dlm: Using TCP for communications
Aug 18 10:26:15 node3 rgmanager: [22290]: <notice> Shutting down Cluster Service Manager... 
Aug 18 10:34:48 node3 kernel: dlm: rgmanager: group join failed -512 0

... and this in /var/log/messages on node1:

Aug 18 10:37:48 node1 kernel: INFO: task clurgmgrd:32606 blocked for more than 120 seconds.
Aug 18 10:37:48 node1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 18 10:37:48 node1 kernel: clurgmgrd     D ffff81016ae9abc0     0 32606  32605           633       (NOTLB)
Aug 18 10:37:48 node1 kernel:  ffff810169641de8 0000000000000086 ffff810169641d28 ffff810169641d28
Aug 18 10:37:48 node1 kernel:  0000000000000246 0000000000000008 ffff81006efad820 ffff810168493080
Aug 18 10:37:48 node1 kernel:  0003f21de24fde7f 000000000000f650 ffff81006efada08 000000007eea8300
Aug 18 10:37:48 node1 kernel: Call Trace:
Aug 18 10:37:48 node1 kernel:  [<ffffffff8002cd2c>] mntput_no_expire+0x19/0x89
Aug 18 10:37:48 node1 kernel:  [<ffffffff8000ea75>] link_path_walk+0xa6/0xb2
Aug 18 10:37:48 node1 kernel:  [<ffffffff800656ac>] __down_read+0x7a/0x92
Aug 18 10:37:48 node1 kernel:  [<ffffffff88473380>] :dlm:dlm_clear_proc_locks+0x20/0x1d2
Aug 18 10:37:48 node1 kernel:  [<ffffffff8001adcf>] cp_new_stat+0xe5/0xfd
Aug 18 10:37:48 node1 kernel:  [<ffffffff8847b0a9>] :dlm:device_close+0x55/0x99
Aug 18 10:37:48 node1 kernel:  [<ffffffff80012ac5>] __fput+0xd3/0x1bd
Aug 18 10:37:48 node1 kernel:  [<ffffffff80023bd1>] filp_close+0x5c/0x64
Aug 18 10:37:48 node1 kernel:  [<ffffffff8001dff3>] sys_close+0x88/0xbd
Aug 18 10:37:48 node1 kernel:  [<ffffffff8005e116>] system_call+0x7e/0x83
Aug 18 10:37:48 node1 kernel:

Finally, I rebooted all three cluster nodes at the same time,
After I did that, everything came back up in a good state.
I'm sending this followup in the hopes that someone can use this data
to determine what the bug was.  If you do, please reply.  Thanks!
  -- Cos