[Linux-cluster] rgmanager running, but cluster acts as if it's not
Ofer Inbar
cos at aaaaa.org
Thu Aug 18 15:38:14 UTC 2011
> 3-node cluster. rgmanager is running on all three nodes, but service
> won't relocate over to node 3. clustat doesn't see rgmanager on it.
> Run from nodes 1 and 2, clustat shows all three nodes Online but only
> nodes 1 and 2 have rgmanager. Run from node 3, clustat shows all
> three Online and no rgmanager. This is what I'd see if rgamanger were
> not running on node3 at all. And yet:
[...]
After I sent that email - and about an hour after the problme first
began - node2 spontaneously switched to showing rgmanager="0" in its
clustat -x output, even though node2 was where the service was running.
After rebooting node3 another time, its clurgmgrd was no longer in the
SIGCHLD loop I showed before. Instead, it was blocked on write(7, ...
According to lsof, filehandle 7 was /dev/misc/dlm-control
On #linux-cluster IRC, lon asked what group_tool ls showed...
node1 $sudo group_tool ls
type level name id state
fence 0 default 00010001 JOIN_STOP_WAIT
[1 2 3 3]
dlm 1 rgmanager 00030001 JOIN_ALL_STOPPED
[1 2 3]
node2 $sudo group_tool ls
[sudo] password for oinbar:
type level name id state
fence 0 default 00010001 JOIN_STOP_WAIT
[1 2 3 3]
dlm 1 rgmanager 00030001 JOIN_ALL_STOPPED
[1 2 3]
node3 $sudo group_tool ls
[sudo] password for oinbar:
type level name id state
fence 0 default 00000000 JOIN_STOP_WAIT
[1 2 3]
dlm 1 rgmanager 00000000 JOIN_STOP_WAIT
[1 2 3]
He also asked me to send SIGUSR1 to clurgmgrd and get the contents of
/tmp/rgmanager-dump*, but clurgmgrd did not respond to SIGUSR1 and I
got no dump files.
Also, I updated the cluster.conf to change <rm log_level="6"> to 7.
I started seeing this in /var/log/messages on node3:
Aug 18 10:00:07 node3 rgmanager: [8121]: <notice> Shutting down Cluster Service Manager...
Aug 18 10:13:31 node3 kernel: dlm: Using TCP for communications
Aug 18 10:13:31 node3 dlm_controld[1857]: process_uevent online@ error -17 errno 2
Aug 18 10:14:05 node3 kernel: dlm: rgmanager: group join failed -512 0
Aug 18 10:14:05 node3 kernel: dlm: Using TCP for communications
Aug 18 10:14:05 node3 dlm_controld[1857]: process_uevent online@ error -17 errno 2
Aug 18 10:14:33 node3 kernel: dlm: rgmanager: group join failed -512 0
Aug 18 10:14:36 node3 dlm_controld[1857]: process_uevent online@ error -17 errno 2
Aug 18 10:14:36 node3 kernel: dlm: Using TCP for communications
Aug 18 10:26:15 node3 rgmanager: [22290]: <notice> Shutting down Cluster Service Manager...
Aug 18 10:34:48 node3 kernel: dlm: rgmanager: group join failed -512 0
... and this in /var/log/messages on node1:
Aug 18 10:37:48 node1 kernel: INFO: task clurgmgrd:32606 blocked for more than 120 seconds.
Aug 18 10:37:48 node1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 18 10:37:48 node1 kernel: clurgmgrd D ffff81016ae9abc0 0 32606 32605 633 (NOTLB)
Aug 18 10:37:48 node1 kernel: ffff810169641de8 0000000000000086 ffff810169641d28 ffff810169641d28
Aug 18 10:37:48 node1 kernel: 0000000000000246 0000000000000008 ffff81006efad820 ffff810168493080
Aug 18 10:37:48 node1 kernel: 0003f21de24fde7f 000000000000f650 ffff81006efada08 000000007eea8300
Aug 18 10:37:48 node1 kernel: Call Trace:
Aug 18 10:37:48 node1 kernel: [<ffffffff8002cd2c>] mntput_no_expire+0x19/0x89
Aug 18 10:37:48 node1 kernel: [<ffffffff8000ea75>] link_path_walk+0xa6/0xb2
Aug 18 10:37:48 node1 kernel: [<ffffffff800656ac>] __down_read+0x7a/0x92
Aug 18 10:37:48 node1 kernel: [<ffffffff88473380>] :dlm:dlm_clear_proc_locks+0x20/0x1d2
Aug 18 10:37:48 node1 kernel: [<ffffffff8001adcf>] cp_new_stat+0xe5/0xfd
Aug 18 10:37:48 node1 kernel: [<ffffffff8847b0a9>] :dlm:device_close+0x55/0x99
Aug 18 10:37:48 node1 kernel: [<ffffffff80012ac5>] __fput+0xd3/0x1bd
Aug 18 10:37:48 node1 kernel: [<ffffffff80023bd1>] filp_close+0x5c/0x64
Aug 18 10:37:48 node1 kernel: [<ffffffff8001dff3>] sys_close+0x88/0xbd
Aug 18 10:37:48 node1 kernel: [<ffffffff8005e116>] system_call+0x7e/0x83
Aug 18 10:37:48 node1 kernel:
Finally, I rebooted all three cluster nodes at the same time,
After I did that, everything came back up in a good state.
I'm sending this followup in the hopes that someone can use this data
to determine what the bug was. If you do, please reply. Thanks!
-- Cos
More information about the Linux-cluster
mailing list