[Linux-cluster] DLM nodes disconnected issue

Bjoern Teipel bjoern.teipel at internetbrands.com
Mon Apr 7 07:26:43 UTC 2014


H all,

i did a dlm_tool leave clvmd on one node (node06) of a CMAN cluster with CLVMD
Now I have the problem that clvmd is stuck and all nodes lost
connections to DLM.
For some reason dlm want's to fence member 8 I guess and that might
stuck the whole dlm?
All other stacks, cman, corosync look fine...

Thanks,
Bjoern

Error:

dlm: closing connection to node 2
dlm: closing connection to node 3
dlm: closing connection to node 4
dlm: closing connection to node 5
dlm: closing connection to node 6
dlm: closing connection to node 8
dlm: closing connection to node 9
dlm: closing connection to node 10
dlm: closing connection to node 2
dlm: closing connection to node 3
dlm: closing connection to node 4
dlm: closing connection to node 5
dlm: closing connection to node 6
dlm: closing connection to node 8
dlm: closing connection to node 9
dlm: closing connection to node 10
INFO: task dlm_tool:33699 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
dlm_tool      D 0000000000000003     0 33699  33698 0x00000080
 ffff88138905dcc0 0000000000000082 ffffffff81168043 ffff88138905dd18
 ffff88138905dd08 ffff88305b30ccc0 ffff88304fa5c800 ffff883058e49900
 ffff881857329058 ffff88138905dfd8 000000000000fb88 ffff881857329058
Call Trace:
 [<ffffffff81168043>] ? kmem_cache_alloc_trace+0x1a3/0x1b0
 [<ffffffff8132f79a>] ? misc_open+0x1ca/0x320
 [<ffffffff81510725>] rwsem_down_failed_common+0x95/0x1d0
 [<ffffffff81185505>] ? chrdev_open+0x125/0x230
 [<ffffffff815108b6>] rwsem_down_read_failed+0x26/0x30
 [<ffffffff8117e5ff>] ? __dentry_open+0x23f/0x360
 [<ffffffff81283894>] call_rwsem_down_read_failed+0x14/0x30
 [<ffffffff8150fdb4>] ? down_read+0x24/0x30
 [<ffffffffa06d948d>] dlm_clear_proc_locks+0x3d/0x2a0 [dlm]
 [<ffffffff811dfed6>] ? generic_acl_chmod+0x46/0xd0
 [<ffffffffa06e4b36>] device_close+0x66/0xc0 [dlm]
 [<ffffffff81182b45>] __fput+0xf5/0x210
 [<ffffffff81182c85>] fput+0x25/0x30
 [<ffffffff8117e0dd>] filp_close+0x5d/0x90
 [<ffffffff8117e1b5>] sys_close+0xa5/0x100
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b



Status:

cman_tool nodes
Node  Sts   Inc   Joined               Name
   1   M  18908   2014-03-24 19:01:00  node01
   2   M  18972   2014-04-06 22:47:57  node02
   3   M  18972   2014-04-06 22:47:57  node03
   4   M  18972   2014-04-06 22:47:57  node04
   5   M  18972   2014-04-06 22:47:57  node05
   6   X  18960                        node06
   7   X  18928                        node07
   8   M  18972   2014-04-06 22:47:57  node08
   9   M  18972   2014-04-06 22:47:57  node09
  10   M  18972   2014-04-06 22:47:57  node10

dlm lockspaces
name          clvmd
id            0x4104eefa
flags         0x00000004 kern_stop
change        member 8 joined 0 remove 1 failed 0 seq 11,11
members       1 2 3 4 5 8 9 10
new change    member 8 joined 1 remove 0 failed 0 seq 12,41
new status    wait_messages 0 wait_condition 1 fencing
new members   1 2 3 4 5 8 9 10



DLM dump:
1396849677 cluster node 2 added seq 18972
1396849677 set_configfs_node 2 10.14.18.66 local 0
1396849677 cluster node 3 added seq 18972
1396849677 set_configfs_node 3 10.14.18.67 local 0
1396849677 cluster node 4 added seq 18972
1396849677 set_configfs_node 4 10.14.18.68 local 0
1396849677 cluster node 5 added seq 18972
1396849677 set_configfs_node 5 10.14.18.70 local 0
1396849677 cluster node 8 added seq 18972
1396849677 set_configfs_node 8 10.14.18.80 local 0
1396849677 cluster node 9 added seq 18972
1396849677 set_configfs_node 9 10.14.18.81 local 0
1396849677 cluster node 10 added seq 18972
1396849677 set_configfs_node 10 10.14.18.77 local 0
1396849677 dlm:ls:clvmd conf 2 1 0 memb 1 3 join 3 left
1396849677 clvmd add_change cg 35 joined nodeid 3
1396849677 clvmd add_change cg 35 counts member 2 joined 1 remove 0 failed 0
1396849677 dlm:ls:clvmd conf 3 1 0 memb 1 2 3 join 2 left
1396849677 clvmd add_change cg 36 joined nodeid 2
1396849677 clvmd add_change cg 36 counts member 3 joined 1 remove 0 failed 0
1396849677 dlm:ls:clvmd conf 4 1 0 memb 1 2 3 9 join 9 left
1396849677 clvmd add_change cg 37 joined nodeid 9
1396849677 clvmd add_change cg 37 counts member 4 joined 1 remove 0 failed 0
1396849677 dlm:ls:clvmd conf 5 1 0 memb 1 2 3 8 9 join 8 left
1396849677 clvmd add_change cg 38 joined nodeid 8
1396849677 clvmd add_change cg 38 counts member 5 joined 1 remove 0 failed 0
1396849677 dlm:ls:clvmd conf 6 1 0 memb 1 2 3 8 9 10 join 10 left
1396849677 clvmd add_change cg 39 joined nodeid 10
1396849677 clvmd add_change cg 39 counts member 6 joined 1 remove 0 failed 0
1396849677 dlm:ls:clvmd conf 7 1 0 memb 1 2 3 5 8 9 10 join 5 left
1396849677 clvmd add_change cg 40 joined nodeid 5
1396849677 clvmd add_change cg 40 counts member 7 joined 1 remove 0 failed 0
1396849677 dlm:ls:clvmd conf 8 1 0 memb 1 2 3 4 5 8 9 10 join 4 left
1396849677 clvmd add_change cg 41 joined nodeid 4
1396849677 clvmd add_change cg 41 counts member 8 joined 1 remove 0 failed 0
1396849677 dlm:controld conf 2 1 0 memb 1 3 join 3 left
1396849677 dlm:controld conf 3 1 0 memb 1 2 3 join 2 left
1396849677 dlm:controld conf 4 1 0 memb 1 2 3 9 join 9 left
1396849677 dlm:controld conf 5 1 0 memb 1 2 3 8 9 join 8 left
1396849677 dlm:controld conf 6 1 0 memb 1 2 3 8 9 10 join 10 left
1396849677 dlm:controld conf 7 1 0 memb 1 2 3 5 8 9 10 join 5 left
1396849677 dlm:controld conf 8 1 0 memb 1 2 3 4 5 8 9 10 join 4 left




More information about the Linux-cluster mailing list