[linux-lvm] clvmd leaving kernel dlm uncontrolled lockspace
pgadmin at pse-consulting.de
Thu Jun 6 06:17:17 UTC 2013
Am 05.06.13 17:13, schrieb David Teigland:
> A few different topics wrapped together there:
> - With kill -9 clvmd (possibly combined with dlm_tool leave clvmd),
> you can manually clear/remove a userland lockspace like clvmd.
> - If clvmd is blocked in the kernel in uninterruptible sleep, then
> the kill above will not work. To make kill work, you'd locate the
> particular sleep in the kernel and determine if there's a way to
> make it interruptible, and cleanly back it out.
I had clvmds blocked in kernel, so how to "locate the sleep and make it
> - If clvmd is blocked in the kernel for >120s, you probably want to
> investigate what is causing that, rather than being too hasty
> killing clvmd.
INFO: task clvmd:19766 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
clvmd D ffff880058ec4870 0 19766 1 0x00000000
ffff880058ec4870 0000000000000282 0000000000000000 ffff8800698d9590
0000000000013740 ffff880063787fd8 ffff880063787fd8 0000000000013740
ffff880058ec4870 ffff880063786010 0000000000000001 0000000100000000
[<ffffffff81367f7a>] ? rwsem_down_failed_common+0xda/0x10e
[<ffffffff811c5924>] ? call_rwsem_down_read_failed+0x14/0x30
[<ffffffff813678da>] ? down_read+0x17/0x19
[<ffffffffa059b705>] ? dlm_user_request+0x3a/0x17e [dlm]
[<ffffffffa05a40e4>] ? device_write+0x279/0x5f7 [dlm]
[<ffffffff810f7d7a>] ? __kmalloc+0x104/0x116
[<ffffffffa05a416b>] ? device_write+0x300/0x5f7 [dlm]
[<ffffffff810042c9>] ? xen_mc_flush+0x12b/0x158
[<ffffffff8117489e>] ? security_file_permission+0x18/0x2d
[<ffffffff81106dd5>] ? vfs_write+0xa4/0xff
[<ffffffff81106ee6>] ? sys_write+0x45/0x6e
[<ffffffff8136d652>] ? system_call_fastpath+0x16/0x1b
> - If corosync or dlm_controld are killed while dlm lockspaces exist,
> they become "uncontrolled" and would need to be forcibly cleaned up.
> This cleanup may be possible to implement for userland lockspaces,
> but it's not been clear that the benefits would greatly outweigh
> using reboot for this.
On a machine being Xen host with 20+ running VMs I'd clearly prefer to
clean those orphaned memory space and go on.... I still have 4 hosts to
be rebooted which serve as xen host, providing their devices from
clvmd-controlled (i.e. now uncontrollable) san space.
> - Killing either corosync or dlm_controld is very unlikely help
> anything, and more likely to cause further problems, so it should
> be avoided as far as possible.
I understand. One reason to upgrade was that I had infrequent
situations, where the corosync 1.4.2 instances on all nodes exitted
simultaneously without any log notice. Having this with the new
corosync2.3/dlm infrastructure would mean a whole cluster having
uncontrollable san space. So either the lockspace should be
automatically reclaimed if dlm_controld finds it uncontrolled, or a
means to clean it up manually should be available.
More information about the linux-lvm