[Linux-cluster] clvmd hang

Robert Clark cluster at defuturo.co.uk
Wed Apr 2 14:02:58 UTC 2008


  I'm having some problems with clvmd hanging on our 8-node cluster.
Once hung, any lvm commands wait indefinitely. This normally happens
when starting up the cluster or if multiple nodes reboot. After some
experimentation I've managed to reproduce it consistently on a smaller
3-node test cluster by stopping clvmd on one node and then running
vgscan on another. The vgscan will hang together with clvmd. Restarting
clvmd on the stopped node doesn't wake it up.

  Once hung, an strace shows 3 clvmd threads, 2 waiting on futexes and
one trying to read from /dev/misc/dlm_clvmd. All 3 threads wait
indefinitely on these system calls. Here's the last part of the strace:

[pid  2951] select(1024, [4 6], NULL, NULL, {90, 0}) = 1 (in [4], left {56, 190000})
[pid  2951] accept(4, {sa_family=AF_FILE, path=@}, [2]) = 5
[pid  2951] ioctl(6, 0x7805, 0)         = 1
[pid  2951] select(1024, [4 5 6], NULL, NULL, {90, 0}) = 1 (in [5], left {90, 0})
[pid  2951] read(5, "3\0\0\0\0\0\0\0\0\0\0\0\v\0\0\0\0\4\4P_global\0\0", 4096) = 29
[pid  2951] futex(0x84d64f4, FUTEX_WAIT, 2, NULL <unfinished ...>

  P_global doesn't show up in /proc/cluster/dlm_locks at this point.
Here's what I can get from dlm_debug:

clvmd rebuilt 5 resources
clvmd purge requests
clvmd purged 0 requests
clvmd mark waiting requests
clvmd marked 0 requests
clvmd purge locks of departed nodes
clvmd purged 0 locks
clvmd update remastered resources
clvmd updated 0 resources
clvmd rebuild locks
clvmd rebuilt 0 locks
clvmd recover event 22 done
clvmd move flags 0,0,1 ids 11,22,22
clvmd process held requests
clvmd processed 0 requests
clvmd resend marked requests
clvmd resent 0 requests
clvmd recover event 22 finished
clvmd move flags 1,0,0 ids 22,22,22
clvmd move flags 0,1,0 ids 22,23,22
clvmd move use event 23
clvmd recover event 23
clvmd add node 1
clvmd total nodes 3
clvmd rebuild resource directory
clvmd rebuilt 5 resources
clvmd purge requests
clvmd purged 0 requests
clvmd mark waiting requests
clvmd marked 0 requests
clvmd recover event 23 done
clvmd move flags 0,0,1 ids 22,23,23
clvmd process held requests
clvmd processed 0 requests
clvmd resend marked requests
clvmd resent 0 requests
clvmd recover event 23 finished

  I'm running 4.6 with kernel-hugemem-2.6.9-67.0.7.EL,
lvm2-cluster-2.02.27-2.el4_6.2 & dlm-kernel-hugemem-2.6.9-52.5. Has
anyone else seen anything like this?

	Thanks,

		Robert




More information about the Linux-cluster mailing list