[Linux-cluster] Deadlock when using clvmd + OpenAIS + Corosync
ccaulfie at redhat.com
Mon Jan 11 09:03:46 UTC 2010
On 08/01/10 22:58, Evan Broder wrote:
> [please preserve the CC when replying, thanks]
> Hi -
> We're attempting to setup a clvm (2.02.56) cluster using OpenAIS
> (1.1.1) and Corosync (1.1.2). We've gotten bitten hard in the past by
> crashes leaving DLM state around and forcing us to reboot our nodes,
> so we're specifically looking for a solution that doesn't involve
> in-kernel locking.
> We're also running the Pacemaker OpenAIS service, as we're hoping to
> use it for management of some other resources going forward.
> We've managed to form the OpenAIS cluster, and get clvmd running on
> both of our nodes. Operations using LVM succeed, so long as only one
> operation runs at a time. However, if we attempt to run two operations
> (say, one lvcreate on each host) at a time, they both hang, and both
> clvmd processes appear to deadlock.
> When they deadlock, it doesn't appear to affect the other clustering
> processes - both corosync and pacemaker still report a fully formed
> cluster, so it seems the issue is localized to clvmd.
> I've looked at logs from corosync and pacemaker, and I've straced
> various processes, but I don't want to blast a bunch of useless
> information at the list. What information can I provide to make it
> easier to debug and fix this deadlock?
To start with, the best logging to produce is the clvmd logs which can
be got with clvmd -d (see the man page for details). Ideally these
should be from all nodes in the cluster so they can be correlated. If
you're still using DLM then a dlm lock dump from all nodes is often
helpful in conjunction with the clvmd logs.
Also, did you know it's possible to use clvmd without the DLM? The -I
openais option will tell it to use the Lck service in userspace - though
if there are DLM bugs I think we'd like to fix them if possible ;-)
More information about the Linux-cluster