[Linux-cluster] Deadlock when using clvmd + OpenAIS + Corosync
ccaulfie at redhat.com
Tue Jan 12 08:54:41 UTC 2010
On 11/01/10 09:38, Christine Caulfield wrote:
> On 11/01/10 09:32, Evan Broder wrote:
>> On Mon, Jan 11, 2010 at 4:03 AM, Christine Caulfield
>> <ccaulfie at redhat.com> wrote:
>>> On 08/01/10 22:58, Evan Broder wrote:
>>>> [please preserve the CC when replying, thanks]
>>>> Hi -
>>>> We're attempting to setup a clvm (2.02.56) cluster using OpenAIS
>>>> (1.1.1) and Corosync (1.1.2). We've gotten bitten hard in the past by
>>>> crashes leaving DLM state around and forcing us to reboot our nodes,
>>>> so we're specifically looking for a solution that doesn't involve
>>>> in-kernel locking.
>>>> We're also running the Pacemaker OpenAIS service, as we're hoping to
>>>> use it for management of some other resources going forward.
>>>> We've managed to form the OpenAIS cluster, and get clvmd running on
>>>> both of our nodes. Operations using LVM succeed, so long as only one
>>>> operation runs at a time. However, if we attempt to run two operations
>>>> (say, one lvcreate on each host) at a time, they both hang, and both
>>>> clvmd processes appear to deadlock.
>>>> When they deadlock, it doesn't appear to affect the other clustering
>>>> processes - both corosync and pacemaker still report a fully formed
>>>> cluster, so it seems the issue is localized to clvmd.
>>>> I've looked at logs from corosync and pacemaker, and I've straced
>>>> various processes, but I don't want to blast a bunch of useless
>>>> information at the list. What information can I provide to make it
>>>> easier to debug and fix this deadlock?
>>> To start with, the best logging to produce is the clvmd logs which
>>> can be
>>> got with clvmd -d (see the man page for details). Ideally these
>>> should be
>>> from all nodes in the cluster so they can be correlated. If you're still
>>> using DLM then a dlm lock dump from all nodes is often helpful in
>>> conjunction with the clvmd logs.
>> Sure, no problem. I've posted the logs from clvmd on both processes in
>> <http://web.mit.edu/broder/Public/clvmd/>. I've annotated them at a
>> few points with what I was doing - the annotations all start with "
>>>> ", so they should be easy to spot.
Ironically it looks like a bug in the clvmd-openais code. I can
reproduce it on my systems here. I don't see the problem when using the dlm!
Can you try -Icorosync and see if that helps? In the meantime I'll have
a look at the openais bits to try and find out what is wrong.
More information about the Linux-cluster