[Linux-cluster] Deadlock when using clvmd + OpenAIS + Corosync

Wed Jan 13 09:59:30 UTC 2010

On 12/01/10 16:21, Evan Broder wrote:
> On Tue, Jan 12, 2010 at 3:54 AM, Christine Caulfield
> <ccaulfie at redhat.com>  wrote:
>> On 11/01/10 09:38, Christine Caulfield wrote:
>>>
>>> On 11/01/10 09:32, Evan Broder wrote:
>>>>
>>>> On Mon, Jan 11, 2010 at 4:03 AM, Christine Caulfield
>>>> <ccaulfie at redhat.com>  wrote:
>>>>>
>>>>> On 08/01/10 22:58, Evan Broder wrote:
>>>>>>
>>>>>> [please preserve the CC when replying, thanks]
>>>>>>
>>>>>> Hi -
>>>>>> We're attempting to setup a clvm (2.02.56) cluster using OpenAIS
>>>>>> (1.1.1) and Corosync (1.1.2). We've gotten bitten hard in the past by
>>>>>> crashes leaving DLM state around and forcing us to reboot our nodes,
>>>>>> so we're specifically looking for a solution that doesn't involve
>>>>>> in-kernel locking.
>>>>>>
>>>>>> We're also running the Pacemaker OpenAIS service, as we're hoping to
>>>>>> use it for management of some other resources going forward.
>>>>>>
>>>>>> We've managed to form the OpenAIS cluster, and get clvmd running on
>>>>>> both of our nodes. Operations using LVM succeed, so long as only one
>>>>>> operation runs at a time. However, if we attempt to run two operations
>>>>>> (say, one lvcreate on each host) at a time, they both hang, and both
>>>>>> clvmd processes appear to deadlock.
>>>>>>
>>>>>> When they deadlock, it doesn't appear to affect the other clustering
>>>>>> processes - both corosync and pacemaker still report a fully formed
>>>>>> cluster, so it seems the issue is localized to clvmd.
>>>>>>
>>>>>> I've looked at logs from corosync and pacemaker, and I've straced
>>>>>> various processes, but I don't want to blast a bunch of useless
>>>>>> information at the list. What information can I provide to make it
>>>>>> easier to debug and fix this deadlock?
>>>>>>
>>>>>
>>>>> To start with, the best logging to produce is the clvmd logs which
>>>>> can be
>>>>> got with clvmd -d (see the man page for details). Ideally these
>>>>> should be
>>>>> from all nodes in the cluster so they can be correlated. If you're still
>>>>> using DLM then a dlm lock dump from all nodes is often helpful in
>>>>> conjunction with the clvmd logs.
>>>>
>>>> Sure, no problem. I've posted the logs from clvmd on both processes in
>>>> <http://web.mit.edu/broder/Public/clvmd/>. I've annotated them at a
>>>> few points with what I was doing - the annotations all start with "
>>>>>>
>>>>>> ", so they should be easy to spot.
>>
>>
>> Ironically it looks like a bug in the clvmd-openais code. I can reproduce it
>> on my systems here. I don't see the problem when using the dlm!
>>
>> Can you try -Icorosync and see if that helps? In the meantime I'll have a
>> look at the openais bits to try and find out what is wrong.
>>
>> Chrissie
>>
>
> I'll see what we can pull together, but the nodes running the clvm
> cluster are also Xen dom0's. They're currently running on (Ubuntu
> Hardy's) 2.6.24, so upgrading them to something new enough to support
> DLM 3 would be...challenging.
>
> It would be much, much better for us if we could get clvmd-openais working.
>
> Is there any chance this would work better if we dropped back to
> openais whitetank instead of corosync + openais wilson?
>

OK, I've found the bug and it IS in openais. The attached patch will fix it.

Chrissie
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: lck-continue.patch
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100113/564048ba/attachment.ksh>