[Linux-cluster] Deadlock when using clvmd + OpenAIS + Corosync

Sat Jan 23 17:35:36 UTC 2010

On Fri, Jan 22, 2010 at 6:41 AM, Christine Caulfield
<ccaulfie at redhat.com> wrote:
> On 21/01/10 15:17, Evan Broder wrote:
>>
>> On Wed, Jan 13, 2010 at 4:59 AM, Christine Caulfield
>> <ccaulfie at redhat.com>  wrote:
>>>
>>> On 12/01/10 16:21, Evan Broder wrote:
>>>>
>>>> On Tue, Jan 12, 2010 at 3:54 AM, Christine Caulfield
>>>> <ccaulfie at redhat.com>    wrote:
>>>>>
>>>>> On 11/01/10 09:38, Christine Caulfield wrote:
>>>>>>
>>>>>> On 11/01/10 09:32, Evan Broder wrote:
>>>>>>>
>>>>>>> On Mon, Jan 11, 2010 at 4:03 AM, Christine Caulfield
>>>>>>> <ccaulfie at redhat.com>    wrote:
>>>>>>>>
>>>>>>>> On 08/01/10 22:58, Evan Broder wrote:
>>>>>>>>>
>>>>>>>>> [please preserve the CC when replying, thanks]
>>>>>>>>>
>>>>>>>>> Hi -
>>>>>>>>> We're attempting to setup a clvm (2.02.56) cluster using OpenAIS
>>>>>>>>> (1.1.1) and Corosync (1.1.2). We've gotten bitten hard in the past
>>>>>>>>> by
>>>>>>>>> crashes leaving DLM state around and forcing us to reboot our
>>>>>>>>> nodes,
>>>>>>>>> so we're specifically looking for a solution that doesn't involve
>>>>>>>>> in-kernel locking.
>>>>>>>>>
>>>>>>>>> We're also running the Pacemaker OpenAIS service, as we're hoping
>>>>>>>>> to
>>>>>>>>> use it for management of some other resources going forward.
>>>>>>>>>
>>>>>>>>> We've managed to form the OpenAIS cluster, and get clvmd running on
>>>>>>>>> both of our nodes. Operations using LVM succeed, so long as only
>>>>>>>>> one
>>>>>>>>> operation runs at a time. However, if we attempt to run two
>>>>>>>>> operations
>>>>>>>>> (say, one lvcreate on each host) at a time, they both hang, and
>>>>>>>>> both
>>>>>>>>> clvmd processes appear to deadlock.
>>>>>>>>>
>>>>>>>>> When they deadlock, it doesn't appear to affect the other
>>>>>>>>> clustering
>>>>>>>>> processes - both corosync and pacemaker still report a fully formed
>>>>>>>>> cluster, so it seems the issue is localized to clvmd.
>>>>>>>>>
>>>>>>>>> I've looked at logs from corosync and pacemaker, and I've straced
>>>>>>>>> various processes, but I don't want to blast a bunch of useless
>>>>>>>>> information at the list. What information can I provide to make it
>>>>>>>>> easier to debug and fix this deadlock?
>>>>>>>>>
>>>>>>>>
>>>>>>>> To start with, the best logging to produce is the clvmd logs which
>>>>>>>> can be
>>>>>>>> got with clvmd -d (see the man page for details). Ideally these
>>>>>>>> should be
>>>>>>>> from all nodes in the cluster so they can be correlated. If you're
>>>>>>>> still
>>>>>>>> using DLM then a dlm lock dump from all nodes is often helpful in
>>>>>>>> conjunction with the clvmd logs.
>>>>>>>
>>>>>>> Sure, no problem. I've posted the logs from clvmd on both processes
>>>>>>> in
>>>>>>> <http://web.mit.edu/broder/Public/clvmd/>. I've annotated them at a
>>>>>>> few points with what I was doing - the annotations all start with "
>>>>>>>>>
>>>>>>>>> ", so they should be easy to spot.
>>>>>
>>>>>
>>>>> Ironically it looks like a bug in the clvmd-openais code. I can
>>>>> reproduce
>>>>> it
>>>>> on my systems here. I don't see the problem when using the dlm!
>>>>>
>>>>> Can you try -Icorosync and see if that helps? In the meantime I'll have
>>>>> a
>>>>> look at the openais bits to try and find out what is wrong.
>>>>>
>>>>> Chrissie
>>>>>
>>>>
>>>> I'll see what we can pull together, but the nodes running the clvm
>>>> cluster are also Xen dom0's. They're currently running on (Ubuntu
>>>> Hardy's) 2.6.24, so upgrading them to something new enough to support
>>>> DLM 3 would be...challenging.
>>>>
>>>> It would be much, much better for us if we could get clvmd-openais
>>>> working.
>>>>
>>>> Is there any chance this would work better if we dropped back to
>>>> openais whitetank instead of corosync + openais wilson?
>>>>
>>>
>>>
>>> OK, I've found the bug and it IS in openais. The attached patch will fix
>>> it.
>>>
>>> Chrissie
>>>
>>
>> Awesome. That patch fixed our problem.
>>
>> We are running into one other problem - performing LVM operations on
>> one node is substantially slower than performing them on the other
>> node:
>>
>> root at black-mesa:~# time lvcreate -n test -L 1G xenvg
>>   Logical volume "test" created
>>
>> real    0m0.309s
>> user    0m0.000s
>> sys     0m0.008s
>> root at black-mesa:~# time lvremove -f /dev/xenvg/test
>>   Logical volume "test" successfully removed
>>
>> real    0m0.254s
>> user    0m0.004s
>> sys     0m0.008s
>>
>>
>> root at torchwood-institute:~# time lvcreate -n test -L 1G xenvg
>>   Logical volume "test" created
>>
>> real    0m7.282s
>> user    0m6.396s
>> sys     0m0.312s
>> root at torchwood-institute:~# time lvremove -f /dev/xenvg/test
>>   Logical volume "test" successfully removed
>>
>> real    0m7.277s
>> user    0m6.420s
>> sys     0m0.292s
>>
>> Any idea why this is happening and if there's anything we can do about it?
>>
>
>
> I'm not at all sure why that should be happening. I suppose the best thing
> to do would be to enable clvmd logging (clvmd -d) and see what is taking the
> time.
>
> Chrissie
>

No problem. I've collected another set of logs - they're in
<http://web.mit.edu/broder/Public/clvmd-slow/>.

After spinning up corosync and clvmd, the commands I ran were, in order:

  root at black-mesa:~# vgchange -a y xenvg
    0 logical volume(s) in volume group "xenvg" now active
  root at black-mesa:~# time lvcreate -n test1 -L 1G xenvg
    Logical volume "test1" created

  real    0m0.685s
  user    0m0.004s
  sys     0m0.000s
  root at black-mesa:~# time lvremove -f /dev/xenvg/test1
    Logical volume "test1" successfully removed

  real    0m0.235s
  user    0m0.004s
  sys     0m0.004s
  root at torchwood-institute:~# time lvcreate -n test2 -L 1G xenvg
    Logical volume "test2" created

  real    0m8.007s
  user    0m6.396s
  sys     0m0.312s
  root at torchwood-institute:~# time lvremove -f /dev/xenvg/test2
    Logical volume "test2" successfully removed

  real    0m7.364s
  user    0m6.436s
  sys     0m0.300s
  root at black-mesa:~# vgchange -a n xenvg
    0 logical volume(s) in volume group "xenvg" now active

(black-mesa is node 1, and torchwood-institute is node 2)

Thanks again for your help,
 - Evan