[Linux-cluster] Deadlock when using clvmd + OpenAIS + Corosync

Mon Jan 25 13:30:40 UTC 2010

On 25/01/10 13:28, Evan Broder wrote:
> On Mon, Jan 25, 2010 at 8:13 AM, Evan Broder<broder at mit.edu>  wrote:
>> On Mon, Jan 25, 2010 at 5:32 AM, Christine Caulfield
>> <ccaulfie at redhat.com>  wrote:
>>> On 23/01/10 17:35, Evan Broder wrote:
>>>>
>>>> On Fri, Jan 22, 2010 at 6:41 AM, Christine Caulfield
>>>> <ccaulfie at redhat.com>    wrote:
>>>>>
>>>>> On 21/01/10 15:17, Evan Broder wrote:
>>>>>>
>>>>>> On Wed, Jan 13, 2010 at 4:59 AM, Christine Caulfield
>>>>>> <ccaulfie at redhat.com>      wrote:
>>>>>>>
>>>>>>> On 12/01/10 16:21, Evan Broder wrote:
>>>>>>>>
>>>>>>>> On Tue, Jan 12, 2010 at 3:54 AM, Christine Caulfield
>>>>>>>> <ccaulfie at redhat.com>        wrote:
>>>>>>>>>
>>>>>>>>> On 11/01/10 09:38, Christine Caulfield wrote:
>>>>>>>>>>
>>>>>>>>>> On 11/01/10 09:32, Evan Broder wrote:
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Jan 11, 2010 at 4:03 AM, Christine Caulfield
>>>>>>>>>>> <ccaulfie at redhat.com>        wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> On 08/01/10 22:58, Evan Broder wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> [please preserve the CC when replying, thanks]
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi -
>>>>>>>>>>>>> We're attempting to setup a clvm (2.02.56) cluster using OpenAIS
>>>>>>>>>>>>> (1.1.1) and Corosync (1.1.2). We've gotten bitten hard in the
>>>>>>>>>>>>> past
>>>>>>>>>>>>> by
>>>>>>>>>>>>> crashes leaving DLM state around and forcing us to reboot our
>>>>>>>>>>>>> nodes,
>>>>>>>>>>>>> so we're specifically looking for a solution that doesn't involve
>>>>>>>>>>>>> in-kernel locking.
>>>>>>>>>>>>>
>>>>>>>>>>>>> We're also running the Pacemaker OpenAIS service, as we're hoping
>>>>>>>>>>>>> to
>>>>>>>>>>>>> use it for management of some other resources going forward.
>>>>>>>>>>>>>
>>>>>>>>>>>>> We've managed to form the OpenAIS cluster, and get clvmd running
>>>>>>>>>>>>> on
>>>>>>>>>>>>> both of our nodes. Operations using LVM succeed, so long as only
>>>>>>>>>>>>> one
>>>>>>>>>>>>> operation runs at a time. However, if we attempt to run two
>>>>>>>>>>>>> operations
>>>>>>>>>>>>> (say, one lvcreate on each host) at a time, they both hang, and
>>>>>>>>>>>>> both
>>>>>>>>>>>>> clvmd processes appear to deadlock.
>>>>>>>>>>>>>
>>>>>>>>>>>>> When they deadlock, it doesn't appear to affect the other
>>>>>>>>>>>>> clustering
>>>>>>>>>>>>> processes - both corosync and pacemaker still report a fully
>>>>>>>>>>>>> formed
>>>>>>>>>>>>> cluster, so it seems the issue is localized to clvmd.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I've looked at logs from corosync and pacemaker, and I've straced
>>>>>>>>>>>>> various processes, but I don't want to blast a bunch of useless
>>>>>>>>>>>>> information at the list. What information can I provide to make
>>>>>>>>>>>>> it
>>>>>>>>>>>>> easier to debug and fix this deadlock?
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> To start with, the best logging to produce is the clvmd logs which
>>>>>>>>>>>> can be
>>>>>>>>>>>> got with clvmd -d (see the man page for details). Ideally these
>>>>>>>>>>>> should be
>>>>>>>>>>>> from all nodes in the cluster so they can be correlated. If you're
>>>>>>>>>>>> still
>>>>>>>>>>>> using DLM then a dlm lock dump from all nodes is often helpful in
>>>>>>>>>>>> conjunction with the clvmd logs.
>>>>>>>>>>>
>>>>>>>>>>> Sure, no problem. I've posted the logs from clvmd on both processes
>>>>>>>>>>> in
>>>>>>>>>>> <http://web.mit.edu/broder/Public/clvmd/>. I've annotated them at a
>>>>>>>>>>> few points with what I was doing - the annotations all start with "
>>>>>>>>>>>>>
>>>>>>>>>>>>> ", so they should be easy to spot.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Ironically it looks like a bug in the clvmd-openais code. I can
>>>>>>>>> reproduce
>>>>>>>>> it
>>>>>>>>> on my systems here. I don't see the problem when using the dlm!
>>>>>>>>>
>>>>>>>>> Can you try -Icorosync and see if that helps? In the meantime I'll
>>>>>>>>> have
>>>>>>>>> a
>>>>>>>>> look at the openais bits to try and find out what is wrong.
>>>>>>>>>
>>>>>>>>> Chrissie
>>>>>>>>>
>>>>>>>>
>>>>>>>> I'll see what we can pull together, but the nodes running the clvm
>>>>>>>> cluster are also Xen dom0's. They're currently running on (Ubuntu
>>>>>>>> Hardy's) 2.6.24, so upgrading them to something new enough to support
>>>>>>>> DLM 3 would be...challenging.
>>>>>>>>
>>>>>>>> It would be much, much better for us if we could get clvmd-openais
>>>>>>>> working.
>>>>>>>>
>>>>>>>> Is there any chance this would work better if we dropped back to
>>>>>>>> openais whitetank instead of corosync + openais wilson?
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> OK, I've found the bug and it IS in openais. The attached patch will
>>>>>>> fix
>>>>>>> it.
>>>>>>>
>>>>>>> Chrissie
>>>>>>>
>>>>>>
>>>>>> Awesome. That patch fixed our problem.
>>>>>>
>>>>>> We are running into one other problem - performing LVM operations on
>>>>>> one node is substantially slower than performing them on the other
>>>>>> node:
>>>>>>
>>>>>> root at black-mesa:~# time lvcreate -n test -L 1G xenvg
>>>>>>    Logical volume "test" created
>>>>>>
>>>>>> real    0m0.309s
>>>>>> user    0m0.000s
>>>>>> sys     0m0.008s
>>>>>> root at black-mesa:~# time lvremove -f /dev/xenvg/test
>>>>>>    Logical volume "test" successfully removed
>>>>>>
>>>>>> real    0m0.254s
>>>>>> user    0m0.004s
>>>>>> sys     0m0.008s
>>>>>>
>>>>>>
>>>>>> root at torchwood-institute:~# time lvcreate -n test -L 1G xenvg
>>>>>>    Logical volume "test" created
>>>>>>
>>>>>> real    0m7.282s
>>>>>> user    0m6.396s
>>>>>> sys     0m0.312s
>>>>>> root at torchwood-institute:~# time lvremove -f /dev/xenvg/test
>>>>>>    Logical volume "test" successfully removed
>>>>>>
>>>>>> real    0m7.277s
>>>>>> user    0m6.420s
>>>>>> sys     0m0.292s
>>>>>>
>>>>>> Any idea why this is happening and if there's anything we can do about
>>>>>> it?
>>>>>>
>>>>>
>>>>>
>>>>> I'm not at all sure why that should be happening. I suppose the best
>>>>> thing
>>>>> to do would be to enable clvmd logging (clvmd -d) and see what is taking
>>>>> the
>>>>> time.
>>>>>
>>>>> Chrissie
>>>>>
>>>>
>>>> No problem. I've collected another set of logs - they're in
>>>> <http://web.mit.edu/broder/Public/clvmd-slow/>.
>>>>
>>>> After spinning up corosync and clvmd, the commands I ran were, in order:
>>>>
>>>>    root at black-mesa:~# vgchange -a y xenvg
>>>>      0 logical volume(s) in volume group "xenvg" now active
>>>>    root at black-mesa:~# time lvcreate -n test1 -L 1G xenvg
>>>>      Logical volume "test1" created
>>>>
>>>>    real    0m0.685s
>>>>    user    0m0.004s
>>>>    sys     0m0.000s
>>>>    root at black-mesa:~# time lvremove -f /dev/xenvg/test1
>>>>      Logical volume "test1" successfully removed
>>>>
>>>>    real    0m0.235s
>>>>    user    0m0.004s
>>>>    sys     0m0.004s
>>>>    root at torchwood-institute:~# time lvcreate -n test2 -L 1G xenvg
>>>>      Logical volume "test2" created
>>>>
>>>>    real    0m8.007s
>>>>    user    0m6.396s
>>>>    sys     0m0.312s
>>>>    root at torchwood-institute:~# time lvremove -f /dev/xenvg/test2
>>>>      Logical volume "test2" successfully removed
>>>>
>>>>    real    0m7.364s
>>>>    user    0m6.436s
>>>>    sys     0m0.300s
>>>>    root at black-mesa:~# vgchange -a n xenvg
>>>>      0 logical volume(s) in volume group "xenvg" now active
>>>>
>>>> (black-mesa is node 1, and torchwood-institute is node 2)
>>>>
>>>> Thanks again for your help,
>>>>
>>>
>>> Hiya,
>>>
>>> Oddly I can't find any delays in the clvmd logs at all. There are some
>>> 7-second gaps in the log files but those are between commands coming from
>>> the lvm command-line and not internal to clvmd itself.
>>>
>>> What sort of storage are you using for these LVs? The only thing I can think
>>> of now is that there is some sort of delay in lvm opening the device for
>>> writing as it updates the metadata. An LVM debug log might be helpful here,
>>> though I'm not sure, off-hand, how to put time-stamps on that - I'm not
>>> really an LVM developer any more.
>>>
>>> Chrissie
>>>
>>
>> Well, uh, you see...at the moment I'm using a VM on a personal server
>> running iscsitarget from the other side of MIT's campus :)
>>
>> But that's only because our normal RAID is in the process of being
>> RMA'd. Normally it's a Fibre channel RAID sitting right next to the
>> two servers, and we still had the same timing problems (with the same
>> server being slower) then. The two servers are identical hardware and
>> sitting right next to each other.
>>
>> In any case, thanks for the suggestion to look at LVM itself instead
>> of clvmd - I sort of forgot my basic debugging techniques before
>> asking for help on this one. LVM logs to syslog, and its logging mode
>> includes timestamps.
>>
>> When I cranked up the debug level, I noticed that the lvcreates and
>> lvremoves were hanging on "Archiving volume group "xenvg" metadata
>> (seqno 48).". If I turn off the metadata archives in lvm.conf on
>> torchwood-institute, its performance goes back to what I'd expect, so
>> I guess it's having some sort of disk problems.
>>
>> Thanks again for all of our help,
>>   - Evan
>>
>
> Actually, the real problem turned out to be much more straightforward:
>
> root at torchwood-institute:~# ls -a /etc/lvm/archive | wc -l
> 301369
>
> For some reason, torchwood-institute had built up a massive archive of
> files named like
> /etc/lvm/archive/.lvm_torchwood-institute.mit.edu_17470_993238909.
> Listing them all took...about 7 seconds.
>
> Deleting them all put performance back where it should be.

:-)

Sometimes the simpler thing actually is the problem!

clvmd is usually the first place I do look for a slowdown as it's 
complicated and does a lot of locking. It's good that you found the 
problem elsewhere.

Thanks for letting me know.

Chrissie