[Linux-cluster] Deadlock when using clvmd + OpenAIS + Corosync

Tue Jan 26 01:56:24 UTC 2010

On Mon, Jan 25, 2010 at 8:30 AM, Christine Caulfield
<ccaulfie at redhat.com> wrote:
> On 25/01/10 13:28, Evan Broder wrote:
>>
>> On Mon, Jan 25, 2010 at 8:13 AM, Evan Broder<broder at mit.edu>  wrote:
>>>
>>> On Mon, Jan 25, 2010 at 5:32 AM, Christine Caulfield
>>> <ccaulfie at redhat.com>  wrote:
>>>>
>>>> On 23/01/10 17:35, Evan Broder wrote:
>>>>>
>>>>> On Fri, Jan 22, 2010 at 6:41 AM, Christine Caulfield
>>>>> <ccaulfie at redhat.com>    wrote:
>>>>>>
>>>>>> On 21/01/10 15:17, Evan Broder wrote:
>>>>>>>
>>>>>>> On Wed, Jan 13, 2010 at 4:59 AM, Christine Caulfield
>>>>>>> <ccaulfie at redhat.com>      wrote:
>>>>>>>>
>>>>>>>> On 12/01/10 16:21, Evan Broder wrote:
>>>>>>>>>
>>>>>>>>> On Tue, Jan 12, 2010 at 3:54 AM, Christine Caulfield
>>>>>>>>> <ccaulfie at redhat.com>        wrote:
>>>>>>>>>>
>>>>>>>>>> On 11/01/10 09:38, Christine Caulfield wrote:
>>>>>>>>>>>
>>>>>>>>>>> On 11/01/10 09:32, Evan Broder wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Jan 11, 2010 at 4:03 AM, Christine Caulfield
>>>>>>>>>>>> <ccaulfie at redhat.com>        wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 08/01/10 22:58, Evan Broder wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [please preserve the CC when replying, thanks]
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi -
>>>>>>>>>>>>>> We're attempting to setup a clvm (2.02.56) cluster using
>>>>>>>>>>>>>> OpenAIS
>>>>>>>>>>>>>> (1.1.1) and Corosync (1.1.2). We've gotten bitten hard in the
>>>>>>>>>>>>>> past
>>>>>>>>>>>>>> by
>>>>>>>>>>>>>> crashes leaving DLM state around and forcing us to reboot our
>>>>>>>>>>>>>> nodes,
>>>>>>>>>>>>>> so we're specifically looking for a solution that doesn't
>>>>>>>>>>>>>> involve
>>>>>>>>>>>>>> in-kernel locking.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> We're also running the Pacemaker OpenAIS service, as we're
>>>>>>>>>>>>>> hoping
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>> use it for management of some other resources going forward.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> We've managed to form the OpenAIS cluster, and get clvmd
>>>>>>>>>>>>>> running
>>>>>>>>>>>>>> on
>>>>>>>>>>>>>> both of our nodes. Operations using LVM succeed, so long as
>>>>>>>>>>>>>> only
>>>>>>>>>>>>>> one
>>>>>>>>>>>>>> operation runs at a time. However, if we attempt to run two
>>>>>>>>>>>>>> operations
>>>>>>>>>>>>>> (say, one lvcreate on each host) at a time, they both hang,
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>> both
>>>>>>>>>>>>>> clvmd processes appear to deadlock.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> When they deadlock, it doesn't appear to affect the other
>>>>>>>>>>>>>> clustering
>>>>>>>>>>>>>> processes - both corosync and pacemaker still report a fully
>>>>>>>>>>>>>> formed
>>>>>>>>>>>>>> cluster, so it seems the issue is localized to clvmd.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I've looked at logs from corosync and pacemaker, and I've
>>>>>>>>>>>>>> straced
>>>>>>>>>>>>>> various processes, but I don't want to blast a bunch of
>>>>>>>>>>>>>> useless
>>>>>>>>>>>>>> information at the list. What information can I provide to
>>>>>>>>>>>>>> make
>>>>>>>>>>>>>> it
>>>>>>>>>>>>>> easier to debug and fix this deadlock?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> To start with, the best logging to produce is the clvmd logs
>>>>>>>>>>>>> which
>>>>>>>>>>>>> can be
>>>>>>>>>>>>> got with clvmd -d (see the man page for details). Ideally these
>>>>>>>>>>>>> should be
>>>>>>>>>>>>> from all nodes in the cluster so they can be correlated. If
>>>>>>>>>>>>> you're
>>>>>>>>>>>>> still
>>>>>>>>>>>>> using DLM then a dlm lock dump from all nodes is often helpful
>>>>>>>>>>>>> in
>>>>>>>>>>>>> conjunction with the clvmd logs.
>>>>>>>>>>>>
>>>>>>>>>>>> Sure, no problem. I've posted the logs from clvmd on both
>>>>>>>>>>>> processes
>>>>>>>>>>>> in
>>>>>>>>>>>> <http://web.mit.edu/broder/Public/clvmd/>. I've annotated them
>>>>>>>>>>>> at a
>>>>>>>>>>>> few points with what I was doing - the annotations all start
>>>>>>>>>>>> with "
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ", so they should be easy to spot.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Ironically it looks like a bug in the clvmd-openais code. I can
>>>>>>>>>> reproduce
>>>>>>>>>> it
>>>>>>>>>> on my systems here. I don't see the problem when using the dlm!
>>>>>>>>>>
>>>>>>>>>> Can you try -Icorosync and see if that helps? In the meantime I'll
>>>>>>>>>> have
>>>>>>>>>> a
>>>>>>>>>> look at the openais bits to try and find out what is wrong.
>>>>>>>>>>
>>>>>>>>>> Chrissie
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I'll see what we can pull together, but the nodes running the clvm
>>>>>>>>> cluster are also Xen dom0's. They're currently running on (Ubuntu
>>>>>>>>> Hardy's) 2.6.24, so upgrading them to something new enough to
>>>>>>>>> support
>>>>>>>>> DLM 3 would be...challenging.
>>>>>>>>>
>>>>>>>>> It would be much, much better for us if we could get clvmd-openais
>>>>>>>>> working.
>>>>>>>>>
>>>>>>>>> Is there any chance this would work better if we dropped back to
>>>>>>>>> openais whitetank instead of corosync + openais wilson?
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> OK, I've found the bug and it IS in openais. The attached patch will
>>>>>>>> fix
>>>>>>>> it.
>>>>>>>>
>>>>>>>> Chrissie
>>>>>>>>
>>>>>>>
>>>>>>> Awesome. That patch fixed our problem.
>>>>>>>
>>>>>>> We are running into one other problem - performing LVM operations on
>>>>>>> one node is substantially slower than performing them on the other
>>>>>>> node:
>>>>>>>
>>>>>>> root at black-mesa:~# time lvcreate -n test -L 1G xenvg
>>>>>>>   Logical volume "test" created
>>>>>>>
>>>>>>> real    0m0.309s
>>>>>>> user    0m0.000s
>>>>>>> sys     0m0.008s
>>>>>>> root at black-mesa:~# time lvremove -f /dev/xenvg/test
>>>>>>>   Logical volume "test" successfully removed
>>>>>>>
>>>>>>> real    0m0.254s
>>>>>>> user    0m0.004s
>>>>>>> sys     0m0.008s
>>>>>>>
>>>>>>>
>>>>>>> root at torchwood-institute:~# time lvcreate -n test -L 1G xenvg
>>>>>>>   Logical volume "test" created
>>>>>>>
>>>>>>> real    0m7.282s
>>>>>>> user    0m6.396s
>>>>>>> sys     0m0.312s
>>>>>>> root at torchwood-institute:~# time lvremove -f /dev/xenvg/test
>>>>>>>   Logical volume "test" successfully removed
>>>>>>>
>>>>>>> real    0m7.277s
>>>>>>> user    0m6.420s
>>>>>>> sys     0m0.292s
>>>>>>>
>>>>>>> Any idea why this is happening and if there's anything we can do
>>>>>>> about
>>>>>>> it?
>>>>>>>
>>>>>>
>>>>>>
>>>>>> I'm not at all sure why that should be happening. I suppose the best
>>>>>> thing
>>>>>> to do would be to enable clvmd logging (clvmd -d) and see what is
>>>>>> taking
>>>>>> the
>>>>>> time.
>>>>>>
>>>>>> Chrissie
>>>>>>
>>>>>
>>>>> No problem. I've collected another set of logs - they're in
>>>>> <http://web.mit.edu/broder/Public/clvmd-slow/>.
>>>>>
>>>>> After spinning up corosync and clvmd, the commands I ran were, in
>>>>> order:
>>>>>
>>>>>   root at black-mesa:~# vgchange -a y xenvg
>>>>>     0 logical volume(s) in volume group "xenvg" now active
>>>>>   root at black-mesa:~# time lvcreate -n test1 -L 1G xenvg
>>>>>     Logical volume "test1" created
>>>>>
>>>>>   real    0m0.685s
>>>>>   user    0m0.004s
>>>>>   sys     0m0.000s
>>>>>   root at black-mesa:~# time lvremove -f /dev/xenvg/test1
>>>>>     Logical volume "test1" successfully removed
>>>>>
>>>>>   real    0m0.235s
>>>>>   user    0m0.004s
>>>>>   sys     0m0.004s
>>>>>   root at torchwood-institute:~# time lvcreate -n test2 -L 1G xenvg
>>>>>     Logical volume "test2" created
>>>>>
>>>>>   real    0m8.007s
>>>>>   user    0m6.396s
>>>>>   sys     0m0.312s
>>>>>   root at torchwood-institute:~# time lvremove -f /dev/xenvg/test2
>>>>>     Logical volume "test2" successfully removed
>>>>>
>>>>>   real    0m7.364s
>>>>>   user    0m6.436s
>>>>>   sys     0m0.300s
>>>>>   root at black-mesa:~# vgchange -a n xenvg
>>>>>     0 logical volume(s) in volume group "xenvg" now active
>>>>>
>>>>> (black-mesa is node 1, and torchwood-institute is node 2)
>>>>>
>>>>> Thanks again for your help,
>>>>>
>>>>
>>>> Hiya,
>>>>
>>>> Oddly I can't find any delays in the clvmd logs at all. There are some
>>>> 7-second gaps in the log files but those are between commands coming
>>>> from
>>>> the lvm command-line and not internal to clvmd itself.
>>>>
>>>> What sort of storage are you using for these LVs? The only thing I can
>>>> think
>>>> of now is that there is some sort of delay in lvm opening the device for
>>>> writing as it updates the metadata. An LVM debug log might be helpful
>>>> here,
>>>> though I'm not sure, off-hand, how to put time-stamps on that - I'm not
>>>> really an LVM developer any more.
>>>>
>>>> Chrissie
>>>>
>>>
>>> Well, uh, you see...at the moment I'm using a VM on a personal server
>>> running iscsitarget from the other side of MIT's campus :)
>>>
>>> But that's only because our normal RAID is in the process of being
>>> RMA'd. Normally it's a Fibre channel RAID sitting right next to the
>>> two servers, and we still had the same timing problems (with the same
>>> server being slower) then. The two servers are identical hardware and
>>> sitting right next to each other.
>>>
>>> In any case, thanks for the suggestion to look at LVM itself instead
>>> of clvmd - I sort of forgot my basic debugging techniques before
>>> asking for help on this one. LVM logs to syslog, and its logging mode
>>> includes timestamps.
>>>
>>> When I cranked up the debug level, I noticed that the lvcreates and
>>> lvremoves were hanging on "Archiving volume group "xenvg" metadata
>>> (seqno 48).". If I turn off the metadata archives in lvm.conf on
>>> torchwood-institute, its performance goes back to what I'd expect, so
>>> I guess it's having some sort of disk problems.
>>>
>>> Thanks again for all of our help,
>>>  - Evan
>>>
>>
>> Actually, the real problem turned out to be much more straightforward:
>>
>> root at torchwood-institute:~# ls -a /etc/lvm/archive | wc -l
>> 301369
>>
>> For some reason, torchwood-institute had built up a massive archive of
>> files named like
>> /etc/lvm/archive/.lvm_torchwood-institute.mit.edu_17470_993238909.
>> Listing them all took...about 7 seconds.
>>
>> Deleting them all put performance back where it should be.
>
> :-)
>
> Sometimes the simpler thing actually is the problem!
>
> clvmd is usually the first place I do look for a slowdown as it's
> complicated and does a lot of locking. It's good that you found the problem
> elsewhere.
>
> Thanks for letting me know.
>
> Chrissie
>

Hi again!

Sorry to bother everyone again, but deadlocks on simultaneous LVM
operations seem to have resurfaced somehow. As far as I know, we
haven't made any changes, other than restarting the cluster a few
times, but if I run

root at black-mesa:~# lvcreate -n test1 -L 1G xenvg & lvcreate -n test2 -L 1G xenvg

it seems to consistently deadlock.

I should have logs of everything in
<http://web.mit.edu/broder/Public/clvmd/node-1-syslog.20100125> and
<http://web.mit.edu/broder/Public/clvmd/node-2-syslog.20100125>

Thanks again!
 - Evan