[Linux-cluster] Deadlock when using clvmd + OpenAIS + Corosync

Mon Jan 25 13:13:31 UTC 2010

On Mon, Jan 25, 2010 at 5:32 AM, Christine Caulfield
<ccaulfie at redhat.com> wrote:
> On 23/01/10 17:35, Evan Broder wrote:
>>
>> On Fri, Jan 22, 2010 at 6:41 AM, Christine Caulfield
>> <ccaulfie at redhat.com>  wrote:
>>>
>>> On 21/01/10 15:17, Evan Broder wrote:
>>>>
>>>> On Wed, Jan 13, 2010 at 4:59 AM, Christine Caulfield
>>>> <ccaulfie at redhat.com>    wrote:
>>>>>
>>>>> On 12/01/10 16:21, Evan Broder wrote:
>>>>>>
>>>>>> On Tue, Jan 12, 2010 at 3:54 AM, Christine Caulfield
>>>>>> <ccaulfie at redhat.com>      wrote:
>>>>>>>
>>>>>>> On 11/01/10 09:38, Christine Caulfield wrote:
>>>>>>>>
>>>>>>>> On 11/01/10 09:32, Evan Broder wrote:
>>>>>>>>>
>>>>>>>>> On Mon, Jan 11, 2010 at 4:03 AM, Christine Caulfield
>>>>>>>>> <ccaulfie at redhat.com>      wrote:
>>>>>>>>>>
>>>>>>>>>> On 08/01/10 22:58, Evan Broder wrote:
>>>>>>>>>>>
>>>>>>>>>>> [please preserve the CC when replying, thanks]
>>>>>>>>>>>
>>>>>>>>>>> Hi -
>>>>>>>>>>> We're attempting to setup a clvm (2.02.56) cluster using OpenAIS
>>>>>>>>>>> (1.1.1) and Corosync (1.1.2). We've gotten bitten hard in the
>>>>>>>>>>> past
>>>>>>>>>>> by
>>>>>>>>>>> crashes leaving DLM state around and forcing us to reboot our
>>>>>>>>>>> nodes,
>>>>>>>>>>> so we're specifically looking for a solution that doesn't involve
>>>>>>>>>>> in-kernel locking.
>>>>>>>>>>>
>>>>>>>>>>> We're also running the Pacemaker OpenAIS service, as we're hoping
>>>>>>>>>>> to
>>>>>>>>>>> use it for management of some other resources going forward.
>>>>>>>>>>>
>>>>>>>>>>> We've managed to form the OpenAIS cluster, and get clvmd running
>>>>>>>>>>> on
>>>>>>>>>>> both of our nodes. Operations using LVM succeed, so long as only
>>>>>>>>>>> one
>>>>>>>>>>> operation runs at a time. However, if we attempt to run two
>>>>>>>>>>> operations
>>>>>>>>>>> (say, one lvcreate on each host) at a time, they both hang, and
>>>>>>>>>>> both
>>>>>>>>>>> clvmd processes appear to deadlock.
>>>>>>>>>>>
>>>>>>>>>>> When they deadlock, it doesn't appear to affect the other
>>>>>>>>>>> clustering
>>>>>>>>>>> processes - both corosync and pacemaker still report a fully
>>>>>>>>>>> formed
>>>>>>>>>>> cluster, so it seems the issue is localized to clvmd.
>>>>>>>>>>>
>>>>>>>>>>> I've looked at logs from corosync and pacemaker, and I've straced
>>>>>>>>>>> various processes, but I don't want to blast a bunch of useless
>>>>>>>>>>> information at the list. What information can I provide to make
>>>>>>>>>>> it
>>>>>>>>>>> easier to debug and fix this deadlock?
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> To start with, the best logging to produce is the clvmd logs which
>>>>>>>>>> can be
>>>>>>>>>> got with clvmd -d (see the man page for details). Ideally these
>>>>>>>>>> should be
>>>>>>>>>> from all nodes in the cluster so they can be correlated. If you're
>>>>>>>>>> still
>>>>>>>>>> using DLM then a dlm lock dump from all nodes is often helpful in
>>>>>>>>>> conjunction with the clvmd logs.
>>>>>>>>>
>>>>>>>>> Sure, no problem. I've posted the logs from clvmd on both processes
>>>>>>>>> in
>>>>>>>>> <http://web.mit.edu/broder/Public/clvmd/>. I've annotated them at a
>>>>>>>>> few points with what I was doing - the annotations all start with "
>>>>>>>>>>>
>>>>>>>>>>> ", so they should be easy to spot.
>>>>>>>
>>>>>>>
>>>>>>> Ironically it looks like a bug in the clvmd-openais code. I can
>>>>>>> reproduce
>>>>>>> it
>>>>>>> on my systems here. I don't see the problem when using the dlm!
>>>>>>>
>>>>>>> Can you try -Icorosync and see if that helps? In the meantime I'll
>>>>>>> have
>>>>>>> a
>>>>>>> look at the openais bits to try and find out what is wrong.
>>>>>>>
>>>>>>> Chrissie
>>>>>>>
>>>>>>
>>>>>> I'll see what we can pull together, but the nodes running the clvm
>>>>>> cluster are also Xen dom0's. They're currently running on (Ubuntu
>>>>>> Hardy's) 2.6.24, so upgrading them to something new enough to support
>>>>>> DLM 3 would be...challenging.
>>>>>>
>>>>>> It would be much, much better for us if we could get clvmd-openais
>>>>>> working.
>>>>>>
>>>>>> Is there any chance this would work better if we dropped back to
>>>>>> openais whitetank instead of corosync + openais wilson?
>>>>>>
>>>>>
>>>>>
>>>>> OK, I've found the bug and it IS in openais. The attached patch will
>>>>> fix
>>>>> it.
>>>>>
>>>>> Chrissie
>>>>>
>>>>
>>>> Awesome. That patch fixed our problem.
>>>>
>>>> We are running into one other problem - performing LVM operations on
>>>> one node is substantially slower than performing them on the other
>>>> node:
>>>>
>>>> root at black-mesa:~# time lvcreate -n test -L 1G xenvg
>>>>   Logical volume "test" created
>>>>
>>>> real    0m0.309s
>>>> user    0m0.000s
>>>> sys     0m0.008s
>>>> root at black-mesa:~# time lvremove -f /dev/xenvg/test
>>>>   Logical volume "test" successfully removed
>>>>
>>>> real    0m0.254s
>>>> user    0m0.004s
>>>> sys     0m0.008s
>>>>
>>>>
>>>> root at torchwood-institute:~# time lvcreate -n test -L 1G xenvg
>>>>   Logical volume "test" created
>>>>
>>>> real    0m7.282s
>>>> user    0m6.396s
>>>> sys     0m0.312s
>>>> root at torchwood-institute:~# time lvremove -f /dev/xenvg/test
>>>>   Logical volume "test" successfully removed
>>>>
>>>> real    0m7.277s
>>>> user    0m6.420s
>>>> sys     0m0.292s
>>>>
>>>> Any idea why this is happening and if there's anything we can do about
>>>> it?
>>>>
>>>
>>>
>>> I'm not at all sure why that should be happening. I suppose the best
>>> thing
>>> to do would be to enable clvmd logging (clvmd -d) and see what is taking
>>> the
>>> time.
>>>
>>> Chrissie
>>>
>>
>> No problem. I've collected another set of logs - they're in
>> <http://web.mit.edu/broder/Public/clvmd-slow/>.
>>
>> After spinning up corosync and clvmd, the commands I ran were, in order:
>>
>>   root at black-mesa:~# vgchange -a y xenvg
>>     0 logical volume(s) in volume group "xenvg" now active
>>   root at black-mesa:~# time lvcreate -n test1 -L 1G xenvg
>>     Logical volume "test1" created
>>
>>   real    0m0.685s
>>   user    0m0.004s
>>   sys     0m0.000s
>>   root at black-mesa:~# time lvremove -f /dev/xenvg/test1
>>     Logical volume "test1" successfully removed
>>
>>   real    0m0.235s
>>   user    0m0.004s
>>   sys     0m0.004s
>>   root at torchwood-institute:~# time lvcreate -n test2 -L 1G xenvg
>>     Logical volume "test2" created
>>
>>   real    0m8.007s
>>   user    0m6.396s
>>   sys     0m0.312s
>>   root at torchwood-institute:~# time lvremove -f /dev/xenvg/test2
>>     Logical volume "test2" successfully removed
>>
>>   real    0m7.364s
>>   user    0m6.436s
>>   sys     0m0.300s
>>   root at black-mesa:~# vgchange -a n xenvg
>>     0 logical volume(s) in volume group "xenvg" now active
>>
>> (black-mesa is node 1, and torchwood-institute is node 2)
>>
>> Thanks again for your help,
>>
>
> Hiya,
>
> Oddly I can't find any delays in the clvmd logs at all. There are some
> 7-second gaps in the log files but those are between commands coming from
> the lvm command-line and not internal to clvmd itself.
>
> What sort of storage are you using for these LVs? The only thing I can think
> of now is that there is some sort of delay in lvm opening the device for
> writing as it updates the metadata. An LVM debug log might be helpful here,
> though I'm not sure, off-hand, how to put time-stamps on that - I'm not
> really an LVM developer any more.
>
> Chrissie
>

Well, uh, you see...at the moment I'm using a VM on a personal server
running iscsitarget from the other side of MIT's campus :)

But that's only because our normal RAID is in the process of being
RMA'd. Normally it's a Fibre channel RAID sitting right next to the
two servers, and we still had the same timing problems (with the same
server being slower) then. The two servers are identical hardware and
sitting right next to each other.

In any case, thanks for the suggestion to look at LVM itself instead
of clvmd - I sort of forgot my basic debugging techniques before
asking for help on this one. LVM logs to syslog, and its logging mode
includes timestamps.

When I cranked up the debug level, I noticed that the lvcreates and
lvremoves were hanging on "Archiving volume group "xenvg" metadata
(seqno 48).". If I turn off the metadata archives in lvm.conf on
torchwood-institute, its performance goes back to what I'd expect, so
I guess it's having some sort of disk problems.

Thanks again for all of our help,
 - Evan