[Linux-cluster] Strange behavior(s) of DLM

Fri Aug 6 13:35:39 UTC 2004

Friday, August 6, 2004, 8:54:29 AM, David Teigland wrote:

> On Wed, Aug 04, 2004 at 11:41:45PM -0400, Jeff wrote:
>> The attached routine demonstrates some strange
>> behavior in the DLM and it was responsible for the
>> dmesg text at the end of this note.
>> 
>> This is on a FC2, SMP box running cvs/latest version of
>> cman and the dlm. Its a 2 CPU box configured with 4 logical
>> CPUs.
>> 
>> I have a two node cluster and the two machines are identical
>> as far as I can tell with the exception of which order they are
>> listed in the cluster config file.
>> 
>> On node #1 (in the config file) when I run the attached test from
>> two terminals the output looks reasonable. The same as it does if
>> I run it on Tru64 or VMS (more or less).
>> 
>>       8923: over last 10.000 seconds, grant 8922, blkast 0, cancel 0
>>      18730: over last 9.001 seconds, grant 9807, blkast 0, cancel 0
>>      28403: over last 9.001 seconds, grant 9673, blkast 0, cancel 0
>> 
>> If you shut this down and start it up on node #2 (lx4) you start
>> to get messages that look like:
>>      91280: over last 10.000 seconds, grant 91279, blkast 0, cancel 0
>>     125138: NL Blocking Routine Start ^^^^^^^^^^^^^^^^^^^^^^^^^^
>>     125138: NL Blocking Notification on lockid 0x00010312 (mode 0)
>>     125138: NL Blocking Notification Rountine End  ^^^^^^^^^^^^^^^^^^^^
>>     141370: NL Blocking Routine Start ^^^^^^^^^^^^^^^^^^^^^^^^^^
>>     141371: NL Blocking Notification on lockid 0x00010312 (mode 0)
>>     141371: NL Blocking Notification Rountine End  ^^^^^^^^^^^^^^^^^^^^
>>     141373: NL Blocking Routine Start ^^^^^^^^^^^^^^^^^^^^^^^^^^

> You're running the program on two nodes at once right?  The line with "*"
> is when I started the program on a second node, so it appears I get the
> same thing.  I don't get any assertion failure, though.  That may be the
> result of changes I've checked in for some other bugs over the past couple
> days.

>      57150: over last 10.000 seconds, grant 57149, blkast 0, cancel 0
>     116825: over last 9.001 seconds, grant 59675, blkast 0, cancel 0
> *   123790: NL Blocking Routine Start ^^^^^^^^^^^^^^^^^^^^^^^^^^
>     123790: NL Blocking Notification on lockid 0x00010373 (mode 0)
>     123790: NL Blocking Notification Rountine End  ^^^^^^^^^^^^^^^^^^^^
>     123822: NL Blocking Routine Start ^^^^^^^^^^^^^^^^^^^^^^^^^^
>     123822: NL Blocking Notification on lockid 0x00010373 (mode 0)
>     123822: NL Blocking Notification Rountine End  ^^^^^^^^^^^^^^^^^^^^

I'm running the program from two processes on a single node.

On the two nodes if I run the program from two processes on
node #1, I don't get the above behavior. If I run it from
two processes on node #2, I do (the 'NL Blocking'). When
you run it from two nodes I suspect you only see the NL blocking
on one of the nodes, never on the other one.

I'll update the lock module with the recent changes and try
to reproduce the assertion failure. The way I produce it is:

Starting from both nodes rebooted...

install the modules and have both nodes join the cluster. First
node #1 then node #2.

Run the program on node #1 and ctrl/c it to stop after a minute
or so.

Start the program on node #2 (one process) and let it run
for 10-20 seconds (one or two status lines). Start another copy
on node #2. This usually generates the NL messages. CTRL/C that
copy and start it again. Maybe CTRL/C the other copy and start it
again.

At some point after CTRL/Cing and restarting, the program just
hangs. At that point the process doesn't respond to CTRL/C any
more and dmesg will show the various failure messages.