[Linux-cluster] gfs deadlock situation

Wed Feb 14 15:59:57 UTC 2007

> node1:
> Resource 0000010001218088 (parent 0000000000000000). Name (len=24) "       2
> 1100e7"
> Local Copy, Master is node 2
> Granted Queue
> Conversion Queue
> Waiting Queue
> 5eb00178 PR (EX) Master:     3eeb0117  LQ: 0,0x5

> node2:
> Resource 00000107e462c8c8 (parent 0000000000000000). Name (len=24) "       2
> 1100e7"
> Master Copy
> Granted Queue
> 3eeb0117 PR Remote:   1 5eb00178
> Conversion Queue
> Waiting Queue

The state of the lock on node1 looks bad.  I'm studying the code and
struggling to understand how it could possibly arrive in that state.

Some things to notice:
- the lock is converting, it should be on the Conversion Queue, not the
  Waiting Queue
- lockqueue_state is 0, so either node1 has not sent a remote request to
  node2 at all, or node1 did send something and already received some kind
  of reply so it's not waiting for a reply any longer
- the state of the lock on node2 looks normal

Did you check for suspicious syslog messages on both nodes?  Did any nodes
on this fs mount, unmount or fail around the time this happened?  Has this
happened before?  If you'd like to try to reproduce this with some dlm
debugging I could send you a patch (although this is such an odd state I'm
not sure yet where I'd begin to add debugging.)

Dave