[Linux-cluster] fcntl locking lockup (dlm 1.07, GFS 6.1.5, kernel 2.6.9-67.EL)

Fri Jan 4 21:18:45 UTC 2008

I'm helping a colleague to collect information on an application lockup 
problem on a two-node DLM/GFS cluster, with GFS on a shared SCSI array.

I'd appreciate advice as to what information to collect next.

Packages in use are:

kernel-smp-2.6.9-67.EL.i686.rpm
dlm-1.0.7-1.i686.rpm
dlm-kernel-smp-2.6.9-52.2.i686.rpm
GFS-kernel-smp-2.6.9-75.9.i686.rpm
GFS-6.1.15-1.i386.rpm
ccs-1.0.11-1.i686.rpm
cman-1.0.17-0.i686.rpm
cman-kernel-smp-2.6.9-53.5.i686.rpm

We've reduced the application code to a simple test case. The following 
code run on each node will soon block, and doesn't receive signals until 
the peer node is shutdown:

...
    fl.l_whence=SEEK_SET;
    fl.l_start=0;
    fl.l_len=1;

    while (1)
    {
      fl.l_type=F_WRLCK;
      retval=fcntl(filedes,F_SETLKW,&fl);
      if (retval==-1)
      {
        perror("lock");
        exit(1);
      }
      // attempt to unlock the index file
      fl.l_type=F_UNLCK;
      retval=fcntl(filedes,F_SETLKW,&fl);
      if (retval==-1)
      {
        perror("unlock");
        exit(1);
      }
    }
...

/proc/cluster/dlm_debug on the respectives nodes showed this on most 
recent run:

Node1:

 2
FS1 send einval to 2
FS1 send einval to 2
[above line many times]
FS1 send einval to 2
FS1 send einval to 2
FS1 grant lock on lockqueue 2
FS1 process_lockqueue_reply id 5400c2 state 0

Node 2:

FS1 (31613) req reply einval 3de002b1 fr 1 r 1        7
FS1 (31613) req reply einval 3ea30356 fr 1 r 1        7
FS1 (31613) req reply einval 3f0100d5 fr 1 r 1        7
FS1 (31613) req reply einval 3df10367 fr 1 r 1        7
FS1 (31613) req reply einval 3fa600be fr 1 r 1        7
FS1 (31613) req reply einval 3f430355 fr 1 r 1        7
FS1 (31613) req reply einval 3fd20096 fr 1 r 1        7
FS1 (31613) req reply einval 3fc900d3 fr 1 r 1        7
FS1 (31613) req reply einval 3fe60375 fr 1 r 1        7
FS1 (31613) req reply einval 3f870143 fr 1 r 1        7
FS1 (31613) req reply einval 3f690239 fr 1 r 1        7
FS1 (31613) req reply einval 3eb40379 fr 1 r 1        7
FS1 (31613) req reply einval 3fb00352 fr 1 r 1        7
FS1 (31613) req reply einval 40a002f6 fr 1 r 1        7
FS1 (31613) req reply einval 3fb90265 fr 1 r 1        7
FS1 (31613) req reply einval 400b0326 fr 1 r 1        7

I have lockdump files from each node, but don't know how to interpret 
them.

On shutdown, GFS unmount failed, and kernel panic followed:

Turning off quotas:                                        [  OK  ]
Unmounting file systems:  umount2: Device or resource busy
umount: /diskarray: device is busy
umount2: Device or resource busy
umount: /diskarray: device is busy
CMAN: No functional network interfaces, leaving cluster
CMAN: sendmsg failed: -22
CMAN: we are leaving the cluster.
WARNING: dlm_emergency_shutdown
SM: 00000002 sm_stop: SG still joined
SM: 01000004 sm_stop: SG still joined
SM: 02000006 sm_stop: SG still joined
ds: 007b   es: 007b   ss: 0068
Process gfs_glockd (pid: 5654, threadinfo=f40d2000 task=f3c4b230)
Stack: f8ade2d3 f8bb8000 00000003 f2c4ee80 f8ad98b2 f8c28ede 00000001 
f33c0c7c
       f33c0c60 f8c1ed63 f8c55da0 d4aa4940 f33c0c60 f8c55da0 f33c0c60 
f8c1e257
       f33c0c60 00000001 f33c0cf4 f8c1e30e f33c0c60 f33c0c7c f8c1e431 
00000001
Call Trace:
 [<f8ad98b2>] lm_dlm_unlock+0x14/0x1c [lock_dlm]
 [<f8c28ede>] gfs_lm_unlock+0x2c/0x42 [gfs]
 [<f8c1ed63>] gfs_glock_drop_th+0xf3/0x12d [gfs]
 [<f8c1e257>] rq_demote+0x7f/0x98 [gfs]
 [<f8c1e30e>] run_queue+0x5a/0xc1 [gfs]
 [<f8c1e431>] unlock_on_glock+0x1f/0x28 [gfs]
 [<f8c203e9>] gfs_reclaim_glock+0xc3/0x13c [gfs]
 [<f8c12e05>] gfs_glockd+0x39/0xde [gfs]
 [<c011e7b9>] default_wake_function+0x0/0xc
 [<c02d8522>] ret_from_fork+0x6/0x14
 [<c011e7b9>] default_wake_function+0x0/0xc
 [<f8c12dcc>] gfs_glockd+0x0/0xde [gfs]
 [<c01041f5>] kernel_thread_helper+0x5/0xb
Code: 73 34 8b 03 ff 73 2c ff 73 08 ff 73 04 ff 73 0c 56 ff 70 18 68 ef e3 
ad f8
 e8 de 92 64 c7 83 c4 34 68 d3 e2 ad f8 e8 d1 92 64 c7 <0f> 0b 69 01 1b e2 
ad f8
 68 d5 e2 ad f8 e8 8c 8a 64 c7 5b 5e 5f
 <0>Fatal exception: panic in 5 seconds
Kernel panic - not syncing: Fatal exception

---
Charlie