[Linux-cluster] fcntl locking lockup (dlm 1.07, GFS 6.1.5, kernel 2.6.9-67.EL)
Charlie Brady
charlieb-linux-cluster at e-smith.com
Fri Jan 4 21:18:45 UTC 2008
I'm helping a colleague to collect information on an application lockup
problem on a two-node DLM/GFS cluster, with GFS on a shared SCSI array.
I'd appreciate advice as to what information to collect next.
Packages in use are:
kernel-smp-2.6.9-67.EL.i686.rpm
dlm-1.0.7-1.i686.rpm
dlm-kernel-smp-2.6.9-52.2.i686.rpm
GFS-kernel-smp-2.6.9-75.9.i686.rpm
GFS-6.1.15-1.i386.rpm
ccs-1.0.11-1.i686.rpm
cman-1.0.17-0.i686.rpm
cman-kernel-smp-2.6.9-53.5.i686.rpm
We've reduced the application code to a simple test case. The following
code run on each node will soon block, and doesn't receive signals until
the peer node is shutdown:
...
fl.l_whence=SEEK_SET;
fl.l_start=0;
fl.l_len=1;
while (1)
{
fl.l_type=F_WRLCK;
retval=fcntl(filedes,F_SETLKW,&fl);
if (retval==-1)
{
perror("lock");
exit(1);
}
// attempt to unlock the index file
fl.l_type=F_UNLCK;
retval=fcntl(filedes,F_SETLKW,&fl);
if (retval==-1)
{
perror("unlock");
exit(1);
}
}
...
/proc/cluster/dlm_debug on the respectives nodes showed this on most
recent run:
Node1:
2
FS1 send einval to 2
FS1 send einval to 2
[above line many times]
FS1 send einval to 2
FS1 send einval to 2
FS1 grant lock on lockqueue 2
FS1 process_lockqueue_reply id 5400c2 state 0
Node 2:
FS1 (31613) req reply einval 3de002b1 fr 1 r 1 7
FS1 (31613) req reply einval 3ea30356 fr 1 r 1 7
FS1 (31613) req reply einval 3f0100d5 fr 1 r 1 7
FS1 (31613) req reply einval 3df10367 fr 1 r 1 7
FS1 (31613) req reply einval 3fa600be fr 1 r 1 7
FS1 (31613) req reply einval 3f430355 fr 1 r 1 7
FS1 (31613) req reply einval 3fd20096 fr 1 r 1 7
FS1 (31613) req reply einval 3fc900d3 fr 1 r 1 7
FS1 (31613) req reply einval 3fe60375 fr 1 r 1 7
FS1 (31613) req reply einval 3f870143 fr 1 r 1 7
FS1 (31613) req reply einval 3f690239 fr 1 r 1 7
FS1 (31613) req reply einval 3eb40379 fr 1 r 1 7
FS1 (31613) req reply einval 3fb00352 fr 1 r 1 7
FS1 (31613) req reply einval 40a002f6 fr 1 r 1 7
FS1 (31613) req reply einval 3fb90265 fr 1 r 1 7
FS1 (31613) req reply einval 400b0326 fr 1 r 1 7
I have lockdump files from each node, but don't know how to interpret
them.
On shutdown, GFS unmount failed, and kernel panic followed:
Turning off quotas: [ OK ]
Unmounting file systems: umount2: Device or resource busy
umount: /diskarray: device is busy
umount2: Device or resource busy
umount: /diskarray: device is busy
CMAN: No functional network interfaces, leaving cluster
CMAN: sendmsg failed: -22
CMAN: we are leaving the cluster.
WARNING: dlm_emergency_shutdown
SM: 00000002 sm_stop: SG still joined
SM: 01000004 sm_stop: SG still joined
SM: 02000006 sm_stop: SG still joined
ds: 007b es: 007b ss: 0068
Process gfs_glockd (pid: 5654, threadinfo=f40d2000 task=f3c4b230)
Stack: f8ade2d3 f8bb8000 00000003 f2c4ee80 f8ad98b2 f8c28ede 00000001
f33c0c7c
f33c0c60 f8c1ed63 f8c55da0 d4aa4940 f33c0c60 f8c55da0 f33c0c60
f8c1e257
f33c0c60 00000001 f33c0cf4 f8c1e30e f33c0c60 f33c0c7c f8c1e431
00000001
Call Trace:
[<f8ad98b2>] lm_dlm_unlock+0x14/0x1c [lock_dlm]
[<f8c28ede>] gfs_lm_unlock+0x2c/0x42 [gfs]
[<f8c1ed63>] gfs_glock_drop_th+0xf3/0x12d [gfs]
[<f8c1e257>] rq_demote+0x7f/0x98 [gfs]
[<f8c1e30e>] run_queue+0x5a/0xc1 [gfs]
[<f8c1e431>] unlock_on_glock+0x1f/0x28 [gfs]
[<f8c203e9>] gfs_reclaim_glock+0xc3/0x13c [gfs]
[<f8c12e05>] gfs_glockd+0x39/0xde [gfs]
[<c011e7b9>] default_wake_function+0x0/0xc
[<c02d8522>] ret_from_fork+0x6/0x14
[<c011e7b9>] default_wake_function+0x0/0xc
[<f8c12dcc>] gfs_glockd+0x0/0xde [gfs]
[<c01041f5>] kernel_thread_helper+0x5/0xb
Code: 73 34 8b 03 ff 73 2c ff 73 08 ff 73 04 ff 73 0c 56 ff 70 18 68 ef e3
ad f8
e8 de 92 64 c7 83 c4 34 68 d3 e2 ad f8 e8 d1 92 64 c7 <0f> 0b 69 01 1b e2
ad f8
68 d5 e2 ad f8 e8 8c 8a 64 c7 5b 5e 5f
<0>Fatal exception: panic in 5 seconds
Kernel panic - not syncing: Fatal exception
---
Charlie
More information about the Linux-cluster
mailing list