[Linux-cluster] lock_dlm kernel panics

Paul Tader ptader at fnal.gov
Fri Mar 17 21:38:59 UTC 2006


We're experiencing random kernel panics that all seem to be attributed 
to the lock_dlm module.


(panic text from 3 different systems):

/var/log/messages.1:Mar  9 10:01:45 node1 kernel: EIP is at 
do_dlm_lock+0x134/0x14e [lock_dlm]

/var/log/messages.1:Mar  6 16:33:41 node1 kernel: EIP is at 
do_dlm_unlock+0x8b/0xa0 [lock_dlm]

/var/log/messages.1:Mar  7 22:28:53 node1 kernel: EIP is at 
do_dlm_lock+0x134/0x14e [lock_dlm]

/var/log/messages.3:Feb 22 13:35:07 node2 kernel: EIP is at 
do_dlm_lock+0x134/0x14e [lock_dlm]

/var/log/messages.4:Feb 18 12:17:04 node2 kernel: EIP is at 
do_dlm_unlock+0x8b/0xa0 [lock_dlm]

/var/log/messages.3:Feb 23 04:46:01 node3 kernel: EIP is at 
do_dlm_lock+0x134/0x14e [lock_dlm]



On average, nodes stay up for about a week.  The work load is steady and 
is mostly disk I/O.  These nodes were running RHES3 with GFS 6.0. 
During that setup, we experienced much more frequent panics, even when 
the nodes weren't being used.

My thought is that this is a hardware problem.  Disk array, fibre switch 
or HBA?  But in the hopes that there is some addition GFS turning or 
diagnostics I can perform that will either lead me to a hardware problem 
or GFS configuration change, I'm posting this message.

Software:
- RHES4
- GFS-6.1.2-0
- GFS-kernel-2.6.9-49.1
- One, 1Tb GFS partition

Hardware:
- 5 nodes total
- Dual Xeon CPU's 2.66GHz
- 2 Gb ram
- 1 Gb eth0
- QLogic QLA2200


Latest complete panic message:
---------------------------

Mar 17 11:38:02 nodename kernel:
Mar 17 11:38:02 nodename kernel: d0 purged 0 requests
Mar 17 11:38:02 nodename kernel: d0 mark waiting requests
Mar 17 11:38:02 nodename kernel: d0 marked 0 requests
Mar 17 11:38:02 nodename kernel: d0 recover event 17 done
Mar 17 11:38:02 nodename kernel: d0 move flags 0,0,1 ids 14,17,17
Mar 17 11:38:02 nodename kernel: d0 process held requests
Mar 17 11:38:02 nodename kernel: d0 processed 0 requests
Mar 17 11:38:02 nodename kernel: d0 resend marked requests
Mar 17 11:38:02 nodename kernel: d0 resent 0 requests
Mar 17 11:38:02 nodename kernel: d0 recover event 17 finished
Mar 17 11:38:02 nodename kernel: d0 send einval to 5
Mar 17 11:38:02 nodename kernel: d0 send einval to 5
Mar 17 11:38:02 nodename kernel: d0 (1983) req reply einval 2da2006d fr 
2 r 2
        5  9
Mar 17 11:38:02 nodename kernel: d0 send einval to 5
Mar 17 11:38:02 nodename kernel: d0 send einval to 3
Mar 17 11:38:02 nodename kernel: d0 (1983) req reply einval 410803b0 fr 
5 r 5
        5  a
Mar 17 11:38:02 nodename kernel: d0 (1983) req reply einval 456f03d1 fr 
2 r 2
        5  1
Mar 17 11:38:02 nodename kernel: d0 send einval to 5
Mar 17 11:38:02 nodename kernel: d0 send einval to 5
Mar 17 11:38:02 nodename kernel: d0 send einval to 3
Mar 17 11:38:02 nodename kernel: d0 send einval to 3
Mar 17 11:38:02 nodename kernel: d0 (1983) req reply einval aca103f2 fr 
5 r 5
        5  2
Mar 17 11:38:02 nodename kernel: d0 grant lock on lockqueue 3
Mar 17 11:38:02 nodename kernel: d0 process_lockqueue_reply id bbfe0396 
state 0
Mar 17 11:38:02 nodename kernel: d0 (1983) req reply einval d2d20215 fr 
2 r 2
        5  9
Mar 17 11:38:02 nodename kernel: d0 (1983) req reply einval d5a60059 fr 
5 r 5
        5  d
Mar 17 11:38:02 nodename kernel: d0 (1983) req reply einval d886008f fr 
3 r 3
        5  e
Mar 17 11:38:02 nodename kernel: d0 (1983) req reply einval 3130220 fr 2 r 2
        5 c3
Mar 17 11:38:02 nodename kernel: d0 unlock fe20017a no id
Mar 17 11:38:02 nodename kernel: 1976 pr_start last_stop 0 last_start 4
last_finish 0
Mar 17 11:38:02 nodename kernel: 1976 pr_start count 4 type 2 event 4 
flags 250
Mar 17 11:38:02 nodename kernel: 1976 claim_jid 2
Mar 17 11:38:02 nodename kernel: 1976 pr_start 4 done 1
Mar 17 11:38:02 nodename kernel: 1976 pr_finish flags 5a
Mar 17 11:38:02 nodename kernel: 1968 recovery_done jid 2 msg 309 a
Mar 17 11:38:02 nodename kernel: 1968 recovery_done nodeid 4 flg 18
Mar 17 11:38:02 nodename kernel: 1976 pr_start last_stop 4 last_start 8
last_finish 4
Mar 17 11:38:02 nodename kernel: 1976 pr_start count 5 type 2 event 8 
flags 21a
Mar 17 11:38:02 nodename kernel: 1976 pr_start 8 done 1
Mar 17 11:38:02 nodename kernel: 1976 pr_finish flags 1a
Mar 17 11:38:02 nodename kernel: 1976 rereq 3,624b610 id 7f1d022e 5,0
Mar 17 11:38:02 nodename kernel: 1976 pr_start last_stop 8 last_start 9
last_finish 8
Mar 17 11:38:02 nodename kernel: 1976 pr_start count 4 type 1 event 9 
flags 21a
Mar 17 11:38:02 nodename kernel: 1976 pr_start cb jid 0 id 2
Mar 17 11:38:02 nodename kernel: 1976 pr_start 9 done 0
Mar 17 11:38:02 nodename kernel: 1980 recovery_done jid 0 msg 308 11a
Mar 17 11:38:02 nodename kernel: 1980 recovery_done nodeid 2 flg 1b
Mar 17 11:38:02 nodename kernel: 1980 recovery_done start_done 9
Mar 17 11:38:02 nodename kernel: 1976 rereq 3,263e6dd id 7e2d01b9 3,0
Mar 17 11:38:02 nodename kernel: 1977 pr_finish flags 1a
Mar 17 11:38:02 nodename kernel: 1976 pr_start last_stop 9 last_start 13
last_finish 9
Mar 17 11:38:02 nodename kernel: 1976 pr_start count 5 type 2 event 13 
flags 21a
Mar 17 11:38:02 nodename kernel: 1976 pr_start 13 done 1
Mar 17 11:38:02 nodename kernel: 1976 pr_finish flags 1a
Mar 17 11:38:02 nodename kernel: 1976 pr_start last_stop 13 last_start 14
last_finish 13
Mar 17 11:38:02 nodename kernel: 1976 pr_start count 4 type 1 event 14 
flags 21a
Mar 17 11:38:02 nodename kernel: 1976 pr_start cb jid 4 id 5
Mar 17 11:38:02 nodename kernel: 1976 pr_start 14 done 0
Mar 17 11:38:02 nodename kernel: 1980 recovery_done jid 4 msg 308 11a
Mar 17 11:38:02 nodename kernel: 1980 recovery_done nodeid 5 flg 1b
Mar 17 11:38:02 nodename kernel: 1980 recovery_done start_done 14
Mar 17 11:38:02 nodename kernel: 1977 pr_finish flags 1a
Mar 17 11:38:02 nodename kernel: 1976 pr_start last_stop 14 last_start 18
last_finish 14
Mar 17 11:38:02 nodename kernel: 1976 pr_start count 5 type 2 event 18 
flags 21a
Mar 17 11:38:02 nodename kernel: 1976 pr_start 18 done 1
Mar 17 11:38:02 nodename kernel: 1976 pr_finish flags 1a
Mar 17 11:38:02 nodename kernel:
Mar 17 11:38:02 nodename kernel: lock_dlm:  Assertion failed on line 357 of
file /mnt/src/4/BUILD/gfs-kernel-2.6.9-45/smp/src/dlm/lock.c
Mar 17 11:38:02 nodename kernel: lock_dlm:  assertion:  "!error"
Mar 17 11:38:02 nodename kernel: lock_dlm:  time = 783572508
Mar 17 11:38:03 nodename kernel: d0: error=-22 num=3,a458688 lkf=9 flags=84
Mar 17 11:38:03 nodename kernel:
Mar 17 11:38:03 nodename kernel: ------------[ cut here ]------------
Mar 17 11:38:03 nodename kernel: kernel BUG at
/mnt/src/4/BUILD/gfs-kernel-2.6.9-45/smp/src/dlm/lock.c:357!
Mar 17 11:38:03 nodename kernel: invalid operand: 0000 [#1]
Mar 17 11:38:03 nodename kernel: SMP
Mar 17 11:38:03 nodename kernel: Modules linked in: parport_pc lp parport
autofs4 lock_dlm(U) gfs(U) lock_harness(U) nfs lockd dlm(U) cman(U) md5 ipv6
sunrpc dm_mirror button battery ac uhci_hcd ehci_hcd e100 mii e1000 floppy
ext3 jbd dm_mod qla2200 qla2xxx scsi_transport_fc sd_mod scsi_mod
Mar 17 11:38:03 nodename kernel: CPU:    1
Mar 17 11:38:03 nodename kernel: EIP:    0060:[<f8bbb5f3>]    Not 
tainted VLI
Mar 17 11:38:03 nodename kernel: EFLAGS: 00010246   (2.6.9-22.0.2.ELsmp)
Mar 17 11:38:03 nodename kernel: EIP is at do_dlm_unlock+0x8b/0xa0 
[lock_dlm]
Mar 17 11:38:03 nodename kernel: eax: 00000001   ebx: f518d380   ecx:
f5857f2c   edx: f8bc0155Mar 17 11:38:03 nodename kernel: esi: ffffffea 
  edi:
f518d380   ebp: f8c3f000   esp: f5857f28Mar 17 11:38:03 nodename kernel: ds:
007b   es: 007b   ss: 0068
Mar 17 11:38:03 nodename kernel: Process gfs_glockd (pid: 1979,
threadinfo=f5857000 task=f5b588b0)
Mar 17 11:38:03 nodename kernel: Stack: f8bc0155 f8c3f000 00000003 f8bbb893
f8d19612 00000001 f514c268 f514c24c
Mar 17 11:38:03 nodename kernel:        f8d0f89e f8d44440 f4bf0cc0 f514c24c
f8d44440 f514c24c f8d0ed97 f514c24c
Mar 17 11:38:03 nodename kernel:        00000001 f514c2e0 f8d0ee4e f514c24c
f514c268 f8d0ef71 00000001 f514c268
Mar 17 11:38:03 nodename kernel: Call Trace:
Mar 17 11:38:03 nodename kernel:  [<f8bbb893>] lm_dlm_unlock+0x14/0x1c 
[lock_dlm]
Mar 17 11:38:03 nodename kernel:  [<f8d19612>] gfs_lm_unlock+0x2c/0x42 [gfs]
Mar 17 11:38:03 nodename kernel:  [<f8d0f89e>] 
gfs_glock_drop_th+0xf3/0x12d [gfs]
Mar 17 11:38:03 nodename kernel:  [<f8d0ed97>] rq_demote+0x7f/0x98 [gfs]
Mar 17 11:38:03 nodename kernel:  [<f8d0ee4e>] run_queue+0x5a/0xc1 [gfs]
Mar 17 11:38:03 nodename kernel:  [<f8d0ef71>] unlock_on_glock+0x1f/0x28 
[gfs]
Mar 17 11:38:03 nodename kernel:  [<f8d10ed0>] 
gfs_reclaim_glock+0xc3/0x13c [gfs]
Mar 17 11:38:03 nodename kernel:  [<f8d03e01>] gfs_glockd+0x39/0xde [gfs]
Mar 17 11:38:03 nodename kernel:  [<c011e481>] default_wake_function+0x0/0xc
Mar 17 11:38:03 nodename kernel:  [<c02d13b2>] ret_from_fork+0x6/0x14
Mar 17 11:38:03 nodename kernel:  [<c011e481>] default_wake_function+0x0/0xc
Mar 17 11:38:03 nodename kernel:  [<f8d03dc8>] gfs_glockd+0x0/0xde [gfs]
Mar 17 11:38:03 nodename kernel:  [<c01041f1>] kernel_thread_helper+0x5/0xb
Mar 17 11:38:03 nodename kernel: Code: 73 34 8b 03 ff 73 2c ff 73 08 ff 
73 04
ff 73 0c 56 ff 70 18 68 4d 02 bc f8 e8 84 6c 56 c7 83 c4 34 68 55 01 bc f8
e8 77 6c 56 c7 <0f> 0b 65 01 a2 00 bc f8 68 57 01 bc f8 e8 32 64 56 c7 
5b 5e c3
Mar 17 11:38:03 nodename kernel:  <0>Fatal exception: panic in 5 seconds
Mar 17 13:08:01 nodename syslogd 1.4.1: restart.


Thanks,
Paul


-- 
===========================================================================
Paul Tader  <ptader at fnal.gov>  Computing Div/CSS Dept
Fermi National Accelerator Lab; PO Box 500 Batavia, IL 60510-0500




More information about the Linux-cluster mailing list