[Linux-cluster] Kernel Crashes on all nodes when one dies

isplist at logicore.net isplist at logicore.net
Mon Apr 23 16:12:47 UTC 2007


Someone accidentally messed up iptables on a node so that it could no longer 
communicate with the cluster. That should have been the end of the problem, 
one node down but instead, all nodes died with a kernel crash. 

Here is a paste from one of the log's. I think this is the right section which 
shows the dying nodes;

Mike


Apr 22 11:55:26 qm250 kernel: qm move flags 0,1,0 ids 0,3,0
Apr 22 11:55:26 qm250 kernel: qm move use event 3
Apr 22 11:55:26 qm250 kernel: qm recover event 3 (first)
Apr 22 11:55:26 qm250 kernel: qm add nodes
Apr 22 11:55:26 qm250 kernel: qm total nodes 2
Apr 22 11:55:26 qm250 kernel: qm rebuild resource directory
Apr 22 11:55:26 qm250 kernel: qm rebuilt 8 resources
Apr 22 11:55:26 qm250 kernel: qm recover event 3 done
Apr 22 11:55:26 qm250 kernel: qm move flags 0,0,1 ids 0,3,3
Apr 22 11:55:26 qm250 kernel: qm process held requests
Apr 22 11:55:26 qm250 kernel: qm processed 0 requests
Apr 22 11:55:26 qm250 kernel: qm recover event 3 finished
Apr 22 11:55:26 qm250 kernel: clvmd move flags 1,0,0 ids 2,2,2
Apr 22 11:55:26 qm250 kernel: qm move flags 1,0,0 ids 3,3,3
Apr 22 11:55:26 qm250 kernel: 2640 pr_start last_stop 0 last_start 4 
last_finish 0
Apr 22 11:55:26 qm250 kernel: 2640 pr_start count 2 type 2 event 4 flags 250
Apr 22 11:55:26 qm250 kernel: 2640 claim_jid 1
Apr 22 11:55:26 qm250 kernel: 2640 pr_start 4 done 1
Apr 22 11:55:26 qm250 kernel: 2640 pr_finish flags 5a
Apr 22 11:55:27 qm250 kernel: 2566 recovery_done jid 1 msg 309 a
Apr 22 11:55:27 qm250 kernel: 2566 recovery_done nodeid 250 flg 18
Apr 22 11:55:27 qm250 kernel:
Apr 22 11:55:27 qm250 kernel: lock_dlm:  Assertion failed on line 357 of file 
/home/buildcentos/rpmbuild/BUILD/gf
s-kernel-2.6.9-60/up/src/dlm/lock.c
Apr 22 11:55:27 qm250 kernel: lock_dlm:  assertion:  "!error"
Apr 22 11:55:27 qm250 kernel: lock_dlm:  time = 14525882
Apr 22 11:55:27 qm250 kernel: qm: error=-22 num=2,1a lkf=10000 flags=84
Apr 22 11:55:27 qm250 kernel:
Apr 22 11:55:27 qm250 kernel: ------------[ cut here ]------------
Apr 22 11:55:27 qm250 kernel: kernel BUG at 
/home/buildcentos/rpmbuild/BUILD/gfs-kernel-2.6.9-60/up/src/dlm/lock.
c:357!
Apr 22 11:55:27 qm250 kernel: invalid operand: 0000 [#1]
Apr 22 11:55:27 qm250 kernel: Modules linked in: lock_dlm(U) gfs(U) 
lock_harness(U) parport_pc lp parport autofs4
 dlm(U) cman(U) md5 ipv6 sunrpc dm_mirror dm_mod uhci_hcd e100 mii floppy ext3 
jbd qla2200 qla2xxx scsi_transport
_fc sd_mod scsi_mod
Apr 22 11:55:27 qm250 kernel: CPU:    0
Apr 22 11:55:27 qm250 kernel: EIP:    0060:[<e09aacfe>]    Not tainted VLI
Apr 22 11:55:27 qm250 kernel: EFLAGS: 00010246   (2.6.9-42.0.3.EL)
Apr 22 11:55:27 qm250 kernel: EIP is at do_dlm_unlock+0x89/0x9e [lock_dlm]
Apr 22 11:55:27 qm250 kernel: eax: 00000001   ebx: dfd552e0   ecx: e09b089f   
edx: dafe9f44
Apr 22 11:55:27 qm250 kernel: esi: ffffffea   edi: dfd552e0   ebp: e0a62000   
esp: dafe9f40
Apr 22 11:55:27 qm250 kernel: ds: 007b   es: 007b   ss: 0068
Apr 22 11:55:27 qm250 kernel: Process gfs_glockd (pid: 2647, 
threadinfo=dafe9000 task=de442c50)
Apr 22 11:55:27 qm250 kernel: Stack: e09b089f e0a62000 00000003 e09aafff 
e0ae3e51 dfd7d4ac e0a62000 e0b156c0
Apr 22 11:55:27 qm250 kernel:        e0ad6bd4 dfd7d4ac e0b156c0 dafe9fb4 
e0ad5683 dfd7d4ac 00000001 e0ad5840
Apr 22 11:55:27 qm250 kernel:        dfd7d4ac dfd7d4ac e0ad5af9 dfd7d550 
e0ad9182 dafe9000 dafe9fc0 e0ac8e9a
Apr 22 11:55:27 qm250 kernel: Call Trace:
Apr 22 11:55:27 qm250 kernel:  [<e09aafff>] lm_dlm_unlock+0x13/0x1b [lock_dlm]
Apr 22 11:55:27 qm250 kernel:  [<e0ae3e51>] gfs_lm_unlock+0x2b/0x40 [gfs]
Apr 22 11:55:27 qm250 kernel:  [<e0ad6bd4>] gfs_glock_drop_th+0x17a/0x1b0 
[gfs]
Apr 22 11:55:27 qm250 kernel:  [<e0ad5683>] rq_demote+0x15c/0x1da [gfs]
Apr 22 11:55:27 qm250 kernel:  [<e0ad5840>] run_queue+0x5a/0xc1 [gfs]
Apr 22 11:55:27 qm250 kernel:  [<e0ad5af9>] unlock_on_glock+0x6e/0xc8 [gfs]
Apr 22 11:55:27 qm250 kernel:  [<e0ad9182>] gfs_reclaim_glock+0x257/0x2ae 
[gfs]
Apr 22 11:55:27 qm250 kernel:  [<e0ac8e9a>] gfs_glockd+0x38/0xde [gfs]
Apr 22 11:55:27 qm250 kernel:  [<c0120049>] default_wake_function+0x0/0xc
Apr 22 11:55:27 qm250 kernel:  [<c0318d7e>] ret_from_fork+0x6/0x14
Apr 22 11:55:27 qm250 kernel:  [<c0120049>] default_wake_function+0x0/0xc
Apr 22 11:55:28 qm250 kernel:  [<e0ac8e62>] gfs_glockd+0x0/0xde [gfs]
Apr 22 11:55:28 qm250 kernel:  [<c01041dd>] kernel_thread_helper+0x5/0xb
Apr 22 11:55:28 qm250 kernel: Code: 73 34 8b 03 ff 73 2c ff 73 08 ff 73 04 ff 
73 0c 56 ff 70 18 68 ac 09 9b e0 e8
 10 9c 77 df 83 c4 34 68 9f 08 9b e0 e8 03 9c 77 df <0f> 0b 65 01 2e 07 9b e0 
68 a1 08 9b e0 e8 5b 90 77 df 5b 5e
 c3
Apr 22 11:55:28 qm250 kernel:  <0>Fatal exception: panic in 5 seconds







More information about the Linux-cluster mailing list