[Linux-cluster] GFS crash

Wed Jul 5 23:14:03 UTC 2006

Hi,

we have similar setup, till NFS part, 9nodes stable GFS 1.02 and 2.6.16.
cluster is highly unstable when we have to reboot individual nodes or 
they fence each other.

this is what nodes whine about when they tried to fence node3, what 
stable kernel version other here use?
Kernel is not compiled with any premtive code as mentioned before on 
this list, it is not tested so we didn't bother.

> CMAN: node node7 has been removed from the cluster : Inconsistent 
> cluster view
> CMAN: node node8 has been removed from the cluster : Inconsistent 
> cluster view
> CMAN: node node6 has been removed from the cluster : Inconsistent 
> cluster view
> CMAN: removing node node4 from the cluster : No response to messages
> CMAN: node node2 has been removed from the cluster : Inconsistent 
> cluster view
> CMAN: node node9 has been removed from the cluster : Inconsistent 
> cluster view
> CMAN: removing node node1 from the cluster : No response to messages
> ------------[ cut here ]------------
> kernel BUG at 
> /var/tmp/portage/cman-kernel-1.02.00/work/cluster-1.02.00/cman-kernel/src/membership.c:3151! 
>
> invalid opcode: 0000 [#1]
> SMP
> Modules linked in: iptable_filter ipt_REDIRECT xt_tcpudp iptable_nat 
> ip_nat ip_conntrack ip_tables x_tables bcm5700 lock_dlm dlm cman gfs 
> lock_harness qla2300 qla2xxx_conf qla2xxx firmware_class
> CPU:    0
> EIP:    0060:[<f88de101>]    Not tainted VLI
> EFLAGS: 00010246   (2.6.16-gentoo-r1 #8)
> EIP is at elect_master+0x2a/0x41 [cman]
> eax: 00000080   ebx: 00000080   ecx: f888a000   edx: 00000000
> esi: f88f1084   edi: f5cfdfcc   ebp: f5cfdfb8   esp: f5cfdf70
> ds: 007b   es: 007b   ss: 0068
> Process cman_memb (pid: 7279, threadinfo=f5cfc000 task=f7c48580)
> Stack: <0>f5dcb640 f88db725 f5cfdf8c 00000000 f88e91ac f5dcb640 
> f88d9896 f5ad0d40
>        00000000 f7c48580 f88d9c78 f58dc080 00000001 00000000 f5cfc000 
> 0000001f
>        00000000 c0102b3e 00000000 f7c48580 c0118702 00100100 00200200 
> 00000000
> Call Trace:
>  [<f88db725>] a_node_just_died+0x172/0x1cf [cman]
>  [<f88d9896>] process_dead_nodes+0x74/0x80 [cman]
>  [<f88d9c78>] membership_kthread+0x3d6/0x40e [cman]
>  [<c0102b3e>] ret_from_fork+0x6/0x14
>  [<c0118702>] default_wake_function+0x0/0x12
>  [<f88d98a2>] membership_kthread+0x0/0x40e [cman]
>  [<c0101149>] kernel_thread_helper+0x5/0xb
> Code: c3 53 b8 01 00 00 00 8b 1d 44 1e 8f f8 39 d8 7d 1a 8b 0d 48 1e 
> 8f f8 8b 14 81 85 d2 74 06 83 7a 1c 02 74 13 83 c0 01 39 d8 7c ec <0f> 
> 0b 4f 0c 60 6f 8e f8 31 c0 5b c3 8b 44 24 08 89 10 8b 42 14 

Bas van der Vlies wrote:
> We are using kernel 2.6.16 and cvs STABLE code 1.0.2. We have a 5 node 
> GFS cluster that exports the GFS filesystems as NFS to our cluster. 
> This is the error log is crashed in: gfs_glockd
> ------------------------------------------
> isa_vg5_lv2 send einval to 3
> lisa_vg5_lv1 send einval to 4
> lisa_vg5_lv1 send einval to 4
> lisa_vg5_lv1 send einval to 4
> lisa_vg5_lv1 send einval to 4
> lisa_vg5_lv1 send einval to 4
> lisa_vg5_lv1 send einval to 4
> lisa_vg5_lv1 send einval to 4
> lisa_vg5_lv1 send einval to 4
> lisa_vg5_lv1 send einval to 4
> lisa_vg5_lv1 send einval to 4
> lisa_vg5_lv1 send einval to 4
> lisa_vg5_lv1 send einval to 4
> lisa_vg5_lv1 send einval to 4
> lisa_vg5_lv1 send einval to 4
> lisa_vg5_lv1 send einval to 4
> lisa_vg5_lv1 send einval to 4
> lisa_vg5_lv1 send einval to 4
> lisa_vg5_lv1 send einval to 4
> lisa_vg5_lv1 send einval to 4
> lisa_vg5_lv1 send einval to 4
> lisa_vg5_lv1 send einval to 4
> lisa_vg5_lv1 send einval to 4
> lisa_vg5_lv1 send einval to 4
> lisa_vg5_lv1 send einval to 4
> lisa_vg5_lv1 send einval to 4
> lisa_vg5_lv1 send einval to 4
> lisa_vg5_lv1 send einval to 4
> lisa_vg5_lv1 send einval to 4
> lisa_vg5_lv1 send einval to 4
> lisa_vg5_lv1 send einval to 4
> lisa_vg5_lv1 send einval to 4
> lisa_vg5_lv1 send einval to 4
> lisa_vg5_lv1 unlock febd02eb no id
> 7367 pr_start cb jid 2 id 3
> 7367 pr_start 121 done 0
> 7428 recovery_done jid 2 msg 308 191b
> 7428 recovery_done nodeid 3 flg 1b
> 7428 recovery_done start_done 121
> 7348 pr_start last_stop 95 last_start 121 last_finish 95
> 7348 pr_start count 4 type 1 event 121 flags a1b
> 7348 pr_start cb jid 2 id 3
> 7348 pr_start 121 done 0
> 7330 pr_start last_stop 87 last_start 121 last_finish 87
> 7330 pr_start count 4 type 1 event 121 flags a1b
> 7330 pr_start cb jid 2 id 3
> 7330 pr_start 121 done 0
> 7409 recovery_done jid 2 msg 308 191b
> 7409 recovery_done nodeid 3 flg 1b
> 7409 recovery_done start_done 121
> 7390 recovery_done jid 2 msg 308 91b
> 7390 recovery_done nodeid 3 flg 1b
> 7390 recovery_done start_done 121
> 7310 pr_start last_stop 75 last_start 121 last_finish 75
> 7310 pr_start count 4 type 1 event 121 flags a1b
> 7310 pr_start cb jid 2 id 3
> 7310 pr_start 121 done 0
> 7371 recovery_done jid 2 msg 308 91b
> 7371 recovery_done nodeid 3 flg 1b
> 7371 recovery_done start_done 121
> 7290 pr_start last_stop 56 last_start 121 last_finish 56
> 7290 pr_start count 4 type 1 event 121 flags a1b
> 7290 pr_start cb jid 2 id 3
> 7290 pr_start 121 done 0
> 7352 recovery_done jid 2 msg 308 91b
> 7352 recovery_done nodeid 3 flg 1b
> 7352 recovery_done start_done 121
> 7271 pr_start last_stop 40 last_start 121 last_finish 40
> 7271 pr_start count 4 type 1 event 121 flags a1b
> 7271 pr_start cb jid 2 id 3
> 7271 pr_start 121 done 0
> 7333 recovery_done jid 2 msg 308 91b
> 7333 recovery_done nodeid 3 flg 1b
> 7333 recovery_done start_done 121
> 7252 pr_start last_stop 24 last_start 121 last_finish 24
> 7252 pr_start count 4 type 1 event 121 flags 1a1b
> 7252 pr_start cb jid 2 id 3
> 7252 pr_start 121 done 0
> 7314 recovery_done jid 2 msg 308 91b
> 7314 recovery_done nodeid 3 flg 1b
> 7314 recovery_done start_done 121
> 7294 recovery_done jid 2 msg 308 91b
> 7294 recovery_done nodeid 3 flg 1b
> 7294 recovery_done start_done 121
> 7275 recovery_done jid 2 msg 308 91b
> 7275 recovery_done nodeid 3 flg 1b
> 7275 recovery_done start_done 121
> 7256 recovery_done jid 2 msg 308 191b
> 7256 recovery_done nodeid 3 flg 1b
> 7256 recovery_done start_done 121
> 7310 pr_finish flags 81b
> 7368 pr_finish flags 81b
> 7348 pr_finish flags 81b
> 7444 pr_finish flags 181b
> 7329 pr_finish flags 81b
> 7425 pr_finish flags 181b
> 7405 pr_finish flags 181b
> 7290 pr_finish flags 81b
> 7252 pr_finish flags 181b
> 7386 pr_finish flags 81b
> 7272 pr_finish flags 81b
> 7251 pr_start last_stop 121 last_start 125 last_finish 121
> 7251 pr_start count 5 type 2 event 125 flags 1a1b
> 7251 pr_start 125 done 1
> 7252 pr_finish flags 181b
> 7271 pr_start last_stop 121 last_start 127 last_finish 121
> 7271 pr_start count 5 type 2 event 127 flags a1b
> 7271 pr_start 127 done 1
> 7271 pr_finish flags 81b
> 7291 pr_start last_stop 121 last_start 129 last_finish 121
> 7291 pr_start count 5 type 2 event 129 flags a1b
> 7291 pr_start 129 done 1
> 7291 pr_finish flags 81b
> 7311 pr_start last_stop 121 last_start 131 last_finish 121
> 7311 pr_start count 5 type 2 event 131 flags a1b
> 7311 pr_start 131 done 1
> 7311 pr_finish flags 81b
> 7330 pr_start last_stop 121 last_start 133 last_finish 121
> 7330 pr_start count 5 type 2 event 133 flags a1b
> 7330 pr_start 133 done 1
> 7330 pr_finish flags 81b
> 7349 pr_start last_stop 121 last_start 135 last_finish 121
> 7349 pr_start count 5 type 2 event 135 flags a1b
> 7349 pr_start 135 done 1
> 7349 pr_finish flags 81b
> 7367 pr_start last_stop 121 last_start 137 last_finish 121
> 7367 pr_start count 5 type 2 event 137 flags a1b
> 7367 pr_start 137 done 1
> 7367 pr_finish flags 81b
> 7386 pr_start last_stop 121 last_start 139 last_finish 121
> 7386 pr_start count 5 type 2 event 139 flags a1b
> 7386 pr_start 139 done 1
> 7386 pr_finish flags 81b
> 7406 pr_start last_stop 121 last_start 141 last_finish 121
> 7406 pr_start count 5 type 2 event 141 flags 1a1b
> 7406 pr_start 141 done 1
> 7406 pr_finish flags 181b
> 7425 pr_start last_stop 121 last_start 143 last_finish 121
> 7425 pr_start count 5 type 2 event 143 flags 1a1b
> 7425 pr_start 143 done 1
> 7425 pr_finish flags 181b
> 7443 pr_start last_stop 121 last_start 145 last_finish 121
> 7443 pr_start count 5 type 2 event 145 flags 1a1b
> 7443 pr_start 145 done 1
> 7443 pr_finish flags 181b
>
> lock_dlm:  Assertion failed on line 357 of file 
> /usr/src/gfs/stable_1.0.2/stable/cluster/gfs-kernel/src/dlm/lock.c
> lock_dlm:  assertion:  "!error"
> lock_dlm:  time = 1486517232
> lisa_vg5_lv1: error=-22 num=3,990448c lkf=9 flags=84
>
> ------------[ cut here ]------------
> kernel BUG at 
> /usr/src/gfs/stable_1.0.2/stable/cluster/gfs-kernel/src/dlm/lock.c:357!
> invalid opcode: 0000 [#1]
> SMP
> Modules linked in: lock_dlm dlm cman dm_round_robin dm_multipath sg 
> ide_floppy ide_cd cdrom qla2xxx siimage piix e1000 gfs lock_harness 
> dm_mod
> CPU:    0
> EIP:    0060:[<f8aa5586>]    Tainted: GF     VLI
> EFLAGS: 00010246   (2.6.16-rc5-sara3 #1)
> EIP is at do_dlm_unlock+0x91/0xaa [lock_dlm]
> eax: 00000004   ebx: dbdff440   ecx: 00014e5f   edx: 00000246
> esi: ffffffea   edi: f8c0b000   ebp: f22bdee0   esp: f22bded4
> ds: 007b   es: 007b   ss: 0068
> Process gfs_glockd (pid: 7427, threadinfo=f22bc000 task=f209d030)
> Stack: <0>f8aa9d89 f8c0b000 dbdf7120 f22bdeec f8aa5824 dbdff440 
> f22bdf00 f899a7bc
>        dbdff440 00000003 dbdf7144 f22bdf24 f8990ca4 f8c0b000 dbdff440 
> 00000003
>        f89c4f00 dbde1200 dbdf7120 dbdf7120 f22bdf40 f899393a dbdf7120 
> dbde1200
> Call Trace:
>  [<c0103599>] show_stack_log_lvl+0xad/0xb5
>  [<c01036db>] show_registers+0x10d/0x176
>  [<c01038ad>] die+0xf2/0x16d
>  [<c0103996>] do_trap+0x6e/0x8a
>  [<c0103bed>] do_invalid_op+0x90/0x97
>  [<c010322f>] error_code+0x4f/0x54
>  [<f8aa5824>] lm_dlm_unlock+0x1d/0x24 [lock_dlm]
>  [<f899a7bc>] gfs_lm_unlock+0x2c/0x46 [gfs]
>  [<f8990ca4>] gfs_glock_drop_th+0xf0/0x12d [gfs]
>  [<f899393a>] rgrp_go_drop_th+0x1d/0x24 [gfs]
>  [<f89901f9>] rq_demote+0x79/0x95 [gfs]
>  [<f89902b4>] run_queue+0x56/0xbb [gfs]
>  [<f89903d6>] unlock_on_glock+0x1f/0x29 [gfs]
>  [<f899232a>] gfs_reclaim_glock+0xbf/0x138 [gfs]
>  [<f8986682>] gfs_glockd+0x3b/0xe3 [gfs]
>  [<c0100ed9>] kernel_thread_helper+0x5/0xb
> Code: 73 34 ff 73 2c ff 73 08 ff 73 04 ff 73 0c 56 8b 03 ff 70 18 68 
> a0 a6 aa f8 e8 80 19 67 c7 83 c4 34 68 89 9d aa f8 e8 73 19 67 c7 <0f> 
> 0b 65 01 c0 a4 aa f8 68 a0 a5 aa f8 e8 27 12 67 c7 8d 65 f8
>  <3>fh_update: test2/CHGCAR already up-to-date!
> fh_update: test2/CHGCAR already up-to-date!
> fh_update: test2/WAVECAR already up-to-date!
> fh_update: test2/WAVECAR already up-to-date!
>

-- 
Ivan Pantovic, System Engineer
-----
YUnet International  http://www.eunet.yu
Dubrovacka 35/III,   11000 Belgrade
Tel: +381 11 311 9901;  Fax: +381 11 311 9901; Mob: +381 63 302 288
-----
This  e-mail  is confidential and intended only for the recipient.
Unauthorized  distribution,  modification  or  disclosure  of  its
contents is prohibited. If you have received this e-mail in error,
please notify the sender by telephone  +381 11 311 9901.
-----