[Linux-cluster] weird happenings on my cluster and another panic.

Thu Oct 26 00:56:01 UTC 2006

ok, I was just logging into the 2 nodes of my cluster, tf1 and tf2, I noticed that tf1 was NOT 
available via ssh, but tf2 was. tf1 was pingable, but that was it. I looked on tft2 and 
noticed that he had taken over the cluster virtual ip address 

2: eth0: <BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast qlen 1000
    link/ether 00:11:43:d7:c9:c6 brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.6/24 brd 192.168.1.255 scope global eth0
    inet 192.168.1.7/32 scope global eth0
    inet6 fe80::211:43ff:fed7:c9c6/64 scope link 
       valid_lft forever preferred_lft forever

and in the syslog on tf2, I saw
Oct 25 20:26:00 tf2 kernel: CMAN: removing node tf1 from the cluster : Missed too many 
heartbeats
Oct 25 20:26:00 tf2 fenced[4091]: tf1 not a cluster member after 0 sec post_fail_delay
Oct 25 20:26:00 tf2 fenced[4091]: fencing node "tf1"
Oct 25 20:26:04 tf2 kernel: e100: eth2: e100_watchdog: link down
Oct 25 20:26:08 tf2 fenced[4091]: fence "tf1" success
Oct 25 20:26:15 tf2 kernel: GFS: fsid=progressive:lv1.1: jid=0: Trying to acquire journal 
lock...
Oct 25 20:26:15 tf2 kernel: GFS: fsid=progressive:lv1.1: jid=0: Looking at journal...
Oct 25 20:26:15 tf2 kernel: GFS: fsid=progressive:lv1.1: jid=0: Acquiring the transaction 
lock...
Oct 25 20:26:15 tf2 kernel: GFS: fsid=progressive:lv1.1: jid=0: Replaying journal...
Oct 25 20:26:15 tf2 kernel: GFS: fsid=progressive:lv1.1: jid=0: Replayed 0 of 11 blocks
Oct 25 20:26:15 tf2 kernel: GFS: fsid=progressive:lv1.1: jid=0: replays = 0, skips = 0, sames 
= 11
Oct 25 20:26:15 tf2 kernel: GFS: fsid=progressive:lv1.1: jid=0: Journal replayed in 1s
Oct 25 20:26:15 tf2 kernel: GFS: fsid=progressive:lv1.1: jid=0: Done
Oct 25 20:26:27 tf2 clurgmgrd[4903]: <info> Magma Event: Membership Change 
Oct 25 20:26:27 tf2 clurgmgrd[4903]: <info> State change: tf1 DOWN 
Oct 25 20:26:27 tf2 clurgmgrd[4903]: <notice> Starting stopped service Apache Service 
Oct 25 20:26:29 tf2 httpd: httpd startup succeeded
Oct 25 20:26:29 tf2 clurgmgrd[4903]: <notice> Service Apache Service started 
Oct 25 20:26:36 tf2 kernel: e100: eth2: e100_watchdog: link up, 100Mbps, full-duplex
Oct 25 20:28:08 tf2 kernel: e100: eth2: e100_watchdog: link down
Oct 25 20:28:10 tf2 kernel: e100: eth2: e100_watchdog: link up, 100Mbps, full-duplex
Oct 25 20:29:40 tf2 kernel: CMAN: node tf1 rejoining

so i noticed that after a few more mins, tf1 *appeared* to be rebooting,
and I saw this in the syslog of tf2

Oct 25 20:34:25 tf2 kernel: CMAN: too many transition restarts - will die
Oct 25 20:34:25 tf2 kernel: CMAN: we are leaving the cluster. Inconsistent cluster view
Oct 25 20:34:25 tf2 kernel: WARNING: dlm_emergency_shutdown
Oct 25 20:34:25 tf2 clurgmgrd[4903]: <warning> #67: Shutting down uncleanly 
Oct 25 20:34:25 tf2 kernel: WARNING: dlm_emergency_shutdown
Oct 25 20:34:25 tf2 kernel: SM: 00000001 sm_stop: SG still joined
Oct 25 20:34:25 tf2 kernel: SM: 01000002 sm_stop: SG still joined
Oct 25 20:34:25 tf2 kernel: SM: 02000004 sm_stop: SG still joined
Oct 25 20:34:25 tf2 kernel: SM: 03000005 sm_stop: SG still joined
Oct 25 20:34:25 tf2 ccsd[3988]: Cluster manager shutdown.  Attemping to reconnect... 
Oct 25 20:34:26 tf2 httpd: httpd shutdown succeeded
Oct 25 20:34:26 tf2 kernel: parted nodes
Oct 25 20:34:26 tf2 kernel: clvmd rebuilt 0 resources
Oct 25 20:34:26 tf2 kernel: clvmd purge requests
Oct 25 20:34:26 tf2 kernel: clvmd purged 0 requests
Oct 25 20:34:26 tf2 kernel: clvmd mark waiting requests
Oct 25 20:34:26 tf2 kernel: clvmd marked 0 requests
Oct 25 20:34:26 tf2 kernel: clvmd purge locks of departed nodes
Oct 25 20:34:26 tf2 kernel: lv1 purged 1 locks
Oct 25 20:34:26 tf2 kernel: lv1 update remastered resources
Oct 25 20:34:26 tf2 kernel: clvmd purged 0 locks
Oct 25 20:34:26 tf2 kernel: clvmd update remastered resources
Oct 25 20:34:26 tf2 kernel: clvmd updated 1 resources
Oct 25 20:34:26 tf2 kernel: clvmd rebuild locks
Oct 25 20:34:26 tf2 kernel: clvmd rebuilt 0 locks
Oct 25 20:34:26 tf2 kernel: clvmd recover event 7 done
Oct 25 20:34:26 tf2 kernel: Magma move flags 0,0,1 ids 6,7,7
Oct 25 20:34:26 tf2 kernel: Magma process held requests
Oct 25 20:34:26 tf2 kernel: Magma processed 0 requests
Oct 25 20:34:26 tf2 kernel: Magma resend marked requests
Oct 25 20:34:26 tf2 kernel: Magma resend 6403d9 lq 1 flg 200000 node -1/-1 "usrm::vf"
Oct 25 20:34:26 tf2 kernel: Magma resent 1 requests
Oct 25 20:34:26 tf2 kernel: Magma recover event 7 finished
Oct 25 20:34:26 tf2 kernel: clvmd move flags 0,0,1 ids 2,7,7
Oct 25 20:34:26 tf2 kernel: clvmd process held requests
Oct 25 20:34:26 tf2 kernel: clvmd processed 0 requests
Oct 25 20:34:26 tf2 kernel: clvmd resend marked requests
Oct 25 20:34:26 tf2 kernel: clvmd resent 0 requests
Oct 25 20:34:26 tf2 kernel: clvmd recover event 7 finished
Oct 25 20:34:26 tf2 kernel: lv1 updated 525 resources
Oct 25 20:34:26 tf2 kernel: lv1 rebuild locks
Oct 25 20:34:26 tf2 kernel: lv1 rebuilt 0 locks
Oct 25 20:34:26 tf2 kernel: lv1 recover event 7 done
Oct 25 20:34:26 tf2 kernel: lv1 move flags 0,0,1 ids 3,7,7
Oct 25 20:34:26 tf2 kernel: lv1 process held requests
Oct 25 20:34:26 tf2 kernel: lv1 processed 0 requests
Oct 25 20:34:26 tf2 kernel: lv1 resend marked requests
Oct 25 20:34:26 tf2 kernel: lv1 resent 0 requests
Oct 25 20:34:26 tf2 kernel: lv1 recover event 7 finished
Oct 25 20:34:26 tf2 kernel: 4189 pr_start last_stop 0 last_start 4 last_finish 0
Oct 25 20:34:26 tf2 kernel: 4189 pr_start count 2 type 2 event 4 flags 250
Oct 25 20:34:26 tf2 kernel: 4189 claim_jid 1
Oct 25 20:34:26 tf2 kernel: 4189 pr_start 4 done 1
Oct 25 20:34:26 tf2 kernel: 4189 pr_finish flags 5a
Oct 25 20:34:26 tf2 kernel: 4168 recovery_done jid 1 msg 309 a
Oct 25 20:34:26 tf2 kernel: 4168 recovery_done nodeid 2 flg 18
Oct 25 20:34:26 tf2 kernel: 4189 pr_start last_stop 4 last_start 7 last_finish 4
Oct 25 20:34:26 tf2 kernel: 4189 pr_start count 1 type 1 event 7 flags 21a
Oct 25 20:34:26 tf2 kernel: 4189 pr_start cb jid 0 id 1
Oct 25 20:34:26 tf2 kernel: 4189 pr_start 7 done 0
Oct 25 20:34:26 tf2 kernel: 4192 recovery_done jid 0 msg 309 11a
Oct 25 20:34:26 tf2 kernel: 4192 recovery_done nodeid 1 flg 1b
Oct 25 20:34:26 tf2 kernel: 4192 recovery_done start_done 7
Oct 25 20:34:26 tf2 kernel: 4189 pr_finish flags 1a
Oct 25 20:34:26 tf2 kernel: 
Oct 25 20:34:26 tf2 kernel: lock_dlm:  Assertion failed on line 428 of file 
/usr/src/redhat/BUILD/gfs-kernel-2.6.9-49/smp/src/dlm/lock.c
Oct 25 20:34:26 tf2 kernel: lock_dlm:  assertion:  "!error"
Oct 25 20:34:26 tf2 kernel: lock_dlm:  time = 623964971
Oct 25 20:34:26 tf2 kernel: lv1: num=2,1a err=-22 cur=-1 req=3 lkf=10000
Oct 25 20:34:26 tf2 kernel: 
Oct 25 20:34:26 tf2 kernel: ------------[ cut here ]------------
Oct 25 20:34:26 tf2 kernel: kernel BUG at 
/usr/src/redhat/BUILD/gfs-kernel-2.6.9-49/smp/src/dlm/lock.c:428!
Oct 25 20:34:26 tf2 kernel: invalid operand: 0000 [#1]
Oct 25 20:34:26 tf2 kernel: SMP 
Oct 25 20:34:26 tf2 kernel: Modules linked in: dcdipm(U) dcdbas(U) parport_pc lp parport 
autofs4 i2c_dev i2c_core lock_dlm(U) gfs(U) lock_harness(U) dlm(U) cman(U) md5 ipv6 sunrpc 
button battery ac uhci_hcd ehci_hcd hw_random shpchp eepro100 e100 mii e1000 floppy sg ext3 
jbd dm_mod aic7xxx megaraid_mbox megaraid_mm sd_mod scsi_mod
Oct 25 20:34:26 tf2 kernel: CPU:    2
Oct 25 20:34:26 tf2 kernel: EIP:    0060:[<f8acc779>]    Tainted: P      VLI
Oct 25 20:34:26 tf2 kernel: EFLAGS: 00010246   (2.6.9-34.ELsmp) 
Oct 25 20:34:26 tf2 kernel: EIP is at do_dlm_lock+0x134/0x14e [lock_dlm]
Oct 25 20:34:26 tf2 kernel: eax: 00000001   ebx: ffffffea   ecx: f1be9d50   edx: f8ad115f
Oct 25 20:34:26 tf2 kernel: esi: f8acc798   edi: f7e7da00   ebp: c2355b00   esp: f1be9d4c
Oct 25 20:34:26 tf2 kernel: ds: 007b   es: 007b   ss: 0068
Oct 25 20:34:26 tf2 kernel: Process umount (pid: 13456, threadinfo=f1be9000 task=f66c7230)
Oct 25 20:34:26 tf2 kernel: Stack: f8ad115f 20202020 32202020 20202020 20202020 20202020 
61312020 f1f40018 
Oct 25 20:34:26 tf2 kernel:        f1f422b8 c2355b00 00000003 00000000 c2355b00 f8acc828 
00000003 f8ad4860 
Oct 25 20:34:26 tf2 kernel:        f8b20000 f8bf45b2 00000008 00000001 f4fbc5c4 f4fbc5a8 
f8b20000 f8bea5cd 
Oct 25 20:34:26 tf2 kernel: Call Trace:
Oct 25 20:34:26 tf2 kernel:  [<f8acc828>] lm_dlm_lock+0x49/0x52 [lock_dlm]
Oct 25 20:34:26 tf2 kernel:  [<f8bf45b2>] gfs_lm_lock+0x35/0x4d [gfs]
Oct 25 20:34:26 tf2 kernel:  [<f8bea5cd>] gfs_glock_xmote_th+0x130/0x172 [gfs]
Oct 25 20:34:26 tf2 kernel:  [<f8be9c91>] rq_promote+0xc8/0x147 [gfs]
Oct 25 20:34:26 tf2 kernel:  [<f8be9e7d>] run_queue+0x91/0xc1 [gfs]
Oct 25 20:34:26 tf2 kernel:  [<f8beae88>] gfs_glock_nq+0xcf/0x116 [gfs]
Oct 25 20:34:26 tf2 kernel:  [<f8beb40f>] gfs_glock_nq_init+0x13/0x26 [gfs]
Oct 25 20:34:26 tf2 kernel:  [<f8c02e64>] gfs_permission+0x0/0x61 [gfs]
Oct 25 20:34:26 tf2 kernel:  [<f8c02e9e>] gfs_permission+0x3a/0x61 [gfs]
Oct 25 20:34:26 tf2 kernel:  [<f8c02e64>] gfs_permission+0x0/0x61 [gfs]
Oct 25 20:34:26 tf2 kernel:  [<c0165870>] permission+0x2b/0x4f
Oct 25 20:34:26 tf2 kernel:  [<c0165dbf>] __link_path_walk+0x148/0xbb5
Oct 25 20:34:26 tf2 kernel:  [<c016686f>] link_path_walk+0x43/0xbe
Oct 25 20:34:26 tf2 kernel:  [<c0150309>] do_brk+0x1f2/0x22c
Oct 25 20:34:26 tf2 kernel:  [<c0166c04>] path_lookup+0x14b/0x17f
Oct 25 20:34:26 tf2 kernel:  [<c0166d4c>] __user_walk+0x21/0x51
Oct 25 20:34:26 tf2 kernel:  [<c0162460>] sys_readlink+0x20/0x82
Oct 25 20:34:26 tf2 kernel:  [<c0150309>] do_brk+0x1f2/0x22c
Oct 25 20:34:26 tf2 kernel:  [<c011ad21>] do_page_fault+0x0/0x5c6
Oct 25 20:34:26 tf2 kernel:  [<c02d2657>] syscall_call+0x7/0xb
Oct 25 20:34:26 tf2 kernel: Code: 26 50 0f bf 45 24 50 53 ff 75 08 ff 75 04 ff 75 0c ff 77 18 
68 8a 12 ad f8 e8 ce 5e 65 c7 83 c4 38 68 5f 11 ad f8 e8 c1 5e 65 c7 <0f> 0b ac 01 a7 10 ad f8 
68 61 11 ad f8 e8 7c 56 65 c7 83 c4 20 
Oct 25 20:34:26 tf2 kernel:  <0>Fatal exception: panic in 5 seconds

and now tf2 is  unreachable too.. 
ideas? suggestions?

Jason