[Linux-cluster] [2nd try: servers crashing while not doing much.]

Jason jason at monsterjam.org
Tue Jun 20 22:11:05 UTC 2006


hey folks,

I have 2 nodes running GFS 6.1.5
[root at tf1 ~]# rpm -qa | grep -i gfs
GFS-6.1.5-0
GFS-kernheaders-2.6.9-49.1
GFS-kernel-smp-2.6.9-49.1
[root at tf1 ~]# rpm -qa | grep -i ccs
ccs-devel-1.0.3-0
ccs-1.0.3-0
[root at tf1 ~]# 
[root at tf1 ~]# uname -a
Linux tf1.localdomain 2.6.9-34.ELsmp #1 SMP Fri Feb 24 16:54:53 EST 2006 i686 i686 i386 GNU/Linux
[root at tf1 ~]# 


and last week, we had them both go down on us unexpectedly.
one had paniced and the other was powered off..

these systems are NOT in production yet, so there was some data on the GFS partition, but im pretty 
sure that there was not much activity when the boxes went down. Any help on what to do about this 
would be appreciated..

Here is the log from the one that panicd.



Jun 10 03:59:07 tf1 ccsd[3939]: Unable to connect to cluster infrastructure after 45030 seconds. 
Jun 10 03:59:37 tf1 ccsd[3939]: Unable to connect to cluster infrastructure after 45060 seconds. 
Jun 10 04:00:07 tf1 ccsd[3939]: Unable to connect to cluster infrastructure after 45090 seconds. 
Jun 10 04:00:37 tf1 ccsd[3939]: Unable to connect to cluster infrastructure after 45120 seconds. 
Jun 10 04:01:01 tf1 crond(pam_unix)[15618]: session opened for user root by (uid=0)
Jun 10 04:01:01 tf1 crond(pam_unix)[15618]: session closed for user root
Jun 10 04:01:07 tf1 ccsd[3939]: Unable to connect to cluster infrastructure after 45150 seconds. 
Jun 10 04:01:37 tf1 ccsd[3939]: Unable to connect to cluster infrastructure after 45180 seconds. 
Jun 10 04:02:01 tf1 crond(pam_unix)[15620]: session opened for user root by (uid=0)
Jun 10 04:02:03 tf1 kernel: des 1
Jun 10 04:02:03 tf1 kernel: clvmd total nodes 1
Jun 10 04:02:03 tf1 kernel: lv1 rebuild resource directory
Jun 10 04:02:03 tf1 kernel: clvmd rebuild resource directory
Jun 10 04:02:03 tf1 kernel: clvmd rebuilt 0 resources
Jun 10 04:02:03 tf1 kernel: clvmd purge requests
Jun 10 04:02:03 tf1 kernel: clvmd purged 0 requests
Jun 10 04:02:03 tf1 kernel: clvmd mark waiting requests
Jun 10 04:02:03 tf1 kernel: clvmd marked 0 requests
Jun 10 04:02:03 tf1 kernel: clvmd purge locks of departed nodes
Jun 10 04:02:03 tf1 kernel: clvmd purged 0 locks
Jun 10 04:02:03 tf1 kernel: clvmd update remastered resources
Jun 10 04:02:03 tf1 kernel: clvmd updated 1 resources
Jun 10 04:02:03 tf1 kernel: clvmd rebuild locks
Jun 10 04:02:03 tf1 kernel: clvmd rebuilt 0 locks
Jun 10 04:02:03 tf1 kernel: clvmd recover event 7 done
Jun 10 04:02:03 tf1 kernel: clvmd move flags 0,0,1 ids 4,7,7
Jun 10 04:02:03 tf1 kernel: clvmd process held requests
Jun 10 04:02:03 tf1 kernel: clvmd processed 0 requests
Jun 10 04:02:03 tf1 kernel: clvmd resend marked requests
Jun 10 04:02:03 tf1 kernel: clvmd resent 0 requests
Jun 10 04:02:03 tf1 kernel: clvmd recover event 7 finished
Jun 10 04:02:03 tf1 kernel: lv1 rebuilt 518 resources
Jun 10 04:02:03 tf1 kernel: lv1 purge requests
Jun 10 04:02:03 tf1 kernel: lv1 purged 0 requests
Jun 10 04:02:03 tf1 kernel: lv1 mark waiting requests
Jun 10 04:02:03 tf1 kernel: lv1 marked 0 requests
Jun 10 04:02:03 tf1 kernel: lv1 purge locks of departed nodes
Jun 10 04:02:03 tf1 kernel: lv1 purged 530 locks
Jun 10 04:02:03 tf1 kernel: lv1 update remastered resources
Jun 10 04:02:03 tf1 kernel: lv1 updated 20609 resources
Jun 10 04:02:03 tf1 kernel: lv1 rebuild locks
Jun 10 04:02:03 tf1 kernel: lv1 rebuilt 0 locks
Jun 10 04:02:03 tf1 kernel: lv1 recover event 7 done
Jun 10 04:02:03 tf1 kernel: lv1 move flags 0,0,1 ids 5,7,7
Jun 10 04:02:03 tf1 kernel: lv1 process held requests
Jun 10 04:02:03 tf1 kernel: lv1 processed 0 requests
Jun 10 04:02:03 tf1 kernel: lv1 resend marked requests
Jun 10 04:02:03 tf1 kernel: lv1 resent 0 requests
Jun 10 04:02:03 tf1 kernel: lv1 recover event 7 finished
Jun 10 04:02:03 tf1 kernel: 6851 pr_start last_stop 0 last_start 6 last_finish 0
Jun 10 04:02:03 tf1 kernel: 6851 pr_start count 2 type 2 event 6 flags 250
Jun 10 04:02:03 tf1 kernel: 6851 claim_jid 1
Jun 10 04:02:03 tf1 kernel: 6851 pr_start 6 done 1
Jun 10 04:02:03 tf1 kernel: 6851 pr_finish flags 5a
Jun 10 04:02:03 tf1 kernel: 6840 recovery_done jid 1 msg 309 a
Jun 10 04:02:03 tf1 kernel: 6840 recovery_done nodeid 1 flg 18
Jun 10 04:02:03 tf1 kernel: 6851 pr_start last_stop 6 last_start 7 last_finish 6
Jun 10 04:02:03 tf1 kernel: 6851 pr_start count 1 type 1 event 7 flags 21a
Jun 10 04:02:03 tf1 kernel: 6851 pr_start cb jid 0 id 2
Jun 10 04:02:03 tf1 kernel: 6851 pr_start 7 done 0
Jun 10 04:02:03 tf1 kernel: 6854 recovery_done jid 0 msg 309 11a
Jun 10 04:02:03 tf1 kernel: 6854 recovery_done nodeid 2 flg 1b
Jun 10 04:02:03 tf1 kernel: 6854 recovery_done start_done 7
Jun 10 04:02:03 tf1 kernel: 6850 pr_finish flags 1a
Jun 10 04:02:03 tf1 kernel: 
Jun 10 04:02:03 tf1 kernel: 
Jun 10 04:02:03 tf1 kernel: lock_dlm:  Assertion failed on line 428 of file 
/usr/src/redhat/BUILD/gfs-kernel-2.6.9-49/smp/src/dlm/lock.c
Jun 10 04:02:03 tf1 kernel: lock_dlm:  assertion:  "!error"
Jun 10 04:02:03 tf1 kernel: lock_dlm:  time = 1252230568
Jun 10 04:02:03 tf1 kernel: lv1: num=3,11 err=-22 cur=-1 req=3 lkf=8
Jun 10 04:02:03 tf1 kernel: 
Jun 10 04:02:03 tf1 kernel: ------------[ cut here ]------------
Jun 10 04:02:03 tf1 kernel: kernel BUG at 
/usr/src/redhat/BUILD/gfs-kernel-2.6.9-49/smp/src/dlm/lock.c:428!
Jun 10 04:02:03 tf1 kernel: invalid operand: 0000 [#1]
Jun 10 04:02:03 tf1 kernel: SMP 
Jun 10 04:02:03 tf1 kernel: Modules linked in: nls_utf8 vfat fat usb_storage lock_dlm(U) dcdipm(U) 
dcdbas(U) parport_pc lp parport autofs4 i2c_dev i2c_core gfs(U) lock_harness(U) 
dlm(U) cman(U) md5 ipv6 sunrpc button battery ac uhci_hcd ehci_hcd hw_random shpchp eepro100 e100 
mii e1000 floppy sg ext3 jbd dm_mod aic7xxx megaraid_mbox megaraid_mm sd_mod scsi
_mod
Jun 10 04:02:03 tf1 kernel: CPU:    3
Jun 10 04:02:03 tf1 kernel: EIP:    0060:[<f8bc7779>]    Tainted: P      VLI
Jun 10 04:02:03 tf1 kernel: EFLAGS: 00010246   (2.6.9-34.ELsmp) 
Jun 10 04:02:03 tf1 kernel: EIP is at do_dlm_lock+0x134/0x14e [lock_dlm]
Jun 10 04:02:03 tf1 kernel: eax: 00000001   ebx: ffffffea   ecx: c585ace8   edx: f8bcc15f
Jun 10 04:02:03 tf1 kernel: esi: f8bc7798   edi: f77c8400   ebp: c2361600   esp: c585ace4
Jun 10 04:02:03 tf1 kernel: ds: 007b   es: 007b   ss: 0068
Jun 10 04:02:03 tf1 kernel: Process df (pid: 15930, threadinfo=c585a000 task=d94fa6b0)
Jun 10 04:02:03 tf1 kernel: Stack: f8bcc15f 20202020 33202020 20202020 20202020 20202020 31312020 
00000018 
Jun 10 04:02:03 tf1 kernel:        d2956694 c2361600 00000003 00000000 c2361600 f8bc7828 00000003 
f8bcf860 
Jun 10 04:02:03 tf1 kernel:        f8ba0000 f8bf45b2 00000000 00000001 f4fd2064 f4fd2048 f8ba0000 
f8bea5cd 
Jun 10 04:02:03 tf1 kernel: Call Trace:
Jun 10 04:02:03 tf1 kernel:  [<f8bc7828>] lm_dlm_lock+0x49/0x52 [lock_dlm]
Jun 10 04:02:03 tf1 kernel:  [<f8bf45b2>] gfs_lm_lock+0x35/0x4d [gfs]
Jun 10 04:02:03 tf1 kernel:  [<f8bea5cd>] gfs_glock_xmote_th+0x130/0x172 [gfs]
Jun 10 04:02:03 tf1 kernel:  [<f8be9c91>] rq_promote+0xc8/0x147 [gfs]
Jun 10 04:02:03 tf1 kernel:  [<f8be9e7d>] run_queue+0x91/0xc1 [gfs]
Jun 10 04:02:03 tf1 kernel:  [<f8beae88>] gfs_glock_nq+0xcf/0x116 [gfs]
Jun 10 04:02:03 tf1 kernel:  [<f8beb40f>] gfs_glock_nq_init+0x13/0x26 [gfs]
Jun 10 04:02:03 tf1 kernel:  [<f8c0b6d6>] stat_gfs_async+0x119/0x187 [gfs]
Jun 10 04:02:03 tf1 kernel:  [<f8c0b80b>] gfs_stat_gfs+0x27/0x4e [gfs]
Jun 10 04:02:03 tf1 kernel:  [<c01aa436>] superblock_has_perm+0x1f/0x23
Jun 10 04:02:03 tf1 kernel:  [<f8c0387e>] gfs_statfs+0x26/0xc7 [gfs]
Jun 10 04:02:03 tf1 kernel:  [<c0158675>] vfs_statfs+0x41/0x59
Jun 10 04:02:03 tf1 kernel:  [<c015876b>] vfs_statfs64+0xe/0x28
Jun 10 04:02:03 tf1 kernel:  [<c0166d75>] __user_walk+0x4a/0x51
Jun 10 04:02:03 tf1 kernel:  [<c0158876>] sys_statfs64+0x52/0xb2
Jun 10 04:02:03 tf1 kernel:  [<c014f598>] do_mmap_pgoff+0x568/0x666
Jun 10 04:02:03 tf1 kernel:  [<c010b693>] sys_mmap2+0x7e/0xaf
Jun 10 04:02:03 tf1 kernel:  [<c011ad21>] do_page_fault+0x0/0x5c6
Jun 10 04:02:03 tf1 kernel:  [<c02d2657>] syscall_call+0x7/0xb
Jun 10 04:02:03 tf1 kernel: Code: 26 50 0f bf 45 24 50 53 ff 75 08 ff 75 04 ff 75 0c ff 77 18 68 8a 
c2 bc f8 e8 ce ae 55 c7 83 c4 38 68 5f c1 bc f8 e8 c1 ae 55 c7 <0f> 0b ac 01 a7
 c0 bc f8 68 61 c1 bc f8 e8 7c a6 55 c7 83 c4 20 
Jun 10 04:02:03 tf1 kernel:  <0>Fatal exception: panic in 5 seconds
Jun 10 04:02:07 tf1 ccsd[3939]: Unable to connect to cluster infrastructure after 45210 seconds. 
Jun 16 10:48:47 tf1 syslogd 1.4.1: restart.


----- End forwarded message -----

-- 
================================================
|    Jason Welsh   jason at monsterjam.org        |
| http://monsterjam.org    DSS PGP: 0x5E30CC98 |
|    gpg key: http://monsterjam.org/gpg/       |
================================================




More information about the Linux-cluster mailing list