[Linux-cluster] weird happenings on my cluster and another panic.

Fri Oct 27 01:03:15 UTC 2006

ok, heres the rest of the story.. i.e. heres the messages from tf1 (the original master)
Oct 25 20:31:14 tf1 rpcidmapd: rpc.idmapd startup succeeded
Oct 25 20:31:14 tf1 kernel:   Vendor: DELL      Model: PERC 4/DC         Rev: 351X
Oct 25 20:31:14 tf1 kernel:   Type:   Processor                          ANSI SCSI revision: 02
Oct 25 20:31:14 tf1 kernel: scsi[1]: scanning scsi channel 1 [Phy 1] for non-raid devices
Oct 25 20:31:14 tf1 kernel:   Vendor: DELL      Model: PERC 4/DC         Rev: 351X
Oct 25 20:31:14 tf1 kernel:   Type:   Processor                          ANSI SCSI revision: 02
Oct 25 20:31:14 tf1 kernel:   Vendor: DELL      Model: PV22XS            Rev: E.17
Oct 25 20:31:14 tf1 kernel:   Type:   Processor                          ANSI SCSI revision: 03
Oct 25 20:31:14 tf1 kernel: scsi[1]: scanning scsi channel 2 [virtual] for logical drives
Oct 25 20:31:14 tf1 kernel:   Vendor: MegaRAID  Model: LD 0 RAID5  139G  Rev: 351X
Oct 25 20:31:14 tf1 kernel:   Type:   Direct-Access                      ANSI SCSI revision: 02
Oct 25 20:31:14 tf1 kernel: scsi1 (2,0,0) : reservation conflict
Oct 25 20:31:14 tf1 last message repeated 2 times
Oct 25 20:31:14 tf1 kernel: sdb: Unit Not Ready, error = 0x70018
Oct 25 20:31:14 tf1 kernel: SCSI device sdb: 286228480 512-byte hdwr sectors (146549 MB)
Oct 25 20:31:14 tf1 kernel: sdb: asking for cache data failed
Oct 25 20:31:14 tf1 kernel: sdb: assuming drive cache: write through
Oct 25 20:31:14 tf1 kernel: scsi1 (2,0,0) : reservation conflict
Oct 25 20:31:14 tf1 last message repeated 2 times
Oct 25 20:31:14 tf1 kernel: sdb: Unit Not Ready, error = 0x70018
Oct 25 20:31:14 tf1 kernel: SCSI device sdb: 286228480 512-byte hdwr sectors (146549 MB)
Oct 25 20:31:14 tf1 kernel: sdb: asking for cache data failed
Oct 25 20:31:14 tf1 kernel: sdb: assuming drive cache: write through
Oct 25 20:31:14 tf1 kernel:  sdb: sdb1
Oct 25 20:31:14 tf1 kernel: Attached scsi disk sdb at scsi1, channel 2, id 0, lun 0
Oct 25 20:31:14 tf1 kernel: Adaptec aacraid driver (1.1-5[2412])
Oct 25 20:31:14 tf1 kernel: device-mapper: 4.5.0-ioctl (2005-10-04) initialised: dm-devel at redhat.com
Oct 25 20:31:14 tf1 kernel: EXT3-fs: INFO: recovery required on readonly filesystem.
Oct 25 20:31:14 tf1 kernel: EXT3-fs: write access will be enabled during recovery.

so sdb is the gfs volume and is already locked by the other server at this point is my guess.

Oct 25 20:31:14 tf1 kernel: shpchp: acpi_pciehprm:\_SB_.PCI0.PBLO OSHP fails=0x5
Oct 25 20:31:14 tf1 kernel: shpchp: acpi_shpchprm:   Slot sun(0) at s:b:d:f=0x00:04:1f:00
Oct 25 20:31:14 tf1 kernel: shpchp: acpi_pciehprm:\_SB_.PCI0.PBLO OSHP fails=0x5
Oct 25 20:31:14 tf1 last message repeated 4 times
Oct 25 20:31:14 tf1 ccsd[4128]: Starting ccsd 1.0.3: 
Oct 25 20:31:14 tf1 kernel: shpchp: acpi_pciehprm:\_SB_.PCI0.PBLO OSHP fails=0x5
Oct 25 20:31:14 tf1 ccsd[4128]:  Built: May 22 2006 16:15:59 
Oct 25 20:31:14 tf1 kernel: shpchp: acpi_pciehprm:\_SB_.PCI0.PBLO OSHP fails=0x5
Oct 25 20:31:14 tf1 ccsd[4128]:  Copyright (C) Red Hat, Inc.  2004  All rights reserved. 
Oct 25 20:31:14 tf1 kernel: shpchp: acpi_pciehprm:\_SB_.PCI0.VPR0 OSHP fails=0x5
Oct 25 20:31:14 tf1 last message repeated 7 times
Oct 25 20:31:14 tf1 kernel: shpchp: shpc_init : shpc_cap_offset == 0
Oct 25 20:31:14 tf1 last message repeated 8 times
Oct 25 20:31:14 tf1 kernel: shpchp: Standard Hot Plug PCI Controller Driver version: 0.4
Oct 25 20:31:14 tf1 kernel: hw_random hardware driver 1.0.0 loaded

...
Oct 25 20:29:48 tf1 rc.sysinit: Setting clock : Wed Oct 25 20:29:48 EDT 2006 succeeded 
Oct 25 20:29:48 tf1 rc.sysinit: Loading default keymap succeeded 
Oct 25 20:29:48 tf1 rc.sysinit: Setting hostname tf1.localdomain:  succeeded 
Oct 25 20:29:53 tf1 fsck: /: clean, 36754/131616 files, 139153/263056 blocks 
Oct 25 20:29:53 tf1 rc.sysinit: Checking root filesystem succeeded 
Oct 25 20:29:53 tf1 rc.sysinit: Remounting root filesystem in read-write mode:  succeeded 
Oct 25 20:29:53 tf1 lvm.static:   Locking inactive: ignoring clustered volume group diskarray 
Oct 25 20:29:53 tf1 rc.sysinit: Setting up Logical Volume Management: failed 
Oct 25 20:29:53 tf1 fsck: fsck.gfs: invalid option -- a 
Oct 25 20:29:53 tf1 fsck: Please use '-h' for usage. 

Oct 25 20:31:17 tf1 ccsd[4128]: Remote copy of cluster.conf is from quorate node. 
Oct 25 20:31:17 tf1 ccsd[4128]:  Local version # : 22 
Oct 25 20:31:17 tf1 ccsd[4128]:  Remote version #: 22 
Oct 25 20:31:17 tf1 kernel: CMAN: Waiting to join or form a Linux-cluster
Oct 25 20:31:18 tf1 ccsd[4128]: Connected to cluster infrastruture via: CMAN/SM Plugin v1.1.5 
Oct 25 20:31:18 tf1 ccsd[4128]: Initial status:: Inquorate 
Oct 25 20:31:19 tf1 kernel: CMAN: sending membership request
Oct 25 20:31:19 tf1 kernel: CMAN: got node tf2
Oct 25 20:33:17 tf1 cman: Timed-out waiting for cluster failed
Oct 25 20:33:17 tf1 lock_gulmd: no <gulm> section detected in /etc/cluster/cluster.conf succeeded
Oct 25 20:35:17 tf1 fenced: startup failed
Oct 25 20:36:13 tf1 kernel: CMAN: removing node tf2 from the cluster : No response to messages
Oct 25 20:36:13 tf1 kernel: ------------[ cut here ]------------
Oct 25 20:36:13 tf1 kernel: kernel BUG at /usr/src/redhat/BUILD/cman-kernel-2.6.9-43/smp/src/membership.c:3150!
Oct 25 20:36:13 tf1 kernel: invalid operand: 0000 [#1]
Oct 25 20:36:13 tf1 kernel: SMP 
Oct 25 20:36:13 tf1 kernel: Modules linked in: cman(U) md5 ipv6 sunrpc button battery ac uhci_hcd ehci_hcd hw_random shpchp 
eepro100 e100 mii e100
0 floppy sg ext3 jbd dm_mod aic7xxx megaraid_mbox megaraid_mm sd_mod scsi_mod
Oct 25 20:36:13 tf1 kernel: CPU:    3
Oct 25 20:36:13 tf1 kernel: EIP:    0060:[<f896ae2a>]    Not tainted VLI
Oct 25 20:36:13 tf1 kernel: EFLAGS: 00010246   (2.6.9-34.ELsmp) 
Oct 25 20:36:13 tf1 kernel: EIP is at elect_master+0x2e/0x3a [cman]
Oct 25 20:36:13 tf1 kernel: eax: 00000000   ebx: f619dfa0   ecx: 00000080   edx: 00000080
Oct 25 20:36:13 tf1 kernel: esi: f897dfc4   edi: f619dfd8   ebp: 00000000   esp: f619df98
Oct 25 20:36:13 tf1 kernel: ds: 007b   es: 007b   ss: 0068
Oct 25 20:36:13 tf1 kernel: Process cman_memb (pid: 4192, threadinfo=f619d000 task=f757f1b0)
Oct 25 20:36:13 tf1 kernel: Stack: f897de88 f89688d1 c22b9800 f62bd540 f8966eb7 f757f1b0 f757f1b0 f896709a 
Oct 25 20:36:13 tf1 kernel:        0000001f 00000000 f6622bb0 00000000 f757f1b0 c011e71b 00100100 00200200 
Oct 25 20:36:13 tf1 kernel:        00000000 00000000 0000007b f8966ed8 00000000 00000000 c01041f5 00000000 
Oct 25 20:36:13 tf1 kernel: Call Trace:
Oct 25 20:36:13 tf1 kernel:  [<f89688d1>] a_node_just_died+0x13a/0x199 [cman]
Oct 25 20:36:13 tf1 kernel:  [<f8966eb7>] process_dead_nodes+0x4e/0x6f [cman]
Oct 25 20:36:13 tf1 kernel:  [<f896709a>] membership_kthread+0x1c2/0x39d [cman]
Oct 25 20:36:13 tf1 kernel:  [<c011e71b>] default_wake_function+0x0/0xc
Oct 25 20:36:13 tf1 kernel:  [<f8966ed8>] membership_kthread+0x0/0x39d [cman]
Oct 25 20:36:13 tf1 kernel:  [<c01041f5>] kernel_thread_helper+0x5/0xb
Oct 25 20:36:13 tf1 kernel: Code: a8 ed 97 f8 89 c3 ba 01 00 00 00 39 ca 7d 1c a1 ac ed 97 f8 8b 04 90 85 c0 74 0d 83 78 1c 02 75 
07 89 03 8b 40 1
4 eb 0d 42 eb e0 <0f> 0b 4e 0c 68 1d 97 f8 31 c0 5b c3 a1 ac ed 97 f8 e8 79 80 7e 
Oct 25 20:36:13 tf1 kernel:  <0>Fatal exception: panic in 5 seconds
Oct 26 12:19:46 tf1 syslogd 1.4.1: restart.

so my question now is that it appears that I have something misconfigured.. tf1 should come up as secondary while tf2 is running as 
primary, right? or should tf1 come up and take over as primary and tf2 let him?

Jason