[Linux-cluster] weird happenings on my cluster and another panic.
jason at monsterjam.org
jason at monsterjam.org
Fri Oct 27 01:03:15 UTC 2006
ok, heres the rest of the story.. i.e. heres the messages from tf1 (the original master)
Oct 25 20:31:14 tf1 rpcidmapd: rpc.idmapd startup succeeded
Oct 25 20:31:14 tf1 kernel: Vendor: DELL Model: PERC 4/DC Rev: 351X
Oct 25 20:31:14 tf1 kernel: Type: Processor ANSI SCSI revision: 02
Oct 25 20:31:14 tf1 kernel: scsi[1]: scanning scsi channel 1 [Phy 1] for non-raid devices
Oct 25 20:31:14 tf1 kernel: Vendor: DELL Model: PERC 4/DC Rev: 351X
Oct 25 20:31:14 tf1 kernel: Type: Processor ANSI SCSI revision: 02
Oct 25 20:31:14 tf1 kernel: Vendor: DELL Model: PV22XS Rev: E.17
Oct 25 20:31:14 tf1 kernel: Type: Processor ANSI SCSI revision: 03
Oct 25 20:31:14 tf1 kernel: scsi[1]: scanning scsi channel 2 [virtual] for logical drives
Oct 25 20:31:14 tf1 kernel: Vendor: MegaRAID Model: LD 0 RAID5 139G Rev: 351X
Oct 25 20:31:14 tf1 kernel: Type: Direct-Access ANSI SCSI revision: 02
Oct 25 20:31:14 tf1 kernel: scsi1 (2,0,0) : reservation conflict
Oct 25 20:31:14 tf1 last message repeated 2 times
Oct 25 20:31:14 tf1 kernel: sdb: Unit Not Ready, error = 0x70018
Oct 25 20:31:14 tf1 kernel: SCSI device sdb: 286228480 512-byte hdwr sectors (146549 MB)
Oct 25 20:31:14 tf1 kernel: sdb: asking for cache data failed
Oct 25 20:31:14 tf1 kernel: sdb: assuming drive cache: write through
Oct 25 20:31:14 tf1 kernel: scsi1 (2,0,0) : reservation conflict
Oct 25 20:31:14 tf1 last message repeated 2 times
Oct 25 20:31:14 tf1 kernel: sdb: Unit Not Ready, error = 0x70018
Oct 25 20:31:14 tf1 kernel: SCSI device sdb: 286228480 512-byte hdwr sectors (146549 MB)
Oct 25 20:31:14 tf1 kernel: sdb: asking for cache data failed
Oct 25 20:31:14 tf1 kernel: sdb: assuming drive cache: write through
Oct 25 20:31:14 tf1 kernel: sdb: sdb1
Oct 25 20:31:14 tf1 kernel: Attached scsi disk sdb at scsi1, channel 2, id 0, lun 0
Oct 25 20:31:14 tf1 kernel: Adaptec aacraid driver (1.1-5[2412])
Oct 25 20:31:14 tf1 kernel: device-mapper: 4.5.0-ioctl (2005-10-04) initialised: dm-devel at redhat.com
Oct 25 20:31:14 tf1 kernel: EXT3-fs: INFO: recovery required on readonly filesystem.
Oct 25 20:31:14 tf1 kernel: EXT3-fs: write access will be enabled during recovery.
so sdb is the gfs volume and is already locked by the other server at this point is my guess.
Oct 25 20:31:14 tf1 kernel: shpchp: acpi_pciehprm:\_SB_.PCI0.PBLO OSHP fails=0x5
Oct 25 20:31:14 tf1 kernel: shpchp: acpi_shpchprm: Slot sun(0) at s:b:d:f=0x00:04:1f:00
Oct 25 20:31:14 tf1 kernel: shpchp: acpi_pciehprm:\_SB_.PCI0.PBLO OSHP fails=0x5
Oct 25 20:31:14 tf1 last message repeated 4 times
Oct 25 20:31:14 tf1 ccsd[4128]: Starting ccsd 1.0.3:
Oct 25 20:31:14 tf1 kernel: shpchp: acpi_pciehprm:\_SB_.PCI0.PBLO OSHP fails=0x5
Oct 25 20:31:14 tf1 ccsd[4128]: Built: May 22 2006 16:15:59
Oct 25 20:31:14 tf1 kernel: shpchp: acpi_pciehprm:\_SB_.PCI0.PBLO OSHP fails=0x5
Oct 25 20:31:14 tf1 ccsd[4128]: Copyright (C) Red Hat, Inc. 2004 All rights reserved.
Oct 25 20:31:14 tf1 kernel: shpchp: acpi_pciehprm:\_SB_.PCI0.VPR0 OSHP fails=0x5
Oct 25 20:31:14 tf1 last message repeated 7 times
Oct 25 20:31:14 tf1 kernel: shpchp: shpc_init : shpc_cap_offset == 0
Oct 25 20:31:14 tf1 last message repeated 8 times
Oct 25 20:31:14 tf1 kernel: shpchp: Standard Hot Plug PCI Controller Driver version: 0.4
Oct 25 20:31:14 tf1 kernel: hw_random hardware driver 1.0.0 loaded
...
Oct 25 20:29:48 tf1 rc.sysinit: Setting clock : Wed Oct 25 20:29:48 EDT 2006 succeeded
Oct 25 20:29:48 tf1 rc.sysinit: Loading default keymap succeeded
Oct 25 20:29:48 tf1 rc.sysinit: Setting hostname tf1.localdomain: succeeded
Oct 25 20:29:53 tf1 fsck: /: clean, 36754/131616 files, 139153/263056 blocks
Oct 25 20:29:53 tf1 rc.sysinit: Checking root filesystem succeeded
Oct 25 20:29:53 tf1 rc.sysinit: Remounting root filesystem in read-write mode: succeeded
Oct 25 20:29:53 tf1 lvm.static: Locking inactive: ignoring clustered volume group diskarray
Oct 25 20:29:53 tf1 rc.sysinit: Setting up Logical Volume Management: failed
Oct 25 20:29:53 tf1 fsck: fsck.gfs: invalid option -- a
Oct 25 20:29:53 tf1 fsck: Please use '-h' for usage.
Oct 25 20:31:17 tf1 ccsd[4128]: Remote copy of cluster.conf is from quorate node.
Oct 25 20:31:17 tf1 ccsd[4128]: Local version # : 22
Oct 25 20:31:17 tf1 ccsd[4128]: Remote version #: 22
Oct 25 20:31:17 tf1 kernel: CMAN: Waiting to join or form a Linux-cluster
Oct 25 20:31:18 tf1 ccsd[4128]: Connected to cluster infrastruture via: CMAN/SM Plugin v1.1.5
Oct 25 20:31:18 tf1 ccsd[4128]: Initial status:: Inquorate
Oct 25 20:31:19 tf1 kernel: CMAN: sending membership request
Oct 25 20:31:19 tf1 kernel: CMAN: got node tf2
Oct 25 20:33:17 tf1 cman: Timed-out waiting for cluster failed
Oct 25 20:33:17 tf1 lock_gulmd: no <gulm> section detected in /etc/cluster/cluster.conf succeeded
Oct 25 20:35:17 tf1 fenced: startup failed
Oct 25 20:36:13 tf1 kernel: CMAN: removing node tf2 from the cluster : No response to messages
Oct 25 20:36:13 tf1 kernel: ------------[ cut here ]------------
Oct 25 20:36:13 tf1 kernel: kernel BUG at /usr/src/redhat/BUILD/cman-kernel-2.6.9-43/smp/src/membership.c:3150!
Oct 25 20:36:13 tf1 kernel: invalid operand: 0000 [#1]
Oct 25 20:36:13 tf1 kernel: SMP
Oct 25 20:36:13 tf1 kernel: Modules linked in: cman(U) md5 ipv6 sunrpc button battery ac uhci_hcd ehci_hcd hw_random shpchp
eepro100 e100 mii e100
0 floppy sg ext3 jbd dm_mod aic7xxx megaraid_mbox megaraid_mm sd_mod scsi_mod
Oct 25 20:36:13 tf1 kernel: CPU: 3
Oct 25 20:36:13 tf1 kernel: EIP: 0060:[<f896ae2a>] Not tainted VLI
Oct 25 20:36:13 tf1 kernel: EFLAGS: 00010246 (2.6.9-34.ELsmp)
Oct 25 20:36:13 tf1 kernel: EIP is at elect_master+0x2e/0x3a [cman]
Oct 25 20:36:13 tf1 kernel: eax: 00000000 ebx: f619dfa0 ecx: 00000080 edx: 00000080
Oct 25 20:36:13 tf1 kernel: esi: f897dfc4 edi: f619dfd8 ebp: 00000000 esp: f619df98
Oct 25 20:36:13 tf1 kernel: ds: 007b es: 007b ss: 0068
Oct 25 20:36:13 tf1 kernel: Process cman_memb (pid: 4192, threadinfo=f619d000 task=f757f1b0)
Oct 25 20:36:13 tf1 kernel: Stack: f897de88 f89688d1 c22b9800 f62bd540 f8966eb7 f757f1b0 f757f1b0 f896709a
Oct 25 20:36:13 tf1 kernel: 0000001f 00000000 f6622bb0 00000000 f757f1b0 c011e71b 00100100 00200200
Oct 25 20:36:13 tf1 kernel: 00000000 00000000 0000007b f8966ed8 00000000 00000000 c01041f5 00000000
Oct 25 20:36:13 tf1 kernel: Call Trace:
Oct 25 20:36:13 tf1 kernel: [<f89688d1>] a_node_just_died+0x13a/0x199 [cman]
Oct 25 20:36:13 tf1 kernel: [<f8966eb7>] process_dead_nodes+0x4e/0x6f [cman]
Oct 25 20:36:13 tf1 kernel: [<f896709a>] membership_kthread+0x1c2/0x39d [cman]
Oct 25 20:36:13 tf1 kernel: [<c011e71b>] default_wake_function+0x0/0xc
Oct 25 20:36:13 tf1 kernel: [<f8966ed8>] membership_kthread+0x0/0x39d [cman]
Oct 25 20:36:13 tf1 kernel: [<c01041f5>] kernel_thread_helper+0x5/0xb
Oct 25 20:36:13 tf1 kernel: Code: a8 ed 97 f8 89 c3 ba 01 00 00 00 39 ca 7d 1c a1 ac ed 97 f8 8b 04 90 85 c0 74 0d 83 78 1c 02 75
07 89 03 8b 40 1
4 eb 0d 42 eb e0 <0f> 0b 4e 0c 68 1d 97 f8 31 c0 5b c3 a1 ac ed 97 f8 e8 79 80 7e
Oct 25 20:36:13 tf1 kernel: <0>Fatal exception: panic in 5 seconds
Oct 26 12:19:46 tf1 syslogd 1.4.1: restart.
so my question now is that it appears that I have something misconfigured.. tf1 should come up as secondary while tf2 is running as
primary, right? or should tf1 come up and take over as primary and tf2 let him?
Jason
More information about the Linux-cluster
mailing list