[Linux-cluster] CLVM/GFS will not mount or communicate with cluster

Barry Brimer lists at brimer.org
Tue Dec 5 14:18:27 UTC 2006



On Mon, 4 Dec 2006, Robert Peterson wrote:

> Barry Brimer wrote:
>> This is a repeat of the post I made a few minutes ago.  I thought adding a
>> subject would be helpful.
>> 
>> 
>> I have a 2 node cluster for a shared GFS filesystem.  One of the nodes 
>> fenced
>> the other, and the node that got fenced is no longer able to communicate 
>> with
>> the cluster.
>> 
>> While booting the problem node, I receive the following error message:
>> Setting up Logical Volume Management:  Locking inactive: ignoring clustered
>> volume group vg00
>> 
>> I have compared /etc/lvm/lvm.conf files on both nodes.  They are identical. 
>> The
>> disk (/dev/sda1) is listed when typing "fdisk -l"
>> 
>> There are no iptables firewalls active (although /etc/sysconfig/iptables 
>> exists,
>> iptables is chkconfig'd off).  I have written a simple iptables logging 
>> rule
>> (iptables -I INPUT -s <problem node> -j LOG) on the working node to verify 
>> that
>> packets are reaching the working node, but no messages are being logged in
>> /var/log/messages on the working node that acknowledge any cluster activity
>> from the problem node.
>> 
>> Both machines have the same RH packages installed and are mostly up to 
>> date,
>> they are missing the same packages, none of which involve the kernel, RHCS 
>> or
>> GFS.
>> 
>> When I boot the problem node, it successfully starts ccsd, but it fails 
>> after a
>> while on cman and fails after a while on fenced.  I have given the clvmd
>> process an hour, and it still will not start.
>> 
>> vgchange -ay on the problem node returns:
>> 
>> # vgchange -ay
>>   connect() failed on local socket: Connection refused
>>   Locking type 2 initialisation failed.
>> 
>> I have the contents of /var/log/messages on the working node and the 
>> problem
>> node at the time of the fence, if that would be helpful.
>> 
>> Any help is greatly appreciated.
>> 
>> Thanks,
>> Barry
>> 
> Hi Barry,
>
> Well, vgchange and other lvm functions won't work on the clustered volume
> unless clvmd is running, and clvmd won't run properly until the node is 
> talking
> happily through the cluster infrastructure.  So as I see it, your problem is 
> that
> cman is not starting properly.  Unfortunately, you haven't told us much about
> the system to determine why.  There can be many reasons.

Agreed.  Although it did not seem relevant at the time of the post, there 
were network outages around the time of the failure.  What happens now is 
that on the problem node, ccsd starts, but when starting cman it sends 
membership requests but they are never received acknowledged the working 
node.  Again, I see packets received in the /var/log/messages on the 
working node on UDP 6809 from the problem node, but watching 
/var/log/messages on the working node, cman never acknowledges them.

The problem node had this in its /var/log/messages at the time of the 
problem:

Dec  1 14:29:38 server1 kernel: CMAN: Being told to leave the cluster by node 1
Dec  1 14:29:38 server1 kernel: CMAN: we are leaving the cluster.
Dec  1 14:29:38 server1 kernel: WARNING: dlm_emergency_shutdown
Dec  1 14:29:38 server1 kernel: WARNING: dlm_emergency_shutdown
Dec  1 14:29:38 server1 kernel: SM: 00000001 sm_stop: SG still joined
Dec  1 14:29:38 server1 kernel: SM: 01000002 sm_stop: SG still joined
Dec  1 14:29:38 server1 kernel: SM: 02000006 sm_stop: SG still joined
Dec  1 14:29:38 server1 ccsd[3080]: Cluster manager shutdown.  Attemping to reconnect...
Dec  1 14:30:02 server1 kernel: clvmd move flags 0,1,0 ids 0,2,0
Dec  1 14:30:02 server1 kernel: clvmd move use event 2
Dec  1 14:30:02 server1 kernel: clvmd recover event 2 (first)
Dec  1 14:30:02 server1 kernel: clvmd add nodes
Dec  1 14:30:02 server1 kernel: clvmd total nodes 2
Dec  1 14:30:02 server1 kernel: clvmd rebuild resource directory
Dec  1 14:30:02 server1 kernel: clvmd rebuilt 1 resources
Dec  1 14:30:02 server1 kernel: clvmd recover event 2 done
Dec  1 14:30:02 server1 kernel: clvmd move flags 0,0,1 ids 0,2,2
Dec  1 14:30:02 server1 kernel: clvmd process held requests
Dec  1 14:30:02 server1 kernel: clvmd processed 0 requests
Dec  1 14:30:02 server1 kernel: clvmd recover event 2 finished
Dec  1 14:30:02 server1 kernel: ems move flags 0,1,0 ids 0,3,0
Dec  1 14:30:02 server1 kernel: ems move use event 3
Dec  1 14:30:02 server1 kernel: ems recover event 3 (first)
Dec  1 14:30:02 server1 kernel: ems add nodes
Dec  1 14:30:02 server1 kernel: ems total nodes 2
Dec  1 14:30:02 server1 kernel: ems rebuild resource directory
Dec  1 14:30:02 server1 kernel: ems rebuilt 77 resources
Dec  1 14:30:02 server1 kernel: ems recover event 3 done
Dec  1 14:30:02 server1 kernel: ems move flags 0,0,1 ids 0,3,3
Dec  1 14:30:02 server1 kernel: ems process held requests
Dec  1 14:30:02 server1 kernel: ems processed 0 requests
Dec  1 14:30:02 server1 kernel: ems recover event 3 finished
Dec  1 14:30:02 server1 kernel: 803 en punlock 7,2da070
Dec  1 14:30:02 server1 kernel: 10803 ex punlock 0
Dec  1 14:30:02 server1 kernel: 10803 en punlock 7,28aa0a
Dec  1 14:30:02 server1 kernel: 10803 ex punlock 0
Dec  1 14:30:02 server1 kernel: 14054 en punlock 7,2e1a27
Dec  1 14:30:02 server1 kernel: 14054 ex punlock 0
Dec  1 14:30:02 server1 kernel: 14054 en punlock 7,2da070
Dec  1 14:30:02 server1 kernel: 14054 ex punlock 0
Dec  1 14:30:02 server1 kernel: 14054 en punlock 7,28aa0a
Dec  1 14:30:02 server1 kernel: 14054 ex punlock 0
Dec  1 14:30:02 server1 kernel: 12215 en punlock 7,2e1a27
Dec  1 14:30:02 server1 kernel: 12215 ex punlock 0
Dec  1 14:30:02 server1 kernel: 12215 en punlock 7,2da070
Dec  1 14:30:02 server1 kernel: 12215 ex punlock 0
Dec  1 14:30:02 server1 kernel: 12215 en punlock 7,28aa0a
Dec  1 14:30:02 server1 kernel: 12215 ex punlock 0
Dec  1 14:30:02 server1 kernel: 10961 en punlock 7,2e1a27
Dec  1 14:30:02 server1 kernel: 10961 ex punlock 0
Dec  1 14:30:02 server1 kernel: 10961 en punlock 7,2da070
Dec  1 14:30:02 server1 kernel: 10961 ex punlock 0
Dec  1 14:30:02 server1 kernel: 10961 en punlock 7,28aa0a
Dec  1 14:30:02 server1 kernel: 10961 ex punlock 0
Dec  1 14:30:02 server1 kernel: 10737 en punlock 7,2e1a27
Dec  1 14:30:02 server1 kernel: 10737 ex punlock 0
Dec  1 14:30:02 server1 kernel: 10737 en punlock 7,2da070
Dec  1 14:30:02 server1 kernel: 10737 ex punlock 0
Dec  1 14:30:02 server1 kernel: 10737 en punlock 7,28aa0a
Dec  1 14:30:02 server1 kernel: 10737 ex punlock 0
Dec  1 14:30:02 server1 kernel: 5060 en punlock 7,2e1a27
Dec  1 14:30:02 server1 kernel: 5060 ex punlock 0
Dec  1 14:30:02 server1 kernel: 5060 en punlock 7,2da070
Dec  1 14:30:02 server1 kernel: 5060 ex punlock 0
Dec  1 14:30:02 server1 kernel: 5060 en punlock 7,28aa0a
Dec  1 14:30:02 server1 kernel: 5060 ex punlock 0
Dec  1 14:30:02 server1 kernel: 32691 en punlock 7,2e1a27
Dec  1 14:30:02 server1 kernel: 32691 ex punlock 0
Dec  1 14:30:02 server1 kernel: 32691 en punlock 7,2da070
Dec  1 14:30:02 server1 kernel: 32691 ex punlock 0
Dec  1 14:30:02 server1 kernel: 32691 en punlock 7,28aa0a
Dec  1 14:30:02 server1 kernel: 32691 ex punlock 0
Dec  1 14:30:02 server1 kernel: 32155 en punlock 7,2e1a27
Dec  1 14:30:02 server1 kernel: 32155 ex punlock 0
Dec  1 14:30:02 server1 kernel: 32155 en punlock 7,2da070
Dec  1 14:30:02 server1 kernel: 32155 ex punlock 0
Dec  1 14:30:02 server1 kernel: 32155 en punlock 7,28aa0a
Dec  1 14:30:02 server1 kernel: 32155 ex punlock 0
Dec  1 14:30:02 server1 kernel: 30922 en punlock 7,2e1a27
Dec  1 14:30:02 server1 kernel: 30922 ex punlock 0
Dec  1 14:30:02 server1 kernel: 30922 en punlock 7,2da070
Dec  1 14:30:02 server1 kernel: 30922 ex punlock 0
Dec  1 14:30:02 server1 kernel: 30922 en punlock 7,28aa0a
Dec  1 14:30:02 server1 kernel: 30922 ex punlock 0
Dec  1 14:30:02 server1 kernel: 21072 en punlock 7,2e1a27
Dec  1 14:30:02 server1 kernel: 21072 ex punlock 0
Dec  1 14:30:02 server1 kernel: 21072 en punlock 7,2da070
Dec  1 14:30:02 server1 kernel: 21072 ex punlock 0
Dec  1 14:30:02 server1 kernel: 21072 en punlock 7,28aa0a
Dec  1 14:30:02 server1 kernel: 21072 ex punlock 0
Dec  1 14:30:02 server1 kernel: 19127 en punlock 7,2e1a27
Dec  1 14:30:02 server1 kernel: 19127 ex punlock 0
Dec  1 14:30:02 server1 kernel: 19127 en punlock 7,2da070
Dec  1 14:30:02 server1 kernel: 19127 ex punlock 0
Dec  1 14:30:02 server1 kernel: 19127 en punlock 7,28aa0a
Dec  1 14:30:02 server1 kernel: 19127 ex punlock 0
Dec  1 14:30:02 server1 kernel: 16526 en punlock 7,2e1a27
Dec  1 14:30:02 server1 kernel: 16526 ex punlock 0
Dec  1 14:30:02 server1 kernel: 16526 en punlock 7,2da070
Dec  1 14:30:02 server1 kernel: 16526 ex punlock 0
Dec  1 14:30:02 server1 kernel: 16526 en punlock 7,28aa0a
Dec  1 14:30:02 server1 kernel: 16526 ex punlock 0
Dec  1 14:30:02 server1 kernel: 13729 en punlock 7,2e1a27
Dec  1 14:30:02 server1 kernel: 13729 ex punlock 0
Dec  1 14:30:02 server1 kernel: 13729 en punlock 7,2da070
Dec  1 14:30:02 server1 kernel: 13729 ex punlock 0
Dec  1 14:30:02 server1 kernel: 13729 en punlock 7,28aa0a
Dec  1 14:30:02 server1 kernel: 13729 ex punlock 0
Dec  1 14:30:02 server1 kernel: 29093 en punlock 7,2e1a27
Dec  1 14:30:02 server1 kernel: 29093 ex punlock 0
Dec  1 14:30:02 server1 kernel: 29093 en punlock 7,2da070
Dec  1 14:30:02 server1 kernel: 29093 ex punlock 0
Dec  1 14:30:02 server1 kernel: 29093 en punlock 7,28aa0a
Dec  1 14:30:02 server1 kernel: 29093 ex punlock 0
Dec  1 14:30:02 server1 kernel: 30618 en punlock 7,2e1a27
Dec  1 14:30:02 server1 kernel: 30618 ex punlock 0
Dec  1 14:30:02 server1 kernel: 30618 en punlock 7,2da070
Dec  1 14:30:02 server1 kernel: 30618 ex punlock 0
Dec  1 14:30:02 server1 kernel: 30618 en punlock 7,28aa0a
Dec  1 14:30:02 server1 kernel: 30618 ex punlock 0
Dec  1 14:30:02 server1 kernel: 28340 en punlock 7,2e1a27
Dec  1 14:30:02 server1 kernel: 28340 ex punlock 0
Dec  1 14:30:02 server1 kernel: 28340 en punlock 7,2da070
Dec  1 14:30:02 server1 kernel: 28340 ex punlock 0
Dec  1 14:30:02 server1 kernel: 28340 en punlock 7,28aa0a
Dec  1 14:30:02 server1 kernel: 28340 ex punlock 0
Dec  1 14:30:02 server1 kernel: 26076 en punlock 7,2e1a27
Dec  1 14:30:02 server1 kernel: 26076 ex punlock 0
Dec  1 14:30:02 server1 kernel: 26076 en punlock 7,2da070
Dec  1 14:30:02 server1 kernel: 26076 ex punlock 0
Dec  1 14:30:02 server1 kernel: 26076 en punlock 7,28aa0a
Dec  1 14:30:02 server1 kernel: 26076 ex punlock 0
Dec  1 14:30:02 server1 kernel: 24520 en punlock 7,2e1a27
Dec  1 14:30:02 server1 kernel: 24520 ex punlock 0
Dec  1 14:30:02 server1 kernel: 24520 en punlock 7,2da070
Dec  1 14:30:02 server1 kernel: 24520 ex punlock 0
Dec  1 14:30:02 server1 kernel: 24520 en punlock 7,28aa0a
Dec  1 14:30:02 server1 kernel: 24520 ex punlock 0
Dec  1 14:30:02 server1 kernel: 23190 en punlock 7,2e1a27
Dec  1 14:30:02 server1 kernel: 23190 ex punlock 0
Dec  1 14:30:02 server1 kernel: 23190 en punlock 7,2da070
Dec  1 14:30:02 server1 kernel: 23190 ex punlock 0
Dec  1 14:30:02 server1 kernel: 23190 en punlock 7,28aa0a
Dec  1 14:30:02 server1 kernel: 23190 ex punlock 0
Dec  1 14:30:02 server1 kernel: 21902 en punlock 7,2e1a27
Dec  1 14:30:02 server1 kernel: 21902 ex punlock 0
Dec  1 14:30:02 server1 kernel: 21902 en punlock 7,2da070
Dec  1 14:30:02 server1 kernel: 21902 ex punlock 0
Dec  1 14:30:02 server1 kernel: 21902 en punlock 7,28aa0a
Dec  1 14:30:02 server1 kernel: 21902 ex punlock 0
Dec  1 14:30:02 server1 kernel: 6705 en punlock 7,2e1a27
Dec  1 14:30:02 server1 kernel: 6705 ex punlock 0
Dec  1 14:30:02 server1 kernel: 6705 en punlock 7,2da070
Dec  1 14:30:02 server1 kernel: 6705 ex punlock 0
Dec  1 14:30:02 server1 kernel: 6705 en punlock 7,28aa0a
Dec  1 14:30:02 server1 kernel: 6705 ex punlock 0
Dec  1 14:30:02 server1 kernel: 4976 en punlock 7,2e1a27
Dec  1 14:30:02 server1 kernel: 4976 ex punlock 0
Dec  1 14:30:02 server1 kernel: 4976 en punlock 7,2da070
Dec  1 14:30:02 server1 kernel: 4976 ex punlock 0
Dec  1 14:30:02 server1 kernel: 4976 en punlock 7,28aa0a
Dec  1 14:30:02 server1 kernel: 4976 ex punlock 0
Dec  1 14:30:02 server1 kernel: 32560 en punlock 7,2e1a27
Dec  1 14:30:02 server1 kernel: 32560 ex punlock 0
Dec  1 14:30:02 server1 kernel: 32560 en punlock 7,2da070
Dec  1 14:30:02 server1 kernel: 32560 ex punlock 0
Dec  1 14:30:02 server1 kernel: 32560 en punlock 7,28aa0a
Dec  1 14:30:02 server1 kernel: 32560 ex punlock 0
Dec  1 14:30:02 server1 kernel: 30988 en punlock 7,2e1a27
Dec  1 14:30:02 server1 kernel: 30988 ex punlock 0
Dec  1 14:30:02 server1 kernel: 30988 en punlock 7,2da070
Dec  1 14:30:02 server1 kernel: 30988 ex punlock 0
Dec  1 14:30:02 server1 kernel: 30988 en punlock 7,28aa0a
Dec  1 14:30:02 server1 kernel: 30988 ex punlock 0
Dec  1 14:30:02 server1 kernel: 29578 en punlock 7,2e1a27
Dec  1 14:30:02 server1 kernel: 29578 ex punlock 0
Dec  1 14:30:02 server1 kernel: 29578 en punlock 7,2da070
Dec  1 14:30:02 server1 kernel: 29578 ex punlock 0
Dec  1 14:30:02 server1 kernel: 29578 en punlock 7,28aa0a
Dec  1 14:30:02 server1 kernel: 29578 ex punlock 0
Dec  1 14:30:02 server1 kernel: 27523 en punlock 7,2e1a27
Dec  1 14:30:02 server1 kernel: 27523 ex punlock 0
Dec  1 14:30:02 server1 kernel: 27523 en punlock 7,2da070
Dec  1 14:30:02 server1 kernel: 27523 ex punlock 0
Dec  1 14:30:02 server1 kernel: 27523 en punlock 7,28aa0a
Dec  1 14:30:02 server1 kernel: 27523 ex punlock 0
Dec  1 14:30:02 server1 kernel: 5413 en punlock 7,2e1a27
Dec  1 14:30:02 server1 kernel: 5413 ex punlock 0
Dec  1 14:30:02 server1 kernel: 5413 en punlock 7,2da070
Dec  1 14:30:02 server1 kernel: 5413 ex punlock 0
Dec  1 14:30:02 server1 kernel: 5413 en punlock 7,28aa0a
Dec  1 14:30:02 server1 kernel: 5413 ex punlock 0
Dec  1 14:30:02 server1 kernel: 3665 en punlock 7,2e1a27
Dec  1 14:30:02 server1 kernel: 3665 ex punlock 0
Dec  1 14:30:02 server1 kernel: 3665 en punlock 7,2da070
Dec  1 14:30:02 server1 kernel: 3665 ex punlock 0
Dec  1 14:30:02 server1 kernel: 3665 en punlock 7,28aa0a
Dec  1 14:30:02 server1 kernel: 3665 ex punlock 0
Dec  1 14:30:02 server1 kernel: 5517 en punlock 7,2e1a27
Dec  1 14:30:02 server1 kernel: 5517 ex punlock 0
Dec  1 14:30:02 server1 kernel: 5517 en punlock 7,2da070
Dec  1 14:30:02 server1 kernel: 5517 ex punlock 0
Dec  1 14:30:02 server1 kernel: 5517 en punlock 7,28aa0a
Dec  1 14:30:02 server1 kernel: 5517 ex punlock 0
Dec  1 14:30:02 server1 kernel: 3617 en punlock 7,2e1a27
Dec  1 14:30:02 server1 kernel: 3617 ex punlock 0
Dec  1 14:30:02 server1 kernel: 3617 en punlock 7,2da070
Dec  1 14:30:02 server1 kernel: 3617 ex punlock 0
Dec  1 14:30:02 server1 kernel: 3617 en punlock 7,28aa0a
Dec  1 14:30:02 server1 kernel: 3617 ex punlock 0
Dec  1 14:30:02 server1 kernel: 18044 en punlock 7,2e1a27
Dec  1 14:30:02 server1 kernel: 18044 ex punlock 0
Dec  1 14:30:02 server1 kernel: 18044 en punlock 7,2da070
Dec  1 14:30:02 server1 kernel: 18044 ex punlock 0
Dec  1 14:30:02 server1 kernel: 18044 en punlock 7,28aa0a
Dec  1 14:30:02 server1 kernel: 18044 ex punlock 0
Dec  1 14:30:02 server1 kernel:
Dec  1 14:30:02 server1 kernel: lock_dlm:  Assertion failed on line 428 of file /builddir/build/BUILD/gfs-kernel-2.6.9-60/smp/src/dlm/lock.c
Dec  1 14:30:02 server1 kernel: lock_dlm:  assertion:  "!error"
Dec  1 14:30:02 server1 kernel: lock_dlm:  time = 3382292560
Dec  1 14:30:02 server1 kernel: ems: num=3,11 err=-22 cur=-1 req=3 lkf=8
Dec  1 14:30:02 server1 kernel:
Dec  1 14:30:02 server1 kernel: ------------[ cut here ]------------
Dec  1 14:30:02 server1 kernel: kernel BUG at /builddir/build/BUILD/gfs-kernel-2.6.9-60/smp/src/dlm/lock.c:428!
Dec  1 14:30:02 server1 kernel: invalid operand: 0000 [#1]
Dec  1 14:30:02 server1 kernel: SMP
Dec  1 14:30:02 server1 kernel: Modules linked in: nfs lockd nfs_acl sunrpc autofs4 lock_dlm(U) gfs(U) lock_harness(U) dlm(U) cman(U) iptable_filter ip_tables dm_mirror dm_multipath button battery ac uhci_hcd ehci_hcd hw_random bcm5700(U) floppy ext3 jbd dm_mod qla6312(U) qla2400(U) qla2300(U) qla2xxx(U) qla2xxx_conf(U) cciss sd_mod scsi_mod
Dec  1 14:30:02 server1 kernel: CPU:    3
Dec  1 14:30:02 server1 kernel: EIP:    0060:[<f8cd1779>]    Not tainted VLI
Dec  1 14:30:02 server1 kernel: EFLAGS: 00010246   (2.6.9-42.0.3.ELsmp)
Dec  1 14:30:02 server1 kernel: EIP is at do_dlm_lock+0x134/0x14e [lock_dlm]
Dec  1 14:30:02 server1 kernel: eax: 00000001   ebx: ffffffea   ecx: f29ead08   edx: f8cd6217
Dec  1 14:30:02 server1 kernel: esi: f8cd1798   edi: f7e93a00   ebp: f7ee4d00   esp: f29ead04
Dec  1 14:30:02 server1 kernel: ds: 007b   es: 007b   ss: 0068
Dec  1 14:30:02 server1 kernel: Process tibhawkhma (pid: 7150, threadinfo=f29ea000 task=f24e32b0)
Dec  1 14:30:02 server1 kernel: Stack: f8cd6217 20202020 33202020 20202020 20202020 20202020 31312020 c2b40018
Dec  1 14:30:02 server1 kernel:        e75e9118 f7ee4d00 00000003 00000000 f7ee4d00 f8cd1828 00000003 f8cd9a80
Dec  1 14:30:02 server1 kernel:        f8c5c000 f8d30936 00000000 00000001 f292de80 f292de64 f8c5c000 f8d268fe
Dec  1 14:30:02 server1 kernel: Call Trace:
Dec  1 14:30:02 server1 kernel:  [<f8cd1828>] lm_dlm_lock+0x49/0x52 [lock_dlm]
Dec  1 14:30:02 server1 kernel:  [<f8d30936>] gfs_lm_lock+0x35/0x4d [gfs]
Dec  1 14:30:02 server1 kernel:  [<f8d268fe>] gfs_glock_xmote_th+0x130/0x172 [gfs]
Dec  1 14:30:02 server1 kernel:  [<f8d25fbd>] rq_promote+0xc8/0x147 [gfs]
Dec  1 14:30:02 server1 kernel:  [<f8d261a9>] run_queue+0x91/0xc1 [gfs]
Dec  1 14:30:02 server1 kernel:  [<f8d271b9>] gfs_glock_nq+0xcf/0x116 [gfs]
Dec  1 14:30:02 server1 kernel:  [<f8d2778f>] gfs_glock_nq_init+0x13/0x26 [gfs]
Dec  1 14:30:02 server1 kernel:  [<f8d47b22>] stat_gfs_async+0x119/0x187 [gfs]
Dec  1 14:30:02 server1 kernel:  [<f8d47c57>] gfs_stat_gfs+0x27/0x4e [gfs]
Dec  1 14:30:02 server1 kernel:  [<f8d3fcca>] gfs_statfs+0x26/0xc7 [gfs]
Dec  1 14:30:02 server1 kernel:  [<c0159149>] vfs_statfs+0x41/0x59
Dec  1 14:30:02 server1 kernel:  [<c015916f>] vfs_statfs_native+0xe/0xd0
Dec  1 14:30:02 server1 kernel:  [<c01678e5>] __user_walk+0x4a/0x51
Dec  1 14:30:02 server1 kernel:  [<c0159298>] sys_statfs+0x3f/0x9f
Dec  1 14:30:02 server1 kernel:  [<c010b052>] do_gettimeofday+0x1a/0x9c
Dec  1 14:30:02 server1 kernel:  [<c012614f>] sys_time+0xf/0x58
Dec  1 14:30:02 server1 kernel:  [<c02d47cb>] syscall_call+0x7/0xb
Dec  1 14:30:02 server1 kernel:  [<c02d007b>] packet_rcv+0x17e/0x307
Dec  1 14:30:02 server1 kernel: Code: 26 50 0f bf 45 24 50 53 ff 75 08 ff 75 04 ff 75 0c ff 77 18 68 42 63 cd f8 e8 32 11 45 c7 83 c4 38 68 17 62 cd f8 e8 25 11 45 c7 <0f> 0b ac 01 5f 61 cd f8 68 19 62 cd f8 e8 e0 08 45 c7 83 c4 20
Dec  1 14:30:02 server1 kernel:  <0>Fatal exception: panic in 5 seconds


> For now, let me assume that the two were working properly in a cluster 
> before it was fenced, and therefore I'll assume that the software and 
> configurations are all okay.  I think one reason this might happen is if 
> you're using manual fencing and haven't yet done your:
>
> fence_ack_manual -n <fenced_node>
>
> on the remaining node to acknowledge that the reboot actually happened.

Everything was working fine for several months on this cluster.  The 
cluster software is the latest provided by Red Hat for RHEL4.  Latest 
kernel.  I am using fence_ilo, and the working node fenced the problem 
node.

> Also, you might want to test communications between the boxes to make
> sure they can communicate with each other in general. 
> You might also get this kind of problem if you had updated the cluster 
> software, so that the cman on one node is incompatible with the cman on the other.
> Ordinarily, there are no problems or incompatibilities with upgrading, but
> if you upgraded cman from RHEL4U1 to RHEL4U4, for example, you might
> get this because the cman protocol changed slightly between RHEL4U1 and U2.

Same versions - RHEL 4 Red Hat Latest:

Problem node:

# rpm -qa '*lvm*'
system-config-lvm-1.0.19-1.0
lvm2-2.02.06-6.0.RHEL4
lvm2-cluster-2.02.06-7.0.RHEL4

Working node:

# rpm -qa '*lvm*'
lvm2-2.02.06-6.0.RHEL4
lvm2-cluster-2.02.06-7.0.RHEL4
system-config-lvm-1.0.19-1.0

> Next time, it would also be helpful to post what version of the cluster 
> software
> you're running and possibly snippets from /var/log/messages showing why
> cman is not connecting.

I've since discovered that another GFS cluster (non-production) had a 
similar issue, and a reboot on both nodes solved this problem.  With the 
original (production) cluster, I am trying to figure out how to get the problem 
node back into the cluster without having to unmount the GFS volume from the remaining working node.

Thank you so much for your input, it is greatly appreciated.

If you have any more suggestions, particularly on how to get my problem node back into the cluster without 
unmounting the GFS volume from the working node, please let me know.

Thanks,
Barry




More information about the Linux-cluster mailing list