[Linux-cluster] another cman/sm problem

Thu Dec 23 00:35:25 UTC 2004

cl030:
CMAN: node cl031a is not responding - removing from the cluster
dlm: closing connection to node 1
dlm: closing connection to node 2
SM: 00000001 sm_stop: SG still joined
SM: 01000932 sm_stop: SG still joined
SM: 02000933 sm_stop: SG still joined
Unable to handle kernel NULL pointer dereference at virtual address 00000004
 printing eip:
c0119677
*pde = 00000000
Oops: 0000 [#1]
PREEMPT SMP
Modules linked in: lock_nolock lock_dlm dlm qla2200 qla2xxx gfs lock_harness cman dm_mod
CPU:    1
EIP:    0060:[<c0119677>]    Not tainted VLI
EFLAGS: 00010096   (2.6.9)
EIP is at task_rq_lock+0x27/0x70
eax: ea000f2c   ebx: c052e000   ecx: 00000000   edx: 00000000
esi: c0533020   edi: c052e000   ebp: ea000ef4   esp: ea000ee8
ds: 007b   es: 007b   ss: 0068
Process cman_comms (pid: 3739, threadinfo=ea000000 task=e9de98f0)
Stack: c1b037b0 e617daf4 f781243c ea000f3c c0119d92 00000000 ea000f2c c014a82f
       c181f040 00000020 f7810750 f8d04755 ffffff95 02000933 00000000 ea000f50
       f8d047b5 00000296 c1b037b0 e617daf4 f781243c ea000f50 c011a02e 00000000
Call Trace:
 [<c010626f>] show_stack+0x7f/0xa0
 [<c010641e>] show_registers+0x15e/0x1d0
 [<c010663e>] die+0xfe/0x190
 [<c0118683>] do_page_fault+0x293/0x7c1
 [<c0105e59>] error_code+0x2d/0x38
 [<c0119d92>] try_to_wake_up+0x22/0x2a0
 [<c011a02e>] wake_up_process+0x1e/0x30
 [<f8d048b0>] callback_startdone_barrier+0x20/0x30 [cman]
 [<f8cfc641>] node_shutdown+0x291/0x3c0 [cman]
 [<f8cf847a>] cluster_kthread+0x2aa/0x350 [cman]
 [<c0103325>] kernel_thread_helper+0x5/0x10
SM: 00000001 sm_stop: SG still joined
SM: 01000932 sm_stop: SG still joined
SM: 02000933 sm_stop: SG still joined
Unable to handle kernel NULL pointer dereference at virtual address 00000004
 printing eip:
c0119677
*pde = 00000000
Oops: 0000 [#1]
PREEMPT SMP

cl031:
dlm: closing connection to node 1
dlm: closing connection to node 2
SM: 00000001 sm_stop: SG still joined
SM: 01000932 sm_stop: SG still joined
SM: 02000933 sm_stop: SG still joined
Unable to handle kernel NULL pointer dereference at virtual address 00000004
 printing eip:
c0119677
*pde = 00000000
Oops: 0000 [#1]
PREEMPT SMP
Modules linked in: lock_dlm dlm qla2200 qla2xxx gfs lock_harness cman dm_mod
CPU:    1
EIP:    0060:[<c0119677>]    Not tainted VLI
EFLAGS: 00010096   (2.6.9)
EIP is at task_rq_lock+0x27/0x70
eax: eacc0f2c   ebx: c052e000   ecx: 00000000   edx: 00000000
esi: c0533020   edi: c052e000   ebp: eacc0ef4   esp: eacc0ee8
ds: 007b   es: 007b   ss: 0068
Process cman_comms (pid: 2876, threadinfo=eacc0000 task=eae75a30)
Stack: ea3a285c caeb7da4 f502e1a8 eacc0f3c c0119d92 00000000 eacc0f2c c014a82f
       c181f040 00000020 f7d35f38 f8d04755 ffffff95 02000933 00000000 eacc0f50
       f8d047b5 00000296 ea3a285c caeb7da4 f502e1a8 eacc0f50 c011a02e 00000000
Call Trace:
 [<c010626f>] show_stack+0x7f/0xa0
 [<c010641e>] show_registers+0x15e/0x1d0
 [<c010663e>] die+0xfe/0x190
 [<c0118683>] do_page_fault+0x293/0x7c1
 [<c0105e59>] error_code+0x2d/0x38
 [<c0119d92>] try_to_wake_up+0x22/0x2a0
 [<c011a02e>] wake_up_process+0x1e/0x30
 [<f8d04880>] callback_startdone_barrier_new+0x20/0x30 [cman]
 [<f8cfc641>] node_shutdown+0x291/0x3c0 [cman]
 [<f8cf847a>] cluster_kthread+0x2aa/0x350 [cman]
 [<c0103325>] kernel_thread_helper+0x5/0x10
Code: 00 00 00 00 55 89 e5 83 ec 0c 89 1c 24 89 74 24 04 89 7c 24 08
 8b 45 0c 9c 8f 00 fa be 20 30 53 c0 bb 00 e0 52 c0 8b 55 08 89 df
 <8b> 42 04 8b 40 10 8b 0c

cl032:
Dec 22 05:29:19 cl032 sshd(pam_unix)[18296]: session closed for user root
Dec 22 05:42:14 cl032 kernel: CMAN: bad generation number 15 in HELLO message, expected 14
Dec 22 05:42:17 cl032 kernel: CMAN: Node cl030a is leaving the cluster, ShutdownDec 22 05:42:17 cl032 kernel: CMAN: quorum lost, blocking activity

My test is doing a lot of mounting and umounting.  Wouldn't that
stress SM code a lot.  Is SM causing the problem?
http://developer.osdl.org/daniel/GFS/cman.21dec2004/

Daniel