[Linux-cluster] DLM or SM bug after 50 hours
Daniel McNeil
daniel at osdl.org
Fri Dec 17 01:18:40 UTC 2004
My tests ran for 50 hours! This is a new record and is running
with my up_write() before queue_ast() patch.
It hit an error during a 2 node test (GFS on cl030 and cl031;
cl032 was a member of the cluster, but no GFS file system mounted).
On cl030 console:
SM: 00000001 sm_stop: SG still joined
SM: 01000410 sm_stop: SG still joined
/proc/cluster/status shows cl030 is not in cluster
On cl031 console:
CMAN: node cl030a is not responding - removing from the cluster
dlm: stripefs: recover event 6388
CMAN: node cl030a is not responding - removing from the cluster
dlm: stripefs: recover event 6388
name " 5 54bdb0" flags 2 nodeid 0 ref 1
G 00240122 gr 3 rq -1 flg 0 sts 2 node 0 remid 0 lq 0,0
[60,000 lines of this]
------------[ cut here ]------------
kernel BUG at /Views/redhat-cluster/cluster/dlm-kernel/src/reccomms.c:128!
invalid operand: 0000 [#1]
PREEMPT SMP
Modules linked in: lock_dlm dlm gfs lock_harness cman qla2200 qla2xxx dm_mod
CPU: 1
EIP: 0060:[<f8b3e243>] Not tainted VLI
EFLAGS: 00010286 (2.6.9)
EIP is at rcom_send_message+0x193/0x250 [dlm]
eax: 00000001 ebx: c27813cc ecx: c0456c0c edx: 00000286
esi: da046eb4 edi: c27812d8 ebp: da046e90 esp: da046e6c
ds: 007b es: 007b ss: 0068
Process dlm_recoverd (pid: 28108, threadinfo=da046000 task=f6d656f0)
Stack: f8b44904 ffffff97 f8b46c60 f8b448ed 0af345bb ffffff97 c27812d8 da046000
da046eb4 da046ee0 f8b3eff1 c27812d8 00000001 00000001 da046eb4 00000001
c181f040 00000001 00150014 00000000 01000410 00000008 01000001 c7062300
Call Trace:
[<c010626f>] show_stack+0x7f/0xa0
[<c010641e>] show_registers+0x15e/0x1d0
[<c010663e>] die+0xfe/0x190
[<c0106bd7>] do_invalid_op+0x107/0x110
[<c0105e59>] error_code+0x2d/0x38
[<f8b3eff1>] dlm_wait_status_low+0x71/0xa0 [dlm]
[<f8b38e19>] nodes_reconfig_wait+0x29/0x80 [dlm]
[<f8b39051>] ls_nodes_reconfig+0x161/0x350 [dlm]
[<f8b4077b>] ls_reconfig+0x6b/0x250 [dlm]
[<f8b41685>] do_ls_recovery+0x195/0x4a0 [dlm]
[<f8b41a88>] dlm_recoverd+0xf8/0x100 [dlm]
[<c0134cca>] kthread+0xba/0xc0
[<c0103325>] kernel_thread_helper+0x5/0x10
Code: 44 24 04 80 00 00 00 e8 dc 1c 5e c7 8b 45 f0 c7 04 24 f8 48 b4 f8 89 44 24 04 e8 c9 1c 5e c7 c7 04 24 04 49 b4 f8 e8 bd 1c 5e c7 <0f> 0b 80 00 60 6c b4 f8 c7 04 24 a0 6c b4 f8 e8 59 14 5e c7 89
<1>Unable to handle kernel paging request at virtual address 6b6b6b7b
printing eip:
c011967a
*pde = 00000000
Oops: 0000 [#2]
PREEMPT SMP
Modules linked in: lock_dlm dlm gfs lock_harness cman qla2200 qla2xxx dm_mod
CPU: 0
EIP: 0060:[<c011967a>] Not tainted VLI
EFLAGS: 00010086 (2.6.9)
EIP is at task_rq_lock+0x2a/0x70
eax: 6b6b6b6b ebx: c052e000 ecx: c2781350 edx: f6d656f0
esi: c0533020 edi: c052e000 ebp: eb6b8e9c esp: eb6b8e90
ds: 007b es: 007b ss: 0068
Process cman_comms (pid: 3628, threadinfo=eb6b8000 task=eb9e0910)
Stack: c2781350 c27812d8 00000002 eb6b8ee4 c0119d92 f6d656f0 eb6b8ed4 0af34b37
c0456ac8 00100100 00200200 0af34b37 00000001 dead4ead 00000000 c0129790
eb9e0910 00000286 c2781350 c27812d8 00000002 eb6b8ef8 c011a02e f6d656f0
Call Trace:
[<c010626f>] show_stack+0x7f/0xa0
[<c010641e>] show_registers+0x15e/0x1d0
[<c010663e>] die+0xfe/0x190
[<c0118683>] do_page_fault+0x293/0x7c1
[<c0105e59>] error_code+0x2d/0x38
[<c0119d92>] try_to_wake_up+0x22/0x2a0
[<c011a02e>] wake_up_process+0x1e/0x30
[<f8b41c28>] dlm_recoverd_stop+0x48/0x6b [dlm]
[<f8b350c8>] release_lockspace+0x38/0x2f0 [dlm]
[<f8b3541c>] dlm_emergency_shutdown+0x4c/0x70 [dlm]
[<f8a8057a>] notify_kernel_listeners+0x5a/0x90 [cman]
[<f8a8440e>] node_shutdown+0x5e/0x3c0 [cman]
[<f8a8047a>] cluster_kthread+0x2aa/0x350 [cman]
[<c0103325>] kernel_thread_helper+0x5/0x10
Code: 00 55 89 e5 83 ec 0c 89 1c 24 89 74 24 04 89 7c 24 08 8b 45 0c 9c 8f 00 fa be 20 30 53 c0 bb 00 e0 52 c0 8b 55 08 89 df 8b 42 04 <8b> 40 10 8b 0c 86 01 cf 89 f8 e8 e7 2c 2c 00 8b 55 08 8b 42 04
cl032 console shows:
SM: process_reply invalid id=7783 nodeid=2
CMAN: quorum lost, blocking activity
The test was umounting the gfs file system on cl030 when this
occurred. the gfs file system is still mounted on cl031
according to /proc/mounts.
The stack trace on cl030 for umount shows:
umount D 00000008 0 10862 10856 (NOTLB)
e383de00 00000082 e383ddf0 00000008 00000002 e0b661e7 00000008 0000007d
f71b37f8 00000001 e383dde8 c011b77b e383dde0 c0119881 eb59d8b0 e0ba257b
c1716f60 00000001 00053db9 0fb9cfc1 0000a65f d678b790 d678b8f8 c1716f60
Call Trace:
[<c03dbac4>] wait_for_completion+0xa4/0xe0
[<f8a92aee>] kcl_leave_service+0xfe/0x180 [cman]
[<f8b35366>] release_lockspace+0x2d6/0x2f0 [dlm]
[<f8b5215c>] release_gdlm+0x1c/0x30 [lock_dlm]
[<f8b52464>] lm_dlm_unmount+0x24/0x50 [lock_dlm]
[<f8964496>] lm_unmount+0x46/0xac [lock_harness]
[<f8b0eb2f>] gfs_put_super+0x30f/0x3c0 [gfs]
[<c0167f07>] generic_shutdown_super+0x1b7/0x1d0
[<c0168c0d>] kill_block_super+0x1d/0x40
[<c0167c10>] deactivate_super+0xa0/0xd0
[<c017f6ac>] sys_umount+0x3c/0xa0
[<c017f729>] sys_oldumount+0x19/0x20
[<c010537d>] sysenter_past_esp+0x52/0x71
So my guess is that the umount on cl030 caused the assert on
cl031 and both nodes got kicked out of the cluster.
All the data is available here:
http://developer.osdl.org/daniel/GFS/panic.16dec2004/
I included /proc/cluster/dlm_debug and sm_debug (not sure
what the data from those is).
Thoughts?
Daniel
More information about the Linux-cluster
mailing list