[Linux-cluster] Cluster Node Crash
Steve Rigler
srigler at marathonoil.com
Fri Jul 27 19:28:49 UTC 2007
On Fri, 2007-07-27 at 14:21 -0500, Steve Rigler wrote:
> Hello All,
>
> We are running GFS on RHEL4U3 (x86_64). One of our cluster nodes
> crashes this afternoon. We are able to capture some of the message from
> netdump (pasted below) before fencing killed the node.
>
> Any advice would be appreciated.
>
> Thanks,
> Steve
>
>
As a followup, this is past tense (the word "crashes" should have been
"crashed"). One of the other nodes panicked after the first one tried
to rejoin the cluster (this is a 3 node cluster).
The dump from that node had these messages near the beginning of its
crash:
WARNING: dlm_emergency_shutdown
WARNING: dlm_emergency_shutdown
SM: 00000001 sm_stop: SG still joined
SM: 01000002 sm_stop: SG still joined
SM: 02000004 sm_stop: SG still joined
SM: 0300000d sm_stop: SG still joined
Followed by this:
lock_dlm: Assertion failed on line 428 of file /usr/src/build/714650-
x86_64/BUILD/gfs-kernel-2.6.9-49/smp/src/dlm/lock.c
lock_dlm: assertion: "!error"
lock_dlm: time = 5442621324
STUL03E: num=1,2 err=-22 cur=-1 req=3 lkf=0
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at lock:428
invalid operand: 0000 [1] SMP
CPU 0
Modules linked in: nfsd exportfs nfs lockd nfs_acl parport_pc lp parport
netconsole netdump autofs4 i2c_dev i2c_core lock_dlm(U) gfs(U)
lock_harness(U) dlm(U) cman(U) md5 ipv6 sunrpc ds yenta_socket
pcmcia_core dm_mirror dm_round_robin dm_multipath button battery ac
uhci_hcd ehci_hcd hw_random tg3 floppy ext3 jbd dm_mod qla2300 qla2xxx
scsi_transport_fc cciss sd_mod scsi_mod
Pid: 30604, comm: umount Not tainted 2.6.9-34.ELsmp
RIP: 0010:[<ffffffffa02689e7>] <ffffffffa02689e7>{:lock_dlm:do_dlm_lock
+365}
RSP: 0018:000001002ab6dc38 EFLAGS: 00010216
RAX: 0000000000000001 RBX: 00000000ffffffea RCX: 0000000000000246
RDX: 000000000000996e RSI: 0000000000000246 RDI: ffffffff803d9e60
RBP: 0000010117945c80 R08: 0000000000000004 R09: 00000000ffffffea
R10: 0000000000000000 R11: 00000000000000e4 R12: 00000100dfd23400
R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000003
FS: 0000002a95575b00(0000) GS:ffffffff804d7b00(0000)
knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000003f95fc60c0 CR3: 0000000000101000 CR4: 00000000000006e0
Process umount (pid: 30604, threadinfo 000001002ab6c000, task
00000101120da030)
Stack: 0000000000000003 0000000000000000 3120202020202020
2020202020202020
3220202020202020 0000000000000018 0000010117945c80
0000000000000000
0000000000000003 0000000000000000
Call Trace:<ffffffffa0268b2a>{:lock_dlm:lm_dlm_lock+214}
<ffffffffa022f93f>{:gfs:gfs_lm_lock+50}
<ffffffffa02269da>{:gfs:gfs_glock_xmote_th+357}
<ffffffffa0224cdd>{:gfs:run_queue+667}
<ffffffffa0225ccf>{:gfs:gfs_glock_nq+938}
<ffffffffa0225f11>{:gfs:gfs_glock_nq_init+20}
<ffffffffa024629b>{:gfs:gfs_make_fs_ro+39}
<ffffffffa023e508>{:gfs:gfs_put_super+630}
<ffffffff8017d0c9>{generic_shutdown_super+202}
<ffffffffa023c009>{:gfs:gfs_kill_sb+42}
<ffffffff801ccb78>{dummy_inode_permission+0}
<ffffffff8017cfe6>{deactivate_super+95}
<ffffffff80192537>{sys_umount+925} <ffffffff80180264>{sys_newstat
+17}
<ffffffff80110c61>{error_exit+0} <ffffffff801101c6>{system_call
+126}
More information about the Linux-cluster
mailing list