[Linux-cluster] Cluster Node Crash

Fri Jul 27 19:28:49 UTC 2007

On Fri, 2007-07-27 at 14:21 -0500, Steve Rigler wrote:
> Hello All,
> 
> We are running GFS on RHEL4U3 (x86_64).  One of our cluster nodes
> crashes this afternoon.  We are able to capture some of the message from
> netdump (pasted below) before fencing killed the node.
> 
> Any advice would be appreciated.
> 
> Thanks,
> Steve
> 
> 

As a followup, this is past tense (the word "crashes" should have been
"crashed").  One of the other nodes panicked after the first one tried
to rejoin the cluster (this is a 3 node cluster).

The dump from that node had these messages near the beginning of its
crash:
WARNING: dlm_emergency_shutdown
WARNING: dlm_emergency_shutdown
SM: 00000001 sm_stop: SG still joined
SM: 01000002 sm_stop: SG still joined
SM: 02000004 sm_stop: SG still joined
SM: 0300000d sm_stop: SG still joined

Followed by this:

lock_dlm:  Assertion failed on line 428 of file /usr/src/build/714650-
x86_64/BUILD/gfs-kernel-2.6.9-49/smp/src/dlm/lock.c
lock_dlm:  assertion:  "!error"
lock_dlm:  time = 5442621324
STUL03E: num=1,2 err=-22 cur=-1 req=3 lkf=0

----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at lock:428
invalid operand: 0000 [1] SMP
CPU 0
Modules linked in: nfsd exportfs nfs lockd nfs_acl parport_pc lp parport
netconsole netdump autofs4 i2c_dev i2c_core lock_dlm(U) gfs(U)
lock_harness(U) dlm(U) cman(U) md5 ipv6 sunrpc ds yenta_socket
pcmcia_core dm_mirror dm_round_robin dm_multipath button battery ac
uhci_hcd ehci_hcd hw_random tg3 floppy ext3 jbd dm_mod qla2300 qla2xxx
scsi_transport_fc cciss sd_mod scsi_mod
Pid: 30604, comm: umount Not tainted 2.6.9-34.ELsmp
RIP: 0010:[<ffffffffa02689e7>] <ffffffffa02689e7>{:lock_dlm:do_dlm_lock
+365}
RSP: 0018:000001002ab6dc38  EFLAGS: 00010216
RAX: 0000000000000001 RBX: 00000000ffffffea RCX: 0000000000000246
RDX: 000000000000996e RSI: 0000000000000246 RDI: ffffffff803d9e60
RBP: 0000010117945c80 R08: 0000000000000004 R09: 00000000ffffffea
R10: 0000000000000000 R11: 00000000000000e4 R12: 00000100dfd23400
R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000003
FS:  0000002a95575b00(0000) GS:ffffffff804d7b00(0000)
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000003f95fc60c0 CR3: 0000000000101000 CR4: 00000000000006e0
Process umount (pid: 30604, threadinfo 000001002ab6c000, task
00000101120da030)
Stack: 0000000000000003 0000000000000000 3120202020202020
2020202020202020
       3220202020202020 0000000000000018 0000010117945c80
0000000000000000
       0000000000000003 0000000000000000
Call Trace:<ffffffffa0268b2a>{:lock_dlm:lm_dlm_lock+214}
<ffffffffa022f93f>{:gfs:gfs_lm_lock+50}
       <ffffffffa02269da>{:gfs:gfs_glock_xmote_th+357}
<ffffffffa0224cdd>{:gfs:run_queue+667}
       <ffffffffa0225ccf>{:gfs:gfs_glock_nq+938}
<ffffffffa0225f11>{:gfs:gfs_glock_nq_init+20}
       <ffffffffa024629b>{:gfs:gfs_make_fs_ro+39}
<ffffffffa023e508>{:gfs:gfs_put_super+630}
       <ffffffff8017d0c9>{generic_shutdown_super+202}
<ffffffffa023c009>{:gfs:gfs_kill_sb+42}
       <ffffffff801ccb78>{dummy_inode_permission+0}
<ffffffff8017cfe6>{deactivate_super+95}
       <ffffffff80192537>{sys_umount+925} <ffffffff80180264>{sys_newstat
+17}
       <ffffffff80110c61>{error_exit+0} <ffffffff801101c6>{system_call
+126}