[Linux-cluster] Same OOPS on both cluster nodes, sepirated by a week.

Mon Jun 20 16:07:26 UTC 2005

I got the following oops messages on my cluster nodes, both at different
times.  Once was on node A, I was running a clustat, and did a ctrl-4 to
kill it, (it was taking a long while to run, seemed to be blocked by
something).  The second time after doing that OOPS#1 showed up.  The
second oops showed up on the b node, the cluster was running, and I
wasn't actually doing anything outside of watching a tcpdump to watch
some data flow by, went away for about 10 minutes, and when I came back
node B had blocked up, and was fenced by A.  The OOPS was in the
messages file. 

These events were separated by about a week, and in between I had
updated everything to RHEL4 U1, and recompiled the cluster code which
was checked out from the RHEL4 branch for the new kernel.

Yes, these nodes both have VMWare loaded. I can move the virtual
machines off to another host, and disabled VMware, and try and replicate
the problem again if you think VMWare might be causing the problem. (it
may take a week or so, since this problem seems to be intermittent)

Two nodes in the cluster, shared ext3 partitions, a few services
(apache, postgresql, a vmware virtual machine) All nodes running Redhat
Enterprise 4 on identical HP DL380 G4 Dual Xeon boxes, with
hyperthreading enabled.  A Memtest86 on the B node went through two
successful passes, run soon after oops.

Any help would be appreciated, including a step in the right direction
to debug this problem.
Eric Kerin

OOPS#1:  Node A - ctrl-4ing clustat from a root shell
Unable to handle kernel NULL pointer dereference at virtual address
0000001c
 printing eip:
c02c4f92
*pde = 34a2c001
Oops: 0000 [#1]
SMP
Modules linked in: nfsd exportfs lockd nls_utf8 vmnet(U) vmmon(U)
parport_pc lp parport autofs4 i2c_dev i2c_core dlm(U) cman(U) sunrpc
button
 battery ac md5 ipv6 uhci_hcd ehci_hcd hw_random tg3 floppy dm_snapshot
dm_zero dm_mirror ext3 jbd dm_mod cciss sd_mod scsi_mod
CPU:    2
EIP:    0060:[<c02c4f92>]    Tainted: PF     VLI
EFLAGS: 00010206   (2.6.9-5.0.5.ELsmp)
EIP is at _spin_lock+0x3/0x34
eax: 00000018   ebx: 00000018   ecx: f466ae00   edx: f466ae00
esi: f466ae00   edi: 00000000   ebp: 00000000   esp: f50eff70
ds: 007b   es: 007b   ss: 0068
Process dlm_astd (pid: 2440, threadinfo=f50ef000 task=f515c130)
Stack: f466ae00 f89993be f466ae00 00000000 0011ab26 00000000 f466ae00
00000005
       f466ae00 f8999493 00000000 f587e2e8 f8999423 f89984c7 00000000
e22d518c
       f7e24600 f89b36a8 00000000 00000000 f8998b61 f8998cb2 f50ef000
f53caeac
Call Trace:
 [<f89993be>] add_to_astqueue+0x79/0xc7 [dlm]
 [<f8999493>] ast_routine+0x70/0x130 [dlm]
 [<f8999423>] ast_routine+0x0/0x130 [dlm]
 [<f89984c7>] process_asts+0x15c/0x1c2 [dlm]
 [<f8998b61>] dlm_astd+0x0/0x1a9 [dlm]
 [<f8998cb2>] dlm_astd+0x151/0x1a9 [dlm]
 [<c0131d3d>] kthread+0x73/0x9b
 [<c0131cca>] kthread+0x0/0x9b
 [<c01041f1>] kernel_thread_helper+0x5/0xb
Code: c0 84 d2 0f 9f c0 c3 89 c2 f0 81 28 00 00 00 01 0f 94 c0 84 c0 b9
01 00 00 00 75 09 f0 81 02 00 00 00 01 30 c9 89 c8 c3 53 89 c3 <81> 7
8 04 ad 4e ad de 74 18 ff 74 24 04 68 4d 83 2d c0 e8 61 bb

OOPS#2: Node B - Nothing out of the ordinary, just watching a tcpdump
 Unable to handle kernel NULL pointer dereference at virtual address
0000001c
  printing eip:
 c02c5ee4
 *pde = 3509b001
 Oops: 0000 [#1]
 SMP
 Modules linked in: dlm(U) cman(U) vmnet(U) parport_pc vmmon(U) lp
parport autofs4 i2c_dev i2c_core sunrpc button battery ac md5 ipv6
uhci_hcd ehci_hcd hw_random tg3 floppy dm_snapshot dm_zero dm_mirror
ext3 jbd dm_mod cciss sd_mod scsi_mod
 CPU:    2
 EIP:    0060:[<c02c5ee4>]    Tainted: PF     VLI
 EFLAGS: 00010206   (2.6.9-11.ELsmp)
 EIP is at _spin_lock+0x3/0x34
 eax: 00000018   ebx: 00000018   ecx: c2b67b80   edx: c2b67b80
 esi: c2b67b80   edi: 00000000   ebp: 00000000   esp: f5083f70
 ds: 007b   es: 007b   ss: 0068
 Process dlm_astd (pid: 4733, threadinfo=f5083000 task=f7601730)
 Stack: c2b67b80 f8c73446 c2b67b80 00000000 008ecb26 00000000 c2b67b80
00000005
        c2b67b80 f8c7351b 00000000 f5a853f0 f8c734ab f8c724c7 00000000
d95e6eac
        f74fa400 f8c8d7a8 00000000 00000000 f8c72b61 f8c72cb2 f5083000
f519feac
 Call Trace:
  [<f8c73446>] add_to_astqueue+0x79/0xc7 [dlm]
  [<f8c7351b>] ast_routine+0x70/0x130 [dlm]
  [<f8c734ab>] ast_routine+0x0/0x130 [dlm]
  [<f8c724c7>] process_asts+0x15c/0x1c2 [dlm]
  [<f8c72b61>] dlm_astd+0x0/0x1a9 [dlm]
  [<f8c72cb2>] dlm_astd+0x151/0x1a9 [dlm]
  [<c0132e31>] kthread+0x73/0x9b
  [<c0132dbe>] kthread+0x0/0x9b
  [<c01041f1>] kernel_thread_helper+0x5/0xb
 Code: c0 84 d2 0f 9f c0 c3 89 c2 f0 81 28 00 00 00 01 0f 94 c0 84 c0 b9
01 00 00 00 75 09 f0 81 02 00 00 00 01 30 c9 89 c8 c3 53 89 c3 <81>78 04
ad 4e ad de 74 18 ff 74 24 04 68 2a 97 2d c0 e8 db ba
  <0>Fatal exception: panic in 5 seconds