[Linux-cluster] kernel oops on mount and sendmsg failed: -22

Thu Sep 21 22:54:46 UTC 2006

I have a two node cluster, one node (node A) runs linux kernel 2.6.11.12 
while the other (node B) runs 2.6.18.  both are running cman_tool 
version 5.0.1.  I first start up node A, then node B joins.  node A can 
mount the GFS file systems, but when node B tries that, it gets a kernel 
oops, which is pasted at the end of the email (see "KERNEL OOPS output"). 

So I reboot node B and try to rejoin, but it seems to not be able to 
communicate with node A correctly, as if the cluster is in some stale 
state (see "node B rejoin kernel messages").  Upon viewing node A, it 
seemed to have received the join message, but it looks like it didn't 
send an ack or something, and then node A simply quits...(see "node A 
kernel messages").

I think the problem lies in my use of two different cluster software 
versions (even though --version doesn't say so), but the newest -rSTABLE 
doesn't compile with 2.6.11.12 anymore.  What is the recommended 
solution for a cluster that must run different kernel versions?

tia,
dan

---

<KERNEL OOPS output>

BUG: unable to handle kernel NULL pointer dereference at virtual
address 0000001c
 printing eip:
c01825e6
*pde = 00000000
Oops: 0000 [#1]
PREEMPT SMP
Modules linked in: lock_dlm dlm gfs lock_harness cman qla2xxx
firmware_class scsi_transport_fc ppdev parport_pc lp parport sg sd_mod
scsi_mod ide_generic ide_cd cdrom evdev i2c_piix4 psmouse i2c_core
serio_raw sworks_agp agpgart rtc pcspkr ext3 jbd mbcache dm_mirror
dm_snapshot dm_mod ide_disk serverworks generic ohci_hcd ide_core
usbcore tg3 thermal processor fan unix
CPU:    2
EIP:    0060:[<c01825e6>]    Tainted: GF     VLI
EFLAGS: 00010293   (2.6.18 #1)
EIP is at do_add_mount+0x66/0x130
eax: 0000000c   ebx: f3843f24   ecx: c24fbac0   edx: f443f550
esi: df907200   edi: 00000000   ebp: 00000000   esp: f3843df4
ds: 007b   es: 007b   ss: 0068
Process mount (pid: 14922, ti=f3842000 task=f443f550 task.ti=f3842000)
Stack: c0394388 00000000 00000000 f49a1000 f3843f24 00000000 c018321d 
df907200
       f3843f24 00000000 00000000 f49a1000 df907200 c033a5c0 fffffffe 
00000000
       c0175080 c24fbac0 f3843ef8 00000050 f4998000 dfb98c40 c24fbac0 
df98330c
Call Trace:
 [<c018321d>] do_mount+0x33d/0x760
 [<c0175080>] link_path_walk+0x80/0x100
 [<c01507e3>] __handle_mm_fault+0x233/0x980
 [<c0150a86>] __handle_mm_fault+0x4d6/0x980
 [<c0147cdf>] __alloc_pages+0x4f/0x2f0
 [<c0147fad>] __get_free_pages+0x2d/0x40
 [<c0181ed7>] copy_mount_options+0x47/0x130
 [<c01836dd>] sys_mount+0x9d/0xe0
 [<c01031fb>] syscall_call+0x7/0xb
Code: e4 89 e0 8b 4b 04 25 00 e0 ff ff 8b 10 8b 41 64 3b 82 58 04 00
00 0f 85 a1 00 00 00 8b 41 14 3b 46 14 0f 84 ac 00 00 00 8b 46 10 <8b>
40 10 0f b7 40 28 25 00 f0 00 00 3d 00 a0 00 00 74 55 8b 44
EIP: [<c01825e6>] do_add_mount+0x66/0x130 SS:ESP 0068:f3843df4

<node B rejoin kernel messages>
CMAN: Waiting to join or form a Linux-cluster
CMAN: sending membership request (message repeated 30 times)
CMAN: Been in JOINWAIT for too long - giving up
CMAN: sendmsg failed: -22

<node A kernel messages>
CMAN: node blade14 rejoining
CMAN: too many transition restarts - will die
CMAN: we are leaving the cluster. Inconsistent cluster view