[Linux-cluster] kernel oops on mount and sendmsg failed: -22

Fri Sep 22 07:36:16 UTC 2006

Dan B. Phung wrote:
> I have a two node cluster, one node (node A) runs linux kernel 2.6.11.12
> while the other (node B) runs 2.6.18.  both are running cman_tool
> version 5.0.1.  I first start up node A, then node B joins.  node A can
> mount the GFS file systems, but when node B tries that, it gets a kernel
> oops, which is pasted at the end of the email (see "KERNEL OOPS output").
> So I reboot node B and try to rejoin, but it seems to not be able to
> communicate with node A correctly, as if the cluster is in some stale
> state (see "node B rejoin kernel messages").  Upon viewing node A, it
> seemed to have received the join message, but it looks like it didn't
> send an ack or something, and then node A simply quits...(see "node A
> kernel messages").
> 
> I think the problem lies in my use of two different cluster software
> versions (even though --version doesn't say so), but the newest -rSTABLE
> doesn't compile with 2.6.11.12 anymore.  What is the recommended
> solution for a cluster that must run different kernel versions?
> 
> tia,
> dan
> 
> ---
> 
> <KERNEL OOPS output>
> 
> BUG: unable to handle kernel NULL pointer dereference at virtual
> address 0000001c
> printing eip:
> c01825e6
> *pde = 00000000
> Oops: 0000 [#1]
> PREEMPT SMP
> Modules linked in: lock_dlm dlm gfs lock_harness cman qla2xxx
> firmware_class scsi_transport_fc ppdev parport_pc lp parport sg sd_mod
> scsi_mod ide_generic ide_cd cdrom evdev i2c_piix4 psmouse i2c_core
> serio_raw sworks_agp agpgart rtc pcspkr ext3 jbd mbcache dm_mirror
> dm_snapshot dm_mod ide_disk serverworks generic ohci_hcd ide_core
> usbcore tg3 thermal processor fan unix
> CPU:    2
> EIP:    0060:[<c01825e6>]    Tainted: GF     VLI
> EFLAGS: 00010293   (2.6.18 #1)
> EIP is at do_add_mount+0x66/0x130
> eax: 0000000c   ebx: f3843f24   ecx: c24fbac0   edx: f443f550
> esi: df907200   edi: 00000000   ebp: 00000000   esp: f3843df4
> ds: 007b   es: 007b   ss: 0068
> Process mount (pid: 14922, ti=f3842000 task=f443f550 task.ti=f3842000)
> Stack: c0394388 00000000 00000000 f49a1000 f3843f24 00000000 c018321d
> df907200
>       f3843f24 00000000 00000000 f49a1000 df907200 c033a5c0 fffffffe
> 00000000
>       c0175080 c24fbac0 f3843ef8 00000050 f4998000 dfb98c40 c24fbac0
> df98330c
> Call Trace:
> [<c018321d>] do_mount+0x33d/0x760
> [<c0175080>] link_path_walk+0x80/0x100
> [<c01507e3>] __handle_mm_fault+0x233/0x980
> [<c0150a86>] __handle_mm_fault+0x4d6/0x980
> [<c0147cdf>] __alloc_pages+0x4f/0x2f0
> [<c0147fad>] __get_free_pages+0x2d/0x40
> [<c0181ed7>] copy_mount_options+0x47/0x130
> [<c01836dd>] sys_mount+0x9d/0xe0
> [<c01031fb>] syscall_call+0x7/0xb
> Code: e4 89 e0 8b 4b 04 25 00 e0 ff ff 8b 10 8b 41 64 3b 82 58 04 00
> 00 0f 85 a1 00 00 00 8b 41 14 3b 46 14 0f 84 ac 00 00 00 8b 46 10 <8b>
> 40 10 0f b7 40 28 25 00 f0 00 00 3d 00 a0 00 00 74 55 8b 44
> EIP: [<c01825e6>] do_add_mount+0x66/0x130 SS:ESP 0068:f3843df4
> 
> <node B rejoin kernel messages>
> CMAN: Waiting to join or form a Linux-cluster
> CMAN: sending membership request (message repeated 30 times)
> CMAN: Been in JOINWAIT for too long - giving up
> CMAN: sendmsg failed: -22
> 
> <node A kernel messages>
> CMAN: node blade14 rejoining
> CMAN: too many transition restarts - will die
> CMAN: we are leaving the cluster. Inconsistent cluster view

That's a known bug. Upgrade the kernel component of cman.

-- 

patrick