[Linux-cluster] Oops

Thu May 24 14:23:32 UTC 2007

On Thu, May 24, 2007 at 03:51:08PM +0200, Wagner Ferenc wrote:
> Hi,
> 
> I wasn't sure whether to send this to LKML or here, but DLM seems
> involved.  Please let me know if I'd better repost it to somewhere
> else.

Here is good.

> It's a vanilla 2.6.21 kernel patched by cluster-2.00.00 (with the
> three extra export for GFS1).  Config attached.  The machine froze
> during the morning updatedb cronjob, which performed a recursive find
> into the shared GFS filesystem.  Two other nodes doing the same at the
> same time are still up.
> 
> I experienced a similar hang with cluster-1 not long ago, though that
> didn't lock up the whole machine, but the cluster software only.

updatedb, even on just one node (much less all) is never going to be a
good thing to run on gfs... our standard response is "don't do that".

> Please ask back if I didn't provide all information necessary.

I also ran into this bug last week and was testing some patches from
Patrick to try to figure it out -- I got distracted with other things but
will get back to it again soon.  My test that hit it was doing looping
mount/unmount on four nodes.

Thanks for the good report.

Dave

> CPU:    2
> EIP:    0060:[<c012f476>]    Not tainted VLI
> EFLAGS: 00010213   (2.6.21gfs-xeon #2)
> EIP is at queue_work+0x2f/0x49
> eax: dfb176e4   ebx: 00000002   ecx: f7e66a80   edx: dfb176e0
> esi: 00000002   edi: e2bfa080   ebp: 00000000   esp: f7a91bb4
> ds: 007b   es: 007b   fs: 00d8  gs: 0000  ss: 0068
> Process dlm_recv/2 (pid: 10261, ti=f7a90000 task=c196aa50 task.ti=f7a90000)
> Stack: f798d434 f7c5a980 c026dc79 ab0ee1c1 e2bfa080 dfaea000 f798d434 00200000 
>        00000020 00000000 c1b6bd80 0101e520 e2bfa080 e2bfa080 c0272f90 000000d0 
>        0000000e f7c5a980 00000000 00000039 00000000 00000000 00000000 00000286 
> Call Trace:
>  [<c026dc79>] tcp_rcv_established+0x53a/0x7d1
>  [<c0272f90>] tcp_v4_do_rcv+0x28/0x2c5
>  [<c0275306>] tcp_v4_rcv+0x81b/0x88d
>  [<c02957a8>] packet_rcv_spkt+0x0/0x150
>  [<c024035d>] dev_hard_start_xmit+0x1be/0x21d
>  [<c025ccef>] ip_local_deliver+0x187/0x230
>  [<c025cb2f>] ip_rcv+0x409/0x442
>  [<c02958ed>] packet_rcv_spkt+0x145/0x150
>  [<c011b434>] __wake_up+0x32/0x43
>  [<c023ff15>] netif_receive_skb+0x2dc/0x350
>  [<f8879cfa>] tg3_poll+0x5b6/0x82f [tg3]
>  [<c0241a00>] net_rx_action+0x9d/0x1a8
>  [<c012608e>] __do_softirq+0x66/0xcc
>  [<c0126137>] do_softirq+0x43/0x51
>  [<c010648f>] do_IRQ+0x5c/0x71
>  [<c010474b>] common_interrupt+0x23/0x28
>  [<c0134e03>] down_read_trylock+0x10/0x1d
>  [<f8c9d90a>] dlm_receive_message+0xa2/0xc0b [dlm]
>  [<c023870d>] sock_common_recvmsg+0x3e/0x54
>  [<c02371ff>] sock_recvmsg+0xec/0x107
>  [<f8c9fe36>] dlm_process_incoming_buffer+0x11a/0x18c [dlm]
>  [<f8ca3e4c>] receive_from_sock+0x124/0x217 [dlm]
>  [<c010648f>] do_IRQ+0x5c/0x71
>  [<f8ca3b4e>] process_recv_sockets+0xf/0x15 [dlm]
>  [<c012f559>] run_workqueue+0x85/0x125
>  [<f8ca3b3f>] process_recv_sockets+0x0/0x15 [dlm]
>  [<c012fde7>] worker_thread+0xf9/0x124
>  [<c011d23f>] default_wake_function+0x0/0xc
>  [<c012fcee>] worker_thread+0x0/0x124
>  [<c013248a>] kthread+0xb2/0xdc
>  [<c01323d8>] kthread+0x0/0xdc
>  [<c0104993>] kernel_thread_helper+0x7/0x10
>  =======================
> Code: 64 8b 35 04 00 00 00 f0 0f ba 2a 00 19 c0 31 db 85 c0 75 2c 8d 41 08 39 41 08 8b 1d f4 94 39 c0 0f 45 de 8d 42 04 39 42 04 74 04 <0f> 0b eb fe 8b 01 f7 d0 8b 04 98 e8 34 ff ff ff bb 01 00 00 00 
> EIP: [<c012f476>] queue_work+0x2f/0x49 SS:ESP 0068:f7a91bb4
> Kernel panic - not syncing: Fatal exception in interrupt