[Linux-cluster] Kernel panic

Mon Mar 10 14:28:45 UTC 2008

I have just had my cluster crash yet again, but this time, I was able to 
capture the full kernel panic.

Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP:
  [<0000000000000000>] _stext+0x7ffff000/0x1000
PGD 0
Oops: 0000 [1] SMP
last sysfs file: /kernel/dlm/rgmanager/control
CPU 1
Modules linked in: gfs(U) nfsd exportfs lockd nfs_acl autofs4 hidp rfcomm l2cap bluetooth lock_dlm gfs2 dlm configfs sunrpc ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables ipv6 ib_iser rdma_cm ib_cm iw_cm ib_addr ib_local_sa ib_sa ib_mad ib_core iscsi_tcp libiscsi scsi_transport_iscsi dm_multipath video sbs backlight i2c_ec i2c_core button battery asus_acpi acpi_memhotplug ac parport_pc lp parport shpchp serio_raw tg3 sg pcspkr dm_snapshot dm_zero dm_mirror dm_mod usb_storage ata_piix libata sd_mod scsi_mod raid1 ext3 jbd ehci_hcd ohci_hcd uhci_hcd
Pid: 6215, comm: nfsd Not tainted 2.6.18-53.1.4.el5 #1
RIP: 0010:[<0000000000000000>]  [<0000000000000000>] _stext+0x7ffff000/0x1000
RSP: 0018:ffff81006abd56e8  EFLAGS: 00010206
RAX: 0000000000000000 RBX: ffff8100210a4518 RCX: 0000000000000f88
RDX: 0000000000000000 RSI: ffff81000148e2c0 RDI: ffff81007f5757c0
RBP: ffff81000148e2c0 R08: 0400000000000000 R09: 0100000073747261
R10: 000000000c000000 R11: 0c000000c41d0000 R12: 0000000000000f88
R13: 0000000000000f88 R14: ffff81006a170078 R15: 0000000000000000
FS:  00002aaaab0166e0(0000) GS:ffff81007fe357c0(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000000201000 CR4: 00000000000006e0
Process nfsd (pid: 6215, threadinfo ffff81006abd4000, task ffff81006afbc7e0)
Stack:  ffffffff8000fc3c 0000000000000f88 ffff81006abd5d08 0000000000000000
  0000000000000001 ffff81006abd5910 0000000000000001 00000f8800000000
  ffff81007f5757c0 ffff810021d855c0 ffff8100210a4518 ffff810021d854b0
Call Trace:
  [<ffffffff8000fc3c>] generic_file_buffered_write+0x4cb/0x6d8
  [<ffffffff8000ddd9>] current_fs_time+0x3b/0x40
  [<ffffffff80015dc6>] __generic_file_aio_write_nolock+0x36c/0x3b8
  [<ffffffff885c0a5d>] :gfs:gfs_dreread+0x72/0xc7
  [<ffffffff800be014>] generic_file_aio_write_nolock+0x20/0x6c
  [<ffffffff800be3e0>] generic_file_write_nolock+0x8f/0xa8
  [<ffffffff8009b492>] autoremove_wake_function+0x0/0x2e
  [<ffffffff885e7e68>] :gfs:gfs_trans_begin_i+0x13c/0x1b2
  [<ffffffff885db3a1>] :gfs:do_write_buf+0x443/0x67e
  [<ffffffff885dabb6>] :gfs:walk_vm+0x10e/0x311
  [<ffffffff885daf5e>] :gfs:do_write_buf+0x0/0x67e
  [<ffffffff8006108d>] wait_for_completion+0x1f/0xa2
  [<ffffffff885dae65>] :gfs:__gfs_write+0xac/0xc6
  [<ffffffff800d5ee7>] do_readv_writev+0x198/0x295
  [<ffffffff885daea8>] :gfs:gfs_write+0x0/0x8
  [<ffffffff885dc429>] :gfs:gfs_open+0x12c/0x15e
  [<ffffffff8857a77d>] :nfsd:nfsd_vfs_write+0xf2/0x2e1
  [<ffffffff885dc2fd>] :gfs:gfs_open+0x0/0x15e
  [<ffffffff8001e115>] __dentry_open+0x101/0x1dc
  [<ffffffff8857aff1>] :nfsd:nfsd_write+0xb5/0xd5
  [<ffffffff88581c96>] :nfsd:nfsd3_proc_write+0xea/0x109
  [<ffffffff885771c4>] :nfsd:nfsd_dispatch+0xd7/0x198
  [<ffffffff883e1514>] :sunrpc:svc_process+0x44d/0x70b
  [<ffffffff800625bf>] __down_read+0x12/0x92
  [<ffffffff8857754d>] :nfsd:nfsd+0x0/0x2db
  [<ffffffff885776fb>] :nfsd:nfsd+0x1ae/0x2db
  [<ffffffff8005bfb1>] child_rip+0xa/0x11
  [<ffffffff8857754d>] :nfsd:nfsd+0x0/0x2db
  [<ffffffff8857754d>] :nfsd:nfsd+0x0/0x2db
  [<ffffffff8005bfa7>] child_rip+0x0/0x11

Code:  Bad RIP value.
RIP  [<0000000000000000>] _stext+0x7ffff000/0x1000
  RSP <ffff81006abd56e8>
CR2: 0000000000000000
  <0>Kernel panic - not syncing: Fatal exception

I'm experiencing upwards of 8 crashes a day because of this.  What can I do 
about it?

Thanks,

James

On Wed, 5 Mar 2008, James Chamberlain wrote:

> Two of the three nodes in my CS/GFS cluster just crashed, which dissolved 
> quorum and allowed me to finally capture part of the kernel panic.  Here is 
> what was displayed on the screen:
>
> [<ffffffff885daea8>] :gfs:gfs_write+0x0/0x8
> [<ffffffff885cb2a7>] :gfs:gfs_glock_d1+0x15c/0x16c
> [<ffffffff885dc429>] :gfs:gfs_open+0x12c/0x15e
> [<ffffffff8857a77d>] :nfsd:nfsd_vfs_write+0xf2/0x2e1
> [<ffffffff885dc2fd>] :gfs:gfs_open+0x0/0x15e
> [<ffffffff8001e115>] __dentry_open+0x101/0x1dc
> [<ffffffff8857aff1>] :nfsd:nfsd_write+0xb5/0xd5
> [<ffffffff88581c96>] :nfsd:nfsd3_proc_write+0xea/0x109
> [<ffffffff885771c4>] :nfsd:nfsd_dispatch+0xd7/0x198
> [<ffffffff883e1514>] :sunrpc:svc_process+0x44d/0x70b
> [<ffffffff800625bf>] __down_read+0x12/0x92
> [<ffffffff8857754d>] :nfsd:nfsd+0x0/0x2db
> [<ffffffff885776fb>] :nfsd:nfsd+0x1ae/0x2db
> [<ffffffff8005bfb1>] child_rip+0xa/0x11
> [<ffffffff8857754d>] :nfsd:nfsd+0x0/0x2db
> [<ffffffff8857754d>] :nfsd:nfsd+0x0/0x2db
> [<ffffffff8005bfa7>] child_rip+0x0/0x11
>
>
> Code:  Bad RIP value.
> RIP  [<0000000000000000>] _stext+0x7fff000/0x1000
> RSP <ffff81006ac9f6e8>
> CR2: 0000000000000000
>  <0>Kernel panic - not syncing: Fatal exception
>
>
> Is this enough to figure out what happened, and how can I prevent this from 
> happening in the future? I suspect that all the instability I have had with 
> my CS/GFS cluster is related to this sort of crash.  I am using the
> following on all three nodes:
>
> cman-2.0.73-1.el5_1.1
> openais-0.80.3-7.el5
> rgmanager-2.0.31-1.el5.centos
> lvm2-cluster-2.02.26-1.el5
> luci-0.10.0-6.el5.centos.1
> ricci-0.10.0-6.el5.centos.1
> kernel-2.6.18-53.1.4.el5
> gfs-utils-0.1.12-1.el5
> kmod-gfs-0.1.19-7.el5_1.1
>
> Thanks,
>
> James
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>