[Linux-cluster] Hard lockups during file transfer to GNBD/GFS device

Thu Sep 28 16:15:43 UTC 2006

Here is our setup: 2 GNBD servers attached to a shared SCSI array. Each (of
9) nodes uses multipath to import the shared device from both servers. We
are also using GFS on to of that for our shared storage.

What is happening is that I need to transfer a large number of files (about
1.5 million) from a nodes local storage to the network storage. I'm using
rsync locally to move all the files. Orginally my problem was that the oom
killer would start running partway through the transfer and the machine
would then be unusable (however it was still up enough that it wasn't
fenced). Here is that log:

Sep 27 12:21:43 db2 kernel: oom-killer: gfp_mask=0xd0
Sep 27 12:21:43 db2 kernel: Mem-info:
Sep 27 12:21:43 db2 kernel: DMA per-cpu:
Sep 27 12:21:43 db2 kernel: cpu 0 hot: low 2, high 6, batch 1
Sep 27 12:21:43 db2 kernel: cpu 0 cold: low 0, high 2, batch 1
Sep 27 12:21:43 db2 kernel: cpu 1 hot: low 2, high 6, batch 1
Sep 27 12:21:43 db2 kernel: cpu 1 cold: low 0, high 2, batch 1
Sep 27 12:21:43 db2 kernel: cpu 2 hot: low 2, high 6, batch 1
Sep 27 12:21:43 db2 kernel: cpu 2 cold: low 0, high 2, batch 1
Sep 27 12:21:43 db2 kernel: cpu 3 hot: low 2, high 6, batch 1
Sep 27 12:21:43 db2 kernel: cpu 3 cold: low 0, high 2, batch 1
Sep 27 12:21:43 db2 kernel: cpu 4 hot: low 2, high 6, batch 1
Sep 27 12:21:44 db2 kernel: cpu 4 cold: low 0, high 2, batch 1
Sep 27 12:21:53 db2 in[15473]: 1159374113||chericee at herr-sacco.com
|2852|timeout|1
Sep 27 12:21:54 db2 kernel: cpu 5 hot: low 2, high 6, batch 1
Sep 27 12:21:54 db2 kernel: cpu 5 cold: low 0, high 2, batch 1
Sep 27 12:21:54 db2 kernel: cpu 6 hot: low 2, high 6, batch 1
Sep 27 12:21:54 db2 kernel: cpu 6 cold: low 0, high 2, batch 1
Sep 27 12:21:54 db2 kernel: cpu 7 hot: low 2, high 6, batch 1
Sep 27 12:21:54 db2 kernel: cpu 7 cold: low 0, high 2, batch 1
Sep 27 12:21:54 db2 kernel: Normal per-cpu:
Sep 27 12:21:54 db2 kernel: cpu 0 hot: low 32, high 96, batch 16
Sep 27 12:21:54 db2 kernel: cpu 0 cold: low 0, high 32, batch 16
Sep 27 12:21:54 db2 kernel: cpu 1 hot: low 32, high 96, batch 16
Sep 27 12:21:54 db2 kernel: cpu 1 cold: low 0, high 32, batch 16
Sep 27 12:21:54 db2 kernel: cpu 2 hot: low 32, high 96, batch 16
Sep 27 12:27:59 db2 syslogd 1.4.1: restart.
Sep 27 12:27:59 db2 syslog: syslogd startup succeeded
Sep 27 12:27:59 db2 kernel: klogd 1.4.1, log source = /proc/kmsg started.
Sep 27 12:27:59 db2 kernel: Linux version 2.6.9-42.0.2.ELsmp (
buildsvn at build-i386) (gcc version 3.4.6 20060404 (Red Hat 3.4.6-3))
#1 SMP Wed Aug 23 00:17:26 CDT 2006

I found a few postings saying that using the hugemem kernel would solve the
problems (they claimed it was a known SMP bug by redhat) so all my systems
are now running on that kernel. It did solve the out of memory problem, but
it seems to have introduced some new ones. Here are the logs from the most
recent crashes:

Sep 28 11:15:05 db2 kernel: do_IRQ: stack overflow: 412
Sep 28 11:15:05 db2 kernel:  [<02107c6b>] do_IRQ+0x49/0x1ae<1>Unable to
handle kernel NULL pointer dereference at virtual address
00000000
Sep 28 11:15:05 db2 kernel:  printing eip:
Sep 28 11:15:05 db2 kernel: 0212928c
Sep 28 11:15:05 db2 kernel: *pde = 00004001
Sep 28 11:15:05 db2 kernel: Oops: 0002 [#1]
Sep 28 11:15:05 db2 kernel: SMP
Sep 28 11:15:05 db2 kernel: Modules linked in: mptctl mptbase dell_rbu nfsd
exportfs lockd nfs_acl parport_pc lp parport autofs4 i
2c_dev i2c_core lock_dlm(U) gfs(U) lock_harness(U) dm_round_robin gnbd(U)
dlm(U) cman(U) sunrpc ipmi_devintf ipmi_si ipmi_msghandl
er iptable_filter iptable_mangle iptable_nat ip_conntrack ip_tables md5 ipv6
dm_multipath joydev button battery ac uhci_hcd ehci_h
cd hw_random e1000 bonding(U) floppy sg dm_snapshot dm_zero dm_mirror ext3
jbd dm_mod megaraid_mbox megaraid_mm sd_mod scsi_mod
Sep 28 11:15:05 db2 kernel: CPU:    1548750336
Sep 28 11:15:05 db2 kernel: EIP:    0060:[<0212928c>]    Not tainted VLI
Sep 28 11:15:05 db2 kernel: EFLAGS: 00010002   (2.6.9-42.0.2.ELhugemem)
Sep 28 11:15:05 db2 kernel: EIP is at internal_add_timer+0x84/0x8c
Sep 28 11:15:05 db2 kernel: eax: 00000000   ebx: 023b7900   ecx: 023b8680
edx: 02447620
Sep 28 11:15:05 db2 kernel: esi: 00000000   edi: 023b7900   ebp: 02ee0c94
esp: 48552fb4
Sep 28 11:15:05 db2 kernel: ds: 007b   es: 007b   ss: 0068
Sep 28 11:15:05 db2 kernel: Process  (pid: 1, threadinfo=48552000
task=6d641a00)
Sep 28 11:17:54 db2 syslogd 1.4.1: restart.
Sep 28 11:17:54 db2 syslog: syslogd startup succeeded
Sep 28 11:17:54 db2 kernel: klogd 1.4.1, log source = /proc/kmsg started.
Sep 28 11:17:54 db2 syslog: klogd startup succeeded
Sep 28 11:17:54 db2 kernel: Linux version 2.6.9-42.0.2.ELhugemem (
buildsvn at build-i386) (gcc version 3.4.6 20060404 (Red Hat 3.4.6-
3)) #1 SMP Wed Aug 23 00:38:38 CDT 2006

The GNBD servers stay online and don't have any problems, it's just the
client where all the trouble is coming from. Is this a bug or is something
not setup right?

If you need more info I'll be happy to provide it.

Thanks.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060928/81df55f5/attachment.htm>