[Linux-cluster] Re: Hard lockups during file transfer to GNBD/GFS device

David Brieck Jr. dbrieck at gmail.com
Thu Sep 28 19:27:20 UTC 2006


On 9/28/06, David Brieck Jr. <dbrieck at gmail.com> wrote:
> On 9/28/06, David Brieck Jr. <dbrieck at gmail.com> wrote:
> > Here is our setup: 2 GNBD servers attached to a shared SCSI array. Each (of 9) nodes uses multipath to import the shared device from both servers. We are also using GFS on to of that for our shared storage.
> >
> > What is happening is that I need to transfer a large number of files (about  1.5 million) from a nodes local storage to the network storage. I'm using rsync locally to move all the files. Orginally my problem was that the oom killer would start running partway through the transfer and the machine would then be unusable (however it was still up enough that it wasn't fenced). Here is that log:
> > <snip>
> >
> >
> > I found a few postings saying that using the hugemem kernel would solve the problems (they claimed it was a known SMP bug by redhat) so all my systems are now running on that kernel. It did solve the out of memory problem, but it seems to have introduced some new ones. Here are the logs from the most recent crashes:
> >
> >
> > <snip>
> >
> > The GNBD servers stay online and don't have any problems, it's just the client where all the trouble is coming from. Is this a bug or is something not setup right?
> >
> > If you need more info I'll be happy to provide it.
> >
> > Thanks.
>
>
> I just tried to more the same data by tar-ing it up to the network,
> same result. Again, this is about 94GB and 1.5 million files that I
> seem to be unable to move from local storage to shared. Anyone have
> any suggestions?
>

I forgot to include the kernel message, see below:

Sep 28 15:01:56 db2 kernel: do_IRQ: stack overflow: 460
Sep 28 15:01:56 db2 kernel:  [<02107c6b>] do_IRQ+0x49/0x1ae
Sep 28 15:01:56 db2 kernel:  [<f89e3574>] tcp_in_window+0x1c6/0x3ad
[ip_conntrack]
Sep 28 15:01:56 db2 kernel:  [<f89e3d0e>] tcp_packet+0x338/0x412 [ip_conntrack]
Sep 28 15:01:56 db2 kernel:  [<f89e1c3b>] __ip_conntrack_find+0xf/0xa1
[ip_conntrack]
Sep 28 15:01:56 db2 kernel:  [<f89e24e6>] ip_conntrack_in+0x1dc/0x2a6
[ip_conntrack]
Sep 28 15:01:56 db2 kernel:  [<0228227b>] nf_iterate+0x40/0x81
Sep 28 15:01:56 db2 kernel:  [<022927d8>] dst_output+0x0/0x1a
Sep 28 15:01:56 db2 kernel:  [<02282581>] nf_hook_slow+0x47/0xbc
Sep 28 15:01:56 db2 kernel:  [<022927d8>] dst_output+0x0/0x1a
Sep 28 15:01:56 db2 kernel:  [<02293093>] ip_queue_xmit+0x395/0x3f9
Sep 28 15:04:39 db2 syslogd 1.4.1: restart.




More information about the Linux-cluster mailing list