[Linux-cluster] Re: Hard lockups during file transfer to GNBD/GFS device

Mon Oct 23 17:39:08 UTC 2006

On 10/23/06, David Brieck Jr. <dbrieck at gmail.com> wrote:
> On 10/23/06, Benjamin Marzinski <bmarzins at redhat.com> wrote:
> > On Mon, Oct 23, 2006 at 09:48:34AM -0400, David Brieck Jr. wrote:
> > > On 10/10/06, David Brieck Jr. <dbrieck at gmail.com> wrote:
> > > >On 9/29/06, David Brieck Jr. <dbrieck at gmail.com> wrote:
> > > >> On 9/28/06, David Teigland <teigland at redhat.com> wrote:
> > > >> >
> > > >> > Could you try it without multipath?  You have quite a few layers there.
> > > >> > Dave
> > > >> >
> > > >> >
> > > >>
> > > >> Thanks for the response. I unloaded gfs, clvm, gnbd and multipath, the
> > > >> reloaded gnbd, clvm and gfs. It was only talking to one of the gnbd
> > > >> servers and without multipath. Here's the log from this crash. It
> > > >> seems to have more info in it.
> > > >>
> > > >> I'm kinda confused why it still has references to mulitpath though. I
> > > >> unloaded the multipath module so I'm not sure why it's still in there.
> > > >> SNIP
> > > >
> > > >Since I didn't hear back from anyone I decided to try things a little
> > > >differently. Instead of rsyncing on the local machine, I ran the rsync
> > > >from another cluster member who also mounts the same partition I was
> > > >trying to move things too.
> > > >
> > > >So instead of
> > > >
> > > >rsync -a /path/to/files/ /mnt/http/
> > > >
> > > >I used
> > > >
> > > >rsync -a root at 10.1.1.121::/path/to/files/  /mnt/http/
> > > >
> > > >and I didn't have a crash at all. Why would this not cause a problem
> > > >when the first one did? Is this more of an rsync problem maybe? I do
> > > >have 2 NFS exports, could those have been causing the problem?
> > > >
> > > >Thanks.
> > > >
> > >
> > > I am now seeing other crashes. Here's one from this weekend. It's huge
> > > and I'm not even sure how to see where things might have gone wrong.
> > > At the time of the crash the system was doing a backup to the gnbd
> > > server (if that helps). I'd also appreciate it if maybe someone could
> > > explain how to read this as far as where the problem actually started.
> >
> > I'd bet that the problem started exactly where the crash says it did,
> > with a stack overflow. This is with running gfs on top of clvm on top of
> > multipath on to of gnbd, right?
>
> Yes, that's correct, GFS, CLVM, Multipath, and GNBD client.
>
> > So for IO to complete, it needs to go through
> > gfs, device-mapper, and gnbd functions. It's possible that some functions in
> > those modules aren't incredibly efficient with their stack space usage (A
> > number of functions like this have been found and fixed in GFS over the years).
> > Since it's a pretty long function stack, it would take too much waste in
> > a couple of functions to put this over the edge.
> >
> > Which means that this setup probably needs some testing, and functions need
> > auditing.
> >
> > Just to make sure of some things:
> >
> > This is not a crash on a gnbd server node. correct? More specifically, you
> > aren't running a gnbd client and server on the same node for the same device.
> > That is bad. On a gnbd server machine, you cannot gnbd_import the devices that
> > you just gnbd_exported from that machine, and there's no reason to anyway.
>
> This is a GNBD node and not a server. My server simply mounts the
> device and does not reimport the export.
>
> >
> > Is anything else running that would effect the gfs file system on this node?
> > You mentioned NFS earlier. Are the gnbd client machines using NFS to serve
> > up the gfs file system?
>
> The node that was crashing that started this whole thing does have an
> NFS export, however this is a different node which does not have
> anything exported or imported via NFS.
>
> >
> > Are you running bonded ethernet?
>
> Yes, there are two dual Intel NICs in each of my nodes. 2 are bonded
> for the heartbeat, gnbd exports, cluster manager, etc, and the other 2
> are bonded for normal network traffic. I have two switches that are
> then trunked and setup with 2 vlans, one for cluster stuff, another
> for normal traffic. So in effect each bonded channel has one plug in
> one switch and one in another, allowing me to loose an entire switch
> and stay online.
>
> > There's nothing wrong with that, it just adds
> > more functions to the stack. That is an amazingly ugly stack trace, and I'm
> > trying to figure out what all is on there.
>
> As far as stack traces goes I'm pretty much a novice, it's honestly
> not that often (if ever) that I've had to investigate why a linux box
> crashed. This is just amazingly huge compared to the one I posted
> before and others I've run into.
>
> >
> > > Thanks again.
> > >
> > > --
> > > Linux-cluster mailing list
> > > Linux-cluster at redhat.com
> > > https://www.redhat.com/mailman/listinfo/linux-cluster
>

One other thing I wanted to mention: the bonded interfaces are mode=0
(round robin). Not sure if that would have any effect but it's worth
mentioning.