[Linux-cluster] Re: Hard lockups during file transfer to GNBD/GFS device

Mon Oct 23 17:32:48 UTC 2006

On 10/23/06, Benjamin Marzinski <bmarzins at redhat.com> wrote:
> On Mon, Oct 23, 2006 at 09:48:34AM -0400, David Brieck Jr. wrote:
> > On 10/10/06, David Brieck Jr. <dbrieck at gmail.com> wrote:
> > >On 9/29/06, David Brieck Jr. <dbrieck at gmail.com> wrote:
> > >> On 9/28/06, David Teigland <teigland at redhat.com> wrote:
> > >> >
> > >> > Could you try it without multipath?  You have quite a few layers there.
> > >> > Dave
> > >> >
> > >> >
> > >>
> > >> Thanks for the response. I unloaded gfs, clvm, gnbd and multipath, the
> > >> reloaded gnbd, clvm and gfs. It was only talking to one of the gnbd
> > >> servers and without multipath. Here's the log from this crash. It
> > >> seems to have more info in it.
> > >>
> > >> I'm kinda confused why it still has references to mulitpath though. I
> > >> unloaded the multipath module so I'm not sure why it's still in there.
> > >> SNIP
> > >
> > >Since I didn't hear back from anyone I decided to try things a little
> > >differently. Instead of rsyncing on the local machine, I ran the rsync
> > >from another cluster member who also mounts the same partition I was
> > >trying to move things too.
> > >
> > >So instead of
> > >
> > >rsync -a /path/to/files/ /mnt/http/
> > >
> > >I used
> > >
> > >rsync -a root at 10.1.1.121::/path/to/files/  /mnt/http/
> > >
> > >and I didn't have a crash at all. Why would this not cause a problem
> > >when the first one did? Is this more of an rsync problem maybe? I do
> > >have 2 NFS exports, could those have been causing the problem?
> > >
> > >Thanks.
> > >
> >
> > I am now seeing other crashes. Here's one from this weekend. It's huge
> > and I'm not even sure how to see where things might have gone wrong.
> > At the time of the crash the system was doing a backup to the gnbd
> > server (if that helps). I'd also appreciate it if maybe someone could
> > explain how to read this as far as where the problem actually started.
>
> I'd bet that the problem started exactly where the crash says it did,
> with a stack overflow. This is with running gfs on top of clvm on top of
> multipath on to of gnbd, right?

Yes, that's correct, GFS, CLVM, Multipath, and GNBD client.

> So for IO to complete, it needs to go through
> gfs, device-mapper, and gnbd functions. It's possible that some functions in
> those modules aren't incredibly efficient with their stack space usage (A
> number of functions like this have been found and fixed in GFS over the years).
> Since it's a pretty long function stack, it would take too much waste in
> a couple of functions to put this over the edge.
>
> Which means that this setup probably needs some testing, and functions need
> auditing.
>
> Just to make sure of some things:
>
> This is not a crash on a gnbd server node. correct? More specifically, you
> aren't running a gnbd client and server on the same node for the same device.
> That is bad. On a gnbd server machine, you cannot gnbd_import the devices that
> you just gnbd_exported from that machine, and there's no reason to anyway.

This is a GNBD node and not a server. My server simply mounts the
device and does not reimport the export.

>
> Is anything else running that would effect the gfs file system on this node?
> You mentioned NFS earlier. Are the gnbd client machines using NFS to serve
> up the gfs file system?

The node that was crashing that started this whole thing does have an
NFS export, however this is a different node which does not have
anything exported or imported via NFS.

>
> Are you running bonded ethernet?

Yes, there are two dual Intel NICs in each of my nodes. 2 are bonded
for the heartbeat, gnbd exports, cluster manager, etc, and the other 2
are bonded for normal network traffic. I have two switches that are
then trunked and setup with 2 vlans, one for cluster stuff, another
for normal traffic. So in effect each bonded channel has one plug in
one switch and one in another, allowing me to loose an entire switch
and stay online.

> There's nothing wrong with that, it just adds
> more functions to the stack. That is an amazingly ugly stack trace, and I'm
> trying to figure out what all is on there.

As far as stack traces goes I'm pretty much a novice, it's honestly
not that often (if ever) that I've had to investigate why a linux box
crashed. This is just amazingly huge compared to the one I posted
before and others I've run into.

>
> > Thanks again.
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster