[Linux-cluster] Re: SMP and GFS
Axel Thimm
Axel.Thimm at ATrpms.net
Mon Oct 3 11:40:17 UTC 2005
On Mon, Oct 03, 2005 at 12:02:40PM +0100, Patrick Caulfield wrote:
> Axel Thimm wrote:
> >>showing that a node has been kicked out of the cluster for not responding
> >>quickly enough to messages. You could try increasing the value in
> >>
> >>/proc/cluster/config/cman/max_retries
> >
> > I know, but that doesn't explain the einval messages, or does it? Or
> > formulated differently: the einval messages show that the dual Xeon
> > box had some issues with sockets and its being kicked out could be
> > just a symptom of that.
>
> it probably does explain them. If the node is kicked out of the cluster, the DLM
> starts return -EINVAL from lock ops (because the lockspace no longer exists).
> This very often causes the GFS lock_dlm module to oops.
>
>
> The bugzillas are confused about this but it sort-of exists as
> https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=165160
Thanks, that bugzilla explains a lot. It's the same situation like
Corey's, two nodes were shut down, quorum was lost, and one of the two
nodes removed was using the filesystem and was having lock_dlm on
it. So it paniced.
It all very much makes sense now. The two remaining issues are
o why did the network interface blow up twice, and killed the
communication between the nodes (and it looks like it once killed
all UDP communications permanently including syslog)? We replaced
all cabling and switches, next thing is to use a dedicated GBit
network only for cman/dlm. That's of course something we need to
investigate and should not be an issue with GFS.
o why did the filesystem desync across members? That may or may not be
a consequence of the previous cman/dlm failures and kernel panics,
or may be a consequence of the broken networking between the
nodes. In both cases while the triggering problem seems to be in the
networking between the nodes, filesystem inconsitency should not
happen, and reflects some bug in GFS. See also
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=169693
BTW what is "revolver"? Is that a stress test used at RH for GFS?
Would it be possible to share this tool?
Thanks!
--
Axel.Thimm at ATrpms.net
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20051003/6a79c484/attachment.sig>
More information about the Linux-cluster
mailing list