[Linux-cluster] Re: SMP and GFS

Axel Thimm Axel.Thimm at ATrpms.net
Mon Oct 3 11:40:17 UTC 2005


On Mon, Oct 03, 2005 at 12:02:40PM +0100, Patrick Caulfield wrote:
> Axel Thimm wrote:
> >>showing that a node has been kicked out of the cluster for not responding
> >>quickly enough to messages. You could try increasing the value in
> >>
> >>/proc/cluster/config/cman/max_retries
> > 
> > I know, but that doesn't explain the einval messages, or does it? Or
> > formulated differently: the einval messages show that the dual Xeon
> > box had some issues with sockets and its being kicked out could be
> > just a symptom of that.
> 
> it probably does explain them. If the node is kicked out of the cluster, the DLM
> starts return -EINVAL from lock ops (because the lockspace no longer exists).
> This very often causes the GFS lock_dlm module to oops.
> 
> 
> The bugzillas are confused about this but it sort-of exists as
> https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=165160

Thanks, that bugzilla explains a lot. It's the same situation like
Corey's, two nodes were shut down, quorum was lost, and one of the two
nodes removed was using the filesystem and was having lock_dlm on
it. So it paniced.

It all very much makes sense now. The two remaining issues are

o why did the network interface blow up twice, and killed the
  communication between the nodes (and it looks like it once killed
  all UDP communications permanently including syslog)? We replaced
  all cabling and switches, next thing is to use a dedicated GBit
  network only for cman/dlm. That's of course something we need to
  investigate and should not be an issue with GFS.

o why did the filesystem desync across members? That may or may not be
  a consequence of the previous cman/dlm failures and kernel panics,
  or may be a consequence of the broken networking between the
  nodes. In both cases while the triggering problem seems to be in the
  networking between the nodes, filesystem inconsitency should not
  happen, and reflects some bug in GFS. See also

  https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=169693

BTW what is "revolver"? Is that a stress test used at RH for GFS?
Would it be possible to share this tool?

Thanks!
-- 
Axel.Thimm at ATrpms.net
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20051003/6a79c484/attachment.sig>


More information about the Linux-cluster mailing list