[Linux-cluster] cman bad generation number

Daniel McNeil daniel at osdl.org
Wed Jan 5 22:19:01 UTC 2005


On Wed, 2005-01-05 at 01:00, Patrick Caulfield wrote:
> On Tue, Jan 04, 2005 at 02:46:17PM -0800, Daniel McNeil wrote:
> > 
> > One thing I do not understand is that I am leaving the nodes in the
> > cluster and just doing mounting and umounting, so the generation number
> > should not be changing.
> > 
> > I think you are saying the the lock traffic is so high that the heart
> > are lost so the node being kicked out is seeing the new heart beat
> > from the other nodes and doesn't know they are not receiving his
> > heartbeat messages.  This node must be seeing the other nodes
> > heartbeat messages or it would have started a membership transition
> > without the other nodes.  Do I have this right?
> 
> Yes, I think. It's all a bit vague. If it wasn't I might have an answer by now
> :-(
>  
> > Shouldn't the heartbeat messages have higher priority
> > over the lock traffic messages? 
> 
> They do. That's why I am puzzled. I'm currently investigating if the heartbeat
> thread is being starved of CPU time by either the DLM or GFS.
>  
> > Shouldn't there be a way of throttling back the lock traffic and seeing
> > if heartbeat connection can be re-established before starting a
> > membership transition?
> 
> DLM & CMAN are not that tightly coupled.

Do DLM and CMAN use a common communication layer?

I was expecting that they would since having multiple
interfaces for redundancy would be something they
would both want.  DLM should just want to be able
to send messages to other nodes and shouldn't care
how it gets there.  I was expecting this to be
part of CMAN since it should know which interfaces are
connected to which nodes and their state.  It could
also load balance on multiple networks.  Is there a
description of how multiple interfaces are handle today?

Thanks,

Daniel




More information about the Linux-cluster mailing list