[Linux-cluster] cman bad generation number

Patrick Caulfield pcaulfie at redhat.com
Wed Jan 5 09:00:44 UTC 2005


On Tue, Jan 04, 2005 at 02:46:17PM -0800, Daniel McNeil wrote:
> 
> One thing I do not understand is that I am leaving the nodes in the
> cluster and just doing mounting and umounting, so the generation number
> should not be changing.
> 
> I think you are saying the the lock traffic is so high that the heart
> are lost so the node being kicked out is seeing the new heart beat
> from the other nodes and doesn't know they are not receiving his
> heartbeat messages.  This node must be seeing the other nodes
> heartbeat messages or it would have started a membership transition
> without the other nodes.  Do I have this right?

Yes, I think. It's all a bit vague. If it wasn't I might have an answer by now
:-(
 
> Shouldn't the heartbeat messages have higher priority
> over the lock traffic messages? 

They do. That's why I am puzzled. I'm currently investigating if the heartbeat
thread is being starved of CPU time by either the DLM or GFS.
 
> Shouldn't there be a way of throttling back the lock traffic and seeing
> if heartbeat connection can be re-established before starting a
> membership transition?

DLM & CMAN are not that tightly coupled.

-- 

patrick




More information about the Linux-cluster mailing list