[Linux-cluster] cman bad generation number

Wed Dec 22 09:08:32 UTC 2004

On Tue, Dec 21, 2004 at 10:34:41AM -0800, Daniel McNeil wrote:
> Another test run that manage 52 hours before hitting a cman bug:
> 
> cl032:
> Dec 18 19:56:05 cl032 kernel: CMAN: bad generation number 10 in HELLO message, expected 9
> Dec 18 19:56:06 cl032 kernel: CMAN: killed by STARTTRANS or NOMINATE
> Dec 18 19:56:06 cl032 kernel: CMAN: we are leaving the cluster.
> Dec 18 19:56:07 cl032 kernel: dlm: closing connection to node 2
> Dec 18 19:56:07 cl032 kernel: dlm: closing connection to node 3
> Dec 18 19:56:07 cl032 kernel: SM: 00000001 sm_stop: SG still joined
> Dec 18 19:56:07 cl032 kernel: SM: 0100081e sm_stop: SG still joined
> Dec 18 19:56:07 cl032 kernel: SM: 0200081f sm_stop: SG still joined
> 
> cl031:
> Dec 18 19:56:02 cl031 kernel: CMAN: node cl032a is not responding - removing from the cluster
> Dec 18 19:56:06 cl031 kernel: CMAN: Being told to leave the cluster by node 1
> Dec 18 19:56:06 cl031 kernel: CMAN: we are leaving the cluster.
> Dec 18 19:56:07 cl031 kernel: dlm: closing connection to node 2
> Dec 18 19:56:07 cl031 kernel: dlm: closing connection to node 3
> Dec 18 19:56:07 cl031 kernel: SM: 00000001 sm_stop: SG still joined
> Dec 18 19:56:07 cl031 kernel: SM: 0100081e sm_stop: SG still joined
> 
> cl030:
> Dec 18 19:56:05 cl030 kernel: CMAN: bad generation number 10 in HELLO message, expected 9
> Dec 18 19:56:06 cl030 kernel: CMAN: Node cl031a is leaving the cluster, Shutdown
> Dec 18 19:56:06 cl030 kernel: CMAN: quorum lost, blocking activity
> 
> Looks like cl032 had the most problems.  It hit a bug of asserts:
> $ grep BUG cl032.messages
> Dec 18 19:56:48 cl032 kernel: kernel BUG at /Views/redhat-cluster/cluster/gfs-kernel/src/dlm/lock.c:400!
> Dec 18 19:56:48 cl032 kernel: kernel BUG at /Views/redhat-cluster/cluster/gfs-kernel/src/dlm/lock.c:342!
> Dec 18 20:01:06 cl032 kernel: kernel BUG at /Views/redhat-cluster/cluster/gfs-kernel/src/dlm/lock.c:342!
> Dec 18 20:01:07 cl032 kernel: kernel BUG at /Views/redhat-cluster/cluster/gfs-kernel/src/dlm/lock.c:342!
> 
> Questions:
> Any ideas on what is going on here?
> 
> How does one know what the current "generation" number is?

You don't, cman does. it's the current "generation" of the cluster which is
incremented for each state transition. Are you taking nodes up and down during
these tests??

It does seem that cman is susceptible to heavy network traffic, despite my best
efforts to increase its priority. I'm going to check in a change that will allow
you to change the retry count byt it's a bit of a hack really.

> When CMAN gets an error, it is not shutting down all the cluster
> software correctly.  GFS is still mounted and anything accessing
> it is hung.  For debugging it is ok for the machine to stay up
> so we can figure out what is going on, but for a real operational
> cluster this is very bad.  In normal operation, if the cluster
> hits a bug likes this shouldn't it just reboot, so hopefully
> all the other nodes can recover?

If you have power switch fencing and the remainder of the node is quorate then
surely the failed node should be powercycled?

patrick