[Linux-cluster] cman bad generation number

Tue Jan 4 11:29:24 UTC 2005

On Wed, Dec 22, 2004 at 09:33:39AM -0800, Daniel McNeil wrote:
> > > 
> > > How does one know what the current "generation" number is?
> > 
> > You don't, cman does. it's the current "generation" of the cluster which is
> > incremented for each state transition. Are you taking nodes up and down during
> > these tests??
> 
> The nodes are staying up.  I am mounting and umounting a lot.
> Any reason to not add generation /proc/cluster/status?  (it would help
> debugging at least).

No reason at all not to, apart from I really don't think it will tell anyone
anything useful. The cause of the problem is that the CMAN heartbeat messages
are being lost on the network flooded by lock traffic. generation mismatches are
just a symptom of that.

> 
> I currently have it set up for manual fencing and I have yet to see that
> work correctly.  This was a 3 node cluster.  cl032 got the bad
> generation number and cman was "killed by STARTTRANS or NOMINATE"
> cl030 got a bad generation number (but stayed up) and cl031 leaves
> the cluster because it says cl030 told it to.  So that leaves me
> with 1 node up without quorum.  I did not see any fencing messages.
> 
> Should the surviving node (cl030) have attempted fencing or does
> it only do that if it has quorum?

ah no, fencing will only happen if the cluster has quorum.

> I do not seem to be able to keep cman up for much past 2 days if 
> I have my tests running.  (it stays up with no load, of course).
> My tests are not the complicated currently either.  Just tar, du
> and rm in separate directories from 1, 2 and then 3 nodes
> simultaneously.  Who knows what will happen if I add tests
> to cause lots of dlm lock conflict.
> How long does cman stay up in your testing?

I've never had iSCSI stay up long enough to find out :(

-- 

patrick