[Linux-cluster] CMAN: got WAIT barrier not in phase 1 TRANSITION.96 (2)

Wed Oct 18 17:05:27 UTC 2006

On Oct 16, 2006, at 8:34 AM, Patrick Caulfield wrote:

> Tom Mornini wrote:
>> We're getting problems when adding cluster nodes to our cluster.

snip...

>> Oct 13 04:09:04 ey00-s00017 kernel: CMAN: Waiting to join or form a
>> Linux-cluster
>> Oct 13 04:09:05 ey00-s00017 kernel: CMAN: sending membership request
>> Oct 13 04:09:06 ey00-s00017 kernel: CMAN: got node ey00-s00025
>> Oct 13 04:09:06 ey00-s00017 kernel: CMAN: got node ey00-s00019
>> Oct 13 04:09:06 ey00-s00017 kernel: CMAN: got node ey00-s00030
>> Oct 13 04:09:06 ey00-s00017 kernel: CMAN: got node ey00-s00024
>> Oct 13 04:09:06 ey00-s00017 kernel: CMAN: got node ey00-s00010
>> Oct 13 04:09:06 ey00-s00017 kernel: CMAN: got node ey00-s00016
>> Oct 13 04:09:06 ey00-s00017 kernel: CMAN: got node ey00-s00004
>> Oct 13 04:09:06 ey00-s00017 kernel: CMAN: got node ey00-s00011
>> Oct 13 04:09:06 ey00-s00017 kernel: CMAN: got node ey00-s00005
>> Oct 13 04:09:06 ey00-s00017 kernel: CMAN: got node ey00-s00009
>> Oct 13 04:09:06 ey00-s00017 kernel: CMAN: got node ey00-s00002
>> Oct 13 04:09:06 ey00-s00017 kernel: CMAN: got node ey00-s00015
>> Oct 13 04:09:06 ey00-s00017 kernel: CMAN: got node ey00-s00014
>> Oct 13 04:09:06 ey00-s00017 kernel: CMAN: got node ey00-s00008
>> Oct 13 04:09:06 ey00-s00017 kernel: CMAN: got node ey00-s00003
>> Oct 13 04:09:06 ey00-s00017 kernel: CMAN: got node ey00-s00006
>> Oct 13 04:09:06 ey00-s00017 kernel: CMAN: got node ey00-s00012
>> Oct 13 04:09:06 ey00-s00017 kernel: CMAN: got node ey00-s00013
>> Oct 13 04:09:06 ey00-s00017 kernel: CMAN: got node ey00-s00007
>> Oct 13 04:09:06 ey00-s00017 kernel: CMAN: got node ey00-s00001
>> Oct 13 04:09:06 ey00-s00017 kernel: CMAN: got node ey00-s00000
>> Oct 13 04:09:06 ey00-s00017 kernel: CMAN: got node ey00-04
>> Oct 13 04:09:06 ey00-s00017 kernel: CMAN: got node ey00-05
>> Oct 13 04:09:06 ey00-s00017 kernel: CMAN: got node ey00-03
>> Oct 13 04:09:06 ey00-s00017 kernel: CMAN: got node ey00-00
>> Oct 13 04:09:06 ey00-s00017 kernel: CMAN: got node ey00-01
>> Oct 13 04:09:06 ey00-s00017 kernel: CMAN: got node ey00-02
>> Oct 13 04:09:06 ey00-s00017 kernel: dlm: no version for
>> "kcl_register_service" found: kernel tainted.
>> Oct 13 04:09:06 ey00-s00017 kernel: DLM 1.03.00 (built Sep  8 2006
>> 03:50:23) installed
>> Oct 13 04:09:57 ey00-s00017 kernel: CMAN: node ey00-s00018 rejoining
>> Oct 13 04:17:18 ey00-s00017 kernel: CMAN: got WAIT barrier not in  
>> phase
>> 1 TRANSITION.96 (2)
>
> That message should be harmless. does it prevent the cluster  
> reaching quorum ?

>> Oct 13 04:17:18 ey00-s00017 kernel: CMAN: got WAIT barrier not in  
>> phase
>> 1 TRANSITION.96 (2)
>>
> That message should be harmless. does it prevent the cluster  
> reaching quorum ?

Hello Patrick / list, I've been working with Tom on this problem.

It doesn't prevent quorum, although after this point the nodes
mysteriously can't seem to join the fence domain.  I've checked and it
doesn't appear that anyone is trying to fence anyone else, so I'm at a
bit of a loss to explain what's going on.

The really bizarre thing is that the old nodes don't seem to play with
the new ones despite them being joined into the cluster (i.e. fence
domain on old nodes shows running, fence domain on new node says joining
indefinitely).  If you prod it enough (start enough new nodes),
eventually the existing cluster will blow apart (nodes start kicking
each other for inconsistency and the like).

Let me explain a few things about our cluster:

We are running Xen.

The control VM for each node is in the cluster with 1 vote.

The application VMs are dynamically spawned and are entered into the
cluster.

The application VMs have 0 votes (so as to prevent one physical machine
from accidentally grabbing a quorum of votes if it has too many
application VMs running on it).

We are currently using fence_manual for debugging purposes (we have an
APC MasterSwitch to eventually use for fencing).

We are experiencing the following problems:

After a certain size (about 20 cluster members) we start having serious
issues with the cluster holding together.  Nodes are sometimes kicked
for having an inconsistent view.  There is often a complaint about the
count of members not matching between nodes as well.  Right now we have
the 1.03 version of everything installed (it was packaged and we are
trying to avoid building too much from scratch).

When a node starts up with an old cluster.conf, it never seems to
automatically update to the newer version.  If the file is updated while
a node is down, must it be manually synched up before resuming?

Finally, a random question.  When I'm debugging this stuff, I use
"cman_tool services" to keep tabs on some things.  What does the stuff
in the Code column mean?

-- 

Jayson Vantuyl
Quality Humans, Inc.