[Linux-cluster] strange cluster behavior

brem belguebli brem.belguebli at gmail.com
Wed Mar 3 02:11:50 UTC 2010


Hi,

I experienced a strange cluster behavior that I couldn't explain.

I have a 4 nodes Rhel 5.4 cluster (node1, node2, node3 and node4).

Node1 and node2 are connected to an ethernet switch (sw1), node3 and
node4 are connected to another switch (sw2). The 4 nodes are on the same
Vlan.

sw1 and sw2 are connected thru a couple of core switches, and the nodes
Vlan is well propagated across the network that I just described.

Latency between node1 and node4 (on 2 different switches) doesn't exceed
0.3 ms.

The cluster is normally configured with a iscsi quorum device located on
another switch.

I wanted to check how it would behave when quorum disk is not active
(removed from cluster.conf) if a member node came to get isolated (link
up but not on the right vlan).

Node3 is the one I played with.

The fence_device for this node is intentionally misconfigured to be able
to follow on this node console what happens.

When changing the vlan membership of node3, results are as expected, the
3 remaining nodes see it come offline after totem timer expiry, and
node1 (lowest node id) starts trying to fence node3 (without success as
intentionally misconfigured).

Node3 sees itself the only member of the cluster which is inquorate.
Coherent as it became a single node parition.

When putting back node3 vlan conf to the right value, things go bad.

Node1, 2 and 4 instruct node3 cman to kill itself as it did re appear
with an already existing status. Why not.

Node1 and node2 then say then the quorum is dissolved and see themselves
offline (????), node3 offline and node4 online.

Node4 sees itself online but cluster inquorate as we also lost node1 and
node2.

I thought about potential multicast problems, but it behaves the same
way when cman is configured to broadcast.

The same test run with qdisk enabled is behaving normally, when node3
gets back to network it gets automatically rebooted (thx to qdisk), the
cluster remains stable.

Any idea why node1 and node2 go bad when node3 is back ?

Thanks

Brem  



  




More information about the Linux-cluster mailing list