[Linux-cluster] strange cluster behavior

Xavier Montagutelli xavier.montagutelli at unilim.fr
Wed Mar 3 07:16:45 UTC 2010


On Wednesday 03 March 2010 03:11:50 brem belguebli wrote:
> Hi,
> 
> I experienced a strange cluster behavior that I couldn't explain.
> 
> I have a 4 nodes Rhel 5.4 cluster (node1, node2, node3 and node4).
> 
> Node1 and node2 are connected to an ethernet switch (sw1), node3 and
> node4 are connected to another switch (sw2). The 4 nodes are on the same
> Vlan.
> 
> sw1 and sw2 are connected thru a couple of core switches, and the nodes
> Vlan is well propagated across the network that I just described.
> 
> Latency between node1 and node4 (on 2 different switches) doesn't exceed
> 0.3 ms.
> 
> The cluster is normally configured with a iscsi quorum device located on
> another switch.
> 
> I wanted to check how it would behave when quorum disk is not active
> (removed from cluster.conf) if a member node came to get isolated (link
> up but not on the right vlan).
> 
> Node3 is the one I played with.
> 
> The fence_device for this node is intentionally misconfigured to be able
> to follow on this node console what happens.
> 
> When changing the vlan membership of node3, results are as expected, the
> 3 remaining nodes see it come offline after totem timer expiry, and
> node1 (lowest node id) starts trying to fence node3 (without success as
> intentionally misconfigured).
> 
> Node3 sees itself the only member of the cluster which is inquorate.
> Coherent as it became a single node parition.
> 
> When putting back node3 vlan conf to the right value, things go bad.

(My two cents)

You just put it back in the good VLAN, without restarting the host ?

I did this kind of test (under RH 5.3), and things always get bad if a node 
supposed to be fenced is not really fenced and comes back. Perhaps this is an 
intended behaviour to prevent "split brain" cases (even at the cost of the 
whole cluster going down) ? Or perhaps it depends how your misconfigured fence 
device behaves (does it give an exit status ? What exit status does it send 
?).

> 
> Node1, 2 and 4 instruct node3 cman to kill itself as it did re appear
> with an already existing status. Why not.
> 
> Node1 and node2 then say then the quorum is dissolved and see themselves
> offline (????), node3 offline and node4 online.
> 
> Node4 sees itself online but cluster inquorate as we also lost node1 and
> node2.
> 
> I thought about potential multicast problems, but it behaves the same
> way when cman is configured to broadcast.
> 
> The same test run with qdisk enabled is behaving normally, when node3
> gets back to network it gets automatically rebooted (thx to qdisk), the
> cluster remains stable.
> 
> Any idea why node1 and node2 go bad when node3 is back ?
> 
> Thanks
> 
> Brem
> 
> 
> 
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 

-- 
Xavier Montagutelli                      Tel : +33 (0)5 55 45 77 20
Service Commun Informatique              Fax : +33 (0)5 55 45 75 95
Universite de Limoges
123, avenue Albert Thomas
87060 Limoges cedex




More information about the Linux-cluster mailing list