[Linux-cluster] new cluster acting odd

Tue Dec 2 08:46:29 UTC 2014

On 01/12/14 14:16, Megan . wrote:
> Good Day,
>
> I'm fairly new to the cluster world so i apologize in advance for
> silly questions.  Thank you for any help.
>
> We decided to use this cluster solution in order to share GFS2 mounts
> across servers.  We have a 7 node cluster that is newly setup, but
> acting oddly.  It has 3 vmware guest hosts and 4 physical hosts (dells
> with Idracs).  They are all running Centos 6.6.  I have fencing
> working (I'm able to do fence_node node and it will fence with
> success).  I do not have the gfs2 mounts in the cluster yet.
>
> When I don't touch the servers, my cluster looks perfect with all
> nodes online. But when I start testing fencing, I have an odd problem
> where i end up with split brain between some of the nodes.  They won't
> seem to automatically fence each other when it gets like this.
>
> in the  corosync.log for the node that gets split out i see the totem
> chatter, but it seems confused and just keeps doing the below over and
> over:
>
>
> Dec 01 12:39:15 corosync [TOTEM ] Retransmit List: 22 24 25 26 27 28 29 2a 2b 2c
>
> Dec 01 12:39:17 corosync [TOTEM ] Retransmit List: 22 24 25 26 27 28 29 2a 2b 2c
>
> Dec 01 12:39:19 corosync [TOTEM ] Retransmit List: 22 24 25 26 27 28 29 2a 2b 2c
>
> Dec 01 12:39:39 corosync [TOTEM ] Retransmit List: 1 3 4 5 6 7 8 9 a b
>
> Dec 01 12:39:39 corosync [TOTEM ] Retransmit List: 1 3 4 5 6 7 8 9 a b
> 21 23 24 25 26 27 28 29 2a 2b 32
> ..
> ..
> ..
> Dec 01 12:54:49 corosync [TOTEM ] Retransmit List: 1 3 4 5 6 7 8 9 a b
> 1d 1f 20 21 22 23 24 25 26 27 2e 30 31 32 37 38 39 3a 3b 3c
>
> Dec 01 12:54:50 corosync [TOTEM ] Retransmit List: 1 3 4 5 6 7 8 9 a b
> 1d 1f 20 21 22 23 24 25 26 27 2e 30 31 32 37 38 39 3a 3b 3c
>
> Dec 01 12:54:50 corosync [TOTEM ] Retransmit List: 1 3 4 5 6 7 8 9 a b
> 1d 1f 20 21 22 23 24 25 26 27 2e 30 31 32 37 38 39 3a 3b 3c
>

These messages are the key to your problem and nothing will be fixed 
until you can get rid of them. As Digimer said they are often caused by 
a congested network, but it could also be multicast traffic not being 
passed between nodes - a mix of physical and virtual nodes could easily 
be contributing to this. The easiest way to prove this (and get the 
system working possibly) is to switch from multicast to normal UDP 
unicast traffic

<cman transport="udpu"/>

in cluster.conf. You'll need to to this on all nodes and reboot the 
whole cluster. All in all, it's probably easier that messing around 
checking routers, switches and kernel routing paramaters in a 
mixed-mode cluster!

Chrissie