[Linux-cluster] corosync ring failure

Fri Jul 25 06:22:09 UTC 2014

>> the switch has no load, interface utilization is below 10%, no crc
>> errors on the ports and no errors in the log. On the same switch a
>> second cluster (four machines, similiar config) is running fine.
>
> did you vlan the switches so the two clusters are "logically separate"?  if
> they're on the same VLAN they might interfere with each other...
>
> also i second Michael Schwartzkopff's suggestion of looking into Spanning
> Tree Protocol (STP).  if your switches (i'm assUming you're using two) are
> not stacked(1), you may be running into that, as well.

There are vlans and spanning trees and stacking.

Both (actualy three) cluster are in the same vlans. One vlan for
cluster internals and one for external. One internal ring, one
external ring. On the internal ring each Cluster uses it's own IP
subnet. Communication is udpu and should not interfere.

Spanning tree has no events. All ports are always in forwarding mode.
As ring failure happens every 40 seconds i think it is unlikely for
spanning tree to be the reason.

i pulled a wiredump on one of the nodes (432). But i can't really make
sense of it.

orf packets from node 431 arive and i send out orf packets to the node
430. So the ring looks fine.

I modified the cluster config to include

<dlm protocol="sctp"/> <!-- missed the note in "man cman" about this -->
<totem rrp_mode="active" /> <!-- missed this note also -->

rebooted all nodes (for another reason) and everything looks fine now.
No idea if it is the config change or the reboot. I will pull another
wiredump and compare.

Greetings
   Christoph