[Linux-cluster] corosync ring failure

Thu Jul 24 07:30:01 UTC 2014

>>> i run a cluster with two corosync rings. One of the rings is marked
>>> faulty every fourty seconds, to immediately recover a second later.
>>> the other ring is stable
>>>
>>> i have no idea how i should debug this.
>>>
>>>
>>> we are running sl6.5 with pacemaker 1.1.10, cman 3.0.12, corosync 1.4.1
>>> cluster consists of three machines. Ring1 is running on 10gigbit
>>> interfaces, Ring0 on 1gigibit interfaces. Both rings don't leave their
>>> respective switch.

>> Any logs in the switch? Is the multicast group being deleted/recreated?

> believe there would be no multicast for UDPU transport

>Can you check to see if any of the devices (servers and switches) is >dropping
>UDP packets, be it for congestion or damage?

the switch has no load, interface utilization is below 10%, no crc
errors on the ports and no errors in the log. On the same switch a
second cluster (four machines, similiar config) is running fine.

Greetings
   Christoph