[Linux-cluster] Spontanous eviction of node cluster

Thu Dec 15 15:53:41 UTC 2011

Hello,

I'm at this moment investigating problems on a site with qdisk time-outs. The SAN is slow, additional hardware is ordered, which leads to the time-outs. At this moment almost all clusters are managing to stay on-line, mostly due to huge time-out values (150-300 seconds). We've seen disk time-outs of more then 100 seconds, but they've now dropped to max. 15 seconds.

However, there is one cluster that just keeps on killing it's primary node and I'm unable to find the reason why. All that's being logged on this cluster are the lines below:

Dec 15 00:15:10 node2 last message repeated 2 times
Dec 15 00:16:59 node2 qdiskd[3073]: <notice> Writing eviction notice for node 1
Dec 15 00:17:02 node2 qdiskd[3073]: <notice> Node 1 evicted
Dec 15 00:17:54 node2 openais[3040]: [TOTEM] The token was lost in the OPERATIONAL state.

Dec 15 00:15:10 node1 last message repeated 2 times
Dec 15 00:16:59 node1 openais[3297]: [CMAN ] cman killed by node 2 because we were killed by cman_tool or other application
Dec 15 00:17:00 node1 openais[3297]: [SERV ] Unloading all openais components

The quorumd has an interval of 3, tko of 50 (to get 150 seconds, but keep receiving warnings for the > 3 seconds time delays.)
The quorum_dev_poll is set to 300000, just as the totem token value.

I can't find any differences with the other clusters, which are now working fine, except for one big differences. There are no resources configured. The set-up is configured only to get gfs working. Both nodes have their own ip, no resources are shared (other then the gfs file systems, which sre concurrently available) and there are no check, other then qdisk availability and totem tokens.

Is this behaviour that can be expected when running a cluster without resources configured?

Greetings,

Jan Huijsmans