[Linux-cluster] totem token & post_fail_delay question

Tue Aug 26 08:23:14 UTC 2014

On 26/08/14 07:56, Vasil Valchev wrote:
> Hello,
>
> I have a cluster that sometimes has intermittent network issues on the
> heartbeat network.
> Unfortunately improving the network is not an option, so I am looking
> for a way to tolerate longer interruptions.
>
> Previously it seemed to me the post_fail_delay option is suitable, but
> after some research it might not be what I am looking for.
>
> If I am correct, when a member leaves (due to token timeout) the cluster
> will wait the post_fail_delay before fencing. If the member rejoins
> before that, it will still be fenced, because it has previous state?
>  From a recent fencing on this cluster there is a strange message:
>
> Aug 24 06:20:45 node2 openais[29048]: [MAIN ] Not killing node node1cl
> despite it rejoining the cluster with existing state, it has a lower node ID
>
> What does this mean?
>

It's an attempt by cman to sort out which node to kill in the situation 
where a node rejoins too quickly. If both nodes try to send a 'kill' 
message then then both nodes would leave the cluster leaving you with no 
active nodes. So cman (and fencing) prioritise the node with the lowest 
nodeID in an attempt at a tie-break. you should see a corresponding 
message on the other node:
"Killing node %s because it has rejoined the cluster with existing state 
and has higher node ID"

> And lastly is increasing the totem token timeout the way to go?
>

if there is no option for improving the network situation then, yes, 
increasing token timeout is probably your best option.

Chrissie