[Linux-cluster] Heartbeat time outs in rhel4 understanding

Wed May 6 08:06:17 UTC 2009

Elias, Michael wrote:
> I am trying to understand how these timers interact with each other.
> 
>  
> 
> In a RHEL4 cluster the heartbeat defaults are;
> 
> hello_timer:5
> 
> max_retries:5
> 
> deadnode_timeout:21
> 
>  
> 
> Meaning a heartbeat message is sent every 5 seconds, if it fails to
> receive a response it will start a deadnode counter @ 21 seconds. It
> will also try to send 5 more heartbeat requests. What is the interval of
> those retries? If none of those requests receive a response. 5 seconds
> pass.. there is 15 seconds left on the deadnode timer and we try upto 5
> times to get a response…. This goes on until we hit the 4^th iteration
> of the hellotimer it tries again upto 5 times and fails… we then hit the
> 21 second on the deadnode time.. fenced takes over and wham reboot.
> 
>  
> 
> Is my understanding of this correct????
> 

No, I'm afraid it isn't :-)

max_retries has nothing to do with the heartbeat. It is to do with
cluster messages, such as service join requests, clvmd messages or the
messages used in the membership protocol.

So the heartbeat system is just a 5 second heartbeat and after 21
seconds the node will be evicted from the cluster and (usually) fenced.

The same happens for data messages if max_retries is exceeded. The retry
period here starts at 1 second and increases each time to avoid filling
the ethernet buffers.

I hope this helps,

Chrissie