[Linux-cluster] Heartbeat time outs in rhel4 understanding

Wed May 6 13:33:53 UTC 2009

Ok, so let me ask this. I did a tcpdump between nodes. Is the heartbeat
the udp pack I see? I also see an xml doc. Like node1 keeps uptime and
other cluster info for itself and node2. node2 keeps uptime and cluster
onfo for nodes 1 and 3. Node 3 does the same for 2 and 4 and so on. I
assume is a node dies then they next closest node starts watching the
uptime for that node until the failed node rejoins.

Thanks again

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Chrissie
Caulfield
Sent: Wednesday, May 06, 2009 4:06 AM
To: linux clustering
Subject: Re: [Linux-cluster] Heartbeat time outs in rhel4 understanding

Elias, Michael wrote:
> I am trying to understand how these timers interact with each other.
> 
>  
> 
> In a RHEL4 cluster the heartbeat defaults are;
> 
> hello_timer:5
> 
> max_retries:5
> 
> deadnode_timeout:21
> 
>  
> 
> Meaning a heartbeat message is sent every 5 seconds, if it fails to
> receive a response it will start a deadnode counter @ 21 seconds. It
> will also try to send 5 more heartbeat requests. What is the interval
of
> those retries? If none of those requests receive a response. 5 seconds
> pass.. there is 15 seconds left on the deadnode timer and we try upto
5
> times to get a response.... This goes on until we hit the 4^th
iteration
> of the hellotimer it tries again upto 5 times and fails... we then hit
the
> 21 second on the deadnode time.. fenced takes over and wham reboot.
> 
>  
> 
> Is my understanding of this correct????
> 

No, I'm afraid it isn't :-)

max_retries has nothing to do with the heartbeat. It is to do with
cluster messages, such as service join requests, clvmd messages or the
messages used in the membership protocol.

So the heartbeat system is just a 5 second heartbeat and after 21
seconds the node will be evicted from the cluster and (usually) fenced.

The same happens for data messages if max_retries is exceeded. The retry
period here starts at 1 second and increases each time to avoid filling
the ethernet buffers.

I hope this helps,

Chrissie

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster