[Linux-cluster] partly OT: failover <500ms

Fri Sep 2 18:45:22 UTC 2005

On Fri, 2005-09-02 at 08:03 +0100, Patrick Caulfield wrote:
> Lon Hohberger wrote:
> > On Thu, 2005-09-01 at 21:58 +0200, Jure Pečar wrote:
> > 
> >>Hi all,
> >>
> >>Sorry if this is somewhat offtopic here ...
> >>
> >>Our telco is looking into linux HA solutions for their VoIP needs. Their
> >>main requirement is that the failover happens in the order of a few 100ms. 
> >>
> >>Can redhat cluster be tweaked to work reliably with such short time
> >>periods? This would mean heartbeat on the level of few ms and status probes
> >>on the level of 10ms. Is this even feasible?
> > 
> > 
> > Possibly, I don't think it can do it right now.  A couple of things to
> > remember:
> > 
> > * For such a fast requirement, you'll want a dedicated network for
> > cluster traffic and a real-time kernel.
> > 
> > * Also, "detection and initiation of recovery" is all the cluster
> > software can do for you; your application - by itself - may take longer
> > than this to recover.
> > 
> > * It's practically impossible to guarantee completion of I/O fencing in
> > this amount of time, so your application must be able to do without, or
> > you need to create a new specialized fencing mechanism which is
> > guaranteed to complete within a very fast time.
> > 
> > * I *think* CMAN is currently at the whole-second granularity, so some
> > changes would need to be made to give it finer granularity.  This
> > shouldn't be difficult (but I'll let the developers of CMAN answer this
> > definitively, though... ;) )
> > 
> 
> All true :) All cman timers are calibrated in seconds. I did run some tests a
> while ago with them in milliseconds and 100ms timeouts and it worked
> /reasonably/ well. However, without an RT kernel I wouldn't like to put this
> into a production system - we've had several instances of the cman kernel thread
> (which runs at the top RT priority) being stalled for up to 5 seconds and that
> node being fenced. Smaller stalls may be more common so with timeouts set that
> low you may well get nodes fenced for small delays.
> 
> To be quite honest I'm not really sure what causes these stalls, as they
> generally happen under heavy IO load I assume (possibly wrongly) that they are
> related to disk flushes but someone who knows the VM better may out me right on
> this.
> 
> 

These systems could have swap..  Swap doesn't work because it is
possible for a swapped page to take 1-10 seconds to be swapped into
memory.  The mlockall() system call resolves this particular problem.

The poll sendmsg and recvmsg (and some others that require memory)
system calls can block when allocating memory in low memory conditions.
This unfortunately results in longer timeouts necessary when the system
is overloaded.  One solution is to change these system calls via some
kind of socket option to allocate memory ahead of time for their
operation.  But I don't know of anything like this yet.

I have measured failover with openais at 3 msec from detection to
direction of new CSIs within components.  Application failures are
detected in 100 msec.  Node failures are detected in 100 msec.  It is
possible on a system that meets the above scenario for a processor to be
excluded from the membership during low memory.

This is a reasonable choice, because the processor is having difficulty
responding to requests in a timely fashion, and should be removed until
overload control software on the processor cleans up the processor
memory.

Regards
-steve