[Cluster-devel] fencing conditions: what should trigger a fencing operation?

Thu Nov 19 19:49:52 UTC 2009

On Thu, Nov 19, 2009 at 07:10:54PM +0100, Fabio M. Di Nitto wrote:
> David Teigland wrote:
> > On Thu, Nov 19, 2009 at 12:35:05PM +0100, Fabio M. Di Nitto wrote:
> > 
> >> - what are the current fencing policies?
> > 
> > node failure
> > 
> >> - what can we do to improve them?
> > 
> > node failure is a simple, black and white, fact
> > 
> >> - should we monitor for more failures than we do now?
> > 
> > corosync *exists* to to detect node failure
> > 
> >> It is a known issue that node1 will crash at some point (kernel OOPS).
> > 
> > oops is not necessarily node failure; if you *want* it to be, then you
> > sysctl -w kernel.panic_on_oops=1
> > 
> > (gfs has also had it's own mount options over the years to force this
> > behavior, even if the sysctl isn't set properly; it's a common issue.
> > It seems panic_on_oops has had inconsistent default values over various
> > releases, sometimes 0, sometimes 1; setting it has historically been part
> > of cluster/gfs documentation since most customers want it to be 1.)
> 
> So a cluster can hang because our code failed, but we don?t detect that
> it did fail.... so what determines a node failure? only when corosync dies?

The error is detected in gfs.  For every error in every bit of code, the
developer needs to consider what the appropriate error handling should be:
What are the consequences (with respect to availability and data
integrity), both locally and remotely, of the error handling they choose?
It's case by case.

If the error could lead to data corruption, then the proper error handling
is usually to fail fast and hard.

If the error can result in remote nodes being blocked, then the proper
error handling is usually self-sacrifice to avoid blocking other nodes.

Self-sacrifice means forcibly removing the local node from the cluster so
that others can recover for it and move on.  There are different ways of
doing self-sacrifice:

- panic the local machine (kernel code usually uses this method)
- killing corosync on the local machine (daemons usually do this)
- calling reboot (I think rgmanager has used this method)

> panic_on_oops is not cluster specific and not all OOPS are panic == not
> a clean solution.

So you want gfs oopses to result in a panic, and non-gfs oopses to *not*
result in a panic?  There's probably a combination of options that would
produce this effect.  Most people interested in HA will want all oopses to
result in a panic and recovery since an oops puts a node in a precarious
position regardless of where it came from.

Dave