[Cluster-devel] fencing conditions: what should trigger a fencing operation?

Fri Nov 20 07:26:57 UTC 2009

David Teigland wrote:
> On Thu, Nov 19, 2009 at 07:10:54PM +0100, Fabio M. Di Nitto wrote:
>
> The error is detected in gfs.  For every error in every bit of code, the
> developer needs to consider what the appropriate error handling should be:
> What are the consequences (with respect to availability and data
> integrity), both locally and remotely, of the error handling they choose?
> It's case by case.
> 
> If the error could lead to data corruption, then the proper error handling
> is usually to fail fast and hard.

of course, agreed.

> 
> If the error can result in remote nodes being blocked, then the proper
> error handling is usually self-sacrifice to avoid blocking other nodes.

ok, so this is the case we are seeing here. the cluster is half blocked
but there is no self-sacrifice action happening.

> 
> Self-sacrifice means forcibly removing the local node from the cluster so
> that others can recover for it and move on.  There are different ways of
> doing self-sacrifice:
> 
> - panic the local machine (kernel code usually uses this method)
> - killing corosync on the local machine (daemons usually do this)
> - calling reboot (I think rgmanager has used this method)

I don´t have an opinion on how it happens really, as long as it works.

> 
>> panic_on_oops is not cluster specific and not all OOPS are panic == not
>> a clean solution.
> 
> So you want gfs oopses to result in a panic, and non-gfs oopses to *not*
> result in a panic? 

Well partially yes.

We can´t take decision for OOPSes that are not generated within our
code. The user will have to configure that via panic_on_oops or other
means. Maybe our task is to make sure users are aware of this
situation/option (i didn´t check if it is documented).

You have a point by saying that it depends from error to error and this
is exactly where I´d like to head. Maybe it´s time to review our error
paths and make better decisions on what to do. At least within our code.

 There's probably a combination of options that would
> produce this effect.  Most people interested in HA will want all oopses to
> result in a panic and recovery since an oops puts a node in a precarious
> position regardless of where it came from.

I agree, but I don´t think we can kill the node on every OOPS by
default. We can agree that has to be a user configurable choice but we
can improve our stuff to do the right thing (or do better what it does now).

Fabio