[Cluster-devel] fencing conditions: what should trigger a fencing operation?
Fabio M. Di Nitto
fdinitto at redhat.com
Fri Nov 20 07:26:57 UTC 2009
David Teigland wrote:
> On Thu, Nov 19, 2009 at 07:10:54PM +0100, Fabio M. Di Nitto wrote:
>
> The error is detected in gfs. For every error in every bit of code, the
> developer needs to consider what the appropriate error handling should be:
> What are the consequences (with respect to availability and data
> integrity), both locally and remotely, of the error handling they choose?
> It's case by case.
>
> If the error could lead to data corruption, then the proper error handling
> is usually to fail fast and hard.
of course, agreed.
>
> If the error can result in remote nodes being blocked, then the proper
> error handling is usually self-sacrifice to avoid blocking other nodes.
ok, so this is the case we are seeing here. the cluster is half blocked
but there is no self-sacrifice action happening.
>
> Self-sacrifice means forcibly removing the local node from the cluster so
> that others can recover for it and move on. There are different ways of
> doing self-sacrifice:
>
> - panic the local machine (kernel code usually uses this method)
> - killing corosync on the local machine (daemons usually do this)
> - calling reboot (I think rgmanager has used this method)
I don´t have an opinion on how it happens really, as long as it works.
>
>> panic_on_oops is not cluster specific and not all OOPS are panic == not
>> a clean solution.
>
> So you want gfs oopses to result in a panic, and non-gfs oopses to *not*
> result in a panic?
Well partially yes.
We can´t take decision for OOPSes that are not generated within our
code. The user will have to configure that via panic_on_oops or other
means. Maybe our task is to make sure users are aware of this
situation/option (i didn´t check if it is documented).
You have a point by saying that it depends from error to error and this
is exactly where I´d like to head. Maybe it´s time to review our error
paths and make better decisions on what to do. At least within our code.
There's probably a combination of options that would
> produce this effect. Most people interested in HA will want all oopses to
> result in a panic and recovery since an oops puts a node in a precarious
> position regardless of where it came from.
I agree, but I don´t think we can kill the node on every OOPS by
default. We can agree that has to be a user configurable choice but we
can improve our stuff to do the right thing (or do better what it does now).
Fabio
More information about the Cluster-devel
mailing list