[Linux-cluster] How to disable node?

Marc - A. Dahlhaus [ Administration | Westermann GmbH ] mad at wol.de
Tue Sep 1 10:29:36 UTC 2009


Am Dienstag, den 01.09.2009, 11:26 +0200 schrieb Jakov Sosic:
> On Mon, 31 Aug 2009 23:26:06 +0200
> "Marc - A. Dahlhaus" <mad at wol.de> wrote:
> 
> > I think your so called 'limitation' is more related to mistakes that
> > was made during the planing phase of your cluster setup than to
> > missing functionality.
> 
> Yeah, and what can be that mistake? I'll feel free to quote John:
> 
> > The best course of action to take would be to remove that missing
> > node from your cluster configuration using conga,
> > system-config-cluster, or by hand
> > editing /etc/cluster/cluster.conf.  As long as it exists in the
> > configuration then the other nodes will expect it to join the
> > cluster, and they will attempt to fence it when they try to join the
> > cluster and see it is not present.
> 
> Where's the issue with my config there? It seems to be an issue with
> RHCS misbehaving with one fence device missing.

It isn't misbehaving at all here.

The job of RHCS in this case is to save your data against failure.

If fenced can't fence a node successfully, RHCS will wait in stalled
mode (because it doesn't get a successful response from the fence-agent)
until someone who knows what he is doing comes around to fix up the
problem. If it wouldn't do it that way a separated node could eat up
your data. It is the job of fenced to stop all activities until fencing
is in a working shape again.

This behaviour is perfectly fine IMO...

The mistakes in the planing phase of your cluster setup are:

- You use system dependent fencing like "HP iLO" wich will be missing
  if your system is missing and no independent fencing like an
  APC PowerSwitch...

  Think about a power purge which kills booth of your PSU on a system,
  a system dependent management device would be missing from your
  network in this case leading to exactly the problem you're faced with.

- You haven't read through the related documentation (read on and you
  spot to what i am referring to).

> > Please take a look at the qdisk manpage and aditionaly to the cman
> > faq sections about tiebraker, qdisks and especially the last man
> > standing setup...
> 
> qdisk already set up. I never said I lost quorum. I have quorum. But
> without one node missing completely, with it's fence device, rgmanager
> just doesn't start up the services, and is not listed in clustat. I
> repeat, I HAVE GAINED QUORUM, and I have qdisk for the case two out of
> three are out.

Your mistake is that you started fenced in normal mode in which it will
fence all nodes that it can't reach to get around a possible split-brain
scenario. You need to start fenced in "clean start" without fencing mode
(read the fenced manpage as it is documented there) because you know
everything is right. RHCS can't on it's own know anything about, for it
the missing node is separated on network/link layer and could be eating
up all your data just fine until it gets fenced. As long as the missing
node isn't joining it will not get fenced by the other nodes in clean
start node of fenced so it will be your way out of this problem.


Marc




More information about the Linux-cluster mailing list