[Linux-cluster] How to disable node?

Tue Sep 1 09:19:54 UTC 2009

On Mon, 31 Aug 2009 14:22:07 -0700
Rick Stevens <ricks at nerd.com> wrote:

> I don't see that there's anything to fix.  You had a three-node
> cluster so you needed a majority of nodes up to maintain a quorum.
> One node died, killing quorum and thus stopping the cluster

Nope. Quorum is still there. I have 3 nodes with qdisk, and two nodes
remained in quorum. Then, I had to reboot the nodes because of some
multipath/scsi changes, and after that, they only try to fence the
missing node, and they can't get to it's fencing device, and rgmanager
is not showing in my output. Quorum is regained after both nodes
restarted.

So, bassically what I mean is that you cannot start cluster with one
node and it's fence device missing, although you have gained quorum.
2 nodes and qdisk is much more than I need - I need only one node +
qdisk for cluster to function properly.

> As a three-node cluster, it's dead.
> It can't be run as a three-node cluster until the third node is
> fixed.  Those are the rules.

Well this is the part that I don't like :) Why can't I for example put
10 missing nodes in my cluster.conf - if other nodes don't gain quorum,
they shouldn't start services and that's it, but if they do gain
quorum, what's the point of constantly trying to fence missing fence
device of missing node?!

> A two node cluster requires special handling of things to prevent the
> dread split-brain situation, which is what two_node does.  Running the
> surviving nodes as a two-node cluster is, by definition, a
> reconfiguration.  I'd say simply requiring you to set two_node is
> pretty damned innocuous to let you run a dead (ok, mortally wounded)
> cluster.
> 
> If you pulled a drive out of a RAID6--thus degrading it to a RAID5--
> would you complain because it didn't remain a RAID6?

First of all, RAID6 without one disk _IS NOT_ RAID5. In terms of
redundancy they are the same, but on disk data is not the same, so that
two are not equal.

And yes - I would complain if I had to _REBUILD_ degraded array to
RAID5. And until it's rebuilded, if the array was unavailable - that
would be a major issue - what's the point of redundancy then if I loose
whole array/cluster when one unit fails? But with RAID6 I don't have to.
As a matter of fact, I can loose one more drive, and leave it in that
state until I buy new two drives and hotplug them into the chassis.
EG.: until quorum is maintained, array and data in it are not
jeopardized. With RHCS that should be the same, shouldn't it?

I'm just asking, why can't I leave the missing node in the
configuration, which will be active once it returns from dealer? Why do
I have to reconfig the cluster? That is not a good behaviour IMHO -
there should be some command to mark node as missing, and the cluster
should work fine with two nodes + qdisk because it has quorum. Isn't
that the point of quorum?

What's the point of cluster, if one node cannot malfunction, and be
taken away to repairs, without the need of setting up a new cluster?

In your RAID6 configuration, it's like taking away one disk breaks the
array until you rebuild it...

-- 
|    Jakov Sosic    |    ICQ: 28410271    |   PGP: 0x965CAE2D   |
=================================================================
| start fighting cancer -> http://www.worldcommunitygrid.org/   |