[Linux-cluster] Heuristics for quorum disk used as a tiebreaker in a two node cluster.

Jankowski, Chris Chris.Jankowski at hp.com
Fri Dec 3 10:10:10 UTC 2010


Hi,

I am configuring a two node HA cluster that has only one service.
The sole purpose of the cluster is to keep the service up with minimum disruption for the widest possible range of failure scenarios.

I configured a quorum disk to make sure that after a failure of a node, the cluster (now consisting of only one node) continues to have quorum.

I am considering a partitioned cluster scenario.  Partitioned means to me that the cluster nodes lost the cluster communication path.  Without quorum disk each of the nodes in the cluster will fence the other.

However the manual page for qdisk gives premise of solving the problem in the list of design requirement that it apparently fulfils:

Quote:
Ability to use the external reasons for deciding which partition is the quorate partition in a partitioned cluster.  For example, a user may have a service running on one node, and that node must always be the master in the event of a network partition.
Unquote.

This is exactly what I would like to achieve.  I know which node should stay alive - the one running my service, and it is trivial for me to find this out directly, as I can query for its status locally on a node. I do not have use the network.  This can be used as a heuristic for the quorum disc.

What I am missing is how to make that into a workable whole.  Specifically the following aspects are of concern:

1.
I do not want the other node to be ejected from the cluster just because it does not run the service.  But the test is binary, so it looks like it will be ejected.

2.
Startup time, before the service started.  As no node has the service, both will be candidates for ejection.

3.
Service migration time.
During service migration from one node to another, there is a transient period of time when the service is not active on either node.

Questions:

1.
How do I put all of this together to achieve the overall objective of the node with the service surviving the partitioning event uninterrupted?

2.
What is the relationship between  fencing and node suicide due to communication through quorum disk?

3.
How does the master election relate to this?

I would be grateful for any insights, pointers to documentation, etc.

Thanks and regards,

Chris Jankowski





-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20101203/77d4447d/attachment.htm>


More information about the Linux-cluster mailing list