[Cluster-devel] cluster/cman/man Makefile mkqdisk.8 qdisk.5 qd ...

Fri Jul 21 17:55:06 UTC 2006

CVSROOT:	/cvs/cluster
Module name:	cluster
Changes by:	lhh at sourceware.org	2006-07-21 17:55:04

Modified files:
	cman/man       : Makefile 
Added files:
	cman/man       : mkqdisk.8 qdisk.5 qdiskd.8 

Log message:
	Add man pages for qdisk

Patches:
http://sourceware.org/cgi-bin/cvsweb.cgi/cluster/cman/man/mkqdisk.8.diff?cvsroot=cluster&r1=1.1&r2=1.2
http://sourceware.org/cgi-bin/cvsweb.cgi/cluster/cman/man/qdisk.5.diff?cvsroot=cluster&r1=1.1&r2=1.2
http://sourceware.org/cgi-bin/cvsweb.cgi/cluster/cman/man/qdiskd.8.diff?cvsroot=cluster&r1=1.1&r2=1.2
http://sourceware.org/cgi-bin/cvsweb.cgi/cluster/cman/man/Makefile.diff?cvsroot=cluster&r1=1.1&r2=1.2

--- cluster/cman/man/mkqdisk.8	2006/07/21 17:53:08	1.1
+++ cluster/cman/man/mkqdisk.8	2006/07/21 17:55:04	1.2
@@ -0,0 +1,23 @@
+.TH "mkqdisk" "8" "July 2006" "" "Quorum Disk Management"
+.SH "NAME"
+mkqdisk \- Cluster Quorum Disk Utility
+.SH "WARNING"
+Use of this command can cause the cluster to malfunction.
+.SH "SYNOPSIS"
+\fBmkqdisk [\-?|\-h] | [\-L] | [\-f \fPlabel\fB] [\-c \fPdevice \fB -l \fPlabel\fB]
+.SH "DESCRIPTION"
+.PP 
+The \fBmkqdisk\fP command is used to create a new quorum disk or display
+existing quorum disks accessible from a given cluster node.
+.SH "OPTIONS"
+.IP "\-c device \-l label"
+Initialize a new cluster quorum disk.  This will destroy all data on the given
+device.  If a cluster is currently using that device as a quorum disk, the
+entire cluster will malfunction.  Do not ru
+.IP "\-f label"
+Find the cluster quorum disk with the given label and display information about it..
+.IP "\-L"
+Display information on all accessible cluster quorum disks.
+
+.SH "SEE ALSO"
+qdisk(5) qdiskd(8)
--- cluster/cman/man/qdisk.5	2006/07/21 17:53:08	1.1
+++ cluster/cman/man/qdisk.5	2006/07/21 17:55:04	1.2
@@ -0,0 +1,309 @@
+.TH "QDisk" "8" "July 2006" "" "Cluster Quorum Disk"
+.SH "NAME"
+QDisk 1.0 \- a disk-based quorum daemon for CMAN / Linux-Cluster
+.SH "1. Overview"
+.SH "1.1 Problem"
+In some situations, it may be necessary or desirable to sustain
+a majority node failure of a cluster without introducing the need for
+asymmetric cluster configurations (e.g. client-server, or heavily-weighted
+voting nodes).
+
+.SH "1.2. Design Requirements"
+* Ability to sustain 1..(n-1)/n simultaneous node failures, without the
+danger of a simple network partition causing a split brain.  That is, we
+need to be able to ensure that the majority failure case is not merely
+the result of a network partition.
+
+* Ability to use external reasons for deciding which partition is the 
+the quorate partition in a partitioned cluster.  For example, a user may
+have a service running on one node, and that node must always be the master
+in the event of a network partition.  Or, a node might lose all network
+connectivity except the cluster communication path - in which case, a
+user may wish that node to be evicted from the cluster.
+
+* Integration with CMAN.  We must not require CMAN to run with us (or
+without us).  Linux-Cluster does not require a quorum disk normally -
+introducing new requirements on the base of how Linux-Cluster operates
+is not allowed.
+
+* Data integrity.  In order to recover from a majority failure, fencing
+is required.  The fencing subsystem is already provided by Linux-Cluster.
+
+* Non-reliance on hardware or protocol specific methods (i.e. SCSI
+reservations).  This ensures the quorum disk algorithm can be used on the
+widest range of hardware configurations possible.
+
+* Little or no memory allocation after initialization.  In critical paths
+during failover, we do not want to have to worry about being killed during
+a memory pressure situation because we request a page fault, and the Linux
+OOM killer responds...
+
+.SH "1.3. Hardware Considerations and Requirements"
+.SH "1.3.1. Concurrent, Synchronous, Read/Write Access"
+This quorum daemon requires a shared block device with concurrent read/write
+access from all nodes in the cluster.  The shared block device can be
+a multi-port SCSI RAID array, a Fiber-Channel RAID SAN, a RAIDed iSCSI
+target, or even GNBD.  The quorum daemon uses O_DIRECT to write to the
+device.
+
+.SH "1.3.2. Bargain-basement JBODs need not apply"
+There is a minimum performance requirement inherent when using disk-based
+cluster quorum algorithms, so design your cluster accordingly.  Using a
+cheap JBOD with old SCSI2 disks on a multi-initiator bus will cause 
+problems at the first load spike.  Plan your loads accordingly; a node's
+inability to write to the quorum disk in a timely manner will cause the
+cluster to evict the node.  Using host-RAID or multi-initiator parallel
+SCSI configurations with the qdisk daemon is unlikely to work, and will
+probably cause administrators a lot of frustration.  That having been
+said, because the timeouts are configurable, most hardware should work
+if the timeouts are set high enough.
+
+.SH "1.3.3. Fencing is Required"
+In order to maintain data integrity under all failure scenarios, use of
+this quorum daemon requires adequate fencing, preferrably power-based
+fencing.  Watchdog timers and software-based solutions to reboot the node
+internally, while possibly sufficient, are not considered 'fencing' for 
+the purposes of using the quorum disk.
+
+.SH "1.4. Limitations"
+* At this time, this daemon supports a maximum of 16 nodes.  This is
+primarily a scalability issue: As we increase the node count, we increase
+the amount of synchronous I/O contention on the shared quorum disk.
+
+* Cluster node IDs must be statically configured in cluster.conf and
+must be numbered from 1..16 (there can be gaps, of course).
+
+* Cluster node votes should be more or less equal.
+
+* CMAN must be running before the qdisk program can start.
+
+* CMAN's eviction timeout should be at least 2x the quorum daemon's
+to give the quorum daemon adequate time to converge on a master during a
+failure + load spike situation.
+
+* The total number of votes assigned to the quorum device should be
+equal to or greater than the total number of node-votes in the cluster.
+While it is possible to assign only one (or a few) votes to the quorum
+device, the effects of doing so have not been explored.
+
+* Currently, the quorum disk daemon is difficult to use with CLVM if
+the quorum disk resides on a CLVM logical volume.  CLVM requires a
+quorate cluster to correctly operate, which introduces a chicken-and-egg
+problem for starting the cluster: CLVM needs quorum, but the quorum daemon
+needs CLVM (if and only if the quorum device lies on CLVM-managed storage).
+One way to work around this is to *not* set the cluster's expected votes
+to include the quorum daemon's votes.  Bring all nodes online, and start
+the quorum daemon *after* the whole cluster is running.  This will allow
+the expected votes to increase naturally.
+
+.SH "2. Algorithms"
+.SH "2.1. Heartbeating & Liveliness Determination"
+Nodes update individual status blocks on the quorum disk at a user-
+defined rate.  Each write of a status block alters the timestamp, which
+is what other nodes use to decide whether a node has hung or not.  If,
+after a user-defined number of 'misses' (that is, failure to update a
+timestamp), a node is declared offline.  After a certain number of 'hits'
+(changed timestamp + "i am alive" state), the node is declared online.
+
+The status block contains additional information, such as a bitmask of
+the nodes that node believes are online.  Some of this information is
+used by the master - while some is just for performace recording, and
+may be used at a later time.  The most important pieces of information
+a node writes to its status block are:
+
+.in 12
+- Timestamp
+.br
+- Internal state (available / not available)
+.br
+- Score
+.br
+- Known max score (may be used in the future to detect invalid configurations)
+.br
+- Vote/bid messages
+.br
+- Other nodes it thinks are online
+.in 0
+
+.SH "2.2. Scoring & Heuristics"
+The administrator can configure up to 10 purely arbitrary heuristics, and
+must exercise caution in doing so.  At least one administrator-
+defined heuristic is required for operation, but it is generally a good
+idea to have more than one heuristic.  By default, only nodes scoring over
+1/2 of the total maximum score will claim they are available via the
+quorum disk, and a node (master or otherwise) whose score drops too low
+will remove itself (usually, by rebooting).
+
+The heuristics themselves can be any command executable by 'sh -c'.  For
+example, in early testing the following was used:
+
+.ti 12
+<\fBheuristic \fP\fIprogram\fP\fB="\fP[ -f /quorum ]\fB" \fP\fIscore\fP\fB="\fP10\fB" \fP\fIinterval\fP\fB="\fP2\fB"/>\fP
+
+This is a literal sh-ism which tests for the existence of a file called
+"/quorum".  Without that file, the node would claim it was unavailable.
+This is an awful example, and should never, ever be used in production,
+but is provided as an example as to what one could do...
+
+Typically, the heuristics should be snippets of shell code or commands which
+help determine a node's usefulness to the cluster or clients.  Ideally, you
+want to add traces for all of your network paths (e.g. check links, or
+ping routers), and methods to detect availability of shared storage.
+
+.SH "2.3. Master Election"
+Only one master is present at any one time in the cluster, regardless of
+how many partitions exist within the cluster itself.  The master is
+elected by a simple voting scheme in which the lowest node which believes
+it is capable of running (i.e. scores high enough) bids for master status.
+If the other nodes agree, it becomes the master.  This algorithm is 
+run whenever no master is present.
+
+If another node comes online with a lower node ID while a node is still
+bidding for master status, it will rescind its bid and vote for the lower
+node ID.  If a master dies or a bidding node dies, the voting algorithm
+is started over.  The voting algorithm typically takes two passes to
+complete.
+
+Master deaths take marginally longer to recover from than non-master
+deaths, because a new master must be elected before the old master can
+be evicted & fenced.
+
+.SH "2.4. Master Duties"
+The master node decides who is or is not in the master partition, as
+well as handles eviction of dead nodes (both via the quorum disk and via
+the linux-cluster fencing system by using the cman_kill_node() API).
+
+.SH "2.5. How it All Ties Together"
+When a master is present, and if the master believes a node to be online,
+that node will advertise to CMAN that the quorum disk is available.  The
+master will only grant a node membership if:
+
+.in 12
+(a) CMAN believes the node to be online, and
+.br
+(b) that node has made enough consecutive, timely writes
+.in 16
+to the quorum disk, and
+.in 12
+(c) the node has a high enough score to consider itself online.
+.in 0
+
+.SH "3. Configuration"
+.SH "3.1. The <quorumd> tag"
+This tag is a child of the top-level <cluster> tag.
+
+.in 8
+\fB<quorumd\fP
+.in 9
+\fIinterval\fP\fB="\fP1\fB"\fP
+.in 12 
+This is the frequency of read/write cycles
+
+.in 9
+\fItko\fP\fB="\fP10\fB"\fP
+.in 12
+This is the number of cycles a node must miss in order to be declared dead.
+
+.in 9
+\fIvotes\fP\fB="\fP3\fB"\fP
+.in 12
+This is the number of votes the quorum daemon advertises to CMAN when it
+has a high enough score.
+
+.in 9
+\fIlog_level\fP\fB="\fP4\fB"\fP
+.in 12
+This controls the verbosity of the quorum daemon in the system logs.
+0 = emergencies; 7 = debug.
+
+.in 9
+\fIlog_facility\fP\fB="\fPlocal4\fB"\fP
+.in 12
+This controls the syslog facility used by the quorum daemon when logging.
+For a complete list of available facilities, see \fBsyslog.conf(5)\fP.
+
+.in 9
+\fIstatus_file\fP\fB="\fP/foo\fB"\fP
+.in 12
+Write internal states out to this file periodically ("-" = use stdout).
+This is primarily used for debugging.
+
+.in 9
+\fImin_score\fP\fB="\fP3\fB"\fP
+.in 12
+Absolute minimum score to be consider one's self "alive".  If omitted,
+or set to 0, the default function "floor((n+1)/2)" is used, where \fIn\fP
+is the sum-total of all of defined heuristics' \fIscore\fP attribute.
+
+.in 9
+\fIdevice\fP\fB="\fP/dev/sda1\fB"\fP
+.in 12
+This is the device the quorum daemon will use.  This device must be the
+same on all nodes.
+
+.in 9
+\fIlabel\fP\fB="\fPmylabel\fB"/>\fP
+.in 12
+This overrides the device field if present.  If specified, the quorum
+daemon will read /proc/partitions and check for qdisk signatures
+on every block device found, comparing the label against the specified
+label.  This is useful in configurations where the block device name
+differs on a per-node basis.
+.in 0
+
+.SH "3.2.  The <heuristic> tag"
+This tag is a child of the <quorumd> tag.
+
+.in 8
+\fB<heuristic\fP
+.in 9
+\fIprogram\fP\fB="\fP/test.sh\fB"\fP
+.in 12
+This is the program used to determine if this heuristic is alive.  This
+can be anything which may be executed by \fI/bin/sh -c\fP.  A return
+value of zero indicates success; anything else indicates failure.
+
+.in 9
+\fIscore\fP\fB="\fP1\fB"\fP
+.in 12
+This is the weight of this heuristic.  Be careful when determining scores
+for heuristics.
+
+.in 9
+\fIinterval\fP\fB="\fP2\fB"/>\fP
+.in 12
+This is the frequency at which we poll the heuristic.
+.in 0
+
+.SH "3.3. Example"
+.in 8
+<quorumd interval="1" tko="10" votes="3" label="testing">
+.in 12
+<heuristic program="ping A -c1 -t1" score="1" interval="2"/>
+.br
+<heuristic program="ping B -c1 -t1" score="1" interval="2"/>
+.br
+<heuristic program="ping C -c1 -t1" score="1" interval="2"/>
+.br
+.in 8
+</quorumd>
+.in 0
+
+.SH "3.4. Heuristic score considerations"
+* Heuristic timeouts should be set high enough to allow the previous run
+of a given heuristic to complete.
+
+* Heuristic scripts returning anything except 0 as their return code 
+are considered failed.
+
+* The worst-case for improperly configured quorum heuristics is a race
+to fence where two partitions simultaneously try to kill each other.
+
+.SH "3.5. Creating a quorum disk partition"
+The mkqdisk utility can create and list currently configured quorum disks
+visible to the local node; see
+.B mkqdisk(8)
+for more details.
+
+.SH "SEE ALSO"
+mkqdisk(8), qdiskd(8), cman(5), syslog.conf(5)
--- cluster/cman/man/qdiskd.8	2006/07/21 17:53:08	1.1
+++ cluster/cman/man/qdiskd.8	2006/07/21 17:55:04	1.2
@@ -0,0 +1,20 @@
+.TH "qdiskd" "8" "July 2006" "" "Quorum Disk Management"
+.SH "NAME"
+qdiskd \- Cluster Quorum Disk Daemon
+.SH "SYNOPSIS"
+\fBqdiskd [\-f] [\-d]
+.SH "DESCRIPTION"
+.PP 
+The \fBqdiskd\fP daemon talks to CMAN and provides a mechanism for determining
+node-fitness in a cluster environment.  See
+.B
+qdisk(5)
+for configuration information.
+.SH "OPTIONS"
+.IP "\-f"
+Run in the foreground (do not fork / daemonize).
+.IP "\-d"
+Enable debug output.
+
+.SH "SEE ALSO"
+mkqdisk(8), qdisk(5), cman(5)
--- cluster/cman/man/Makefile	2004/08/13 06:38:22	1.1
+++ cluster/cman/man/Makefile	2006/07/21 17:55:04	1.2
@@ -18,10 +18,10 @@
 install:
 	install -d ${mandir}/man5
 	install -d ${mandir}/man8
-	install cman.5 ${mandir}/man5
-	install cman_tool.8 ${mandir}/man8
+	install cman.5 qdisk.5 ${mandir}/man5
+	install cman_tool.8 qdiskd.8 mkqdisk.8 ${mandir}/man8
 
 uninstall:
-	${UNINSTALL} cman.5 ${mandir}/man5
-	${UNINSTALL} cman_tool.8 ${mandir}/man8
+	${UNINSTALL} cman.5 qdisk.5 ${mandir}/man5
+	${UNINSTALL} cman_tool.8 qdiskd.8 mkqdisk.8 ${mandir}/man8