[Linux-cluster] Clustering Tutorial

Michael Will mwill at penguincomputing.com
Thu Oct 20 16:18:15 UTC 2005


http://www.phy.duke.edu/resources/computing/brahma/Resources/beowulf_book.php 
is a good start,
http://www.beowulf.org is another good place, it is also the home of the 
original beowulf mailinglist.

Generally I would recommend digging through recent mailinglist postings 
because
there are often very informed answers to questions.

Lon just answered a fencing question a few days ago:

"STONITH, STOMITH, etc. are indeed implementations of I/O fencing.

Fencing is the act of forcefully preventing a node from being able to
access resources after that node has been evicted from the cluster in an
attempt to avoid corruption.

The canonical example of when it is needed is the live-hang scenario, as
you described:

1. node A hangs with I/Os pending to a shared file system
2. node B and node C decide that node A is dead and recover resources
allocated on node A (including the shared file system)
3. node A resumes normal operation
4. node A completes I/Os to shared file system

At this point, the shared file system is probably corrupt.  If you're
lucky, fsck will fix it -- if you're not, you'll need to restore from
backup.  I/O fencing (STONITH, or whatever we want to call it) prevents
the last step (step 4) from happening.

How fencing is done (power cycling via external switch, SCSI
reservations, FC zoning, integrated methods like IPMI, iLO, manual
intervention, etc.) is unimportant - so long as whatever method is used
can guarantee that step 4 can not complete."

"GFS can use fabric-level fencing - that is, you can tell the iSCSI
server to cut a node off, or ask the fiber-channel switch to disable a
port.  This is in addition to "power-cycle" fencing."


Michael




More information about the Linux-cluster mailing list