[Linux-cluster] two fencing problems
gforte at leopard.us.udel.edu
Tue Dec 20 19:22:53 UTC 2005
Lon Hohberger wrote:
> On Wed, 2005-12-07 at 10:08 -0500, Greg Forte wrote:
>> <device name="FENCE1"
>> <device name="FENCE2"
>>and increased the reboot wait time on the PDUs to make sure it'd wait
>>long enough, and that SEEMS to work (once I remembered to turn off ccsd
>>before updating my cluster.conf by hand so that it didn't end up
>>replacing it with the old one immediately ;-)
> I don't know how I missed this, but this is a poor idea.
> What if fenced hangs in the middle? Then you haven't turned off the
> power at all, but the cluster thinks you did! Goodbye, file systems!
> There's no way to guarantee that both ports were turned off
> simultaneously, irrespective of the timeout values. :(
> You could do:
> <device name="FENCE1" option="off" port="1"/>
> <device name="FENCE2" option="reboot" port="1"/>
> <device name="FENCE1" option="on" port="1"/>
> ...but that's about as "optimal" as you can get while still being safe.
Thinking about this a bit further, how is the second example any better
than the first? If fenced hangs after issuing the "off" to FENCE1 in
your conf, but before or during issuing the reboot to FENCE2, how is
that different than it hanging between issuing the two reboots in mine?
Aside from the fact that mine (in theory) leaves both power outlets on,
whereas yours leaves one off, isn't the net effect that the node didn't
get fenced but the cluster thinks it did? The same argument applies to
then "off","off","on","on" configuration that I'd just as soon use.
I guess the real ambiguity here is in this concept of "thinks" -
wouldn't cman expect to get X "OK" responses from fenced, where X is
the number of entries in the <fence> section, and if it didn't receive
X responses then assume something was amiss? Otherwise it seems like
fencing with redundant fence devices is inherently unsafe ...
On a slightly related note, system-config-cluster strikes again - I
started it to monitor the cluster services, and it appears to have
clobbered my "illegal" fence sections that it didn't like. How would
one go about controlling (restarting, disabling) cluster services
without the gui? I know cman_tool allows you to check the status, but
it doesn't seem to have any options for service control. Thanks.
More information about the Linux-cluster