[Linux-cluster] two fencing problems

Greg Forte gforte at leopard.us.udel.edu
Tue Dec 20 21:26:02 UTC 2005


Lon Hohberger wrote:
> On Tuesday 20 December 2005 14:22, Greg Forte wrote:
> 
>>Lon Hohberger wrote:
>>
>>>   <device name="FENCE1" option="off" port="1"/>
>>>   <device name="FENCE2" option="reboot" port="1"/>
>>>   <device name="FENCE1" option="on" port="1"/>
>>>
>>>...but that's about as "optimal" as you can get while still being safe.
>>
>>Thinking about this a bit further, how is the second example any better
>>than the first?  If fenced hangs after issuing the "off" to FENCE1 in
>>your conf, but before or during issuing the reboot to FENCE2, how is
>>that different than it hanging between issuing the two reboots in mine?
> 
> 
> Sorry, I was not very clear...
> 
> A power switch "rebooting" a port means turning that port off, then on, 
> optionally after some delay.
> 
> If you hang between two "reboot" operations and recover a few seconds later, 
> the second reboot cycle can occur after the first reboot cycle has completed.  
> That is - the first power outlet has power restored prior to the second 
> outlet being turned off.  Fencing has succeeded as configured (with some 
> delay for the hang), but the host has never lost power.  This is dangerous.
> 
> In the "off-reboot-on" case, if you hang between "off" (occurring first) and 
> "reboot" and recover a few seconds later, the first outlet is still off when 
> the second operation ("reboot") occurs.  Fencing has succeeded as configured, 
> and the host has lost power.
> 
> Similarly, with the "off-off-on-on" case, if you hang between two "off" 
> operations, the first port will still be off when the second "off" operation 
> occurs, so the host loses power like it should.
> 
> Put simply, if we expand the "reboot" operation to what it really is to a 
> power switch - "off-on" - we end up with the following:
> 
> "reboot-reboot" ==> "(off-on)-(off-on)" ==> bad!
> "off-reboot-on" ==> "off-(off-on)-on" ==> good

Ah, that makes more sense, yes - I was misinterpreting your use of
"hang".  Certainly true, but OTOH I set the wait time in the fence
devices to 10 seconds (I think, have to check that), and if fenced hangs
for 10 seconds between two operations then I've got other big problems.
 Anyway, this was only a stop-gap until the patched fenced makes its way
into the next update - or I get un-lazy enough to patch it myself.  ;-)

>>... it appears to have
>>clobbered my "illegal" fence sections that it didn't like.  How would
>>one go about controlling (restarting, disabling) cluster services
>>without the gui?
> 
> Your fencing section *was* illegal :) , but I do not think it should have 
> clobbered (e.g. removed it).  If there are other problems, please file a 
> bugzilla.  

I will have to see if I can reproduce it - right now neither of my nodes
will boot because I changed the lvm vg names and need to fix their
initrds ... and I'm on vacation.  ;-)  But I will fiddle with it in the
new year.

> See clustat(8), clusvcadm(8) for service control.

Thanks.

-g




More information about the Linux-cluster mailing list