[Linux-cluster] Re: Power based fencing in cluster causes single point of failure that can take down a cluster

Jonathan Biggar jon at levanta.com
Tue Jan 9 19:22:10 UTC 2007


Josef Whiter wrote:
> You can either have redundant fence devices, or look into qdisk.

Thanks for the reply.  Can you explain how qdisk would solve the 
problem?  It seems to me that the fencing device failing which 
simultaneously causes the cluster member to fail wouldn't be affected by 
qdisk.

Does qdisk have some feedback mechanism that tells the cluster that it's 
ok to restart the failed services on another node without fencing being 
successful?  I can't see how that can work reliably and still prevent 
split brain problems.

> On Tue, Jan 09, 2007 at 10:50:53AM -0800, Jonathan Biggar wrote:
>> If we set up a cluster and use network power switches for fencing, won't 
>> the failure of the power switch attached to a cluster member cause all 
>> services that were running on that node to fail to migrate to other 
>> cluster members?
>>
>> This seems to happen to us in practice, because fencing the offline 
>> member fails due to the power switch being unavailable, so rgmanager 
>> never migrates the failed service(s) to another member.
>>
>> Is there a general solution to this problem that I'm missing?

-- 
Jon Biggar
Levanta
jon at levanta.com
650-403-7252




More information about the Linux-cluster mailing list