[Linux-cluster] Quorum disks and two-node clusters

Wed Oct 18 07:32:08 UTC 2006

> On Tue, 2006-10-10 at 10:21 +0200, Pena, Francisco Javier wrote:
> > [...]
> > Lon,
> > 
> > Do you know if there are any plans to release an updated 
> cman package, 
> > including this bugfix, before U5? My company is planning to start 
> > implementing GFS, and this would definitely be required for us.
> > 
> > Regards,
> > 
> > Javier Peña
> 
> Hi Javier,
> 
> I do not think there is a plan to release the fix early, but 
> that could change at any time.  Note that because the changes 
> were very simple, you can use the checked-out version from 
> CVS and it will work on U4+errata as a drop-in replacement.
> 
> I am interested to know how you are going to use qdiskd such 
> that this is a show-stopper for your organization.
> 
> The only case where the lack of rebooting becomes problematic 
> is if you have heuristics which are not concerned with 
> network connectivity (note: this is a perfectly valid case, 
> of course).
> 
> That is, if two nodes see each other, but one thinks it is 
> inquorate, the quorate node will *not* fence the inquorate 
> node, and... well, things will not work very smoothly (this 
> is the case where a reboot is needed).
> 
> In network outages (one of the things qdiskd was designed to 
> help with), the nodes will not see each other, so the node 
> which still thinks it is quorate will fence the inquorate 
> node, and the cluster will cleanly continue.
> 
> -- Lon
> 

Hi Lon,

After doing some additional checks with my test environment, I think I was too fast in assuming this would be absolutely required. I assumed that having the failed node reboot itself would eliminate the need to fence that node, but it looks like this is not the case. Which makes me wonder, why do we want the server to reboot itself, if it is going to be fenced anyway?

I we could avoid the fencing the failed node, we would be able to solve some problems I found with iLO fencing: if a node loses power completely, the iLO card will not work, so we will never be able to fence the failed node, and the whole cluster will be stopped. If we can assume that an inquorate node will inmediately reboot, we might continue working without any manual interaction.

Thanks for your answer. Regards,

Javier Peña