[Linux-cluster] Problems with Cluster

Manish Kathuria mkathuria at tuxtechnologies.co.in
Wed Jun 13 16:15:03 UTC 2007


On 6/12/07, Marc Grimme <grimme at atix.de> wrote:
> On Tuesday 12 June 2007 03:29:00 Manish Kathuria wrote:
> > On 6/11/07, Robert Gil <Robert.Gil at americanhm.com> wrote:
> > > If ilo itself is off, fencing doesn't work.
> >
> > Isn't there any timeout setting such that if the ILO doesn't respond
> > for a certain amount of time, it is treated as fenced and the node is
> > considered to be dead and the failover takes place?
> As far as I remember there is only a tcp-timeout when establishing the
> connection to the ilo-card that takes a very long time to occure (that's a
> default setting and takes minutes). I'm not sure how and where to set it.

We did wait for quite some time and followed the messages appearing in
/var/log/messages. It kept on trying to contact the ILO of the node
which was powered off.

>
> But we've had this discussion (especially with ILO-Cards) nearly every time
> when using them and therefore and also out of other reasons we had to build
> our own fence_ilo agent. I'm quite sure that we solved the timeout problem in
> the end. It is set to 10sec per default (Config.timeout).
> You can find it at
> http://download.atix.de/yum/comoonics/productive/noarch/RPMS/comoonics-bootimage-fenceclient-ilo-0.1-16.noarch.rpm
> or directly use the yum/up2date-channel as described here:
> http://www.open-sharedroot.org/faq/can-i-use-yum-or-up2date-to-install-the-software/
> then install "comoonics-bootimage-fenceclient-ilo" and there you go.

Thanks, I will try and see if they agree to use this version.

> >
> > > Did you add ilo as a fence device? And create a user? You create a user
> > > in the ilo for that blade, not on the chassis. You have to reboot the
> > > blade to get to the ilo manager.
> >
> > Yes, had added respective ILOs as fence devices for both the servers
> > and created users also.
> We are doing so as well. Always a power user for ilo devices.
> We are also automating this with the ilo client.
> There is a undocumented switch -x in the fence_ilo client referenced above
> where you reference a file that might look as follows and you'll have your
> user.
> > I just want to make sure that automatic fencing happens and failover
> > takes place even when there is a complete power failure for one node
> If the timeout thing works you'll also need a second fence mechanism.
> You might think about using fence_manual as last resort, to bring that cluster
> back online after power failure and then after manual intervention.
>
> Regards Marc.

Just wondering if there is any undocumented option / switch which will
force an automatic failover to one node if the ILO on the other one
fails to respond within certain time period (maybe few minutes).

Regards,
--
Manish




More information about the Linux-cluster mailing list