[Linux-cluster] Problems with Cluster

Marc Grimme grimme at atix.de
Tue Jun 12 06:16:30 UTC 2007


On Tuesday 12 June 2007 03:29:00 Manish Kathuria wrote:
> On 6/11/07, Robert Gil <Robert.Gil at americanhm.com> wrote:
> > If ilo itself is off, fencing doesn't work.
>
> Isn't there any timeout setting such that if the ILO doesn't respond
> for a certain amount of time, it is treated as fenced and the node is
> considered to be dead and the failover takes place?
As far as I remember there is only a tcp-timeout when establishing the 
connection to the ilo-card that takes a very long time to occure (that's a 
default setting and takes minutes). I'm not sure how and where to set it.

But we've had this discussion (especially with ILO-Cards) nearly every time 
when using them and therefore and also out of other reasons we had to build 
our own fence_ilo agent. I'm quite sure that we solved the timeout problem in 
the end. It is set to 10sec per default (Config.timeout).
You can find it at 
http://download.atix.de/yum/comoonics/productive/noarch/RPMS/comoonics-bootimage-fenceclient-ilo-0.1-16.noarch.rpm
or directly use the yum/up2date-channel as described here:
http://www.open-sharedroot.org/faq/can-i-use-yum-or-up2date-to-install-the-software/ 
then install "comoonics-bootimage-fenceclient-ilo" and there you go.
>
> > Did you add ilo as a fence device? And create a user? You create a user
> > in the ilo for that blade, not on the chassis. You have to reboot the
> > blade to get to the ilo manager.
>
> Yes, had added respective ILOs as fence devices for both the servers
> and created users also.
We are doing so as well. Always a power user for ilo devices.
We are also automating this with the ilo client.
There is a undocumented switch -x in the fence_ilo client referenced above 
where you reference a file that might look as follows and you'll have your 
user.

  <USER_INFO MODE="write">
    <ADD_USER
      USER_NAME="power"
      USER_LOGIN="power"
      PASSWORD="the_password">
      <ADMIN_PRIV value ="N"/>
      <REMOTE_CONS_PRIV value ="N"/>
      <RESET_SERVER_PRIV value ="Y"/>
      <VIRTUAL_MEDIA_PRIV value ="N"/>
      <!--        Firmware support infomation for next tag:          -->
      <!--            iLO 2 - All version.                           -->
      <!--              iLO - All version.                           -->
      <!--         RILOE II - None                                   -->
      <CONFIG_ILO_PRIV value="Yes"/>
      <!--        Firmware support infomation for next 3 tags:       -->
      <!--            iLO 2 - None.                                  -->
      <!--              iLO - None.                                  -->
      <!--         RILOE II - All versions.                          -->
      <!--
      <CONFIG_RILO_PRIV value="Y"/>
      <LOGIN_PRIV value ="Y"/>
      <CLIENT_RANGE value ="10.10.10.1 - 254.255.255.255"/>
      -->
      <!--        Firmware support infomation for next 6 tags:       -->
      <!--            iLO 2 - None.                                  -->
      <!--              iLO - Version 1.40 and earlier.              -->
      <!--         RILOE II - None.                                  -->
      <!--
      <VIEW_LOGS_PRIV value="Yes"/>
      <CLEAR_LOGS_PRIV value="Yes"/>
      <EMS_PRIV value="Yes"/>
      <UPDATE_ILO_PRIV value="No"/>
      <CONFIG_RACK_PRIV value="Yes"/>
      <DIAG_PRIV value="Yes"/>
      -->
    </ADD_USER>
  </USER_INFO>

>
>
> I just want to make sure that automatic fencing happens and failover
> takes place even when there is a complete power failure for one node
If the timeout thing works you'll also need a second fence mechanism. 
You might think about using fence_manual as last resort, to bring that cluster 
back online after power failure and then after manual intervention.

Regards Marc.
>
> > -----Original Message-----
> > From: linux-cluster-bounces at redhat.com
> > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Manish Kathuria
> > Sent: Monday, June 11, 2007 12:45 PM
> > To: linux clustering
> > Subject: Re: [Linux-cluster] Problems with Cluster
> >
> > On 6/11/07, Maciej Bogucki <maciej.bogucki at artegence.com> wrote:
> > > Manish Kathuria napisał(a):
> > > > We want the failover to happen when the power supply fails to either
> > > > of the nodes. In order to test the scenario, we removed the power
> > > > cables from one of the nodes. However the failover did not happen
> > > > and upon observing the logs we found that the alive node could not
> > > > connect to the fence device (ILO in this case) of the dead node
> > > > since it was powered off and the fencing could not take place. Does
> > > > this mean that we would not be able to have a failover in case of
> > > > power failure for one of the nodes. Is there a way we can do it ?
> > > > How is the cluster supposed to react when the ILO itself is powered
> > > > off ?
> > >
> > > You need to perform manual fencing(administrator reaction) when it
> > > happend.
> >
> > Isn't there any way which is automated and does not require manual
> > intervention ? Otherwise, the whole purpose gets defeated.
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



-- 
Gruss / Regards,

Marc Grimme
Phone: +49-89 452 3538-14
http://www.atix.de/               http://www.open-sharedroot.org/

**
ATIX - Ges. fuer Informationstechnologie und Consulting mbH
Einsteinstr. 10 - 85716 Unterschleissheim - Germany

Registergericht: Amtsgericht München
Registernummer: HRB 131682
USt.-Id.: DE209485962

Geschäftsführung: Marc Grimme, Mark Hlawatschek, Thomas Merz





More information about the Linux-cluster mailing list