[Linux-cluster] Fence Issue on BL 460C G6

Sat Oct 30 10:48:07 UTC 2010

Hi all,

Thanks. I've replaced mainboard on both servers. But there's another
problem. Both servers active after mainboard replaced.

But, when I restart the node that is active, other node will be restarted as
well. This happened during fencing.

Repeated occurrence, which would in turn lead to both restart repeatedly.

Need your suggestion please..

Please find the attachment of /var/log/messages/

And, here's my cluster.conf

<?xml version="1.0"?>
<cluster alias="PORTAL_WORLD" config_version="32" name="PORTAL_WORLD">
       <fence_daemon clean_start="0" post_fail_delay="0"
post_join_delay="3"/>
       <clusternodes>
               <clusternode name="rhel-cluster-node1.mgmt.local" nodeid="1"
votes="1">
                       <fence>
                               <method name="1">
                                       <device name="NODE1-ILO"/>
                               </method>
                       </fence>
               </clusternode>
               <clusternode name="rhel-cluster-node2.mgmt.local" nodeid="2"
votes="1">
                       <fence>
                               <method name="1">
                                       <device name="NODE2-ILO"/>
                               </method>
                       </fence>
               </clusternode>
       </clusternodes>
       <quorumd device="/dev/sdf1" interval="3" label="quorum_disk1"
tko="23" votes="2">
               <heuristic interval="2" program="ping 10.4.0.1 -c1 -t1"
score="1"/>
       </quorumd>
       <cman expected_votes="1" two_node="1"/>
       <fencedevices>
               <fencedevice agent="fence_ilo" hostname="ilo-node2"
login="Administrator" name="NODE2-ILO" passwd="password"/>
               <fencedevice agent="fence_ilo" hostname="ilo-node1"
login="Administrator" name="NODE1-ILO" passwd="password"/>
       </fencedevices>
       <rm>
               <failoverdomains>
                       <failoverdomain name="Failover" nofailback="0"
ordered="0" restricted="0">
                               <failoverdomainnode
name="rhel-cluster-node2.mgmt.local" priority="1"/>
                               <failoverdomainnode
name="rhel-cluster-node1.mgmt.local" priority="1"/>
                       </failoverdomain>
               </failoverdomains>
               <resources>
                       <ip address="10.4.1.103" monitor_link="1"/>
               </resources>
               <service autostart="1" domain="Failover" exclusive="0"
name="IP_Virtual" recovery="relocate">
                       <ip ref="10.4.1.103"/>
               </service>
       </rm>
</cluster>

Thanks,

From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Dustin Henry Offutt
Sent: Thursday, October 28, 2010 11:46 PM
To: linux clustering
Subject: Re: [Linux-cluster] Fence Issue on BL 460C G6

I believe your problem is being caused by "nofailback" being set to "1". :

                       <failoverdomain name="Failover" nofailback="1"
ordered="0" restricted="0">

Set it to zero and I believe your problem will be resolved.

On Wed, Oct 27, 2010 at 10:43 PM, Wahyu Darmawan <wahyu at vivastor.co.id>
wrote:

Hi Ben,
Here is my cluster.conf. Need your help please.

<?xml version="1.0"?>
<cluster alias="PORTAL_WORLD" config_version="32" name="PORTAL_WORLD">
       <fence_daemon clean_start="0" post_fail_delay="0"
post_join_delay="3"/>
       <clusternodes>
               <clusternode name="rhel-cluster-node1.mgmt.local" nodeid="1"
votes="1">
                       <fence>
                               <method name="1">
                                       <device name="NODE1-ILO"/>
                               </method>
                       </fence>
               </clusternode>
               <clusternode name="rhel-cluster-node2.mgmt.local" nodeid="2"
votes="1">
                       <fence>
                               <method name="1">
                                       <device name="NODE2-ILO"/>
                               </method>
                       </fence>
               </clusternode>
       </clusternodes>
       <quorumd device="/dev/sdf1" interval="3" label="quorum_disk1"
tko="23" votes="2">
               <heuristic interval="2" program="ping 10.4.0.1 -c1 -t1"
score="1"/>
       </quorumd>
       <cman expected_votes="1" two_node="1"/>
       <fencedevices>
               <fencedevice agent="fence_ilo" hostname="ilo-node2"
login="Administrator" name="NODE2-ILO" passwd="password"/>
               <fencedevice agent="fence_ilo" hostname="ilo-node1"
login="Administrator" name="NODE1-ILO" passwd="password"/>
       </fencedevices>
       <rm>
               <failoverdomains>
                       <failoverdomain name="Failover" nofailback="1"
ordered="0" restricted="0">
                               <failoverdomainnode
name="rhel-cluster-node2.mgmt.local" priority="1"/>
                               <failoverdomainnode
name="rhel-cluster-node1.mgmt.local" priority="1"/>
                       </failoverdomain>
               </failoverdomains>
               <resources>
                       <ip address="10.4.1.103" monitor_link="1"/>
               </resources>
               <service autostart="1" domain="Failover" exclusive="0"
name="IP_Virtual" recovery="relocate">
                       <ip ref="10.4.1.103"/>
               </service>
       </rm>
</cluster>

Many thanks,
Wahyu

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Ben Turner
Sent: Thursday, October 28, 2010 12:18 AM
To: linux clustering
Subject: Re: [Linux-cluster] Fence Issue on BL 460C G6

My guess is there is a problem with fencing.  Are you running fence_ilo with
an HP blade?   Iirc the iLOs on the blades have a different CLI, I don't
think fence_ilo will work with them.  What do you see in the messages files
during these events?  If you see failed fence messages you may want to look
into using fence_ipmilan:

http://sources.redhat.com/cluster/wiki/IPMI_FencingConfig

If you post a snip of your messages file from this event and your
cluster.conf I will have a better idea of what is going on.

-b

----- "Wahyu Darmawan" <wahyu at vivastor.co.id> wrote:

> Hi all,
>
>
>
> For fencing, I'm using HP iLO and server is BL460c G6. Problem is
> resource is start moving to the passive when the failed node is power
> on. It is really strange for me. For example, I shutdown the node1 and
> physically remove the node1 machine from the blade chassis and monitor
> the clustat output, clustat was still showing that the resource is on
> node 1, even node 1 is power down and removed from c7000 blade
> chassis. But when I plugged again the failed node1 on the c7000 blade
> chassis and it power-on, then clustat is showing that the resource is
> start moving to the passive node from the failed node.
> I'm powering down the blade server with power button in front of it,
> then we remove it from the chassis, If we face the hardware problem in
> our active node and the active node goes down then how the resource
> move to the passive node. In addition, When I rebooted or shutdown the
> machine from the CLI, then the resource moves successfully from the
> passive node. Furthurmore, When I shutdown the active node with
> "shutdown -hy 0" command, after shuting down the active node
> automatically restart.
>
> Please help me.
>
>
>
> Many Thanks,
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20101030/52c951fc/attachment.htm>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: messages.txt
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20101030/52c951fc/attachment.txt>