[Linux-cluster] Fence Issue on BL 460C G6

Tue Nov 2 16:04:27 UTC 2010

Your nodes don't seem to be able to communicate:

Oct 30 16:08:15 rhel-cluster-node2 fenced[3549]: rhel-cluster-node1.mgmt.local not a cluster member after 3 sec post_join_delay
Oct 30 16:08:15 rhel-cluster-node2 fenced[3549]: fencing node "rhel-cluster-node1.mgmt.local"
Oct 30 16:08:29 rhel-cluster-node2 fenced[3549]: fence "rhel-cluster-node1.mgmt.local" success

I never see them form a cluster:

Oct 30 16:03:25 rhel-cluster-node2 openais[3511]: [CLM  ] CLM CONFIGURATION CHANGE
Oct 30 16:03:25 rhel-cluster-node2 openais[3511]: [CLM  ] New Configuration:
Oct 30 16:03:25 rhel-cluster-node2 openais[3511]: [CLM  ]       r(0) ip(10.4.1.102)
Oct 30 16:03:25 rhel-cluster-node2 openais[3511]: [CLM  ] Members Left:
Oct 30 16:03:25 rhel-cluster-node2 openais[3511]: [CLM  ] Members Joined:
Oct 30 16:03:25 rhel-cluster-node2 openais[3511]: [CLM  ] CLM CONFIGURATION CHANGE
Oct 30 16:03:25 rhel-cluster-node2 openais[3511]: [CLM  ] New Configuration:
Oct 30 16:03:26 rhel-cluster-node2 openais[3511]: [CLM  ]       r(0) ip(10.4.1.102)
Oct 30 16:03:26 rhel-cluster-node2 openais[3511]: [CLM  ] Members Left:
Oct 30 16:03:26 rhel-cluster-node2 openais[3511]: [CLM  ] Members Joined:
Oct 30 16:03:26 rhel-cluster-node2 openais[3511]: [SYNC ] This node is within the primary component and will provide service.
Oct 30 16:03:26 rhel-cluster-node2 openais[3511]: [TOTEM] entering OPERATIONAL state.

Are the nodes just rebooting each other in a cycle?  If so my guess is that you are having issues routing the multicast traffic.  An easy test is to try using broadcast.  Change your cman tag to say:

<cman expected_votes="1" two_node="1" broadcast="yes"/>

If your nodes can form a cluster with that set then you need to evaluate your multicast config.

-Ben

----- "Wahyu Darmawan" <wahyu at vivastor.co.id> wrote:

> Hi all,
> 
> Thanks. I’ve replaced mainboard on both servers. But there’s another
> problem. Both servers active after mainboard replaced.
> 
> 
> 
> But, when I restart the node that is active, other node will be
> restarted as well. This happened during fencing.
> 
> Repeated occurrence, which would in turn lead to both restart
> repeatedly.
> 
> 
> 
> Need your suggestion please..
> 
> Please find the attachment of /var/log/messages/
> 
> And, here’s my cluster.conf
> 
> <?xml version="1.0"?>
> <cluster alias="PORTAL_WORLD" config_version="32" name="PORTAL_WORLD">
> <fence_daemon clean_start="0" post_fail_delay="0"
> post_join_delay="3"/>
> <clusternodes>
> <clusternode name="rhel-cluster-node1.mgmt.local" nodeid="1"
> votes="1">
> <fence>
> <method name="1">
> <device name="NODE1-ILO"/>
> </method>
> </fence>
> </clusternode>
> <clusternode name="rhel-cluster-node2.mgmt.local" nodeid="2"
> votes="1">
> <fence>
> <method name="1">
> <device name="NODE2-ILO"/>
> </method>
> </fence>
> </clusternode>
> </clusternodes>
> <quorumd device="/dev/sdf1" interval="3" label="quorum_disk1" tko="23"
> votes="2">
> <heuristic interval="2" program="ping 10.4.0.1 -c1 -t1" score="1"/>
> </quorumd>
> <cman expected_votes="1" two_node="1"/>
> <fencedevices>
> <fencedevice agent="fence_ilo" hostname="ilo-node2"
> login="Administrator" name="NODE2-ILO" passwd="password"/>
> <fencedevice agent="fence_ilo" hostname="ilo-node1"
> login="Administrator" name="NODE1-ILO" passwd="password"/>
> </fencedevices>
> <rm>
> <failoverdomains>
> <failoverdomain name="Failover" nofailback="0" ordered="0"
> restricted="0">
> <failoverdomainnode name="rhel-cluster-node2.mgmt.local"
> priority="1"/>
> <failoverdomainnode name="rhel-cluster-node1.mgmt.local"
> priority="1"/>
> </failoverdomain>
> </failoverdomains>
> <resources>
> <ip address="10.4.1.103" monitor_link="1"/>
> </resources>
> <service autostart="1" domain="Failover" exclusive="0"
> name="IP_Virtual" recovery="relocate">
> <ip ref="10.4.1.103"/>
> </service>
> </rm>
> </cluster>
> 
> 
> 
> Thanks,
> 
> 
> 
> 
> 
> 
> 
> From: linux-cluster-bounces at redhat.com
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Dustin Henry
> Offutt
> Sent: Thursday, October 28, 2010 11:46 PM
> To: linux clustering
> Subject: Re: [Linux-cluster] Fence Issue on BL 460C G6
> 
> 
> 
> I believe your problem is being caused by "nofailback" being set to
> "1". :
> 
> <failoverdomain name="Failover" nofailback="1" ordered="0"
> restricted="0">
> 
> Set it to zero and I believe your problem will be resolved.
> 
> 
> On Wed, Oct 27, 2010 at 10:43 PM, Wahyu Darmawan <
> wahyu at vivastor.co.id > wrote:
> 
> Hi Ben,
> Here is my cluster.conf. Need your help please.
> 
> 
> <?xml version="1.0"?>
> <cluster alias="PORTAL_WORLD" config_version="32" name="PORTAL_WORLD">
> <fence_daemon clean_start="0" post_fail_delay="0"
> post_join_delay="3"/>
> <clusternodes>
> <clusternode name="rhel-cluster-node1.mgmt.local" nodeid="1"
> votes="1">
> <fence>
> <method name="1">
> <device name="NODE1-ILO"/>
> </method>
> </fence>
> </clusternode>
> <clusternode name="rhel-cluster-node2.mgmt.local" nodeid="2"
> votes="1">
> <fence>
> <method name="1">
> <device name="NODE2-ILO"/>
> </method>
> </fence>
> </clusternode>
> </clusternodes>
> <quorumd device="/dev/sdf1" interval="3" label="quorum_disk1" tko="23"
> votes="2">
> <heuristic interval="2" program="ping 10.4.0.1 -c1 -t1" score="1"/>
> </quorumd>
> <cman expected_votes="1" two_node="1"/>
> <fencedevices>
> <fencedevice agent="fence_ilo" hostname="ilo-node2"
> login="Administrator" name="NODE2-ILO" passwd="password"/>
> <fencedevice agent="fence_ilo" hostname="ilo-node1"
> login="Administrator" name="NODE1-ILO" passwd="password"/>
> </fencedevices>
> <rm>
> <failoverdomains>
> <failoverdomain name="Failover" nofailback="1" ordered="0"
> restricted="0">
> <failoverdomainnode name="rhel-cluster-node2.mgmt.local"
> priority="1"/>
> <failoverdomainnode name="rhel-cluster-node1.mgmt.local"
> priority="1"/>
> </failoverdomain>
> </failoverdomains>
> <resources>
> <ip address="10.4.1.103" monitor_link="1"/>
> </resources>
> <service autostart="1" domain="Failover" exclusive="0"
> name="IP_Virtual" recovery="relocate">
> <ip ref="10.4.1.103"/>
> </service>
> </rm>
> </cluster>
> 
> Many thanks,
> Wahyu
> 
> 
> 
> 
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com [mailto:
> linux-cluster-bounces at redhat.com ] On Behalf Of Ben Turner
> Sent: Thursday, October 28, 2010 12:18 AM
> To: linux clustering
> Subject: Re: [Linux-cluster] Fence Issue on BL 460C G6
> 
> My guess is there is a problem with fencing. Are you running fence_ilo
> with an HP blade? Iirc the iLOs on the blades have a different CLI, I
> don't think fence_ilo will work with them. What do you see in the
> messages files during these events? If you see failed fence messages
> you may want to look into using fence_ipmilan:
> 
> http://sources.redhat.com/cluster/wiki/IPMI_FencingConfig
> 
> If you post a snip of your messages file from this event and your
> cluster.conf I will have a better idea of what is going on.
> 
> -b
> 
> 
> 
> ----- "Wahyu Darmawan" < wahyu at vivastor.co.id > wrote:
> 
> > Hi all,
> >
> >
> >
> > For fencing, I’m using HP iLO and server is BL460c G6. Problem is
> > resource is start moving to the passive when the failed node is
> power
> > on. It is really strange for me. For example, I shutdown the node1
> and
> > physically remove the node1 machine from the blade chassis and
> monitor
> > the clustat output, clustat was still showing that the resource is
> on
> > node 1, even node 1 is power down and removed from c7000 blade
> > chassis. But when I plugged again the failed node1 on the c7000
> blade
> > chassis and it power-on, then clustat is showing that the resource
> is
> > start moving to the passive node from the failed node.
> > I’m powering down the blade server with power button in front of it,
> > then we remove it from the chassis, If we face the hardware problem
> in
> > our active node and the active node goes down then how the resource
> > move to the passive node. In addition, When I rebooted or shutdown
> the
> > machine from the CLI, then the resource moves successfully from the
> > passive node. Furthurmore, When I shutdown the active node with
> > "shutdown -hy 0" command, after shuting down the active node
> > automatically restart.
> >
> > Please help me.
> >
> >
> >
> > Many Thanks,
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster