[Linux-cluster] fence ' node1' failed if etho down

Stewart Walters spods at iinet.net.au
Thu Feb 19 07:40:49 UTC 2009


Hi,

If you have no fencing devices, you *must* use fence_manual.  CMAN operations
AFAIK will not work without some form of fencing in place.  It's critical to
cluster services, and in fact, I'm surprised CMAN even starts given the fact that
you have no fence devices in /etc/cluster/cluster.conf - no matter though.

fence_manual is a bit of a cop out.  It to will not provide proper fencing
operations in the event of split brain - and the act of fencing is required to be
automatic to ensure data corruption is not occurring on the shared storage that
the nodes write to.

Seeing as your using Xen for your cluster nodes, check the man pages on fence_xvm
and fence_xvmd.  I have no idea what the difference is between the two (as I use
physical hosts) but that seems to be the fencing agent for Xen virtualised
cluster nodes.

Someone else on the list might have a better understanding of these fencing
agents, so they might be able to clue you (and me!) in a bit better.

I'd probably also suggest that in your situation to configure fence_manual as a
backup fencing agent, once fence_xvm(d) is in place as the primary fence agent.

Realistically though, the setup of two Xen virtualisation cluster nodes on one
physical dom0 host is a single point of failure and doing so mostly negates the
need for Red Hat Cluster Suite at all within the stack (if you are trying to use
it to provide high-availability).

But if your just doing a bit of testing on the RHCS, fence_manual might get you
by.  I wouldn't recommend staying on it though, I find using RHCS + fence_manual
tends to cause more interruption to users than not having it implemented at all.

If you want documentation on how to implement fencing operations - check the
Cluster Project FAQ section on fencing
(http://sources.redhat.com/cluster/faq.html#fence_what) or the Configuring and
Managing a Red Hat Cluster for RHEL5.2 document from the Red Hat Cluster Suite
Documentation
(http://www.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5.2/html/Cluster_Administration/index.html)

Regards,

Stewart





On Tue Feb 17 16:48 , ESGLinux  sent:

>Hi, 
>
>first, thank you very much for your answer, 
>
> You are right, I have not fencing devices at all, but for one reason: I havent!!!
>
>I´m just testing with 2 xen virtual machines running on the same host and
mounting an iscsi disk on other host to simulate shared storage. 
>
>
>on the other hand, I think I don´t understand the concept of fencing, 
>
>I try to configure fencing devices with luci, but when I try I don´t know what
to select from the combo of fencing devices. (perphaps manual fencing, althoug
its not recommended for production)
>
>
>so, as I think this is a newbie and perhaps a silly question, 
>
>Can you give any good reference about fencing to learn about it or an example
configuation with fence devices to see how it must be done
>
>thanks again, 
>
>
>ESG
>
>
>2009/2/17 spods at iinet.net.au <spods at iinet.net.au>
>
>A couple of things.
>
>
>
>You don't have any fencing devices defined in cluster.conf at all.  No power
>
>fencing, no I/O fencing, not even manual fencing.
>
>
>
>You need to define how each node of the cluster is to be fenced (forcibly removed
>
>from the cluster) for proper failover operations to occur.
>
>
>
>Secondly, if the only connection shared between the two nodes is the network cord
>
>you just disconnected, then of course nothing will happen - each node has just
>
>lost the only common connection between each other to control the faulty node
>
>(i.e. through fencing).
>
>
>
>There need's to be more connections in between the nodes of a cluster than just a
>
>network card.  This can be achieved with a second NIC, I/O fencing, centralised
>
>or individual power controls (I/O switches or IPMI).
>
>
>
>That way in the event that the network connection is the single point of failure
>
>between the two nodes, at least a node can be fenced if it's behaving improperly.
>
>
>
>Once the faulty node is fenced, the remaining nodes should at that point continue
>
>providing cluster services.
>
>
>
>Regards,
>
>
>
>Stewart
>
>
>
>
>
>
>
>
>
>On Mon Feb 16 16:29 , ESGLinux  sent:
>
>
>
>>Hello All,
>
>>
>
>>I have a cluster with two nodes running one service (mysql). The two nodes uses
>
>a ISCSI disk with gfs on it.
>
>>I haven´t configured fencing at all.
>
>>
>
>>I have tested diferent situtations of fail and these are my results:
>
>>
>
>>
>
>>If I halt node1 the service relocates to node2 - OK
>
>>if I kill the process in node1 the services relocate to node2 - OK
>
>>
>
>>but
>
>>
>
>>if I unplug the wire of the ether device or make ifdown eth0 on node1 all the
>
>cluster fails. The service doesn´t relocate.
>
>>
>
>>In node2 I get the messages:
>
>>
>
>>Feb 15 13:29:34 localhost fenced[3405]: fencing node "192.168.1.188"
>
>>Feb 15 13:29:34 localhost fenced[3405]: fence "192.168.1.188" failed
>
>>Feb 15 13:29:39 localhost fenced[3405]: fencing node "192.168.1.188"
>
>>
>
>>Feb 15 13:29:39 localhost fenced[3405]: fence "192.168.1.188" failed
>
>>
>
>>again and again. The node2 never runs the service and I try to reboot the node1
>
>the computer hangs waiting for stopping the services.
>
>>
>
>>
>
>>In this situation all I can do is to switch off the power of node1 and reboot
>
>the node2. This situation is not acceptable at all.
>
>>
>
>>I think the problem is just with fencing but I dont know how to apply to this
>
>situation ( I have RTFM from redhat site  but I have seen how to apply it. :-( )
>
>>
>
>>
>
>>this is my cluster.conf file
>
>>
>
>><cluster alias="MICLUSTER" config_version="62" name="MICLUSTER">
>
>>        <fence_daemon clean_start="0" post_fail_delay="0"
post_join_delay="3"/>
>
>>
>
>>        <clusternodes>
>
>>                <clusternode name="node1" nodeid="1" votes="1">
>
>>                        <fence/>
>
>>                </clusternode>
>
>>                <clusternode name="node2" nodeid="2" votes="1">
>
>>
>
>>                        <fence/>
>
>>                </clusternode>
>
>>        </clusternodes>
>
>>        <cman expected_votes="1" two_node="1"/>
>
>>        <fencedevices/>
>
>>
>
>>        <rm>
>
>>                <failoverdomains>
>
>>                        <failoverdomain
name="DOMINIOFAIL" nofailback="0"
>
>ordered="0" restricted="1">
>
>>                               
<failoverdomainnode name="node1" priority="1"/>
>
>>
>
>>                               
<failoverdomainnode name="node2" priority="1"/>
>
>>                        </failoverdomain>
>
>>                </failoverdomains>
>
>>                <resources/>
>
>>
>
>>                <service domain="DOMINIOFAIL" exclusive="0"
name="BBDD"
>
>revovery="restart">
>
>>                        <mysql config_file="/etc/my.cnf"
listen_address=""
>
>mysql_options="" name="mydb" shutdown_wait="3"/>
>
>>
>
>>                        <ip address="192.168.1.183"
monitor_link="1"/>
>
>>                </service>
>
>>        </rm>
>
>></cluster>
>
>>
>
>>Any idea? references?
>
>>
>
>>Thanks in advance
>
>>
>
>>
>
>>Greetings
>
>>
>
>>ESG
>
>>
>
>>
>
>>
>
>>
>
>
>
>
>
>
>






More information about the Linux-cluster mailing list